Source: cirosantilli/updates/quick-fun-with-the-common-crawl-web-graph

= Quick fun with the Common Crawl web graph

https://stackoverflow.com/questions/31321009/best-more-standard-graph-representation-file-format-graphson-gexf-graphml/79467334#79467334

I wanted to do a quick exploration of <open PageRank implementation and data>.

My general motivation for this is that a <PageRank>-like algorithm could be useful for more accurate user and article ranking on <OurBigBook>, see: <ourbigbook com/PageRank-like ranking>{full}

But it could also be just generally cool to apply it to other <graph> datasets, e.g. for computing an <Wikipedia internal PageRank>.

A quick <Google> reveals only <Open PageRank>, but their methods are apparently closed source.

Then I had a look at the <Common Crawl web graph> data to see if I could easily calculate it myself, and... they already have it! See: <Common Crawl web graph official PageRank>{full}

Their graph dumps are in <BVGraph> <graph file format>, which is the native format of the <WebGraph (software)> framework, which implements the format and algorithms such as <PageRank>.

The only thing I miss is a command line interface to calculate the PageRank. That would be so awesome.

The more I look at it the more I love <Common Crawl>.