= Quick fun with the Common Crawl web graph
https://github.com/cirosantilli/cirosantilli.github.io/issues/198[]. Previously at: https://stackoverflow.com/questions/31321009/best-more-standard-graph-representation-file-format-graphson-gexf-graphml/79467334#79467334 but <deletionism on Stack Overflow>[Stack Overflow fucking deleted the question].
I wanted to do a quick exploration of <open PageRank implementation and data>.
My general motivation for this is that a <PageRank>-like algorithm could be useful for more accurate user and article ranking on <OurBigBook>, see: <ourbigbook.com/PageRank-like ranking>{full}
But it could also be just generally cool to apply it to other <graph> datasets, e.g. for computing an <Wikipedia internal PageRank>.
A quick <Google> reveals only <Open PageRank>, but their methods are apparently closed source.
Then I had a look at the <Common Crawl web graph> data to see if I could easily calculate it myself, and... they already have it! See: <Common Crawl web graph official PageRank>{full}
Their graph dumps are in <BVGraph> <graph file format>, which is the native format of the <WebGraph (software)> framework, which implements the format and algorithms such as <PageRank>.
The only thing I miss is a command line interface to calculate the PageRank. That would be so awesome.
The more I look at it the more I love <Common Crawl>.
Announcements:
* https://mastodon.social/@cirosantilli/114070985511493835
* https://x.com/cirosantilli/status/1894777704517406852
In cc-main-2024-25-dec-jan-feb-domain-ranks.txt:
* `cirosantilli.com` was ranked ~453k
* `ourbigbook.com` was at ~606k
Back to article page