Open PageRank implementation and data by
Ciro Santilli 35 Updated 2025-03-28 +Created 2025-02-26
This section is about more "open" PageRank implementations, notably using either or both of:
- open source software
- open web crawling data such as Common Crawl
As of 2025, the most open and reproducible implementation appears to be whatever Common Crawl web graph official PageRank does, which is to use WebGraph. It's quite beautiful.
École Polytechnique alumnus of 2009 by
Ciro Santilli 35 Updated 2025-03-28 +Created 2025-02-26
École Polytechnique alumnus of 1983 by
Ciro Santilli 35 Updated 2025-03-28 +Created 2025-02-26
In 2017 apparently they've started making their own Web Graphs, i.e. they parse the HTML and extract the graph of what links to what.
This is exactly what we need for an open implementation of PageRank.
Edit: actually, they already calculate PageRank for us!!! Fantastic!!! Main section: Section "Common Crawl web graph official PageRank".
The graphs are dumped in BVGraph format.
A quick exploration of the graph can be seen at: github.com/cirosantilli/cirosantilli.github.io/issues/198
Their source code is at: github.com/commoncrawl/cc-webgraph
École Polytechnique alumnus by year by
Ciro Santilli 35 Updated 2025-03-28 +Created 2025-02-26
École Polytechnique students identify their academic year, or "promotion" in French, by start year date.
For example, Ciro Santilli's year started in 2009, though as a foreign student he arrived only at the start of 2010, and Ciro's promotion is usually known just as X09. And as the century barrier is broken we'll start to need to specify as X2009 one day.
List of notable alumni:
- fr.wikipedia.org/wiki/Liste_d%27élèves_de_l%27École_polytechnique This French list is a bit better as you'd expect
- en.wikipedia.org/wiki/List_of_%C3%89cole_Polytechnique_alumni
A quick hands-on introduction to the software by Ciro Santilli can be found at: github.com/cirosantilli/cirosantilli.github.io/issues/198
The native file format of WebGraph.
It is a binary format and highly storage efficient.
It is for example what Common Crawl web graph currently dumps to as of 2025, see e.g.: data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2024-25-dec-jan-feb/index.html
TODO meaning of "BV"?
A quick hands-on introduction to the format by Ciro Santilli can be found at: github.com/cirosantilli/cirosantilli.github.io/issues/198
Quick fun with the Common Crawl web graph by
Ciro Santilli 35 Updated 2025-03-28 +Created 2025-02-26
github.com/cirosantilli/cirosantilli.github.io/issues/198. Previously at: stackoverflow.com/questions/31321009/best-more-standard-graph-representation-file-format-graphson-gexf-graphml/79467334#79467334 but Stack Overflow fucking deleted the question.
I wanted to do a quick exploration of open PageRank implementation and data.
My general motivation for this is that a PageRank-like algorithm could be useful for more accurate user and article ranking on OurBigBook, see: Section "PageRank-like ranking"
But it could also be just generally cool to apply it to other graph datasets, e.g. for computing an Wikipedia internal PageRank.
A quick Google reveals only Open PageRank, but their methods are apparently closed source.
Then I had a look at the Common Crawl web graph data to see if I could easily calculate it myself, and... they already have it! See: Section "Common Crawl web graph official PageRank"
Their graph dumps are in BVGraph graph file format, which is the native format of the WebGraph framework, which implements the format and algorithms such as PageRank.
The only thing I miss is a command line interface to calculate the PageRank. That would be so awesome.
The more I look at it the more I love Common Crawl.
Announcements:
In cc-main-2024-25-dec-jan-feb-domain-ranks.txt:
cirosantilli.com
was ranked ~453kourbigbook.com
was at ~606k
There are unlisted articles, also show them or only show them.