When Ciro finally understood that this is a play on Larry Page's name (of course it is, typical programmer/academic humor stuff), his mind blew.
This section is about more "open" PageRank implementations, notably using either or both of:
- open source software
- open web crawling data such as Common Crawl
As of 2025, the most open and reproducible implementation appears to be whatever Common Crawl web graph official PageRank does, which is to use WebGraph. It's quite beautiful.
As of 2025 Common Crawl web graph also dumps its own PageRank for each release. See e.g. the file so quite plausible, except for
from at: data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2024-25-dec-jan-feb/index.html The first 20 rows are:#harmonicc_pos #harmonicc_val #pr_pos #pr_val #host_rev
1 3.4626736E7 3 0.005384977821460953 com.facebook
2 3.42356E7 2 0.007010813553170503 com.googleapis.fonts
3 3.007577E7 1 0.008634952900502719 com.google
4 3.0036014E7 4 0.004411782034463272 com.googletagmanager
5 2.9900088E7 5 0.0036940035989790525 com.youtube
6 2.9537252E7 6 0.0032959808223701 com.instagram
7 2.9092556E7 9 0.0027616338842143423 com.twitter
8 2.7346152E7 7 0.0032101332824200743 com.gstatic.fonts
9 2.6818654E7 11 0.0017699438634060259 com.linkedin
10 2.5383126E7 8 0.0027849243241515574 org.gmpg
11 2.3747762E7 12 0.0016577826631867043 com.google.maps
12 2.3514198E7 15 0.0013399414238881337 com.googleapis.ajax
13 2.3504832E7 16 0.0012791339750445332 com.google.play
14 2.337092E7 47 3.794876113587071E-4 be.youtu
15 2.2925148E7 14 0.0013857916784687163 com.cloudflare.cdnjs
16 2.2851038E7 18 0.0012066313543285154 com.google.plus
17 2.2833728E7 13 0.0015745738381307273 org.wordpress
18 2.2830926E7 36 6.02400471665468E-4 com.pinterest
19 2.27056E7 45 4.001342924757244E-4 com.google.support
20 2.2687704E7 24 9.381217848819624E-4 net.jsdelivr.cdn
. What the fuck is that and why is it ranked so high? Is it a quirk with the hosts inside subdomains?Perhaps a more relevant dump might be the domain-only one But nope,
:#harmonicc_pos #harmonicc_val #pr_pos #pr_val #host_rev #n_hosts
1 3.1238044E7 3 0.01110707704411023 com.facebook 3632
2 3.0950192E7 2 0.016650558868491434 com.googleapis 3470
3 3.000803E7 1 0.01749148008448444 com.google 14053
4 2.7319046E7 5 0.00670112168785935 com.instagram 789
5 2.7020862E7 7 0.005464885844102939 com.youtube 1628
6 2.6954494E7 4 0.007740808154448889 com.googletagmanager 42
7 2.6344278E7 8 0.0052073382920908295 com.twitter 712
8 2.5414934E7 6 0.0058790483755603844 com.gstatic 171
9 2.4803688E7 11 0.0038589161241338816 com.linkedin 690
10 2.4683842E7 10 0.004929923081722034 org.gmpg 2
11 2.3575146E7 9 0.005111453489231459 com.cloudflare 951
12 2.2735678E7 14 0.002131882799792225 com.gravatar 98
13 2.2356142E7 12 0.002513741654851857 org.wordpress 1250
14 2.2132868E7 15 0.0019991529719988496 com.apple 3261
15 2.2095914E7 31 0.0010706467268355303 org.wikipedia 2099
16 2.2057972E7 21 0.0015644264715267535 com.pinterest 360
17 2.1941062E7 40 8.52391305373285E-4 be.youtu 15
18 2.1826452E7 16 0.0018442726685905964 net.jsdelivr 40
19 2.1764224E7 34 9.747994384099485E-4 gl.goo 951
20 2.1690982E7 35 9.740295347556525E-4 com.vimeo
is still there!vigna.di.unimi.it/ftp/papers/GraphStructure.pdf comments on it: so it appears to be a computer-readable ontology mechanism in the lines of Resource Description Framework which interlinks many websites. The article also mentions another interesting noise in
for instance, gmpg.org is the reference for a vocabulary that describes relationships
which every Chinese website is required to link to for their ICP license.The source code for it seem to be at: github.com/commoncrawl/cc-webgraph and seems to use the Java version of the WebGraph quite directly on their BVGraph dump. There is apparently no CLI for PageRank however unfortunately, they have to use a bit of Java code. That would be so awesome!
This appears to be the direct precursor project of the Common Crawl web graph official PageRank
This section is about: wwwranking.webdatacommons.org/
Did not contain either of cirosantilli.com or OurBigBook.com as of 2025!
Based on Common Crawl 2012, and they don't seem to be updating it regularly...
Created by the Università degli Studi di Milano.
This section is about: www.domcop.com/openpagerank/
TODO is their source code open source?
Top 10 million websites: www.domcop.com/top-10-million-websites Can be downloaded as CSV. Contained both cirosantilli.com and OurBigBook.com as of 2025!
Get values for some websites: www.domcop.com/openpagerank/
This is the family of algorithms to which PageRank
Just image being famous only for being 44 years too early to a party.
The downside of "Katz centrality" compared to PageRank appears to be that if if a big node links to many many nodes, all of those earn a lot of reputation, regardless of how outgoing links there are:
Was adopted by AskJeeves in 2001.
The Google Story Chapter 11. "The Google Economy" comments:
As they saw it, generation one was AltaVista, generation two was Google, and generation three was Teoma, or what Ask Jeeves came to refer to as Expert Rank. Teoma's technology involved mathematical formulas and calculations that went beyond Google's PageRank system, which was based on popularity. In fact, the concept had been cited in the original Stanford University paper written by Sergey Brin and Larry Page as one of the methods that could be used to rank indexed Web sites in response to search requests. "They called their method global popularity and they called this method local popularity, meaning you look more granularly at the Web and see who the authoritative sources are," Lanzone said. He said Brin an Page had concluded that local popularity would be exceedingly difficult to execute well, because either it would require too much processing power to do it in real time or it would take too long.
googlesystem.blogspot.com/2006/03/expertrank-authoritative-search.html mentionsand:
ExpertRank is an evolution of IBM's CLEVER project, a search engine that never made it to public.
The difference between PageRank and ExpertRank is that for ExpertRank the quality of the page is important and that quality is not absolute, but it's relative to a subject.
There are other more recent algorithms with similar names, and are prehaps related:
- www.researchgate.net/publication/257015904_ExpertRank_A_topic-aware_expert_finding_algorithm_for_online_knowledge_communities ExpertRank: A topic-aware expert finding algorithm for online knowledge communities (2013)
- ieeexplore.ieee.org/document/5260966 ExpertRank: An Expert User Ranking Algorithm in Online Communities
PageRank was apparently inspired by it originally, given that.