As of 2025 Common Crawl web graph also dumps its own PageRank for each release. See e.g. the file so quite plausible, except for
cc-main-2024-25-dec-jan-feb-host-ranks.txt.gz
from at: data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2024-25-dec-jan-feb/index.html The first 20 rows are:#harmonicc_pos #harmonicc_val #pr_pos #pr_val #host_rev
1 3.4626736E7 3 0.005384977821460953 com.facebook
2 3.42356E7 2 0.007010813553170503 com.googleapis.fonts
3 3.007577E7 1 0.008634952900502719 com.google
4 3.0036014E7 4 0.004411782034463272 com.googletagmanager
5 2.9900088E7 5 0.0036940035989790525 com.youtube
6 2.9537252E7 6 0.0032959808223701 com.instagram
7 2.9092556E7 9 0.0027616338842143423 com.twitter
8 2.7346152E7 7 0.0032101332824200743 com.gstatic.fonts
9 2.6818654E7 11 0.0017699438634060259 com.linkedin
10 2.5383126E7 8 0.0027849243241515574 org.gmpg
11 2.3747762E7 12 0.0016577826631867043 com.google.maps
12 2.3514198E7 15 0.0013399414238881337 com.googleapis.ajax
13 2.3504832E7 16 0.0012791339750445332 com.google.play
14 2.337092E7 47 3.794876113587071E-4 be.youtu
15 2.2925148E7 14 0.0013857916784687163 com.cloudflare.cdnjs
16 2.2851038E7 18 0.0012066313543285154 com.google.plus
17 2.2833728E7 13 0.0015745738381307273 org.wordpress
18 2.2830926E7 36 6.02400471665468E-4 com.pinterest
19 2.27056E7 45 4.001342924757244E-4 com.google.support
20 2.2687704E7 24 9.381217848819624E-4 net.jsdelivr.cdn
org.gmpg
. What the fuck is that and why is it ranked so high? Is it a quirk with the hosts inside subdomains?Perhaps a more relevant dump might be the domain-only one But nope,
cc-main-2024-25-dec-jan-feb-domain-ranks.txt.gz
:#harmonicc_pos #harmonicc_val #pr_pos #pr_val #host_rev #n_hosts
1 3.1238044E7 3 0.01110707704411023 com.facebook 3632
2 3.0950192E7 2 0.016650558868491434 com.googleapis 3470
3 3.000803E7 1 0.01749148008448444 com.google 14053
4 2.7319046E7 5 0.00670112168785935 com.instagram 789
5 2.7020862E7 7 0.005464885844102939 com.youtube 1628
6 2.6954494E7 4 0.007740808154448889 com.googletagmanager 42
7 2.6344278E7 8 0.0052073382920908295 com.twitter 712
8 2.5414934E7 6 0.0058790483755603844 com.gstatic 171
9 2.4803688E7 11 0.0038589161241338816 com.linkedin 690
10 2.4683842E7 10 0.004929923081722034 org.gmpg 2
11 2.3575146E7 9 0.005111453489231459 com.cloudflare 951
12 2.2735678E7 14 0.002131882799792225 com.gravatar 98
13 2.2356142E7 12 0.002513741654851857 org.wordpress 1250
14 2.2132868E7 15 0.0019991529719988496 com.apple 3261
15 2.2095914E7 31 0.0010706467268355303 org.wikipedia 2099
16 2.2057972E7 21 0.0015644264715267535 com.pinterest 360
17 2.1941062E7 40 8.52391305373285E-4 be.youtu 15
18 2.1826452E7 16 0.0018442726685905964 net.jsdelivr 40
19 2.1764224E7 34 9.747994384099485E-4 gl.goo 951
20 2.1690982E7 35 9.740295347556525E-4 com.vimeo
org.gmpg
is still there!vigna.di.unimi.it/ftp/papers/GraphStructure.pdf comments on it: so it appears to be a computer-readable ontology mechanism in the lines of Resource Description Framework which interlinks many websites. The article also mentions another interesting noise in
for instance, gmpg.org is the reference for a vocabulary that describes relationships
miibeian.gov.cn
which every Chinese website is required to link to for their ICP license.Intro/docs: www.jonmsterling.com/jms-005P.xml. It is very hard to find information in that system however, largely because they don't seem to have a proper recursive cross file table of contents.
This is the project with the closest philosophy to OurBigBook that Ciro Santilli has ever found. It just tends to be even more idealistic than, OurBigBook in general, which is insane!
Source code: sr.ht/~jonsterling/forester. Not on GitHub, too much idealism for that.
"Docs" at: www.jonmsterling.com/foreign-forester-jms-005P.xml Sample repo at: github.com/jonsterling/forest but all parts of interest are in submodules on the authors private Git server.
Example:
- sample source file: git.sr.ht/~jonsterling/public-trees/tree/2356f52303c588fadc2136ffaa168e9e5fbe346c/item/jms-005P.tree
- appears rendered at: www.jonmsterling.com/foreign-forester-jms-005P.xml
Author's main social media account seems to be: mathstodon.xyz/@jonmsterling e.g. mathstodon.xyz/@jonmsterling/111359099228291730 His home page:
They have
\Include
like OurBigBook, nice: www.jonmsterling.com/jms-007L.xml, but OMG that name \transclude{xxx-NNNN}
!! It seems to be possible to have human readable IDs too if you want: www.jonmsterling.com/foreign-forester-armaëlguéneau.xml is under trees/public/roladex/armaëlguéneau.tree
.Headers have open/close:OurBigBook considered this, but went with
\subtree[jms-00YG]{}
parent=
instead finally to avoid huge lists of close parenthesis at the end of deep nodes.One really cool thing is that the headers render internal links as clickable, which brings it all closer to the "knowledge base as a formal ontology" approach.
Does not encourage human readable IDs, uses stuff like
jms-00YG
.The markup has relatively few insane constructs, notably you need explicit open paragraphs everywhere The markup is documented at: www.jonmsterling.com/foreign-forester-jms-007N.xml
\p{}
?! OMG, too idealistic, not enough pragmatism. There are however a few insane constructs:[]()
: markdown like links[[bluecat]]
: wikilinks (but to raw IDs only, you can't seem to be able to do[[blue cat]]
#{}
and##{}
for inline and block maths, though that might just be a sane construct with an insane name
Jon has some very good theory of personal knowledge base, rationalizing several points that Ciro Santilli had in his mind but hadn't fully put into words, which is quite cool.
OCaml dependency is not so bad, but it relies on actually LaTeX for maths, which is bad. Maybe using JavaScript for OurBigBook wasn't such a bad choice after all, KaTeX just works.
Viewing the generated output HTML directly requires
security.fileuri.strict_origin_policy
which is sad, but using a local server solves it. So it appears to actually pull pieces together with JavaScript? Also output files have .xml extension, the idealism! They are reconsidering that though: www.jonmsterling.com/foreign-forester-jms-005P.xml#tree-8720.The Ctrl+K article dropdown search navigation is quite cool.
\rel
and \meta
allows for arbitrary ontologies between nodes as semantic triples. But they suffer from one fatal flaw: the relations are headers in themselves. We often want to explain why a relation is true, give intuition to it, and refer to it from other nodes. This is obviously how the brain works: relations are nodes just like objects.They do appear to be putting full trees on every toplevel regardless how deep and with JavaScript turned off e.g.:
which is cool but will take lots of storage. In OurBigBook Ciro Santilli only does that on OurBigBook Web where each page can be dynamically generated.