Ciro Santilli @cirosantilli 37

 Incoming links: Ontology

Common Crawl web graph official PageRank Created 2025-02-26 Updated 2025-07-16

As of 2025 Common Crawl web graph also dumps its own PageRank for each release. See e.g. the file cc-main-2024-25-dec-jan-feb-host-ranks.txt.gz from at: data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2024-25-dec-jan-feb/index.html The first 20 rows are:

#harmonicc_pos  #harmonicc_val  #pr_pos #pr_val #host_rev
1       3.4626736E7     3       0.005384977821460953    com.facebook
2       3.42356E7       2       0.007010813553170503    com.googleapis.fonts
3       3.007577E7      1       0.008634952900502719    com.google
4       3.0036014E7     4       0.004411782034463272    com.googletagmanager
5       2.9900088E7     5       0.0036940035989790525   com.youtube
6       2.9537252E7     6       0.0032959808223701      com.instagram
7       2.9092556E7     9       0.0027616338842143423   com.twitter
8       2.7346152E7     7       0.0032101332824200743   com.gstatic.fonts
9       2.6818654E7     11      0.0017699438634060259   com.linkedin
10      2.5383126E7     8       0.0027849243241515574   org.gmpg
11      2.3747762E7     12      0.0016577826631867043   com.google.maps
12      2.3514198E7     15      0.0013399414238881337   com.googleapis.ajax
13      2.3504832E7     16      0.0012791339750445332   com.google.play
14      2.337092E7      47      3.794876113587071E-4    be.youtu
15      2.2925148E7     14      0.0013857916784687163   com.cloudflare.cdnjs
16      2.2851038E7     18      0.0012066313543285154   com.google.plus
17      2.2833728E7     13      0.0015745738381307273   org.wordpress
18      2.2830926E7     36      6.02400471665468E-4     com.pinterest
19      2.27056E7       45      4.001342924757244E-4    com.google.support
20      2.2687704E7     24      9.381217848819624E-4    net.jsdelivr.cdn

so quite plausible, except for org.gmpg. What the fuck is that and why is it ranked so high? Is it a quirk with the hosts inside subdomains?

Perhaps a more relevant dump might be the domain-only one cc-main-2024-25-dec-jan-feb-domain-ranks.txt.gz:

#harmonicc_pos  #harmonicc_val  #pr_pos #pr_val #host_rev       #n_hosts
1       3.1238044E7     3       0.01110707704411023     com.facebook    3632
2       3.0950192E7     2       0.016650558868491434    com.googleapis  3470
3       3.000803E7      1       0.01749148008448444     com.google      14053
4       2.7319046E7     5       0.00670112168785935     com.instagram   789
5       2.7020862E7     7       0.005464885844102939    com.youtube     1628
6       2.6954494E7     4       0.007740808154448889    com.googletagmanager    42
7       2.6344278E7     8       0.0052073382920908295   com.twitter     712
8       2.5414934E7     6       0.0058790483755603844   com.gstatic     171
9       2.4803688E7     11      0.0038589161241338816   com.linkedin    690
10      2.4683842E7     10      0.004929923081722034    org.gmpg        2
11      2.3575146E7     9       0.005111453489231459    com.cloudflare  951
12      2.2735678E7     14      0.002131882799792225    com.gravatar    98
13      2.2356142E7     12      0.002513741654851857    org.wordpress   1250
14      2.2132868E7     15      0.0019991529719988496   com.apple       3261
15      2.2095914E7     31      0.0010706467268355303   org.wikipedia   2099
16      2.2057972E7     21      0.0015644264715267535   com.pinterest   360
17      2.1941062E7     40      8.52391305373285E-4     be.youtu        15
18      2.1826452E7     16      0.0018442726685905964   net.jsdelivr    40
19      2.1764224E7     34      9.747994384099485E-4    gl.goo  951
20      2.1690982E7     35      9.740295347556525E-4    com.vimeo

But nope, org.gmpg is still there!

vigna.di.unimi.it/ftp/papers/GraphStructure.pdf comments on it:

for instance, gmpg.org is the reference for a vocabulary that describes relationships

so it appears to be a computer-readable ontology mechanism in the lines of Resource Description Framework which interlinks many websites. The article also mentions another interesting noise in miibeian.gov.cn which every Chinese website is required to link to for their ICP license.

The source code for it seem to be at: github.com/commoncrawl/cc-webgraph and seems to use the Java version of the WebGraph quite directly on their BVGraph dump. There is apparently no CLI for PageRank however unfortunately, they have to use a bit of Java code. That would be so awesome!

 Read the full article

Forester Created 2024-10-12 Updated 2025-07-16

 View more

www.jonmsterling.com/tfmt-0001.xml

Intro/docs: www.jonmsterling.com/jms-005P.xml. It is very hard to find information in that system however, largely because they don't seem to have a proper recursive cross file table of contents.

This is the project with the closest philosophy to OurBigBook that Ciro Santilli has ever found. It just tends to be even more idealistic than, OurBigBook in general, which is insane!

Source code: sr.ht/~jonsterling/forester. Not on GitHub, too much idealism for that.

"Docs" at: www.jonmsterling.com/foreign-forester-jms-005P.xml Sample repo at: github.com/jonsterling/forest but all parts of interest are in submodules on the authors private Git server.

Example:

sample source file: git.sr.ht/~jonsterling/public-trees/tree/2356f52303c588fadc2136ffaa168e9e5fbe346c/item/jms-005P.tree
appears rendered at: www.jonmsterling.com/foreign-forester-jms-005P.xml

Author's main social media account seems to be: mathstodon.xyz/@jonmsterling e.g. mathstodon.xyz/@jonmsterling/111359099228291730 His home page:

They have \Include like OurBigBook, nice: www.jonmsterling.com/jms-007L.xml, but OMG that name \transclude{xxx-NNNN}!! It seems to be possible to have human readable IDs too if you want: www.jonmsterling.com/foreign-forester-armaëlguéneau.xml is under trees/public/roladex/armaëlguéneau.tree.

Headers have open/close:

\subtree[jms-00YG]{}

OurBigBook considered this, but went with parent= instead finally to avoid huge lists of close parenthesis at the end of deep nodes.

One really cool thing is that the headers render internal links as clickable, which brings it all closer to the "knowledge base as a formal ontology" approach.

Does not encourage human readable IDs, uses stuff like jms-00YG.

The markup has relatively few insane constructs, notably you need explicit open paragraphs everywhere \p{}?! OMG, too idealistic, not enough pragmatism. There are however a few insane constructs:

[](): markdown like links
[[bluecat]]: wikilinks (but to raw IDs only, you can't seem to be able to do [[blue cat]]
#{} and ##{} for inline and block maths, though that might just be a sane construct with an insane name

The markup is documented at: www.jonmsterling.com/foreign-forester-jms-007N.xml

Jon has some very good theory of personal knowledge base, rationalizing several points that Ciro Santilli had in his mind but hadn't fully put into words, which is quite cool.

OCaml dependency is not so bad, but it relies on actually LaTeX for maths, which is bad. Maybe using JavaScript for OurBigBook wasn't such a bad choice after all, KaTeX just works.

Viewing the generated output HTML directly requires security.fileuri.strict_origin_policy which is sad, but using a local server solves it. So it appears to actually pull pieces together with JavaScript? Also output files have .xml extension, the idealism! They are reconsidering that though: www.jonmsterling.com/foreign-forester-jms-005P.xml#tree-8720.

The Ctrl+K article dropdown search navigation is quite cool.

\rel and \meta allows for arbitrary ontologies between nodes as semantic triples. But they suffer from one fatal flaw: the relations are headers in themselves. We often want to explain why a relation is true, give intuition to it, and refer to it from other nodes. This is obviously how the brain works: relations are nodes just like objects.

They do appear to be putting full trees on every toplevel regardless how deep and with JavaScript turned off e.g.:

which is cool but will take lots of storage. In OurBigBook Ciro Santilli only does that on OurBigBook Web where each page can be dynamically generated.

 Read the full article