Ciro Santilli @cirosantilli 37

 Incoming links: PageRank

Closurism Updated 2025-07-16

Closurism is a term invented by Ciro Santilli to refer to content moderation policies that lock threads in online forums, preventing people from adding new comments from that point onward.

This is similar to deletionism but a bit less worse, as the pre-existing content is maintained. But new relevant content that comes up cannot be added in the future, so it is still bad.

The outcome of closurism is that new forum posts must then be made about up-to-date aspects of the topic. But then those may fail to reach the same PageRank, so most people never get the new information, or create new posts leading to useless duplication of work.

 Read the full article

Common Crawl web graph Created 2025-02-26 Updated 2025-07-16

 View more

commoncrawl.org/web-graphs

In 2017 apparently they've started making their own Web Graphs, i.e. they parse the HTML and extract the graph of what links to what.

This is exactly what we need for an open implementation of PageRank.

Edit: actually, they already calculate PageRank for us!!! Fantastic!!! Main section: Section "Common Crawl web graph official PageRank".

The graphs are dumped in BVGraph format.

A quick exploration of the graph can be seen at: github.com/cirosantilli/cirosantilli.github.io/issues/198

Their source code is at: github.com/commoncrawl/cc-webgraph

 Read the full article

Common Crawl web graph official PageRank Created 2025-02-26 Updated 2025-07-16

 View more

As of 2025 Common Crawl web graph also dumps its own PageRank for each release. See e.g. the file cc-main-2024-25-dec-jan-feb-host-ranks.txt.gz from at: data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2024-25-dec-jan-feb/index.html The first 20 rows are:

#harmonicc_pos  #harmonicc_val  #pr_pos #pr_val #host_rev
1       3.4626736E7     3       0.005384977821460953    com.facebook
2       3.42356E7       2       0.007010813553170503    com.googleapis.fonts
3       3.007577E7      1       0.008634952900502719    com.google
4       3.0036014E7     4       0.004411782034463272    com.googletagmanager
5       2.9900088E7     5       0.0036940035989790525   com.youtube
6       2.9537252E7     6       0.0032959808223701      com.instagram
7       2.9092556E7     9       0.0027616338842143423   com.twitter
8       2.7346152E7     7       0.0032101332824200743   com.gstatic.fonts
9       2.6818654E7     11      0.0017699438634060259   com.linkedin
10      2.5383126E7     8       0.0027849243241515574   org.gmpg
11      2.3747762E7     12      0.0016577826631867043   com.google.maps
12      2.3514198E7     15      0.0013399414238881337   com.googleapis.ajax
13      2.3504832E7     16      0.0012791339750445332   com.google.play
14      2.337092E7      47      3.794876113587071E-4    be.youtu
15      2.2925148E7     14      0.0013857916784687163   com.cloudflare.cdnjs
16      2.2851038E7     18      0.0012066313543285154   com.google.plus
17      2.2833728E7     13      0.0015745738381307273   org.wordpress
18      2.2830926E7     36      6.02400471665468E-4     com.pinterest
19      2.27056E7       45      4.001342924757244E-4    com.google.support
20      2.2687704E7     24      9.381217848819624E-4    net.jsdelivr.cdn

so quite plausible, except for org.gmpg. What the fuck is that and why is it ranked so high? Is it a quirk with the hosts inside subdomains?

Perhaps a more relevant dump might be the domain-only one cc-main-2024-25-dec-jan-feb-domain-ranks.txt.gz:

#harmonicc_pos  #harmonicc_val  #pr_pos #pr_val #host_rev       #n_hosts
1       3.1238044E7     3       0.01110707704411023     com.facebook    3632
2       3.0950192E7     2       0.016650558868491434    com.googleapis  3470
3       3.000803E7      1       0.01749148008448444     com.google      14053
4       2.7319046E7     5       0.00670112168785935     com.instagram   789
5       2.7020862E7     7       0.005464885844102939    com.youtube     1628
6       2.6954494E7     4       0.007740808154448889    com.googletagmanager    42
7       2.6344278E7     8       0.0052073382920908295   com.twitter     712
8       2.5414934E7     6       0.0058790483755603844   com.gstatic     171
9       2.4803688E7     11      0.0038589161241338816   com.linkedin    690
10      2.4683842E7     10      0.004929923081722034    org.gmpg        2
11      2.3575146E7     9       0.005111453489231459    com.cloudflare  951
12      2.2735678E7     14      0.002131882799792225    com.gravatar    98
13      2.2356142E7     12      0.002513741654851857    org.wordpress   1250
14      2.2132868E7     15      0.0019991529719988496   com.apple       3261
15      2.2095914E7     31      0.0010706467268355303   org.wikipedia   2099
16      2.2057972E7     21      0.0015644264715267535   com.pinterest   360
17      2.1941062E7     40      8.52391305373285E-4     be.youtu        15
18      2.1826452E7     16      0.0018442726685905964   net.jsdelivr    40
19      2.1764224E7     34      9.747994384099485E-4    gl.goo  951
20      2.1690982E7     35      9.740295347556525E-4    com.vimeo

But nope, org.gmpg is still there!

vigna.di.unimi.it/ftp/papers/GraphStructure.pdf comments on it:

for instance, gmpg.org is the reference for a vocabulary that describes relationships

so it appears to be a computer-readable ontology mechanism in the lines of Resource Description Framework which interlinks many websites. The article also mentions another interesting noise in miibeian.gov.cn which every Chinese website is required to link to for their ICP license.

The source code for it seem to be at: github.com/commoncrawl/cc-webgraph and seems to use the Java version of the WebGraph quite directly on their BVGraph dump. There is apparently no CLI for PageRank however unfortunately, they have to use a bit of Java code. That would be so awesome!

 Read the full article

Eigenvector centrality Created 2025-02-26 Updated 2025-07-16

 View more

This is the family of algorithms to which PageRank

 Read the full article

h-index Updated 2025-07-16

 View more

PageRank was apparently inspired by it originally, given that.

 Read the full article

How to develop Ciro Santilli's website before the OurBigBook migration Updated 2025-07-16

 View more

The website moved from AsciiDoctor to OurBigBook Markup in 2020, making this section mostly useless. But hey, history!

Ciro's website is powered by GitHub Pages and Jekyll Asciidoc.

The source code is located at: github.com/cirosantilli/cirosantilli.github.io

Build locally, watch for changes and rebuild automatically, and start a local server with:

git clone --recursive https://github.com/cirosantilli/cirosantilli.github.io
cd cirosantilli.github.io
bundle install
npm install
./run

Source: ./run.

The website will be visible at: localhost:4000.

Tested on the latest Ubuntu.

Publish changes to GitHub Pages:

git add -u
git commit -m 'make yourself look sillier'
./publish

Source: ./publish.

GitHub forces us to use the master branch for the build output... so the actual source is in the branch dev.

Update the gems with:

bundle update
git add Gemfile.lock
git commit -m 'update gems'

His website was originally written in markdown, however those were deprecated in favour of AsciiDoctor when Ciro saw the light, rationale shown at: markdown-style-guideuse-asciidoc

GitHub pages is chosen instead of a single page GitHub README.adoc for the following reasons:

Ciro will want some unsupported extensions, notably mathematics, likely with KaTeX server side:
when GitHub dies, Ciro's website URL still lives and retains the PageRank!

 Read the full article

Impact factor Updated 2025-07-16

 View more

This metric is so dumb! It only helps maintain existing closed journals closed! Why not just do a PageRank on the articles themselve instead? Like the h-index does for authors? That would make so much more sense!

 Read the full article

Katz centrality Created 2025-02-26 Updated 2025-07-16

 View more

Just image being famous only for being 44 years too early to a party.

The downside of "Katz centrality" compared to PageRank appears to be that if if a big node links to many many nodes, all of those earn a lot of reputation, regardless of how outgoing links there are:

 Read the full article

Open PageRank implementation and data Created 2025-02-26 Updated 2025-07-16

 View more

This section is about more "open" PageRank implementations, notably using either or both of:

As of 2025, the most open and reproducible implementation appears to be whatever Common Crawl web graph official PageRank does, which is to use WebGraph. It's quite beautiful.

 Read the full article

OurBigBook.com / PageRank-like ranking Updated 2025-07-16

 View more

It would be really cool to have a PageRank-link algorithm that answers the key questions:

what is the best content for subject X.
For example, if you are reading cirosantilli/riemann-integral and it is crap, you would be able to click the button
Versions by other authors
which leads you to the URL: ourbigbook.com/subject/mathematics. This URL then contains a list of all pages people have written about the subject mathematics, sorted by some algorithm, containing for example:
- ourbigbook.com/johnsmith/riemann-integral
- ourbigbook.com/cirosantilli/riemann-integral
This URL would also contain a list of issues/comments that are related to the subject.
who knows the most about subject X. This can be found by visiting: ourbigbook.com/users/mathematics "Top Mathematics users", which would contain the list of users sorted by the algorithm:
- ourbigbook.com/johnsmith
- ourbigbook.com/cirosantilli

However, Ciro has decided to leave this for phase two action plan, because it is impossible to tune such an algorithm if you have no users or test data.

Perhaps it is also worth looking into ExpertRank, they appear to do some kind of "expert in this area", but with clustering (unlike us, where the clustering would be more explicit).

Other dump of things worth looking into:

en.wikipedia.org/wiki/Hilltop_algorithm

 Read the full article

OurBigBook.com / Stack Exchange Updated 2025-07-16

 View more

Stack Exchange solves to a good extent the use cases:

I have a very specific question, type it on Google, find top answers
I have an answer, and I put it here because it has a much greater chance of being found due to the larger PageRank than my personal web page will ever have

points of view. It is a big open question if we can actually substantially improve it.

Major shortcoming are mentioned at idiotic Stack Overflow policies:

Scope restrictions can lead to a lot of content deletion: closing questions as off-topic
This greatly discourages new users, who might still have added value to the project.
On our website, anyone can post anything that is legal in a given country. No one can ever delete your content if it is legal, no matter their reputation.
Although you can answer your own question, there's no way to write an organized multi-page book with Stack Exchange due to shortcomings such as no table of contents, 30k max chars on answer, huge risk of deletion due to "too broad"
Absolutely no algorithmic attempt to overcome the fastest gun in the West problem (early answers have huge advantage over newer ones): meta.stackoverflow.com/questions/404535/closing-an-old-upvoted-question-as-duplicate-of-new-unvoted-questions/404567#404567
Native reputation system:
- if the living ultimate God of C++ upvotes you, you get 10 reputation
- if the first-day newb of Java upvotes you, you also get 10 reputation
Randomly split between sites like Stack Overflow vs Super User, with separate user reputations, but huge overlaps, and many questions that appears as dupes on both and never get merged.
Possible edit wars, just like Wikipedia, but these are much less common since content ownership is much clearer than in Wikipedia however

Bibliography:

dev.to/codemouse92/has-stackoverflow-become-an-antipattern-3icb (archive)

 Read the full article

Updates / Metrics and rationales Created 2025-03-08 Updated 2025-07-16

 View more

Long story short, the project is so far a complete failure on the most important metric: number of regular users, which current sits at exactly one: myself.

There were notable users who found the project online and who actually tried to use the website for some content and provided extremely valuable feedback:

Unfortunately after the period of a few weeks they stopped using it to follow their other priorities instead. Which is of course totally fine, however sad.

I still believe that the OurBigBook Web feature is a significant tech innovation that could make the website go big.

I also believe that the project gets many fundamentals of braindumping right, notably the infinitely deep table of contents without forced scoping, e.g.:

- Mathematics
  - Calculus

does not make Calculus have an ID orr URL of mathematics/calculus, rather it's just calculus.

Figure 1.
Internal cross file internal link uses only the leaf ID `hilbert-space`.

But there is a fundamental difficulty in reaching critical mass to that self-sustaining point, as people don't seem to be convinced by these logical "my system is better" argument alone, as opposed to having them Google into stuff they need now and then understand that the project is awesome.

A closely related critical mass issue is that existing big multiuser knowledge base websites such as Stack Overflow and Wikipedia have a tremendous advantage on PageRank. No matter how useless a Wikipedia article about something is, it will always be on top of Google within a week of creation for title hits. And since the main goal of publishing your stuff is to get it seen, it makes much more sense for writers to publish on such existing websites whenever possible, because anywhere else it is way way less likely to be seen by anybody.

Even I end up writing way more on Stack Overflow than on OurBigBook as a programmer. But I still believe that there is a value to OurBigBook, for the usual reasons of:

it allows you to organize a more global view of a subject, i.e. a book. Even I write answers on Stack Overflow, I also tend to organize links to these answers in a structured ways here, see e.g. big topics such as SQL
deletionism and overly narrowness of allowed topics/style

Perhaps what saddens me the most is that even on GitHub stars/Twitter/Hacker news terms there is almost no interest in the project despite the fact that I consider that it has innovations, while many other note taking apps as well in the thousands of stars. Maybe I'm just delusional and all the tech that I'm doing is completely useless?

Part of the issue is probably linked to the fact that most other note taking apps focus on "help me organize my ideas so I can make more money" and often completely ignore "I want to publish my knowledge", and stuff that helps you make money is always easier to sell and promote.

OurBigBook on the other hand a huge focus on "I want to publish me knowledge". It aims almost single mindedly in being the best tool ever for that. However this doesn't make money for people, and therefore there are going to be way less potential users.

I do believe strongly that all it takes is a few users for the project to snowball. For some people, once you start braindumping, it is very addictive, and you never want to stop basically. So with only a few of those we can open large parts of undergrad knowledge to the world. But these people are few, and so far I haven't been able to find even a single one like me, and on top of that convince them that I have created the ultimate system for their knowledge publishing desires.

Another general lesson is that I should perhaps aimed for greater compatibility with existing systems such as Obsidian. Taking something that many people already know and use can have a huge impact on acceptance. E.g. anything that touches Obsidian can reach thousands of stars: github.com/KosmosisDire/obsidian-webpage-export. Note taking apps that aim for "markdown" compatibility also tend to fare better, even if in the end you inevitably have to extend the Markdown for some of your features. And WYSIWYG, which I want but don't have, is perhaps the ultimate familiarity.

Another issue compared to other platforms is that OurBigBook just came out late. Obsidian launched in 2020. Roam Research and Trillium Notes also came earlier. And it is hard to fight the advantage already gained by those on the "I'm going to take some personal notes" area. I do believe however that there a strong separation between "these are my personal notes" and "I want to publish these". Once you decide to publish your knowledge, you immediately start to write in a different way, and it is very hard to convert pre-existing "private" notes into ones suitable for public consumption.

 Read the full article

Updates / Quick fun with the Common Crawl web graph Created 2025-02-26 Updated 2025-07-16

 View more

github.com/cirosantilli/cirosantilli.github.io/issues/198. Previously at: stackoverflow.com/questions/31321009/best-more-standard-graph-representation-file-format-graphson-gexf-graphml/79467334#79467334 but Stack Overflow fucking deleted the question.

I wanted to do a quick exploration of open PageRank implementation and data.

My general motivation for this is that a PageRank-like algorithm could be useful for more accurate user and article ranking on OurBigBook, see: Section "PageRank-like ranking"

But it could also be just generally cool to apply it to other graph datasets, e.g. for computing an Wikipedia internal PageRank.

A quick Google reveals only Open PageRank, but their methods are apparently closed source.

Then I had a look at the Common Crawl web graph data to see if I could easily calculate it myself, and... they already have it! See: Section "Common Crawl web graph official PageRank"

Their graph dumps are in BVGraph graph file format, which is the native format of the WebGraph framework, which implements the format and algorithms such as PageRank.

The only thing I miss is a command line interface to calculate the PageRank. That would be so awesome.

The more I look at it the more I love Common Crawl.

Announcements:

In cc-main-2024-25-dec-jan-feb-domain-ranks.txt:

cirosantilli.com was ranked ~453k
ourbigbook.com was at ~606k

 Read the full article

Wikipedia internal PageRank Created 2025-02-26 Updated 2025-07-16

 View more

Good article: www.nayuki.io/page/computing-wikipedias-internal-pageranks

One day, once WebGraph exposes a PageRank CLI, we will be able to do it fully from the CLI, it will be beautiful.

 Read the full article