Cosmopedia by Ciro Santilli 37 Updated 2025-07-16
Cosmopedia is a dataset of synthetic textbooks, blogposts, stories, posts and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.The dataset contains over 30 million files and 25 billion tokens, making it the largest open synthetic dataset to date.
Usenet personality by Ciro Santilli 37 Updated 2025-07-16
Ciro Santilli does the same via Google searches and Twitter/Reddit searches for himself, you can't invent anything new nowadays:
Kibo was known for his high-volume but thoughtful posts, but achieved Usenet celebrity circa 1991 by writing a small script to grep his entire Usenet feed for instances of his name, and then answering personally whenever and wherever he was mentioned, giving the illusion that he was personally reading the entire feed.
DigitalDreamDoor by Ciro Santilli 37 Updated 2025-07-16
Ahh, this brings good memories of Ciro Santilli's musical formative teenage years scouring the web for the best art humanity had ever produced in certain generes. And it still is a valuable resource as of the 2020's!
This section tries to explain how the discoveries were made in more detail.
Some of the subsections are quite readable, while others are mostly data dumps and work logs, so bear with us.
But by looking at the URLs of the screenshots they provided from other websites we can easily uncover all others that had screenshots, except for the Johnny Carson one, which is just generically named. E.g. the image for the Chinese one is www.reuters.com/investigates/special-report/assets/usa-spies-iran/screencap-activegaminginfo.com.jpg?v=192516290922 which leads us to domain activegaminginfo.com.
Oleg Shakirov later discovered that the Carson one had its domain written right on the screenshot, as part of a watermark present on the original website itself. Therefore the URLs of all the websites were in one way or another essentially given on the article.
The full list of domains from screenshots is:
This is a dark art, and many of the sources are shady as fuck! We often have no idea of their methodology. Also no source is fully complete. We just piece up as best we can.
In order to explore IPs in known IP ranges, what we need are good DNS databases.
Starting at twitter.com/shakirov2036/status/1746729471778988499, Russian expat Oleg Shakirov comments "Let me know if you are still looking for the Carson website".
He then proceeded to give Carson and 5 other domains in private communication. His name is given here with his consent. His advances besides not being blind were Yandexing for some of the known hits which led to pages that contained other hits:
Unfortunately, these methods are not very generalizable, and didn't lead to a large number of other hits. But every domain counts!
whoisxmlapi WHOIS history March 23, 2011:
  • Created Date: April 9, 2007 00:00:00 UTC
  • Updated Date: March 2, 2011 00:00:00 UTC
  • Expires Date: April 9, 2011 00:00:00 UTC
  • Registrant Name: domainsbyproxy.com
  • Name servers: dns1.registrar-servers.com|dns2.registrar-servers.com
whoisrequest.com/history/ mentions:
1 May, 2007: Domain created*, nameservers added. Nameservers:
  • ns1.qwknetllc.com
  • ns2.qwknetllc.com

Pinned article: Introduction to the OurBigBook Project

Welcome to the OurBigBook Project! Our goal is to create the perfect publishing platform for STEM subjects, and get university-level students to write the best free STEM tutorials ever.
Everyone is welcome to create an account and play with the site: ourbigbook.com/go/register. We belive that students themselves can write amazing tutorials, but teachers are welcome too. You can write about anything you want, it doesn't have to be STEM or even educational. Silly test content is very welcome and you won't be penalized in any way. Just keep it legal!
We have two killer features:
  1. topics: topics group articles by different users with the same title, e.g. here is the topic for the "Fundamental Theorem of Calculus" ourbigbook.com/go/topic/fundamental-theorem-of-calculus
    Articles of different users are sorted by upvote within each article page. This feature is a bit like:
    • a Wikipedia where each user can have their own version of each article
    • a Q&A website like Stack Overflow, where multiple people can give their views on a given topic, and the best ones are sorted by upvote. Except you don't need to wait for someone to ask first, and any topic goes, no matter how narrow or broad
    This feature makes it possible for readers to find better explanations of any topic created by other writers. And it allows writers to create an explanation in a place that readers might actually find it.
    Figure 1.
    Screenshot of the "Derivative" topic page
    . View it live at: ourbigbook.com/go/topic/derivative
  2. local editing: you can store all your personal knowledge base content locally in a plaintext markup format that can be edited locally and published either:
    This way you can be sure that even if OurBigBook.com were to go down one day (which we have no plans to do as it is quite cheap to host!), your content will still be perfectly readable as a static site.
    Figure 5. . You can also edit articles on the Web editor without installing anything locally.
    Video 3.
    Edit locally and publish demo
    . Source. This shows editing OurBigBook Markup and publishing it using the Visual Studio Code extension.
  3. https://raw.githubusercontent.com/ourbigbook/ourbigbook-media/master/feature/x/hilbert-space-arrow.png
  4. Infinitely deep tables of contents:
    Figure 6.
    Dynamic article tree with infinitely deep table of contents
    .
    Descendant pages can also show up as toplevel e.g.: ourbigbook.com/cirosantilli/chordate-subclade
All our software is open source and hosted at: github.com/ourbigbook/ourbigbook
Further documentation can be found at: docs.ourbigbook.com
Feel free to reach our to us for any help or suggestions: docs.ourbigbook.com/#contact