Common Crawl by Ciro Santilli 35 Updated +Created
Amazing project, that basically makes a more searchable Wayback Machine.
A bit hard to use their data though, partly due to size, but also lack of free to use querrying mechanisms, and how obtuse Amazon S3 is to use.
Notably, aws-cli with an account is the only reliable way, everything else is way too broken, e.g. trying the to check the an index index.commoncrawl.org/CC-MAIN-2023-06/ very often 500s.
But still, their projct is amazing.
The only out-of-the-box search they seem to have is: urlsearch.commoncrawl.org/ for domains/URLs. It is good, but there could be so much more... notably IPs.
Also could should document the data shape a bit better.
To explore the data, after login:
aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2013-20/
Copy the toplevel directory only:
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/ . --recursive --exclude "*/*"
Copy some wet/wat files:
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz .
aws s3 sync s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz .
Directory structrure:
  • cc-index.paths.gz (1K)
  • cc-index-table.paths.gz (1K)
  • segment.paths.gz (1.7K) Sample lines:
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/
    crawl-data/CC-MAIN-2013-20/segments/1368696381630/
  • index.html (2.3K)
  • wat.paths.gz (98K) Sample lines:
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wat.gz
  • wet.paths.gz (98K) Sample lines:
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wet.gz
  • warc.paths.gz (99K)
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.gz
  • segments: directgory with actual data
    • 1368696381249: one of many segments, any meaning of name?
      • CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz (142M, 334M unzipped)
        A tiny bit of metadata, and then plaintext content from the website, e.g. the second one:
        WARC/1.0
        WARC-Type: conversion
        WARC-Target-URI: http://004eeb5.netsolhost.com/stephensilver.htm
        WARC-Date: 2013-05-18T08:11:02Z
        WARC-Record-ID: <urn:uuid:773b31ba-ddc6-47a5-ae24-d08141b9944d>
        WARC-Refers-To: <urn:uuid:4b1bdbff-4926-4ced-86f6-072f5bb3837a>
        WARC-Block-Digest: sha1:LQFSCR2LIJQYMPTXRHWU7HAPQTVSYS3A
        Content-Type: text/plain
        Content-Length: 12046
        
        Stephen Silver is a journalist and editor who specializes in the areas of politics, pop culture, film and sports. He works as an editor with the North American Publishing Co. and as a film critic with The Trend, a local newspaper in the Philadelphia area.
        No IP unfortunately.
      • CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz (329M, 1.4G unzipped)
        A lot of JSON metadata and no contents as desired. Contains IP! Some entries however are humongous with a ton of useless data, that's what bloats these so much:
        WARC/1.0
        WARC-Type: metadata
        WARC-Target-URI: CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
        WARC-Date: 2013-11-22T14:51:12Z
        WARC-Record-ID: <urn:uuid:ec54e493-8965-41be-b344-07596cc30b3a>
        WARC-Refers-To: <urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>
        Content-Type: application/json
        Content-Length: 1180
        
        {"Envelope":{"Format":"WARC","WARC-Header-Length":"274","Block-Digest":"sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR","Actual-Content-Length":"372","WARC-Header-Metadata":{"WARC-Type":"warcinfo","WARC-Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz","WARC-Date":"2013-11-22T14:51:12Z","Content-Length":"372","WARC-Record-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Type":"application/warc-fields"},"Payload-Metadata":{"Trailing-Slop-Length":"0","Actual-Content-Type":"application/warc-fields","Actual-Content-Length":"372","Headers-Corrupt":true,"WARC-Info-Metadata":{"robots":"classic","software":"Nutch 1.6 (CC)/CC WarcExport 1.0","description":"Wide crawl of the web with URLs provided by Blekko for Spring 2013","hostname":"ip-10-60-113-184.ec2.internal","format":"WARC File Format 1.0","isPartOf":"CC-MAIN-2013-20","operator":"CommonCrawl Admin","publisher":"CommonCrawl"}}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"453","Header-Length":"10","Inflated-CRC":"866052549","Inflated-Length":"650"},"Offset":"0","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}
        
        WARC/1.0
        WARC-Type: metadata
        WARC-Target-URI: http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions
        WARC-Date: 2013-05-18T05:48:54Z
        WARC-Record-ID: <urn:uuid:d519658f-7a63-46c1-849b-4cd92332ddb8>
        WARC-Refers-To: <urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>
        Content-Type: application/json
        Content-Length: 1501
        
        {"Envelope":{"Format":"WARC","WARC-Header-Length":"433","Block-Digest":"sha1:B2B6JDSGWCUQIIUGV54SXEE25RX4SANS","Actual-Content-Length":"302","WARC-Header-Metadata":{"WARC-Type":"request","WARC-Date":"2013-05-18T05:48:54Z","WARC-Warcinfo-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Length":"302","WARC-Record-ID":"<urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>","WARC-Target-URI":"http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions","WARC-IP-Address":"165.1.125.44","Content-Type":"application/http; msgtype=request"},"Payload-Metadata":{"Trailing-Slop-Length":"4","HTTP-Request-Metadata":{"Headers":{"Accept-Language":"en-us,en-gb,en;q=0.7,*;q=0.3","Host":"ap.org","Accept-Encoding":"x-gzip, gzip, deflate","User-Agent":"CCBot/2.0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"},"Headers-Length":"300","Entity-Length":"0","Entity-Trailing-Slop-Bytes":"0","Request-Message":{"Method":"GET","Version":"HTTP/1.0","Path":"/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions"},"Entity-Digest":"sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ"},"Actual-Content-Type":"application/http; msgtype=request"}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"455","Header-Length":"10","Inflated-CRC":"453539965","Inflated-Length":"739"},"Offset":"453","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}
        Let's beautify one of them to see it better:
        
        {
          "Envelope": {
            "Format": "WARC",
            "WARC-Header-Length": "274",
            "Block-Digest": "sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR",
            "Actual-Content-Length": "372",
            "WARC-Header-Metadata": {
              "WARC-Type": "warcinfo",
              "WARC-Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz",
              "WARC-Date": "2013-11-22T14:51:12Z",
              "Content-Length": "372",
              "WARC-Record-ID": "<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>",
              "Content-Type": "application/warc-fields"
            },
            "Payload-Metadata": {
              "Trailing-Slop-Length": "0",
              "Actual-Content-Type": "application/warc-fields",
              "Actual-Content-Length": "372",
              "Headers-Corrupt": true,
              "WARC-Info-Metadata": {
                "robots": "classic",
                "software": "Nutch 1.6 (CC)/CC WarcExport 1.0",
                "description": "Wide crawl of the web with URLs provided by Blekko for Spring 2013",
                "hostname": "ip-10-60-113-184.ec2.internal",
                "format": "WARC File Format 1.0",
                "isPartOf": "CC-MAIN-2013-20",
                "operator": "CommonCrawl Admin",
                "publisher": "CommonCrawl"
              }
            }
          },
          "Container": {
            "Compressed": true,
            "Gzip-Metadata": {
              "Footer-Length": "8",
              "Deflate-Length": "453",
              "Header-Length": "10",
              "Inflated-CRC": "866052549",
              "Inflated-Length": "650"
            },
            "Offset": "0",
            "Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"
          }
        }
        Fuck no IP addresses either. But other entries do have it, why not this one?
        The reason these can be huge is the HTML-Metadata section which contain all outlinks! gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat-L34
      • CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz ()
        Obtain:
        aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz .
Hacker group by Ciro Santilli 35 Updated +Created
Documentation generator by Ciro Santilli 35 Updated +Created
Apache License by Ciro Santilli 35 Updated +Created
GNU General Public License by Ciro Santilli 35 Updated +Created
Cutout (espionage) by Ciro Santilli 35 Updated +Created
Application of systems of linear equations by Ciro Santilli 35 Updated +Created
No 2x2 examples please. I'm talking about large matrices that would be used in supercomputers.
System of linear equations algorithm by Ciro Santilli 35 Updated +Created
University is broken by Ciro Santilli 35 Updated +Created
You just have to spend a few minutes with students until they complain about the courses or teachers. And you just have to spend a few hours with teachers until they complain about the students or broader system.
University is broken, and everyone knows it. The only question now is finding a viable, "political cash flow positive" path, into something better.
Bibliography:
University should focus on inspiring and not on evaluating by Ciro Santilli 35 Updated +Created
As of 2020s and much earlier, Ciro Santilli believes that undergrad studies were fundamentally broken (considering the Information Age which completely changed what would be possible) because university had only two goals, with the exception of a few enlightened professors:
  • rank students from worse to best so they can get into PhD programs.
    For regular jobs grades didn't even matter as much compared the prestige of your university (and therefore, university entry exam grades) and your ability to stand the stress of exams to get minimal passing grade.
    In particular, being able to rank requires setting the difficulty level at a point where you can see a normal distribution in grades, and not have everyone at either 0 nor 100%.
    Also, this split could be caused by either shitty learning materials/conditions, or by mere volume. It doesn't matter.
  • get money from the students. Of course, in countries where university is "free", this means reporting how many students you had to some government office so they can give you a corresponding budget. But you still have an incentive to enroll as many as possible.
As a result, most students, who would not go on to do a PhD essentially do a simple trade: all their time, and possibly some money, in exchange for embuing themselves with the incredible name of a respected institution so they can get better jobs later on.
Beauty, deep understanding, and learning awesome things comes basically as a second thought.
Adjacency list by Ciro Santilli 35 Updated +Created
Ordered and unordered trees by Ciro Santilli 35 Updated +Created
Spin-off company by Ciro Santilli 35 Updated +Created
Master's degree by Ciro Santilli 35 Updated +Created
In your normal 2020 broken educational system, it is the first time at which students get an official chance to learn something advanced, and possibly prepare to go venture into the PhD desert.
webots.cloud by Ciro Santilli 35 Updated +Created
Not going to lie, this is some cool shit, robot simulation and 3D visualization in the browser.
Deletionism on Wikipedia by Ciro Santilli 35 Updated +Created
Some examples by Ciro Santilli follow.
Of the tutorial-subjectivity type:
Notability constraints, which are are way too strict:
  • even information about important companies can be disputed. E.g. once Ciro Santilli tried to create a page for PsiQuantum, a startup with $650m in funding, and there was a deletion proposal because it did not contain verifiable sources not linked directly to information provided by the company itself: en.wikipedia.org/wiki/Wikipedia:Articles_for_deletion/PsiQuantum Although this argument is correct, it is also true about 90% of everything that is on Wikipedia about any company. Where else can you get any information about a B2B company? Their clients are not going to say anything. Lawsuits and scandals are kind of the only possible source... In that case, the page was deleted with 2 votes against vs 3 votes for deletion.
    should we delete this extremely likely useful/correct content or not according to this extremely complex system of guidelines"
    is very similar to Stack Exchange's own Stack Overflow content deletion issues. Ain't Nobody Got Time For That. "Ain't Nobody Got Time for That" actually has a Wiki page: en.wikipedia.org/wiki/Ain%27t_Nobody_Got_Time_for_That. That's notable. Unlike a $600M+ company of course.
    In December 2023 the page was re-created, and seemed to stick: en.wikipedia.org/wiki/Talk:PsiQuantum#Secondary_sources It's just a random going back and forth. Author Ctjk has an interesting background:
    I am a legal official at a major government antitrust agency. The only plausible connection is we regulate tech firms
There are even a Wikis that were created to remove notability constraints: Wiki without notability requirements.
For these reasons reason why Ciro basically only contributes images to Wikipedia: because they are either all in or all out, and you can determine which one of them it is. And this allows images to be more attributable, so people can actually see that it was Ciro that created a given amazing image, thus overcoming Wikipedia's lack of reputation system a little bit as well.
Wikipedia is perfect for things like biographies, geography, or history, which have a much more defined and subjective expository order. But when it comes to "tutorials of how to actually do stuff", which is what mathematics and physics are basically about, Wikipedia has a very hard time to go beyond dry definitions which are only useful for people who already half know the stuff. But to learn from zero, newbies need tutorials with intuition and examples.
Bibliography:
Box2D by Ciro Santilli 35 Updated +Created
Generative AI by Ciro Santilli 35 Updated +Created

There are unlisted articles, also show them or only show them.