Ciro can relate strongly to the level of passion depicted in this film: Section "Don't be a pussy". It almost feels like a business film in that a sense, where the startup is the authors career and passions.
A pair of Austrailan deep learning training provider/consuntants that have produced a lot of good free learning materials:Authors:
- twitter.com/jeremyphoward Jeremy Howard
- twitter.com/math_rachel Rachel Thomas
There are infinitely many primes with a neighbor not further apart than 70 million. This was the first such finite bound to be proven, and therefore a major breakthrough.
This implies that for at least one value (or more) below 70 million there are infinitely many repetitions, but we don't know which e.g. we could have infinitely many:or infinitely many:or infinitely many:or infinitely many:but we don't know which of those.
The Prime k-tuple conjecture conjectures that it is all of them.
Amazing project, that basically makes a more searchable Wayback Machine.
A bit hard to use their data though, partly due to size, but also lack of free to use querrying mechanisms, and how obtuse Amazon S3 is to use.
Notably, aws-cli with an account is the only reliable way, everything else is way too broken, e.g. trying the to check the an index index.commoncrawl.org/CC-MAIN-2023-06/ very often 500s.
But still, their projct is amazing.
The only out-of-the-box search they seem to have is: urlsearch.commoncrawl.org/ for domains/URLs. It is good, but there could be so much more... notably IPs.
Sample sizes can be found at: commoncrawl.org/2023/04/mar-apr-2023-crawl-archive-now-available/
To explore the data, after login:
aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2013-20/Copy the toplevel directory only:
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/ . --recursive --exclude "*/*"Copy some wet/wat files:
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz .
aws s3 sync s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz .Directory structrure:
- cc-index.paths.gz (1K)
- cc-index-table.paths.gz (1K)
- segment.paths.gz (1.7K) Sample lines:crawl-data/CC-MAIN-2013-20/segments/1368696381249/ crawl-data/CC-MAIN-2013-20/segments/1368696381630/
- index.html (2.3K)
- wat.paths.gz (98K) Sample lines:crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wat.gz
- wet.paths.gz (98K) Sample lines:crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wet.gz
- warc.paths.gz (99K)crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.gz
- segments: directgory with actual data- 1368696381249: one of many segments, any meaning of name?- CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz (142M, 334M unzipped)A tiny bit of metadata, and then plaintext content from the website, e.g. the second one:No IP unfortunately.WARC/1.0 WARC-Type: conversion WARC-Target-URI: http://004eeb5.netsolhost.com/stephensilver.htm WARC-Date: 2013-05-18T08:11:02Z WARC-Record-ID: <urn:uuid:773b31ba-ddc6-47a5-ae24-d08141b9944d> WARC-Refers-To: <urn:uuid:4b1bdbff-4926-4ced-86f6-072f5bb3837a> WARC-Block-Digest: sha1:LQFSCR2LIJQYMPTXRHWU7HAPQTVSYS3A Content-Type: text/plain Content-Length: 12046 Stephen Silver is a journalist and editor who specializes in the areas of politics, pop culture, film and sports. He works as an editor with the North American Publishing Co. and as a film critic with The Trend, a local newspaper in the Philadelphia area.
- A lot of JSON metadata and no contents as desired. Contains IP! Some entries however are humongous with a ton of useless data, that's what bloats these so much:Let's beautify one of them to see it better:WARC/1.0 WARC-Type: metadata WARC-Target-URI: CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz WARC-Date: 2013-11-22T14:51:12Z WARC-Record-ID: <urn:uuid:ec54e493-8965-41be-b344-07596cc30b3a> WARC-Refers-To: <urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1> Content-Type: application/json Content-Length: 1180 {"Envelope":{"Format":"WARC","WARC-Header-Length":"274","Block-Digest":"sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR","Actual-Content-Length":"372","WARC-Header-Metadata":{"WARC-Type":"warcinfo","WARC-Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz","WARC-Date":"2013-11-22T14:51:12Z","Content-Length":"372","WARC-Record-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Type":"application/warc-fields"},"Payload-Metadata":{"Trailing-Slop-Length":"0","Actual-Content-Type":"application/warc-fields","Actual-Content-Length":"372","Headers-Corrupt":true,"WARC-Info-Metadata":{"robots":"classic","software":"Nutch 1.6 (CC)/CC WarcExport 1.0","description":"Wide crawl of the web with URLs provided by Blekko for Spring 2013","hostname":"ip-10-60-113-184.ec2.internal","format":"WARC File Format 1.0","isPartOf":"CC-MAIN-2013-20","operator":"CommonCrawl Admin","publisher":"CommonCrawl"}}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"453","Header-Length":"10","Inflated-CRC":"866052549","Inflated-Length":"650"},"Offset":"0","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}} WARC/1.0 WARC-Type: metadata WARC-Target-URI: http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions WARC-Date: 2013-05-18T05:48:54Z WARC-Record-ID: <urn:uuid:d519658f-7a63-46c1-849b-4cd92332ddb8> WARC-Refers-To: <urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f> Content-Type: application/json Content-Length: 1501 {"Envelope":{"Format":"WARC","WARC-Header-Length":"433","Block-Digest":"sha1:B2B6JDSGWCUQIIUGV54SXEE25RX4SANS","Actual-Content-Length":"302","WARC-Header-Metadata":{"WARC-Type":"request","WARC-Date":"2013-05-18T05:48:54Z","WARC-Warcinfo-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Length":"302","WARC-Record-ID":"<urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>","WARC-Target-URI":"http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions","WARC-IP-Address":"165.1.125.44","Content-Type":"application/http; msgtype=request"},"Payload-Metadata":{"Trailing-Slop-Length":"4","HTTP-Request-Metadata":{"Headers":{"Accept-Language":"en-us,en-gb,en;q=0.7,*;q=0.3","Host":"ap.org","Accept-Encoding":"x-gzip, gzip, deflate","User-Agent":"CCBot/2.0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"},"Headers-Length":"300","Entity-Length":"0","Entity-Trailing-Slop-Bytes":"0","Request-Message":{"Method":"GET","Version":"HTTP/1.0","Path":"/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions"},"Entity-Digest":"sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ"},"Actual-Content-Type":"application/http; msgtype=request"}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"455","Header-Length":"10","Inflated-CRC":"453539965","Inflated-Length":"739"},"Offset":"453","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}Fuck no IP addresses either. But other entries do have it, why not this one?{ "Envelope": { "Format": "WARC", "WARC-Header-Length": "274", "Block-Digest": "sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR", "Actual-Content-Length": "372", "WARC-Header-Metadata": { "WARC-Type": "warcinfo", "WARC-Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz", "WARC-Date": "2013-11-22T14:51:12Z", "Content-Length": "372", "WARC-Record-ID": "<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>", "Content-Type": "application/warc-fields" }, "Payload-Metadata": { "Trailing-Slop-Length": "0", "Actual-Content-Type": "application/warc-fields", "Actual-Content-Length": "372", "Headers-Corrupt": true, "WARC-Info-Metadata": { "robots": "classic", "software": "Nutch 1.6 (CC)/CC WarcExport 1.0", "description": "Wide crawl of the web with URLs provided by Blekko for Spring 2013", "hostname": "ip-10-60-113-184.ec2.internal", "format": "WARC File Format 1.0", "isPartOf": "CC-MAIN-2013-20", "operator": "CommonCrawl Admin", "publisher": "CommonCrawl" } } }, "Container": { "Compressed": true, "Gzip-Metadata": { "Footer-Length": "8", "Deflate-Length": "453", "Header-Length": "10", "Inflated-CRC": "866052549", "Inflated-Length": "650" }, "Offset": "0", "Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz" } }The reason these can be huge is theHTML-Metadatasection which contain all outlinks! gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat-L34
- CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz()Obtain:- aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz .
 
 
- 1368696381249: one of many segments, any meaning of name?
 There are unlisted articles, also show them or only show them.