= Common Crawl
{wiki}
https://commoncrawl.org/
Amazing project, that basically makes a more searchable <Wayback Machine>.
A bit hard to use their data though, partly due to size, but also lack of free to use querrying mechanisms, and how obtuse <Amazon S3> is to use.
Notably, <aws-cli> with an account is the only reliable way, everything else is way too broken, e.g. trying the to check the an index https://index.commoncrawl.org/CC-MAIN-2023-06/ very often 500s.
But still, their projct is amazing.
The only out-of-the-box search they seem to have is: http://urlsearch.commoncrawl.org/[] for domains/URLs. It is good, but there could be so much more... notably <IPs>.
Also could should document the data shape a bit better.
Sample sizes can be found at: https://commoncrawl.org/2023/04/mar-apr-2023-crawl-archive-now-available/
To explore the data, after login:
``
aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2013-20/
``
Copy the toplevel directory only:
``
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/ . --recursive --exclude "*/*"
``
Copy some wet/wat files:
``
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz .
aws s3 sync s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz .
``
Directory structrure:
* cc-index.paths.gz (1K)
* cc-index-table.paths.gz (1K)
* segment.paths.gz (1.7K) Sample lines:
``
crawl-data/CC-MAIN-2013-20/segments/1368696381249/
crawl-data/CC-MAIN-2013-20/segments/1368696381630/
``
* index.html (2.3K)
* wat.paths.gz (98K) Sample lines:
``
crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz
crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wat.gz
``
* wet.paths.gz (98K) Sample lines:
``
crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz
crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wet.gz
``
* warc.paths.gz (99K)
``
crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.gz
``
* segments: directgory with actual data
* 1368696381249: one of many segments, any meaning of name?
* CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz (142M, 334M unzipped)
A tiny bit of metadata, and then plaintext content from the website, e.g. the second one:
``
WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://004eeb5.netsolhost.com/stephensilver.htm
WARC-Date: 2013-05-18T08:11:02Z
WARC-Record-ID: <urn:uuid:773b31ba-ddc6-47a5-ae24-d08141b9944d>
WARC-Refers-To: <urn:uuid:4b1bdbff-4926-4ced-86f6-072f5bb3837a>
WARC-Block-Digest: sha1:LQFSCR2LIJQYMPTXRHWU7HAPQTVSYS3A
Content-Type: text/plain
Content-Length: 12046
Stephen Silver is a journalist and editor who specializes in the areas of politics, pop culture, film and sports. He works as an editor with the North American Publishing Co. and as a film critic with The Trend, a local newspaper in the Philadelphia area.
``
No <IP> unfortunately.
* CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz (329M, 1.4G unzipped)
A lot of JSON metadata and no contents as desired. Contains IP! Some entries however are humongous with a ton of useless data, that's what bloats these so much:
``
WARC/1.0
WARC-Type: metadata
WARC-Target-URI: CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
WARC-Date: 2013-11-22T14:51:12Z
WARC-Record-ID: <urn:uuid:ec54e493-8965-41be-b344-07596cc30b3a>
WARC-Refers-To: <urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>
Content-Type: application/json
Content-Length: 1180
{"Envelope":{"Format":"WARC","WARC-Header-Length":"274","Block-Digest":"sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR","Actual-Content-Length":"372","WARC-Header-Metadata":{"WARC-Type":"warcinfo","WARC-Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz","WARC-Date":"2013-11-22T14:51:12Z","Content-Length":"372","WARC-Record-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Type":"application/warc-fields"},"Payload-Metadata":{"Trailing-Slop-Length":"0","Actual-Content-Type":"application/warc-fields","Actual-Content-Length":"372","Headers-Corrupt":true,"WARC-Info-Metadata":{"robots":"classic","software":"Nutch 1.6 (CC)/CC WarcExport 1.0","description":"Wide crawl of the web with URLs provided by Blekko for Spring 2013","hostname":"ip-10-60-113-184.ec2.internal","format":"WARC File Format 1.0","isPartOf":"CC-MAIN-2013-20","operator":"CommonCrawl Admin","publisher":"CommonCrawl"}}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"453","Header-Length":"10","Inflated-CRC":"866052549","Inflated-Length":"650"},"Offset":"0","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}
WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions
WARC-Date: 2013-05-18T05:48:54Z
WARC-Record-ID: <urn:uuid:d519658f-7a63-46c1-849b-4cd92332ddb8>
WARC-Refers-To: <urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>
Content-Type: application/json
Content-Length: 1501
{"Envelope":{"Format":"WARC","WARC-Header-Length":"433","Block-Digest":"sha1:B2B6JDSGWCUQIIUGV54SXEE25RX4SANS","Actual-Content-Length":"302","WARC-Header-Metadata":{"WARC-Type":"request","WARC-Date":"2013-05-18T05:48:54Z","WARC-Warcinfo-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Length":"302","WARC-Record-ID":"<urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>","WARC-Target-URI":"http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions","WARC-IP-Address":"165.1.125.44","Content-Type":"application/http; msgtype=request"},"Payload-Metadata":{"Trailing-Slop-Length":"4","HTTP-Request-Metadata":{"Headers":{"Accept-Language":"en-us,en-gb,en;q=0.7,*;q=0.3","Host":"ap.org","Accept-Encoding":"x-gzip, gzip, deflate","User-Agent":"CCBot/2.0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"},"Headers-Length":"300","Entity-Length":"0","Entity-Trailing-Slop-Bytes":"0","Request-Message":{"Method":"GET","Version":"HTTP/1.0","Path":"/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions"},"Entity-Digest":"sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ"},"Actual-Content-Type":"application/http; msgtype=request"}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"455","Header-Length":"10","Inflated-CRC":"453539965","Inflated-Length":"739"},"Offset":"453","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}
``
Let's beautify one of them to see it better:
``
{
"Envelope": {
"Format": "WARC",
"WARC-Header-Length": "274",
"Block-Digest": "sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR",
"Actual-Content-Length": "372",
"WARC-Header-Metadata": {
"WARC-Type": "warcinfo",
"WARC-Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz",
"WARC-Date": "2013-11-22T14:51:12Z",
"Content-Length": "372",
"WARC-Record-ID": "<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>",
"Content-Type": "application/warc-fields"
},
"Payload-Metadata": {
"Trailing-Slop-Length": "0",
"Actual-Content-Type": "application/warc-fields",
"Actual-Content-Length": "372",
"Headers-Corrupt": true,
"WARC-Info-Metadata": {
"robots": "classic",
"software": "Nutch 1.6 (CC)/CC WarcExport 1.0",
"description": "Wide crawl of the web with URLs provided by Blekko for Spring 2013",
"hostname": "ip-10-60-113-184.ec2.internal",
"format": "WARC File Format 1.0",
"isPartOf": "CC-MAIN-2013-20",
"operator": "CommonCrawl Admin",
"publisher": "CommonCrawl"
}
}
},
"Container": {
"Compressed": true,
"Gzip-Metadata": {
"Footer-Length": "8",
"Deflate-Length": "453",
"Header-Length": "10",
"Inflated-CRC": "866052549",
"Inflated-Length": "650"
},
"Offset": "0",
"Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"
}
}
``
Fuck no IP addresses either. But other entries do have it, why not this one?
The reason these can be huge is the `HTML-Metadata` section which contain all outlinks! https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat-L34
* `CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz` ()
Obtain:
``
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz .
``
Back to article page