Amazing project, that basically makes a more searchable Wayback Machine.

A bit hard to use their data though, partly due to size, but also lack of free to use querrying mechanisms, and how obtuse Amazon S3 is to use.

Notably, aws-cli with an account is the only reliable way, everything else is way too broken, e.g. trying the to check the an index index.commoncrawl.org/CC-MAIN-2023-06/ very often 500s.

But still, their projct is amazing.

The only out-of-the-box search they seem to have is: urlsearch.commoncrawl.org/ for domains/URLs. It is good, but there could be so much more... notably IPs.

Also could should document the data shape a bit better.

Sample sizes can be found at: commoncrawl.org/2023/04/mar-apr-2023-crawl-archive-now-available/

To explore the data, after login:

aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2013-20/

Copy the toplevel directory only:

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/ . --recursive --exclude "*/*"

Copy some wet/wat files:

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz .
aws s3 sync s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz .

Directory structrure:

cc-index.paths.gz (1K)
cc-index-table.paths.gz (1K)

segment.paths.gz (1.7K) Sample lines:

crawl-data/CC-MAIN-2013-20/segments/1368696381249/
crawl-data/CC-MAIN-2013-20/segments/1368696381630/

index.html (2.3K)

wat.paths.gz (98K) Sample lines:

crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz
crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wat.gz

wet.paths.gz (98K) Sample lines:

crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz
crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wet.gz

warc.paths.gz (99K)

crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.gz

segments: directgory with actual data

1368696381249: one of many segments, any meaning of name?

CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz (142M, 334M unzipped)

A tiny bit of metadata, and then plaintext content from the website, e.g. the second one:

WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://004eeb5.netsolhost.com/stephensilver.htm
WARC-Date: 2013-05-18T08:11:02Z
WARC-Record-ID: <urn:uuid:773b31ba-ddc6-47a5-ae24-d08141b9944d>
WARC-Refers-To: <urn:uuid:4b1bdbff-4926-4ced-86f6-072f5bb3837a>
WARC-Block-Digest: sha1:LQFSCR2LIJQYMPTXRHWU7HAPQTVSYS3A
Content-Type: text/plain
Content-Length: 12046

Stephen Silver is a journalist and editor who specializes in the areas of politics, pop culture, film and sports. He works as an editor with the North American Publishing Co. and as a film critic with The Trend, a local newspaper in the Philadelphia area.

No IP unfortunately.

CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz (329M, 1.4G unzipped)

A lot of JSON metadata and no contents as desired. Contains IP! Some entries however are humongous with a ton of useless data, that's what bloats these so much:

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
WARC-Date: 2013-11-22T14:51:12Z
WARC-Record-ID: <urn:uuid:ec54e493-8965-41be-b344-07596cc30b3a>
WARC-Refers-To: <urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>
Content-Type: application/json
Content-Length: 1180

{"Envelope":{"Format":"WARC","WARC-Header-Length":"274","Block-Digest":"sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR","Actual-Content-Length":"372","WARC-Header-Metadata":{"WARC-Type":"warcinfo","WARC-Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz","WARC-Date":"2013-11-22T14:51:12Z","Content-Length":"372","WARC-Record-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Type":"application/warc-fields"},"Payload-Metadata":{"Trailing-Slop-Length":"0","Actual-Content-Type":"application/warc-fields","Actual-Content-Length":"372","Headers-Corrupt":true,"WARC-Info-Metadata":{"robots":"classic","software":"Nutch 1.6 (CC)/CC WarcExport 1.0","description":"Wide crawl of the web with URLs provided by Blekko for Spring 2013","hostname":"ip-10-60-113-184.ec2.internal","format":"WARC File Format 1.0","isPartOf":"CC-MAIN-2013-20","operator":"CommonCrawl Admin","publisher":"CommonCrawl"}}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"453","Header-Length":"10","Inflated-CRC":"866052549","Inflated-Length":"650"},"Offset":"0","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions
WARC-Date: 2013-05-18T05:48:54Z
WARC-Record-ID: <urn:uuid:d519658f-7a63-46c1-849b-4cd92332ddb8>
WARC-Refers-To: <urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>
Content-Type: application/json
Content-Length: 1501

{"Envelope":{"Format":"WARC","WARC-Header-Length":"433","Block-Digest":"sha1:B2B6JDSGWCUQIIUGV54SXEE25RX4SANS","Actual-Content-Length":"302","WARC-Header-Metadata":{"WARC-Type":"request","WARC-Date":"2013-05-18T05:48:54Z","WARC-Warcinfo-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Length":"302","WARC-Record-ID":"<urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>","WARC-Target-URI":"http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions","WARC-IP-Address":"165.1.125.44","Content-Type":"application/http; msgtype=request"},"Payload-Metadata":{"Trailing-Slop-Length":"4","HTTP-Request-Metadata":{"Headers":{"Accept-Language":"en-us,en-gb,en;q=0.7,*;q=0.3","Host":"ap.org","Accept-Encoding":"x-gzip, gzip, deflate","User-Agent":"CCBot/2.0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"},"Headers-Length":"300","Entity-Length":"0","Entity-Trailing-Slop-Bytes":"0","Request-Message":{"Method":"GET","Version":"HTTP/1.0","Path":"/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions"},"Entity-Digest":"sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ"},"Actual-Content-Type":"application/http; msgtype=request"}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"455","Header-Length":"10","Inflated-CRC":"453539965","Inflated-Length":"739"},"Offset":"453","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}

Let's beautify one of them to see it better:


{
  "Envelope": {
    "Format": "WARC",
    "WARC-Header-Length": "274",
    "Block-Digest": "sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR",
    "Actual-Content-Length": "372",
    "WARC-Header-Metadata": {
      "WARC-Type": "warcinfo",
      "WARC-Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz",
      "WARC-Date": "2013-11-22T14:51:12Z",
      "Content-Length": "372",
      "WARC-Record-ID": "<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>",
      "Content-Type": "application/warc-fields"
    },
    "Payload-Metadata": {
      "Trailing-Slop-Length": "0",
      "Actual-Content-Type": "application/warc-fields",
      "Actual-Content-Length": "372",
      "Headers-Corrupt": true,
      "WARC-Info-Metadata": {
        "robots": "classic",
        "software": "Nutch 1.6 (CC)/CC WarcExport 1.0",
        "description": "Wide crawl of the web with URLs provided by Blekko for Spring 2013",
        "hostname": "ip-10-60-113-184.ec2.internal",
        "format": "WARC File Format 1.0",
        "isPartOf": "CC-MAIN-2013-20",
        "operator": "CommonCrawl Admin",
        "publisher": "CommonCrawl"
      }
    }
  },
  "Container": {
    "Compressed": true,
    "Gzip-Metadata": {
      "Footer-Length": "8",
      "Deflate-Length": "453",
      "Header-Length": "10",
      "Inflated-CRC": "866052549",
      "Inflated-Length": "650"
    },
    "Offset": "0",
    "Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"
  }
}

Fuck no IP addresses either. But other entries do have it, why not this one?

The reason these can be huge is the HTML-Metadata section which contain all outlinks! gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat-L34

CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz ()

Obtain:

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz .

Common Crawl web graph

 0  0

commoncrawl.org/web-graphs

In 2017 apparently they've started making their own Web Graphs, i.e. they parse the HTML and extract the graph of what links to what.

This is exactly what we need for an open implementation of PageRank.

Edit: actually, they already calculate PageRank for us!!! Fantastic!!! Main section: Section "Common Crawl web graph official PageRank".

The graphs are dumped in BVGraph format.

A quick exploration of the graph can be seen at: github.com/cirosantilli/cirosantilli.github.io/issues/198

Their source code is at: github.com/commoncrawl/cc-webgraph

 Tagged

Common Crawl web graph official PageRank

Reverse image search

 0  0

 Tagged

Google reverse image search

facecheck.id

 0  0

facecheck.id/

Became paid in 2024: www.reddit.com/r/OSINT/comments/1awkxbi/facecheckid_will_no_longer_be_free/ You can search, it and lists which social media websites it found the hits on, but does not give the full URLs.

Had one possible non-trivial LinkedIn hit for Ross Ulbricht's wife in early 2025, before her identity was publicly known, so they may have something actually going on there

A search engine is a software system designed to search for information on the internet. It enables users to input queries and retrieve relevant data from a vast index of web pages, documents, images, videos, and other content available online. The main functions of a search engine include: 1. **Crawling**: Search engines use automated programs called crawlers or bots to scan and index the content of websites across the internet. These bots follow links from page to page to discover new content.

 Read the full article

  See all articles in the same topic Create my own version

Search engine

Search engine optimization (SEO)

List of search engines

Yandex

Web crawling

Open web crawling

Common Crawl

Common Crawl Athena

Common Crawl web graph

Reverse image search

Reverse face image search

facecheck.id

 Tagged (1)

 Ancestors (6)

 Incoming links (2)

 Discussion (0)

 Articles by others on the same topic (1)

 Discussion (0)  Subscribe (1)

 Discussion (0)