= Common Crawl
So far, no new domains have been found with <Common Crawl>, nor have any existing known domains been found to be present in Common Crawl. Our working theory is that Common Crawl never reached the domains <How did Alexa find the domains?>
Let's try and do something with <Common Crawl>.
Unfortunately there's no <IP> data apparently: https://github.com/commoncrawl/cc-index-table/issues/30[], so let's focus on the URLs.
Using their <Common Crawl Athena> method: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
Hello world:
``
select * from "ccindex"."ccindex" limit 100;
``
Data scanned: 11.75 MB
Sample first output line:
``
# 2
url_surtkey org,whwheelers)/robots.txt
url https://whwheelers.org/robots.txt
url_host_name whwheelers.org
url_host_tld org
url_host_2nd_last_part whwheelers
url_host_3rd_last_part
url_host_4th_last_part
url_host_5th_last_part
url_host_registry_suffix org
url_host_registered_domain whwheelers.org
url_host_private_suffix org
url_host_private_domain whwheelers.org
url_host_name_reversed
url_protocol https
url_port
url_path /robots.txt
url_query
fetch_time 2021-06-22 16:36:50.000
fetch_status 301
fetch_redirect https://www.whwheelers.org/robots.txt
content_digest 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
content_mime_type text/html
content_mime_detected text/html
content_charset
content_languages
content_truncated
warc_filename crawl-data/CC-MAIN-2021-25/segments/1623488519183.85/robotstxt/CC-MAIN-20210622155328-20210622185328-00312.warc.gz
warc_record_offset 1854030
warc_record_length 639
warc_segment 1623488519183.85
crawl CC-MAIN-2021-25
subset robotstxt
``
So `url_host_3rd_last_part` might be a winner for <CGI comms> fingerprinting!
Naive one for one index:
``
select * from "ccindex"."ccindex" where url_host_registered_domain = 'conquermstoday.com' limit 100;
``
have no results... data scanned: 5.73 GB
Let's see if they have any of the domain hits. Let's also restrict by date to try and reduce the data scanned:
``
select * from "ccindex"."ccindex" where
fetch_time < TIMESTAMP '2014-01-01 00:00:00' AND
url_host_registered_domain IN (
'activegaminginfo.com',
'altworldnews.com',
...
'topbillingsite.com',
'worldwildlifeadventure.com'
)
``
Humm, data scanned: 60.59 GB and no hits... weird.
Sanity check:
``
select * from "ccindex"."ccindex" WHERE
crawl = 'CC-MAIN-2013-20' AND
subset = 'warc' AND
url_host_registered_domain IN (
'google.com',
'amazon.com'
)
``
has a bunch of hits of course. Also Data scanned: 212.88 MB, `WHERE` `crawl` and `subset` are a must! Should have read the article first.
Let's widen a bit more:
``
select * from "ccindex"."ccindex" WHERE
crawl IN (
'CC-MAIN-2013-20',
'CC-MAIN-2013-48',
'CC-MAIN-2014-10'
) AND
subset = 'warc' AND
url_host_registered_domain IN (
'activegaminginfo.com',
'altworldnews.com',
...
'worldnewsandent.com',
'worldwildlifeadventure.com'
)
``
Still nothing found... they don't seem to have any of the URLs of interest?
Back to article page