Amazon Athena by Ciro Santilli 35 Updated +Created
Amazon Redshift by Ciro Santilli 35 Updated +Created
Amazon S3 by Ciro Santilli 35 Updated +Created
B-tree by Ciro Santilli 35 Updated +Created
Like Binary search tree, but each node can have multiple objects and more than two children.
WebLearn (Oxford) by Ciro Santilli 35 Updated +Created
Lancet by Ciro Santilli 35 Updated +Created
Wayback Machine by Ciro Santilli 35 Updated +Created
D'oh.
But to be serious. The Wayback Machine contains a very large proportion of all sites. It is the most complete database we have found so far. Some archives are very broken. But those are rares.
The only problem with the Wayback Machine is that there is no known efficient way to query its archives across domains. You have to have a domain in hand for CDX queries: Wayback Machine CDX scanning.
The Common Crawl project attempts in part to address this lack of querriability, but we haven't managed to extract any hits from it.
CDX + 2013 DNS Census + heuristics however has been fruitful however.
viewdns.info by Ciro Santilli 35 Updated +Created
Accounts used so far: 6 (1500 reverse IP checks).
Their historic DNS and reverse DNS info was very valuable, and served as Ciro's the initial entry point to finding hits in the IP ranges given by Reuters.
Their data is also quite disjoint from the data of the 2013 DNS Census. There is some overlap, but clearly their methodology is very different. Some times they slot into one another almost perfectly.
You can only get about 250 queries on the web interface, then 250 queries per free account via API.
Since this source is so scarce and valuable, we have been quite careful to note down all the domain and IP ranges that have been explored.
They check your IP when you signup, and you can't sign in twice from the same IP. They also state that Tor addresses are blacklisted.
At news.ycombinator.com/item?id=38496244, the creator of the viewdns.info, "Hughesey", also stated that he'd able to give some free credits for public research projects such as this one. This would have saved up going to quite a few Cafes to get those sweet extra IPs! But it was more fun in hardmode, no doubt.
They also normalize dots in gmail addresses, so you need more diverse email accounts. But they haven't covered the .gmail vs .googlemail trick.
We do API access to IP ranges with this simple helper: cia-2010-covert-communication-websites/viewdns-info.sh, usage:
./viewdns-info.sh <apikey> <start-ipv-address> <end-ipv-address>
e.g.:
./viewdns-info.sh 8b890b00b17ed2d66bbed878d51200b58d43d014 66.45.179.187 66.45.179.210
For domain to IP queries from the API you should use "iphistory" viewdns.info/api/docs/ip-history.php:
curl 'https://api.viewdns.info/iphistory/?domain=todaysengineering.com&apikey=$APIKEY&output=json'
Very curiously, their reverse IP search appears to be somewhat broken, or not to be historic, e.g.
We've contacted viewdns.info support and they replied:
The reverse IP tool will only show a domain if that is it's current IP address.
This is likely not accurate, more precisely it likely only works if it was the last IP address, not necessarily a current one.
DNS Census 2013 by Ciro Santilli 35 Updated +Created
Main article: DNS Census 2013.
This data source was very valuable, and led to many hits, and to finding the first non Reuters ranges with Section "secure subdomain search on 2013 DNS Census".
Hit overlap:
jq -r '.[].host' ../media/cia-2010-covert-communication-websites/hits.json ) | xargs -I{} sqlite3 aiddcu.sqlite "select * from t where d = '{}'"
Domain hit count when we were at 279 hits: 142 hits, so about half of the hits were present.
The timing of the database is perfect for this project, it is as if the CIA had planted it themselves!
dnshistory.org by Ciro Santilli 35 Updated +Created
dnshistory.org contains historical domain -> mappings.
We have not managed to extract much from this source, they don't have as much data on the range of interest.
But they do have some unique data at least, perhaps we should try them a bit more often, e.g. they were the only source we've seen so far that made the association: headlines2day.com -> 212.209.74.126 which places it in the more plausible globalbaseballnews.com IP range.
TODO can it do IP to domain? Or just domain to IP? Asked on their Discord: discord.com/channels/698151879166918727/968586102493552731/1124254204257632377. Their banner suggests that yes:
With our new look website you can now find other domains hosted on the same IP address, your website neighbours and more even quicker than before.
Owner replied, you can't:
At the moment you can only do this for current not historical records
This is a shame, reverse IP here could be quite valuable.
In principle, we could obtain this data from search engines, but Google doesn't track that entire website well, e.g. no hits for site:dnshistory.org "62.22.60.48" presumably due to heavy IP throttling.
Homepage dnshistory.org/ gives date starting in 2009:
Here at DNS History we have been crawling DNS records since 2009, our database currently contains over 1 billion domains and over 12 billion DNS records.
and it is true that they do have some hits from that useful era.
Any data that we have the patience of extracting from this we will dump under github.com/cirosantilli/media/blob/master/cia-2010-covert-communication-websites/hits.json.
securitytrails.com by Ciro Santilli 35 Updated +Created
They appear to piece together data from various sources. As a result, they have a very complete domain -> IP history.
TODO reverse IP? The fact that they don't seem to have it suggests that they are just making historical reverse IP requests to a third party via some API.
Account creation blacklists common email providers such as gmail to force users to use a "corporate" email address. But using random domains like ciro@cirosantilli.com works fine.
Their data seems to date back to 2008 for our searches.
Common Crawl by Ciro Santilli 35 Updated +Created
So far, no new domains have been found with Common Crawl, nor have any existing known domains been found to be present in Common Crawl. Our working theory is that Common Crawl never reached the domains How did Alexa find the domains?
Let's try and do something with Common Crawl.
Unfortunately there's no IP data apparently: github.com/commoncrawl/cc-index-table/issues/30, so let's focus on the URLs.
Hello world:
select * from "ccindex"."ccindex" limit 100;
Data scanned: 11.75 MB
Sample first output line:
#                            2
url_surtkey                  org,whwheelers)/robots.txt
url                          https://whwheelers.org/robots.txt
url_host_name                whwheelers.org
url_host_tld                 org
url_host_2nd_last_part       whwheelers
url_host_3rd_last_part
url_host_4th_last_part
url_host_5th_last_part
url_host_registry_suffix     org
url_host_registered_domain   whwheelers.org
url_host_private_suffix      org
url_host_private_domain      whwheelers.org
url_host_name_reversed
url_protocol                 https
url_port
url_path                     /robots.txt
url_query
fetch_time                   2021-06-22 16:36:50.000
fetch_status                 301
fetch_redirect               https://www.whwheelers.org/robots.txt
content_digest               3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
content_mime_type            text/html
content_mime_detected        text/html
content_charset
content_languages
content_truncated
warc_filename                crawl-data/CC-MAIN-2021-25/segments/1623488519183.85/robotstxt/CC-MAIN-20210622155328-20210622185328-00312.warc.gz
warc_record_offset           1854030
warc_record_length           639
warc_segment                 1623488519183.85
crawl                        CC-MAIN-2021-25
subset                       robotstxt
So url_host_3rd_last_part might be a winner for CGI comms fingerprinting!
Naive one for one index:
select * from "ccindex"."ccindex" where url_host_registered_domain = 'conquermstoday.com' limit 100;
have no results... data scanned: 5.73 GB
Let's see if they have any of the domain hits. Let's also restrict by date to try and reduce the data scanned:
select * from "ccindex"."ccindex" where
  fetch_time < TIMESTAMP '2014-01-01 00:00:00' AND
  url_host_registered_domain IN (
   'activegaminginfo.com',
   'altworldnews.com',
   ...
   'topbillingsite.com',
   'worldwildlifeadventure.com'
 )
Humm, data scanned: 60.59 GB and no hits... weird.
Sanity check:
select * from "ccindex"."ccindex" WHERE
  crawl = 'CC-MAIN-2013-20' AND
  subset = 'warc' AND
  url_host_registered_domain IN (
   'google.com',
   'amazon.com'
 )
has a bunch of hits of course. Also Data scanned: 212.88 MB, WHERE crawl and subset are a must! Should have read the article first.
Let's widen a bit more:
select * from "ccindex"."ccindex" WHERE
  crawl IN (
    'CC-MAIN-2013-20',
    'CC-MAIN-2013-48',
    'CC-MAIN-2014-10'
  ) AND
  subset = 'warc' AND
  url_host_registered_domain IN (
    'activegaminginfo.com',
    'altworldnews.com',
    ...
    'worldnewsandent.com',
    'worldwildlifeadventure.com'
 )
Still nothing found... they don't seem to have any of the URLs of interest?
Internet Census 2012 by Ciro Santilli 35 Updated +Created
Does not appear to have any reverse IP hits unfortunately: opendata.stackexchange.com/questions/1951/dataset-of-domain-names/21077#21077. Likely only has domains that were explicitly advertised.
We could not find anything useful in it so far, but there is great potential to use this tool to find new IP ranges based on properties of existing IP ranges. Part of the problem is that the dataset is huge, and is split by top 256 bytes. But it would be reasonable to at least explore ranges with pre-existing known hits...
We have started looking for patterns on 66.* and 208.*, both selected as two relatively far away ranges that have a number of pre-existing hits. 208 should likely have been 212 considering later finds that put several ranges in 212.
tcpip_fp:
  • 66.104.
    • 66.104.175.41: grubbersworldrugbynews.com: 1346397300 SCAN(V=6.01%E=4%D=1/12%OT=22%CT=443%CU=%PV=N%G=N%TM=387CAB9E%P=mipsel-openwrt-linux-gnu),ECN(R=N),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=N),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
    • 66.104.175.48: worlddispatch.net: 1346816700 SCAN(V=6.01%E=4%D=1/2%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=1D5EA%P=mipsel-openwrt-linux-gnu),SEQ(SP=F8%GCD=3%ISR=109%TI=Z%TS=A),ECN(R=N),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
    • 66.104.175.49: webworldsports.com: 1346692500 SCAN(V=6.01%E=4%D=9/3%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=5044E96E%P=mipsel-openwrt-linux-gnu),SEQ(SP=105%GCD=1%ISR=108%TI=Z%TS=A),OPS(O1=M550ST11NW6%O2=M550ST11NW6%O3=M550NNT11NW6%O4=M550ST11NW6%O5=M550ST11NW6%O6=M550ST11),WIN(W1=1510%W2=1510%W3=1510%W4=1510%W5=1510%W6=1510),ECN(R=N),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
    • 66.104.175.50: fly-bybirdies.com: 1346822100 SCAN(V=6.01%E=4%D=1/1%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=14655%P=mipsel-openwrt-linux-gnu),SEQ(TI=Z%TS=A),ECN(R=N),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
    • 66.104.175.53: info-ology.net: 1346712300 SCAN(V=6.01%E=4%D=9/4%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=50453230%P=mipsel-openwrt-linux-gnu),SEQ(SP=FB%GCD=1%ISR=FF%TI=Z%TS=A),ECN(R=N),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
  • 66.175.106
    • 66.175.106.150: noticiasmusica.net: 1340077500 SCAN(V=5.51%D=1/3%OT=22%CT=443%CU=%PV=N%G=N%TM=38707542%P=mipsel-openwrt-linux-gnu),ECN(R=N),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
    • 66.175.106.155: atomworldnews.com: 1345562100 SCAN(V=5.51%D=8/21%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=5033A5F2%P=mips-openwrt-linux-gnu),SEQ(SP=FB%GCD=1%ISR=FC%TI=Z%TS=A),ECN(R=Y%DF=Y%TG=40%W=1540%O=M550NNSNW6%CC=N%Q=),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
tb0hdan/domains by Ciro Santilli 35 Updated +Created
Domain list only, no IPs and no dates. We haven't been able to extract anything of interest from this source so far.
Domain hit count when we were at 69 hits: only 9, some of which had been since reused. Likely their data collection did not cover the dates of interest.
iraniangoals.com by Ciro Santilli 35 Updated +Created
whoisxmlapi WHOIS history April 11, 2011:
  • Created Date: March 6, 2008 00:00:00 UTC
  • Updated Date: March 7, 2011 00:00:00 UTC
  • Expires Date: March 6, 2014 00:00:00 UTC
  • Registrant Name: domainsbyproxy.com.
  • Registrant Organization: Domains by Proxy, Inc.
  • Registrant Street: 15111 N. Hayden Rd., Ste 160,
  • Registrant City: Scottsdale
  • Registrant State/Province: Arizona
  • Registrant Postal Code: 85260
  • Registrant Country: UNITED STATES
  • Name servers: NS29.WORLDNIC.COM|NS30.WORLDNIC.COM
Folowed by reuters registration in 2022.
whoisrequest.com/history/ mentions:
  • 1 Apr, 2008: Domain created*, nameservers added. Nameservers:
  • ns1.webhostingpad.com
  • ns2.webhostingpad.com
iraniangoalkicks.com by Ciro Santilli 35 Updated +Created
whoisxmlapi WHOIS history March 23, 2011:
  • Created Date: April 9, 2007 00:00:00 UTC
  • Updated Date: March 2, 2011 00:00:00 UTC
  • Expires Date: April 9, 2011 00:00:00 UTC
  • Registrant Name: domainsbyproxy.com
  • Name servers: dns1.registrar-servers.com|dns2.registrar-servers.com
whoisrequest.com/history/ mentions:
1 May, 2007: Domain created*, nameservers added. Nameservers:
  • ns1.qwknetllc.com
  • ns2.qwknetllc.com
activegameinfo.com by Ciro Santilli 35 Updated +Created
whoisxmlapi WHOIS history March 22, 2011:
  • Registrar Name: NETWORK SOLUTIONS, LLC.
  • Created Date: January 26, 2010 00:00:00 UTC
  • Updated Date: November 27, 2010 00:00:00 UTC
  • Expires Date: January 26, 2012 00:00:00 UTC
  • Registrant Name: Corral, Elizabeth|ATTN ACTIVEGAMINGINFO.COM|care of Network Solutions
  • Registrant Street: PO Box 459
  • Registrant City: PA
  • Registrant State/Province: US
  • Registrant Postal Code: 18222
  • Registrant Country: UNITED STATES
  • Administrative Name: Corral, Elizabeth|ATTN ACTIVEGAMINGINFO.COM|care of Network Solutions
  • Administrative Street: PO Box 459
  • Administrative City: Drums
  • Administrative State/Province: PA
  • Administrative Postal Code: 18222
  • Administrative Country: UNITED STATES
  • Administrative Email: xc2mv7ur8cw@networksolutionsprivateregistration.com
  • Administrative Phone: 5707088780
  • Name servers: NS23.DOMAINCONTROL.COM|NS24.DOMAINCONTROL.COM
feedsdemexicoyelmundo.com by Ciro Santilli 35 Updated +Created
whoisxmlapi WHOIS record on April 28, 2011
  • Registrar Name: GODADDY.COM, INC
  • Created Date: February 9, 2010 00:00:00 UTC
  • Updated Date: February 9, 2010 00:00:00 UTC
  • Expires Date: February 9, 2015 00:00:00 UTC
  • Registrant Name: domainsbyproxy.com
  • Name servers: NS55.DOMAINCONTROL.COM|NS56.DOMAINCONTROL.COM
noticiasmusica.net by Ciro Santilli 35 Updated +Created
whoisxmlapi WHOIS record on September 13, 2011
  • Registrar Name: NETWORK SOLUTIONS, LLC
  • Created Date: February 17, 2010 00:00:00 UTC
  • Updated Date: February 17, 2010 00:00:00 UTC
  • Expires Date: February 17, 2015 00:00:00 UTC
  • Registrant Name: See, Megan|ATTN NOTICIASMUSICA.NET|care of Network Solutions
  • Registrant Street: PO Box 459
  • Registrant City: PA
  • Registrant State/Province: US
  • Registrant Postal Code: 18222
  • Registrant Country: UNITED STATES
  • Administrative Contact
  • Administrative Name: See, Megan|ATTN NOTICIASMUSICA.NET|care of Network Solutions
  • Administrative Street: PO Box 459
  • Administrative City: Drums
  • Administrative State/Province: PA
  • Administrative Postal Code: 18222
  • Administrative Country: UNITED STATES
  • Administrative Email: hf3eg77c4nn@networksolutionsprivateregistration.com
  • Administrative Phone: 5707088780
  • Name Servers: NS45.WORLDNIC.COM|NS46.WORLDNIC.COM
2012:
  • Registrant Country: PANAMA
atomworldnews.com by Ciro Santilli 35 Updated +Created
whoisxmlapi WHOIS record on April 17, 2011
  • Created Date: April 9, 2010 00:00:00 UTC
  • Updated Date: April 9, 2010 00:00:00 UTC
  • Expires Date: April 9, 2012 00:00:00 UTC
  • Registrant Name: domainsbyproxy.com
  • Name servers: NS33.DOMAINCONTROL.COM|NS34.DOMAINCONTROL.COM

Unlisted articles are being shown, click here to show only listed articles.