Ciro Santilli @cirosantilli 37

 Incoming links: CIA 2010 covert communication websites / Wayback Machine CDX scanning

CIA 2010 covert communication websites / 2013 DNS census MX records Updated 2025-07-16

Let' see if there's anything in records/mx.xz.

mx.csv is 21GB.

They do have " in the files to escape commas so:

mx.py

import csv
import sys
writer = csv.writer(sys.stdout)
with open('mx.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        writer.writerow([row[0], row[3]])

Would have been better with csvkit: stackoverflow.com/questions/36287982/bash-parse-csv-with-quotes-commas-and-newlines

then:

# uniq not amazing as there are often two or three slightly different records repeated on multiple timestamps, but down to 11 GB
python3 mx.py | uniq > mx-uniq.csv
sqlite3 mx.sqlite 'create table t(d text, m text)'
# 13 GB
time sqlite3 mx.sqlite ".import --csv --skip 1 'mx-uniq.csv' t"

# 41 GB
time sqlite3 mx.sqlite 'create index td on t(d)'
time sqlite3 mx.sqlite 'create index tm on t(m)'
time sqlite3 mx.sqlite 'create index tdm on t(d, m)'

# Remove dupes.
# Rows: 150m
time sqlite3 mx.sqlite <<EOF
delete from t
where rowid not in (
  select min(rowid)
  from t
  group by d, m
)
EOF

# 15 GB
time sqlite3 mx.sqlite vacuum

Let's see what the hits use:

awk -F, 'NR>1{ print $2 }' ../media/cia-2010-covert-communication-websites/hits.csv | xargs -I{} sqlite3 mx.sqlite "select distinct * from t where d = '{}'"

At around 267 total hits, only 84 have MX records, and from those that do, almost all of them have exactly:

smtp.secureserver.net
mailstore1.secureserver.net

with only three exceptions:

dailynewsandsports.com|dailynewsandsports.com
inews-today.com|mail.inews-today.com
just-kidding-news.com|just-kidding-news.com

We need to count out of the totals!

sqlite3 mx.sqlite "select count(*) from t where m = 'mailstore1.secureserver.net'"

which gives, ~18M, so nope, it is too much by itself...

Let's try to use that to reduce av.sqlite from 2013 DNS Census virtual host cleanup a bit further:

time sqlite3 mx.sqlite '.mode csv' "attach 'aiddcu.sqlite' as 'av'" '.load ./ip' "select ipi2s(av.t.i), av.t.d from av.t inner join t as mx on av.t.d = mx.d and mx.m = 'mailstore1.secureserver.net' order by av.t.i asc" > avm.csv

where avm stands for av with mx pruning. This leaves us with only ~500k entries left. With one more figerprint we could do a Wayback Machine CDX scanning scan.

Let's check that we still have most our hits in there:

grep -f <(awk -F, 'NR>1{print $2}' /home/ciro/bak/git/media/cia-2010-covert-communication-websites/hits.csv) avm.csv

At 267 hits we got 81, so all are still present.

secureserver is a hosting provider, we can see their blank page e.g. at: web.archive.org/web/20110128152204/http://emmano.com/. security.stackexchange.com/questions/12610/why-did-secureserver-net-godaddy-access-my-gmail-account/12616#12616 comments:

secureserver.net is the name GoDaddy use as the reverse DNS for IP addresses used for dedicated/virtual server hosting

 Read the full article

CIA 2010 covert communication websites / 2013 DNS census NS records Updated 2025-07-16

 View more

ns.csv is 57 GB. This file is too massive, working with it is a pain.

We can also cut down the data a lot with stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column/76605540#76605540 and tld filtering:

awk -F, 'BEGIN{OFS=","} { if ($1 != last) { print $1, $3; last = $1; } }' ns.csv | grep -E '\.(com|net|info|org|biz),' > nsu.csv

This brings us down to a much more manageable 3.0 GB, 83 M rows.

Let's just scan it once real quick to start with, since likely nothing will come of this venue:

grep -f <(awk -F, 'NR>1{print $2}' ../media/cia-2010-covert-communication-websites/hits.csv) nsu.csv | tee nsu-hits.csv
cat nsu-hits.csv | csvcut -c 2 | sort | awk -F. '{OFS="."; print $(NF-1), $(NF)}' | sort | uniq -c | sort -k1 -n

As of 267 hits we get:

      1 a2hosting.com
      1 amerinoc.com
      1 ayns.net
      1 dailyrazor.com
      1 domainingdepot.com
      1 easydns.com
      1 frienddns.ru
      1 hostgator.com
      1 kolmic.com
      1 name-services.com
      1 namecity.com
      1 netnames.net
      1 tonsmovies.net
      1 webmailer.de
      2 cashparking.com
     55 worldnic.com
     86 domaincontrol.com

so yeah, most of those are likely going to be humongous just by looking at the names.

The smallest ones by far from the total are: frienddns.ru with only 487 hits, all others quite large or fake hits due to CSV. Did a quick Wayback Machine CDX scanning there but no luck alas.

Let's check the smaller ones:

inews-today.com,2013-08-12T03:14:01,ns1.frienddns.ru
source-commodities.net,2012-12-13T20:58:28,ns1.namecity.com -> fake hit due to grep e-commodities.net
dailynewsandsports.com,2013-08-13T08:36:28,ns3.a2hosting.com
just-kidding-news.com,2012-02-04T07:40:50,jns3.dailyrazor.com
fightwithoutrules.com,2012-11-09T01:17:40,sk.s2.ns1.ns92.kolmic.com
fightwithoutrules.com,2013-07-01T22:46:23,ns1625.ztomy.com
half-court.net,2012-09-10T09:49:15,sk.s2.ns1.ns92.kolmic.com
half-court.net,2013-07-07T00:31:12,ns1621.ztomy.com

Doubt anything will come out of this.

Let's do a bit of counting out of the total:

grep domaincontrol.com ns.csv | awk -F, '{print $1}' | uniq | wc

gives ~20M domain using domaincontrol. Let's see how many domains are in the first place:

awk -F, '{print $1}' ns.csv | uniq | wc

so it accounts for 1/4 of the total.

 Read the full article

CIA 2010 covert communication websites / 2013 DNS census secureserver.net MX records intersection 2013 DNS Census virtual host cleanup Created 2023-07-19 Updated 2025-07-16

 View more

We intersect 2013 DNS Census virtual host cleanup with 2013 DNS census MX records and that leaves 460k hits. We did lose a third on the the MX records as of 260 hits since secureserver.net is only used in 1/3 of sites, but we also concentrate 9x, so it may be worth it.

Then we Wayback Machine CDX scanning. it takes about 5 days, but it is manageale.

We did a full Wayback Machine CDX scanning for JAR, SWF and cgi-bin in those, but only found a single new hit:

63.130.160.50 theglobalheadlines.com. Just barely missed with our 2013 DNS Census virtual host cleanup heuristic keyword searches as we did think of both "global" and "headlines" in the "news" themes!

 Read the full article

CIA 2010 covert communication websites / 2013 DNS Census virtual host cleanup heuristic keyword searches Updated 2025-07-16

 View more

There are two keywords that are killers: "news" and "world" and their translations or closely related words. Everything else is hard. So a good start is:

grep -e news -e noticias -e nouvelles -e world -e global

iran + football:

iranfootballsource.com: the third hit for this area after the two given by Reuters! Epic.

3 easy hits with "noticias" (news in Portuguese or Spanish"), uncovering two brand new ip ranges:

66.45.179.205 noticiasporjanua.com
66.237.236.247 comunidaddenoticias.com
204.176.38.143 noticiassofisticadas.com

Let's see some French "nouvelles/actualites" for those tumultuous Maghrebis:

216.97.231.56 nouvelles-d-aujourdhuis.com

news + world:

210.80.75.55 philippinenewsonline.net

news + global:

204.176.39.115 globalprovincesnews.com
212.209.74.105 globalbaseballnews.com
212.209.79.40: hydradraco.com

OK, I've decided to do a complete Wayback Machine CDX scanning of news... Searching for .JAR or https.*cgi-bin.*\.cgi are killers, particularly the .jar hits, here's what came out:

62.22.60.49 telecom-headlines.com
62.22.61.206 worldnewsnetworking.com
64.16.204.55 holein1news.com
66.104.169.184 bcenews.com
69.84.156.90 stickshiftnews.com
74.116.72.236 techtopnews.com
74.254.12.168 non-stop-news.net
193.203.49.212 inews-today.com
199.85.212.118 just-kidding-news.com
207.210.250.132 aeronet-news.com
212.4.18.129 sightseeingnews.com
212.209.90.84 thenewseditor.com
216.105.98.152 modernarabicnews.com

Wayback Machine CDX scanning of "world":

66.104.173.186 myworldlymusic.com

"headline": only 140 matches in 2013-dns-census-a-novirt.csv and 3 hits out of 269 hits. Full inspection without CDX led to no new hits.

"today": only 3.5k matches in 2013-dns-census-a-novirt.csv and 12 hits out of 269 hits, TODO how many on those on 2013-dns-census-a-novirt? No new hits.

"world", "global", "international", and spanish/portuguese/French versions like "mondo", "mundo", "mondi": 15k matches in 2013-dns-census-a-novirt.csv. No new hits.

 Read the full article

CIA 2010 covert communication websites / Non .com .net TLDs Updated 2025-07-16

 View more

.com and .net are very dominant. Here we list other choices made:

.info: has a few hits:
- archived comms:
  - beyondthefringe.info
- unarchived comms:
  - crickettoday.info
- unarchived:
  - talkingpointnews.info
  - theventurenews.info
  - worldconcerns.info
Did a full Wayback Machine CDX scanning on .info after:
```
grep -e news -e noticias -e nouvelles -e world -e global
```
That makes about 10k domains, so it's about the right size.
.org: has a least one hit, see: Are there .org hits?
.biz:
- unarchived comms:
  - atthemovies.biz

 Read the full article

CIA 2010 covert communication websites / secure subdomain search on 2013 DNS Census Updated 2025-07-16

 View more

Grepping the 2013 DNS Census first by overused CGI comms subdomains secure. and ssl. leaves 200k lines. Grepping for the overused "news" led to hits:

secure.worldnewsandent.com,2012-02-13T21:28:15,208.254.40.117
ssl.beyondnetworknews.com,2012-02-13T20:10:13,66.104.175.40

Also tried but failed:

sports:
- secure.motorsportdealers.com,2012-04-10T20:19:09,64.73.117.38 web.archive.org/web/20110501000000*/motorsportdealers.com

OK, after the initial successes in secure., we went a bit more data intensive:

took all secure.* ssl.* URLs in the 2013 DNS Census, 70k entries
cleaned up a bit, e.g. only .com or .net. this left only, 30k entries only
lopped over all of them in archive CDX: Wayback Machine CDX scanning, searching for those that also end in .cgi web.archive.org/cdx/search/cdx?url=$domain&matchType=domain&filter=urlkey:.*.cgi&to=20140101000000. Took an afternoon, but no rate limit block.
this leaves about 1000, so we loop over all of them manually on web archive with a script, and opened any that had the pattern of very vew hits between 2010 and 2013 only, and on those check for visual/thematic style match. Careful not to make more than 15 requests per minute or else 5 min blacklist!

New results: only one...

208.254.42.205 secure.driversinternationalgolf.com,2012-02-13T10:42:20,

After 2013 DNS Census virtual host cleanup heuristic keyword searches we later understood why there were so few hits here: the 2013 DNS Census didn't capture the secure. subdomains of many domains it had for some reason. Shame, because if it had, this method would have yielded many more results.

 Read the full article

CIA 2010 covert communication websites / Wayback Machine Updated 2025-07-16

 View more

D'oh.

But to be serious. The Wayback Machine contains a very large proportion of all sites. It does happen sometime that a Wayback Machine archive is missing or broken and cqcounter has the screenshot. But the Wayback Machine is still the most complete database we have found so far. Some archives are very broken. But those are rare.

The only problem with the Wayback Machine is that there is no known efficient way to query its archives across domains. You have to have a domain in hand for CDX queries: Wayback Machine CDX scanning.

The Common Crawl project attempts in part to address this lack of querriability, but we haven't managed to extract any hits from it.

CDX + 2013 DNS Census + heuristics however has been fruitful however.

We have dumped all Wayback Machine archives of known websites to: github.com/cirosantilli/cia-2010-websites-dump using ../cia-2010-covert-communication-websites/download-websites.sh. This allows for better grepping and serves as a backup in case they ever go down.

 Read the full article