This is a dark art, and many of the sources are shady as fuck! We often have no idea of their methodology. Also no source is fully complete. We just piece up as best we can.
www.reversedns.ch/en/ has some OK reverse IPs, but you have to do them one by one with CAPTCHA, and we were already past that point when that source was found, so nothing new was found on it yet
This is our primary data source, the first article that pointed out a few specific CIA websites which then served as the basis for all of our research.
We take the truth of this article as an axiom. And then all we claim is that all other websites found were made by the same people due to strong shared design principles of the such websites.
But to be serious. The Wayback Machine contains a very large proportion of all sites. It is the most complete database we have found so far. Some archives are very broken. But those are rares.
The only problem with the Wayback Machine is that there is no known efficient way to query its archives across domains. You have to have a domain in hand for CDX queries: Wayback Machine CDX scanning.
The Common Crawl project attempts in part to address this lack of querriability, but we haven't managed to extract any hits from it.
CDX + 2013 DNS Census + heuristics however has been fruitful however.
This allows to filter down 10 thousands of possible domains in a few hours. But 100s of thousands would be too much. This is because you have to query exactly one URL at a time, and they possibly rate limit IPs. But no IP blacklisting so far after several hours, so it's not that bad.
Once you have a heuristic to narrow down some domains, you can use this helper: cia-2010-covert-communication-websites/cdx.sh to drill them down from 10s of thousands down to hundreds or thousands.
and then use it on a newline separated domain name list to check;
./cdx-tor.sh infile.txt
This creates a directory infile.txt.cdx/ containing:
infile.txt.cdx/out00, out01, etc.: the suspected CDX lines from domains from each tor instance based on the simple criteria that the CDX can handle directly. We split the input domains into 100 piles, and give one selected pile per tor instance.
infile.txt.cdx/out: the final combined CDX output of out00, out01, ...
infile.txt.cdx/out.post: the final output containing only domain names that match further CLI criteria that cannot be easily encoded on the CDX query. This is the cleanest domain name list you should look into at the end basically.
Since archive is so abysmal in its data access, e.g. a Google BigQuery would solve our issues in seconds, we have to come up with creative ways of getting around their IP throttling.
The CIA doesn't play fair. They're actually the exact opposite of fair. So neither shall we.
This should allow a full sweep of the 4.5M records in 2013 DNS Census virtual host cleanup in a reasonable amount of time. After JAR/SWF/CGI filtering we obtained 5.8k domains, so a reduction factor of about 1 million with likely very few losses. Not bad.
5.8k is still a bit annoying to fully go over however, so we can also try to count CDX hits to the domains and remove anything with too many hits, since the CIA websites basically have very few archives:
cd 2013-dns-census-a-novirt-domains.txt.cdx
./cdx-tor.sh -d out.post domain-list.txt
cd out.post.cdx
cut -d' ' -f1 out | uniq -c | sort -k1 -n | awk 'match($2, /([^,]+),([^)]+)/, a) {printf("%s.%s %d\n", a[2], a[1], $1)}' > out.count
Many hits appear to happen on the same days, and per-day data does exist: archive.org/details/widecrawl but apparently cannot be publicly downloaded unfortunately. But maybe there's another way? TODO select candidates.
Their historic DNS and reverse DNS info was very valuable, and served as Ciro's the initial entry point to finding hits in the IP ranges given by Reuters.
Their data is also quite disjoint from the data of the 2013 DNS Census. There is some overlap, but clearly their methodology is very different. Some times they slot into one another almost perfectly.
You can only get about 250 queries on the web interface, then 250 queries per free account via API.
Since this source is so scarce and valuable, we have been quite careful to note down all the domain and IP ranges that have been explored.
They check your IP when you signup, and you can't sign in twice from the same IP. They also state that Tor addresses are blacklisted.
At news.ycombinator.com/item?id=38496244, the creator of the viewdns.info, "Hughesey", also stated that he'd able to give some free credits for public research projects such as this one. This would have saved up going to quite a few Cafes to get those sweet extra IPs! But it was more fun in hardmode, no doubt.
They also normalize dots in gmail addresses, so you need more diverse email accounts. But they haven't covered the .gmail vs .googlemail trick.
We've noticed that often when there is a hit range:
there is only one IP for each domain
there is a range of about 20-30 of those
and that this does not seem to be that common. Let's see if that is a reasonable fingerprint or not.
Note that although this is the most common case, we have found multiple hits that viewdns.info maps to the same IP.
First we create a table u (unique) that only have domains which are the only domain for an IP, let's see by how much that lowers the 191 M total unique domains:
time sqlite3 u.sqlite 'create table t (d text, i text)'
time sqlite3 av.sqlite -cmd "attach 'u.sqlite' as u" "insert into u.t select min(d) as d, min(i) as i from t where d not like '%.%.%' group by i having count(distinct d) = 1"
The not like '%.%.%' removes subdomains from the counts so that CGI comms are still included, and distinct in count(distinct is because we have multiple entries at different timestamps for some of the hits.
Let's start with the 208 subset to see how it goes:
time sqlite3 av.sqlite -cmd "attach 'u.sqlite' as u" "insert into u.t select min(d) as d, min(i) as i from t where i glob '208.*' and d not like '%.%.%' and (d like '%.com' or d like '%.net') group by i having count(distinct d) = 1"
OK, after we fixed bugs with the above we are down to 4 million lines with unique domain/IP pairs and which contains all of the original hits! Almost certainly more are to be found!
The numbers of the first column are the IPs as a 32-bit integer representation, which is more useful to search for ranges in.
To make a histogram with the distribution of the single hostname IPs:
#!/usr/bin/env bash
bin=$((2**24))
sqlite3 2013-dns-census-a-novirt.sqlite -cmd '.mode csv' >2013-dns-census-a-novirt-hist.csv <<EOF
select i, sum(cnt) from (
select floor(i/${bin}) as i,
count(*) as cnt
from t
group by 1
union
select *, 0 as cnt from generate_series(0, 255)
)
group by i
EOF
gnuplot \
-e 'set terminal svg size 1200, 800' \
-e 'set output "2013-dns-census-a-novirt-hist.svg"' \
-e 'set datafile separator ","' \
-e 'set tics scale 0' \
-e 'unset key' \
-e 'set xrange[0:255]' \
-e 'set title "Counts of IPs with a single hostname"' \
-e 'set xlabel "IPv4 first byte"' \
-e 'set ylabel "count"' \
-e 'plot "2013-dns-census-a-novirt-hist.csv" using 1:2:1 with labels' \
;
Which gives the following useless noise, there is basically no pattern:
There are two keywords that are killers: "news" and "world" and their translations or closely related words. Everything else is hard. So a good start is:
grep -e news -e noticias -e nouvelles -e world -e global
iran + football:
iranfootballsource.com: the third hit for this area after the two given by Reuters! Epic.
3 easy hits with "noticias" (news in Portuguese or Spanish"), uncovering two brand new ip ranges:
66.45.179.205 noticiasporjanua.com
66.237.236.247 comunidaddenoticias.com
204.176.38.143 noticiassofisticadas.com
Let's see some French "nouvelles/actualites" for those tumultuous Maghrebis:
216.97.231.56 nouvelles-d-aujourdhuis.com
news + world:
210.80.75.55 philippinenewsonline.net
news + global:
204.176.39.115 globalprovincesnews.com
212.209.74.105 globalbaseballnews.com
212.209.79.40: hydradraco.com
OK, I've decided to do a complete Wayback Machine CDX scanning of news... Searching for .JAR or https.*cgi-bin.*\.cgi are killers, particularly the .jar hits, here's what came out:
"headline": only 140 matches in 2013-dns-census-a-novirt.csv and 3 hits out of 269 hits. Full inspection without CDX led to no new hits.
"today": only 3.5k matches in 2013-dns-census-a-novirt.csv and 12 hits out of 269 hits, TODO how many on those on 2013-dns-census-a-novirt? No new hits.
"world", "global", "international", and spanish/portuguese/French versions like "mondo", "mundo", "mondi": 15k matches in 2013-dns-census-a-novirt.csv. No new hits.
# uniq not amazing as there are often two or three slightly different records repeated on multiple timestamps, but down to 11 GB
python3 mx.py | uniq > mx-uniq.csv
sqlite3 mx.sqlite 'create table t(d text, m text)'
# 13 GB
time sqlite3 mx.sqlite ".import --csv --skip 1 'mx-uniq.csv' t"
# 41 GB
time sqlite3 mx.sqlite 'create index td on t(d)'
time sqlite3 mx.sqlite 'create index tm on t(m)'
time sqlite3 mx.sqlite 'create index tdm on t(d, m)'
# Remove dupes.
# Rows: 150m
time sqlite3 mx.sqlite <<EOF
delete from t
where rowid not in (
select min(rowid)
from t
group by d, m
)
EOF
# 15 GB
time sqlite3 mx.sqlite vacuum
Let's see what the hits use:
awk -F, 'NR>1{ print $2 }' ../media/cia-2010-covert-communication-websites/hits.csv | xargs -I{} sqlite3 mx.sqlite "select distinct * from t where d = '{}'"
At around 267 total hits, only 84 have MX records, and from those that do, almost all of them have exactly:
time sqlite3 mx.sqlite '.mode csv' "attach 'aiddcu.sqlite' as 'av'" '.load ./ip' "select ipi2s(av.t.i), av.t.d from av.t inner join t as mx on av.t.d = mx.d and mx.m = 'mailstore1.secureserver.net' order by av.t.i asc" > avm.csv
where avm stands for av with mx pruning. This leaves us with only ~500k entries left. With one more figerprint we could do a Wayback Machine CDX scanning scan.
Let's check that we still have most our hits in there:
We intersect 2013 DNS Census virtual host cleanup with 2013 DNS census MX records and that leaves 460k hits. We did lose a third on the the MX records as of 260 hits since secureserver.net is only used in 1/3 of sites, but we also concentrate 9x, so it may be worth it.
so yeah, most of those are likely going to be humongous just by looking at the names.
The smallest ones by far from the total are: frienddns.ru with only 487 hits, all others quite large or fake hits due to CSV. Did a quick Wayback Machine CDX scanning there but no luck alas.
Let's check the smaller ones:
inews-today.com,2013-08-12T03:14:01,ns1.frienddns.ru
source-commodities.net,2012-12-13T20:58:28,ns1.namecity.com -> fake hit due to grep e-commodities.net
dailynewsandsports.com,2013-08-13T08:36:28,ns3.a2hosting.com
just-kidding-news.com,2012-02-04T07:40:50,jns3.dailyrazor.com
fightwithoutrules.com,2012-11-09T01:17:40,sk.s2.ns1.ns92.kolmic.com
fightwithoutrules.com,2013-07-01T22:46:23,ns1625.ztomy.com
half-court.net,2012-09-10T09:49:15,sk.s2.ns1.ns92.kolmic.com
half-court.net,2013-07-07T00:31:12,ns1621.ztomy.com
We have not managed to extract much from this source, they don't have as much data on the range of interest.
But they do have some unique data at least, perhaps we should try them a bit more often, e.g. they were the only source we've seen so far that made the association: headlines2day.com -> 212.209.74.126 which places it in the more plausible globalbaseballnews.com IP range.
With our new look website you can now find other domains hosted on the same IP address, your website neighbours and more even quicker than before.
Owner replied, you can't:
At the moment you can only do this for current not historical records
This is a shame, reverse IP here could be quite valuable.
In principle, we could obtain this data from search engines, but Google doesn't track that entire website well, e.g. no hits for site:dnshistory.org "62.22.60.48" presumably due to heavy IP throttling.
Here at DNS History we have been crawling DNS records since 2009, our database currently contains over 1 billion domains and over 12 billion DNS records.
and it is true that they do have some hits from that useful era.
They appear to piece together data from various sources. As a result, they have a very complete domain -> IP history.
TODO reverse IP? The fact that they don't seem to have it suggests that they are just making historical reverse IP requests to a third party via some API.
Account creation blacklists common email providers such as gmail to force users to use a "corporate" email address. But using random domains like ciro@cirosantilli.com works fine.
Their data seems to date back to 2008 for our searches.
So far, no new domains have been found with Common Crawl, nor have any existing known domains been found to be present in Common Crawl. Our working theory is that Common Crawl never reached the domains How did Alexa find the domains?
So url_host_3rd_last_part might be a winner for CGI comms fingerprinting!
Naive one for one index:
select * from "ccindex"."ccindex" where url_host_registered_domain = 'conquermstoday.com' limit 100;
have no results... data scanned: 5.73 GB
Let's see if they have any of the domain hits. Let's also restrict by date to try and reduce the data scanned:
select * from "ccindex"."ccindex" where
fetch_time < TIMESTAMP '2014-01-01 00:00:00' AND
url_host_registered_domain IN (
'activegaminginfo.com',
'altworldnews.com',
...
'topbillingsite.com',
'worldwildlifeadventure.com'
)
Humm, data scanned: 60.59 GB and no hits... weird.
Sanity check:
select * from "ccindex"."ccindex" WHERE
crawl = 'CC-MAIN-2013-20' AND
subset = 'warc' AND
url_host_registered_domain IN (
'google.com',
'amazon.com'
)
has a bunch of hits of course. Also Data scanned: 212.88 MB, WHEREcrawl and subset are a must! Should have read the article first.
Let's widen a bit more:
select * from "ccindex"."ccindex" WHERE
crawl IN (
'CC-MAIN-2013-20',
'CC-MAIN-2013-48',
'CC-MAIN-2014-10'
) AND
subset = 'warc' AND
url_host_registered_domain IN (
'activegaminginfo.com',
'altworldnews.com',
...
'worldnewsandent.com',
'worldwildlifeadventure.com'
)
Still nothing found... they don't seem to have any of the URLs of interest?
We could not find anything useful in it so far, but there is great potential to use this tool to find new IP ranges based on properties of existing IP ranges. Part of the problem is that the dataset is huge, and is split by top 256 bytes. But it would be reasonable to at least explore ranges with pre-existing known hits...
We have started looking for patterns on 66.* and 208.*, both selected as two relatively far away ranges that have a number of pre-existing hits. 208 should likely have been 212 considering later finds that put several ranges in 212.
... similar down
208.254.40.95 1334668500 down no-response
208.254.40.95 1338270300 down no-response
208.254.40.95 1338839100 down no-response
208.254.40.95 1339361100 down no-response
208.254.40.95 1346391900 down no-response
208.254.40.96 1335806100 up unknown
208.254.40.96 1336979700 up unknown
208.254.40.96 1338840900 up unknown
208.254.40.96 1339454700 up unknown
208.254.40.96 1346778900 up echo-reply (0.34s latency).
208.254.40.96 1346838300 up echo-reply (0.30s latency).
208.254.40.97 1335840300 up unknown
208.254.40.97 1338446700 up unknown
208.254.40.97 1339334100 up unknown
208.254.40.97 1346658300 up echo-reply (0.26s latency).
... similar up
208.254.40.126 1335708900 up unknown
208.254.40.126 1338446700 up unknown
208.254.40.126 1339330500 up unknown
208.254.40.126 1346494500 up echo-reply (0.24s latency).
208.254.40.127 1335840300 up unknown
208.254.40.127 1337793300 up unknown
208.254.40.127 1338853500 up unknown
208.254.40.127 1346454900 up echo-reply (0.23s latency).
208.254.40.128 1335856500 up unknown
208.254.40.128 1338200100 down no-response
208.254.40.128 1338749100 down no-response
208.254.40.128 1339334100 down no-response
208.254.40.128 1346607900 down net-unreach
208.254.40.129 1335699900 up unknown
... similar down
Suggests exactly 127 - 96 + 1 = 31 IPs.
208.254.42:
... similar down
208.254.42.191 1334522700 down no-response
208.254.42.191 1335276900 down no-response
208.254.42.191 1335784500 down no-response
208.254.42.191 1337845500 down no-response
208.254.42.191 1338752700 down no-response
208.254.42.191 1339332300 down no-response
208.254.42.191 1346499900 down net-unreach
208.254.42.192 1334668500 up unknown
208.254.42.192 1336808700 up unknown
208.254.42.192 1339334100 up unknown
208.254.42.192 1346766300 up echo-reply (0.40s latency).
208.254.42.193 1335770100 up unknown
208.254.42.193 1338444900 up unknown
208.254.42.193 1339334100 up unknown
... similar up
208.254.42.221 1346517900 up echo-reply (0.19s latency).
208.254.42.222 1335708900 up unknown
208.254.42.222 1335708900 up unknown
208.254.42.222 1338066900 up unknown
208.254.42.222 1338747300 up unknown
208.254.42.222 1346872500 up echo-reply (0.27s latency).
208.254.42.223 1335773700 up unknown
208.254.42.223 1336949100 up unknown
208.254.42.223 1338750900 up unknown
208.254.42.223 1339334100 up unknown
208.254.42.223 1346854500 up echo-reply (0.13s latency).
208.254.42.224 1335665700 down no-response
208.254.42.224 1336567500 down no-response
208.254.42.224 1338840900 down no-response
208.254.42.224 1339425900 down no-response
208.254.42.224 1346494500 down time-exceeded
... similar down
Suggests exactly 223 - 192 + 1 = 31 IPs.
Let's have a look at the file 68: outcome: no clear hits like on 208. One wonders why.
It does appears that long sequences of ranges are a sort of fingerprint. The question is how unique it would be.
First:
n=208
time awk '$3=="up"{ print $1 }' $n | uniq -c | sed -r 's/^ +//;s/ /,/' | tee $n-up-uniq
t=$n-up-uniq.sqlite
rm -f $t
time sqlite3 $t 'create table tmp(cnt text, i text)'
time sqlite3 $t ".import --csv $n-up-uniq tmp"
time sqlite3 $t 'create table t (i integer)'
time sqlite3 $t '.load ./ip' 'insert into t select str2ipv4(i) from tmp'
time sqlite3 $t 'drop table tmp'
time sqlite3 $t 'create index ti on t(i)'
This reduces us to 2 million IP rows from the total possible 16 million IPs.
OK now just counting hits on fixed windows has way too many results:
sqlite3 208-up-uniq.sqlite "\
SELECT * FROM (
SELECT min(i), COUNT(*) OVER (
ORDER BY i RANGE BETWEEN 15 PRECEDING AND 15 FOLLOWING
) as c FROM t
) WHERE c > 20 and c < 30
"
sqlite3 208-up-uniq.sqlite <<EOF
SELECT f, t - f as c FROM (
SELECT min(i) as f, max(i) as t
FROM (SELECT i, ROW_NUMBER() OVER (ORDER BY i) - i as grp FROM t)
GROUP BY grp
ORDER BY i
) where c = 31
EOF
271. Hmm. A bit more than we'd like...
Another route is to also count the ups:
n=208
time awk '$3=="up"{ print $1 }' $n | uniq -c | sed -r 's/^ +//;s/ /,/' | tee $n-up-uniq-cnt
t=$n-up-uniq-cnt.sqlite
rm -f $t
time sqlite3 $t 'create table tmp(cnt text, i text)'
time sqlite3 $t ".import --csv $n-up-uniq-cnt tmp"
time sqlite3 $t 'create table t (cnt integer, i integer)'
time sqlite3 $t '.load ./ip' 'insert into t select cnt as integer, str2ipv4(i) from tmp'
time sqlite3 $t 'drop table tmp'
time sqlite3 $t 'create index ti on t(i)'
Let's see how many consecutives with counts:
sqlite3 208-up-uniq-cnt.sqlite <<EOF
SELECT f, t - f as c FROM (
SELECT min(i) as f, max(i) as t
FROM (SELECT i, ROW_NUMBER() OVER (ORDER BY i) - i as grp FROM t WHERE cnt >= 3)
GROUP BY grp
ORDER BY i
) where c > 28 and c < 32
EOF
Let's check on 66:
grep -e '66.45.179' -e '66.45.179' 66
not representative at all... e.g. several convfirmed hits are down:
66.45.179.215 1335305700 down no-response
66.45.179.215 1337579100 down no-response
66.45.179.215 1338765300 down no-response
66.45.179.215 1340271900 down no-response
66.45.179.215 1346813100 down no-response
Domain list only, no IPs and no dates. We haven't been able to extract anything of interest from this source so far.
Domain hit count when we were at 69 hits: only 9, some of which had been since reused. Likely their data collection did not cover the dates of interest.
When you Google most of the hit domains, many of them show up on "expired domain trackers", and above all Chinese expired domain trackers for some reason, notably e.g.:
The hits are very well distributed amongst days and months, at least they did a good job hiding these potential timing fingerprints. This feels very deliberately designed.
There are lots of hits. The data set is very inclusive. Also we understand that it must have been obtains through means other than Web crawling, since it contains so many of the hits.
Nice output format for scraping as the HTML is very minimal
They randomly changed their URL format to remove the space before the .com after 2012-02-03:
Some files around 2011-09-06 start with an empty line. 2014-01-15 starts with about twenty empty lines. Oh and that last one also has some trash bytes the end <B7><B5><BB><D8>. Beauty.
Also has some randomly missing dates like hupo.com, though different missing ones from hupo, so they complement each other nicely.
Some of the URLs are broken and don't inform that with HTTP status code, they just replace the results with some Chinese text 无法找到该页 (The requested page could not be found):
curl -vvv http://domain.webmasterhome.cn/com/2015-10-31.asp
* Trying 125.90.93.11:80...
* Connected to domain.webmasterhome.cn (125.90.93.11) port 80 (#0)
> GET /com/2015-10-31.asp HTTP/1.1
> Host: domain.webmasterhome.cn
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Sat, 21 Oct 2023 15:12:23 GMT
< Server: Microsoft-IIS/6.0
< X-Powered-By: ASP.NET
< Content-Length: 0
< Content-Type: text/html
< Set-Cookie: ASPSESSIONIDCSTTTBAD=BGGPAONBOFKMMFIPMOGGHLMJ; path=/
< Cache-control: private
<
* Connection #0 to host domain.webmasterhome.cn left intact
It is not fully clear if this is a throttling mechanism, or if the data is just missing entirely.
Starting around 2018, the IP limiting became very intense, 30 mins / 1 hour per URL, so we just gave up. Therefore, data from 2018 onwards does not contain webmasterhome.cn data.
Starting from 2013-05-10 the format changes randomly. This also shows us that they just have all the HTML pages as static files on their server. E.g. with:
This suggests that scraping these lists might be a good starting point to obtaining "all expired domains ever".
We've made the following pipelines for hupo.com + webmasterhome.cn merging:
./hupo.sh &
./webmastercn.sh &
wait
./hupo-merge.sh
# Export as small Google indexable files in a Git repository.
./hupo-repo.sh
# Export as per year zips for Internet Archive.
./hupo-zip.sh
# Obtain count statistics:
./hupo-wc.sh
Soon after uploading, these repos started getting some interesting traffic, presumably started by security trackers going "bling bling" on certain malicious domain names in their databases:
TODO what does this Chinese forum track? New registrations? Their focus seems to be domain name speculation
Some of the threads contain domain dumps. We haven't yet seen a scrapable URL pattern, but their data goes way back and did have various hits. The forum seems to have started in 2006: club.domain.cn/forum.php?mod=forumdisplay&fid=41&page=10127
Holy fuck the type of data source that we get in this area of work!
This pastebin contained a few new hits, in addition to some pre-existing ones. Most of the hits them seem to be linked to the IP 72.34.53.174, which presumably is a major part of the fingerprint found by CYBERTAZIEX, though unsurprisingly methodology is unclear. As documented, the domains appear to be linked to a "Condor hosting" provider, but it is hard to find any information about it online.
Ciro Santilli checked every single non-subdomain domain in the list.
The author's real name appears to be Deni Suwandi: twitter.com/denz_999 from Indonesia, but all accounts appear to be inactive, otherwise we'd ping him to ask for more info about the list.
The data here had a little bit of non-overlap from other sources. 4 new confirmed hits were found, plus 4 possible others that were left as candidates.