Main article: DNS Census 2013.
This data source was very valuable, and led to many hits, and to finding the first non Reuters ranges with Section "secure subdomain search on 2013 DNS Census".
Hit overlap:
Domain hit count when we were at 279 hits: 142 hits, so about half of the hits were present.
jq -r '.[].host' ../media/cia-2010-covert-communication-websites/hits.json ) | xargs -I{} sqlite3 aiddcu.sqlite "select * from t where d = '{}'"
The timing of the database is perfect for this project, it is as if the CIA had planted it themselves!
We've noticed that often when there is a hit range:and that this does not seem to be that common. Let's see if that is a reasonable fingerprint or not.
- there is only one IP for each domain
- there is a range of about 20-30 of those
Note that although this is the most common case, we have found multiple hits that viewdns.info maps to the same IP.
First we create a table
The
u
(unique
) that only have domains which are the only domain for an IP, let's see by how much that lowers the 191 M total unique domains:
time sqlite3 u.sqlite 'create table t (d text, i text)'
time sqlite3 av.sqlite -cmd "attach 'u.sqlite' as u" "insert into u.t select min(d) as d, min(i) as i from t where d not like '%.%.%' group by i having count(distinct d) = 1"
not like '%.%.%'
removes subdomains from the counts so that CGI comms are still included, and distinct
in count(distinct
is because we have multiple entries at different timestamps for some of the hits.Let's start with the 208 subset to see how it goes:
OK, after we fixed bugs with the above we are down to 4 million lines with unique domain/IP pairs and which contains all of the original hits! Almost certainly more are to be found!
time sqlite3 av.sqlite -cmd "attach 'u.sqlite' as u" "insert into u.t select min(d) as d, min(i) as i from t where i glob '208.*' and d not like '%.%.%' and (d like '%.com' or d like '%.net') group by i having count(distinct d) = 1"
This data is so valuable that we've decided to upload it to: archive.org/details/2013-dns-census-a-novirt.csv Format:
The numbers of the first column are the IPs as a 32-bit integer representation, which is more useful to search for ranges in.
8,chrisjmcgregor.com
11,80end.com
28,fine5.net
38,bestarabictv.com
49,xy005.com
50,cmsasoccer.com
80,museemontpellier.net
100,newtiger.com
108,lps-promptservice.com
111,bridesmaiddressesshow.com
To make a histogram with the distribution of the single hostname IPs:
Which gives the following useless noise, there is basically no pattern:
#!/usr/bin/env bash
bin=$((2**24))
sqlite3 2013-dns-census-a-novirt.sqlite -cmd '.mode csv' >2013-dns-census-a-novirt-hist.csv <<EOF
select i, sum(cnt) from (
select floor(i/${bin}) as i,
count(*) as cnt
from t
group by 1
union
select *, 0 as cnt from generate_series(0, 255)
)
group by i
EOF
gnuplot \
-e 'set terminal svg size 1200, 800' \
-e 'set output "2013-dns-census-a-novirt-hist.svg"' \
-e 'set datafile separator ","' \
-e 'set tics scale 0' \
-e 'unset key' \
-e 'set xrange[0:255]' \
-e 'set title "Counts of IPs with a single hostname"' \
-e 'set xlabel "IPv4 first byte"' \
-e 'set ylabel "count"' \
-e 'plot "2013-dns-census-a-novirt-hist.csv" using 1:2:1 with labels' \
;
There are two keywords that are killers: "news" and "world" and their translations or closely related words. Everything else is hard. So a good start is:
grep -e news -e noticias -e nouvelles -e world -e global
iran + football:
- iranfootballsource.com: the third hit for this area after the two given by Reuters! Epic.
3 easy hits with "noticias" (news in Portuguese or Spanish"), uncovering two brand new ip ranges:
- 66.45.179.205 noticiasporjanua.com
- 66.237.236.247 comunidaddenoticias.com
- 204.176.38.143 noticiassofisticadas.com
Let's see some French "nouvelles/actualites" for those tumultuous Maghrebis:
- 216.97.231.56 nouvelles-d-aujourdhuis.com
news + world:
- 210.80.75.55 philippinenewsonline.net
news + global:
- 204.176.39.115 globalprovincesnews.com
- 212.209.74.105 globalbaseballnews.com
- 212.209.79.40: hydradraco.com
OK, I've decided to do a complete Wayback Machine CDX scanning of
news
... Searching for .JAR
or https.*cgi-bin.*\.cgi
are killers, particularly the .jar hits, here's what came out:- 62.22.60.49 telecom-headlines.com
- 62.22.61.206 worldnewsnetworking.com
- 64.16.204.55 holein1news.com
- 66.104.169.184 bcenews.com
- 69.84.156.90 stickshiftnews.com
- 74.116.72.236 techtopnews.com
- 74.254.12.168 non-stop-news.net
- 193.203.49.212 inews-today.com
- 199.85.212.118 just-kidding-news.com
- 207.210.250.132 aeronet-news.com
- 212.4.18.129 sightseeingnews.com
- 212.209.90.84 thenewseditor.com
- 216.105.98.152 modernarabicnews.com
Wayback Machine CDX scanning of "world":
- 66.104.173.186 myworldlymusic.com
"headline": only 140 matches in 2013-dns-census-a-novirt.csv and 3 hits out of 269 hits. Full inspection without CDX led to no new hits.
"today": only 3.5k matches in 2013-dns-census-a-novirt.csv and 12 hits out of 269 hits, TODO how many on those on 2013-dns-census-a-novirt? No new hits.
"world", "global", "international", and spanish/portuguese/French versions like "mondo", "mundo", "mondi": 15k matches in 2013-dns-census-a-novirt.csv. No new hits.
Let' see if there's anything in records/mx.xz.
mx.csv is 21GB.
They do have
"
in the files to escape commas so:mx.pyWould have been better with csvkit: stackoverflow.com/questions/36287982/bash-parse-csv-with-quotes-commas-and-newlines
import csv
import sys
writer = csv.writer(sys.stdout)
with open('mx.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
writer.writerow([row[0], row[3]])
then:
# uniq not amazing as there are often two or three slightly different records repeated on multiple timestamps, but down to 11 GB
python3 mx.py | uniq > mx-uniq.csv
sqlite3 mx.sqlite 'create table t(d text, m text)'
# 13 GB
time sqlite3 mx.sqlite ".import --csv --skip 1 'mx-uniq.csv' t"
# 41 GB
time sqlite3 mx.sqlite 'create index td on t(d)'
time sqlite3 mx.sqlite 'create index tm on t(m)'
time sqlite3 mx.sqlite 'create index tdm on t(d, m)'
# Remove dupes.
# Rows: 150m
time sqlite3 mx.sqlite <<EOF
delete from t
where rowid not in (
select min(rowid)
from t
group by d, m
)
EOF
# 15 GB
time sqlite3 mx.sqlite vacuum
Let's see what the hits use:
awk -F, 'NR>1{ print $2 }' ../media/cia-2010-covert-communication-websites/hits.csv | xargs -I{} sqlite3 mx.sqlite "select distinct * from t where d = '{}'"
At around 267 total hits, only 84 have MX records, and from those that do, almost all of them have exactly:with only three exceptions:We need to count out of the totals!which gives, ~18M, so nope, it is too much by itself...
smtp.secureserver.net
mailstore1.secureserver.net
dailynewsandsports.com|dailynewsandsports.com
inews-today.com|mail.inews-today.com
just-kidding-news.com|just-kidding-news.com
sqlite3 mx.sqlite "select count(*) from t where m = 'mailstore1.secureserver.net'"
Let's try to use that to reduce where
av.sqlite
from 2013 DNS Census virtual host cleanup a bit further:time sqlite3 mx.sqlite '.mode csv' "attach 'aiddcu.sqlite' as 'av'" '.load ./ip' "select ipi2s(av.t.i), av.t.d from av.t inner join t as mx on av.t.d = mx.d and mx.m = 'mailstore1.secureserver.net' order by av.t.i asc" > avm.csv
avm
stands for av
with mx
pruning. This leaves us with only ~500k entries left. With one more figerprint we could do a Wayback Machine CDX scanning scan.Let's check that we still have most our hits in there:At 267 hits we got 81, so all are still present.
grep -f <(awk -F, 'NR>1{print $2}' /home/ciro/bak/git/media/cia-2010-covert-communication-websites/hits.csv) avm.csv
secureserver is a hosting provider, we can see their blank page e.g. at: web.archive.org/web/20110128152204/http://emmano.com/. security.stackexchange.com/questions/12610/why-did-secureserver-net-godaddy-access-my-gmail-account/12616#12616 comments:
secureserver.net is the name GoDaddy use as the reverse DNS for IP addresses used for dedicated/virtual server hosting
We intersect 2013 DNS Census virtual host cleanup with 2013 DNS census MX records and that leaves 460k hits. We did lose a third on the the MX records as of 260 hits since secureserver.net is only used in 1/3 of sites, but we also concentrate 9x, so it may be worth it.
Then we Wayback Machine CDX scanning. it takes about 5 days, but it is manageale.
We did a full Wayback Machine CDX scanning for JAR, SWF and cgi-bin in those, but only found a single new hit:
- 63.130.160.50 theglobalheadlines.com. Just barely missed with our 2013 DNS Census virtual host cleanup heuristic keyword searches as we did think of both "global" and "headlines" in the "news" themes!
ns.csv is 57 GB. This file is too massive, working with it is a pain.
We can also cut down the data a lot with stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column/76605540#76605540 and tld filtering:
This brings us down to a much more manageable 3.0 GB, 83 M rows.
awk -F, 'BEGIN{OFS=","} { if ($1 != last) { print $1, $3; last = $1; } }' ns.csv | grep -E '\.(com|net|info|org|biz),' > nsu.csv
Let's just scan it once real quick to start with, since likely nothing will come of this venue:
As of 267 hits we get:
so yeah, most of those are likely going to be humongous just by looking at the names.
grep -f <(awk -F, 'NR>1{print $2}' ../media/cia-2010-covert-communication-websites/hits.csv) nsu.csv | tee nsu-hits.csv
cat nsu-hits.csv | csvcut -c 2 | sort | awk -F. '{OFS="."; print $(NF-1), $(NF)}' | sort | uniq -c | sort -k1 -n
1 a2hosting.com
1 amerinoc.com
1 ayns.net
1 dailyrazor.com
1 domainingdepot.com
1 easydns.com
1 frienddns.ru
1 hostgator.com
1 kolmic.com
1 name-services.com
1 namecity.com
1 netnames.net
1 tonsmovies.net
1 webmailer.de
2 cashparking.com
55 worldnic.com
86 domaincontrol.com
The smallest ones by far from the total are: frienddns.ru with only 487 hits, all others quite large or fake hits due to CSV. Did a quick Wayback Machine CDX scanning there but no luck alas.
Let's check the smaller ones:
Doubt anything will come out of this.
inews-today.com,2013-08-12T03:14:01,ns1.frienddns.ru
source-commodities.net,2012-12-13T20:58:28,ns1.namecity.com -> fake hit due to grep e-commodities.net
dailynewsandsports.com,2013-08-13T08:36:28,ns3.a2hosting.com
just-kidding-news.com,2012-02-04T07:40:50,jns3.dailyrazor.com
fightwithoutrules.com,2012-11-09T01:17:40,sk.s2.ns1.ns92.kolmic.com
fightwithoutrules.com,2013-07-01T22:46:23,ns1625.ztomy.com
half-court.net,2012-09-10T09:49:15,sk.s2.ns1.ns92.kolmic.com
half-court.net,2013-07-07T00:31:12,ns1621.ztomy.com
Let's do a bit of counting out of the total:
gives ~20M domain using
so it accounts for 1/4 of the total.
grep domaincontrol.com ns.csv | awk -F, '{print $1}' | uniq | wc
domaincontrol
. Let's see how many domains are in the first place:
awk -F, '{print $1}' ns.csv | uniq | wc
Same as 2013 DNS census NS records basically, nothing came out.
Articles by others on the same topic
There are currently no matching articles.