We've noticed that often when there is a hit range:
there is only one IP for each domain
there is a range of about 20-30 of those
and that this does not seem to be that common. Let's see if that is a reasonable fingerprint or not.
Note that although this is the most common case, we have found multiple hits that viewdns.info maps to the same IP.
First we create a table u (unique) that only have domains which are the only domain for an IP, let's see by how much that lowers the 191 M total unique domains:
time sqlite3 u.sqlite 'create table t (d text, i text)'
time sqlite3 av.sqlite -cmd "attach 'u.sqlite' as u" "insert into u.t select min(d) as d, min(i) as i from t where d not like '%.%.%' group by i having count(distinct d) = 1"
The not like '%.%.%' removes subdomains from the counts so that CGI comms are still included, and distinct in count(distinct is because we have multiple entries at different timestamps for some of the hits.
Let's start with the 208 subset to see how it goes:
time sqlite3 av.sqlite -cmd "attach 'u.sqlite' as u" "insert into u.t select min(d) as d, min(i) as i from t where i glob '208.*' and d not like '%.%.%' and (d like '%.com' or d like '%.net') group by i having count(distinct d) = 1"
OK, after we fixed bugs with the above we are down to 4 million lines with unique domain/IP pairs and which contains all of the original hits! Almost certainly more are to be found!
The numbers of the first column are the IPs as a 32-bit integer representation, which is more useful to search for ranges in.
To make a histogram with the distribution of the single hostname IPs:
#!/usr/bin/env bash
bin=$((2**24))
sqlite3 2013-dns-census-a-novirt.sqlite -cmd '.mode csv' >2013-dns-census-a-novirt-hist.csv <<EOF
select i, sum(cnt) from (
select floor(i/${bin}) as i,
count(*) as cnt
from t
group by 1
union
select *, 0 as cnt from generate_series(0, 255)
)
group by i
EOF
gnuplot \
-e 'set terminal svg size 1200, 800' \
-e 'set output "2013-dns-census-a-novirt-hist.svg"' \
-e 'set datafile separator ","' \
-e 'set tics scale 0' \
-e 'unset key' \
-e 'set xrange[0:255]' \
-e 'set title "Counts of IPs with a single hostname"' \
-e 'set xlabel "IPv4 first byte"' \
-e 'set ylabel "count"' \
-e 'plot "2013-dns-census-a-novirt-hist.csv" using 1:2:1 with labels' \
;
Which gives the following useless noise, there is basically no pattern:
There are two keywords that are killers: "news" and "world" and their translations or closely related words. Everything else is hard. So a good start is:
grep -e news -e noticias -e nouvelles -e world -e global
iran + football:
iranfootballsource.com: the third hit for this area after the two given by Reuters! Epic.
3 easy hits with "noticias" (news in Portuguese or Spanish"), uncovering two brand new ip ranges:
66.45.179.205 noticiasporjanua.com
66.237.236.247 comunidaddenoticias.com
204.176.38.143 noticiassofisticadas.com
Let's see some French "nouvelles/actualites" for those tumultuous Maghrebis:
216.97.231.56 nouvelles-d-aujourdhuis.com
news + world:
210.80.75.55 philippinenewsonline.net
news + global:
204.176.39.115 globalprovincesnews.com
212.209.74.105 globalbaseballnews.com
212.209.79.40: hydradraco.com
OK, I've decided to do a complete Wayback Machine CDX scanning of news... Searching for .JAR or https.*cgi-bin.*\.cgi are killers, particularly the .jar hits, here's what came out:
"headline": only 140 matches in 2013-dns-census-a-novirt.csv and 3 hits out of 269 hits. Full inspection without CDX led to no new hits.
"today": only 3.5k matches in 2013-dns-census-a-novirt.csv and 12 hits out of 269 hits, TODO how many on those on 2013-dns-census-a-novirt? No new hits.
"world", "global", "international", and spanish/portuguese/French versions like "mondo", "mundo", "mondi": 15k matches in 2013-dns-census-a-novirt.csv. No new hits.