Source: cirosantilli/cia-2010-covert-communication-websites/2013-dns-census-virtual-host-cleanup

= 2013 DNS Census virtual host cleanup

We've noticed that often when there is a hit range:
* there is only one IP for each domain
* there is a range of about 20-30 of those
and that this does not seem to be that common. Let's see if that is a reasonable fingerprint or not.

Note that although this is the most common case, we have found multiple hits that <viewdns.info> maps to the same IP.

First we create a table `u` (`unique`) that only have domains which are the only domain for an IP, let's see by how much that lowers the 191 M total unique domains:
``
time sqlite3 u.sqlite 'create table t (d text, i text)'
time sqlite3 av.sqlite -cmd "attach 'u.sqlite' as u" "insert into u.t select min(d) as d, min(i) as i from t where d not like '%.%.%' group by i having count(distinct d) = 1"
``
The `not like '%.%.%'` removes subdomains from the counts so that <CGI comms> are still included, and `distinct` in `count(distinct` is because we have multiple entries at different timestamps for some of the hits.

Let's start with the 208 subset to see how it goes:
``
time sqlite3 av.sqlite -cmd "attach 'u.sqlite' as u" "insert into u.t select min(d) as d, min(i) as i from t where i glob '208.*' and d not like '%.%.%' and (d like '%.com' or d like '%.net') group by i having count(distinct d) = 1"
``
OK, after we fixed bugs with the above we are down to 4 million lines with unique domain/IP pairs and which contains all of the original hits! Almost certainly more are to be found!

This data is so valuable that we've decided to upload it to: https://archive.org/details/2013-dns-census-a-novirt.csv Format:
``
8,chrisjmcgregor.com
11,80end.com
28,fine5.net
38,bestarabictv.com
49,xy005.com
50,cmsasoccer.com
80,museemontpellier.net
100,newtiger.com
108,lps-promptservice.com
111,bridesmaiddressesshow.com
``
The numbers of the first column are the IPs as a 32-bit integer representation, which is more useful to search for ranges in.

To make a <histogram> with the distribution of the single hostname IPs:
``
#!/usr/bin/env bash
bin=$((2**24))
sqlite3 2013-dns-census-a-novirt.sqlite -cmd '.mode csv' >2013-dns-census-a-novirt-hist.csv <<EOF
select i, sum(cnt) from (
  select floor(i/${bin}) as i,
         count(*) as cnt
    from t
    group by 1
  union
  select *, 0 as cnt from generate_series(0, 255)
)
group by i
EOF
gnuplot \
  -e 'set terminal svg size 1200, 800' \
  -e 'set output "2013-dns-census-a-novirt-hist.svg"' \
  -e 'set datafile separator ","' \
  -e 'set tics scale 0' \
  -e 'unset key' \
  -e 'set xrange[0:255]' \
  -e 'set title "Counts of IPs with a single hostname"' \
  -e 'set xlabel "IPv4 first byte"' \
  -e 'set ylabel "count"' \
  -e 'plot "2013-dns-census-a-novirt-hist.csv" using 1:2:1 with labels' \
;
``
Which gives the following useless noise, there is basically no pattern:
\Image[https://raw.githubusercontent.com/cirosantilli/media/master/cia-2010-covert-communication-websites/2013-dns-census-a-novirt-hist.svg]