ns.csv is 57 GB. This file is too massive, working with it is a pain.
We can also cut down the data a lot with stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column/76605540#76605540 and tld filtering:
awk -F, 'BEGIN{OFS=","} { if ($1 != last) { print $1, $3; last = $1; } }' ns.csv | grep -E '\.(com|net|info|org|biz),' > nsu.csv
This brings us down to a much more manageable 3.0 GB, 83 M rows.
Let's just scan it once real quick to start with, since likely nothing will come of this venue:
grep -f <(awk -F, 'NR>1{print $2}' ../media/cia-2010-covert-communication-websites/hits.csv) nsu.csv | tee nsu-hits.csv
cat nsu-hits.csv | csvcut -c 2 | sort | awk -F. '{OFS="."; print $(NF-1), $(NF)}' | sort | uniq -c | sort -k1 -n
As of 267 hits we get:
      1 a2hosting.com
      1 amerinoc.com
      1 ayns.net
      1 dailyrazor.com
      1 domainingdepot.com
      1 easydns.com
      1 frienddns.ru
      1 hostgator.com
      1 kolmic.com
      1 name-services.com
      1 namecity.com
      1 netnames.net
      1 tonsmovies.net
      1 webmailer.de
      2 cashparking.com
     55 worldnic.com
     86 domaincontrol.com
so yeah, most of those are likely going to be humongous just by looking at the names.
The smallest ones by far from the total are: frienddns.ru with only 487 hits, all others quite large or fake hits due to CSV. Did a quick Wayback Machine CDX scanning there but no luck alas.
Let's check the smaller ones:
source-commodities.net,2012-12-13T20:58:28,ns1.namecity.com -> fake hit due to grep e-commodities.net
Doubt anything will come out of this.
Let's do a bit of counting out of the total:
grep domaincontrol.com ns.csv | awk -F, '{print $1}' | uniq | wc
gives ~20M domain using domaincontrol. Let's see how many domains are in the first place:
awk -F, '{print $1}' ns.csv | uniq | wc
so it accounts for 1/4 of the total.

Articles by others on the same topic (0)

There are currently no matching articles

See all articles in the same topic