New articles - OurBigBook.com

Ciro Santilli 37 Updated 2025-07-16

It can't be HTML crawl because presumably there wouldn't have been links to those websites? Presumably this is why Common Crawl doesn't seem to have any hits.

So they must have had some kind of DNS A record database?

Or would IPv4 sweep have worked, without the Host header with the CIA's setup?

The same question also applies to the 2013 DNS Census. It has less hits, but still has many.

Whatever they did, we are so so glad that they did!

 Read the full article

CIA 2010 covert communication websites / Non .com .net TLDs by

Ciro Santilli 37 Updated 2025-07-16

 View more

.com and .net are very dominant. Here we list other choices made:

.info: has a few hits:
- archived comms:
  - beyondthefringe.info
- unarchived comms:
  - crickettoday.info
- unarchived:
  - talkingpointnews.info
  - theventurenews.info
  - worldconcerns.info
Did a full Wayback Machine CDX scanning on .info after:
```
grep -e news -e noticias -e nouvelles -e world -e global
```
That makes about 10k domains, so it's about the right size.
.org: has a least one hit, see: Are there .org hits?
.biz:
- unarchived comms:
  - atthemovies.biz

 Read the full article

CIA 2010 covert communication websites / Are there .org hits? by

Ciro Santilli 37 Updated 2025-07-16

 View more

Previously it was unclear if there were any .org hits, until we found the first one with clear comms: web.archive.org/web/20110624203548/http://awfaoi.org/hand.jar

Later on, two more clear ones were found with expired domain trackers:

azerinews.org
autism-news.org

further settling their existence. Later on newimages.org also came to light.

Others that had been previously found in IP ranges but without clear comms:

65.61.127.177: material-science.org
212.4.17.61: tech-stop.org
74.116.72.244 arborstribune.org

.org is very rare, and has been excluded from some of our search heuristics. That was a shame, but likely not much was missed.

 Read the full article

CIA 2010 covert communication websites / Data sources by

Ciro Santilli 37 Updated 2025-07-16

 View more

This is a dark art, and many of the sources are shady as fuck! We often have no idea of their methodology. Also no source is fully complete. We just piece up as best we can.

In order to explore IPs in known IP ranges, what we need are good DNS databases.

www.zone-h.org/archive/ip=208.76.80.93/page=11?hz=1 mentions newsupdatesite.com and mentions "defacement", the "Mass Deface III" pastebin comes to mind. No other nearby hits on quick inspection.

 Read the full article

CIA 2010 covert communication websites / Reuters article by

Ciro Santilli 37 Updated 2025-07-16

 View more

www.reuters.com/investigates/special-report/usa-spies-iran

This is our primary data source, the first article that pointed out a few specific CIA websites which then served as the basis for all of our research.

We take the truth of this article as an axiom. And then all we claim is that all other websites found were made by the same people due to strong shared design principles of the such websites.

 Read the full article

CIA 2010 covert communication websites / Wayback Machine by

Ciro Santilli 37 Updated 2025-07-16

 View more

D'oh.

But to be serious. The Wayback Machine contains a very large proportion of all sites. It does happen sometime that a Wayback Machine archive is missing or broken and cqcounter has the screenshot. But the Wayback Machine is still the most complete database we have found so far. Some archives are very broken. But those are rare.

The only problem with the Wayback Machine is that there is no known efficient way to query its archives across domains. You have to have a domain in hand for CDX queries: Wayback Machine CDX scanning.

The Common Crawl project attempts in part to address this lack of querriability, but we haven't managed to extract any hits from it.

CDX + 2013 DNS Census + heuristics however has been fruitful however.

We have dumped all Wayback Machine archives of known websites to: github.com/cirosantilli/cia-2010-websites-dump using ../cia-2010-covert-communication-websites/download-websites.sh. This allows for better grepping and serves as a backup in case they ever go down.

 Read the full article

CIA 2010 covert communication websites / Wayback Machine CDX scanning by

Ciro Santilli 37 Updated 2025-07-16

 View more

The Wayback Machine has an endpoint to query cralwed pages called the CDX server. It is documented at: github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md.

This allows to filter down 10 thousands of possible domains in a few hours. But 100s of thousands would be too much. This is because you have to query exactly one URL at a time, and they possibly rate limit IPs. But no IP blacklisting so far after several hours, so it's not that bad.

Once you have a heuristic to narrow down some domains, you can use this helper: ../cia-2010-covert-communication-websites/cdx.sh to drill them down from 10s of thousands down to hundreds or thousands.

We then post process the results of cdx.sh with ../cia-2010-covert-communication-websites/cdx-post.sh to drill them down from from thousands to dozens, and manually inspect everything.

From then on, you can just manually inspect for hist on your browser.

 Read the full article

CIA 2010 covert communication websites / Wayback Machine CDX scanning with Tor parallelization by

Ciro Santilli 37 Updated 2025-07-16

 View more

Dire times require dire methods: ../cia-2010-covert-communication-websites/cdx-tor.sh.

First we must start the tor servers with the tor-army command from: stackoverflow.com/questions/14321214/how-to-run-multiple-tor-processes-at-once-with-different-exit-ips/76749983#76749983

tor-army 100

and then use it on a newline separated domain name list to check;

./cdx-tor.sh infile.txt

This creates a directory infile.txt.cdx/ containing:

infile.txt.cdx/out00, out01, etc.: the suspected CDX lines from domains from each tor instance based on the simple criteria that the CDX can handle directly. We split the input domains into 100 piles, and give one selected pile per tor instance.
infile.txt.cdx/out: the final combined CDX output of out00, out01, ...
infile.txt.cdx/out.post: the final output containing only domain names that match further CLI criteria that cannot be easily encoded on the CDX query. This is the cleanest domain name list you should look into at the end basically.

Since archive is so abysmal in its data access, e.g. a Google BigQuery would solve our issues in seconds, we have to come up with creative ways of getting around their IP throttling.

The CIA doesn't play fair. They're actually the exact opposite of fair. So neither shall we.

Distilled into an answer at: stackoverflow.com/questions/14321214/how-to-run-multiple-tor-processes-at-once-with-different-exit-ips/76749983#76749983

This should allow a full sweep of the 4.5M records in 2013 DNS Census virtual host cleanup in a reasonable amount of time. After JAR/SWF/CGI filtering we obtained 5.8k domains, so a reduction factor of about 1 million with likely very few losses. Not bad.

5.8k is still a bit annoying to fully go over however, so we can also try to count CDX hits to the domains and remove anything with too many hits, since the CIA websites basically have very few archives:

cd 2013-dns-census-a-novirt-domains.txt.cdx
./cdx-tor.sh -d out.post domain-list.txt
cd out.post.cdx
cut -d' ' -f1 out | uniq -c | sort -k1 -n | awk 'match($2, /([^,]+),([^)]+)/, a) {printf("%s.%s %d\n", a[2], a[1], $1)}' > out.count

This gives us something like:

12654montana.com 1
aeronet-news.com 1
atohms.com 1
av3net.com 1
beechstreetas400.com 1

sorted by increasing hit counts, so we can go down as far as patience allows for!

New results from a full CDX scan of 2013-dns-census-a-novirt.csv:

219.90.61.123 journeystravelled.com

 Read the full article

CIA 2010 covert communication websites / Wayback Machine crawl date search by

Ciro Santilli 37 Updated 2025-07-16

 View more

Many hits appear to happen on the same days, and per-day data does exist: archive.org/details/widecrawl but apparently cannot be publicly downloaded unfortunately. But maybe there's another way? TODO select candidates.

 Read the full article

CIA 2010 covert communication websites / DNS Census 2013 by

Ciro Santilli 37 Updated 2025-07-16

 View more

Main article: DNS Census 2013.

This data source was very valuable, and led to many hits, and to finding the first non Reuters ranges with Section "secure subdomain search on 2013 DNS Census".

Hit overlap:

jq -r '.[].host' ../media/cia-2010-covert-communication-websites/hits.json ) | xargs -I{} sqlite3 aiddcu.sqlite "select * from t where d = '{}'"

Domain hit count when we were at 279 hits: 142 hits, so about half of the hits were present.

The timing of the database is perfect for this project, it is as if the CIA had planted it themselves!

 Read the full article

CIA 2010 covert communication websites / 2013 DNS Census virtual host cleanup by

Ciro Santilli 37 Updated 2025-07-16

 View more

We've noticed that often when there is a hit range:

there is only one IP for each domain
there is a range of about 20-30 of those

and that this does not seem to be that common. Let's see if that is a reasonable fingerprint or not.

Note that although this is the most common case, we have found multiple hits that viewdns.info maps to the same IP.

First we create a table u (unique) that only have domains which are the only domain for an IP, let's see by how much that lowers the 191 M total unique domains:

time sqlite3 u.sqlite 'create table t (d text, i text)'
time sqlite3 av.sqlite -cmd "attach 'u.sqlite' as u" "insert into u.t select min(d) as d, min(i) as i from t where d not like '%.%.%' group by i having count(distinct d) = 1"

The not like '%.%.%' removes subdomains from the counts so that CGI comms are still included, and distinct in count(distinct is because we have multiple entries at different timestamps for some of the hits.

Let's start with the 208 subset to see how it goes:

time sqlite3 av.sqlite -cmd "attach 'u.sqlite' as u" "insert into u.t select min(d) as d, min(i) as i from t where i glob '208.*' and d not like '%.%.%' and (d like '%.com' or d like '%.net') group by i having count(distinct d) = 1"

OK, after we fixed bugs with the above we are down to 4 million lines with unique domain/IP pairs and which contains all of the original hits! Almost certainly more are to be found!

This data is so valuable that we've decided to upload it to: archive.org/details/2013-dns-census-a-novirt.csv Format:

8,chrisjmcgregor.com
11,80end.com
28,fine5.net
38,bestarabictv.com
49,xy005.com
50,cmsasoccer.com
80,museemontpellier.net
100,newtiger.com
108,lps-promptservice.com
111,bridesmaiddressesshow.com

The numbers of the first column are the IPs as a 32-bit integer representation, which is more useful to search for ranges in.

To make a histogram with the distribution of the single hostname IPs:

#!/usr/bin/env bash
bin=$((2**24))
sqlite3 2013-dns-census-a-novirt.sqlite -cmd '.mode csv' >2013-dns-census-a-novirt-hist.csv <<EOF
select i, sum(cnt) from (
  select floor(i/${bin}) as i,
         count(*) as cnt
    from t
    group by 1
  union
  select *, 0 as cnt from generate_series(0, 255)
)
group by i
EOF
gnuplot \
  -e 'set terminal svg size 1200, 800' \
  -e 'set output "2013-dns-census-a-novirt-hist.svg"' \
  -e 'set datafile separator ","' \
  -e 'set tics scale 0' \
  -e 'unset key' \
  -e 'set xrange[0:255]' \
  -e 'set title "Counts of IPs with a single hostname"' \
  -e 'set xlabel "IPv4 first byte"' \
  -e 'set ylabel "count"' \
  -e 'plot "2013-dns-census-a-novirt-hist.csv" using 1:2:1 with labels' \
;

Which gives the following useless noise, there is basically no pattern:

https://raw.githubusercontent.com/cirosantilli/media/master/cia-2010-covert-communication-websites/2013-dns-census-a-novirt-hist.svg

 Read the full article

CIA 2010 covert communication websites / 2013 DNS Census virtual host cleanup heuristic keyword searches by

Ciro Santilli 37 Updated 2025-07-16

 View more

There are two keywords that are killers: "news" and "world" and their translations or closely related words. Everything else is hard. So a good start is:

grep -e news -e noticias -e nouvelles -e world -e global

iran + football:

iranfootballsource.com: the third hit for this area after the two given by Reuters! Epic.

3 easy hits with "noticias" (news in Portuguese or Spanish"), uncovering two brand new ip ranges:

66.45.179.205 noticiasporjanua.com
66.237.236.247 comunidaddenoticias.com
204.176.38.143 noticiassofisticadas.com

Let's see some French "nouvelles/actualites" for those tumultuous Maghrebis:

216.97.231.56 nouvelles-d-aujourdhuis.com

news + world:

210.80.75.55 philippinenewsonline.net

news + global:

204.176.39.115 globalprovincesnews.com
212.209.74.105 globalbaseballnews.com
212.209.79.40: hydradraco.com

OK, I've decided to do a complete Wayback Machine CDX scanning of news... Searching for .JAR or https.*cgi-bin.*\.cgi are killers, particularly the .jar hits, here's what came out:

62.22.60.49 telecom-headlines.com
62.22.61.206 worldnewsnetworking.com
64.16.204.55 holein1news.com
66.104.169.184 bcenews.com
69.84.156.90 stickshiftnews.com
74.116.72.236 techtopnews.com
74.254.12.168 non-stop-news.net
193.203.49.212 inews-today.com
199.85.212.118 just-kidding-news.com
207.210.250.132 aeronet-news.com
212.4.18.129 sightseeingnews.com
212.209.90.84 thenewseditor.com
216.105.98.152 modernarabicnews.com

Wayback Machine CDX scanning of "world":

66.104.173.186 myworldlymusic.com

"headline": only 140 matches in 2013-dns-census-a-novirt.csv and 3 hits out of 269 hits. Full inspection without CDX led to no new hits.

"today": only 3.5k matches in 2013-dns-census-a-novirt.csv and 12 hits out of 269 hits, TODO how many on those on 2013-dns-census-a-novirt? No new hits.

"world", "global", "international", and spanish/portuguese/French versions like "mondo", "mundo", "mondi": 15k matches in 2013-dns-census-a-novirt.csv. No new hits.

 Read the full article

CIA 2010 covert communication websites / 2013 DNS census MX records by

Ciro Santilli 37 Updated 2025-07-16

 View more

Let' see if there's anything in records/mx.xz.

mx.csv is 21GB.

They do have " in the files to escape commas so:

mx.py

import csv
import sys
writer = csv.writer(sys.stdout)
with open('mx.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        writer.writerow([row[0], row[3]])

Would have been better with csvkit: stackoverflow.com/questions/36287982/bash-parse-csv-with-quotes-commas-and-newlines

then:

# uniq not amazing as there are often two or three slightly different records repeated on multiple timestamps, but down to 11 GB
python3 mx.py | uniq > mx-uniq.csv
sqlite3 mx.sqlite 'create table t(d text, m text)'
# 13 GB
time sqlite3 mx.sqlite ".import --csv --skip 1 'mx-uniq.csv' t"

# 41 GB
time sqlite3 mx.sqlite 'create index td on t(d)'
time sqlite3 mx.sqlite 'create index tm on t(m)'
time sqlite3 mx.sqlite 'create index tdm on t(d, m)'

# Remove dupes.
# Rows: 150m
time sqlite3 mx.sqlite <<EOF
delete from t
where rowid not in (
  select min(rowid)
  from t
  group by d, m
)
EOF

# 15 GB
time sqlite3 mx.sqlite vacuum

Let's see what the hits use:

awk -F, 'NR>1{ print $2 }' ../media/cia-2010-covert-communication-websites/hits.csv | xargs -I{} sqlite3 mx.sqlite "select distinct * from t where d = '{}'"

At around 267 total hits, only 84 have MX records, and from those that do, almost all of them have exactly:

smtp.secureserver.net
mailstore1.secureserver.net

with only three exceptions:

dailynewsandsports.com|dailynewsandsports.com
inews-today.com|mail.inews-today.com
just-kidding-news.com|just-kidding-news.com

We need to count out of the totals!

sqlite3 mx.sqlite "select count(*) from t where m = 'mailstore1.secureserver.net'"

which gives, ~18M, so nope, it is too much by itself...

Let's try to use that to reduce av.sqlite from 2013 DNS Census virtual host cleanup a bit further:

time sqlite3 mx.sqlite '.mode csv' "attach 'aiddcu.sqlite' as 'av'" '.load ./ip' "select ipi2s(av.t.i), av.t.d from av.t inner join t as mx on av.t.d = mx.d and mx.m = 'mailstore1.secureserver.net' order by av.t.i asc" > avm.csv

where avm stands for av with mx pruning. This leaves us with only ~500k entries left. With one more figerprint we could do a Wayback Machine CDX scanning scan.

Let's check that we still have most our hits in there:

grep -f <(awk -F, 'NR>1{print $2}' /home/ciro/bak/git/media/cia-2010-covert-communication-websites/hits.csv) avm.csv

At 267 hits we got 81, so all are still present.

secureserver is a hosting provider, we can see their blank page e.g. at: web.archive.org/web/20110128152204/http://emmano.com/. security.stackexchange.com/questions/12610/why-did-secureserver-net-godaddy-access-my-gmail-account/12616#12616 comments:

secureserver.net is the name GoDaddy use as the reverse DNS for IP addresses used for dedicated/virtual server hosting

 Read the full article

CIA 2010 covert communication websites / 2013 DNS census NS records by

Ciro Santilli 37 Updated 2025-07-16

 View more

ns.csv is 57 GB. This file is too massive, working with it is a pain.

We can also cut down the data a lot with stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column/76605540#76605540 and tld filtering:

awk -F, 'BEGIN{OFS=","} { if ($1 != last) { print $1, $3; last = $1; } }' ns.csv | grep -E '\.(com|net|info|org|biz),' > nsu.csv

This brings us down to a much more manageable 3.0 GB, 83 M rows.

Let's just scan it once real quick to start with, since likely nothing will come of this venue:

grep -f <(awk -F, 'NR>1{print $2}' ../media/cia-2010-covert-communication-websites/hits.csv) nsu.csv | tee nsu-hits.csv
cat nsu-hits.csv | csvcut -c 2 | sort | awk -F. '{OFS="."; print $(NF-1), $(NF)}' | sort | uniq -c | sort -k1 -n

As of 267 hits we get:

      1 a2hosting.com
      1 amerinoc.com
      1 ayns.net
      1 dailyrazor.com
      1 domainingdepot.com
      1 easydns.com
      1 frienddns.ru
      1 hostgator.com
      1 kolmic.com
      1 name-services.com
      1 namecity.com
      1 netnames.net
      1 tonsmovies.net
      1 webmailer.de
      2 cashparking.com
     55 worldnic.com
     86 domaincontrol.com

so yeah, most of those are likely going to be humongous just by looking at the names.

The smallest ones by far from the total are: frienddns.ru with only 487 hits, all others quite large or fake hits due to CSV. Did a quick Wayback Machine CDX scanning there but no luck alas.

Let's check the smaller ones:

inews-today.com,2013-08-12T03:14:01,ns1.frienddns.ru
source-commodities.net,2012-12-13T20:58:28,ns1.namecity.com -> fake hit due to grep e-commodities.net
dailynewsandsports.com,2013-08-13T08:36:28,ns3.a2hosting.com
just-kidding-news.com,2012-02-04T07:40:50,jns3.dailyrazor.com
fightwithoutrules.com,2012-11-09T01:17:40,sk.s2.ns1.ns92.kolmic.com
fightwithoutrules.com,2013-07-01T22:46:23,ns1625.ztomy.com
half-court.net,2012-09-10T09:49:15,sk.s2.ns1.ns92.kolmic.com
half-court.net,2013-07-07T00:31:12,ns1621.ztomy.com

Doubt anything will come out of this.

Let's do a bit of counting out of the total:

grep domaincontrol.com ns.csv | awk -F, '{print $1}' | uniq | wc

gives ~20M domain using domaincontrol. Let's see how many domains are in the first place:

awk -F, '{print $1}' ns.csv | uniq | wc

so it accounts for 1/4 of the total.

 Read the full article

CIA 2010 covert communication websites / 2013 DNS census SOA records by

Ciro Santilli 37 Updated 2025-07-16

 View more

Same as 2013 DNS census NS records basically, nothing came out.

 Read the full article

CIA 2010 covert communication websites / dnshistory.org by

Ciro Santilli 37 Updated 2025-07-16

 View more

dnshistory.org contains historical domain -> mappings.

We have not managed to extract much from this source, they don't have as much data on the range of interest.

But they do have some unique data at least, perhaps we should try them a bit more often, e.g. they were the only source we've seen so far that made the association: headlines2day.com -> 212.209.74.126 which places it in the more plausible globalbaseballnews.com IP range.

TODO can it do IP to domain? Or just domain to IP? Asked on their Discord: discord.com/channels/698151879166918727/968586102493552731/1124254204257632377. Their banner suggests that yes:

With our new look website you can now find other domains hosted on the same IP address, your website neighbours and more even quicker than before.

Owner replied, you can't:

At the moment you can only do this for current not historical records

This is a shame, reverse IP here could be quite valuable.

In principle, we could obtain this data from search engines, but Google doesn't track that entire website well, e.g. no hits for site:dnshistory.org "62.22.60.48" presumably due to heavy IP throttling.

Homepage dnshistory.org/ gives date starting in 2009:

Here at DNS History we have been crawling DNS records since 2009, our database currently contains over 1 billion domains and over 12 billion DNS records.

and it is true that they do have some hits from that useful era.

Any data that we have the patience of extracting from this we will dump under github.com/cirosantilli/media/blob/master/cia-2010-covert-communication-websites/hits.json.

 Read the full article

CIA 2010 covert communication websites / securitytrails.com by

Ciro Santilli 37 Updated 2025-07-16

 View more

They appear to piece together data from various sources. This is the most complete historical domain -> IP database we have so far. They don't have hugely more data than viewdns.info, but many times do offer something new. It feels like the key difference is that their data goes further back in the critical time period a bit.

TODO do they have historical reverse IP? The fact that they don't seem to have it suggests that they are just making historical reverse IP requests to a third party via some API?

E.g. searching thefilmcentre.com under historical data at securitytrails.com/domain/thefilmcentre.com/history/al gives the correct IP 62.22.60.55.

But searching the IP 62.22.60.55 is empty and there's no historical data option?

Account creation blacklists common email providers such as gmail to force users to use a "corporate" email address. But using random domains like ciro@cirosantilli.com works fine.

Their data seems to date back to 2008 for our searches.

 Read the full article

CIA 2010 covert communication websites / Common Crawl by

Ciro Santilli 37 Updated 2025-07-16

 View more

So far, no new domains have been found with Common Crawl, nor have any existing known domains been found to be present in Common Crawl. Our working theory is that Common Crawl never reached the domains How did Alexa find the domains?

Let's try and do something with Common Crawl.

Unfortunately there's no IP data apparently: github.com/commoncrawl/cc-index-table/issues/30, so let's focus on the URLs.

Using their Common Crawl Athena method: commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

Hello world:

select * from "ccindex"."ccindex" limit 100;

Data scanned: 11.75 MB

Sample first output line:

#                            2
url_surtkey                  org,whwheelers)/robots.txt
url                          https://whwheelers.org/robots.txt
url_host_name                whwheelers.org
url_host_tld                 org
url_host_2nd_last_part       whwheelers
url_host_3rd_last_part
url_host_4th_last_part
url_host_5th_last_part
url_host_registry_suffix     org
url_host_registered_domain   whwheelers.org
url_host_private_suffix      org
url_host_private_domain      whwheelers.org
url_host_name_reversed
url_protocol                 https
url_port
url_path                     /robots.txt
url_query
fetch_time                   2021-06-22 16:36:50.000
fetch_status                 301
fetch_redirect               https://www.whwheelers.org/robots.txt
content_digest               3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
content_mime_type            text/html
content_mime_detected        text/html
content_charset
content_languages
content_truncated
warc_filename                crawl-data/CC-MAIN-2021-25/segments/1623488519183.85/robotstxt/CC-MAIN-20210622155328-20210622185328-00312.warc.gz
warc_record_offset           1854030
warc_record_length           639
warc_segment                 1623488519183.85
crawl                        CC-MAIN-2021-25
subset                       robotstxt

So url_host_3rd_last_part might be a winner for CGI comms fingerprinting!

Naive one for one index:

select * from "ccindex"."ccindex" where url_host_registered_domain = 'conquermstoday.com' limit 100;

have no results... data scanned: 5.73 GB

Let's see if they have any of the domain hits. Let's also restrict by date to try and reduce the data scanned:

select * from "ccindex"."ccindex" where
  fetch_time < TIMESTAMP '2014-01-01 00:00:00' AND
  url_host_registered_domain IN (
   'activegaminginfo.com',
   'altworldnews.com',
   ...
   'topbillingsite.com',
   'worldwildlifeadventure.com'
 )

Humm, data scanned: 60.59 GB and no hits... weird.

Sanity check:

select * from "ccindex"."ccindex" WHERE
  crawl = 'CC-MAIN-2013-20' AND
  subset = 'warc' AND
  url_host_registered_domain IN (
   'google.com',
   'amazon.com'
 )

has a bunch of hits of course. Data scanned: 212.88 MB, WHERE crawl and subset are a must! Should have read the article first.

Let's widen a bit more:

select * from "ccindex"."ccindex" WHERE
  crawl IN (
    'CC-MAIN-2013-20',
    'CC-MAIN-2013-48',
    'CC-MAIN-2014-10'
  ) AND
  subset = 'warc' AND
  url_host_registered_domain IN (
    'activegaminginfo.com',
    'altworldnews.com',
    ...
    'worldnewsandent.com',
    'worldwildlifeadventure.com'
 )

Still nothing found... they don't seem to have any of the URLs of interest?

 Read the full article

CIA 2010 covert communication websites / Internet Census 2012 by

Ciro Santilli 37 Updated 2025-07-16

 View more

Does not appear to have any reverse IP hits unfortunately: opendata.stackexchange.com/questions/1951/dataset-of-domain-names/21077#21077. Likely only has domains that were explicitly advertised.

We could not find anything useful in it so far, but there is great potential to use this tool to find new IP ranges based on properties of existing IP ranges. Part of the problem is that the dataset is huge, and is split by top 256 bytes. But it would be reasonable to at least explore ranges with pre-existing known hits...

We have started looking for patterns on 66.* and 208.*, both selected as two relatively far away ranges that have a number of pre-existing hits. 208 should likely have been 212 considering later finds that put several ranges in 212.

tcpip_fp:

66.104.
- 66.104.175.41: grubbersworldrugbynews.com: 1346397300 SCAN(V=6.01%E=4%D=1/12%OT=22%CT=443%CU=%PV=N%G=N%TM=387CAB9E%P=mipsel-openwrt-linux-gnu),ECN(R=N),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=N),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
- 66.104.175.48: worlddispatch.net: 1346816700 SCAN(V=6.01%E=4%D=1/2%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=1D5EA%P=mipsel-openwrt-linux-gnu),SEQ(SP=F8%GCD=3%ISR=109%TI=Z%TS=A),ECN(R=N),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
- 66.104.175.49: webworldsports.com: 1346692500 SCAN(V=6.01%E=4%D=9/3%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=5044E96E%P=mipsel-openwrt-linux-gnu),SEQ(SP=105%GCD=1%ISR=108%TI=Z%TS=A),OPS(O1=M550ST11NW6%O2=M550ST11NW6%O3=M550NNT11NW6%O4=M550ST11NW6%O5=M550ST11NW6%O6=M550ST11),WIN(W1=1510%W2=1510%W3=1510%W4=1510%W5=1510%W6=1510),ECN(R=N),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
- 66.104.175.50: fly-bybirdies.com: 1346822100 SCAN(V=6.01%E=4%D=1/1%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=14655%P=mipsel-openwrt-linux-gnu),SEQ(TI=Z%TS=A),ECN(R=N),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
- 66.104.175.53: info-ology.net: 1346712300 SCAN(V=6.01%E=4%D=9/4%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=50453230%P=mipsel-openwrt-linux-gnu),SEQ(SP=FB%GCD=1%ISR=FF%TI=Z%TS=A),ECN(R=N),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
66.175.106
- 66.175.106.150: noticiasmusica.net: 1340077500 SCAN(V=5.51%D=1/3%OT=22%CT=443%CU=%PV=N%G=N%TM=38707542%P=mipsel-openwrt-linux-gnu),ECN(R=N),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
- 66.175.106.155: atomworldnews.com: 1345562100 SCAN(V=5.51%D=8/21%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=5033A5F2%P=mips-openwrt-linux-gnu),SEQ(SP=FB%GCD=1%ISR=FC%TI=Z%TS=A),ECN(R=Y%DF=Y%TG=40%W=1540%O=M550NNSNW6%CC=N%Q=),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)

 Read the full article

CIA 2010 covert communication websites / 2012 Internet Census hostprobes by

Ciro Santilli 37 Updated 2025-07-16

 View more

Hostprobes quick look on two ranges:

208.254.40:

... similar down

208.254.40.95	1334668500	down	no-response
208.254.40.95	1338270300	down	no-response
208.254.40.95	1338839100	down	no-response
208.254.40.95	1339361100	down	no-response
208.254.40.95	1346391900	down	no-response
208.254.40.96	1335806100	up	unknown
208.254.40.96	1336979700	up	unknown
208.254.40.96	1338840900	up	unknown
208.254.40.96	1339454700	up	unknown
208.254.40.96	1346778900	up	echo-reply (0.34s latency).
208.254.40.96	1346838300	up	echo-reply (0.30s latency).
208.254.40.97	1335840300	up	unknown
208.254.40.97	1338446700	up	unknown
208.254.40.97	1339334100	up	unknown
208.254.40.97	1346658300	up	echo-reply (0.26s latency).

... similar up

208.254.40.126	1335708900	up	unknown
208.254.40.126	1338446700	up	unknown
208.254.40.126	1339330500	up	unknown
208.254.40.126	1346494500	up	echo-reply (0.24s latency).
208.254.40.127	1335840300	up	unknown
208.254.40.127	1337793300	up	unknown
208.254.40.127	1338853500	up	unknown
208.254.40.127	1346454900	up	echo-reply (0.23s latency).

208.254.40.128	1335856500	up	unknown
208.254.40.128	1338200100	down	no-response
208.254.40.128	1338749100	down	no-response
208.254.40.128	1339334100	down	no-response
208.254.40.128	1346607900	down	net-unreach
208.254.40.129	1335699900	up	unknown

... similar down

Suggests exactly 127 - 96 + 1 = 31 IPs.

208.254.42:

... similar down

208.254.42.191	1334522700	down	no-response
208.254.42.191	1335276900	down	no-response
208.254.42.191	1335784500	down	no-response
208.254.42.191	1337845500	down	no-response
208.254.42.191	1338752700	down	no-response
208.254.42.191	1339332300	down	no-response
208.254.42.191	1346499900	down	net-unreach

208.254.42.192	1334668500	up	unknown
208.254.42.192	1336808700	up	unknown
208.254.42.192	1339334100	up	unknown
208.254.42.192	1346766300	up	echo-reply (0.40s latency).
208.254.42.193	1335770100	up	unknown
208.254.42.193	1338444900	up	unknown
208.254.42.193	1339334100	up	unknown

... similar up

208.254.42.221	1346517900	up	echo-reply (0.19s latency).
208.254.42.222	1335708900	up	unknown
208.254.42.222	1335708900	up	unknown
208.254.42.222	1338066900	up	unknown
208.254.42.222	1338747300	up	unknown
208.254.42.222	1346872500	up	echo-reply (0.27s latency).
208.254.42.223	1335773700	up	unknown
208.254.42.223	1336949100	up	unknown
208.254.42.223	1338750900	up	unknown
208.254.42.223	1339334100	up	unknown
208.254.42.223	1346854500	up	echo-reply (0.13s latency).

208.254.42.224	1335665700	down	no-response
208.254.42.224	1336567500	down	no-response
208.254.42.224	1338840900	down	no-response
208.254.42.224	1339425900	down	no-response
208.254.42.224	1346494500	down	time-exceeded

... similar down

Suggests exactly 223 - 192 + 1 = 31 IPs.

Let's have a look at the file 68: outcome: no clear hits like on 208. One wonders why.

It does appears that long sequences of ranges are a sort of fingerprint. The question is how unique it would be.

First:

n=208
time awk '$3=="up"{ print $1 }' $n | uniq -c | sed -r 's/^ +//;s/ /,/' | tee $n-up-uniq
t=$n-up-uniq.sqlite
rm -f $t
time sqlite3 $t 'create table tmp(cnt text, i text)'
time sqlite3 $t ".import --csv $n-up-uniq tmp"
time sqlite3 $t 'create table t (i integer)'
time sqlite3 $t '.load ./ip' 'insert into t select str2ipv4(i) from tmp'
time sqlite3 $t 'drop table tmp'
time sqlite3 $t 'create index ti on t(i)'

This reduces us to 2 million IP rows from the total possible 16 million IPs.

OK now just counting hits on fixed windows has way too many results:

sqlite3 208-up-uniq.sqlite "\
SELECT * FROM (
  SELECT min(i), COUNT(*) OVER (
    ORDER BY i RANGE BETWEEN 15 PRECEDING AND 15 FOLLOWING
  ) as c FROM t
) WHERE c > 20 and c < 30
"

Let's try instead consecutive ranges of length exactly 31 instead then:

sqlite3 208-up-uniq.sqlite <<EOF
SELECT f, t - f as c FROM (
  SELECT min(i) as f, max(i) as t
  FROM (SELECT i, ROW_NUMBER() OVER (ORDER BY i) - i as grp FROM t)
  GROUP BY grp
  ORDER BY i
) where c = 31
EOF

271. Hmm. A bit more than we'd like...

Another route is to also count the ups:

n=208
time awk '$3=="up"{ print $1 }' $n | uniq -c | sed -r 's/^ +//;s/ /,/' | tee $n-up-uniq-cnt
t=$n-up-uniq-cnt.sqlite
rm -f $t
time sqlite3 $t 'create table tmp(cnt text, i text)'
time sqlite3 $t ".import --csv $n-up-uniq-cnt tmp"
time sqlite3 $t 'create table t (cnt integer, i integer)'
time sqlite3 $t '.load ./ip' 'insert into t select cnt as integer, str2ipv4(i) from tmp'
time sqlite3 $t 'drop table tmp'
time sqlite3 $t 'create index ti on t(i)'

Let's see how many consecutives with counts:

sqlite3 208-up-uniq-cnt.sqlite <<EOF
SELECT f, t - f as c FROM (
  SELECT min(i) as f, max(i) as t
  FROM (SELECT i, ROW_NUMBER() OVER (ORDER BY i) - i as grp FROM t WHERE cnt >= 3)
  GROUP BY grp
  ORDER BY i
) where c > 28 and c < 32
EOF

Let's check on 66:

grep -e '66.45.179' -e '66.45.179' 66

not representative at all... e.g. several convfirmed hits are down:

66.45.179.215   1335305700      down    no-response
66.45.179.215   1337579100      down    no-response
66.45.179.215   1338765300      down    no-response
66.45.179.215   1340271900      down    no-response
66.45.179.215   1346813100      down    no-response

 Read the full article

 Pinned article: Introduction to the OurBigBook Project

Welcome to the OurBigBook Project! Our goal is to create the perfect publishing platform for STEM subjects, and get university-level students to write the best free STEM tutorials ever.

Everyone is welcome to create an account and play with the site: ourbigbook.com/go/register. We belive that students themselves can write amazing tutorials, but teachers are welcome too. You can write about anything you want, it doesn't have to be STEM or even educational. Silly test content is very welcome and you won't be penalized in any way. Just keep it legal!

Video 1.

Intro to OurBigBook

. Source.

We have two killer features:

topics: topics group articles by different users with the same title, e.g. here is the topic for the "Fundamental Theorem of Calculus" ourbigbook.com/go/topic/fundamental-theorem-of-calculus
Articles of different users are sorted by upvote within each article page. This feature is a bit like:
- a Wikipedia where each user can have their own version of each article
- a Q&A website like Stack Overflow, where multiple people can give their views on a given topic, and the best ones are sorted by upvote. Except you don't need to wait for someone to ask first, and any topic goes, no matter how narrow or broad
This feature makes it possible for readers to find better explanations of any topic created by other writers. And it allows writers to create an explanation in a place that readers might actually find it.
Figure 1.
Screenshot of the "Derivative" topic page
. View it live at: ourbigbook.com/go/topic/derivative
Video 2.
OurBigBook Web topics demo
. Source.
local editing: you can store all your personal knowledge base content locally in a plaintext markup format that can be edited locally and published either:
- to OurBigBook.com to get awesome multi-user features like topics and likes
- as HTML files to a static website, which you can host yourself for free on many external providers like GitHub Pages, and remain in full control
This way you can be sure that even if OurBigBook.com were to go down one day (which we have no plans to do as it is quite cheap to host!), your content will still be perfectly readable as a static site.
Figure 2.
You can publish local OurBigBook lightweight markup files to either https://OurBigBook.com or as a static website
.
Figure 3.
Visual Studio Code extension installation
.
Figure 4.
Visual Studio Code extension tree navigation
.
Figure 5.
Web editor
. You can also edit articles on the Web editor without installing anything locally.
Video 3.
Edit locally and publish demo
. Source. This shows editing OurBigBook Markup and publishing it using the Visual Studio Code extension.
Video 4.
OurBigBook Visual Studio Code extension editing and navigation demo
. Source.
Internal cross file references done right:
Infinitely deep tables of contents:
Figure 6.
Dynamic article tree with infinitely deep table of contents
.
Live URL: ourbigbook.com/cirosantilli/chordate
Descendant pages can also show up as toplevel e.g.: ourbigbook.com/cirosantilli/chordate-subclade