CIA 2010 covert communication websites / Fingerprints

From The Reuters websites and others we've found, we can establish see some clear stylistic trends across the websites which would allow us to find other likely candidates upon inspection:

natural sounding, sometimes long-ish, domain names generally with 2 or 3 full words. Most in English language, but a few in Spanish, and very few in other languages like French.
shallow websites with a few tabs, many external links, sometimes many images, and few internal pages
common themes include:
- news
- hobbies, notably sports, travel and photography. Golf seems overrepresented. Must be a thing over there in Langley.
.com and .net top-level domains, plus a few other very rare non .com .net TLDs, notably .info and .org
each one has one "communication mechanism file": communication mechanisms
narrow page width like in the days of old, lots of images
split header images
some common pattern they follow in their news lists:
- ul.rss-items > li.rss-item, e.g.: web.archive.org/web/20110202092126/http://beamingnews.com/
- links with class a.newslink and a.newslinkalt e.g. web.archive.org/web/20110128181622/http://profile-news.com/

The most notable dissonance from the rest of the web is that there are no commercial looking website of companies, presumably because it was felt that it would be possible to verify the existence of such companies.

Most domains are the only domain for its IP, i.e. the websites are mostly private hosted. However we have later found many exceptions to this general indicator, so it should not be used as a strong exclusion rule.

It would be fun to actually reverse search into one of their stock image provider's original images. Ones we've found:

Table of contents
- Split header images Fingerprints
- HTML analysis Fingerprints

Split header images

 0  1

Many of the website banners are composed of several images cut up.

Often stock images were first assembled into the banner, and then the resulting image was cut.

Possibly this was done to make reverse image search to their stock image provider harder.

But it somewhat backfired and serves as a good marker that confirms authorship.

Maybe it is some kind of outdated web design thing, which they took much further in time than the average website, like the JAR.

Their websites do appear to follow common style guidelines form earlier eras, around the early 2000s notably, some legit sites that look a lot like hits:

An example:

web.archive.org/web/20031002010827/http://www.ausiranstudy.com/

Looking at the source code of: web.archive.org/web/20130828122833/http://euronewsonline.net/euro_bus.php we noticed an interesting comment:

<!-- ImageReady Slices (enewsweather.psd) -->

which presumably refers to Adobe ImageReady:

Adobe ImageReady was a bitmap graphics editor that was shipped with Adobe Photoshop for six years. It was available for Windows, Classic Mac OS and Mac OS X from 1998 to 2007. ImageReady was designed for web development and closely interacted with Photoshop

A sample tutorial: people.goshen.edu/~paulmr/physix/326/imageready/slicendice.php

Some of the websites use CSS background images to populate the images, e.g. ingenuitytrendz.com has HTML:

ingenuitytrendz.com/20110201170354/index.html:                  <li><a id="banner1">&nbsp;</a></li>
ingenuitytrendz.com/20110201170354/index.html:                  <li><a id="banner2">&nbsp;</a></li>
ingenuitytrendz.com/20110201170354/index.html:                  <li><a id="banner3">&nbsp;</a></li>

and then the CSS engineering.css does:

#banner1 { background: url(/web/20110201170405im_/http://ingenuitytrendz.com/images/banner_01.jpg) no-repeat center; }
#banner2 { background: url(/web/20110201170405im_/http://ingenuitytrendz.com/images/banner_02.jpg) no-repeat center; }
#banner3 { background: url(/web/20110201170405im_/http://ingenuitytrendz.com/images/banner_03.jpg) no-repeat center; }

HTML analysis

 0  0

The HTML from the index page of Wayback Machine were:

dumped at: github.com/cirosantilli/media/tree/master/cia-2010-covert-communication-websites/html
downloaded with: github.com/cirosantilli/media/tree/master/cia-2010-covert-communication-websites/download-html.sh. Note that there were many supurious errors notably:
OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to web.archive.org:443
we just ran it multiple times until all errors were gone.

The best way to analyse the HTML is to grap our dumps from: github.com/cirosantilli/cia-2010-websites-dump.

Some possibly interesting searches include:

list all HTML comments, maybe something spicy was left over:
```
git grep '<!--'
```

search for weird file extensions:

git ls-files | grep -Ev '\.(jpg|gif|html|txt|png|css|php|js|jar|cgi|htm|swf|ico|JPG|class|zip|sf)'

have a look at the largest folers:
```
ncdu
```

Some of the HTML files contain conditional comments e.g. web.archive.org/web/20091023041107/http://aquaswimming.com/ contains:

<!--[if IE 6]> <link href="swimstyleie6.css" rel="stylesheet" type="text/css"> <![endif]-->

Varios of the non-English websites seem to have comments translating the content e.g.:

./noticiasmusica.net/20101230165001/index.html:<h2>Alguns dos Melhores Sites Nacionais</h2><!--some of the best national sites (in music)-->

This feels like it could be the translation helping the technical webdev team know what is what.

Many of the RSS frame pages use:

<base target="_blank" />

which is a weird HTML tag that would lead all links to open on new tabs, e.g. web.archive.org/web/20110202124411/http://thecricketfan.com/home.html.

Various websites have pages with .php extension. It feels likely that all websites were written in PHP.

Some sites use a feeds.php for the feeds, e.g. http://www.absolutebearing.net//absolutebearing_feeds/feeds.php?src=http%3A%2F%2Ffeeds2.feedburner.com%2FOceanyachtsinfo&desc=1

Some URLs existed both in HTML and .php extension, or were converted at some point:

allworldstatistics.com/20110207151941/comprehensivesources.html
allworldstatistics.com/20130818155225/comprehensivesources.php

A few of the PHP urls have weird IDs in them like omktf, juqwt and qlaqft:

./middle-east-newstoday.com/20100829004127/omktf/uirl.php?ok=461128
./newsandsportscentral.com/20100327130237/juqwt/eubcek.php?pe=747155
./pondernews.net/20100826031745/lldwg/qlaqft.php?fc=281298

we wonder what they mean.

A few separate websites have an archive with the same pid parameter:

fightwithoutrules.com/20131220205811/?pid=2POQ7BC1G/index.html
half-court.net/20131223165013/?pid=2POQ7BC1G/index.html
health-men-today.com/20131223002237/?pid=2POQ7BC1G/index.html
intlnewsdaily.com/20131221121441/?pid=2POQ7BC1G/index.html
intoworldnews.com/20131217193621/?pid=2POQ7BC1G/index.html

It is unclear what it means. All of them contain something like:

<html>
<head>
<meta name="robots" content="noarchive" />
<meta name="googlebot" content="nosnippet" />
</head>
<body>
<div align=center>
<h3>Error. Page cannot be displayed. Please contact your service provider for more details.  (11)</h3>
</div>
</body>
</html>

so looks like an archival artifact only.

The following two websites have a feeds.php system for their RSS:

./mydailynewsreport.com/20110211111053/myrss/feeds.php?src=http:/www.refahemelli.com/pashto/news/rss.php&chan=y&desc=1&targ=y&utf=y
./magneticfieldnews.com/20110208063545/magneticfeeds/feeds.php?src=http:/www.bbc.co.uk/pashto/index.xml&chan=y&desc=1&targ=y&utf=y

Some of the HTML uses attributes without quotes, which is legal, but very unusual nowadays:

soldiersofsouthasia.com/20110207203705/home.htm: <a href=http://www.rss-to-javascript.com

We can try to search for any link leaks by listing all domains linked to with:

git grep --no-color -I -h --no-line -o 'https?://[^/">?]+[/">?]' | sed -r 's/.$//' | sort | uniq -c | sort -nk1

The first thing that shows up is that there are some IPs linked to directly! But they seem to be the direct IPs of legitimate websites, we are not sure why IPs were used rather than domain names:

69.167.160.171 at web.archive.org/web/20110208053653/http://sa-michigan.com/ to web.archive.org/web/20100304122019/http://69.167.160.171/ marked with image "fantasyplayers.com", a legit website called Fantasy Players Network
69.94.11.53 at web.archive.org/web/20101229193800/http://newsresolution.net/ titled "International Tribunal for Rwanda" to web.archive.org/web/20101229193800/http://69.94.11.53/default.htm
74.125.77.132 mynepalnews.com Webalizer
194.165.154.66/index.php web.archive.org/web/20110129161937/http://icwb-news.com/ MiddleEast links to 194.165.154.66/index.php but that is an actual page: web.archive.org/web/20110529142501/http://194.165.154.66/index.php
200.55.6.87 at web.archive.org/web/20110128170204/http://noticiasdelmundolatino.com/ after clicking "Maps" tab entitled "Mapas en la red" to web.archive.org/web/20100329150648/http://200.55.6.87/es/index.htm
213.97.154.118 at web.archive.org/web/20120429042725/http://montanismoaventura.com/ entitled "Mallorca Verde" to web.archive.org/web/20120430191214/http://213.97.154.118/mallorcaverde/ The target is a bit weird and almost empty.
216.218.196.146 at entitled "AskTheDr.com" to web.archive.org/web/20070303080403/http://216.218.196.146/askthedr/index.htm

We can also get the full line for each with sorted by least common domains with the slow:

git grep --no-color -I -h --no-line -o 'https?://[^/">?]+[/">?]' | sed -r 's/.$//' | sort | uniq -c | sort -nk1 | awk '{if ($1 < 10) print $2}' | xargs -I{} git --no-pager grep -h --no-line -o '{}.*<' | tee tmp.log

We can search for all IP-like strings with:

git grep '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\b'

Binary files

 0  0

As per:

grep . */index.html | grep 'binary file matches'

a few of the HTMLs are interpreted by grep as being binary:

grep: china-destinations.org/index.html: binary file matches
grep: classicalmusicboxonline.com/index.html: binary file matches
grep: driversinternationalgolf.com/index.html: binary file matches
grep: familyhealthonline.net/index.html: binary file matches
grep: grubbersworldrugbynews.com/index.html: binary file matches
grep: hai-pow.com/index.html: binary file matches
grep: hi-tech-today.com/index.html: binary file matches
grep: networkofnews.com/index.html: binary file matches
grep: nigeriastar.net/index.html: binary file matches
grep: noticias-caracas.com/index.html: binary file matches
grep: theentertainbiz.com/index.html: binary file matches
grep: thefilmcentre.com/index.html: binary file matches
grep: theinternationalgoal.com/index.html: binary file matches
grep: wildbirds-seasia.com/index.html: binary file matches
grep: worldedgenews.com/index.html: binary file matches

HTML title element

 0  0

The discoverty of a possible HTML information leaks on HTML <title> of webofcheer.com which is cryptically set as:

pg1c

motivated us to download all HTML and have a grep.

We started grepping with:

grep -ai '<title>' */index.html

and to just get the titles alone for visual inspection:

grep -ahi '<title>' */index.html | sed -r 's/^\s*<title>//;s/<\/title>.*//'

Some mildly interesting facts include:

opensourcenewstoday.com is titled just as "Title"

opensourcenewstoday.com/index.html:<title>Title</title>

a few sites are titled "Untitled Document" e.g.:

media-coverage-now.com/index.html:<title>Untitled Document</title>
newsandsportscentral.com/index.html:  <title>Untitled Document</title>
newsincirculation.com/index.html:<title>Untitled Document</title>
newsworldsite.com/index.html:<title>Untitled Document</title>
primetimemovies.net/index.html:<title>Untitled Document</title>
unganadormundial.com/index.html:<title>Untitled Document</title>

This may have been the default title in Adobe Dreamweaver.

some others have empty title:

aeronet-news.com/index.html:<title></title>
al-rashidrealestate.com/index.html:             <title></title>
arabicnewsunfiltered.com/index.html:<title></title>
dailynewsandsports.com/index.html:<title></title>
electronictechreviews.com/index.html:<title></title>
indirectfreekick.com/index.html:<title></title>
iran-newslink-today.com/index.html:<title></title>
iraniangoals.com/index.html:<title></title>
kickitnews.com/index.html:<title></title>
mediocampodefutbol.com/index.html:<title></title>
middle-east-newstoday.com/index.html:      <title></title>
mygadgettech.com/index.html:<title></title>
sayaara-auto.com/index.html:<title></title>
techwatchtoday.com/index.html:<title></title>
the-open-book-online.com/index.html:<title></title>
thenewsofpakistan.com/index.html:<title></title>
theworld-news.net/index.html:<title></title>
todaysengineering.com/index.html:<title></title>
todaysnewsreports.net/index.html:<title></title>
worldnewsandent.com/index.html:<title></title>

some others are titled just "index" or a variant of it:

all-sport-headlines.com/index.html:<title>index</title>
europeannewsflash.com/index.html:<title>Index</title>
fgnl.net/index.html:<title>Index Page</title>
iraniangoalkicks.com/index.html:<title>index</title>
just-the-news.com/index.html:<title>index</title>
mide-news.com/index.html:<title>index</title>
mytravelopian.com/index.html:<title>Index</title>
noticiasdelmundolatino.com/index.html:<title>index</title>
pakcricketgrd.com/index.html:  <title>index</title>
pangawana.com/index.html:<title>index</title>
sportsnewsfinder.com/index.html:<title>index</title>
thenewseditor.com/index.html:<title>index</title>
turkishnewslinks.com/index.html:<title>index2</title>
wahidfutbol.com/index.html:<title>index</title>
webscooper.com/index.html:<title>index</title>
webworldsports.com/index.html:<title>index</title>

a few don't have <title> at all:

b2bworldglobal.com/index.html
bailandstump.com/index.html
businessexchangetoday.com/index.html
commercialspacedesign.com/index.html
court-masters.com/index.html
flyingtimeline.com/index.html
marketflows.net/index.html
nouvellesetdesrapports.com/index.html
senderosdemontana.com/index.html
sixty2media.com/index.htm

It is impossible to tell if these were oversights, or intentional to simulate common web development quircks. But they are cute in any case.

Adobe Dreamwaver JS functions

 0  0

Many of the files appear to contain JavaScript functions in a format generated by Adobe Dreamweaver, making it almost certain that at least some of the websites were developed in that editor. This was first pointed out by Reddit user sq00q. Also note that the username spells "boobs" upside down in leet.

For example, starwarsweb.net contains the four following functions, first commented out which is funny and has some version comments:

function MM_swapImgRestore() { //v3.0
function MM_preloadImages() { //v3.0
function MM_findObj(n, d) { //v4.01
function MM_swapImage() { //v3.0

and then repeated on a body onload. Here MM_ stands for MacroMedia, and is mentioned e.g. at:

Doing:

git grep MM_swapImage | sed -r 's/\/.*//' | sort -u | wc

on github.com/cirosantilli/cia-2010-websites-dump currently gives 64 hits out of 421 websites.

The same signatures are also found on the early websites from 2004 such as alljohnny.com.

The approximate version history is:

 Articles by others on the same topic (0)

There are currently no matching articles.

  See all articles in the same topic Create my own version

CIA 2010 covert communication websites / Fingerprints

Split header images

HTML analysis

Binary files

HTML title element

Adobe Dreamwaver JS functions

 Ancestors (13)

 Incoming links (2)

 Synonyms (1)

 Discussion (0)

 Articles by others on the same topic (0)

CIA 2010 covert communication websites / Fingerprints

 Discussion (0)  Subscribe (1)

 Discussion (0)