The HTML from the index page of Wayback Machine were:
- dumped at: github.com/cirosantilli/media/tree/master/cia-2010-covert-communication-websites/html
- downloaded with: github.com/cirosantilli/media/tree/master/cia-2010-covert-communication-websites/download-html.sh. Note that there were many supurious errors notably:we just ran it multiple times until all errors were gone.
OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to web.archive.org:443
The best way to analyse the HTML is to grap our dumps from: github.com/cirosantilli/cia-2010-websites-dump.
Some possibly interesting searches include:
Some of the HTML files contain conditional comments e.g. web.archive.org/web/20091023041107/http://aquaswimming.com/ contains:
<!--[if IE 6]> <link href="swimstyleie6.css" rel="stylesheet" type="text/css"> <![endif]-->
Varios of the non-English websites seem to have comments translating the content e.g.:This feels like it could be the translation helping the technical webdev team know what is what.
./noticiasmusica.net/20101230165001/index.html:<h2>Alguns dos Melhores Sites Nacionais</h2><!--some of the best national sites (in music)-->
Many of the RSS frame pages use:which is a weird HTML tag that would lead all links to open on new tabs, e.g. web.archive.org/web/20110202124411/http://thecricketfan.com/home.html.
<base target="_blank" />
Various websites have pages with .php extension. It feels likely that all websites were written in PHP.
Some sites use a
feeds.php
for the feeds, e.g. http://www.absolutebearing.net//absolutebearing_feeds/feeds.php?src=http%3A%2F%2Ffeeds2.feedburner.com%2FOceanyachtsinfo&desc=1Some URLs existed both in HTML and .php extension, or were converted at some point:
allworldstatistics.com/20110207151941/comprehensivesources.html
allworldstatistics.com/20130818155225/comprehensivesources.php
A few of the PHP urls have weird IDs in them like we wonder what they mean.
omktf
, juqwt
and qlaqft
:./middle-east-newstoday.com/20100829004127/omktf/uirl.php?ok=461128
./newsandsportscentral.com/20100327130237/juqwt/eubcek.php?pe=747155
./pondernews.net/20100826031745/lldwg/qlaqft.php?fc=281298
A few separate websites have an archive with the same It is unclear what it means. All of them contain something like:so looks like an archival artifact only.
pid
parameter:fightwithoutrules.com/20131220205811/?pid=2POQ7BC1G/index.html
half-court.net/20131223165013/?pid=2POQ7BC1G/index.html
health-men-today.com/20131223002237/?pid=2POQ7BC1G/index.html
intlnewsdaily.com/20131221121441/?pid=2POQ7BC1G/index.html
intoworldnews.com/20131217193621/?pid=2POQ7BC1G/index.html
<html>
<head>
<meta name="robots" content="noarchive" />
<meta name="googlebot" content="nosnippet" />
</head>
<body>
<div align=center>
<h3>Error. Page cannot be displayed. Please contact your service provider for more details. (11)</h3>
</div>
</body>
</html>
The following two websites have a
feeds.php
system for their RSS:./mydailynewsreport.com/20110211111053/myrss/feeds.php?src=http:/www.refahemelli.com/pashto/news/rss.php&chan=y&desc=1&targ=y&utf=y
./magneticfieldnews.com/20110208063545/magneticfeeds/feeds.php?src=http:/www.bbc.co.uk/pashto/index.xml&chan=y&desc=1&targ=y&utf=y
Some of the HTML uses attributes without quotes, which is legal, but very unusual nowadays:
soldiersofsouthasia.com/20110207203705/home.htm: <a href=http://www.rss-to-javascript.com
We can try to search for any link leaks by listing all domains linked to with:The first thing that shows up is that there are some IPs linked to directly! But they seem to be the direct IPs of legitimate websites, we are not sure why IPs were used rather than domain names:
git grep --no-color -I -h --no-line -o 'https?://[^/">?]+[/">?]' | sed -r 's/.$//' | sort | uniq -c | sort -nk1
- 69.167.160.171 at web.archive.org/web/20110208053653/http://sa-michigan.com/ to web.archive.org/web/20100304122019/http://69.167.160.171/ marked with image "fantasyplayers.com", a legit website called Fantasy Players Network
- 69.94.11.53 at web.archive.org/web/20101229193800/http://newsresolution.net/ titled "International Tribunal for Rwanda" to web.archive.org/web/20101229193800/http://69.94.11.53/default.htm
- 74.125.77.132 mynepalnews.com Webalizer
- 194.165.154.66/index.php web.archive.org/web/20110129161937/http://icwb-news.com/ MiddleEast links to 194.165.154.66/index.php but that is an actual page: web.archive.org/web/20110529142501/http://194.165.154.66/index.php
- 200.55.6.87 at web.archive.org/web/20110128170204/http://noticiasdelmundolatino.com/ after clicking "Maps" tab entitled "Mapas en la red" to web.archive.org/web/20100329150648/http://200.55.6.87/es/index.htm
- 213.97.154.118 at web.archive.org/web/20120429042725/http://montanismoaventura.com/ entitled "Mallorca Verde" to web.archive.org/web/20120430191214/http://213.97.154.118/mallorcaverde/ The target is a bit weird and almost empty.
- 216.218.196.146 at entitled "AskTheDr.com" to web.archive.org/web/20070303080403/http://216.218.196.146/askthedr/index.htm
We can also get the full line for each with sorted by least common domains with the slow:
git grep --no-color -I -h --no-line -o 'https?://[^/">?]+[/">?]' | sed -r 's/.$//' | sort | uniq -c | sort -nk1 | awk '{if ($1 < 10) print $2}' | xargs -I{} git --no-pager grep -h --no-line -o '{}.*<' | tee tmp.log
We can search for all IP-like strings with:
git grep '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\b'
As per:a few of the HTMLs are interpreted by grep as being binary:
grep . */index.html | grep 'binary file matches'
grep: china-destinations.org/index.html: binary file matches
grep: classicalmusicboxonline.com/index.html: binary file matches
grep: driversinternationalgolf.com/index.html: binary file matches
grep: familyhealthonline.net/index.html: binary file matches
grep: grubbersworldrugbynews.com/index.html: binary file matches
grep: hai-pow.com/index.html: binary file matches
grep: hi-tech-today.com/index.html: binary file matches
grep: networkofnews.com/index.html: binary file matches
grep: nigeriastar.net/index.html: binary file matches
grep: noticias-caracas.com/index.html: binary file matches
grep: theentertainbiz.com/index.html: binary file matches
grep: thefilmcentre.com/index.html: binary file matches
grep: theinternationalgoal.com/index.html: binary file matches
grep: wildbirds-seasia.com/index.html: binary file matches
grep: worldedgenews.com/index.html: binary file matches
The discoverty of a possible HTML information leaks on HTML motivated us to download all HTML and have a grep.
<title>
of webofcheer.com which is cryptically set as:pg1c
We started grepping with:and to just get the titles alone for visual inspection:
grep -ai '<title>' */index.html
grep -ahi '<title>' */index.html | sed -r 's/^\s*<title>//;s/<\/title>.*//'
Some mildly interesting facts include:It is impossible to tell if these were oversights, or intentional to simulate common web development quircks. But they are cute in any case.
- opensourcenewstoday.com is titled just as "Title"
opensourcenewstoday.com/index.html:<title>Title</title>
- a few sites are titled "Untitled Document" e.g.:This may have been the default title in Adobe Dreamweaver.
media-coverage-now.com/index.html:<title>Untitled Document</title> newsandsportscentral.com/index.html: <title>Untitled Document</title> newsincirculation.com/index.html:<title>Untitled Document</title> newsworldsite.com/index.html:<title>Untitled Document</title> primetimemovies.net/index.html:<title>Untitled Document</title> unganadormundial.com/index.html:<title>Untitled Document</title>
- some others have empty title:
aeronet-news.com/index.html:<title></title> al-rashidrealestate.com/index.html: <title></title> arabicnewsunfiltered.com/index.html:<title></title> dailynewsandsports.com/index.html:<title></title> electronictechreviews.com/index.html:<title></title> indirectfreekick.com/index.html:<title></title> iran-newslink-today.com/index.html:<title></title> iraniangoals.com/index.html:<title></title> kickitnews.com/index.html:<title></title> mediocampodefutbol.com/index.html:<title></title> middle-east-newstoday.com/index.html: <title></title> mygadgettech.com/index.html:<title></title> sayaara-auto.com/index.html:<title></title> techwatchtoday.com/index.html:<title></title> the-open-book-online.com/index.html:<title></title> thenewsofpakistan.com/index.html:<title></title> theworld-news.net/index.html:<title></title> todaysengineering.com/index.html:<title></title> todaysnewsreports.net/index.html:<title></title> worldnewsandent.com/index.html:<title></title>
- some others are titled just "index" or a variant of it:
all-sport-headlines.com/index.html:<title>index</title> europeannewsflash.com/index.html:<title>Index</title> fgnl.net/index.html:<title>Index Page</title> iraniangoalkicks.com/index.html:<title>index</title> just-the-news.com/index.html:<title>index</title> mide-news.com/index.html:<title>index</title> mytravelopian.com/index.html:<title>Index</title> noticiasdelmundolatino.com/index.html:<title>index</title> pakcricketgrd.com/index.html: <title>index</title> pangawana.com/index.html:<title>index</title> sportsnewsfinder.com/index.html:<title>index</title> thenewseditor.com/index.html:<title>index</title> turkishnewslinks.com/index.html:<title>index2</title> wahidfutbol.com/index.html:<title>index</title> webscooper.com/index.html:<title>index</title> webworldsports.com/index.html:<title>index</title>
- a few don't have
<title>
at all:b2bworldglobal.com/index.html bailandstump.com/index.html businessexchangetoday.com/index.html commercialspacedesign.com/index.html court-masters.com/index.html flyingtimeline.com/index.html marketflows.net/index.html nouvellesetdesrapports.com/index.html senderosdemontana.com/index.html sixty2media.com/index.htm
Many of the files appear to contain JavaScript functions in a format generated by Adobe Dreamweaver, making it almost certain that at least some of the websites were developed in that editor. This was first pointed out by Reddit user sq00q. Also note that the username spells "boobs" upside down in leet.
For example, starwarsweb.net contains the four following functions, first commented out which is funny and has some version comments:and then repeated on a
function MM_swapImgRestore() { //v3.0
function MM_preloadImages() { //v3.0
function MM_findObj(n, d) { //v4.01
function MM_swapImage() { //v3.0
body onload
. Here MM_
stands for MacroMedia, and is mentioned e.g. at:Doing:on github.com/cirosantilli/cia-2010-websites-dump currently gives 64 hits out of 421 websites.
git grep MM_swapImage | sed -r 's/\/.*//' | sort -u | wc
The approximate version history is:
Articles by others on the same topic
There are currently no matching articles.