 HTML analysis (source code)

= HTML analysis

The HTML from the index page of Wayback Machine were:
* dumped at: https://github.com/cirosantilli/media/tree/master/cia-2010-covert-communication-websites/html[]
* downloaded with: https://github.com/cirosantilli/media/tree/master/cia-2010-covert-communication-websites/download-html.sh[]. Note that there were many supurious errors notably:
  > OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to web.archive.org:443
  we just ran it multiple times until all errors were gone.

The best way to analyse the HTML is to grap our dumps from: https://github.com/cirosantilli/cia-2010-websites-dump[].

Some possibly interesting searches include:
* list all HTML comments, maybe something spicy was left over:
  ``
  git grep '<!--'
  ``
* search for weird file extensions:
  ``
  git ls-files | grep -Ev '\.(jpg|gif|html|txt|png|css|php|js|jar|cgi|htm|swf|ico|JPG|class|zip|sf)'
  ``
* have a look at the largest folers:
  ``
  ncdu
  ``

Some of the HTML files contain <#conditional comments> e.g. https://web.archive.org/web/20091023041107/http://aquaswimming.com/ contains:
``
<!--[if IE 6]> <link href="swimstyleie6.css" rel="stylesheet" type="text/css"> <![endif]-->
``

Varios of the non-English websites seem to have comments translating the content e.g.:
``
./noticiasmusica.net/20101230165001/index.html:<h2>Alguns dos Melhores Sites Nacionais</h2><!--some of the best national sites (in music)-->
``
This feels like it could be the translation helping the technical webdev team know what is what.

Many of the RSS frame pages use:
``
<base target="_blank" />
``
which is a weird HTML tag that would lead all links to open on new tabs, e.g. https://web.archive.org/web/20110202124411/http://thecricketfan.com/home.html[].

Various websites have pages with .php extension. It feels likely that all websites were written in <PHP>.

Some sites use a `feeds.php` for the feeds, e.g. https://web.archive.org/web/20101231174008/http://www.absolutebearing.net//absolutebearing_feeds/feeds.php?src=http%3A%2F%2Ffeeds2.feedburner.com%2FOceanyachtsinfo&desc=1[\http://www.absolutebearing.net//absolutebearing_feeds/feeds.php?src=http%3A%2F%2Ffeeds2.feedburner.com%2FOceanyachtsinfo&desc=1]

Some URLs existed both in HTML and .php extension, or were converted at some point:
``
allworldstatistics.com/20110207151941/comprehensivesources.html
allworldstatistics.com/20130818155225/comprehensivesources.php
``

A few of the PHP urls have weird IDs in them like `omktf`, `juqwt` and `qlaqft`:
``
./middle-east-newstoday.com/20100829004127/omktf/uirl.php?ok=461128
./newsandsportscentral.com/20100327130237/juqwt/eubcek.php?pe=747155
./pondernews.net/20100826031745/lldwg/qlaqft.php?fc=281298
``
we wonder what they mean.

A few separate websites have an archive with the same `pid` parameter:
``
fightwithoutrules.com/20131220205811/?pid=2POQ7BC1G/index.html
half-court.net/20131223165013/?pid=2POQ7BC1G/index.html
health-men-today.com/20131223002237/?pid=2POQ7BC1G/index.html
intlnewsdaily.com/20131221121441/?pid=2POQ7BC1G/index.html
intoworldnews.com/20131217193621/?pid=2POQ7BC1G/index.html
``
It is unclear what it means. All of them contain something like:
``
<html>
<head>
<meta name="robots" content="noarchive" />
<meta name="googlebot" content="nosnippet" />
</head>
<body>
<div align=center>
<h3>Error. Page cannot be displayed. Please contact your service provider for more details.  (11)</h3>
</div>
</body>
</html>
``
so looks like an archival artifact only.

The following two websites have a `feeds.php` system for their RSS:
``
./mydailynewsreport.com/20110211111053/myrss/feeds.php?src=http:/www.refahemelli.com/pashto/news/rss.php&chan=y&desc=1&targ=y&utf=y
./magneticfieldnews.com/20110208063545/magneticfeeds/feeds.php?src=http:/www.bbc.co.uk/pashto/index.xml&chan=y&desc=1&targ=y&utf=y
``

Some of the HTML uses attributes without quotes, which is legal, but very unusual nowadays:
``
soldiersofsouthasia.com/20110207203705/home.htm: <a href=http://www.rss-to-javascript.com
``

We can try to search for any link leaks by listing all domains linked to with:
``
git grep --no-color -I -h --no-line -o 'https?://[^/">?]+[/">?]' | sed -r 's/.$//' | sort | uniq -c | sort -nk1
``
The first thing that shows up is that there are some IPs linked to directly! But they seem to be the direct IPs of legitimate websites, we are not sure why IPs were used rather than domain names:
* http://69.167.160.171 at https://web.archive.org/web/20110208053653/http://sa-michigan.com/ to https://web.archive.org/web/20100304122019/http://69.167.160.171/ marked with image "fantasyplayers.com", a legit website called Fantasy Players Network
* http://69.94.11.53 at https://web.archive.org/web/20101229193800/http://newsresolution.net/ titled "International Tribunal for Rwanda" to https://web.archive.org/web/20101229193800/http://69.94.11.53/default.htm
* http://74.125.77.132 mynepalnews.com Webalizer
* http://194.165.154.66/index.php https://web.archive.org/web/20110129161937/http://icwb-news.com/ MiddleEast links to 194.165.154.66/index.php but that is an actual page: https://web.archive.org/web/20110529142501/http://194.165.154.66/index.php
* http://200.55.6.87 at https://web.archive.org/web/20110128170204/http://noticiasdelmundolatino.com/ after clicking "Maps" tab entitled "Mapas en la red" to https://web.archive.org/web/20100329150648/http://200.55.6.87/es/index.htm
* http://213.97.154.118 at https://web.archive.org/web/20120429042725/http://montanismoaventura.com/ entitled "Mallorca Verde" to https://web.archive.org/web/20120430191214/http://213.97.154.118/mallorcaverde/ The target is a bit weird and almost empty.
* http://216.218.196.146 at entitled "AskTheDr.com" to https://web.archive.org/web/20070303080403/http://216.218.196.146/askthedr/index.htm

We can also get the full line for each with sorted by least common domains with the slow:
``
git grep --no-color -I -h --no-line -o 'https?://[^/">?]+[/">?]' | sed -r 's/.$//' | sort | uniq -c | sort -nk1 | awk '{if ($1 < 10) print $2}' | xargs -I{} git --no-pager grep -h --no-line -o '{}.*<' | tee tmp.log

``

We can search for all IP-like strings with:
``
git grep '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\b'
``
 Back to article page