{tag=/Wayback Machine}

D'oh.

But to be serious. The </Wayback Machine> contains a very large proportion of all sites. It does happen sometime that a Wayback Machine archive is missing or broken and <cqcounter> has the screenshot. But the Wayback Machine is still the most complete database we have found so far. Some archives are very broken. But those are rare.

The only problem with the </Wayback Machine> is that there is no known efficient way to query its archives across domains. You have to have a domain in hand for CDX queries: <Wayback Machine CDX scanning>.

The <Common Crawl> project attempts in part to address this lack of querriability, but we haven't managed to extract any hits from it.

CDX + <2013 DNS Census> + heuristics however has been fruitful however.

We have dumped all Wayback Machine archives of known websites to: https://github.com/cirosantilli/cia-2010-websites-dump using \a[../cia-2010-covert-communication-websites/download-websites.sh]. This allows for better grepping and serves as a backup in case they ever go down.


 Wayback Machine (source code)