= Wayback Machine
D'oh.
But to be serious. The </Wayback Machine> contains a very large proportion of all sites. It is the most complete database we have found so far. Some archives are very broken. But those are rares.
The only problem with the </Wayback Machine> is that there is no known efficient way to query its archives across domains. You have to have a domain in hand for CDX queries: <Wayback Machine CDX scanning>.
The <Common Crawl> project attempts in part to address this lack of querriability, but we haven't managed to extract any hits from it.
CDX + <2013 DNS Census> + heuristics however has been fruitful however.
Back to article page