Source: /cirosantilli/cia-2010-covert-communication-websites/wayback-machine-cdx-scanning

= Wayback Machine CDX scanning

The Wayback Machine has an endpoint to query cralwed pages called the CDX server. It is documented at: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md[].

This allows to filter down 10 thousands of possible domains in a few hours. But 100s of thousands would be too much. This is because you have to query exactly one URL at a time, and they possibly rate limit IPs. But no IP blacklisting so far after several hours, so it's not that bad.

Once you have a heuristic to narrow down some domains, you can use this helper: \a[cia-2010-covert-communication-websites/cdx.sh] to drill them down from 10s of thousands down to hundreds or thousands.

We then post process the results of cdx.sh with \a[cia-2010-covert-communication-websites/cdx-post.sh] to drill them down from from thousands to dozens, and manually inspect everything.

From then on, you can just manually inspect for hist on your browser.