 Wayback Machine CDX scanning

ID: cia-2010-covert-communication-websites/wayback-machine-cdx-scanning

 Top articles  Latest articles New article in topic

CIA 2010 covert communication websites / Wayback Machine CDX scanning by

Ciro Santilli 37 Updated 2025-07-16

The Wayback Machine has an endpoint to query cralwed pages called the CDX server. It is documented at: github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md.

This allows to filter down 10 thousands of possible domains in a few hours. But 100s of thousands would be too much. This is because you have to query exactly one URL at a time, and they possibly rate limit IPs. But no IP blacklisting so far after several hours, so it's not that bad.

Once you have a heuristic to narrow down some domains, you can use this helper: ../cia-2010-covert-communication-websites/cdx.sh to drill them down from 10s of thousands down to hundreds or thousands.

We then post process the results of cdx.sh with ../cia-2010-covert-communication-websites/cdx-post.sh to drill them down from from thousands to dozens, and manually inspect everything.

From then on, you can just manually inspect for hist on your browser.

 Read the full article

 New to topics? Read the docs here!