It can't be HTML crawl because presumably there wouldn't have been links to those websites? Presumably this is why Common Crawl doesn't seem to have any hits.
So they must have had some kind of DNS A record database?
Or would IPv4 sweep have worked, without the Host header with the CIA's setup?
The same question also applies to the 2013 DNS Census. It has less hits, but still has many.
Whatever they did, we are so so glad that they did!