Gridworld version of DeepMind Lab.
Open sourced in 2020: analyticsindiamag.com/deepmind-just-gave-away-this-ai-environment-simulator-for-free/
A tiny paper: arxiv.org/pdf/2011.07027.pdf
TODO get running, publish demo videos on YouTube.
- github.com/google-deepmind/pushworld 2023 Too combinatorial, gripping makes it so much easier to move stuff around in the real world. But cool nonetheless.
As of 2022, this tends to be the more "default" when you talk about a quantum computer.
But there are some serious analog quantum computer contestants in the field as well.
- quantumtech.blog/2023/01/17/quantum-computing-with-neutral-atoms/ OK this one hits it:So we understand that it is truly like the classical computer analog vs digital case.
As Alex Keesling, CEO of QuEra told me, "... whereas in gate-based [digital] quantum computing the focus is on the sequence of the gates, in analog quantum processing it's more about the position of the atoms and where you place them so they can mirror real life problems. We arrange the atoms and define the forces that drive them and then measure the result... so it’s a geometric encoding of the problem itself."
- thequantuminsider.com/2022/06/28/why-analog-neutral-atoms-quantum-computing-is-a-promising-direction-for-early-quantum-advantage on The Quantum Insider useless article mostly by Pasqal
TensorFlow quantum by Masoud Mohseni (2020)
Source. At the timestamp, Masoud gives a thought experiment example of the perhaps simplest to understand analog quantum computer: chained double-slit experiments with carefully calculated distances between slits. Calulating the final propability distribution of that grows exponentially.Google BigQuery alternative.
Like Binary search tree, but each node can have multiple objects and more than two children.
D'oh.
But to be serious. The Wayback Machine contains a very large proportion of all sites. It is the most complete database we have found so far. Some archives are very broken. But those are rares.
The only problem with the Wayback Machine is that there is no known efficient way to query its archives across domains. You have to have a domain in hand for CDX queries: Wayback Machine CDX scanning.
The Common Crawl project attempts in part to address this lack of querriability, but we haven't managed to extract any hits from it.
CDX + 2013 DNS Census + heuristics however has been fruitful however.
Accounts used so far: 6 (1500 reverse IP checks).
Their historic DNS and reverse DNS info was very valuable, and served as Ciro's the initial entry point to finding hits in the IP ranges given by Reuters.
Generic information about the website not specific on this project will be stored at: Section "viewdns.info".
Since this source is so scarce and valuable, we have been quite careful to note down all the domain and IP ranges that have been explored.
At news.ycombinator.com/item?id=38496244, the creator of the viewdns.info, "Hughesey", also stated that he'd able to give some free credits for public research projects such as this one. This would have saved up going to quite a few Cafes to get those sweet extra IPs! But it was more fun in hardmode, no doubt.
We do API access to IP ranges with this simple helper: cia-2010-covert-communication-websites/viewdns-info.sh, usage:e.g.:
./viewdns-info.sh <apikey> <start-ipv-address> <end-ipv-address>
./viewdns-info.sh 8b890b00b17ed2d66bbed878d51200b58d43d014 66.45.179.187 66.45.179.210
For domain to IP queries from the API you should use "iphistory" viewdns.info/api/docs/ip-history.php:
curl 'https://api.viewdns.info/iphistory/?domain=todaysengineering.com&apikey=$APIKEY&output=json'
Just beware of the viewdns.info reverse IP bug, that really sucks and led to us missing a ton of domains.
Main article: DNS Census 2013.
This data source was very valuable, and led to many hits, and to finding the first non Reuters ranges with Section "secure subdomain search on 2013 DNS Census".
Hit overlap:Domain hit count when we were at 279 hits: 142 hits, so about half of the hits were present.
jq -r '.[].host' ../media/cia-2010-covert-communication-websites/hits.json ) | xargs -I{} sqlite3 aiddcu.sqlite "select * from t where d = '{}'"
The timing of the database is perfect for this project, it is as if the CIA had planted it themselves!
dnshistory.org contains historical domain -> mappings.
We have not managed to extract much from this source, they don't have as much data on the range of interest.
But they do have some unique data at least, perhaps we should try them a bit more often, e.g. they were the only source we've seen so far that made the association: headlines2day.com -> 212.209.74.126 which places it in the more plausible globalbaseballnews.com IP range.
TODO can it do IP to domain? Or just domain to IP? Asked on their Discord: discord.com/channels/698151879166918727/968586102493552731/1124254204257632377. Their banner suggests that yes:
With our new look website you can now find other domains hosted on the same IP address, your website neighbours and more even quicker than before.
Owner replied, you can't:
At the moment you can only do this for current not historical records
This is a shame, reverse IP here could be quite valuable.
In principle, we could obtain this data from search engines, but Google doesn't track that entire website well, e.g. no hits for
site:dnshistory.org "62.22.60.48"
presumably due to heavy IP throttling.Homepage dnshistory.org/ gives date starting in 2009:and it is true that they do have some hits from that useful era.
Here at DNS History we have been crawling DNS records since 2009, our database currently contains over 1 billion domains and over 12 billion DNS records.
Any data that we have the patience of extracting from this we will dump under github.com/cirosantilli/media/blob/master/cia-2010-covert-communication-websites/hits.json.
They appear to piece together data from various sources. This is the most complete historical domain -> IP database we have so far. They don't have hugely more data than viewdns.info, but many times do offer something new. It feels like the key difference is that their data goes further back in the critical time period a bit.
TODO do they have historical reverse IP? The fact that they don't seem to have it suggests that they are just making historical reverse IP requests to a third party via some API?
E.g. searching
thefilmcentre.com
under historical data at securitytrails.com/domain/thefilmcentre.com/history/al gives the correct IP 62.22.60.55.But searching the IP 62.22.60.55 is empty and there's no historical data option?
Account creation blacklists common email providers such as gmail to force users to use a "corporate" email address. But using random domains like
ciro@cirosantilli.com
works fine.Their data seems to date back to 2008 for our searches.
So far, no new domains have been found with Common Crawl, nor have any existing known domains been found to be present in Common Crawl. Our working theory is that Common Crawl never reached the domains How did Alexa find the domains?
Let's try and do something with Common Crawl.
Unfortunately there's no IP data apparently: github.com/commoncrawl/cc-index-table/issues/30, so let's focus on the URLs.
Using their Common Crawl Athena method: commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
Hello world:Data scanned: 11.75 MB
select * from "ccindex"."ccindex" limit 100;
Sample first output line:So
# 2
url_surtkey org,whwheelers)/robots.txt
url https://whwheelers.org/robots.txt
url_host_name whwheelers.org
url_host_tld org
url_host_2nd_last_part whwheelers
url_host_3rd_last_part
url_host_4th_last_part
url_host_5th_last_part
url_host_registry_suffix org
url_host_registered_domain whwheelers.org
url_host_private_suffix org
url_host_private_domain whwheelers.org
url_host_name_reversed
url_protocol https
url_port
url_path /robots.txt
url_query
fetch_time 2021-06-22 16:36:50.000
fetch_status 301
fetch_redirect https://www.whwheelers.org/robots.txt
content_digest 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
content_mime_type text/html
content_mime_detected text/html
content_charset
content_languages
content_truncated
warc_filename crawl-data/CC-MAIN-2021-25/segments/1623488519183.85/robotstxt/CC-MAIN-20210622155328-20210622185328-00312.warc.gz
warc_record_offset 1854030
warc_record_length 639
warc_segment 1623488519183.85
crawl CC-MAIN-2021-25
subset robotstxt
url_host_3rd_last_part
might be a winner for CGI comms fingerprinting!Naive one for one index:have no results... data scanned: 5.73 GB
select * from "ccindex"."ccindex" where url_host_registered_domain = 'conquermstoday.com' limit 100;
Let's see if they have any of the domain hits. Let's also restrict by date to try and reduce the data scanned:Humm, data scanned: 60.59 GB and no hits... weird.
select * from "ccindex"."ccindex" where
fetch_time < TIMESTAMP '2014-01-01 00:00:00' AND
url_host_registered_domain IN (
'activegaminginfo.com',
'altworldnews.com',
...
'topbillingsite.com',
'worldwildlifeadventure.com'
)
Sanity check:has a bunch of hits of course. Also Data scanned: 212.88 MB,
select * from "ccindex"."ccindex" WHERE
crawl = 'CC-MAIN-2013-20' AND
subset = 'warc' AND
url_host_registered_domain IN (
'google.com',
'amazon.com'
)
WHERE
crawl
and subset
are a must! Should have read the article first.Let's widen a bit more:Still nothing found... they don't seem to have any of the URLs of interest?
select * from "ccindex"."ccindex" WHERE
crawl IN (
'CC-MAIN-2013-20',
'CC-MAIN-2013-48',
'CC-MAIN-2014-10'
) AND
subset = 'warc' AND
url_host_registered_domain IN (
'activegaminginfo.com',
'altworldnews.com',
...
'worldnewsandent.com',
'worldwildlifeadventure.com'
)
Unlisted articles are being shown, click here to show only listed articles.