D'oh.
But to be serious. The Wayback Machine contains a very large proportion of all sites. It does happen sometime that a Wayback Machine archive is missing or broken and cqcounter has the screenshot. But the Wayback Machine is still the most complete database we have found so far. Some archives are very broken. But those are rare.
The only problem with the Wayback Machine is that there is no known efficient way to query its archives across domains. You have to have a domain in hand for CDX queries: Wayback Machine CDX scanning.
The Common Crawl project attempts in part to address this lack of querriability, but we haven't managed to extract any hits from it.
CDX + 2013 DNS Census + heuristics however has been fruitful however.
We have dumped all Wayback Machine archives of known websites to: github.com/cirosantilli/cia-2010-websites-dump using ../cia-2010-covert-communication-websites/download-websites.sh. This allows for better grepping and serves as a backup in case they ever go down.
Their historic DNS and reverse DNS info was very valuable, and served as Ciro's the initial entry point to finding hits in the IP ranges given by Reuters.
Generic information about the website not specific on this project will be stored at: Section "viewdns.info".
Since this source is so scarce and valuable, we have been quite careful to note down all the domain and IP ranges that have been explored.
At news.ycombinator.com/item?id=38496244, the creator of the viewdns.info, "Hughesey", also stated that he'd able to give some free credits for public research projects such as this one. This would have saved up going to quite a few Cafes to get those sweet extra IPs! But it was more fun in hardmode, no doubt.
We do API access to IP ranges with this simple helper: ../cia-2010-covert-communication-websites/viewdns-info.sh, usage:e.g.:
./viewdns-info.sh <apikey> <start-ipv-address> <end-ipv-address>./viewdns-info.sh 8b890b00b17ed2d66bbed878d51200b58d43d014 66.45.179.187 66.45.179.210For domain to IP queries from the API you should use "iphistory" viewdns.info/api/docs/ip-history.php:
curl 'https://api.viewdns.info/iphistory/?domain=todaysengineering.com&apikey=$APIKEY&output=json'Just beware of the viewdns.info reverse IP bug, that really sucks and led to us missing a ton of domains.
Main article: DNS Census 2013.
This data source was very valuable, and led to many hits, and to finding the first non Reuters ranges with Section "secure subdomain search on 2013 DNS Census".
dnshistory.org contains historical domain -> mappings.
We have not managed to extract much from this source, they don't have as much data on the range of interest.
But they do have some unique data at least, perhaps we should try them a bit more often, e.g. they were the only source we've seen so far that made the association: headlines2day.com -> 212.209.74.126 which places it in the more plausible globalbaseballnews.com IP range.
TODO can it do IP to domain? Or just domain to IP? Asked on their Discord: discord.com/channels/698151879166918727/968586102493552731/1124254204257632377. Their banner suggests that yes:
With our new look website you can now find other domains hosted on the same IP address, your website neighbours and more even quicker than before.
Owner replied, you can't:
At the moment you can only do this for current not historical records
In principle, we could obtain this data from search engines, but Google doesn't track that entire website well, e.g. no hits for
site:dnshistory.org "62.22.60.48" presumably due to heavy IP throttling.Homepage dnshistory.org/ gives date starting in 2009:and it is true that they do have some hits from that useful era.
Here at DNS History we have been crawling DNS records since 2009, our database currently contains over 1 billion domains and over 12 billion DNS records.
Any data that we have the patience of extracting from this we will dump under github.com/cirosantilli/media/blob/master/cia-2010-covert-communication-websites/hits.json.
They appear to piece together data from various sources. This is the most complete historical domain -> IP database we have so far. They don't have hugely more data than viewdns.info, but many times do offer something new. It feels like the key difference is that their data goes further back in the critical time period a bit.
TODO do they have historical reverse IP? The fact that they don't seem to have it suggests that they are just making historical reverse IP requests to a third party via some API?
E.g. searching
thefilmcentre.com under historical data at securitytrails.com/domain/thefilmcentre.com/history/al gives the correct IP 62.22.60.55.Account creation blacklists common email providers such as gmail to force users to use a "corporate" email address. But using random domains like
ciro@cirosantilli.com works fine.Their data seems to date back to 2008 for our searches.
So far, no new domains have been found with Common Crawl, nor have any existing known domains been found to be present in Common Crawl. Our working theory is that Common Crawl never reached the domains How did Alexa find the domains?
Let's try and do something with Common Crawl.
Unfortunately there's no IP data apparently: github.com/commoncrawl/cc-index-table/issues/30, so let's focus on the URLs.
Using their Common Crawl Athena method: commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
Sample first output line:So
# 2
url_surtkey org,whwheelers)/robots.txt
url https://whwheelers.org/robots.txt
url_host_name whwheelers.org
url_host_tld org
url_host_2nd_last_part whwheelers
url_host_3rd_last_part
url_host_4th_last_part
url_host_5th_last_part
url_host_registry_suffix org
url_host_registered_domain whwheelers.org
url_host_private_suffix org
url_host_private_domain whwheelers.org
url_host_name_reversed
url_protocol https
url_port
url_path /robots.txt
url_query
fetch_time 2021-06-22 16:36:50.000
fetch_status 301
fetch_redirect https://www.whwheelers.org/robots.txt
content_digest 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
content_mime_type text/html
content_mime_detected text/html
content_charset
content_languages
content_truncated
warc_filename crawl-data/CC-MAIN-2021-25/segments/1623488519183.85/robotstxt/CC-MAIN-20210622155328-20210622185328-00312.warc.gz
warc_record_offset 1854030
warc_record_length 639
warc_segment 1623488519183.85
crawl CC-MAIN-2021-25
subset robotstxturl_host_3rd_last_part might be a winner for CGI comms fingerprinting!Naive one for one index:have no results... data scanned: 5.73 GB
select * from "ccindex"."ccindex" where url_host_registered_domain = 'conquermstoday.com' limit 100;Let's see if they have any of the domain hits. Let's also restrict by date to try and reduce the data scanned:Humm, data scanned: 60.59 GB and no hits... weird.
select * from "ccindex"."ccindex" where
fetch_time < TIMESTAMP '2014-01-01 00:00:00' AND
url_host_registered_domain IN (
'activegaminginfo.com',
'altworldnews.com',
...
'topbillingsite.com',
'worldwildlifeadventure.com'
)Sanity check:has a bunch of hits of course. Data scanned: 212.88 MB,
select * from "ccindex"."ccindex" WHERE
crawl = 'CC-MAIN-2013-20' AND
subset = 'warc' AND
url_host_registered_domain IN (
'google.com',
'amazon.com'
)WHERE crawl and subset are a must! Should have read the article first.Let's widen a bit more:Still nothing found... they don't seem to have any of the URLs of interest?
select * from "ccindex"."ccindex" WHERE
crawl IN (
'CC-MAIN-2013-20',
'CC-MAIN-2013-48',
'CC-MAIN-2014-10'
) AND
subset = 'warc' AND
url_host_registered_domain IN (
'activegaminginfo.com',
'altworldnews.com',
...
'worldnewsandent.com',
'worldwildlifeadventure.com'
)Does not appear to have any reverse IP hits unfortunately: opendata.stackexchange.com/questions/1951/dataset-of-domain-names/21077#21077. Likely only has domains that were explicitly advertised.
We could not find anything useful in it so far, but there is great potential to use this tool to find new IP ranges based on properties of existing IP ranges. Part of the problem is that the dataset is huge, and is split by top 256 bytes. But it would be reasonable to at least explore ranges with pre-existing known hits...
We have started looking for patterns on
66.* and 208.*, both selected as two relatively far away ranges that have a number of pre-existing hits. 208 should likely have been 212 considering later finds that put several ranges in 212.tcpip_fp:
- 66.104.
- 66.104.175.41: grubbersworldrugbynews.com: 1346397300 SCAN(V=6.01%E=4%D=1/12%OT=22%CT=443%CU=%PV=N%G=N%TM=387CAB9E%P=mipsel-openwrt-linux-gnu),ECN(R=N),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=N),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
- 66.104.175.48: worlddispatch.net: 1346816700 SCAN(V=6.01%E=4%D=1/2%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=1D5EA%P=mipsel-openwrt-linux-gnu),SEQ(SP=F8%GCD=3%ISR=109%TI=Z%TS=A),ECN(R=N),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
- 66.104.175.49: webworldsports.com: 1346692500 SCAN(V=6.01%E=4%D=9/3%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=5044E96E%P=mipsel-openwrt-linux-gnu),SEQ(SP=105%GCD=1%ISR=108%TI=Z%TS=A),OPS(O1=M550ST11NW6%O2=M550ST11NW6%O3=M550NNT11NW6%O4=M550ST11NW6%O5=M550ST11NW6%O6=M550ST11),WIN(W1=1510%W2=1510%W3=1510%W4=1510%W5=1510%W6=1510),ECN(R=N),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
- 66.104.175.50: fly-bybirdies.com: 1346822100 SCAN(V=6.01%E=4%D=1/1%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=14655%P=mipsel-openwrt-linux-gnu),SEQ(TI=Z%TS=A),ECN(R=N),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
- 66.104.175.53: info-ology.net: 1346712300 SCAN(V=6.01%E=4%D=9/4%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=50453230%P=mipsel-openwrt-linux-gnu),SEQ(SP=FB%GCD=1%ISR=FF%TI=Z%TS=A),ECN(R=N),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
- 66.175.106
- 66.175.106.150: noticiasmusica.net: 1340077500 SCAN(V=5.51%D=1/3%OT=22%CT=443%CU=%PV=N%G=N%TM=38707542%P=mipsel-openwrt-linux-gnu),ECN(R=N),T1(R=N),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
- 66.175.106.155: atomworldnews.com: 1345562100 SCAN(V=5.51%D=8/21%OT=22%CT=443%CU=%PV=N%DC=I%G=N%TM=5033A5F2%P=mips-openwrt-linux-gnu),SEQ(SP=FB%GCD=1%ISR=FC%TI=Z%TS=A),ECN(R=Y%DF=Y%TG=40%W=1540%O=M550NNSNW6%CC=N%Q=),T1(R=Y%DF=Y%TG=40%S=O%A=S+%F=AS%RD=0%Q=),T2(R=N),T3(R=N),T4(R=N),T5(R=Y%DF=Y%TG=40%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=N),T7(R=N),U1(R=N),IE(R=N)
Domain list only, no IPs and no dates. We haven't been able to extract anything of interest from this source so far.
Domain hit count when we were at 69 hits: only 9, some of which had been since reused. Likely their data collection did not cover the dates of interest.
whoisxmlapi WHOIS history April 11, 2011:Folowed by reuters registration in 2022.
- Created Date: March 6, 2008 00:00:00 UTC
- Updated Date: March 7, 2011 00:00:00 UTC
- Expires Date: March 6, 2014 00:00:00 UTC
- Registrant Name: domainsbyproxy.com.
- Registrant Organization: Domains by Proxy, Inc.
- Registrant Street: 15111 N. Hayden Rd., Ste 160,
- Registrant City: Scottsdale
- Registrant State/Province: Arizona
- Registrant Postal Code: 85260
- Registrant Country: UNITED STATES
- Name servers: NS29.WORLDNIC.COM|NS30.WORLDNIC.COM
whoisrequest.com/history/ mentions:
- 1 Apr, 2008: Domain created*, nameservers added. Nameservers:
- ns1.webhostingpad.com
- ns2.webhostingpad.com
What poor countries have to do to get richer Don't force international exchange students to come back early by
Ciro Santilli 37 Updated 2025-07-16
Many of the student exchange programs Ciro witnessed in the 2010's in Brazil were inefficient because they were requiring students to come back immediately after university or PhD in fear that those students will never come back.
This is useless, because you don't learn anything unique during university: the truly valuable knowledge is obtained when you work for several years as a postdoc in a world class research laboratory or as an engineer in a world class company.
Therefore, Brazil should learn from the Chinese exchange system, which lets students go do whatever they want, and once they are Gods of the domain, entices them back with great positions and pay as heads of laboratory back in China. Just don't do fraudulent stuff like this like China did, or else you will get a bad rep.
To help this university collaboration happen, we should create communication channels between exchange students and professors of the origin country who work on the same domain so that they can discuss the subject. For example, once Ciro Santilli wanted to contact some of his former teachers at the University of São Paulo about "advanced" topics he had been exposed to as part of his job. However, they didn't even reply to his email, and Ciro didn't know who else to contact. This must never happen. We need a way to informally contact several professors of a given domain informally, to increase the chances that at least one might be interested. It is pointless to just let students loose abroad and hope that they will bring things back to their home country: a more cohesive infrastructure is needed to nurture that.
There is basically one sane way to achieve these goals: the exchange programs must be organized at a national level, not in an ad-hoc per-university manner.
Another good idea is to have taxes that depend on your nationality alone and which only start collecting when you reach a very high amount of net worth. So e.g. if someone leaves the country and makes it big, then and only then does the Government starts clawing back the benefits of its investments in the person. Furthermore, such taxes could be reduced if the person brings some of the business back to the country. And mandatory taxes should be charged if the person decides to drop their nationality at some point.
The above points would also be greatly eased by having a national-level exchange program. E.g. in Brazil in the 2010's which Ciro experience, every university had different terms and conditions, which made everything a mess. Exchange programs must be treated as a unified federal policy.
Ciro actually had to return for just six months from the École Polytechnique to the University of São Paulo, to finish a course he had only done the generic Maths/Physics introduction to. Students from other Brazilian universities were forced to return for up to 3 years even to get their Brazilian diplomas! Ciro was lucky that his teachers understood the situation, and allowed him to develop online learning projects instead of his supposed control engineering projects, which hopefully will have led to changing the world with motivation one day. And for this, Ciro is eternally thankful.
This shows the complete and total lack of any Brazilian strategy to send its students abroad to really learn valuable things and then come back. There is no strategy at all. Things have just reached an equilibrium point of bureaucracies, Brazilian universities trying to bring students back to validate useless diploma pieces of paper, and foreign universities no caring about that, and just wanting the students to say abroad forever.
Ciro was once talking about why so few Brazilians go study abroad compared to the Chinese. Besides the likely true "there are a lot of Chinese" argument, his wife made another: good point Brazil is not so bad to live in, because you have good food and freedom, while China only has good food.
But Ciro still fells bad that so few of his University of São Paulo colleagues, who learnt automation and control engineering, are doing deep tech. Nor physical engineering. They have all basically become computer people like Ciro.
This is not their fault. They basically don't have a choice: all physical science and technology is done in rich countries.
Yes, someone has to implement the newest tech to improve local country efficiency in projects that will never spread abroad.
But who will be left then for the next big thing problems that would really make Brazil richer? 6 out of 30 person class ended up working on a gaming company at one point, even though they were not crazy passionate about the field! What could possibly be a worst investment for society?
This lack of technological innovation can also be clearly seen when you research investment options available in Brazil. Huge emphasis is put on fixed return financial products (often inflation adjusted) linked to base non-tech business such as housing market and agriculture. And when you look to the returns of the stock market on s&P 500-analogue backed exchange-traded funds, they do not seem obviously better, especially considering inflation and taxation benefits that exist for some of the other investment possibilities.
When the companies of a country are not clearly the best investment, you know that something is wrong. They are highly specialized money making machines, remember! And housing and agriculture are not such innovative markets where people can hugely influence efficiency.
When it is best to send students is a good question. Undergrad studies could be easily reproduced in poor countries if they had any intelligence at all, since even in rich countries laboratory usage is always limited. Masters and PhD are generally more valuable moments to send people out. The question is if the students will actually have a fighting chance without having been out, in particular in terms of language skills. Ciro feels that Masters are a good focus point for entry, as that is where PhD links are more actively done.
Some of their archiving accounts:
Lists:
- trilarion.github.io/opensourcegames/
- www.slant.co/topics/1933/~best-open-source-games
- libregamewiki.org/Main_Page
- www.reddit.com/r/opensourcegames/comments/197luuk/what_is_the_best_open_source_game_in_your_opinion/
- www.pcgamer.com/yall-know-about-these-huge-lists-of-free-open-source-game-clones-right/ is a list of lists
Why would anyone ever waste time playing a closed source game, when this will inevitably lead to endless hours of decompilation down the line when you want to:
Those who devote their time to the useless development of open source video games, before we even have decent open source development tooling, will, without a doubt, have their place in Heaven.
- tower defense
- www.edopedia.com/demo/pixeldefense possible source github.com/jesseakt/PixelDefense 2020-03 desperately lacks a fast forward button and enemy health bars
- platformer
- 2D platformer
- 3D platformer
- OpenClonk: Terraria-like 2D mining crafting game. Pretty well done. Not sure if you can have a super huge open world. The fact that the music stops completely so often is a bit saddening.
- Pingus: Lemmings clone. Very good!
- github.com/The-Powder-Toy/The-Powder-Toy: en.wikipedia.org/wiki/Falling-sand_game in C++. No Ubuntu 19.10 package it seems, but was easy to compile from source.
- roguelike
- Worms clone
- Hedgewars
- pokemon clone:
- Tuxemon. Worked on Ubuntu 21.10. 20ea4181e1c0db04934ee69951ea1836a3b1f642
- ARPG
- Diablo II clones:v1.12 download Worked well on Ubuntu 21.10.
- The Mana World: www.themanaworld.org/ Started somewhat as a loose The Secret of Mana clone, but they've added online play capabilities, effectively making it a MMORPG.Their user acquisition as of 2021 is really bad. Download is a wiki page, there are two client versions, etc. The .deb did not work out o box on Ubuntu 21.10 due to unmet dependencies:fails with:
sudo apt install ./manaplus_amd64.debso it won't be able to play without trying to compile and possibly minor ports since the deb does not packs dependencies. Some requests for a release with all dependencies prepacked:Their home page says it all:manaplus : Depends: libpng12-0 (>= 1.2.13-4) but it is not installable Depends: libsdl-gfx1.2-4 (>= 2.0.22) but it is not installable Depends: manaplus-data (= 1.6.4.23-2) but 1.9.3.23-6 is to be installedSad.Server status: Online: 9 players
- Diablo II clones:
- Factorio clones:
- github.com/tobspr/shapez.io Also browser based.
Some that Ciro Santilli likes:
Unlisted articles are being shown, click here to show only listed articles.