Ciro Santilli @cirosantilli 37

 Articles (11k) Discussions (26) Comments (64) Follows  Received likes Files

New Updated  Top  Announced  A-Z  Liked  Followed

Particle accelerator Updated 2025-07-16

 Read the full article

Physics Updated 2025-07-16

 View more

Physics (like all well done science) is the art of predicting the future by modelling the world with mathematics.

And predicting the future is the first step towards controlling it, i.e.: engineering.

Ciro Santilli doesn't know physics. He writes about it partly to start playing with some scientific content for: OurBigBook.com, partly because this stuff is just amazingly beautiful.

Ciro's main intellectual physics fetishes are to learn quantum electrodynamics (understanding the point of Lie groups being a subpart of that) and condensed matter physics.

Every science is Physics in disguise, but the number of objects in the real world is so large that we can't solve the real equations in practice.

Luckily, due to emergence, we can use uglier higher level approximations of the world to solve many problems, with the complex limits of applicability of those approximations.

Therefore, such higher level approximations are highly specialized, and given different names such as:

As of 2019, all known physics can be described by two theories:

Unifying those two into the theory of everything one of the major goals of modern physics.

Figure 1.
xkcd 435: Fields arranged by purity
. Source. Reductionism comes to mind.

Figure 2.
Physically accurate genie by Psychomic
. Source. This sane square composition from: www.reddit.com/r/funny/comments/u08dw3/nice_guy_genie/.

 Read the full article

Reductionism Updated 2025-07-16

 View more

Figure "xkcd 435: Fields arranged by purity" must again be cited.

 Read the full article

Sticker album Updated 2025-07-16

 Read the full article

Time dilation Updated 2025-07-16

 View more

One of the best ways to think about it is the transversal time dilation thought experiment.

 Read the full article

Time invariance implies energy conservation Updated 2025-07-16

 View more

physics.stackexchange.com/questions/614757/intuitive-explanation-for-why-time-symmetry-implies-conservation-of-energy on Physics Stack Exchange

 Read the full article

Times Higher Education Updated 2025-07-16

 Read the full article

Topological Quantum Course of the University of Oxford Updated 2025-07-16

 View more

2010- professor: Steven H. Simon

Lecture notes/book: www-thphys.physics.ox.ac.uk/people/SteveSimon/topological2021/TopoBook-Sep28-2021.pdf

Course page index: www-thphys.physics.ox.ac.uk/people/SteveSimon/

2022 homepage: www-thphys.physics.ox.ac.uk/people/SteveSimon/topological2022/topocourse2022.html

 Read the full article

Torus Updated 2025-07-16

 Read the full article

Allotrope Updated 2025-07-16

 View more

Single chemical element, single phase (usually solid), but different 3D structures.

The prototypical examples are the allotropes of carbon such as diamond vs graphite.

 Read the full article

Amazon acquisition Updated 2025-07-16

 Read the full article

Parlour game Updated 2025-07-16

 Read the full article

The eye in Ciro Santilli's website banner Updated 2025-07-16

 View more

 Read the full article

The Hundred Greatest Theorems by Paul and Jack Abad (1999) Updated 2025-07-16

 View more

Randomly reproduced at: web.archive.org/web/20080105074243/http://personal.stevens.edu/~nkahl/Top100Theorems.html

 Read the full article

Amazon AI accelerator silicon Updated 2025-07-16

 View more

2020: Traininum in 2020, e.g. techcrunch.com/2020/12/01/aws-launches-trainium-its-new-custom-ml-training-chip/
2018: AWS Inferentia, mentioned at en.wikipedia.org/wiki/Annapurna_Labs

 Read the full article

Metabolomics Updated 2025-07-16

 View more

Study of the metabolome.

 Read the full article

Microplate Updated 2025-07-16

 Read the full article

Tesla (unit) Updated 2025-07-16

 Read the full article

Zone of Avoidance Updated 2025-07-16

 Read the full article

Common Crawl Updated 2025-07-16

 View more

commoncrawl.org/

Amazing project, that basically makes a more searchable Wayback Machine.

A bit hard to use their data though, partly due to size, but also lack of free to use querrying mechanisms, and how obtuse Amazon S3 is to use.

Notably, aws-cli with an account is the only reliable way, everything else is way too broken, e.g. trying the to check the an index index.commoncrawl.org/CC-MAIN-2023-06/ very often 500s.

But still, their projct is amazing.

The only out-of-the-box search they seem to have is: urlsearch.commoncrawl.org/ for domains/URLs. It is good, but there could be so much more... notably IPs.

Also could should document the data shape a bit better.

Sample sizes can be found at: commoncrawl.org/2023/04/mar-apr-2023-crawl-archive-now-available/

To explore the data, after login:

aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2013-20/

Copy the toplevel directory only:

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/ . --recursive --exclude "*/*"

Copy some wet/wat files:

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz .
aws s3 sync s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz .

Directory structrure:

cc-index.paths.gz (1K)
cc-index-table.paths.gz (1K)

segment.paths.gz (1.7K) Sample lines:

crawl-data/CC-MAIN-2013-20/segments/1368696381249/
crawl-data/CC-MAIN-2013-20/segments/1368696381630/

index.html (2.3K)

wat.paths.gz (98K) Sample lines:

crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz
crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wat.gz

wet.paths.gz (98K) Sample lines:

crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz
crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wet.gz

warc.paths.gz (99K)

crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.gz

segments: directgory with actual data

1368696381249: one of many segments, any meaning of name?

CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz (142M, 334M unzipped)

A tiny bit of metadata, and then plaintext content from the website, e.g. the second one:

WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://004eeb5.netsolhost.com/stephensilver.htm
WARC-Date: 2013-05-18T08:11:02Z
WARC-Record-ID: <urn:uuid:773b31ba-ddc6-47a5-ae24-d08141b9944d>
WARC-Refers-To: <urn:uuid:4b1bdbff-4926-4ced-86f6-072f5bb3837a>
WARC-Block-Digest: sha1:LQFSCR2LIJQYMPTXRHWU7HAPQTVSYS3A
Content-Type: text/plain
Content-Length: 12046

Stephen Silver is a journalist and editor who specializes in the areas of politics, pop culture, film and sports. He works as an editor with the North American Publishing Co. and as a film critic with The Trend, a local newspaper in the Philadelphia area.

No IP unfortunately.

CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz (329M, 1.4G unzipped)

A lot of JSON metadata and no contents as desired. Contains IP! Some entries however are humongous with a ton of useless data, that's what bloats these so much:

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
WARC-Date: 2013-11-22T14:51:12Z
WARC-Record-ID: <urn:uuid:ec54e493-8965-41be-b344-07596cc30b3a>
WARC-Refers-To: <urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>
Content-Type: application/json
Content-Length: 1180

{"Envelope":{"Format":"WARC","WARC-Header-Length":"274","Block-Digest":"sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR","Actual-Content-Length":"372","WARC-Header-Metadata":{"WARC-Type":"warcinfo","WARC-Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz","WARC-Date":"2013-11-22T14:51:12Z","Content-Length":"372","WARC-Record-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Type":"application/warc-fields"},"Payload-Metadata":{"Trailing-Slop-Length":"0","Actual-Content-Type":"application/warc-fields","Actual-Content-Length":"372","Headers-Corrupt":true,"WARC-Info-Metadata":{"robots":"classic","software":"Nutch 1.6 (CC)/CC WarcExport 1.0","description":"Wide crawl of the web with URLs provided by Blekko for Spring 2013","hostname":"ip-10-60-113-184.ec2.internal","format":"WARC File Format 1.0","isPartOf":"CC-MAIN-2013-20","operator":"CommonCrawl Admin","publisher":"CommonCrawl"}}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"453","Header-Length":"10","Inflated-CRC":"866052549","Inflated-Length":"650"},"Offset":"0","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions
WARC-Date: 2013-05-18T05:48:54Z
WARC-Record-ID: <urn:uuid:d519658f-7a63-46c1-849b-4cd92332ddb8>
WARC-Refers-To: <urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>
Content-Type: application/json
Content-Length: 1501

{"Envelope":{"Format":"WARC","WARC-Header-Length":"433","Block-Digest":"sha1:B2B6JDSGWCUQIIUGV54SXEE25RX4SANS","Actual-Content-Length":"302","WARC-Header-Metadata":{"WARC-Type":"request","WARC-Date":"2013-05-18T05:48:54Z","WARC-Warcinfo-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Length":"302","WARC-Record-ID":"<urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>","WARC-Target-URI":"http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions","WARC-IP-Address":"165.1.125.44","Content-Type":"application/http; msgtype=request"},"Payload-Metadata":{"Trailing-Slop-Length":"4","HTTP-Request-Metadata":{"Headers":{"Accept-Language":"en-us,en-gb,en;q=0.7,*;q=0.3","Host":"ap.org","Accept-Encoding":"x-gzip, gzip, deflate","User-Agent":"CCBot/2.0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"},"Headers-Length":"300","Entity-Length":"0","Entity-Trailing-Slop-Bytes":"0","Request-Message":{"Method":"GET","Version":"HTTP/1.0","Path":"/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions"},"Entity-Digest":"sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ"},"Actual-Content-Type":"application/http; msgtype=request"}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"455","Header-Length":"10","Inflated-CRC":"453539965","Inflated-Length":"739"},"Offset":"453","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}

Let's beautify one of them to see it better:


{
  "Envelope": {
    "Format": "WARC",
    "WARC-Header-Length": "274",
    "Block-Digest": "sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR",
    "Actual-Content-Length": "372",
    "WARC-Header-Metadata": {
      "WARC-Type": "warcinfo",
      "WARC-Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz",
      "WARC-Date": "2013-11-22T14:51:12Z",
      "Content-Length": "372",
      "WARC-Record-ID": "<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>",
      "Content-Type": "application/warc-fields"
    },
    "Payload-Metadata": {
      "Trailing-Slop-Length": "0",
      "Actual-Content-Type": "application/warc-fields",
      "Actual-Content-Length": "372",
      "Headers-Corrupt": true,
      "WARC-Info-Metadata": {
        "robots": "classic",
        "software": "Nutch 1.6 (CC)/CC WarcExport 1.0",
        "description": "Wide crawl of the web with URLs provided by Blekko for Spring 2013",
        "hostname": "ip-10-60-113-184.ec2.internal",
        "format": "WARC File Format 1.0",
        "isPartOf": "CC-MAIN-2013-20",
        "operator": "CommonCrawl Admin",
        "publisher": "CommonCrawl"
      }
    }
  },
  "Container": {
    "Compressed": true,
    "Gzip-Metadata": {
      "Footer-Length": "8",
      "Deflate-Length": "453",
      "Header-Length": "10",
      "Inflated-CRC": "866052549",
      "Inflated-Length": "650"
    },
    "Offset": "0",
    "Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"
  }
}

Fuck no IP addresses either. But other entries do have it, why not this one?

The reason these can be huge is the HTML-Metadata section which contain all outlinks! gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat-L34

CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz ()

Obtain:

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz .

 Read the full article

 Unlisted articles are being shown, click here to show only listed articles.