x86 Paging Tutorial / Linux kernel usage by Ciro Santilli 35 Updated +Created
The Linux kernel makes extensive usage of the paging features of x86 to allow fast process switches with small data fragmentation.
There are also however some features that the Linux kernel might not use, either because they are only for backwards compatibility, or because the Linux devs didn't feel it was worth it yet.
x86 Paging Tutorial / TLB by Ciro Santilli 35 Updated +Created
The Translation Lookahead Buffer (TLB) is a cache for paging addresses.
Since it is a cache, it shares many of the design issues of the CPU cache, such as associativity level.
This section shall describe a simplified fully associative TLB with 4 single address entries. Note that like other caches, real TLBs are not usually fully associative.
x86 Paging Tutorial / PSE by Ciro Santilli 35 Updated +Created
Page size extension.
Allows for pages to be 4M (or 2M if PAE is on) in length instead of 4K.
PSE is turned on and off via the PSE bit of cr4.
x86 Paging Tutorial / PAE by Ciro Santilli 35 Updated +Created
Physical address extension.
With 32 bits, only 4GB RAM can be addressed.
This started becoming a limitation for large servers, so Intel introduced the PAE mechanism to Pentium Pro.
To relieve the problem, Intel added 4 new address lines, so that 64GB could be addressed.
Page table structure is also altered if PAE is on. The exact way in which it is altered depends on weather PSE is on or off.
PAE is turned on and off via the PAE bit of cr4.
Even if the total addressable memory is 64GB, individual process are still only able to use up to 4GB. The OS can however put different processes on different 4GB chunks.
Page directory given to process by the OS:
entry index   entry address      page table address  present
-----------   ----------------   ------------------  --------
0             CR3 + 0      * 4   0x10000             1
1             CR3 + 1      * 4                       0
2             CR3 + 2      * 4   0x80000             1
3             CR3 + 3      * 4                       0
...
2^10-1        CR3 + 2^10-1 * 4                       0
Page tables given to process by the OS at PT1 = 0x10000000 (0x10000 * 4K):
entry index   entry address      page address  present
-----------   ----------------   ------------  -------
0             PT1 + 0      * 4   0x00001       1
1             PT1 + 1      * 4                 0
2             PT1 + 2      * 4   0x0000D       1
...                                  ...
2^10-1        PT1 + 2^10-1 * 4   0x00005       1
Page tables given to process by the OS at PT2 = 0x80000000 (0x80000 * 4K):
entry index   entry address     page address  present
-----------   ---------------   ------------  ------------
0             PT2 + 0     * 4   0x0000A       1
1             PT2 + 1     * 4   0x0000C       1
2             PT2 + 2     * 4                 0
...
2^10-1        PT2 + 0x3FF * 4   0x00003       1
where PT1 and PT2: initial position of page table 1 and page table 2 for process 1 on RAM.
With that setup, the following translations would happen:
linear    10 10 12 split  physical
--------  --------------  ----------
00000001  000 000 001     00001001
00001001  000 001 001     page fault
003FF001  000 3FF 001     00005001
00400000  001 000 000     page fault
00800001  002 000 001     0000A001
00801004  002 001 004     0000C004
00802004  002 002 004     page fault
00B00001  003 000 000     page fault
Let's translate the linear address 0x00801004 step by step:
  • In binary the linear address is:
    0    0    8    0    1    0    0    4
    0000 0000 1000 0000 0001 0000 0000 0100
  • Grouping as 10 | 10 | 12 gives:
    0000000010 0000000001 000000000100
    0x2        0x1        0x4
    which gives:
    page directory entry = 0x2
    page table     entry = 0x1
    offset               = 0x4
    So the hardware looks for entry 2 of the page directory.
  • The page directory table says that the page table is located at 0x80000 * 4K = 0x80000000. This is the first RAM access of the process.
    Since the page table entry is 0x1, the hardware looks at entry 1 of the page table at 0x80000000, which tells it that the physical page is located at address 0x0000C * 4K = 0x0000C000. This is the second RAM access of the process.
  • Finally, the paging hardware adds the offset, and the final address is 0x0000C004.
Page faults occur if either a page directory entry or a page table entry is not present.
The Intel manual gives a picture of this translation process in the image "Linear-Address Translation to a 4-KByte Page using 32-Bit Paging": Figure 1. "x86 page translation process"
Figure 1.
x86 page translation process
.
x86 Paging Tutorial / K-ary trees to the rescue by Ciro Santilli 35 Updated +Created
The algorithmically minded will have noticed that paging requires associative array (like Java Map of Python dict()) abstract data structure where:
  • the keys are linear pages addresses, thus of integer type
  • the values are physical page addresses, also of integer type
The single level paging scheme uses a simple array implementation of the associative array:
  • the keys are the array index
  • this implementation is very fast in time
  • but it is too inefficient in memory
and in C pseudo-code it looks like this:
linear_address[0]      = physical_address_0
linear_address[1]      = physical_address_1
linear_address[2]      = physical_address_2
...
linear_address[2^20-1] = physical_address_N
But there another simple associative array implementation that overcomes the memory problem: an (unbalanced) k-ary tree.
A K-ary tree, is just like a binary tree, but with K children instead of 2.
Using a K-ary tree instead of an array implementation has the following trade-offs:
  • it uses way less memory
  • it is slower since we have to de-reference extra pointers
In C-pseudo code, a 2-level K-ary tree with K = 2^10 looks like this:
level0[0] = &level1_0[0]
    level1_0[0]      = physical_address_0_0
    level1_0[1]      = physical_address_0_1
    ...
    level1_0[2^10-1] = physical_address_0_N
level0[1] = &level1_1[0]
    level1_1[0]      = physical_address_1_0
    level1_1[1]      = physical_address_1_1
    ...
    level1_1[2^10-1] = physical_address_1_N
...
level0[N] = &level1_N[0]
    level1_N[0]      = physical_address_N_0
    level1_N[1]      = physical_address_N_1
    ...
    level1_N[2^10-1] = physical_address_N_N
and we have the following arrays:
  • one directory, which has 2^10 elements. Each element contains a pointer to a page table array.
  • up to 2^10 pagetable arrays. Each one has 2^10 4 byte page entries.
and it still contains 2^10 * 2^10 = 2^20 possible keys.
K-ary trees can save up a lot of space, because if we only have one key, then we only need the following arrays:
  • one directory with 2^10 entries
  • one pagetable at directory[0] with 2^10 entries
  • all other directory[i] are marked as invalid, don't point to anything, and we don't allocate pagetable for them at all
Antimatter by Ciro Santilli 35 Updated +Created
Common Crawl by Ciro Santilli 35 Updated +Created
Amazing project, that basically makes a more searchable Wayback Machine.
A bit hard to use their data though, partly due to size, but also lack of free to use querrying mechanisms, and how obtuse Amazon S3 is to use.
Notably, aws-cli with an account is the only reliable way, everything else is way too broken, e.g. trying the to check the an index index.commoncrawl.org/CC-MAIN-2023-06/ very often 500s.
But still, their projct is amazing.
The only out-of-the-box search they seem to have is: urlsearch.commoncrawl.org/ for domains/URLs. It is good, but there could be so much more... notably IPs.
Also could should document the data shape a bit better.
To explore the data, after login:
aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2013-20/
Copy the toplevel directory only:
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/ . --recursive --exclude "*/*"
Copy some wet/wat files:
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz .
aws s3 sync s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz .
Directory structrure:
  • cc-index.paths.gz (1K)
  • cc-index-table.paths.gz (1K)
  • segment.paths.gz (1.7K) Sample lines:
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/
    crawl-data/CC-MAIN-2013-20/segments/1368696381630/
  • index.html (2.3K)
  • wat.paths.gz (98K) Sample lines:
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wat.gz
  • wet.paths.gz (98K) Sample lines:
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wet.gz
  • warc.paths.gz (99K)
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
    crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.gz
  • segments: directgory with actual data
    • 1368696381249: one of many segments, any meaning of name?
      • CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz (142M, 334M unzipped)
        A tiny bit of metadata, and then plaintext content from the website, e.g. the second one:
        WARC/1.0
        WARC-Type: conversion
        WARC-Target-URI: http://004eeb5.netsolhost.com/stephensilver.htm
        WARC-Date: 2013-05-18T08:11:02Z
        WARC-Record-ID: <urn:uuid:773b31ba-ddc6-47a5-ae24-d08141b9944d>
        WARC-Refers-To: <urn:uuid:4b1bdbff-4926-4ced-86f6-072f5bb3837a>
        WARC-Block-Digest: sha1:LQFSCR2LIJQYMPTXRHWU7HAPQTVSYS3A
        Content-Type: text/plain
        Content-Length: 12046
        
        Stephen Silver is a journalist and editor who specializes in the areas of politics, pop culture, film and sports. He works as an editor with the North American Publishing Co. and as a film critic with The Trend, a local newspaper in the Philadelphia area.
        No IP unfortunately.
      • CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz (329M, 1.4G unzipped)
        A lot of JSON metadata and no contents as desired. Contains IP! Some entries however are humongous with a ton of useless data, that's what bloats these so much:
        WARC/1.0
        WARC-Type: metadata
        WARC-Target-URI: CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
        WARC-Date: 2013-11-22T14:51:12Z
        WARC-Record-ID: <urn:uuid:ec54e493-8965-41be-b344-07596cc30b3a>
        WARC-Refers-To: <urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>
        Content-Type: application/json
        Content-Length: 1180
        
        {"Envelope":{"Format":"WARC","WARC-Header-Length":"274","Block-Digest":"sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR","Actual-Content-Length":"372","WARC-Header-Metadata":{"WARC-Type":"warcinfo","WARC-Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz","WARC-Date":"2013-11-22T14:51:12Z","Content-Length":"372","WARC-Record-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Type":"application/warc-fields"},"Payload-Metadata":{"Trailing-Slop-Length":"0","Actual-Content-Type":"application/warc-fields","Actual-Content-Length":"372","Headers-Corrupt":true,"WARC-Info-Metadata":{"robots":"classic","software":"Nutch 1.6 (CC)/CC WarcExport 1.0","description":"Wide crawl of the web with URLs provided by Blekko for Spring 2013","hostname":"ip-10-60-113-184.ec2.internal","format":"WARC File Format 1.0","isPartOf":"CC-MAIN-2013-20","operator":"CommonCrawl Admin","publisher":"CommonCrawl"}}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"453","Header-Length":"10","Inflated-CRC":"866052549","Inflated-Length":"650"},"Offset":"0","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}
        
        WARC/1.0
        WARC-Type: metadata
        WARC-Target-URI: http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions
        WARC-Date: 2013-05-18T05:48:54Z
        WARC-Record-ID: <urn:uuid:d519658f-7a63-46c1-849b-4cd92332ddb8>
        WARC-Refers-To: <urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>
        Content-Type: application/json
        Content-Length: 1501
        
        {"Envelope":{"Format":"WARC","WARC-Header-Length":"433","Block-Digest":"sha1:B2B6JDSGWCUQIIUGV54SXEE25RX4SANS","Actual-Content-Length":"302","WARC-Header-Metadata":{"WARC-Type":"request","WARC-Date":"2013-05-18T05:48:54Z","WARC-Warcinfo-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Length":"302","WARC-Record-ID":"<urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>","WARC-Target-URI":"http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions","WARC-IP-Address":"165.1.125.44","Content-Type":"application/http; msgtype=request"},"Payload-Metadata":{"Trailing-Slop-Length":"4","HTTP-Request-Metadata":{"Headers":{"Accept-Language":"en-us,en-gb,en;q=0.7,*;q=0.3","Host":"ap.org","Accept-Encoding":"x-gzip, gzip, deflate","User-Agent":"CCBot/2.0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"},"Headers-Length":"300","Entity-Length":"0","Entity-Trailing-Slop-Bytes":"0","Request-Message":{"Method":"GET","Version":"HTTP/1.0","Path":"/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions"},"Entity-Digest":"sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ"},"Actual-Content-Type":"application/http; msgtype=request"}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"455","Header-Length":"10","Inflated-CRC":"453539965","Inflated-Length":"739"},"Offset":"453","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}
        Let's beautify one of them to see it better:
        
        {
          "Envelope": {
            "Format": "WARC",
            "WARC-Header-Length": "274",
            "Block-Digest": "sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR",
            "Actual-Content-Length": "372",
            "WARC-Header-Metadata": {
              "WARC-Type": "warcinfo",
              "WARC-Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz",
              "WARC-Date": "2013-11-22T14:51:12Z",
              "Content-Length": "372",
              "WARC-Record-ID": "<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>",
              "Content-Type": "application/warc-fields"
            },
            "Payload-Metadata": {
              "Trailing-Slop-Length": "0",
              "Actual-Content-Type": "application/warc-fields",
              "Actual-Content-Length": "372",
              "Headers-Corrupt": true,
              "WARC-Info-Metadata": {
                "robots": "classic",
                "software": "Nutch 1.6 (CC)/CC WarcExport 1.0",
                "description": "Wide crawl of the web with URLs provided by Blekko for Spring 2013",
                "hostname": "ip-10-60-113-184.ec2.internal",
                "format": "WARC File Format 1.0",
                "isPartOf": "CC-MAIN-2013-20",
                "operator": "CommonCrawl Admin",
                "publisher": "CommonCrawl"
              }
            }
          },
          "Container": {
            "Compressed": true,
            "Gzip-Metadata": {
              "Footer-Length": "8",
              "Deflate-Length": "453",
              "Header-Length": "10",
              "Inflated-CRC": "866052549",
              "Inflated-Length": "650"
            },
            "Offset": "0",
            "Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"
          }
        }
        Fuck no IP addresses either. But other entries do have it, why not this one?
        The reason these can be huge is the HTML-Metadata section which contain all outlinks! gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat-L34
      • CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz ()
        Obtain:
        aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz .
Free will by Ciro Santilli 35 Updated +Created
Ciro Santilli does not believe in free will of course because he is an agnostic and he believes that brains are controlled by the laws of physics, see also: physics and the illusion of life.
Frequency divider by Ciro Santilli 35 Updated +Created
Bird by Ciro Santilli 35 Updated +Created
List of wars by Ciro Santilli 35 Updated +Created
Chrome Android extension by Ciro Santilli 35 Updated +Created
Lol it is note possible what a joke. Notably this makes it harder to have of a superior third party password manager like Proton Pass (though there seems to be an autocomplete app as an alternative path), and an ad blocker. Fuck Google.
Also, Chromium is not available on Google Play by default, you can install the apk, but you will miss updates:
A blog in every web framework by Ciro Santilli 35 Updated +Created
A (multi-user) blog is the hello world of the web, so creating one of those is the best way to quickly evaluate web technology, i.e. time to Hello World.
Some new frameworks like FeathersJS are making a chat app instead, as that highlights the push notifications a bit better.
Fugging, Upper Austria by Ciro Santilli 35 Updated +Created
Fujitsu by Ciro Santilli 35 Updated +Created
The japanese name literally means:
  • 富士 fushi, from Mount Fuji, which itself has unknown origin
  • 通 tong: telecommunications

Pinned article: ourbigbook/introduction-to-the-ourbigbook-project

Welcome to the OurBigBook Project! Our goal is to create the perfect publishing platform for STEM subjects, and get university-level students to write the best free STEM tutorials ever.
Everyone is welcome to create an account and play with the site: ourbigbook.com/go/register. We belive that students themselves can write amazing tutorials, but teachers are welcome too. You can write about anything you want, it doesn't have to be STEM or even educational. Silly test content is very welcome and you won't be penalized in any way. Just keep it legal!
We have two killer features:
  1. topics: topics group articles by different users with the same title, e.g. here is the topic for the "Fundamental Theorem of Calculus" ourbigbook.com/go/topic/fundamental-theorem-of-calculus
    Articles of different users are sorted by upvote within each article page. This feature is a bit like:
    • a Wikipedia where each user can have their own version of each article
    • a Q&A website like Stack Overflow, where multiple people can give their views on a given topic, and the best ones are sorted by upvote. Except you don't need to wait for someone to ask first, and any topic goes, no matter how narrow or broad
    This feature makes it possible for readers to find better explanations of any topic created by other writers. And it allows writers to create an explanation in a place that readers might actually find it.
    Figure 1.
    Screenshot of the "Derivative" topic page
    . View it live at: ourbigbook.com/go/topic/derivative
  2. local editing: you can store all your personal knowledge base content locally in a plaintext markup format that can be edited locally and published either:
    This way you can be sure that even if OurBigBook.com were to go down one day (which we have no plans to do as it is quite cheap to host!), your content will still be perfectly readable as a static site.
    Figure 5. . You can also edit articles on the Web editor without installing anything locally.
    Video 3.
    Edit locally and publish demo
    . Source. This shows editing OurBigBook Markup and publishing it using the Visual Studio Code extension.
  3. https://raw.githubusercontent.com/ourbigbook/ourbigbook-media/master/feature/x/hilbert-space-arrow.png
  4. Infinitely deep tables of contents:
    Figure 6.
    Dynamic article tree with infinitely deep table of contents
    .
    Descendant pages can also show up as toplevel e.g.: ourbigbook.com/cirosantilli/chordate-subclade
All our software is open source and hosted at: github.com/ourbigbook/ourbigbook
Further documentation can be found at: docs.ourbigbook.com
Feel free to reach our to us for any help or suggestions: docs.ourbigbook.com/#contact