Physical address extension.
With 32 bits, only 4GB RAM can be addressed.
This started becoming a limitation for large servers, so Intel introduced the PAE mechanism to Pentium Pro.
To relieve the problem, Intel added 4 new address lines, so that 64GB could be addressed.
Page table structure is also altered if PAE is on. The exact way in which it is altered depends on weather PSE is on or off.
PAE is turned on and off via the PAE bit of cr4.
Even if the total addressable memory is 64GB, individual process are still only able to use up to 4GB. The OS can however put different processes on different 4GB chunks.
Page size extension.
Allows for pages to be 4M (or 2M if PAE is on) in length instead of 4K.
PSE is turned on and off via the PSE bit of cr4.
If either PAE and PSE are active, different paging level schemes are used:
  • no PAE and no PSE: 10 | 10 | 12
  • no PAE and PSE: 10 | 22.
    22 is the offset within the 4Mb page, since 22 bits address 4Mb.
  • PAE and no PSE: 2 | 9 | 9 | 12
    The design reason why 9 is used twice instead of 10 is that now entries cannot fit anymore into 32 bits, which were all filled up by 20 address bits and 12 meaningful or reserved flag bits.
    The reason is that 20 bits are not enough anymore to represent the address of page tables: 24 bits are now needed because of the 4 extra wires added to the processor.
    Therefore, the designers decided to increase entry size to 64 bits, and to make them fit into a single page table it is necessary reduce the number of entries to 2^9 instead of 2^10.
    The starting 2 is a new Page level called Page Directory Pointer Table (PDPT), since it points to page directories and fill in the 32 bit linear address. PDPTs are also 64 bits wide.
    cr3 now points to PDPTs which must be on the fist four 4GB of memory and aligned on 32 bit multiples for addressing efficiency. This means that now cr3 has 27 significative bits instead of 20: 2^5 for the 32 multiples * 2^27 to complete the 2^32 of the first 4GB.
  • PAE and PSE: 2 | 9 | 21
    Designers decided to keep a 9 bit wide field to make it fit into a single page.
    This leaves 23 bits. Leaving 2 for the PDPT to keep things uniform with the PAE case without PSE leaves 21 for offset, meaning that pages are 2M wide instead of 4M.
The Translation Lookahead Buffer (TLB) is a cache for paging addresses.
Since it is a cache, it shares many of the design issues of the CPU cache, such as associativity level.
This section shall describe a simplified fully associative TLB with 4 single address entries. Note that like other caches, real TLBs are not usually fully associative.
After a translation between linear and physical address happens, it is stored on the TLB. For example, a 4 entry TLB starts in the following state:
  valid  linear  physical
  -----  ------  --------
> 0      00000   00000
  0      00000   00000
  0      00000   00000
  0      00000   00000
The > indicates the current entry to be replaced.
And after a page linear address 00003 is translated to a physical address 00005, the TLB becomes:
  valid  linear  physical
  -----  ------  --------
  1      00003   00005
> 0      00000   00000
  0      00000   00000
  0      00000   00000
and after a second translation of 00007 to 00009 it becomes:
  valid  linear  physical
  -----  ------  --------
  1      00003   00005
  1      00007   00009
> 0      00000   00000
  0      00000   00000
Now if 00003 needs to be translated again, hardware first looks up the TLB and finds out its address with a single RAM access 00003 --> 00005.
Of course, 00000 is not on the TLB since no valid entry contains 00000 as a key.
When TLB is filled up, older addresses are overwritten. Just like CPU cache, the replacement policy is a potentially complex operation, but a simple and reasonable heuristic is to remove the least recently used entry (LRU).
With LRU, starting from state:
  valid  linear  physical
  -----  ------  --------
> 1      00003   00005
  1      00007   00009
  1      00009   00001
  1      0000B   00003
adding 0000D -> 0000A would give:
  valid  linear  physical
  -----  ------  --------
  1      0000D   0000A
> 1      00007   00009
  1      00009   00001
  1      0000B   00003
Using the TLB makes translation faster, because the initial translation takes one access per TLB level, which means 2 on a simple 32 bit scheme, but 3 or 4 on 64 bit architectures.
The TLB is usually implemented as an expensive type of RAM called content-addressable memory (CAM). CAM implements an associative map on hardware, that is, a structure that given a key (linear address), retrieves a value.
Mappings could also be implemented on RAM addresses, but CAM mappings may required much less entries than a RAM mapping.
For example, a map in which:
  • both keys and values have 20 bits (the case of a simple paging schemes)
  • at most 4 values need to be stored at each time
could be stored in a TLB with 4 entries:
linear  physical
------  --------
00000   00001
00001   00010
00010   00011
FFFFF   00000
However, to implement this with RAM, it would be necessary to have 2^20 addresses:
linear  physical
------  --------
00000   00001
00001   00010
00010   00011
... (from 00011 to FFFFE)
FFFFF   00000
which would be even more expensive than using a TLB.
The Linux kernel makes extensive usage of the paging features of x86 to allow fast process switches with small data fragmentation.
There are also however some features that the Linux kernel might not use, either because they are only for backwards compatibility, or because the Linux devs didn't feel it was worth it yet.
The Linux Kernel reserves two zones of virtual memory:
  • one for kernel memory
  • one for programs
The exact split is configured by CONFIG_VMSPLIT_.... By default:
  • on 32-bit:
    • the bottom 3/4 is program space: 00000000 to BFFFFFFF
    • the top 1/4 is kernel memory: C0000000 to FFFFFFFF, like this:
      ------------------ FFFFFFFF
      Kernel
      ------------------ C0000000
      ------------------ BFFFFFFF
      
      
      Process
      
      
      ------------------ 00000000
  • on 64-bit: currently only 48-bits are actually used, split into two equally sized disjoint spaces. The Linux kernel just assigns:
    • the bottom part to processes 00000000 00000000 to 008FFFFF FFFFFFFF
    • the top part to the kernel: FFFF8000 00000000 to FFFFFFFF FFFFFFFF, like this:
      ------------------ FFFFFFFF
      Kernel
      ------------------ C0000000
      
      
      (not addressable)
      
      
      ------------------ BFFFFFFF
      Process
      ------------------ 00000000
Kernel memory is also paged.
Besides a missing page, a very common source of page faults is copy-on-write (COW).
Page tables have extra flags that allow the OS to mark a page a read-only.
Those page faults only happen when a process tries to write to the page, and not read from it.
When Linux forks a process:
  • instead of copying all the pages, which is unnecessarily costly, it makes the page tables of the two process point to the same physical address.
  • it marks those linear addresses as read-only
  • whenever one of the processes tries to write to a page, the makes a copy of the physical memory, and updates the pages of the two process to point to the two different physical addresses
In v4.2, look under arch/x86/:
  • include/asm/pgtable*
  • include/asm/page*
  • mm/pgtable*
  • mm/page*
There seems to be no structs defined to represent the pages, only macros: include/asm/page_types.h is specially interesting. Excerpt:
#define _PAGE_BIT_PRESENT   0   /* is present */
#define _PAGE_BIT_RW        1   /* writeable */
#define _PAGE_BIT_USER      2   /* userspace addressable */
#define _PAGE_BIT_PWT       3   /* page write through */
arch/x86/include/uapi/asm/processor-flags.h defines CR0, and in particular the PG bit position:
#define X86_CR0_PG_BIT      31 /* Paging */
Paging is done by the Memory Management Unit (MMU) part of the CPU.
Like many others (e.g. x87 co-processor, APIC), this used to be by separate chip on early days.
It was later integrated into the CPU, but the term MMU still used.
Peter Cordes mentions that some architectures like MIPS leave paging almost completely in the hands of software: a TLB miss runs an OS-supplied function to walk the page tables, and insert the new mapping into the TLB. In such architectures, the OS can use whatever data structure it wants.
Free:
Non-free:
  • bovet05 chapter "Memory addressing"
    Reasonable intro to x86 memory addressing. Missing some good and simple examples.
Intel is known to have created customized chips for very large clients.
This is mentioned e.g. at: www.theregister.com/2021/03/23/google_to_build_server_socs/
Intel is known to do custom-ish cuts of Xeons for big customers.
Those chips are then used only in large scale server deployments of those very large clients. Google is one of them most likely, given their penchant for Google custom hardware.
TODO better sources.

Pinned article: Introduction to the OurBigBook Project

Welcome to the OurBigBook Project! Our goal is to create the perfect publishing platform for STEM subjects, and get university-level students to write the best free STEM tutorials ever.
Everyone is welcome to create an account and play with the site: ourbigbook.com/go/register. We belive that students themselves can write amazing tutorials, but teachers are welcome too. You can write about anything you want, it doesn't have to be STEM or even educational. Silly test content is very welcome and you won't be penalized in any way. Just keep it legal!
We have two killer features:
  1. topics: topics group articles by different users with the same title, e.g. here is the topic for the "Fundamental Theorem of Calculus" ourbigbook.com/go/topic/fundamental-theorem-of-calculus
    Articles of different users are sorted by upvote within each article page. This feature is a bit like:
    • a Wikipedia where each user can have their own version of each article
    • a Q&A website like Stack Overflow, where multiple people can give their views on a given topic, and the best ones are sorted by upvote. Except you don't need to wait for someone to ask first, and any topic goes, no matter how narrow or broad
    This feature makes it possible for readers to find better explanations of any topic created by other writers. And it allows writers to create an explanation in a place that readers might actually find it.
    Figure 1.
    Screenshot of the "Derivative" topic page
    . View it live at: ourbigbook.com/go/topic/derivative
  2. local editing: you can store all your personal knowledge base content locally in a plaintext markup format that can be edited locally and published either:
    This way you can be sure that even if OurBigBook.com were to go down one day (which we have no plans to do as it is quite cheap to host!), your content will still be perfectly readable as a static site.
    Figure 2.
    You can publish local OurBigBook lightweight markup files to either https://OurBigBook.com or as a static website
    .
    Figure 3.
    Visual Studio Code extension installation
    .
    Figure 4.
    Visual Studio Code extension tree navigation
    .
    Figure 5.
    Web editor
    . You can also edit articles on the Web editor without installing anything locally.
    Video 3.
    Edit locally and publish demo
    . Source. This shows editing OurBigBook Markup and publishing it using the Visual Studio Code extension.
    Video 4.
    OurBigBook Visual Studio Code extension editing and navigation demo
    . Source.
  3. https://raw.githubusercontent.com/ourbigbook/ourbigbook-media/master/feature/x/hilbert-space-arrow.png
  4. Infinitely deep tables of contents:
    Figure 6.
    Dynamic article tree with infinitely deep table of contents
    .
    Descendant pages can also show up as toplevel e.g.: ourbigbook.com/cirosantilli/chordate-subclade
All our software is open source and hosted at: github.com/ourbigbook/ourbigbook
Further documentation can be found at: docs.ourbigbook.com
Feel free to reach our to us for any help or suggestions: docs.ourbigbook.com/#contact