This is an example of how paging operates on a simplified version of a x86 architecture to implement a virtual memory space with a 20 | 12 address split (4 KiB page size).
This is how the memory could look like in a single level paging scheme:
Links   Data                    Physical address

      +-----------------------+ 2^32 - 1
      |                       |
      .                       .
      |                       |
      +-----------------------+ page0 + 4k
      | data of page 0        |
+---->+-----------------------+ page0
|     |                       |
|     .                       .
|     |                       |
|     +-----------------------+ pageN + 4k
|     | data of page N        |
|  +->+-----------------------+ pageN
|  |  |                       |
|  |  .                       .
|  |  |                       |
|  |  +-----------------------+ CR3 + 2^20 * 4
|  +--| entry[2^20-1] = pageN |
|     +-----------------------+ CR3 + 2^20 - 1 * 4
|     |                       |
|     .    many entires       .
|     |                       |
|     +-----------------------+ CR3 + 2 * 4
|  +--| entry[1] = page1      |
|  |  +-----------------------+ CR3 + 1 * 4
+-----| entry[0] = page0      |
   |  +-----------------------+ <--- CR3
   |  |                       |
   |  .                       .
   |  |                       |
   |  +-----------------------+ page1 + 4k
   |  | data of page 1        |
   +->+-----------------------+ page1
      |                       |
      .                       .
      |                       |
      +-----------------------+  0
Notice that:
  • the CR3 register points to the first entry of the page table
  • the page table is just a large array with 2^20 page table entries
  • each entry is 4 bytes big, so the array takes up 4 MiB
  • each page table contains the physical address a page
  • each page is a 4 KiB aligned 4 KiB chunk of memory that user processes may use
  • we have 2^20 table entries. Since each page is 4 KiB == 2^12, this covers the whole 4 GiB (2^32) of 32-bit memory
Suppose that the OS has setup the following page tables for process 1:
entry index   entry address       page address   present
-----------   ------------------  ------------   -------
0             CR3_1 + 0      * 4  0x00001        1
1             CR3_1 + 1      * 4  0x00000        1
2             CR3_1 + 2      * 4  0x00003        1
3             CR3_1 + 3      * 4                 0
...
2^20-1        CR3_1 + 2^20-1 * 4  0x00005        1
and for process 2:
entry index   entry address       page address   present
-----------   -----------------   ------------   -------
0             CR3_2 + 0      * 4  0x0000A        1
1             CR3_2 + 1      * 4  0x12345        1
2             CR3_2 + 2      * 4                 0
3             CR3_2 + 3      * 4  0x00003        1
...
2^20-1        CR3_2 + 2^20-1 * 4  0xFFFFF        1
Before process 1 starts running, the OS sets its cr3 to point to the page table 1 at CR3_1.
When process 1 tries to access a linear address, this is the physical addresses that will be actually accessed:
linear     physical
---------  ---------
00000 001  00001 001
00000 002  00001 002
00000 003  00001 003
00000 FFF  00001 FFF
00001 000  00000 000
00001 001  00000 001
00001 FFF  00000 FFF
00002 000  00003 000
FFFFF 000  00005 000
To switch to process 2, the OS simply sets cr3 to CR3_2, and now the following translations would happen:
linear     physical
---------  ---------
00000 002  0000A 002
00000 003  0000A 003
00000 FFF  0000A FFF
00001 000  12345 000
00001 001  12345 001
00001 FFF  12345 FFF
00004 000  00003 000
FFFFF 000  FFFFF 000
Step-by-step translation for process 1 of logical address 0x00000001 to physical address 0x00001001:
  • split the linear address into two parts:
    | page (20 bits) | offset (12 bits) |
    So in this case we would have:
    *page = 0x00000. This part must be translated to a physical location.
    *offset = 0x001. This part is added directly to the page address, and is not translated: it contains the position within the page.
  • look into Page table 1 because cr3 points to it.
  • The hardware knows that this entry is located at RAM address CR3 + 0x00000 * 4 = CR3:
    *0x00000 because the page part of the logical address is 0x00000
    *4 because that is the fixed size in bytes of every page table entry
  • since it is present, the access is valid
  • by the page table, the location of page number 0x00000 is at 0x00001 * 4K = 0x00001000.
  • to find the final physical address we just need to add the offset:
      00001 000
    + 00000 001
      ---------
      00001 001
    because 00001 is the physical address of the page looked up on the table and 001 is the offset.
    We shift 00001 by 12 bits because the pages are always aligned to 4 KiB.
    The offset is always simply added the physical address of the page.
  • the hardware then gets the memory at that physical location and puts it in a register.
Another example: for logical address 0x00001001:
  • the page part is 00001, and the offset part is 001
  • the hardware knows that its page table entry is located at RAM address: CR3 + 1 * 4 (1 because of the page part), and that is where it will look for it
  • it finds the page address 0x00000 there
  • so the final address is 0x00000 * 4k + 0x001 = 0x00000001
The same linear address can translate to different physical addresses for different processes, depending only on the value inside cr3.
Both linear addresses 00002 000 from process 1 and 00004 000 from process 2 point to the same physical address 00003 000. This is completely allowed by the hardware, and it is up to the operating system to handle such cases.
This often in normal operation because of Copy-on-write (COW), which be explained elsewhere.
Such mappings are sometime called "aliases".
FFFFF 000 points to its own physical address FFFFF 000. This kind of translation is called an "identity mapping", and can be very convenient for OS-level debugging.
The exact format of table entries is fixed by the hardware.
Each page entry can be seen as a struct with many fields.
The page table is then an array of struct.
On this simplified example, the page table entries contain only two fields:
bits   function
-----  -----------------------------------------
20     physical address of the start of the page
1      present flag
so in this example the hardware designers could have chosen the size of the page table to b 21 instead of 32 as we've used so far.
All real page table entries have other fields, notably fields to set pages to read-only for Copy-on-write. This will be explained elsewhere.
It would be impractical to align things at 21 bits since memory is addressable by bytes and not bits. Therefore, even in only 21 bits are needed in this case, hardware designers would probably choose 32 to make access faster, and just reserve bits the remaining bits for later usage. The actual value on x86 is 32 bits.
Here is a screenshot from the Intel manual image "Formats of CR3 and Paging-Structure Entries with 32-Bit Paging" showing the structure of a page table in all its glory: Figure 1. "x86 page entry format".
Figure 1.
x86 page entry format
.
The fields are explained in the manual just after.
Why are pages 4 KiB anyways?
There is a trade-off between memory wasted in:
  • page tables
  • extra padding memory within pages
This can be seen with the extreme cases:
  • if the page size were 1 byte:
    • granularity would be great, and the OS would never have to allocate unneeded padding memory
    • but the page table would have 2^32 entries, and take up the entire memory!
  • if the page size were 4 GiB:
    • we would need to swap 4 GiB to disk every time a new process becomes active
    • the page size would be a single entry, so it would take almost no memory at all
x86 designers have found that 4 KiB pages are a good middle ground.
The problem with a single-level paging scheme is that it would take up too much RAM: 4G / 4K = 1M entries per process.
If each entry is 4 bytes long, that would make 4M per process, which is too much even for a desktop computer: ps -A | wc -l says that I am running 244 processes right now, so that would take around 1GB of my RAM!
For this reason, x86 developers decided to use a multi-level scheme that reduces RAM usage.
The downside of this system is that is has a slightly higher access time, as we need to access RAM more times for each translation.
Learned readers will ask themselves: so why use an unbalanced tree instead of balanced one, which offers better asymptotic times en.wikipedia.org/wiki/Self-balancing_binary_search_tree?
Likely:
  • the maximum number of entries is small enough due to memory size limitations, that we won't waste too much memory with the root directory entry
  • different entries would have different levels, and thus different access times
  • tree rotations would likely make caching more complicated
x86's multi-level paging scheme uses a 2 level K-ary tree with 2^10 bits on each level.
Addresses are now split as:
| directory (10 bits) | table (10 bits) | offset (12 bits) |
Then:
  • the top 10 bits are used to walk the top level of the K-ary tree (level0)
    The top table is called a "directory of page tables".
    cr3 now points to the location on RAM of the page directory of the current process instead of page tables.
    Page directory entries are very similar to page table entries except that they point to the physical addresses of page tables instead of physical addresses of pages.
    Each directory entry also takes up 4 bytes, just like page entries, so that makes 4 KiB per process minimum.
    Page directory entries also contain a valid flag: if invalid, the OS does not allocate a page table for that entry, and saves memory.
    Each process has one and only one page directory associated to it (and pointed to by cr3), so it will contain at least 2^10 = 1K page directory entries, much better than the minimum 1M entries required on a single-level scheme.
  • the next 10 bits are used to walk the second level of the K-ary tree (level1)
    Second level entries are also called page tables like the single level scheme.
    Page tables are only allocated only as needed by the OS.
    Each page table has only 2^10 = 1K page table entries instead of 2^20 for the single paging scheme.
    Each process can now have up to 2^10 page tables instead of 2^20 for the single paging scheme.
  • the offset is again not used for translation, it only gives the offset within a page
One reason for using 10 bits on the first two levels (and not, say, 12 | 8 | 12 ) is that each Page Table entry is 4 bytes long. Then the 2^10 entries of Page directories and Page Tables will fit nicely into 4Kb pages. This means that it faster and simpler to allocate and deallocate pages for that purpose.
64 bits is still too much address for current RAM sizes, so most architectures will use less bits.
x86_64 uses 48 bits (256 TiB), and legacy mode's PAE already allows 52-bit addresses (4 PiB). 56-bits is a likely future candidate.
12 of those 48 bits are already reserved for the offset, which leaves 36 bits.
If a 2 level approach is taken, the best split would be two 18 bit levels.
But that would mean that the page directory would have 2^18 = 256K entries, which would take too much RAM: close to a single-level paging for 32 bit architectures!
Therefore, 64 bit architectures create even further page levels, commonly 3 or 4.
x86_64 uses 4 levels in a 9 | 9 | 9 | 9 scheme, so that the upper level only takes up only 2^9 higher level entries.
The 48 bits are split equally into two disjoint parts:
----------------- FFFFFFFF FFFFFFFF
Top half
----------------- FFFF8000 00000000


Not addressable


----------------- 00007FFF FFFFFFFF
Bottom half
----------------- 00000000 00000000
A 5-level scheme is emerging in 2016: software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf which allows 52-bit addresses with 4k pagetables.
Physical address extension.
With 32 bits, only 4GB RAM can be addressed.
This started becoming a limitation for large servers, so Intel introduced the PAE mechanism to Pentium Pro.
To relieve the problem, Intel added 4 new address lines, so that 64GB could be addressed.
Page table structure is also altered if PAE is on. The exact way in which it is altered depends on weather PSE is on or off.
PAE is turned on and off via the PAE bit of cr4.
Even if the total addressable memory is 64GB, individual process are still only able to use up to 4GB. The OS can however put different processes on different 4GB chunks.
Page size extension.
Allows for pages to be 4M (or 2M if PAE is on) in length instead of 4K.
PSE is turned on and off via the PSE bit of cr4.
If either PAE and PSE are active, different paging level schemes are used:
  • no PAE and no PSE: 10 | 10 | 12
  • no PAE and PSE: 10 | 22.
    22 is the offset within the 4Mb page, since 22 bits address 4Mb.
  • PAE and no PSE: 2 | 9 | 9 | 12
    The design reason why 9 is used twice instead of 10 is that now entries cannot fit anymore into 32 bits, which were all filled up by 20 address bits and 12 meaningful or reserved flag bits.
    The reason is that 20 bits are not enough anymore to represent the address of page tables: 24 bits are now needed because of the 4 extra wires added to the processor.
    Therefore, the designers decided to increase entry size to 64 bits, and to make them fit into a single page table it is necessary reduce the number of entries to 2^9 instead of 2^10.
    The starting 2 is a new Page level called Page Directory Pointer Table (PDPT), since it points to page directories and fill in the 32 bit linear address. PDPTs are also 64 bits wide.
    cr3 now points to PDPTs which must be on the fist four 4GB of memory and aligned on 32 bit multiples for addressing efficiency. This means that now cr3 has 27 significative bits instead of 20: 2^5 for the 32 multiples * 2^27 to complete the 2^32 of the first 4GB.
  • PAE and PSE: 2 | 9 | 21
    Designers decided to keep a 9 bit wide field to make it fit into a single page.
    This leaves 23 bits. Leaving 2 for the PDPT to keep things uniform with the PAE case without PSE leaves 21 for offset, meaning that pages are 2M wide instead of 4M.
The Translation Lookahead Buffer (TLB) is a cache for paging addresses.
Since it is a cache, it shares many of the design issues of the CPU cache, such as associativity level.
This section shall describe a simplified fully associative TLB with 4 single address entries. Note that like other caches, real TLBs are not usually fully associative.
After a translation between linear and physical address happens, it is stored on the TLB. For example, a 4 entry TLB starts in the following state:
  valid  linear  physical
  -----  ------  --------
> 0      00000   00000
  0      00000   00000
  0      00000   00000
  0      00000   00000
The > indicates the current entry to be replaced.
And after a page linear address 00003 is translated to a physical address 00005, the TLB becomes:
  valid  linear  physical
  -----  ------  --------
  1      00003   00005
> 0      00000   00000
  0      00000   00000
  0      00000   00000
and after a second translation of 00007 to 00009 it becomes:
  valid  linear  physical
  -----  ------  --------
  1      00003   00005
  1      00007   00009
> 0      00000   00000
  0      00000   00000
Now if 00003 needs to be translated again, hardware first looks up the TLB and finds out its address with a single RAM access 00003 --> 00005.
Of course, 00000 is not on the TLB since no valid entry contains 00000 as a key.
When TLB is filled up, older addresses are overwritten. Just like CPU cache, the replacement policy is a potentially complex operation, but a simple and reasonable heuristic is to remove the least recently used entry (LRU).
With LRU, starting from state:
  valid  linear  physical
  -----  ------  --------
> 1      00003   00005
  1      00007   00009
  1      00009   00001
  1      0000B   00003
adding 0000D -> 0000A would give:
  valid  linear  physical
  -----  ------  --------
  1      0000D   0000A
> 1      00007   00009
  1      00009   00001
  1      0000B   00003
Using the TLB makes translation faster, because the initial translation takes one access per TLB level, which means 2 on a simple 32 bit scheme, but 3 or 4 on 64 bit architectures.
The TLB is usually implemented as an expensive type of RAM called content-addressable memory (CAM). CAM implements an associative map on hardware, that is, a structure that given a key (linear address), retrieves a value.
Mappings could also be implemented on RAM addresses, but CAM mappings may required much less entries than a RAM mapping.
For example, a map in which:
  • both keys and values have 20 bits (the case of a simple paging schemes)
  • at most 4 values need to be stored at each time
could be stored in a TLB with 4 entries:
linear  physical
------  --------
00000   00001
00001   00010
00010   00011
FFFFF   00000
However, to implement this with RAM, it would be necessary to have 2^20 addresses:
linear  physical
------  --------
00000   00001
00001   00010
00010   00011
... (from 00011 to FFFFE)
FFFFF   00000
which would be even more expensive than using a TLB.

Pinned article: Introduction to the OurBigBook Project

Welcome to the OurBigBook Project! Our goal is to create the perfect publishing platform for STEM subjects, and get university-level students to write the best free STEM tutorials ever.
Everyone is welcome to create an account and play with the site: ourbigbook.com/go/register. We belive that students themselves can write amazing tutorials, but teachers are welcome too. You can write about anything you want, it doesn't have to be STEM or even educational. Silly test content is very welcome and you won't be penalized in any way. Just keep it legal!
We have two killer features:
  1. topics: topics group articles by different users with the same title, e.g. here is the topic for the "Fundamental Theorem of Calculus" ourbigbook.com/go/topic/fundamental-theorem-of-calculus
    Articles of different users are sorted by upvote within each article page. This feature is a bit like:
    • a Wikipedia where each user can have their own version of each article
    • a Q&A website like Stack Overflow, where multiple people can give their views on a given topic, and the best ones are sorted by upvote. Except you don't need to wait for someone to ask first, and any topic goes, no matter how narrow or broad
    This feature makes it possible for readers to find better explanations of any topic created by other writers. And it allows writers to create an explanation in a place that readers might actually find it.
    Figure 1.
    Screenshot of the "Derivative" topic page
    . View it live at: ourbigbook.com/go/topic/derivative
  2. local editing: you can store all your personal knowledge base content locally in a plaintext markup format that can be edited locally and published either:
    This way you can be sure that even if OurBigBook.com were to go down one day (which we have no plans to do as it is quite cheap to host!), your content will still be perfectly readable as a static site.
    Figure 2.
    You can publish local OurBigBook lightweight markup files to either https://OurBigBook.com or as a static website
    .
    Figure 3.
    Visual Studio Code extension installation
    .
    Figure 4.
    Visual Studio Code extension tree navigation
    .
    Figure 5.
    Web editor
    . You can also edit articles on the Web editor without installing anything locally.
    Video 3.
    Edit locally and publish demo
    . Source. This shows editing OurBigBook Markup and publishing it using the Visual Studio Code extension.
    Video 4.
    OurBigBook Visual Studio Code extension editing and navigation demo
    . Source.
  3. https://raw.githubusercontent.com/ourbigbook/ourbigbook-media/master/feature/x/hilbert-space-arrow.png
  4. Infinitely deep tables of contents:
    Figure 6.
    Dynamic article tree with infinitely deep table of contents
    .
    Descendant pages can also show up as toplevel e.g.: ourbigbook.com/cirosantilli/chordate-subclade
All our software is open source and hosted at: github.com/ourbigbook/ourbigbook
Further documentation can be found at: docs.ourbigbook.com
Feel free to reach our to us for any help or suggestions: docs.ourbigbook.com/#contact