The problem with a single-level paging scheme is that it would take up too much RAM: 4G / 4K = 1M entries _per_ process.
If each entry is 4 bytes long, that would make 4M _per process_, which is too much even for a desktop computer: ps -A | wc -l says that I am running 244 processes right now, so that would take around 1GB of my RAM!
For this reason, x86 developers decided to use a multi-level scheme that reduces RAM usage.
The downside of this system is that is has a slightly higher access time, as we need to access RAM more times for each translation.
The algorithmically minded will have noticed that paging requires associative array (like Java Map of Python dict()) abstract data structure where:
  • the keys are linear pages addresses, thus of integer type
  • the values are physical page addresses, also of integer type
The single level paging scheme uses a simple array implementation of the associative array:
  • the keys are the array index
  • this implementation is very fast in time
  • but it is too inefficient in memory
and in C pseudo-code it looks like this:
linear_address[0]      = physical_address_0
linear_address[1]      = physical_address_1
linear_address[2]      = physical_address_2
...
linear_address[2^20-1] = physical_address_N
But there another simple associative array implementation that overcomes the memory problem: an (unbalanced) k-ary tree.
A K-ary tree, is just like a binary tree, but with K children instead of 2.
Using a K-ary tree instead of an array implementation has the following trade-offs:
  • it uses way less memory
  • it is slower since we have to de-reference extra pointers
In C-pseudo code, a 2-level K-ary tree with K = 2^10 looks like this:
level0[0] = &level1_0[0]
    level1_0[0]      = physical_address_0_0
    level1_0[1]      = physical_address_0_1
    ...
    level1_0[2^10-1] = physical_address_0_N
level0[1] = &level1_1[0]
    level1_1[0]      = physical_address_1_0
    level1_1[1]      = physical_address_1_1
    ...
    level1_1[2^10-1] = physical_address_1_N
...
level0[N] = &level1_N[0]
    level1_N[0]      = physical_address_N_0
    level1_N[1]      = physical_address_N_1
    ...
    level1_N[2^10-1] = physical_address_N_N
and we have the following arrays:
  • one directory, which has 2^10 elements. Each element contains a pointer to a page table array.
  • up to 2^10 pagetable arrays. Each one has 2^10 4 byte page entries.
and it still contains 2^10 * 2^10 = 2^20 possible keys.
K-ary trees can save up a lot of space, because if we only have one key, then we only need the following arrays:
  • one directory with 2^10 entries
  • one pagetable at directory[0] with 2^10 entries
  • all other directory[i] are marked as invalid, don't point to anything, and we don't allocate pagetable for them at all
Learned readers will ask themselves: so why use an unbalanced tree instead of balanced one, which offers better asymptotic times en.wikipedia.org/wiki/Self-balancing_binary_search_tree?
Likely:
  • the maximum number of entries is small enough due to memory size limitations, that we won't waste too much memory with the root directory entry
  • different entries would have different levels, and thus different access times
  • tree rotations would likely make caching more complicated
x86's multi-level paging scheme uses a 2 level K-ary tree with 2^10 bits on each level.
Addresses are now split as:
| directory (10 bits) | table (10 bits) | offset (12 bits) |
Then:
  • the top 10 bits are used to walk the top level of the K-ary tree (level0)
    The top table is called a "directory of page tables".
    cr3 now points to the location on RAM of the page directory of the current process instead of page tables.
    Page directory entries are very similar to page table entries except that they point to the physical addresses of page tables instead of physical addresses of pages.
    Each directory entry also takes up 4 bytes, just like page entries, so that makes 4 KiB per process minimum.
    Page directory entries also contain a valid flag: if invalid, the OS does not allocate a page table for that entry, and saves memory.
    Each process has one and only one page directory associated to it (and pointed to by cr3), so it will contain at least 2^10 = 1K page directory entries, much better than the minimum 1M entries required on a single-level scheme.
  • the next 10 bits are used to walk the second level of the K-ary tree (level1)
    Second level entries are also called page tables like the single level scheme.
    Page tables are only allocated only as needed by the OS.
    Each page table has only 2^10 = 1K page table entries instead of 2^20 for the single paging scheme.
    Each process can now have up to 2^10 page tables instead of 2^20 for the single paging scheme.
  • the offset is again not used for translation, it only gives the offset within a page
One reason for using 10 bits on the first two levels (and not, say, 12 | 8 | 12 ) is that each Page Table entry is 4 bytes long. Then the 2^10 entries of Page directories and Page Tables will fit nicely into 4Kb pages. This means that it faster and simpler to allocate and deallocate pages for that purpose.
Page directory given to process by the OS:
entry index   entry address      page table address  present
-----------   ----------------   ------------------  --------
0             CR3 + 0      * 4   0x10000             1
1             CR3 + 1      * 4                       0
2             CR3 + 2      * 4   0x80000             1
3             CR3 + 3      * 4                       0
...
2^10-1        CR3 + 2^10-1 * 4                       0
Page tables given to process by the OS at PT1 = 0x10000000 (0x10000 * 4K):
entry index   entry address      page address  present
-----------   ----------------   ------------  -------
0             PT1 + 0      * 4   0x00001       1
1             PT1 + 1      * 4                 0
2             PT1 + 2      * 4   0x0000D       1
...                                  ...
2^10-1        PT1 + 2^10-1 * 4   0x00005       1
Page tables given to process by the OS at PT2 = 0x80000000 (0x80000 * 4K):
entry index   entry address     page address  present
-----------   ---------------   ------------  ------------
0             PT2 + 0     * 4   0x0000A       1
1             PT2 + 1     * 4   0x0000C       1
2             PT2 + 2     * 4                 0
...
2^10-1        PT2 + 0x3FF * 4   0x00003       1
where PT1 and PT2: initial position of page table 1 and page table 2 for process 1 on RAM.
With that setup, the following translations would happen:
linear    10 10 12 split  physical
--------  --------------  ----------
00000001  000 000 001     00001001
00001001  000 001 001     page fault
003FF001  000 3FF 001     00005001
00400000  001 000 000     page fault
00800001  002 000 001     0000A001
00801004  002 001 004     0000C004
00802004  002 002 004     page fault
00B00001  003 000 000     page fault
Let's translate the linear address 0x00801004 step by step:
  • In binary the linear address is:
    0    0    8    0    1    0    0    4
    0000 0000 1000 0000 0001 0000 0000 0100
  • Grouping as 10 | 10 | 12 gives:
    0000000010 0000000001 000000000100
    0x2        0x1        0x4
    which gives:
    page directory entry = 0x2
    page table     entry = 0x1
    offset               = 0x4
    So the hardware looks for entry 2 of the page directory.
  • The page directory table says that the page table is located at 0x80000 * 4K = 0x80000000. This is the first RAM access of the process.
    Since the page table entry is 0x1, the hardware looks at entry 1 of the page table at 0x80000000, which tells it that the physical page is located at address 0x0000C * 4K = 0x0000C000. This is the second RAM access of the process.
  • Finally, the paging hardware adds the offset, and the final address is 0x0000C004.
Page faults occur if either a page directory entry or a page table entry is not present.
The Intel manual gives a picture of this translation process in the image "Linear-Address Translation to a 4-KByte Page using 32-Bit Paging": Figure 1. "x86 page translation process"
Figure 1. x86 page translation process.