This tutorial explains the very basics of how paging works, with focus on x86, although most high level concepts will also apply to other instruction set architectures, e.g. ARM.
The goals are to:
This tutorial was extracted and expanded from this Stack Overflow answer.
Paging is implemented by the CPU hardware itself.
Paging could be implemented in software, but that would be too slow, because every single RAM memory access uses it!
Operating systems must setup and control paging by communicating to the CPU hardware. This is done mostly via:
- the CR3 register, which tells the CPU where the page table is in RAM memory
- writing the correct paging data structures to the RAM pointed to the CR3 register.Using RAM data structures is a common technique when lots of data must be transmitted to the CPU as it would cost too much to have such a large CPU register.The format of the configuration data structures is fixed by the hardware, but it is up to the OS to set up and manage those data structures on RAM correctly, and to tell the hardware where to find them (via
cr3).Then some heavy caching is done to ensure that the RAM access will be fast, in particular using the TLB.Another notable example of RAM data structure used by the CPU is the IDT which sets up interrupt handlers. - CR3 cannot be modified in ring 3. The OS runs in ring 0. See also:
- the page table structures are made invisible to the process using paging itself!
Processes can however make requests to the OS that cause the page tables to be modified, notably:
- stack size changes
brkandmmapcalls, see also: stackoverflow.com/questions/6988487/what-does-brk-system-call-do/31082353#31082353
The kernel then decides if the request will be granted or not in a controlled manner.
x86 Paging Tutorial Single level paging scheme visualization by
Ciro Santilli 40 Updated 2025-07-16
This is how the memory could look like in a single level paging scheme:
Links Data Physical address
+-----------------------+ 2^32 - 1
| |
. .
| |
+-----------------------+ page0 + 4k
| data of page 0 |
+---->+-----------------------+ page0
| | |
| . .
| | |
| +-----------------------+ pageN + 4k
| | data of page N |
| +->+-----------------------+ pageN
| | | |
| | . .
| | | |
| | +-----------------------+ CR3 + 2^20 * 4
| +--| entry[2^20-1] = pageN |
| +-----------------------+ CR3 + 2^20 - 1 * 4
| | |
| . many entires .
| | |
| +-----------------------+ CR3 + 2 * 4
| +--| entry[1] = page1 |
| | +-----------------------+ CR3 + 1 * 4
+-----| entry[0] = page0 |
| +-----------------------+ <--- CR3
| | |
| . .
| | |
| +-----------------------+ page1 + 4k
| | data of page 1 |
+->+-----------------------+ page1
| |
. .
| |
+-----------------------+ 0Notice that:
- the CR3 register points to the first entry of the page table
- the page table is just a large array with 2^20 page table entries
- each entry is 4 bytes big, so the array takes up 4 MiB
- each page table contains the physical address a page
- each page is a 4 KiB aligned 4 KiB chunk of memory that user processes may use
- we have 2^20 table entries. Since each page is 4 KiB == 2^12, this covers the whole 4 GiB (2^32) of 32-bit memory
x86 Paging Tutorial Single level paging scheme numerical translation example by
Ciro Santilli 40 Updated 2025-07-16
Suppose that the OS has setup the following page tables for process 1:and for process 2:
entry index entry address page address present
----------- ------------------ ------------ -------
0 CR3_1 + 0 * 4 0x00001 1
1 CR3_1 + 1 * 4 0x00000 1
2 CR3_1 + 2 * 4 0x00003 1
3 CR3_1 + 3 * 4 0
...
2^20-1 CR3_1 + 2^20-1 * 4 0x00005 1entry index entry address page address present
----------- ----------------- ------------ -------
0 CR3_2 + 0 * 4 0x0000A 1
1 CR3_2 + 1 * 4 0x12345 1
2 CR3_2 + 2 * 4 0
3 CR3_2 + 3 * 4 0x00003 1
...
2^20-1 CR3_2 + 2^20-1 * 4 0xFFFFF 1When process 1 tries to access a linear address, this is the physical addresses that will be actually accessed:
linear physical
--------- ---------
00000 001 00001 001
00000 002 00001 002
00000 003 00001 003
00000 FFF 00001 FFF
00001 000 00000 000
00001 001 00000 001
00001 FFF 00000 FFF
00002 000 00003 000
FFFFF 000 00005 000To switch to process 2, the OS simply sets
cr3 to CR3_2, and now the following translations would happen:linear physical
--------- ---------
00000 002 0000A 002
00000 003 0000A 003
00000 FFF 0000A FFF
00001 000 12345 000
00001 001 12345 001
00001 FFF 12345 FFF
00004 000 00003 000
FFFFF 000 FFFFF 000Step-by-step translation for process 1 of logical address
0x00000001 to physical address 0x00001001:- split the linear address into two parts:
| page (20 bits) | offset (12 bits) | - look into Page table 1 because
cr3points to it. - The hardware knows that this entry is located at RAM address
CR3 + 0x00000 * 4 = CR3:
*0x00000because the page part of the logical address is0x00000
*4because that is the fixed size in bytes of every page table entry - since it is present, the access is valid
- by the page table, the location of page number
0x00000is at0x00001 * 4K = 0x00001000. - to find the final physical address we just need to add the offset:
00001 000 + 00000 001 --------- 00001 001because00001is the physical address of the page looked up on the table and001is the offset.The offset is always simply added the physical address of the page. - the hardware then gets the memory at that physical location and puts it in a register.
Another example: for logical address
0x00001001:- the page part is
00001, and the offset part is001 - the hardware knows that its page table entry is located at RAM address:
CR3 + 1 * 4(1because of the page part), and that is where it will look for it - it finds the page address
0x00000there - so the final address is
0x00000 * 4k + 0x001 = 0x00000001
FFFFF 000 points to its own physical address FFFFF 000. This kind of translation is called an "identity mapping", and can be very convenient for OS-level debugging.The problem with a single-level paging scheme is that it would take up too much RAM: 4G / 4K = 1M entries per process.
If each entry is 4 bytes long, that would make 4M per process, which is too much even for a desktop computer:
ps -A | wc -l says that I am running 244 processes right now, so that would take around 1GB of my RAM!Learned readers will ask themselves: so why use an unbalanced tree instead of balanced one, which offers better asymptotic times en.wikipedia.org/wiki/Self-balancing_binary_search_tree?
Addresses are now split as:
| directory (10 bits) | table (10 bits) | offset (12 bits) |Then:
- The top table is called a "directory of page tables".
cr3now points to the location on RAM of the page directory of the current process instead of page tables.Page directory entries are very similar to page table entries except that they point to the physical addresses of page tables instead of physical addresses of pages.Each directory entry also takes up 4 bytes, just like page entries, so that makes 4 KiB per process minimum.Page directory entries also contain a valid flag: if invalid, the OS does not allocate a page table for that entry, and saves memory.Each process has one and only one page directory associated to it (and pointed to bycr3), so it will contain at least2^10 = 1Kpage directory entries, much better than the minimum 1M entries required on a single-level scheme. - Second level entries are also called page tables like the single level scheme.Each page table has only
2^10 = 1Kpage table entries instead of2^20for the single paging scheme. - the offset is again not used for translation, it only gives the offset within a page
One reason for using 10 bits on the first two levels (and not, say,
12 | 8 | 12 ) is that each Page Table entry is 4 bytes long. Then the 2^10 entries of Page directories and Page Tables will fit nicely into 4Kb pages. This means that it faster and simpler to allocate and deallocate pages for that purpose.x86_64 uses 48 bits (256 TiB), and legacy mode's PAE already allows 52-bit addresses (4 PiB). 56-bits is a likely future candidate.
But that would mean that the page directory would have
2^18 = 256K entries, which would take too much RAM: close to a single-level paging for 32 bit architectures!x86_64 uses 4 levels in a
9 | 9 | 9 | 9 scheme, so that the upper level only takes up only 2^9 higher level entries.The 48 bits are split equally into two disjoint parts:
----------------- FFFFFFFF FFFFFFFF
Top half
----------------- FFFF8000 00000000
Not addressable
----------------- 00007FFF FFFFFFFF
Bottom half
----------------- 00000000 00000000A 5-level scheme is emerging in 2016: software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf which allows 52-bit addresses with 4k pagetables.
Physical address extension.
With 32 bits, only 4GB RAM can be addressed.
This started becoming a limitation for large servers, so Intel introduced the PAE mechanism to Pentium Pro.
Page table structure is also altered if PAE is on. The exact way in which it is altered depends on weather PSE is on or off.
The Translation Lookahead Buffer (TLB) is a cache for paging addresses.
When TLB is filled up, older addresses are overwritten. Just like CPU cache, the replacement policy is a potentially complex operation, but a simple and reasonable heuristic is to remove the least recently used entry (LRU).
With LRU, starting from state:adding
valid linear physical
----- ------ --------
> 1 00003 00005
1 00007 00009
1 00009 00001
1 0000B 000030000D -> 0000A would give: valid linear physical
----- ------ --------
1 0000D 0000A
> 1 00007 00009
1 00009 00001
1 0000B 00003Using the TLB makes translation faster, because the initial translation takes one access per TLB level, which means 2 on a simple 32 bit scheme, but 3 or 4 on 64 bit architectures.
The TLB is usually implemented as an expensive type of RAM called content-addressable memory (CAM). CAM implements an associative map on hardware, that is, a structure that given a key (linear address), retrieves a value.
Mappings could also be implemented on RAM addresses, but CAM mappings may required much less entries than a RAM mapping.
linear physical
------ --------
00000 00001
00001 00010
00010 00011
FFFFF 00000Those page faults only happen when a process tries to write to the page, and not read from it.
When Linux forks a process:
- instead of copying all the pages, which is unnecessarily costly, it makes the page tables of the two process point to the same physical address.
- it marks those linear addresses as read-only
- whenever one of the processes tries to write to a page, the makes a copy of the physical memory, and updates the pages of the two process to point to the two different physical addresses
In
v4.2, look under arch/x86/:include/asm/pgtable*include/asm/page*mm/pgtable*mm/page*
There seems to be no structs defined to represent the pages, only macros:
include/asm/page_types.h is specially interesting. Excerpt:#define _PAGE_BIT_PRESENT 0 /* is present */
#define _PAGE_BIT_RW 1 /* writeable */
#define _PAGE_BIT_USER 2 /* userspace addressable */
#define _PAGE_BIT_PWT 3 /* page write through */ x86 Paging Tutorial Multi-level paging scheme numerical translation example by
Ciro Santilli 40 Updated 2025-07-16
Page directory given to process by the OS:
entry index entry address page table address present
----------- ---------------- ------------------ --------
0 CR3 + 0 * 4 0x10000 1
1 CR3 + 1 * 4 0
2 CR3 + 2 * 4 0x80000 1
3 CR3 + 3 * 4 0
...
2^10-1 CR3 + 2^10-1 * 4 0Page tables given to process by the OS at
PT1 = 0x10000000 (0x10000 * 4K):entry index entry address page address present
----------- ---------------- ------------ -------
0 PT1 + 0 * 4 0x00001 1
1 PT1 + 1 * 4 0
2 PT1 + 2 * 4 0x0000D 1
... ...
2^10-1 PT1 + 2^10-1 * 4 0x00005 1Page tables given to process by the OS at where
PT2 = 0x80000000 (0x80000 * 4K):entry index entry address page address present
----------- --------------- ------------ ------------
0 PT2 + 0 * 4 0x0000A 1
1 PT2 + 1 * 4 0x0000C 1
2 PT2 + 2 * 4 0
...
2^10-1 PT2 + 0x3FF * 4 0x00003 1PT1 and PT2: initial position of page table 1 and page table 2 for process 1 on RAM.With that setup, the following translations would happen:
linear 10 10 12 split physical
-------- -------------- ----------
00000001 000 000 001 00001001
00001001 000 001 001 page fault
003FF001 000 3FF 001 00005001
00400000 001 000 000 page fault
00800001 002 000 001 0000A001
00801004 002 001 004 0000C004
00802004 002 002 004 page fault
00B00001 003 000 000 page faultLet's translate the linear address
0x00801004 step by step:- In binary the linear address is:
0 0 8 0 1 0 0 4 0000 0000 1000 0000 0001 0000 0000 0100 - Grouping as
10 | 10 | 12gives:which gives:0000000010 0000000001 000000000100 0x2 0x1 0x4So the hardware looks for entry 2 of the page directory.page directory entry = 0x2 page table entry = 0x1 offset = 0x4 - The page directory table says that the page table is located at
0x80000 * 4K = 0x80000000. This is the first RAM access of the process. - Finally, the paging hardware adds the offset, and the final address is
0x0000C004.
The Intel manual gives a picture of this translation process in the image "Linear-Address Translation to a 4-KByte Page using 32-Bit Paging": Figure 1. "x86 page translation process"
x86 page translation process
. Pinned article: Introduction to the OurBigBook Project
Welcome to the OurBigBook Project! Our goal is to create the perfect publishing platform for STEM subjects, and get university-level students to write the best free STEM tutorials ever.
Everyone is welcome to create an account and play with the site: ourbigbook.com/go/register. We belive that students themselves can write amazing tutorials, but teachers are welcome too. You can write about anything you want, it doesn't have to be STEM or even educational. Silly test content is very welcome and you won't be penalized in any way. Just keep it legal!
Intro to OurBigBook
. Source. We have two killer features:
- topics: topics group articles by different users with the same title, e.g. here is the topic for the "Fundamental Theorem of Calculus" ourbigbook.com/go/topic/fundamental-theorem-of-calculusArticles of different users are sorted by upvote within each article page. This feature is a bit like:
- a Wikipedia where each user can have their own version of each article
- a Q&A website like Stack Overflow, where multiple people can give their views on a given topic, and the best ones are sorted by upvote. Except you don't need to wait for someone to ask first, and any topic goes, no matter how narrow or broad
This feature makes it possible for readers to find better explanations of any topic created by other writers. And it allows writers to create an explanation in a place that readers might actually find it.Figure 1. Screenshot of the "Derivative" topic page. View it live at: ourbigbook.com/go/topic/derivativeVideo 2. OurBigBook Web topics demo. Source. - local editing: you can store all your personal knowledge base content locally in a plaintext markup format that can be edited locally and published either:This way you can be sure that even if OurBigBook.com were to go down one day (which we have no plans to do as it is quite cheap to host!), your content will still be perfectly readable as a static site.
- to OurBigBook.com to get awesome multi-user features like topics and likes
- as HTML files to a static website, which you can host yourself for free on many external providers like GitHub Pages, and remain in full control
Figure 3. Visual Studio Code extension installation.Figure 4. Visual Studio Code extension tree navigation.Figure 5. Web editor. You can also edit articles on the Web editor without installing anything locally.Video 3. Edit locally and publish demo. Source. This shows editing OurBigBook Markup and publishing it using the Visual Studio Code extension.Video 4. OurBigBook Visual Studio Code extension editing and navigation demo. Source. - Infinitely deep tables of contents:
All our software is open source and hosted at: github.com/ourbigbook/ourbigbook
Further documentation can be found at: docs.ourbigbook.com
Feel free to reach our to us for any help or suggestions: docs.ourbigbook.com/#contact






