Using the TLB makes translation faster, because the initial translation takes one access _per TLB level_, which means 2 on a simple 32 bit scheme, but 3 or 4 on 64 bit architectures.
The TLB is usually implemented as an expensive type of RAM called content-addressable memory (CAM). CAM implements an associative map on hardware, that is, a structure that given a key (linear address), retrieves a value.
Mappings could also be implemented on RAM addresses, but CAM mappings may required much less entries than a RAM mapping.
For example, a map in which:could be stored in a TLB with 4 entries:
- both keys and values have 20 bits (the case of a simple paging schemes)
- at most 4 values need to be stored at each time
linear physical
------ --------
00000 00001
00001 00010
00010 00011
FFFFF 00000
However, to implement this with RAM, _it would be necessary to have 2^20 addresses_:which would be even more expensive than using a TLB.
linear physical
------ --------
00000 00001
00001 00010
00010 00011
... (from 00011 to FFFFE)
FFFFF 00000
For each process, the virtual address space looks like this:
------------------ 2^32 - 1
Stack (grows down)
v v v v v v v v v
------------------
(unmapped)
------------------ Maximum stack size.
(unmapped)
-------------------
mmap
-------------------
(unmapped)
-------------------
^^^^^^^^^^^^^^^^^^^
brk (grows up)
-------------------
BSS
-------------------
Data
-------------------
Text
-------------------
------------------- 0
The kernel maintains a list of pages that belong to each process, and synchronizes that with the paging.
If the program accesses memory that does not belong to it, the kernel handles a page-fault, and decides what to do:
- if it is above the maximum stack size, allocate those pages to the process
- otherwise, send a SIGSEGV to the process, which usually kills it
When an ELF file is loaded by the kernel to start a program with the
exec
system call, the kernel automatically registers text, data, BSS and stack for the program.The
brk
and mmap
areas can be modified by request of the program through the brk
and mmap
system calls. But the kernel can also deny the program those areas if there is not enough memory.brk
and mmap
can be used to implement malloc
, or the so called "heap".mmap
is also used to load dynamically loaded libraries into the program's memory so that it can access and run it.Stack allocation: stackoverflow.com/questions/17671423/stack-allocation-for-process
Calculating exact addresses Things are complicated by:
- Address Space Layout Randomization.
- the fact that environment variables, CLI arguments, and some ELF header data take up initial stack space: unix.stackexchange.com/questions/145557/how-does-stack-allocation-work-in-linux/239323#239323
Why the text does not start at 0: stackoverflow.com/questions/14795164/why-do-linux-program-text-sections-start-at-0x0804800-and-stack-tops-start-at-0
Single level paging scheme numerical translation example by Ciro Santilli 35 Updated 2025-01-10 +Created 1970-01-01
Suppose that the OS has setup the following page tables for process 1:and for process 2:
entry index entry address page address present
----------- ------------------ ------------ -------
0 CR3_1 + 0 * 4 0x00001 1
1 CR3_1 + 1 * 4 0x00000 1
2 CR3_1 + 2 * 4 0x00003 1
3 CR3_1 + 3 * 4 0
...
2^20-1 CR3_1 + 2^20-1 * 4 0x00005 1
entry index entry address page address present
----------- ----------------- ------------ -------
0 CR3_2 + 0 * 4 0x0000A 1
1 CR3_2 + 1 * 4 0x12345 1
2 CR3_2 + 2 * 4 0
3 CR3_2 + 3 * 4 0x00003 1
...
2^20-1 CR3_2 + 2^20-1 * 4 0xFFFFF 1
Before process 1 starts running, the OS sets its
cr3
to point to the page table 1 at CR3_1
.When process 1 tries to access a linear address, this is the physical addresses that will be actually accessed:
linear physical
--------- ---------
00000 001 00001 001
00000 002 00001 002
00000 003 00001 003
00000 FFF 00001 FFF
00001 000 00000 000
00001 001 00000 001
00001 FFF 00000 FFF
00002 000 00003 000
FFFFF 000 00005 000
To switch to process 2, the OS simply sets
cr3
to CR3_2
, and now the following translations would happen:linear physical
--------- ---------
00000 002 0000A 002
00000 003 0000A 003
00000 FFF 0000A FFF
00001 000 12345 000
00001 001 12345 001
00001 FFF 12345 FFF
00004 000 00003 000
FFFFF 000 FFFFF 000
Step-by-step translation for process 1 of logical address
0x00000001
to physical address 0x00001001
:- split the linear address into two parts:
| page (20 bits) | offset (12 bits) |
So in this case we would have:
*page = 0x00000. This part must be translated to a physical location.
*offset = 0x001. This part is added directly to the page address, and is not translated: it contains the position _within_ the page. - look into Page table 1 because
cr3
points to it. - The hardware knows that this entry is located at RAM address
CR3 + 0x00000 * 4 = CR3
:
*0x00000
because the page part of the logical address is0x00000
*4
because that is the fixed size in bytes of every page table entry - since it is present, the access is valid
- by the page table, the location of page number
0x00000
is at0x00001 * 4K = 0x00001000
. - to find the final physical address we just need to add the offset:
00001 000 + 00000 001 --------- 00001 001
because00001
is the physical address of the page looked up on the table and001
is the offset.We shift00001
by 12 bits because the pages are always aligned to 4KiB.The offset is always simply added the physical address of the page. - the hardware then gets the memory at that physical location and puts it in a register.
Another example: for logical address
0x00001001
:- the page part is
00001
, and the offset part is001
- the hardware knows that its page table entry is located at RAM address:
CR3 + 1 * 4
(1
because of the page part), and that is where it will look for it - it finds the page address
0x00000
there - so the final address is
0x00000 * 4k + 0x001 = 0x00000001
The problem with single-level paging by Ciro Santilli 35 Updated 2025-01-10 +Created 1970-01-01
The problem with a single-level paging scheme is that it would take up too much RAM: 4G / 4K = 1M entries _per_ process.
If each entry is 4 bytes long, that would make 4M _per process_, which is too much even for a desktop computer:
ps -A | wc -l
says that I am running 244 processes right now, so that would take around 1GB of my RAM!For this reason, x86 developers decided to use a multi-level scheme that reduces RAM usage.
The downside of this system is that is has a slightly higher access time, as we need to access RAM more times for each translation.
The Quantum Story by Jim Baggott (2011) page 2 mentions how newton's support for the corpuscular theory of light led it to be held for a very long time, even when evidence of the wave theory of light was becoming overwhelming.
Kavli Institute for Theoretical Physics by Ciro Santilli 35 Updated 2025-01-10 +Created 1970-01-01
Looks cool. That's what 1k USD gets you.
Human mitochondrial molecular clock by Ciro Santilli 35 Updated 2025-01-10 +Created 1970-01-01
This is really cool. Ciro Santilli would be tempted to participate, but his wife is not a fan, in part due to the loss of privacy of children. Maybe she is right...
Someone should implement a version of that where you can upload your privately sequenced genome and get analytics for free.
Databases and projects:
- www.ncbi.nlm.nih.gov/pmc/articles/PMC2716027/ The Knockout Mouse Project (2004)
There are unlisted articles, also show them or only show them.