x86 Paging Tutorial / Linux kernel usage

The Linux kernel makes extensive usage of the paging features of x86 to allow fast process switches with small data fragmentation.

There are also however some features that the Linux kernel might not use, either because they are only for backwards compatibility, or because the Linux devs didn't feel it was worth it yet.

Table of contents

Play with physical addresses in Linux

 0  0

Convert virtual addresses to physical from user space with /proc/<pid>/pagemap and from kernel space with virt_to_phys:

Dump all page tables from userspace with /proc/<pid>/maps and /proc/<pid>/pagemap:

Read and write physical addresses from userspace with /dev/mem:

Kernel vs process memory layout

 0  0

The Linux Kernel reserves two zones of virtual memory:

one for kernel memory
one for programs

The exact split is configured by CONFIG_VMSPLIT_.... By default:

on 32-bit:

the bottom 3/4 is program space: 00000000 to BFFFFFFF

the top 1/4 is kernel memory: C0000000 to FFFFFFFF, like this:

------------------ FFFFFFFF
Kernel
------------------ C0000000
------------------ BFFFFFFF


Process


------------------ 00000000

on 64-bit: currently only 48-bits are actually used, split into two equally sized disjoint spaces. The Linux kernel just assigns:
- the bottom part to processes 00000000 00000000 to 008FFFFF FFFFFFFF
- the top part to the kernel: FFFF8000 00000000 to FFFFFFFF FFFFFFFF, like this:
  ------------------ FFFFFFFF Kernel ------------------ C0000000 (not addressable) ------------------ BFFFFFFF Process ------------------ 00000000

Kernel memory is also paged.

In previous versions, the paging was continuous, but with HIGHMEM this changed.

There is no clear physical memory split: stackoverflow.com/questions/30471742/physical-memory-userspace-kernel-split-on-linux-x86-64

Process memory layout

 0  0

For each process, the virtual address space looks like this:

------------------ 2^32 - 1
Stack (grows down)
v v v v v v v v v
------------------

(unmapped)

------------------ Maximum stack size.


(unmapped)


-------------------
mmap
-------------------


(unmapped)


-------------------
^^^^^^^^^^^^^^^^^^^
brk (grows up)
-------------------
BSS
-------------------
Data
-------------------
Text
-------------------

------------------- 0

The kernel maintains a list of pages that belong to each process, and synchronizes that with the paging.

If the program accesses memory that does not belong to it, the kernel handles a page-fault, and decides what to do:

if it is above the maximum stack size, allocate those pages to the process
otherwise, send a SIGSEGV to the process, which usually kills it

When an ELF file is loaded by the kernel to start a program with the exec system call, the kernel automatically registers text, data, BSS and stack for the program.

The brk and mmap areas can be modified by request of the program through the brk and mmap system calls. But the kernel can also deny the program those areas if there is not enough memory.

brk and mmap can be used to implement malloc, or the so called "heap".

mmap is also used to load dynamically loaded libraries into the program's memory so that it can access and run it.

Stack allocation: stackoverflow.com/questions/17671423/stack-allocation-for-process

Calculating exact addresses Things are complicated by:

Address Space Layout Randomization.
the fact that environment variables, CLI arguments, and some ELF header data take up initial stack space: unix.stackexchange.com/questions/145557/how-does-stack-allocation-work-in-linux/239323#239323

Why the text does not start at 0: stackoverflow.com/questions/14795164/why-do-linux-program-text-sections-start-at-0x0804800-and-stack-tops-start-at-0

Copy-on-write (COW)

 0  0

en.wikipedia.org/wiki/Copy-on-write

Besides a missing page, a very common source of page faults is copy-on-write (COW).

Page tables have extra flags that allow the OS to mark a page a read-only.

Those page faults only happen when a process tries to write to the page, and not read from it.

When Linux forks a process:

instead of copying all the pages, which is unnecessarily costly, it makes the page tables of the two process point to the same physical address.
it marks those linear addresses as read-only
whenever one of the processes tries to write to a page, the makes a copy of the physical memory, and updates the pages of the two process to point to the two different physical addresses

Linux source tree

 0  0

In v4.2, look under arch/x86/:

include/asm/pgtable*
include/asm/page*
mm/pgtable*
mm/page*

There seems to be no structs defined to represent the pages, only macros: include/asm/page_types.h is specially interesting. Excerpt:

#define _PAGE_BIT_PRESENT   0   /* is present */
#define _PAGE_BIT_RW        1   /* writeable */
#define _PAGE_BIT_USER      2   /* userspace addressable */
#define _PAGE_BIT_PWT       3   /* page write through */

arch/x86/include/uapi/asm/processor-flags.h defines CR0, and in particular the PG bit position:

#define X86_CR0_PG_BIT      31 /* Paging */

x86 Paging Tutorial / Linux kernel usage

Play with physical addresses in Linux

Kernel vs process memory layout

Process memory layout

Copy-on-write (COW)

Linux source tree

 Ancestors (12)

 Discussion (0)

 Articles by others on the same topic (0)

x86 Paging Tutorial / Linux kernel usage

Play with physical addresses in Linux

Kernel vs process memory layout

Process memory layout

Copy-on-write (COW)

Linux source tree

 Ancestors (12)

 Discussion (0)  Subscribe (1)

 Articles by others on the same topic (0)

 Discussion (0)