| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275 |
- From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
- From: Dave Hansen <[email protected]>
- Date: Fri, 5 Jan 2018 09:44:36 -0800
- Subject: [PATCH] x86/Documentation: Add PTI description
- MIME-Version: 1.0
- Content-Type: text/plain; charset=UTF-8
- Content-Transfer-Encoding: 8bit
- CVE-2017-5754
- Add some details about how PTI works, what some of the downsides
- are, and how to debug it when things go wrong.
- Also document the kernel parameter: 'pti/nopti'.
- Signed-off-by: Dave Hansen <[email protected]>
- Signed-off-by: Thomas Gleixner <[email protected]>
- Reviewed-by: Randy Dunlap <[email protected]>
- Reviewed-by: Kees Cook <[email protected]>
- Cc: Moritz Lipp <[email protected]>
- Cc: Daniel Gruss <[email protected]>
- Cc: Michael Schwarz <[email protected]>
- Cc: Richard Fellner <[email protected]>
- Cc: Andy Lutomirski <[email protected]>
- Cc: Linus Torvalds <[email protected]>
- Cc: Hugh Dickins <[email protected]>
- Cc: Andi Lutomirsky <[email protected]>
- Cc: [email protected]
- Link: https://lkml.kernel.org/r/[email protected]
- (cherry picked from commit 01c9b17bf673b05bb401b76ec763e9730ccf1376)
- Signed-off-by: Andy Whitcroft <[email protected]>
- Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
- (cherry picked from commit 1acf87c45b0170e717fc1b06a2d6fef47e07f79b)
- Signed-off-by: Fabian Grünbichler <[email protected]>
- ---
- Documentation/admin-guide/kernel-parameters.txt | 21 ++-
- Documentation/x86/pti.txt | 186 ++++++++++++++++++++++++
- 2 files changed, 200 insertions(+), 7 deletions(-)
- create mode 100644 Documentation/x86/pti.txt
- diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
- index b4d2edf316db..1a6ebc6cdf26 100644
- --- a/Documentation/admin-guide/kernel-parameters.txt
- +++ b/Documentation/admin-guide/kernel-parameters.txt
- @@ -2677,8 +2677,6 @@
- steal time is computed, but won't influence scheduler
- behaviour
-
- - nopti [X86-64] Disable kernel page table isolation
- -
- nolapic [X86-32,APIC] Do not enable or use the local APIC.
-
- nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
- @@ -3247,11 +3245,20 @@
- pt. [PARIDE]
- See Documentation/blockdev/paride.txt.
-
- - pti= [X86_64]
- - Control user/kernel address space isolation:
- - on - enable
- - off - disable
- - auto - default setting
- + pti= [X86_64] Control Page Table Isolation of user and
- + kernel address spaces. Disabling this feature
- + removes hardening, but improves performance of
- + system calls and interrupts.
- +
- + on - unconditionally enable
- + off - unconditionally disable
- + auto - kernel detects whether your CPU model is
- + vulnerable to issues that PTI mitigates
- +
- + Not specifying this option is equivalent to pti=auto.
- +
- + nopti [X86_64]
- + Equivalent to pti=off
-
- pty.legacy_count=
- [KNL] Number of legacy pty's. Overwrites compiled-in
- diff --git a/Documentation/x86/pti.txt b/Documentation/x86/pti.txt
- new file mode 100644
- index 000000000000..d11eff61fc9a
- --- /dev/null
- +++ b/Documentation/x86/pti.txt
- @@ -0,0 +1,186 @@
- +Overview
- +========
- +
- +Page Table Isolation (pti, previously known as KAISER[1]) is a
- +countermeasure against attacks on the shared user/kernel address
- +space such as the "Meltdown" approach[2].
- +
- +To mitigate this class of attacks, we create an independent set of
- +page tables for use only when running userspace applications. When
- +the kernel is entered via syscalls, interrupts or exceptions, the
- +page tables are switched to the full "kernel" copy. When the system
- +switches back to user mode, the user copy is used again.
- +
- +The userspace page tables contain only a minimal amount of kernel
- +data: only what is needed to enter/exit the kernel such as the
- +entry/exit functions themselves and the interrupt descriptor table
- +(IDT). There are a few strictly unnecessary things that get mapped
- +such as the first C function when entering an interrupt (see
- +comments in pti.c).
- +
- +This approach helps to ensure that side-channel attacks leveraging
- +the paging structures do not function when PTI is enabled. It can be
- +enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
- +Once enabled at compile-time, it can be disabled at boot with the
- +'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
- +
- +Page Table Management
- +=====================
- +
- +When PTI is enabled, the kernel manages two sets of page tables.
- +The first set is very similar to the single set which is present in
- +kernels without PTI. This includes a complete mapping of userspace
- +that the kernel can use for things like copy_to_user().
- +
- +Although _complete_, the user portion of the kernel page tables is
- +crippled by setting the NX bit in the top level. This ensures
- +that any missed kernel->user CR3 switch will immediately crash
- +userspace upon executing its first instruction.
- +
- +The userspace page tables map only the kernel data needed to enter
- +and exit the kernel. This data is entirely contained in the 'struct
- +cpu_entry_area' structure which is placed in the fixmap which gives
- +each CPU's copy of the area a compile-time-fixed virtual address.
- +
- +For new userspace mappings, the kernel makes the entries in its
- +page tables like normal. The only difference is when the kernel
- +makes entries in the top (PGD) level. In addition to setting the
- +entry in the main kernel PGD, a copy of the entry is made in the
- +userspace page tables' PGD.
- +
- +This sharing at the PGD level also inherently shares all the lower
- +layers of the page tables. This leaves a single, shared set of
- +userspace page tables to manage. One PTE to lock, one set of
- +accessed bits, dirty bits, etc...
- +
- +Overhead
- +========
- +
- +Protection against side-channel attacks is important. But,
- +this protection comes at a cost:
- +
- +1. Increased Memory Use
- + a. Each process now needs an order-1 PGD instead of order-0.
- + (Consumes an additional 4k per process).
- + b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
- + aligned so that it can be mapped by setting a single PMD
- + entry. This consumes nearly 2MB of RAM once the kernel
- + is decompressed, but no space in the kernel image itself.
- +
- +2. Runtime Cost
- + a. CR3 manipulation to switch between the page table copies
- + must be done at interrupt, syscall, and exception entry
- + and exit (it can be skipped when the kernel is interrupted,
- + though.) Moves to CR3 are on the order of a hundred
- + cycles, and are required at every entry and exit.
- + b. A "trampoline" must be used for SYSCALL entry. This
- + trampoline depends on a smaller set of resources than the
- + non-PTI SYSCALL entry code, so requires mapping fewer
- + things into the userspace page tables. The downside is
- + that stacks must be switched at entry time.
- + d. Global pages are disabled for all kernel structures not
- + mapped into both kernel and userspace page tables. This
- + feature of the MMU allows different processes to share TLB
- + entries mapping the kernel. Losing the feature means more
- + TLB misses after a context switch. The actual loss of
- + performance is very small, however, never exceeding 1%.
- + d. Process Context IDentifiers (PCID) is a CPU feature that
- + allows us to skip flushing the entire TLB when switching page
- + tables by setting a special bit in CR3 when the page tables
- + are changed. This makes switching the page tables (at context
- + switch, or kernel entry/exit) cheaper. But, on systems with
- + PCID support, the context switch code must flush both the user
- + and kernel entries out of the TLB. The user PCID TLB flush is
- + deferred until the exit to userspace, minimizing the cost.
- + See intel.com/sdm for the gory PCID/INVPCID details.
- + e. The userspace page tables must be populated for each new
- + process. Even without PTI, the shared kernel mappings
- + are created by copying top-level (PGD) entries into each
- + new process. But, with PTI, there are now *two* kernel
- + mappings: one in the kernel page tables that maps everything
- + and one for the entry/exit structures. At fork(), we need to
- + copy both.
- + f. In addition to the fork()-time copying, there must also
- + be an update to the userspace PGD any time a set_pgd() is done
- + on a PGD used to map userspace. This ensures that the kernel
- + and userspace copies always map the same userspace
- + memory.
- + g. On systems without PCID support, each CR3 write flushes
- + the entire TLB. That means that each syscall, interrupt
- + or exception flushes the TLB.
- + h. INVPCID is a TLB-flushing instruction which allows flushing
- + of TLB entries for non-current PCIDs. Some systems support
- + PCIDs, but do not support INVPCID. On these systems, addresses
- + can only be flushed from the TLB for the current PCID. When
- + flushing a kernel address, we need to flush all PCIDs, so a
- + single kernel address flush will require a TLB-flushing CR3
- + write upon the next use of every PCID.
- +
- +Possible Future Work
- +====================
- +1. We can be more careful about not actually writing to CR3
- + unless its value is actually changed.
- +2. Allow PTI to be enabled/disabled at runtime in addition to the
- + boot-time switching.
- +
- +Testing
- +========
- +
- +To test stability of PTI, the following test procedure is recommended,
- +ideally doing all of these in parallel:
- +
- +1. Set CONFIG_DEBUG_ENTRY=y
- +2. Run several copies of all of the tools/testing/selftests/x86/ tests
- + (excluding MPX and protection_keys) in a loop on multiple CPUs for
- + several minutes. These tests frequently uncover corner cases in the
- + kernel entry code. In general, old kernels might cause these tests
- + themselves to crash, but they should never crash the kernel.
- +3. Run the 'perf' tool in a mode (top or record) that generates many
- + frequent performance monitoring non-maskable interrupts (see "NMI"
- + in /proc/interrupts). This exercises the NMI entry/exit code which
- + is known to trigger bugs in code paths that did not expect to be
- + interrupted, including nested NMIs. Using "-c" boosts the rate of
- + NMIs, and using two -c with separate counters encourages nested NMIs
- + and less deterministic behavior.
- +
- + while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
- +
- +4. Launch a KVM virtual machine.
- +5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
- + This has been a lightly-tested code path and needs extra scrutiny.
- +
- +Debugging
- +=========
- +
- +Bugs in PTI cause a few different signatures of crashes
- +that are worth noting here.
- +
- + * Failures of the selftests/x86 code. Usually a bug in one of the
- + more obscure corners of entry_64.S
- + * Crashes in early boot, especially around CPU bringup. Bugs
- + in the trampoline code or mappings cause these.
- + * Crashes at the first interrupt. Caused by bugs in entry_64.S,
- + like screwing up a page table switch. Also caused by
- + incorrectly mapping the IRQ handler entry code.
- + * Crashes at the first NMI. The NMI code is separate from main
- + interrupt handlers and can have bugs that do not affect
- + normal interrupts. Also caused by incorrectly mapping NMI
- + code. NMIs that interrupt the entry code must be very
- + careful and can be the cause of crashes that show up when
- + running perf.
- + * Kernel crashes at the first exit to userspace. entry_64.S
- + bugs, or failing to map some of the exit code.
- + * Crashes at first interrupt that interrupts userspace. The paths
- + in entry_64.S that return to userspace are sometimes separate
- + from the ones that return to the kernel.
- + * Double faults: overflowing the kernel stack because of page
- + faults upon page faults. Caused by touching non-pti-mapped
- + data in the entry code, or forgetting to switch to kernel
- + CR3 before calling into C functions which are not pti-mapped.
- + * Userspace segfaults early in boot, sometimes manifesting
- + as mount(8) failing to mount the rootfs. These have
- + tended to be TLB invalidation issues. Usually invalidating
- + the wrong PCID, or otherwise missing an invalidation.
- +
- +1. https://gruss.cc/files/kaiser.pdf
- +2. https://meltdownattack.com/meltdown.pdf
- --
- 2.14.2
|