| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401 |
- From d1ffadc67e2eee2d5f8626dca6646e70e3aa9d76 Mon Sep 17 00:00:00 2001
- From: Andy Lutomirski <[email protected]>
- Date: Mon, 9 Oct 2017 09:50:49 -0700
- Subject: [PATCH 045/242] x86/mm: Flush more aggressively in lazy TLB mode
- MIME-Version: 1.0
- Content-Type: text/plain; charset=UTF-8
- Content-Transfer-Encoding: 8bit
- CVE-2017-5754
- Since commit:
- 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
- x86's lazy TLB mode has been all the way lazy: when running a kernel thread
- (including the idle thread), the kernel keeps using the last user mm's
- page tables without attempting to maintain user TLB coherence at all.
- From a pure semantic perspective, this is fine -- kernel threads won't
- attempt to access user pages, so having stale TLB entries doesn't matter.
- Unfortunately, I forgot about a subtlety. By skipping TLB flushes,
- we also allow any paging-structure caches that may exist on the CPU
- to become incoherent. This means that we can have a
- paging-structure cache entry that references a freed page table, and
- the CPU is within its rights to do a speculative page walk starting
- at the freed page table.
- I can imagine this causing two different problems:
- - A speculative page walk starting from a bogus page table could read
- IO addresses. I haven't seen any reports of this causing problems.
- - A speculative page walk that involves a bogus page table can install
- garbage in the TLB. Such garbage would always be at a user VA, but
- some AMD CPUs have logic that triggers a machine check when it notices
- these bogus entries. I've seen a couple reports of this.
- Boris further explains the failure mode:
- > It is actually more of an optimization which assumes that paging-structure
- > entries are in WB DRAM:
- >
- > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
- > performance optimization that assumes PML4, PDP, PDE, and PTE entries
- > are in cacheable WB-DRAM; memory type checks may be bypassed, and
- > addresses outside of WB-DRAM may result in undefined behavior or NB
- > protocol errors. 1=Disables performance optimization and allows PML4,
- > PDP, PDE and PTE entries to be in any memory type. Operating systems
- > that maintain page tables in memory types other than WB- DRAM must set
- > TlbCacheDis to insure proper operation."
- >
- > The MCE generated is an NB protocol error to signal that
- >
- > "Link: A specific coherent-only packet from a CPU was issued to an
- > IO link. This may be caused by software which addresses page table
- > structures in a memory type other than cacheable WB-DRAM without
- > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
- > example, when page table structure addresses are above top of memory. In
- > such cases, the NB will generate an MCE if it sees a mismatch between
- > the memory operation generated by the core and the link type."
- >
- > I'm assuming coherent-only packets don't go out on IO links, thus the
- > error.
- To fix this, reinstate TLB coherence in lazy mode. With this patch
- applied, we do it in one of two ways:
- - If we have PCID, we simply switch back to init_mm's page tables
- when we enter a kernel thread -- this seems to be quite cheap
- except for the cost of serializing the CPU.
- - If we don't have PCID, then we set a flag and switch to init_mm
- the first time we would otherwise need to flush the TLB.
- The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
- to override the default mode for benchmarking.
- In theory, we could optimize this better by only flushing the TLB in
- lazy CPUs when a page table is freed. Doing that would require
- auditing the mm code to make sure that all page table freeing goes
- through tlb_remove_page() as well as reworking some data structures
- to implement the improved flush logic.
- Reported-by: Markus Trippelsdorf <[email protected]>
- Reported-by: Adam Borowski <[email protected]>
- Signed-off-by: Andy Lutomirski <[email protected]>
- Signed-off-by: Borislav Petkov <[email protected]>
- Cc: Borislav Petkov <[email protected]>
- Cc: Brian Gerst <[email protected]>
- Cc: Daniel Borkmann <[email protected]>
- Cc: Eric Biggers <[email protected]>
- Cc: Johannes Hirte <[email protected]>
- Cc: Kees Cook <[email protected]>
- Cc: Kirill A. Shutemov <[email protected]>
- Cc: Linus Torvalds <[email protected]>
- Cc: Nadav Amit <[email protected]>
- Cc: Peter Zijlstra <[email protected]>
- Cc: Rik van Riel <[email protected]>
- Cc: Roman Kagan <[email protected]>
- Cc: Thomas Gleixner <[email protected]>
- Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
- Link: http://lkml.kernel.org/r/[email protected]
- Signed-off-by: Ingo Molnar <[email protected]>
- (backported from commit b956575bed91ecfb136a8300742ecbbf451471ab)
- Signed-off-by: Andy Whitcroft <[email protected]>
- Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
- (cherry picked from commit a4bb9409c548ece51ec246fc5113a32b8d130142)
- Signed-off-by: Fabian Grünbichler <[email protected]>
- ---
- arch/x86/include/asm/mmu_context.h | 8 +-
- arch/x86/include/asm/tlbflush.h | 24 ++++++
- arch/x86/mm/tlb.c | 160 +++++++++++++++++++++++++------------
- 3 files changed, 136 insertions(+), 56 deletions(-)
- diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
- index c120b5db178a..3c856a15b98e 100644
- --- a/arch/x86/include/asm/mmu_context.h
- +++ b/arch/x86/include/asm/mmu_context.h
- @@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
- DEBUG_LOCKS_WARN_ON(preemptible());
- }
-
- -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
- -{
- - int cpu = smp_processor_id();
- -
- - if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
- - cpumask_clear_cpu(cpu, mm_cpumask(mm));
- -}
- +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
-
- static inline int init_new_context(struct task_struct *tsk,
- struct mm_struct *mm)
- diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
- index d23e61dc0640..6533da3036c9 100644
- --- a/arch/x86/include/asm/tlbflush.h
- +++ b/arch/x86/include/asm/tlbflush.h
- @@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
- #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
- #endif
-
- +/*
- + * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
- + * to init_mm when we switch to a kernel thread (e.g. the idle thread). If
- + * it's false, then we immediately switch CR3 when entering a kernel thread.
- + */
- +DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
- +
- /*
- * 6 because 6 should be plenty and struct tlb_state will fit in
- * two cache lines.
- @@ -104,6 +111,23 @@ struct tlb_state {
- u16 loaded_mm_asid;
- u16 next_asid;
-
- + /*
- + * We can be in one of several states:
- + *
- + * - Actively using an mm. Our CPU's bit will be set in
- + * mm_cpumask(loaded_mm) and is_lazy == false;
- + *
- + * - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit
- + * will not be set in mm_cpumask(&init_mm) and is_lazy == false.
- + *
- + * - Lazily using a real mm. loaded_mm != &init_mm, our bit
- + * is set in mm_cpumask(loaded_mm), but is_lazy == true.
- + * We're heuristically guessing that the CR3 load we
- + * skipped more than makes up for the overhead added by
- + * lazy mode.
- + */
- + bool is_lazy;
- +
- /*
- * Access to this CR4 shadow and to H/W CR4 is protected by
- * disabling interrupts when modifying either one.
- diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
- index 440400316c8a..b27aceaf7ed1 100644
- --- a/arch/x86/mm/tlb.c
- +++ b/arch/x86/mm/tlb.c
- @@ -30,6 +30,8 @@
-
- atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
-
- +DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
- +
- static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
- u16 *new_asid, bool *need_flush)
- {
- @@ -80,7 +82,7 @@ void leave_mm(int cpu)
- return;
-
- /* Warn if we're not lazy. */
- - WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
- + WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
-
- switch_mm(NULL, &init_mm, NULL);
- }
- @@ -140,52 +142,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
- __flush_tlb_all();
- }
- #endif
- + this_cpu_write(cpu_tlbstate.is_lazy, false);
-
- if (real_prev == next) {
- VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
- next->context.ctx_id);
-
- - if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
- - /*
- - * There's nothing to do: we weren't lazy, and we
- - * aren't changing our mm. We don't need to flush
- - * anything, nor do we need to update CR3, CR4, or
- - * LDTR.
- - */
- - return;
- - }
- -
- - /* Resume remote flushes and then read tlb_gen. */
- - cpumask_set_cpu(cpu, mm_cpumask(next));
- - next_tlb_gen = atomic64_read(&next->context.tlb_gen);
- -
- - if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
- - next_tlb_gen) {
- - /*
- - * Ideally, we'd have a flush_tlb() variant that
- - * takes the known CR3 value as input. This would
- - * be faster on Xen PV and on hypothetical CPUs
- - * on which INVPCID is fast.
- - */
- - this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
- - next_tlb_gen);
- - write_cr3(build_cr3(next, prev_asid));
- -
- - /*
- - * This gets called via leave_mm() in the idle path
- - * where RCU functions differently. Tracing normally
- - * uses RCU, so we have to call the tracepoint
- - * specially here.
- - */
- - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
- - TLB_FLUSH_ALL);
- - }
- -
- /*
- - * We just exited lazy mode, which means that CR4 and/or LDTR
- - * may be stale. (Changes to the required CR4 and LDTR states
- - * are not reflected in tlb_gen.)
- + * We don't currently support having a real mm loaded without
- + * our cpu set in mm_cpumask(). We have all the bookkeeping
- + * in place to figure out whether we would need to flush
- + * if our cpu were cleared in mm_cpumask(), but we don't
- + * currently use it.
- */
- + if (WARN_ON_ONCE(real_prev != &init_mm &&
- + !cpumask_test_cpu(cpu, mm_cpumask(next))))
- + cpumask_set_cpu(cpu, mm_cpumask(next));
- +
- + return;
- } else {
- u16 new_asid;
- bool need_flush;
- @@ -204,10 +178,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
- }
-
- /* Stop remote flushes for the previous mm */
- - if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
- - cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
- -
- - VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
- + VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
- + real_prev != &init_mm);
- + cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
-
- /*
- * Start remote flushes and then read tlb_gen.
- @@ -237,6 +210,37 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
- switch_ldt(real_prev, next);
- }
-
- +/*
- + * enter_lazy_tlb() is a hint from the scheduler that we are entering a
- + * kernel thread or other context without an mm. Acceptable implementations
- + * include doing nothing whatsoever, switching to init_mm, or various clever
- + * lazy tricks to try to minimize TLB flushes.
- + *
- + * The scheduler reserves the right to call enter_lazy_tlb() several times
- + * in a row. It will notify us that we're going back to a real mm by
- + * calling switch_mm_irqs_off().
- + */
- +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
- +{
- + if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
- + return;
- +
- + if (static_branch_unlikely(&tlb_use_lazy_mode)) {
- + /*
- + * There's a significant optimization that may be possible
- + * here. We have accurate enough TLB flush tracking that we
- + * don't need to maintain coherence of TLB per se when we're
- + * lazy. We do, however, need to maintain coherence of
- + * paging-structure caches. We could, in principle, leave our
- + * old mm loaded and only switch to init_mm when
- + * tlb_remove_page() happens.
- + */
- + this_cpu_write(cpu_tlbstate.is_lazy, true);
- + } else {
- + switch_mm(NULL, &init_mm, NULL);
- + }
- +}
- +
- /*
- * Call this when reinitializing a CPU. It fixes the following potential
- * problems:
- @@ -308,16 +312,20 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
- /* This code cannot presently handle being reentered. */
- VM_WARN_ON(!irqs_disabled());
-
- + if (unlikely(loaded_mm == &init_mm))
- + return;
- +
- VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
- loaded_mm->context.ctx_id);
-
- - if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
- + if (this_cpu_read(cpu_tlbstate.is_lazy)) {
- /*
- - * We're in lazy mode -- don't flush. We can get here on
- - * remote flushes due to races and on local flushes if a
- - * kernel thread coincidentally flushes the mm it's lazily
- - * still using.
- + * We're in lazy mode. We need to at least flush our
- + * paging-structure cache to avoid speculatively reading
- + * garbage into our TLB. Since switching to init_mm is barely
- + * slower than a minimal flush, just switch to init_mm.
- */
- + switch_mm_irqs_off(NULL, &init_mm, NULL);
- return;
- }
-
- @@ -616,3 +624,57 @@ static int __init create_tlb_single_page_flush_ceiling(void)
- return 0;
- }
- late_initcall(create_tlb_single_page_flush_ceiling);
- +
- +static ssize_t tlblazy_read_file(struct file *file, char __user *user_buf,
- + size_t count, loff_t *ppos)
- +{
- + char buf[2];
- +
- + buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0';
- + buf[1] = '\n';
- +
- + return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
- +}
- +
- +static ssize_t tlblazy_write_file(struct file *file,
- + const char __user *user_buf, size_t count, loff_t *ppos)
- +{
- + bool val;
- +
- + if (kstrtobool_from_user(user_buf, count, &val))
- + return -EINVAL;
- +
- + if (val)
- + static_branch_enable(&tlb_use_lazy_mode);
- + else
- + static_branch_disable(&tlb_use_lazy_mode);
- +
- + return count;
- +}
- +
- +static const struct file_operations fops_tlblazy = {
- + .read = tlblazy_read_file,
- + .write = tlblazy_write_file,
- + .llseek = default_llseek,
- +};
- +
- +static int __init init_tlb_use_lazy_mode(void)
- +{
- + if (boot_cpu_has(X86_FEATURE_PCID)) {
- + /*
- + * Heuristic: with PCID on, switching to and from
- + * init_mm is reasonably fast, but remote flush IPIs
- + * as expensive as ever, so turn off lazy TLB mode.
- + *
- + * We can't do this in setup_pcid() because static keys
- + * haven't been initialized yet, and it would blow up
- + * badly.
- + */
- + static_branch_disable(&tlb_use_lazy_mode);
- + }
- +
- + debugfs_create_file("tlb_use_lazy_mode", S_IRUSR | S_IWUSR,
- + arch_debugfs_dir, NULL, &fops_tlblazy);
- + return 0;
- +}
- +late_initcall(init_tlb_use_lazy_mode);
- --
- 2.14.2
|