0045-x86-mm-Flush-more-aggressively-in-lazy-TLB-mode.patch 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401
  1. From d1ffadc67e2eee2d5f8626dca6646e70e3aa9d76 Mon Sep 17 00:00:00 2001
  2. From: Andy Lutomirski <[email protected]>
  3. Date: Mon, 9 Oct 2017 09:50:49 -0700
  4. Subject: [PATCH 045/242] x86/mm: Flush more aggressively in lazy TLB mode
  5. MIME-Version: 1.0
  6. Content-Type: text/plain; charset=UTF-8
  7. Content-Transfer-Encoding: 8bit
  8. CVE-2017-5754
  9. Since commit:
  10. 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
  11. x86's lazy TLB mode has been all the way lazy: when running a kernel thread
  12. (including the idle thread), the kernel keeps using the last user mm's
  13. page tables without attempting to maintain user TLB coherence at all.
  14. From a pure semantic perspective, this is fine -- kernel threads won't
  15. attempt to access user pages, so having stale TLB entries doesn't matter.
  16. Unfortunately, I forgot about a subtlety. By skipping TLB flushes,
  17. we also allow any paging-structure caches that may exist on the CPU
  18. to become incoherent. This means that we can have a
  19. paging-structure cache entry that references a freed page table, and
  20. the CPU is within its rights to do a speculative page walk starting
  21. at the freed page table.
  22. I can imagine this causing two different problems:
  23. - A speculative page walk starting from a bogus page table could read
  24. IO addresses. I haven't seen any reports of this causing problems.
  25. - A speculative page walk that involves a bogus page table can install
  26. garbage in the TLB. Such garbage would always be at a user VA, but
  27. some AMD CPUs have logic that triggers a machine check when it notices
  28. these bogus entries. I've seen a couple reports of this.
  29. Boris further explains the failure mode:
  30. > It is actually more of an optimization which assumes that paging-structure
  31. > entries are in WB DRAM:
  32. >
  33. > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
  34. > performance optimization that assumes PML4, PDP, PDE, and PTE entries
  35. > are in cacheable WB-DRAM; memory type checks may be bypassed, and
  36. > addresses outside of WB-DRAM may result in undefined behavior or NB
  37. > protocol errors. 1=Disables performance optimization and allows PML4,
  38. > PDP, PDE and PTE entries to be in any memory type. Operating systems
  39. > that maintain page tables in memory types other than WB- DRAM must set
  40. > TlbCacheDis to insure proper operation."
  41. >
  42. > The MCE generated is an NB protocol error to signal that
  43. >
  44. > "Link: A specific coherent-only packet from a CPU was issued to an
  45. > IO link. This may be caused by software which addresses page table
  46. > structures in a memory type other than cacheable WB-DRAM without
  47. > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
  48. > example, when page table structure addresses are above top of memory. In
  49. > such cases, the NB will generate an MCE if it sees a mismatch between
  50. > the memory operation generated by the core and the link type."
  51. >
  52. > I'm assuming coherent-only packets don't go out on IO links, thus the
  53. > error.
  54. To fix this, reinstate TLB coherence in lazy mode. With this patch
  55. applied, we do it in one of two ways:
  56. - If we have PCID, we simply switch back to init_mm's page tables
  57. when we enter a kernel thread -- this seems to be quite cheap
  58. except for the cost of serializing the CPU.
  59. - If we don't have PCID, then we set a flag and switch to init_mm
  60. the first time we would otherwise need to flush the TLB.
  61. The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
  62. to override the default mode for benchmarking.
  63. In theory, we could optimize this better by only flushing the TLB in
  64. lazy CPUs when a page table is freed. Doing that would require
  65. auditing the mm code to make sure that all page table freeing goes
  66. through tlb_remove_page() as well as reworking some data structures
  67. to implement the improved flush logic.
  68. Reported-by: Markus Trippelsdorf <[email protected]>
  69. Reported-by: Adam Borowski <[email protected]>
  70. Signed-off-by: Andy Lutomirski <[email protected]>
  71. Signed-off-by: Borislav Petkov <[email protected]>
  72. Cc: Borislav Petkov <[email protected]>
  73. Cc: Brian Gerst <[email protected]>
  74. Cc: Daniel Borkmann <[email protected]>
  75. Cc: Eric Biggers <[email protected]>
  76. Cc: Johannes Hirte <[email protected]>
  77. Cc: Kees Cook <[email protected]>
  78. Cc: Kirill A. Shutemov <[email protected]>
  79. Cc: Linus Torvalds <[email protected]>
  80. Cc: Nadav Amit <[email protected]>
  81. Cc: Peter Zijlstra <[email protected]>
  82. Cc: Rik van Riel <[email protected]>
  83. Cc: Roman Kagan <[email protected]>
  84. Cc: Thomas Gleixner <[email protected]>
  85. Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
  86. Link: http://lkml.kernel.org/r/[email protected]
  87. Signed-off-by: Ingo Molnar <[email protected]>
  88. (backported from commit b956575bed91ecfb136a8300742ecbbf451471ab)
  89. Signed-off-by: Andy Whitcroft <[email protected]>
  90. Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
  91. (cherry picked from commit a4bb9409c548ece51ec246fc5113a32b8d130142)
  92. Signed-off-by: Fabian Grünbichler <[email protected]>
  93. ---
  94. arch/x86/include/asm/mmu_context.h | 8 +-
  95. arch/x86/include/asm/tlbflush.h | 24 ++++++
  96. arch/x86/mm/tlb.c | 160 +++++++++++++++++++++++++------------
  97. 3 files changed, 136 insertions(+), 56 deletions(-)
  98. diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
  99. index c120b5db178a..3c856a15b98e 100644
  100. --- a/arch/x86/include/asm/mmu_context.h
  101. +++ b/arch/x86/include/asm/mmu_context.h
  102. @@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
  103. DEBUG_LOCKS_WARN_ON(preemptible());
  104. }
  105. -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
  106. -{
  107. - int cpu = smp_processor_id();
  108. -
  109. - if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
  110. - cpumask_clear_cpu(cpu, mm_cpumask(mm));
  111. -}
  112. +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
  113. static inline int init_new_context(struct task_struct *tsk,
  114. struct mm_struct *mm)
  115. diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
  116. index d23e61dc0640..6533da3036c9 100644
  117. --- a/arch/x86/include/asm/tlbflush.h
  118. +++ b/arch/x86/include/asm/tlbflush.h
  119. @@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
  120. #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
  121. #endif
  122. +/*
  123. + * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
  124. + * to init_mm when we switch to a kernel thread (e.g. the idle thread). If
  125. + * it's false, then we immediately switch CR3 when entering a kernel thread.
  126. + */
  127. +DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
  128. +
  129. /*
  130. * 6 because 6 should be plenty and struct tlb_state will fit in
  131. * two cache lines.
  132. @@ -104,6 +111,23 @@ struct tlb_state {
  133. u16 loaded_mm_asid;
  134. u16 next_asid;
  135. + /*
  136. + * We can be in one of several states:
  137. + *
  138. + * - Actively using an mm. Our CPU's bit will be set in
  139. + * mm_cpumask(loaded_mm) and is_lazy == false;
  140. + *
  141. + * - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit
  142. + * will not be set in mm_cpumask(&init_mm) and is_lazy == false.
  143. + *
  144. + * - Lazily using a real mm. loaded_mm != &init_mm, our bit
  145. + * is set in mm_cpumask(loaded_mm), but is_lazy == true.
  146. + * We're heuristically guessing that the CR3 load we
  147. + * skipped more than makes up for the overhead added by
  148. + * lazy mode.
  149. + */
  150. + bool is_lazy;
  151. +
  152. /*
  153. * Access to this CR4 shadow and to H/W CR4 is protected by
  154. * disabling interrupts when modifying either one.
  155. diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
  156. index 440400316c8a..b27aceaf7ed1 100644
  157. --- a/arch/x86/mm/tlb.c
  158. +++ b/arch/x86/mm/tlb.c
  159. @@ -30,6 +30,8 @@
  160. atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
  161. +DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
  162. +
  163. static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
  164. u16 *new_asid, bool *need_flush)
  165. {
  166. @@ -80,7 +82,7 @@ void leave_mm(int cpu)
  167. return;
  168. /* Warn if we're not lazy. */
  169. - WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
  170. + WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
  171. switch_mm(NULL, &init_mm, NULL);
  172. }
  173. @@ -140,52 +142,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
  174. __flush_tlb_all();
  175. }
  176. #endif
  177. + this_cpu_write(cpu_tlbstate.is_lazy, false);
  178. if (real_prev == next) {
  179. VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
  180. next->context.ctx_id);
  181. - if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
  182. - /*
  183. - * There's nothing to do: we weren't lazy, and we
  184. - * aren't changing our mm. We don't need to flush
  185. - * anything, nor do we need to update CR3, CR4, or
  186. - * LDTR.
  187. - */
  188. - return;
  189. - }
  190. -
  191. - /* Resume remote flushes and then read tlb_gen. */
  192. - cpumask_set_cpu(cpu, mm_cpumask(next));
  193. - next_tlb_gen = atomic64_read(&next->context.tlb_gen);
  194. -
  195. - if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
  196. - next_tlb_gen) {
  197. - /*
  198. - * Ideally, we'd have a flush_tlb() variant that
  199. - * takes the known CR3 value as input. This would
  200. - * be faster on Xen PV and on hypothetical CPUs
  201. - * on which INVPCID is fast.
  202. - */
  203. - this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
  204. - next_tlb_gen);
  205. - write_cr3(build_cr3(next, prev_asid));
  206. -
  207. - /*
  208. - * This gets called via leave_mm() in the idle path
  209. - * where RCU functions differently. Tracing normally
  210. - * uses RCU, so we have to call the tracepoint
  211. - * specially here.
  212. - */
  213. - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
  214. - TLB_FLUSH_ALL);
  215. - }
  216. -
  217. /*
  218. - * We just exited lazy mode, which means that CR4 and/or LDTR
  219. - * may be stale. (Changes to the required CR4 and LDTR states
  220. - * are not reflected in tlb_gen.)
  221. + * We don't currently support having a real mm loaded without
  222. + * our cpu set in mm_cpumask(). We have all the bookkeeping
  223. + * in place to figure out whether we would need to flush
  224. + * if our cpu were cleared in mm_cpumask(), but we don't
  225. + * currently use it.
  226. */
  227. + if (WARN_ON_ONCE(real_prev != &init_mm &&
  228. + !cpumask_test_cpu(cpu, mm_cpumask(next))))
  229. + cpumask_set_cpu(cpu, mm_cpumask(next));
  230. +
  231. + return;
  232. } else {
  233. u16 new_asid;
  234. bool need_flush;
  235. @@ -204,10 +178,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
  236. }
  237. /* Stop remote flushes for the previous mm */
  238. - if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
  239. - cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
  240. -
  241. - VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
  242. + VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
  243. + real_prev != &init_mm);
  244. + cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
  245. /*
  246. * Start remote flushes and then read tlb_gen.
  247. @@ -237,6 +210,37 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
  248. switch_ldt(real_prev, next);
  249. }
  250. +/*
  251. + * enter_lazy_tlb() is a hint from the scheduler that we are entering a
  252. + * kernel thread or other context without an mm. Acceptable implementations
  253. + * include doing nothing whatsoever, switching to init_mm, or various clever
  254. + * lazy tricks to try to minimize TLB flushes.
  255. + *
  256. + * The scheduler reserves the right to call enter_lazy_tlb() several times
  257. + * in a row. It will notify us that we're going back to a real mm by
  258. + * calling switch_mm_irqs_off().
  259. + */
  260. +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
  261. +{
  262. + if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
  263. + return;
  264. +
  265. + if (static_branch_unlikely(&tlb_use_lazy_mode)) {
  266. + /*
  267. + * There's a significant optimization that may be possible
  268. + * here. We have accurate enough TLB flush tracking that we
  269. + * don't need to maintain coherence of TLB per se when we're
  270. + * lazy. We do, however, need to maintain coherence of
  271. + * paging-structure caches. We could, in principle, leave our
  272. + * old mm loaded and only switch to init_mm when
  273. + * tlb_remove_page() happens.
  274. + */
  275. + this_cpu_write(cpu_tlbstate.is_lazy, true);
  276. + } else {
  277. + switch_mm(NULL, &init_mm, NULL);
  278. + }
  279. +}
  280. +
  281. /*
  282. * Call this when reinitializing a CPU. It fixes the following potential
  283. * problems:
  284. @@ -308,16 +312,20 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
  285. /* This code cannot presently handle being reentered. */
  286. VM_WARN_ON(!irqs_disabled());
  287. + if (unlikely(loaded_mm == &init_mm))
  288. + return;
  289. +
  290. VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
  291. loaded_mm->context.ctx_id);
  292. - if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
  293. + if (this_cpu_read(cpu_tlbstate.is_lazy)) {
  294. /*
  295. - * We're in lazy mode -- don't flush. We can get here on
  296. - * remote flushes due to races and on local flushes if a
  297. - * kernel thread coincidentally flushes the mm it's lazily
  298. - * still using.
  299. + * We're in lazy mode. We need to at least flush our
  300. + * paging-structure cache to avoid speculatively reading
  301. + * garbage into our TLB. Since switching to init_mm is barely
  302. + * slower than a minimal flush, just switch to init_mm.
  303. */
  304. + switch_mm_irqs_off(NULL, &init_mm, NULL);
  305. return;
  306. }
  307. @@ -616,3 +624,57 @@ static int __init create_tlb_single_page_flush_ceiling(void)
  308. return 0;
  309. }
  310. late_initcall(create_tlb_single_page_flush_ceiling);
  311. +
  312. +static ssize_t tlblazy_read_file(struct file *file, char __user *user_buf,
  313. + size_t count, loff_t *ppos)
  314. +{
  315. + char buf[2];
  316. +
  317. + buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0';
  318. + buf[1] = '\n';
  319. +
  320. + return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
  321. +}
  322. +
  323. +static ssize_t tlblazy_write_file(struct file *file,
  324. + const char __user *user_buf, size_t count, loff_t *ppos)
  325. +{
  326. + bool val;
  327. +
  328. + if (kstrtobool_from_user(user_buf, count, &val))
  329. + return -EINVAL;
  330. +
  331. + if (val)
  332. + static_branch_enable(&tlb_use_lazy_mode);
  333. + else
  334. + static_branch_disable(&tlb_use_lazy_mode);
  335. +
  336. + return count;
  337. +}
  338. +
  339. +static const struct file_operations fops_tlblazy = {
  340. + .read = tlblazy_read_file,
  341. + .write = tlblazy_write_file,
  342. + .llseek = default_llseek,
  343. +};
  344. +
  345. +static int __init init_tlb_use_lazy_mode(void)
  346. +{
  347. + if (boot_cpu_has(X86_FEATURE_PCID)) {
  348. + /*
  349. + * Heuristic: with PCID on, switching to and from
  350. + * init_mm is reasonably fast, but remote flush IPIs
  351. + * as expensive as ever, so turn off lazy TLB mode.
  352. + *
  353. + * We can't do this in setup_pcid() because static keys
  354. + * haven't been initialized yet, and it would blow up
  355. + * badly.
  356. + */
  357. + static_branch_disable(&tlb_use_lazy_mode);
  358. + }
  359. +
  360. + debugfs_create_file("tlb_use_lazy_mode", S_IRUSR | S_IWUSR,
  361. + arch_debugfs_dir, NULL, &fops_tlblazy);
  362. + return 0;
  363. +}
  364. +late_initcall(init_tlb_use_lazy_mode);
  365. --
  366. 2.14.2