020-v6.1-01-mm-x86-arm64-add-arch_has_hw_pte_young.patch 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425
  1. From a4103262b01a1b8704b37c01c7c813df91b7b119 Mon Sep 17 00:00:00 2001
  2. From: Yu Zhao <[email protected]>
  3. Date: Sun, 18 Sep 2022 01:59:58 -0600
  4. Subject: [PATCH 01/29] mm: x86, arm64: add arch_has_hw_pte_young()
  5. MIME-Version: 1.0
  6. Content-Type: text/plain; charset=UTF-8
  7. Content-Transfer-Encoding: 8bit
  8. Patch series "Multi-Gen LRU Framework", v14.
  9. What's new
  10. ==========
  11. 1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
  12. Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
  13. 2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
  14. machines. The old direct reclaim backoff, which tries to enforce a
  15. minimum fairness among all eligible memcgs, over-swapped by about
  16. (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
  17. pulls the plug on swapping once the target is met, trades some
  18. fairness for curtailed latency:
  19. https://lore.kernel.org/r/[email protected]/
  20. 3. Fixed minior build warnings and conflicts. More comments and nits.
  21. TLDR
  22. ====
  23. The current page reclaim is too expensive in terms of CPU usage and it
  24. often makes poor choices about what to evict. This patchset offers an
  25. alternative solution that is performant, versatile and
  26. straightforward.
  27. Patchset overview
  28. =================
  29. The design and implementation overview is in patch 14:
  30. https://lore.kernel.org/r/[email protected]/
  31. 01. mm: x86, arm64: add arch_has_hw_pte_young()
  32. 02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  33. Take advantage of hardware features when trying to clear the accessed
  34. bit in many PTEs.
  35. 03. mm/vmscan.c: refactor shrink_node()
  36. 04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
  37. its sole caller"
  38. Minor refactors to improve readability for the following patches.
  39. 05. mm: multi-gen LRU: groundwork
  40. Adds the basic data structure and the functions that insert pages to
  41. and remove pages from the multi-gen LRU (MGLRU) lists.
  42. 06. mm: multi-gen LRU: minimal implementation
  43. A minimal implementation without optimizations.
  44. 07. mm: multi-gen LRU: exploit locality in rmap
  45. Exploits spatial locality to improve efficiency when using the rmap.
  46. 08. mm: multi-gen LRU: support page table walks
  47. Further exploits spatial locality by optionally scanning page tables.
  48. 09. mm: multi-gen LRU: optimize multiple memcgs
  49. Optimizes the overall performance for multiple memcgs running mixed
  50. types of workloads.
  51. 10. mm: multi-gen LRU: kill switch
  52. Adds a kill switch to enable or disable MGLRU at runtime.
  53. 11. mm: multi-gen LRU: thrashing prevention
  54. 12. mm: multi-gen LRU: debugfs interface
  55. Provide userspace with features like thrashing prevention, working set
  56. estimation and proactive reclaim.
  57. 13. mm: multi-gen LRU: admin guide
  58. 14. mm: multi-gen LRU: design doc
  59. Add an admin guide and a design doc.
  60. Benchmark results
  61. =================
  62. Independent lab results
  63. -----------------------
  64. Based on the popularity of searches [01] and the memory usage in
  65. Google's public cloud, the most popular open-source memory-hungry
  66. applications, in alphabetical order, are:
  67. Apache Cassandra Memcached
  68. Apache Hadoop MongoDB
  69. Apache Spark PostgreSQL
  70. MariaDB (MySQL) Redis
  71. An independent lab evaluated MGLRU with the most widely used benchmark
  72. suites for the above applications. They posted 960 data points along
  73. with kernel metrics and perf profiles collected over more than 500
  74. hours of total benchmark time. Their final reports show that, with 95%
  75. confidence intervals (CIs), the above applications all performed
  76. significantly better for at least part of their benchmark matrices.
  77. On 5.14:
  78. 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
  79. less wall time to sort three billion random integers, respectively,
  80. under the medium- and the high-concurrency conditions, when
  81. overcommitting memory. There were no statistically significant
  82. changes in wall time for the rest of the benchmark matrix.
  83. 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
  84. more transactions per minute (TPM), respectively, under the medium-
  85. and the high-concurrency conditions, when overcommitting memory.
  86. There were no statistically significant changes in TPM for the rest
  87. of the benchmark matrix.
  88. 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
  89. and [21.59, 30.02]% more operations per second (OPS), respectively,
  90. for sequential access, random access and Gaussian (distribution)
  91. access, when THP=always; 95% CIs [13.85, 15.97]% and
  92. [23.94, 29.92]% more OPS, respectively, for random access and
  93. Gaussian access, when THP=never. There were no statistically
  94. significant changes in OPS for the rest of the benchmark matrix.
  95. 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
  96. [2.16, 3.55]% more operations per second (OPS), respectively, for
  97. exponential (distribution) access, random access and Zipfian
  98. (distribution) access, when underutilizing memory; 95% CIs
  99. [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
  100. respectively, for exponential access, random access and Zipfian
  101. access, when overcommitting memory.
  102. On 5.15:
  103. 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
  104. and [4.11, 7.50]% more operations per second (OPS), respectively,
  105. for exponential (distribution) access, random access and Zipfian
  106. (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
  107. [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
  108. exponential access, random access and Zipfian access, when swap was
  109. on.
  110. 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
  111. less average wall time to finish twelve parallel TeraSort jobs,
  112. respectively, under the medium- and the high-concurrency
  113. conditions, when swap was on. There were no statistically
  114. significant changes in average wall time for the rest of the
  115. benchmark matrix.
  116. 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
  117. minute (TPM) under the high-concurrency condition, when swap was
  118. off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
  119. respectively, under the medium- and the high-concurrency
  120. conditions, when swap was on. There were no statistically
  121. significant changes in TPM for the rest of the benchmark matrix.
  122. 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
  123. [11.47, 19.36]% more total operations per second (OPS),
  124. respectively, for sequential access, random access and Gaussian
  125. (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
  126. [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
  127. for sequential access, random access and Gaussian access, when
  128. THP=never.
  129. Our lab results
  130. ---------------
  131. To supplement the above results, we ran the following benchmark suites
  132. on 5.16-rc7 and found no regressions [10].
  133. fs_fio_bench_hdd_mq pft
  134. fs_lmbench pgsql-hammerdb
  135. fs_parallelio redis
  136. fs_postmark stream
  137. hackbench sysbenchthread
  138. kernbench tpcc_spark
  139. memcached unixbench
  140. multichase vm-scalability
  141. mutilate will-it-scale
  142. nginx
  143. [01] https://trends.google.com
  144. [02] https://lore.kernel.org/r/[email protected]/
  145. [03] https://lore.kernel.org/r/[email protected]/
  146. [04] https://lore.kernel.org/r/[email protected]/
  147. [05] https://lore.kernel.org/r/[email protected]/
  148. [06] https://lore.kernel.org/r/[email protected]/
  149. [07] https://lore.kernel.org/r/[email protected]/
  150. [08] https://lore.kernel.org/r/[email protected]/
  151. [09] https://lore.kernel.org/r/[email protected]/
  152. [10] https://lore.kernel.org/r/[email protected]/
  153. Read-world applications
  154. =======================
  155. Third-party testimonials
  156. ------------------------
  157. Konstantin reported [11]:
  158. I have Archlinux with 8G RAM + zswap + swap. While developing, I
  159. have lots of apps opened such as multiple LSP-servers for different
  160. langs, chats, two browsers, etc... Usually, my system gets quickly
  161. to a point of SWAP-storms, where I have to kill LSP-servers,
  162. restart browsers to free memory, etc, otherwise the system lags
  163. heavily and is barely usable.
  164. 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
  165. patchset, and I started up by opening lots of apps to create memory
  166. pressure, and worked for a day like this. Till now I had not a
  167. single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
  168. getting to the point of 3G in SWAP before without a single
  169. SWAP-storm.
  170. Vaibhav from IBM reported [12]:
  171. In a synthetic MongoDB Benchmark, seeing an average of ~19%
  172. throughput improvement on POWER10(Radix MMU + 64K Page Size) with
  173. MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
  174. three different request distributions, namely, Exponential, Uniform
  175. and Zipfan.
  176. Shuang from U of Rochester reported [13]:
  177. With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
  178. and [9.26, 10.36]% higher throughput, respectively, for random
  179. access, Zipfian (distribution) access and Gaussian (distribution)
  180. access, when the average number of jobs per CPU is 1; 95% CIs
  181. [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
  182. throughput, respectively, for random access, Zipfian access and
  183. Gaussian access, when the average number of jobs per CPU is 2.
  184. Daniel from Michigan Tech reported [14]:
  185. With Memcached allocating ~100GB of byte-addressable Optante,
  186. performance improvement in terms of throughput (measured as queries
  187. per second) was about 10% for a series of workloads.
  188. Large-scale deployments
  189. -----------------------
  190. We've rolled out MGLRU to tens of millions of ChromeOS users and
  191. about a million Android users. Google's fleetwide profiling [15] shows
  192. an overall 40% decrease in kswapd CPU usage, in addition to
  193. improvements in other UX metrics, e.g., an 85% decrease in the number
  194. of low-memory kills at the 75th percentile and an 18% decrease in
  195. app launch time at the 50th percentile.
  196. The downstream kernels that have been using MGLRU include:
  197. 1. Android [16]
  198. 2. Arch Linux Zen [17]
  199. 3. Armbian [18]
  200. 4. ChromeOS [19]
  201. 5. Liquorix [20]
  202. 6. OpenWrt [21]
  203. 7. post-factum [22]
  204. 8. XanMod [23]
  205. [11] https://lore.kernel.org/r/[email protected]/
  206. [12] https://lore.kernel.org/r/[email protected]/
  207. [13] https://lore.kernel.org/r/[email protected]/
  208. [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
  209. [15] https://dl.acm.org/doi/10.1145/2749469.2750392
  210. [16] https://android.com
  211. [17] https://archlinux.org
  212. [18] https://armbian.com
  213. [19] https://chromium.org
  214. [20] https://liquorix.net
  215. [21] https://openwrt.org
  216. [22] https://codeberg.org/pf-kernel
  217. [23] https://xanmod.org
  218. Summary
  219. =======
  220. The facts are:
  221. 1. The independent lab results and the real-world applications
  222. indicate substantial improvements; there are no known regressions.
  223. 2. Thrashing prevention, working set estimation and proactive reclaim
  224. work out of the box; there are no equivalent solutions.
  225. 3. There is a lot of new code; no smaller changes have been
  226. demonstrated similar effects.
  227. Our options, accordingly, are:
  228. 1. Given the amount of evidence, the reported improvements will likely
  229. materialize for a wide range of workloads.
  230. 2. Gauging the interest from the past discussions, the new features
  231. will likely be put to use for both personal computers and data
  232. centers.
  233. 3. Based on Google's track record, the new code will likely be well
  234. maintained in the long term. It'd be more difficult if not
  235. impossible to achieve similar effects with other approaches.
  236. This patch (of 14):
  237. Some architectures automatically set the accessed bit in PTEs, e.g., x86
  238. and arm64 v8.2. On architectures that do not have this capability,
  239. clearing the accessed bit in a PTE usually triggers a page fault following
  240. the TLB miss of this PTE (to emulate the accessed bit).
  241. Being aware of this capability can help make better decisions, e.g.,
  242. whether to spread the work out over a period of time to reduce bursty page
  243. faults when trying to clear the accessed bit in many PTEs.
  244. Note that theoretically this capability can be unreliable, e.g.,
  245. hotplugged CPUs might be different from builtin ones. Therefore it should
  246. not be used in architecture-independent code that involves correctness,
  247. e.g., to determine whether TLB flushes are required (in combination with
  248. the accessed bit).
  249. Link: https://lkml.kernel.org/r/[email protected]
  250. Link: https://lkml.kernel.org/r/[email protected]
  251. Signed-off-by: Yu Zhao <[email protected]>
  252. Reviewed-by: Barry Song <[email protected]>
  253. Acked-by: Brian Geffon <[email protected]>
  254. Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
  255. Acked-by: Oleksandr Natalenko <[email protected]>
  256. Acked-by: Steven Barrett <[email protected]>
  257. Acked-by: Suleiman Souhlal <[email protected]>
  258. Acked-by: Will Deacon <[email protected]>
  259. Tested-by: Daniel Byrne <[email protected]>
  260. Tested-by: Donald Carr <[email protected]>
  261. Tested-by: Holger Hoffstätte <[email protected]>
  262. Tested-by: Konstantin Kharlamov <[email protected]>
  263. Tested-by: Shuang Zhai <[email protected]>
  264. Tested-by: Sofia Trinh <[email protected]>
  265. Tested-by: Vaibhav Jain <[email protected]>
  266. Cc: Andi Kleen <[email protected]>
  267. Cc: Aneesh Kumar K.V <[email protected]>
  268. Cc: Catalin Marinas <[email protected]>
  269. Cc: Dave Hansen <[email protected]>
  270. Cc: Hillf Danton <[email protected]>
  271. Cc: Jens Axboe <[email protected]>
  272. Cc: Johannes Weiner <[email protected]>
  273. Cc: Jonathan Corbet <[email protected]>
  274. Cc: Linus Torvalds <[email protected]>
  275. Cc: [email protected]
  276. Cc: Matthew Wilcox <[email protected]>
  277. Cc: Mel Gorman <[email protected]>
  278. Cc: Michael Larabel <[email protected]>
  279. Cc: Michal Hocko <[email protected]>
  280. Cc: Mike Rapoport <[email protected]>
  281. Cc: Peter Zijlstra <[email protected]>
  282. Cc: Tejun Heo <[email protected]>
  283. Cc: Vlastimil Babka <[email protected]>
  284. Cc: Miaohe Lin <[email protected]>
  285. Cc: Mike Rapoport <[email protected]>
  286. Cc: Qi Zheng <[email protected]>
  287. Signed-off-by: Andrew Morton <[email protected]>
  288. ---
  289. arch/arm64/include/asm/pgtable.h | 14 ++------------
  290. arch/x86/include/asm/pgtable.h | 6 +++---
  291. include/linux/pgtable.h | 13 +++++++++++++
  292. mm/memory.c | 14 +-------------
  293. 4 files changed, 19 insertions(+), 28 deletions(-)
  294. --- a/arch/arm64/include/asm/pgtable.h
  295. +++ b/arch/arm64/include/asm/pgtable.h
  296. @@ -999,23 +999,13 @@ static inline void update_mmu_cache(stru
  297. * page after fork() + CoW for pfn mappings. We don't always have a
  298. * hardware-managed access flag on arm64.
  299. */
  300. -static inline bool arch_faults_on_old_pte(void)
  301. -{
  302. - WARN_ON(preemptible());
  303. -
  304. - return !cpu_has_hw_af();
  305. -}
  306. -#define arch_faults_on_old_pte arch_faults_on_old_pte
  307. +#define arch_has_hw_pte_young cpu_has_hw_af
  308. /*
  309. * Experimentally, it's cheap to set the access flag in hardware and we
  310. * benefit from prefaulting mappings as 'old' to start with.
  311. */
  312. -static inline bool arch_wants_old_prefaulted_pte(void)
  313. -{
  314. - return !arch_faults_on_old_pte();
  315. -}
  316. -#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
  317. +#define arch_wants_old_prefaulted_pte cpu_has_hw_af
  318. #endif /* !__ASSEMBLY__ */
  319. --- a/arch/x86/include/asm/pgtable.h
  320. +++ b/arch/x86/include/asm/pgtable.h
  321. @@ -1397,10 +1397,10 @@ static inline bool arch_has_pfn_modify_c
  322. return boot_cpu_has_bug(X86_BUG_L1TF);
  323. }
  324. -#define arch_faults_on_old_pte arch_faults_on_old_pte
  325. -static inline bool arch_faults_on_old_pte(void)
  326. +#define arch_has_hw_pte_young arch_has_hw_pte_young
  327. +static inline bool arch_has_hw_pte_young(void)
  328. {
  329. - return false;
  330. + return true;
  331. }
  332. #endif /* __ASSEMBLY__ */
  333. --- a/include/linux/pgtable.h
  334. +++ b/include/linux/pgtable.h
  335. @@ -259,6 +259,19 @@ static inline int pmdp_clear_flush_young
  336. #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
  337. #endif
  338. +#ifndef arch_has_hw_pte_young
  339. +/*
  340. + * Return whether the accessed bit is supported on the local CPU.
  341. + *
  342. + * This stub assumes accessing through an old PTE triggers a page fault.
  343. + * Architectures that automatically set the access bit should overwrite it.
  344. + */
  345. +static inline bool arch_has_hw_pte_young(void)
  346. +{
  347. + return false;
  348. +}
  349. +#endif
  350. +
  351. #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
  352. static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
  353. unsigned long address,
  354. --- a/mm/memory.c
  355. +++ b/mm/memory.c
  356. @@ -121,18 +121,6 @@ int randomize_va_space __read_mostly =
  357. 2;
  358. #endif
  359. -#ifndef arch_faults_on_old_pte
  360. -static inline bool arch_faults_on_old_pte(void)
  361. -{
  362. - /*
  363. - * Those arches which don't have hw access flag feature need to
  364. - * implement their own helper. By default, "true" means pagefault
  365. - * will be hit on old pte.
  366. - */
  367. - return true;
  368. -}
  369. -#endif
  370. -
  371. #ifndef arch_wants_old_prefaulted_pte
  372. static inline bool arch_wants_old_prefaulted_pte(void)
  373. {
  374. @@ -2782,7 +2770,7 @@ static inline bool cow_user_page(struct
  375. * On architectures with software "accessed" bits, we would
  376. * take a double page fault, so mark it accessed here.
  377. */
  378. - if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
  379. + if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
  380. pte_t entry;
  381. vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);