| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425 |
- From a4103262b01a1b8704b37c01c7c813df91b7b119 Mon Sep 17 00:00:00 2001
- From: Yu Zhao <[email protected]>
- Date: Sun, 18 Sep 2022 01:59:58 -0600
- Subject: [PATCH 01/29] mm: x86, arm64: add arch_has_hw_pte_young()
- MIME-Version: 1.0
- Content-Type: text/plain; charset=UTF-8
- Content-Transfer-Encoding: 8bit
- Patch series "Multi-Gen LRU Framework", v14.
- What's new
- ==========
- 1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
- Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
- 2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
- machines. The old direct reclaim backoff, which tries to enforce a
- minimum fairness among all eligible memcgs, over-swapped by about
- (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
- pulls the plug on swapping once the target is met, trades some
- fairness for curtailed latency:
- https://lore.kernel.org/r/[email protected]/
- 3. Fixed minior build warnings and conflicts. More comments and nits.
- TLDR
- ====
- The current page reclaim is too expensive in terms of CPU usage and it
- often makes poor choices about what to evict. This patchset offers an
- alternative solution that is performant, versatile and
- straightforward.
- Patchset overview
- =================
- The design and implementation overview is in patch 14:
- https://lore.kernel.org/r/[email protected]/
- 01. mm: x86, arm64: add arch_has_hw_pte_young()
- 02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
- Take advantage of hardware features when trying to clear the accessed
- bit in many PTEs.
- 03. mm/vmscan.c: refactor shrink_node()
- 04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
- its sole caller"
- Minor refactors to improve readability for the following patches.
- 05. mm: multi-gen LRU: groundwork
- Adds the basic data structure and the functions that insert pages to
- and remove pages from the multi-gen LRU (MGLRU) lists.
- 06. mm: multi-gen LRU: minimal implementation
- A minimal implementation without optimizations.
- 07. mm: multi-gen LRU: exploit locality in rmap
- Exploits spatial locality to improve efficiency when using the rmap.
- 08. mm: multi-gen LRU: support page table walks
- Further exploits spatial locality by optionally scanning page tables.
- 09. mm: multi-gen LRU: optimize multiple memcgs
- Optimizes the overall performance for multiple memcgs running mixed
- types of workloads.
- 10. mm: multi-gen LRU: kill switch
- Adds a kill switch to enable or disable MGLRU at runtime.
- 11. mm: multi-gen LRU: thrashing prevention
- 12. mm: multi-gen LRU: debugfs interface
- Provide userspace with features like thrashing prevention, working set
- estimation and proactive reclaim.
- 13. mm: multi-gen LRU: admin guide
- 14. mm: multi-gen LRU: design doc
- Add an admin guide and a design doc.
- Benchmark results
- =================
- Independent lab results
- -----------------------
- Based on the popularity of searches [01] and the memory usage in
- Google's public cloud, the most popular open-source memory-hungry
- applications, in alphabetical order, are:
- Apache Cassandra Memcached
- Apache Hadoop MongoDB
- Apache Spark PostgreSQL
- MariaDB (MySQL) Redis
- An independent lab evaluated MGLRU with the most widely used benchmark
- suites for the above applications. They posted 960 data points along
- with kernel metrics and perf profiles collected over more than 500
- hours of total benchmark time. Their final reports show that, with 95%
- confidence intervals (CIs), the above applications all performed
- significantly better for at least part of their benchmark matrices.
- On 5.14:
- 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
- less wall time to sort three billion random integers, respectively,
- under the medium- and the high-concurrency conditions, when
- overcommitting memory. There were no statistically significant
- changes in wall time for the rest of the benchmark matrix.
- 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
- more transactions per minute (TPM), respectively, under the medium-
- and the high-concurrency conditions, when overcommitting memory.
- There were no statistically significant changes in TPM for the rest
- of the benchmark matrix.
- 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
- and [21.59, 30.02]% more operations per second (OPS), respectively,
- for sequential access, random access and Gaussian (distribution)
- access, when THP=always; 95% CIs [13.85, 15.97]% and
- [23.94, 29.92]% more OPS, respectively, for random access and
- Gaussian access, when THP=never. There were no statistically
- significant changes in OPS for the rest of the benchmark matrix.
- 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
- [2.16, 3.55]% more operations per second (OPS), respectively, for
- exponential (distribution) access, random access and Zipfian
- (distribution) access, when underutilizing memory; 95% CIs
- [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
- respectively, for exponential access, random access and Zipfian
- access, when overcommitting memory.
- On 5.15:
- 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
- and [4.11, 7.50]% more operations per second (OPS), respectively,
- for exponential (distribution) access, random access and Zipfian
- (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
- [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
- exponential access, random access and Zipfian access, when swap was
- on.
- 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
- less average wall time to finish twelve parallel TeraSort jobs,
- respectively, under the medium- and the high-concurrency
- conditions, when swap was on. There were no statistically
- significant changes in average wall time for the rest of the
- benchmark matrix.
- 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
- minute (TPM) under the high-concurrency condition, when swap was
- off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
- respectively, under the medium- and the high-concurrency
- conditions, when swap was on. There were no statistically
- significant changes in TPM for the rest of the benchmark matrix.
- 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
- [11.47, 19.36]% more total operations per second (OPS),
- respectively, for sequential access, random access and Gaussian
- (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
- [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
- for sequential access, random access and Gaussian access, when
- THP=never.
- Our lab results
- ---------------
- To supplement the above results, we ran the following benchmark suites
- on 5.16-rc7 and found no regressions [10].
- fs_fio_bench_hdd_mq pft
- fs_lmbench pgsql-hammerdb
- fs_parallelio redis
- fs_postmark stream
- hackbench sysbenchthread
- kernbench tpcc_spark
- memcached unixbench
- multichase vm-scalability
- mutilate will-it-scale
- nginx
- [01] https://trends.google.com
- [02] https://lore.kernel.org/r/[email protected]/
- [03] https://lore.kernel.org/r/[email protected]/
- [04] https://lore.kernel.org/r/[email protected]/
- [05] https://lore.kernel.org/r/[email protected]/
- [06] https://lore.kernel.org/r/[email protected]/
- [07] https://lore.kernel.org/r/[email protected]/
- [08] https://lore.kernel.org/r/[email protected]/
- [09] https://lore.kernel.org/r/[email protected]/
- [10] https://lore.kernel.org/r/[email protected]/
- Read-world applications
- =======================
- Third-party testimonials
- ------------------------
- Konstantin reported [11]:
- I have Archlinux with 8G RAM + zswap + swap. While developing, I
- have lots of apps opened such as multiple LSP-servers for different
- langs, chats, two browsers, etc... Usually, my system gets quickly
- to a point of SWAP-storms, where I have to kill LSP-servers,
- restart browsers to free memory, etc, otherwise the system lags
- heavily and is barely usable.
- 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
- patchset, and I started up by opening lots of apps to create memory
- pressure, and worked for a day like this. Till now I had not a
- single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
- getting to the point of 3G in SWAP before without a single
- SWAP-storm.
- Vaibhav from IBM reported [12]:
- In a synthetic MongoDB Benchmark, seeing an average of ~19%
- throughput improvement on POWER10(Radix MMU + 64K Page Size) with
- MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
- three different request distributions, namely, Exponential, Uniform
- and Zipfan.
- Shuang from U of Rochester reported [13]:
- With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
- and [9.26, 10.36]% higher throughput, respectively, for random
- access, Zipfian (distribution) access and Gaussian (distribution)
- access, when the average number of jobs per CPU is 1; 95% CIs
- [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
- throughput, respectively, for random access, Zipfian access and
- Gaussian access, when the average number of jobs per CPU is 2.
- Daniel from Michigan Tech reported [14]:
- With Memcached allocating ~100GB of byte-addressable Optante,
- performance improvement in terms of throughput (measured as queries
- per second) was about 10% for a series of workloads.
- Large-scale deployments
- -----------------------
- We've rolled out MGLRU to tens of millions of ChromeOS users and
- about a million Android users. Google's fleetwide profiling [15] shows
- an overall 40% decrease in kswapd CPU usage, in addition to
- improvements in other UX metrics, e.g., an 85% decrease in the number
- of low-memory kills at the 75th percentile and an 18% decrease in
- app launch time at the 50th percentile.
- The downstream kernels that have been using MGLRU include:
- 1. Android [16]
- 2. Arch Linux Zen [17]
- 3. Armbian [18]
- 4. ChromeOS [19]
- 5. Liquorix [20]
- 6. OpenWrt [21]
- 7. post-factum [22]
- 8. XanMod [23]
- [11] https://lore.kernel.org/r/[email protected]/
- [12] https://lore.kernel.org/r/[email protected]/
- [13] https://lore.kernel.org/r/[email protected]/
- [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
- [15] https://dl.acm.org/doi/10.1145/2749469.2750392
- [16] https://android.com
- [17] https://archlinux.org
- [18] https://armbian.com
- [19] https://chromium.org
- [20] https://liquorix.net
- [21] https://openwrt.org
- [22] https://codeberg.org/pf-kernel
- [23] https://xanmod.org
- Summary
- =======
- The facts are:
- 1. The independent lab results and the real-world applications
- indicate substantial improvements; there are no known regressions.
- 2. Thrashing prevention, working set estimation and proactive reclaim
- work out of the box; there are no equivalent solutions.
- 3. There is a lot of new code; no smaller changes have been
- demonstrated similar effects.
- Our options, accordingly, are:
- 1. Given the amount of evidence, the reported improvements will likely
- materialize for a wide range of workloads.
- 2. Gauging the interest from the past discussions, the new features
- will likely be put to use for both personal computers and data
- centers.
- 3. Based on Google's track record, the new code will likely be well
- maintained in the long term. It'd be more difficult if not
- impossible to achieve similar effects with other approaches.
- This patch (of 14):
- Some architectures automatically set the accessed bit in PTEs, e.g., x86
- and arm64 v8.2. On architectures that do not have this capability,
- clearing the accessed bit in a PTE usually triggers a page fault following
- the TLB miss of this PTE (to emulate the accessed bit).
- Being aware of this capability can help make better decisions, e.g.,
- whether to spread the work out over a period of time to reduce bursty page
- faults when trying to clear the accessed bit in many PTEs.
- Note that theoretically this capability can be unreliable, e.g.,
- hotplugged CPUs might be different from builtin ones. Therefore it should
- not be used in architecture-independent code that involves correctness,
- e.g., to determine whether TLB flushes are required (in combination with
- the accessed bit).
- Link: https://lkml.kernel.org/r/[email protected]
- Link: https://lkml.kernel.org/r/[email protected]
- Signed-off-by: Yu Zhao <[email protected]>
- Reviewed-by: Barry Song <[email protected]>
- Acked-by: Brian Geffon <[email protected]>
- Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
- Acked-by: Oleksandr Natalenko <[email protected]>
- Acked-by: Steven Barrett <[email protected]>
- Acked-by: Suleiman Souhlal <[email protected]>
- Acked-by: Will Deacon <[email protected]>
- Tested-by: Daniel Byrne <[email protected]>
- Tested-by: Donald Carr <[email protected]>
- Tested-by: Holger Hoffstätte <[email protected]>
- Tested-by: Konstantin Kharlamov <[email protected]>
- Tested-by: Shuang Zhai <[email protected]>
- Tested-by: Sofia Trinh <[email protected]>
- Tested-by: Vaibhav Jain <[email protected]>
- Cc: Andi Kleen <[email protected]>
- Cc: Aneesh Kumar K.V <[email protected]>
- Cc: Catalin Marinas <[email protected]>
- Cc: Dave Hansen <[email protected]>
- Cc: Hillf Danton <[email protected]>
- Cc: Jens Axboe <[email protected]>
- Cc: Johannes Weiner <[email protected]>
- Cc: Jonathan Corbet <[email protected]>
- Cc: Linus Torvalds <[email protected]>
- Cc: [email protected]
- Cc: Matthew Wilcox <[email protected]>
- Cc: Mel Gorman <[email protected]>
- Cc: Michael Larabel <[email protected]>
- Cc: Michal Hocko <[email protected]>
- Cc: Mike Rapoport <[email protected]>
- Cc: Peter Zijlstra <[email protected]>
- Cc: Tejun Heo <[email protected]>
- Cc: Vlastimil Babka <[email protected]>
- Cc: Miaohe Lin <[email protected]>
- Cc: Mike Rapoport <[email protected]>
- Cc: Qi Zheng <[email protected]>
- Signed-off-by: Andrew Morton <[email protected]>
- ---
- arch/arm64/include/asm/pgtable.h | 14 ++------------
- arch/x86/include/asm/pgtable.h | 6 +++---
- include/linux/pgtable.h | 13 +++++++++++++
- mm/memory.c | 14 +-------------
- 4 files changed, 19 insertions(+), 28 deletions(-)
- --- a/arch/arm64/include/asm/pgtable.h
- +++ b/arch/arm64/include/asm/pgtable.h
- @@ -999,23 +999,13 @@ static inline void update_mmu_cache(stru
- * page after fork() + CoW for pfn mappings. We don't always have a
- * hardware-managed access flag on arm64.
- */
- -static inline bool arch_faults_on_old_pte(void)
- -{
- - WARN_ON(preemptible());
- -
- - return !cpu_has_hw_af();
- -}
- -#define arch_faults_on_old_pte arch_faults_on_old_pte
- +#define arch_has_hw_pte_young cpu_has_hw_af
-
- /*
- * Experimentally, it's cheap to set the access flag in hardware and we
- * benefit from prefaulting mappings as 'old' to start with.
- */
- -static inline bool arch_wants_old_prefaulted_pte(void)
- -{
- - return !arch_faults_on_old_pte();
- -}
- -#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
- +#define arch_wants_old_prefaulted_pte cpu_has_hw_af
-
- #endif /* !__ASSEMBLY__ */
-
- --- a/arch/x86/include/asm/pgtable.h
- +++ b/arch/x86/include/asm/pgtable.h
- @@ -1397,10 +1397,10 @@ static inline bool arch_has_pfn_modify_c
- return boot_cpu_has_bug(X86_BUG_L1TF);
- }
-
- -#define arch_faults_on_old_pte arch_faults_on_old_pte
- -static inline bool arch_faults_on_old_pte(void)
- +#define arch_has_hw_pte_young arch_has_hw_pte_young
- +static inline bool arch_has_hw_pte_young(void)
- {
- - return false;
- + return true;
- }
-
- #endif /* __ASSEMBLY__ */
- --- a/include/linux/pgtable.h
- +++ b/include/linux/pgtable.h
- @@ -259,6 +259,19 @@ static inline int pmdp_clear_flush_young
- #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
- #endif
-
- +#ifndef arch_has_hw_pte_young
- +/*
- + * Return whether the accessed bit is supported on the local CPU.
- + *
- + * This stub assumes accessing through an old PTE triggers a page fault.
- + * Architectures that automatically set the access bit should overwrite it.
- + */
- +static inline bool arch_has_hw_pte_young(void)
- +{
- + return false;
- +}
- +#endif
- +
- #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
- static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
- unsigned long address,
- --- a/mm/memory.c
- +++ b/mm/memory.c
- @@ -121,18 +121,6 @@ int randomize_va_space __read_mostly =
- 2;
- #endif
-
- -#ifndef arch_faults_on_old_pte
- -static inline bool arch_faults_on_old_pte(void)
- -{
- - /*
- - * Those arches which don't have hw access flag feature need to
- - * implement their own helper. By default, "true" means pagefault
- - * will be hit on old pte.
- - */
- - return true;
- -}
- -#endif
- -
- #ifndef arch_wants_old_prefaulted_pte
- static inline bool arch_wants_old_prefaulted_pte(void)
- {
- @@ -2782,7 +2770,7 @@ static inline bool cow_user_page(struct
- * On architectures with software "accessed" bits, we would
- * take a double page fault, so mark it accessed here.
- */
- - if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
- + if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
- pte_t entry;
-
- vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
|