From 22d407b164ff79de42d21f37d99f9ee7abdd51c8 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Thu, 21 Mar 2024 09:36:35 -0700 Subject: lib: add allocation tagging support for memory allocation profiling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily instrument memory allocators. It registers an "alloc_tags" codetag type with /proc/allocinfo interface to output allocation tag information when the feature is enabled. CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory allocation profiling instrumentation. Memory allocation profiling can be enabled or disabled at runtime using /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n. CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation profiling by default. [surenb@google.com: Documentation/filesystems/proc.rst: fix allocinfo title] Link: https://lkml.kernel.org/r/20240326073813.727090-1-surenb@google.com [surenb@google.com: do limited memory accounting for modules with ARCH_NEEDS_WEAK_PER_CPU] Link: https://lkml.kernel.org/r/20240402180933.1663992-2-surenb@google.com [klarasmodin@gmail.com: explicitly include irqflags.h in alloc_tag.h] Link: https://lkml.kernel.org/r/20240407133252.173636-1-klarasmodin@gmail.com [surenb@google.com: fix alloc_tag_init() to prevent passing NULL to PTR_ERR()] Link: https://lkml.kernel.org/r/20240417003349.2520094-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-14-surenb@google.com Signed-off-by: Suren Baghdasaryan Co-developed-by: Kent Overstreet Signed-off-by: Kent Overstreet Signed-off-by: Klara Modin Tested-by: Kees Cook Cc: Alexander Viro Cc: Alex Gaynor Cc: Alice Ryhl Cc: Andreas Hindborg Cc: Benno Lossin Cc: "Björn Roy Baron" Cc: Boqun Feng Cc: Christoph Lameter Cc: Dennis Zhou Cc: Gary Guo Cc: Miguel Ojeda Cc: Pasha Tatashin Cc: Peter Zijlstra Cc: Tejun Heo Cc: Vlastimil Babka Cc: Wedson Almeida Filho Signed-off-by: Andrew Morton --- Documentation/admin-guide/sysctl/vm.rst | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index c59889de122b..e86c968a7a0e 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -43,6 +43,7 @@ Currently, these files are in /proc/sys/vm: - legacy_va_layout - lowmem_reserve_ratio - max_map_count +- mem_profiling (only if CONFIG_MEM_ALLOC_PROFILING=y) - memory_failure_early_kill - memory_failure_recovery - min_free_kbytes @@ -425,6 +426,21 @@ e.g., up to one or two maps per allocation. The default value is 65530. +mem_profiling +============== + +Enable memory profiling (when CONFIG_MEM_ALLOC_PROFILING=y) + +1: Enable memory profiling. + +0: Disable memory profiling. + +Enabling memory profiling introduces a small performance overhead for all +memory allocations. + +The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT. + + memory_failure_early_kill: ========================== -- cgit v1.2.3 From 353dc187840100fabeb946bb9573bc5ca5e04fcb Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Wed, 6 Mar 2024 18:13:28 +0800 Subject: docs: hugetlbpage.rst: add hugetlb migration description Add some description of the hugetlb migration strategy. Link: https://lkml.kernel.org/r/63fb16e7a4ebc5cb69ce655af86e29b2d8e9ba34.1709719720.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Reviewed-by: Oscar Salvador Cc: David Hildenbrand Cc: Miaohe Lin Cc: Michal Hocko Cc: Muchun Song Cc: Naoya Horiguchi Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/hugetlbpage.rst | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index e4d4b4a8dc97..f34a0d798d5b 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -376,6 +376,13 @@ Note that the number of overcommit and reserve pages remain global quantities, as we don't know until fault time, when the faulting task's mempolicy is applied, from which node the huge page allocation will be attempted. +The hugetlb may be migrated between the per-node hugepages pool in the following +scenarios: memory offline, memory failure, longterm pinning, syscalls(mbind, +migrate_pages and move_pages), alloc_contig_range() and alloc_contig_pages(). +Now only memory offline, memory failure and syscalls allow fallbacking to allocate +a new hugetlb on a different node if the current node is unable to allocate during +hugetlb migration, that means these 3 cases can break the per-node hugepages pool. + .. _using_huge_pages: Using Huge Pages -- cgit v1.2.3 From 4dc7d37370951fe86216f03a4e0a6909f9b90a8c Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 26 Mar 2024 17:10:31 +0000 Subject: remove references to page->flags in documentation Mostly rewording, but remove entirely the copy of page_fixed_fake_head() in the documentation; we can refer people to the actual source if necessary. Link: https://lkml.kernel.org/r/20240326171045.410737-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v1/memory.rst | 4 ++-- Documentation/mm/vmemmap_dedup.rst | 22 +--------------------- .../translations/zh_CN/core-api/cachetlb.rst | 2 +- mm/migrate.c | 2 +- mm/rmap.c | 4 ++-- 5 files changed, 7 insertions(+), 27 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index ca7d9402f6be..46110e6a31bb 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -300,14 +300,14 @@ When oom event notifier is registered, event will be delivered. Lock order is as follows:: - Page lock (PG_locked bit of page->flags) + folio_lock mm->page_table_lock or split pte_lock folio_memcg_lock (memcg->move_lock) mapping->i_pages lock lruvec->lru_lock. Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by -lruvec->lru_lock; PG_lru bit of page->flags is cleared before +lruvec->lru_lock; the folio LRU flag is cleared before isolating a page from its LRU under lruvec->lru_lock. .. _cgroup-v1-memory-kernel-extension: diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst index 593ede6d314b..b4a55b6569fa 100644 --- a/Documentation/mm/vmemmap_dedup.rst +++ b/Documentation/mm/vmemmap_dedup.rst @@ -180,27 +180,7 @@ this correctly. There is only **one** head ``struct page``, the tail ``struct page`` with ``PG_head`` are fake head ``struct page``. We need an approach to distinguish between those two different types of ``struct page`` so that ``compound_head()`` can return the real head ``struct page`` when the -parameter is the tail ``struct page`` but with ``PG_head``. The following code -snippet describes how to distinguish between real and fake head ``struct page``. - -.. code-block:: c - - if (test_bit(PG_head, &page->flags)) { - unsigned long head = READ_ONCE(page[1].compound_head); - - if (head & 1) { - if (head == (unsigned long)page + 1) - /* head struct page */ - else - /* tail struct page */ - } else { - /* head struct page */ - } - } - -We can safely access the field of the **page[1]** with ``PG_head`` because the -page is a compound page composed with at least two contiguous pages. -The implementation refers to ``page_fixed_fake_head()``. +parameter is the tail ``struct page`` but with ``PG_head``. Device DAX ========== diff --git a/Documentation/translations/zh_CN/core-api/cachetlb.rst b/Documentation/translations/zh_CN/core-api/cachetlb.rst index b4a76ec75daa..64295c61d1c1 100644 --- a/Documentation/translations/zh_CN/core-api/cachetlb.rst +++ b/Documentation/translations/zh_CN/core-api/cachetlb.rst @@ -260,7 +260,7 @@ HyperSparc cpu就是这样一个具有这种属性的cpu。 如果D-cache别名不是一个问题,这个程序可以简单地定义为该架构上 的nop。 - 在page->flags (PG_arch_1)中有一个位是“架构私有”。内核保证, + 在folio->flags (PG_arch_1)中有一个位是“架构私有”。内核保证, 对于分页缓存的页面,当这样的页面第一次进入分页缓存时,它将清除 这个位。 diff --git a/mm/migrate.c b/mm/migrate.c index a31aa75d223d..285072bca29c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -113,7 +113,7 @@ bool isolate_movable_page(struct page *page, isolate_mode_t mode) if (!mops->isolate_page(&folio->page, mode)) goto out_no_isolated; - /* Driver shouldn't use PG_isolated bit of page->flags */ + /* Driver shouldn't use the isolated flag */ WARN_ON_ONCE(folio_test_isolated(folio)); folio_set_isolated(folio); folio_unlock(folio); diff --git a/mm/rmap.c b/mm/rmap.c index d52759aa3ff7..5ee9e338d09b 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -23,7 +23,7 @@ * inode->i_rwsem (while writing or truncating, not reading or faulting) * mm->mmap_lock * mapping->invalidate_lock (in filemap_fault) - * page->flags PG_locked (lock_page) + * folio_lock * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below) * vma_start_write * mapping->i_mmap_rwsem @@ -50,7 +50,7 @@ * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) * vma_lock (hugetlb specific lock for pmd_sharing) * mapping->i_mmap_rwsem (also used for hugetlb pmd sharing) - * page->flags PG_locked (lock_page) + * folio_lock */ #include -- cgit v1.2.3 From ba42b524a0408b5f92bd41edaee1ea84309ab9ae Mon Sep 17 00:00:00 2001 From: York Jasper Niebuhr Date: Fri, 29 Mar 2024 15:56:05 +0100 Subject: mm: init_mlocked_on_free_v3 Implements the "init_mlocked_on_free" boot option. When this boot option is enabled, any mlock'ed pages are zeroed on free. If the pages are munlock'ed beforehand, no initialization takes place. This boot option is meant to combat the performance hit of "init_on_free" as reported in commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options"). With "init_mlocked_on_free=1" only relevant data is freed while everything else is left untouched by the kernel. Correspondingly, this patch introduces no performance hit for unmapping non-mlock'ed memory. The unmapping overhead for purely mlocked memory was measured to be approximately 13%. Realistically, most systems mlock only a fraction of the total memory so the real-world system overhead should be close to zero. Optimally, userspace programs clear any key material or other confidential memory before exit and munlock the according memory regions. If a program crashes, userspace key managers fail to do this job. Accordingly, no munlock operations are performed so the data is caught and zeroed by the kernel. Should the program not crash, all memory will ideally be munlocked so no overhead is caused. CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON can be set to enable "init_mlocked_on_free" by default. Link: https://lkml.kernel.org/r/20240329145605.149917-1-yjnworkstation@gmail.com Signed-off-by: York Jasper Niebuhr Cc: Matthew Wilcox (Oracle) Cc: York Jasper Niebuhr Cc: Kees Cook Signed-off-by: Andrew Morton --- Documentation/admin-guide/kernel-parameters.txt | 6 ++++ include/linux/mm.h | 9 +++++- mm/internal.h | 1 + mm/memory.c | 6 ++++ mm/mm_init.c | 43 +++++++++++++++++++++---- mm/page_alloc.c | 2 +- security/Kconfig.hardening | 15 +++++++++ 7 files changed, 73 insertions(+), 9 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 902ecd92a29f..3ff97de349da 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2148,6 +2148,12 @@ Format: 0 | 1 Default set by CONFIG_INIT_ON_FREE_DEFAULT_ON. + init_mlocked_on_free= [MM] Fill freed userspace memory with zeroes if + it was mlock'ed and not explicitly munlock'ed + afterwards. + Format: 0 | 1 + Default set by CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON + init_pkru= [X86] Specify the default memory protection keys rights register contents for all processes. 0x55555554 by default (disallow access to all but pkey 0). Can diff --git a/include/linux/mm.h b/include/linux/mm.h index 2d5e492ef57f..4f4e460d7853 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3762,7 +3762,14 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_FREE_DEFAULT_ON, init_on_free); static inline bool want_init_on_free(void) { return static_branch_maybe(CONFIG_INIT_ON_FREE_DEFAULT_ON, - &init_on_free); + &init_on_free); +} + +DECLARE_STATIC_KEY_MAYBE(CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON, init_mlocked_on_free); +static inline bool want_init_mlocked_on_free(void) +{ + return static_branch_maybe(CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON, + &init_mlocked_on_free); } extern bool _debug_pagealloc_enabled_early; diff --git a/mm/internal.h b/mm/internal.h index 6614ba4ca9de..cf7799e29391 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -506,6 +506,7 @@ extern void __putback_isolated_page(struct page *page, unsigned int order, extern void memblock_free_pages(struct page *page, unsigned long pfn, unsigned int order); extern void __free_pages_core(struct page *page, unsigned int order); +extern void kernel_init_pages(struct page *page, int numpages); /* * This will have no effect, other than possibly generating a warning, if the diff --git a/mm/memory.c b/mm/memory.c index 0b92336bcebd..80944acb5b4e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1506,6 +1506,12 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, if (unlikely(page_mapcount(page) < 0)) print_bad_pte(vma, addr, ptent, page); } + + if (want_init_mlocked_on_free() && folio_test_mlocked(folio) && + !delay_rmap && folio_test_anon(folio)) { + kernel_init_pages(page, folio_nr_pages(folio)); + } + if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) { *force_flush = true; *force_break = true; diff --git a/mm/mm_init.c b/mm/mm_init.c index d01912b8a597..2c8f3af4430f 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -2522,6 +2522,9 @@ EXPORT_SYMBOL(init_on_alloc); DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_FREE_DEFAULT_ON, init_on_free); EXPORT_SYMBOL(init_on_free); +DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON, init_mlocked_on_free); +EXPORT_SYMBOL(init_mlocked_on_free); + static bool _init_on_alloc_enabled_early __read_mostly = IS_ENABLED(CONFIG_INIT_ON_ALLOC_DEFAULT_ON); static int __init early_init_on_alloc(char *buf) @@ -2539,6 +2542,14 @@ static int __init early_init_on_free(char *buf) } early_param("init_on_free", early_init_on_free); +static bool _init_mlocked_on_free_enabled_early __read_mostly + = IS_ENABLED(CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON); +static int __init early_init_mlocked_on_free(char *buf) +{ + return kstrtobool(buf, &_init_mlocked_on_free_enabled_early); +} +early_param("init_mlocked_on_free", early_init_mlocked_on_free); + DEFINE_STATIC_KEY_MAYBE(CONFIG_DEBUG_VM, check_pages_enabled); /* @@ -2566,12 +2577,21 @@ static void __init mem_debugging_and_hardening_init(void) } #endif - if ((_init_on_alloc_enabled_early || _init_on_free_enabled_early) && + if ((_init_on_alloc_enabled_early || _init_on_free_enabled_early || + _init_mlocked_on_free_enabled_early) && page_poisoning_requested) { pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, " - "will take precedence over init_on_alloc and init_on_free\n"); + "will take precedence over init_on_alloc, init_on_free " + "and init_mlocked_on_free\n"); _init_on_alloc_enabled_early = false; _init_on_free_enabled_early = false; + _init_mlocked_on_free_enabled_early = false; + } + + if (_init_mlocked_on_free_enabled_early && _init_on_free_enabled_early) { + pr_info("mem auto-init: init_on_free is on, " + "will take precedence over init_mlocked_on_free\n"); + _init_mlocked_on_free_enabled_early = false; } if (_init_on_alloc_enabled_early) { @@ -2588,9 +2608,17 @@ static void __init mem_debugging_and_hardening_init(void) static_branch_disable(&init_on_free); } - if (IS_ENABLED(CONFIG_KMSAN) && - (_init_on_alloc_enabled_early || _init_on_free_enabled_early)) - pr_info("mem auto-init: please make sure init_on_alloc and init_on_free are disabled when running KMSAN\n"); + if (_init_mlocked_on_free_enabled_early) { + want_check_pages = true; + static_branch_enable(&init_mlocked_on_free); + } else { + static_branch_disable(&init_mlocked_on_free); + } + + if (IS_ENABLED(CONFIG_KMSAN) && (_init_on_alloc_enabled_early || + _init_on_free_enabled_early || _init_mlocked_on_free_enabled_early)) + pr_info("mem auto-init: please make sure init_on_alloc, init_on_free and " + "init_mlocked_on_free are disabled when running KMSAN\n"); #ifdef CONFIG_DEBUG_PAGEALLOC if (debug_pagealloc_enabled()) { @@ -2629,9 +2657,10 @@ static void __init report_meminit(void) else stack = "off"; - pr_info("mem auto-init: stack:%s, heap alloc:%s, heap free:%s\n", + pr_info("mem auto-init: stack:%s, heap alloc:%s, heap free:%s, mlocked free:%s\n", stack, want_init_on_alloc(GFP_KERNEL) ? "on" : "off", - want_init_on_free() ? "on" : "off"); + want_init_on_free() ? "on" : "off", + want_init_mlocked_on_free() ? "on" : "off"); if (want_init_on_free()) pr_info("mem auto-init: clearing system memory may take some time...\n"); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 47ab0297838a..e030ccf9d5bc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1032,7 +1032,7 @@ static inline bool should_skip_kasan_poison(struct page *page) return page_kasan_tag(page) == KASAN_TAG_KERNEL; } -static void kernel_init_pages(struct page *page, int numpages) +void kernel_init_pages(struct page *page, int numpages) { int i; diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening index 2cff851ebfd7..effbf5982be1 100644 --- a/security/Kconfig.hardening +++ b/security/Kconfig.hardening @@ -255,6 +255,21 @@ config INIT_ON_FREE_DEFAULT_ON touching "cold" memory areas. Most cases see 3-5% impact. Some synthetic workloads have measured as high as 8%. +config INIT_MLOCKED_ON_FREE_DEFAULT_ON + bool "Enable mlocked memory zeroing on free" + depends on !KMSAN + help + This config has the effect of setting "init_mlocked_on_free=1" + on the kernel command line. If it is enabled, all mlocked process + memory is zeroed when freed. This restriction to mlocked memory + improves performance over "init_on_free" but can still be used to + protect confidential data like key material from content exposures + to other processes, as well as live forensics and cold boot attacks. + Any non-mlocked memory is not cleared before it is reassigned. This + configuration can be overwritten by setting "init_mlocked_on_free=0" + on the command line. The "init_on_free" boot option takes + precedence over "init_mlocked_on_free". + config CC_HAS_ZERO_CALL_USED_REGS def_bool $(cc-option,-fzero-call-used-regs=used-gpr) # https://github.com/ClangBuiltLinux/linux/issues/1766 -- cgit v1.2.3 From 34efe1c3b688944d9817a5faaab7aad870182c59 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Fri, 29 Mar 2024 18:39:41 +0900 Subject: zram: add max_pages param to recompression Introduce "max_pages" param to recompress device attribute which sets an upper limit on the number of entries (pages) zram attempts to recompress (in this particular recompression call). S/W recompression can be quite expensive so limiting the number of pages recompress touches can be quite helpful. Link: https://lkml.kernel.org/r/20240329094050.2815699-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Acked-by: Brian Geffon Cc: Minchan Kim Signed-off-by: Andrew Morton --- Documentation/admin-guide/blockdev/zram.rst | 5 +++++ drivers/block/zram/zram_drv.c | 31 ++++++++++++++++++++++++++--- 2 files changed, 33 insertions(+), 3 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index ee2b0030d416..091e8bb38887 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -466,6 +466,11 @@ of equal or greater size::: #recompress idle pages larger than 2000 bytes echo "type=idle threshold=2000" > /sys/block/zramX/recompress +It is also possible to limit the number of pages zram re-compression will +attempt to recompress::: + + echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress + Recompression of idle pages requires memory tracking. During re-compression for every page, that matches re-compression criteria, diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index f0639df6cd18..4cf38f7d3e0a 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1568,7 +1568,8 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, * Corresponding ZRAM slot should be locked. */ static int zram_recompress(struct zram *zram, u32 index, struct page *page, - u32 threshold, u32 prio, u32 prio_max) + u64 *num_recomp_pages, u32 threshold, u32 prio, + u32 prio_max) { struct zcomp_strm *zstrm = NULL; unsigned long handle_old; @@ -1645,6 +1646,15 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page, if (!zstrm) return 0; + /* + * Decrement the limit (if set) on pages we can recompress, even + * when current recompression was unsuccessful or did not compress + * the page below the threshold, because we still spent resources + * on it. + */ + if (*num_recomp_pages) + *num_recomp_pages -= 1; + if (class_index_new >= class_index_old) { /* * Secondary algorithms failed to re-compress the page @@ -1710,6 +1720,7 @@ static ssize_t recompress_store(struct device *dev, struct zram *zram = dev_to_zram(dev); unsigned long nr_pages = zram->disksize >> PAGE_SHIFT; char *args, *param, *val, *algo = NULL; + u64 num_recomp_pages = ULLONG_MAX; u32 mode = 0, threshold = 0; unsigned long index; struct page *page; @@ -1732,6 +1743,17 @@ static ssize_t recompress_store(struct device *dev, continue; } + if (!strcmp(param, "max_pages")) { + /* + * Limit the number of entries (pages) we attempt to + * recompress. + */ + ret = kstrtoull(val, 10, &num_recomp_pages); + if (ret) + return ret; + continue; + } + if (!strcmp(param, "threshold")) { /* * We will re-compress only idle objects equal or @@ -1788,6 +1810,9 @@ static ssize_t recompress_store(struct device *dev, for (index = 0; index < nr_pages; index++) { int err = 0; + if (!num_recomp_pages) + break; + zram_slot_lock(zram, index); if (!zram_allocated(zram, index)) @@ -1807,8 +1832,8 @@ static ssize_t recompress_store(struct device *dev, zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE)) goto next; - err = zram_recompress(zram, index, page, threshold, - prio, prio_max); + err = zram_recompress(zram, index, page, &num_recomp_pages, + threshold, prio, prio_max); next: zram_slot_unlock(zram, index); if (err) { -- cgit v1.2.3 From 658670607fae51ed2f8a64fcbfcd407bf820dd4f Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 9 Apr 2024 21:23:01 +0200 Subject: Documentation/admin-guide/cgroup-v1/memory.rst: don't reference page_mapcount() Let's stop talking about page_mapcount(). Link: https://lkml.kernel.org/r/20240409192301.907377-19-david@redhat.com Signed-off-by: David Hildenbrand Cc: Chris Zankel Cc: Hugh Dickins Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Matthew Wilcox (Oracle) Cc: Max Filippov Cc: Miaohe Lin Cc: Muchun Song Cc: Naoya Horiguchi Cc: Peter Xu Cc: Richard Chang Cc: Rich Felker Cc: Ryan Roberts Cc: Yang Shi Cc: Yin Fengwei Cc: Yoshinori Sato Cc: Zi Yan Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v1/memory.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 46110e6a31bb..9cde26d33843 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -802,8 +802,8 @@ a page or a swap can be moved only when it is charged to the task's current | | anonymous pages, file pages (and swaps) in the range mmapped by the task | | | will be moved even if the task hasn't done page fault, i.e. they might | | | not be the task's "RSS", but other task's "RSS" that maps the same file. | -| | And mapcount of the page is ignored (the page can be moved even if | -| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to | +| | The mapcount of the page is ignored (the page can be moved independent | +| | of the mapcount). You must enable Swap Extension (see 2.4) to | | | enable move of swap charges. | +---+--------------------------------------------------------------------------+ -- cgit v1.2.3 From 42248b9d34ea1be1b959d343fff465906cb787fc Mon Sep 17 00:00:00 2001 From: Barry Song Date: Fri, 12 Apr 2024 23:48:57 +1200 Subject: mm: add docs for per-order mTHP counters and transhuge_page ABI This patch includes documentation for mTHP counters and an ABI file for sys-kernel-mm-transparent-hugepage, which appears to have been missing for some time. [v-songbaohua@oppo.com: fix the name and unexpected indentation] Link: https://lkml.kernel.org/r/20240415054538.17071-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240412114858.407208-4-21cnbao@gmail.com Signed-off-by: Barry Song Reviewed-by: Ryan Roberts Reviewed-by: David Hildenbrand Cc: Chris Li Cc: Domenico Cerasuolo Cc: Kairui Song Cc: Matthew Wilcox (Oracle) Cc: Peter Xu Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Yosry Ahmed Cc: Yu Zhao Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- .../testing/sysfs-kernel-mm-transparent-hugepage | 18 ++++++++++++++ Documentation/admin-guide/mm/transhuge.rst | 28 ++++++++++++++++++++++ 2 files changed, 46 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage (limited to 'Documentation/admin-guide') diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage b/Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage new file mode 100644 index 000000000000..7bfbb9cc2c11 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage @@ -0,0 +1,18 @@ +What: /sys/kernel/mm/transparent_hugepage/ +Date: April 2024 +Contact: Linux memory management mailing list +Description: + /sys/kernel/mm/transparent_hugepage/ contains a number of files and + subdirectories, + + - defrag + - enabled + - hpage_pmd_size + - khugepaged + - shmem_enabled + - use_zero_page + - subdirectories of the form hugepages-kB, where + is the page size of the hugepages supported by the kernel/CPU + combination. + + See Documentation/admin-guide/mm/transhuge.rst for details. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 04eb45a2f940..e0fe17affeb3 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -447,6 +447,34 @@ thp_swpout_fallback Usually because failed to allocate some continuous swap space for the huge page. +In /sys/kernel/mm/transparent_hugepage/hugepages-kB/stats, There are +also individual counters for each huge page size, which can be utilized to +monitor the system's effectiveness in providing huge pages for usage. Each +counter has its own corresponding file. + +anon_fault_alloc + is incremented every time a huge page is successfully + allocated and charged to handle a page fault. + +anon_fault_fallback + is incremented if a page fault fails to allocate or charge + a huge page and instead falls back to using huge pages with + lower orders or small pages. + +anon_fault_fallback_charge + is incremented if a page fault fails to charge a huge page and + instead falls back to using huge pages with lower orders or + small pages even though the allocation was successful. + +anon_swpout + is incremented every time a huge page is swapped out in one + piece without splitting. + +anon_swpout_fallback + is incremented if a huge page has to be split before swapout. + Usually because failed to allocate some continuous swap space + for the huge page. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help -- cgit v1.2.3 From a14421ae2a99378c4103bb03606465ab13e75509 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Fri, 12 Apr 2024 23:48:58 +1200 Subject: mm: correct the docs for thp_fault_alloc and thp_fault_fallback The documentation does not align with the code. In __do_huge_pmd_anonymous_page(), THP_FAULT_FALLBACK is incremented when mem_cgroup_charge() fails, despite the allocation succeeding, whereas THP_FAULT_ALLOC is only incremented after a successful charge. Link: https://lkml.kernel.org/r/20240412114858.407208-5-21cnbao@gmail.com Signed-off-by: Barry Song Reviewed-by: Ryan Roberts Reviewed-by: David Hildenbrand Cc: Chris Li Cc: Domenico Cerasuolo Cc: Kairui Song Cc: Matthew Wilcox (Oracle) Cc: Peter Xu Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Yosry Ahmed Cc: Yu Zhao Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/transhuge.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index e0fe17affeb3..f82300b9193f 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -369,7 +369,7 @@ monitor how successfully the system is providing huge pages for use. thp_fault_alloc is incremented every time a huge page is successfully - allocated to handle a page fault. + allocated and charged to handle a page fault. thp_collapse_alloc is incremented by khugepaged when it has found @@ -377,7 +377,7 @@ thp_collapse_alloc successfully allocated a new huge page to store the data. thp_fault_fallback - is incremented if a page fault fails to allocate + is incremented if a page fault fails to allocate or charge a huge page and instead falls back to using small pages. thp_fault_fallback_charge -- cgit v1.2.3 From c074e1467f8546c8f8c9ea2128fb8bf8c4579418 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Sat, 13 Apr 2024 02:24:07 +0000 Subject: mm: zswap: remove same_filled module params These knobs offer more fine-grained control to userspace than needed and directly expose/influence kernel implementation; remove them. For disabling same_filled handling, there is no logical reason to refuse storing same-filled pages more efficiently and opt for compression. Scanning pages for patterns may be an argument, but the page contents will be read into the CPU cache anyway during compression. Also, removing the same_filled handling code does not move the needle significantly in terms of performance anyway [1]. For disabling non_same_filled handling, it was added when the compressed pages in zswap were not being properly charged to memcgs, as workloads could escape the accounting with compression [2]. This is no longer the case after commit f4840ccfca25 ("zswap: memcg accounting"), and using zswap without compression does not make much sense. [1]https://lore.kernel.org/lkml/CAJD7tkaySFP2hBQw4pnZHJJwe3bMdjJ1t9VC2VJd=khn1_TXvA@mail.gmail.com/ [2]https://lore.kernel.org/lkml/19d5cdee-2868-41bd-83d5-6da75d72e940@maciej.szmigiero.name/ [yosryahmed@google.com: remove same_filled_pages from docs] Link: https://lkml.kernel.org/r/ZhxFVggdyvCo79jc@google.com Link: https://lkml.kernel.org/r/20240413022407.785696-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed Acked-by: Johannes Weiner Reviewed-by: Nhat Pham Reviewed-by: Chengming Zhou Cc: "Maciej S. Szmigiero" Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/zswap.rst | 29 ---------------------- Documentation/driver-api/crypto/iaa/iaa-crypto.rst | 2 -- mm/zswap.c | 19 -------------- 3 files changed, 50 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 13632671adae..3598dcd7dbe7 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -111,35 +111,6 @@ checked if it is a same-value filled page before compressing it. If true, the compressed length of the page is set to zero and the pattern or same-filled value is stored. -Same-value filled pages identification feature is enabled by default and can be -disabled at boot time by setting the ``same_filled_pages_enabled`` attribute -to 0, e.g. ``zswap.same_filled_pages_enabled=0``. It can also be enabled and -disabled at runtime using the sysfs ``same_filled_pages_enabled`` -attribute, e.g.:: - - echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled - -When zswap same-filled page identification is disabled at runtime, it will stop -checking for the same-value filled pages during store operation. -In other words, every page will be then considered non-same-value filled. -However, the existing pages which are marked as same-value filled pages remain -stored unchanged in zswap until they are either loaded or invalidated. - -In some circumstances it might be advantageous to make use of just the zswap -ability to efficiently store same-filled pages without enabling the whole -compressed page storage. -In this case the handling of non-same-value pages by zswap (enabled by default) -can be disabled by setting the ``non_same_filled_pages_enabled`` attribute -to 0, e.g. ``zswap.non_same_filled_pages_enabled=0``. -It can also be enabled and disabled at runtime using the sysfs -``non_same_filled_pages_enabled`` attribute, e.g.:: - - echo 1 > /sys/module/zswap/parameters/non_same_filled_pages_enabled - -Disabling both ``zswap.same_filled_pages_enabled`` and -``zswap.non_same_filled_pages_enabled`` effectively disables accepting any new -pages by zswap. - To prevent zswap from shrinking pool when zswap is full and there's a high pressure on swap (this will result in flipping pages in and out zswap pool without any real benefit but with a performance drop for the system), a diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst index de587cf9cbed..4cb1d52ea6dc 100644 --- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst +++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst @@ -457,7 +457,6 @@ Use the following commands to enable zswap:: # echo deflate-iaa > /sys/module/zswap/parameters/compressor # echo zsmalloc > /sys/module/zswap/parameters/zpool # echo 1 > /sys/module/zswap/parameters/enabled - # echo 0 > /sys/module/zswap/parameters/same_filled_pages_enabled # echo 100 > /proc/sys/vm/swappiness # echo never > /sys/kernel/mm/transparent_hugepage/enabled # echo 1 > /proc/sys/vm/overcommit_memory @@ -599,7 +598,6 @@ the 'fixed' compression mode:: echo deflate-iaa > /sys/module/zswap/parameters/compressor echo zsmalloc > /sys/module/zswap/parameters/zpool echo 1 > /sys/module/zswap/parameters/enabled - echo 0 > /sys/module/zswap/parameters/same_filled_pages_enabled echo 100 > /proc/sys/vm/swappiness echo never > /sys/kernel/mm/transparent_hugepage/enabled diff --git a/mm/zswap.c b/mm/zswap.c index 18fd70795f5b..a50e2986cd2f 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -123,19 +123,6 @@ static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */ module_param_named(accept_threshold_percent, zswap_accept_thr_percent, uint, 0644); -/* - * Enable/disable handling same-value filled pages (enabled by default). - * If disabled every page is considered non-same-value filled. - */ -static bool zswap_same_filled_pages_enabled = true; -module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled, - bool, 0644); - -/* Enable/disable handling non-same-value filled pages (enabled by default) */ -static bool zswap_non_same_filled_pages_enabled = true; -module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled, - bool, 0644); - /* Number of zpools in zswap_pool (empirically determined for scalability) */ #define ZSWAP_NR_ZPOOLS 32 @@ -1393,9 +1380,6 @@ static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1; bool ret = false; - if (!zswap_same_filled_pages_enabled) - return false; - page = kmap_local_folio(folio, 0); val = page[0]; @@ -1473,9 +1457,6 @@ bool zswap_store(struct folio *folio) goto store_entry; } - if (!zswap_non_same_filled_pages_enabled) - goto freepage; - /* if entry is successfully added, it keeps the reference */ entry->pool = zswap_pool_current_get(); if (!entry->pool) -- cgit v1.2.3 From 1bafe96e89f056cb6e25d47451fb16aee2c7c4d0 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 24 Apr 2024 14:26:30 +0200 Subject: mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared() We want to limit the use of page_mapcount() to places where absolutely required, to prepare for kernel configs where we won't keep track of per-page mapcounts in large folios. khugepaged is one of the remaining "more challenging" page_mapcount() users, but we might be able to move away from page_mapcount() without resulting in a significant behavior change that would warrant special-casing based on kernel configs. In 2020, we first added support to khugepaged for collapsing COW-shared pages via commit 9445689f3b61 ("khugepaged: allow to collapse a page shared across fork"), followed by support for collapsing PTE-mapped THP in commit 5503fbf2b0b8 ("khugepaged: allow to collapse PTE-mapped compound pages") and limiting the memory waste via the "page_count() > 1" check in commit 71a2c112a0f6 ("khugepaged: introduce 'max_ptes_shared' tunable"). As a default, khugepaged will allow up to half of the PTEs to map shared pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged setting. khugepaged does currently not care about swapcache page references, and does not check under folio lock: so in some corner cases the "shared vs. exclusive" detection might be a bit off, making us detect "exclusive" when it's actually "shared". Most of our anonymous folios in the system are usually exclusive. We frequently see sharing of anonymous folios for a short period of time, after which our short-lived suprocesses either quit or exec(). There are some famous examples, though, where child processes exist for a long time, and where memory is COW-shared with a lot of processes (webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for reducing the memory footprint. We don't want to suddenly change the behavior to result in a significant increase in memory waste. Interestingly, khugepaged will only collapse an anonymous THP if at least one PTE is writable. After fork(), that means that something (usually a page fault) populated at least a single exclusive anonymous THP in that PMD range. So ... what happens when we switch to "is this folio mapped shared" instead of "is this page mapped shared" by using folio_likely_mapped_shared()? For "not-COW-shared" folios, small folios and for THPs (large folios) that are completely mapped into at least one process, switching to folio_likely_mapped_shared() will not result in a change. We'll only see a change for COW-shared PTE-mapped THPs that are partially mapped into all involved processes. There are two cases to consider: (A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP If the folio is detected as exclusive, and it actually is exclusive, there is no change: page_mapcount() == 1. This is the common case without fork() or with short-lived child processes. folio_likely_mapped_shared() might currently still detect a folio as exclusive although it is shared (false negatives): if the first page is not mapped multiple times and if the average per-page mapcount is smaller than 1, implying that (1) the folio is partially mapped and (2) if we are responsible for many mapcounts by mapping many pages others can't ("mostly exclusive") (3) if we are not responsible for many mapcounts by mapping little pages ("mostly shared") it won't make a big impact on the end result. So while we might now detect a page as "exclusive" although it isn't, it's not expected to make a big difference in common cases. (B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP folio_likely_mapped_shared() will never detect a large anonymous folio as shared although it is exclusive: there are no false positives. If we detect a THP as shared, at least one page of the THP is mapped by another process. It could well be that some pages are actually exclusive. For example, our child processes could have unmapped/COW'ed some pages such that they would now be exclusive to out process, which we now would treat as still-shared. Examples: (1) Parent maps all pages of a THP, child maps some pages. We detect all pages in the parent as shared although some are actually exclusive. (2) Parent maps all but some page of a THP, child maps the remainder. We detect all pages of the THP that the parent maps as shared although they are all exclusive. In (1) we wouldn't collapse a THP right now already: no PTE is writable, because a write fault would have resulted in COW of a single page and the parent would no longer map all pages of that THP. For (2) we would have collapsed a THP in the parent so far, now we wouldn't as long as the child process is still alive: unless the child process unmaps the remaining THP pages or we decide to split that THP. Possibly, the child COW'ed many pages, meaning that it's likely that we can populate a THP for our child first, and then for our parent. For (2), we are making really bad use of the THP in the first place (not even mapped completely in at least one process). If the THP would be completely partially mapped, it would be on the deferred split queue where we would split it lazily later. For short-running child processes, we don't particularly care. For long-running processes, the expectation is that such scenarios are rather rare: further, a THP might be best placed if most data in the PMD range is actually written, implying that we'll have to COW more pages first before khugepaged would collapse it. To summarize, in the common case, this change is not expected to matter much. The more common application of khugepaged operates on exclusive pages, either before fork() or after a child quit. Can we improve (A)? Yes, if we implement more precise tracking of "mapped shared" vs. "mapped exclusively", we could get rid of the false negatives completely. Can we improve (B)? We could count how many pages of a large folio we map inside the current page table and detect that we are responsible for most of the folio mapcount and conclude "as good as exclusive", which might help in some cases. ... but likely, some other mechanism should detect that the THP is not a good use in the scenario (not even mapped completely in a single process) and try splitting that folio lazily etc. We'll move the folio_test_anon() check before our "shared" check, so we might get more expressive results for SCAN_EXCEED_SHARED_PTE: this order of checks now matches the one in __collapse_huge_page_isolate(). Extend documentation. Link: https://lkml.kernel.org/r/20240424122630.495788-1-david@redhat.com Signed-off-by: David Hildenbrand Cc: Jonathan Corbet Cc: Kirill A. Shutemov Cc: Zi Yan Cc: Yang Shi Cc: John Hubbard Cc: Ryan Roberts Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/transhuge.rst | 3 ++- mm/khugepaged.c | 22 +++++++++++++++------- 2 files changed, 17 insertions(+), 8 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index f82300b9193f..076443cc10a6 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -278,7 +278,8 @@ collapsed, resulting fewer pages being collapsed into THPs, and lower memory access performance. ``max_ptes_shared`` specifies how many pages can be shared across multiple -processes. Exceeding the number would block the collapse:: +processes. khugepaged might treat pages of THPs as shared if any page of +that THP is shared. Exceeding the number would block the collapse:: /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2f73d2aa9ae8..cf518fc44098 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -583,7 +583,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, folio = page_folio(page); VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); - if (page_mapcount(page) > 1) { + /* See hpage_collapse_scan_pmd(). */ + if (folio_likely_mapped_shared(folio)) { ++shared; if (cc->is_khugepaged && shared > khugepaged_max_ptes_shared) { @@ -1317,8 +1318,20 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, result = SCAN_PAGE_NULL; goto out_unmap; } + folio = page_folio(page); - if (page_mapcount(page) > 1) { + if (!folio_test_anon(folio)) { + result = SCAN_PAGE_ANON; + goto out_unmap; + } + + /* + * We treat a single page as shared if any part of the THP + * is shared. "False negatives" from + * folio_likely_mapped_shared() are not expected to matter + * much in practice. + */ + if (folio_likely_mapped_shared(folio)) { ++shared; if (cc->is_khugepaged && shared > khugepaged_max_ptes_shared) { @@ -1328,7 +1341,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, } } - folio = page_folio(page); /* * Record which node the original page is from and save this * information to cc->node_load[]. @@ -1349,10 +1361,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, result = SCAN_PAGE_LOCK; goto out_unmap; } - if (!folio_test_anon(folio)) { - result = SCAN_PAGE_ANON; - goto out_unmap; - } /* * Check if the page has any GUP (or other external) pins. -- cgit v1.2.3 From ed13c93b9393c9b9e0fc1f25fc5da13715bfddbd Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 26 Apr 2024 12:52:45 -0700 Subject: Docs/admin-guide/mm/damon/usage: update for young page type DAMOS filter Update DAMON usage document for the newly added DAMOS filter type, 'young page'. Link: https://lkml.kernel.org/r/20240426195247.100306-7-sj@kernel.org Signed-off-by: SeongJae Park Cc: Honggyu Kim Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 6fce035fdbf5..69bc8fabf378 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -410,19 +410,19 @@ in the numeric order. Each filter directory contains six files, namely ``type``, ``matcing``, ``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type`` -file, you can write one of four special keywords: ``anon`` for anonymous pages, -``memcg`` for specific memory cgroup, ``addr`` for specific address range (an -open-ended interval), or ``target`` for specific DAMON monitoring target -filtering. In case of the memory cgroup filtering, you can specify the memory -cgroup of the interest by writing the path of the memory cgroup from the -cgroups mount point to ``memcg_path`` file. In case of the address range -filtering, you can specify the start and end address of the range to -``addr_start`` and ``addr_end`` files, respectively. For the DAMON monitoring -target filtering, you can specify the index of the target between the list of -the DAMON context's monitoring targets list to ``target_idx`` file. You can -write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does -not match to the type, respectively. Then, the scheme's action will not be -applied to the pages that specified to be filtered out. +file, you can write one of five special keywords: ``anon`` for anonymous pages, +``memcg`` for specific memory cgroup, ``young`` for young pages, ``addr`` for +specific address range (an open-ended interval), or ``target`` for specific +DAMON monitoring target filtering. In case of the memory cgroup filtering, you +can specify the memory cgroup of the interest by writing the path of the memory +cgroup from the cgroups mount point to ``memcg_path`` file. In case of the +address range filtering, you can specify the start and end address of the range +to ``addr_start`` and ``addr_end`` files, respectively. For the DAMON +monitoring target filtering, you can specify the index of the target between +the list of the DAMON context's monitoring targets list to ``target_idx`` file. +You can write ``Y`` or ``N`` to ``matching`` file to filter out pages that does +or does not match to the type, respectively. Then, the scheme's action will +not be applied to the pages that specified to be filtered out. For example, below restricts a DAMOS action to be applied to only non-anonymous pages of all memory cgroups except ``/having_care_already``.:: -- cgit v1.2.3 From da2a061888883e067e8e649d086df35c92c760a7 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 3 May 2024 11:03:14 -0700 Subject: Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file The example usage of DAMOS filter sysfs files, specifically the part of 'matching' file writing for memcg type filter, is wrong. The intention is to exclude pages of a memcg that already getting enough care from a given scheme, but the example is setting the filter to apply the scheme to only the pages of the memcg. Fix it. Link: https://lkml.kernel.org/r/20240503180318.72798-7-sj@kernel.org Fixes: 9b7f9322a530 ("Docs/admin-guide/mm/damon/usage: document DAMOS filters of sysfs") Closes: https://lore.kernel.org/r/20240317191358.97578-1-sj@kernel.org Signed-off-by: SeongJae Park Cc: [6.3.x] Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 69bc8fabf378..3ce3f0aaa1d5 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -434,7 +434,7 @@ pages of all memory cgroups except ``/having_care_already``.:: # # further filter out all cgroups except one at '/having_care_already' echo memcg > 1/type echo /having_care_already > 1/memcg_path - echo N > 1/matching + echo Y > 1/matching Note that ``anon`` and ``memcg`` filters are currently supported only when ``paddr`` :ref:`implementation ` is being used. -- cgit v1.2.3 From 14e70e4660d6ecaf503c461f072949ef8758e4a1 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 3 May 2024 11:03:15 -0700 Subject: Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command To update effective size quota of DAMOS schemes on DAMON sysfs file interface, user should write 'update_schemes_effective_quotas' to the kdamond 'state' file. But the document is mistakenly saying the input string as 'update_schemes_effective_bytes'. Fix it (s/bytes/quotas/). Link: https://lkml.kernel.org/r/20240503180318.72798-8-sj@kernel.org Fixes: a6068d6dfa2f ("Docs/admin-guide/mm/damon/usage: document effective_bytes file") Signed-off-by: SeongJae Park Cc: [6.9.x] Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 3ce3f0aaa1d5..e58ceb89ea2a 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -153,7 +153,7 @@ Users can write below commands for the kdamond to the ``state`` file. - ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme action tried regions directory for each DAMON-based operation scheme of the kdamond. -- ``update_schemes_effective_bytes``: Update the contents of +- ``update_schemes_effective_quotas``: Update the contents of ``effective_bytes`` files for each DAMON-based operation scheme of the kdamond. For more details, refer to :ref:`quotas directory `. @@ -342,7 +342,7 @@ Based on the user-specified :ref:`goal `, the effective size quota is further adjusted. Reading ``effective_bytes`` returns the current effective size quota. The file is not updated in real time, so users should ask DAMON sysfs interface to update the content of the file for -the stats by writing a special keyword, ``update_schemes_effective_bytes`` to +the stats by writing a special keyword, ``update_schemes_effective_quotas`` to the relevant ``kdamonds//state`` file. Under ``weights`` directory, three files (``sz_permil``, -- cgit v1.2.3