summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-07-13mm/gup: introduce unpin_folio/unpin_folios helpersVivek Kasireddy2-0/+49
Patch series "mm/gup: Introduce memfd_pin_folios() for pinning memfd folios", v16. Currently, some drivers (e.g, Udmabuf) that want to longterm-pin the pages/folios associated with a memfd, do so by simply taking a reference on them. This is not desirable because the pages/folios may reside in Movable zone or CMA block. Therefore, having drivers use memfd_pin_folios() API ensures that the folios are appropriately pinned via FOLL_PIN for longterm DMA. This patchset also introduces a few helpers and converts the Udmabuf driver to use folios and memfd_pin_folios() API to longterm-pin the folios for DMA. Two new Udmabuf selftests are also included to test the driver and the new API. This patch (of 9): These helpers are the folio versions of unpin_user_page/unpin_user_pages. They are currently only useful for unpinning folios pinned by memfd_pin_folios() or other associated routines. However, they could find new uses in the future, when more and more folio-only helpers are added to GUP. We should probably sanity check the folio as part of unpin similar to how it is done in unpin_user_page/unpin_user_pages but we cannot cleanly do that at the moment without also checking the subpage. Therefore, sanity checking needs to be added to these routines once we have a way to determine if any given folio is anon-exclusive (via a per folio AnonExclusive flag). Link: https://lkml.kernel.org/r/20240624063952.1572359-1-vivek.kasireddy@intel.com Link: https://lkml.kernel.org/r/20240624063952.1572359-2-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-13mm/zswap: use only one pool in zswapChengming Zhou1-41/+20
Zswap uses 32 pools to workaround the locking scalability problem in zswap backends (mainly zsmalloc nowadays), which brings its own problems like memory waste and more memory fragmentation. Testing results show that we can have near performance with only one pool in zswap after changing zsmalloc to use per-size_class lock instead of pool spinlock. Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and zswap shrinker enabled with 10GB swapfile on ext4. real user sys 6.10.0-rc3 138.18 1241.38 1452.73 6.10.0-rc3-onepool 149.45 1240.45 1844.69 6.10.0-rc3-onepool-perclass 138.23 1242.37 1469.71 And do the same testing using zbud, which shows a little worse performance as expected since we don't do any locking optimization for zbud. I think it's acceptable since zsmalloc became a lot more popular than other backends, and we may want to support only zsmalloc in the future. real user sys 6.10.0-rc3-zbud 138.23 1239.58 1430.09 6.10.0-rc3-onepool-zbud 139.64 1241.37 1516.59 [chengming.zhou@linux.dev: fix error handling in zswap_pool_create(), per Dan Carpenter] Link: https://lkml.kernel.org/r/20240621-zsmalloc-lock-mm-everything-v2-2-d30e9cd2b793@linux.dev [chengming.zhou@linux.dev: fix error handling again in zswap_pool_create(), per Yosry] Link: https://lkml.kernel.org/r/20240625-zsmalloc-lock-mm-everything-v3-2-ad941699cb61@linux.dev Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-2-5e5081ea11b3@linux.dev Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-13mm/zsmalloc: change back to per-size_class lockChengming Zhou1-35/+50
Patch series "mm/zsmalloc: change back to per-size_class lock, v2". Commit c0547d0b6a4b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks") changed per-size_class lock to pool spinlock to prepare reclaim support in zsmalloc. Then reclaim support in zsmalloc had been dropped in favor of LRU reclaim in zswap, but this locking change had been left there. Obviously, the scalability of pool spinlock is worse than per-size_class. And we have a workaround that using 32 pools in zswap to avoid this scalability problem, which brings its own problems like memory waste and more memory fragmentation. So this series changes back to use per-size_class lock and using testing data in much stressed situation to verify that we can use only one pool in zswap. Note we only test and care about the zsmalloc backend, which makes sense now since zsmalloc became a lot more popular than other backends. Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and zswap shrinker enabled with 10GB swapfile on ext4. real user sys 6.10.0-rc3 138.18 1241.38 1452.73 6.10.0-rc3-onepool 149.45 1240.45 1844.69 6.10.0-rc3-onepool-perclass 138.23 1242.37 1469.71 We can see from "sys" column that per-size_class locking with only one pool in zswap can have near performance with the current 32 pools. This patch (of 2): This patch is almost the revert of the commit c0547d0b6a4b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks"), which changed to use a global pool->lock instead of per-size_class lock and pool->migrate_lock, was preparation for suppporting reclaim in zsmalloc. Then reclaim in zsmalloc had been dropped in favor of LRU reclaim in zswap. In theory, per-size_class is more fine-grained than the pool->lock, since a pool can have many size_classes. As for the additional pool->migrate_lock, only free() and map() need to grab it to access stable handle to get zspage, and only in read lock mode. Link: https://lkml.kernel.org/r/20240625-zsmalloc-lock-mm-everything-v3-0-ad941699cb61@linux.dev Link: https://lkml.kernel.org/r/20240621-zsmalloc-lock-mm-everything-v2-0-d30e9cd2b793@linux.dev Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-0-5e5081ea11b3@linux.dev Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-1-5e5081ea11b3@linux.dev Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-13mm/hugetlb.c: undo errant changeAndrew Morton1-0/+1
During conflict resolution a line was unintentionally removed by a ksm.c patch. Link: https://lkml.kernel.org/r/85b0d694-d1ac-8e7a-2e50-1edc03eee21a@google.com Fixes: ac90c56bbd73 ("mm/ksm: refactor out try_to_merge_with_zero_page()") Reported-by: Hugh Dickins <hughd@google.com> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10mm: zswap: fix zswap_never_enabled() for CONFIG_ZSWAP==NBarry Song1-1/+1
If CONFIG_ZSWAP is set to N, it means zswap cannot be enabled. zswap_never_enabled() should return true. The only effect of this issue is that with Barry's latest large folio swapin patches for zram ("mm: support mTHP swap-in for zRAM-like swapfile"), we will always fallback to order-0 swapin, even mistakenly when !CONFIG_ZSWAP. Basically this bug makes Barry's in progress patches not work at all. The API was created to inform the mm core that zswap has never been enabled, allowing the mm core to perform mTHP swap-in. This is a transitional solution until zswap supports mTHP. If zswap has been enabled, performing mTHP swap-in will result in corrupted data. You may find the answer in the mTHP swap-in series: https://lore.kernel.org/linux-mm/CAJD7tkZ4FQr6HZpduOdvmqgg_-whuZYE-Bz5O2t6yzw6Yg+v1A@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240629232231.42394-1-21cnbao@gmail.com Fixes: 0300e17d67c3 ("mm: zswap: add zswap_never_enabled()") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10mm/vmscan: drop checking if _deferred_list is empty before using TTU_SYNCBarry Song1-1/+1
The optimization of list_empty(&folio->_deferred_list) aimed to prevent increasing the PTL duration when a large folio is partially unmapped, for example, from subpage 0 to subpage (nr - 2). But Ryan's commit 5ed890ce5147 ("mm: vmscan: avoid split during shrink_folio_list()") actually splits this kind of large folios. This makes the "optimization" useless. Additionally, the list_empty() technically required a data_race() annotation. Link: https://lkml.kernel.org/r/20240629234155.53524-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10mm/page_alloc: remove prefetchw() on freeing page to buddy systemWei Yang1-11/+2
The prefetchw() is introduced from an ancient patch[1]. The change log says: The basic idea is to free higher order pages instead of going through every single one. Also, some unnecessary atomic operations are done away with and replaced with non-atomic equivalents, and prefetching is done where it helps the most. For a more in-depth discusion of this patch, please see the linux-ia64 archives (topic is "free bootmem feedback patch"). So there are several changes improve the bootmem freeing, in which the most basic idea is freeing higher order pages. And as Matthew says, "Itanium CPUs of this era had no prefetchers." I did 10 round bootup tests before and after this change, the data doesn't prove prefetchw() help speeding up bootmem freeing. The sum of the 10 round bootmem freeing time after prefetchw() removal even 5.2% faster than before. [1]: https://lore.kernel.org/linux-ia64/40F46962.4090604@sgi.com/ Link: https://lkml.kernel.org/r/20240702020931.7061-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10kernel/fork.c: put set_max_threads()/task_struct_whitelist() in __init sectionWei Yang1-2/+2
The functions set_max_threads() and task_struct_whitelist() are only used by fork_init() during bootup. Let's add __init tag to them. Link: https://lkml.kernel.org/r/20240701013410.17260-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10kernel/fork.c: get totalram_pages from memblock to calculate max_threadsWei Yang1-1/+2
Since we plan to move the accounting into __free_pages_core(), totalram_pages may not represent the total usable pages on system at this point when defer_init is enabled. Instead we can get the total usable pages from memblock directly. Link: https://lkml.kernel.org/r/20240701013410.17260-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10mm: remove CONFIG_MEMCG_KMEMJohannes Weiner20-131/+59
CONFIG_MEMCG_KMEM used to be a user-visible option for whether slab tracking is enabled. It has been default-enabled and equivalent to CONFIG_MEMCG for almost a decade. We've only grown more kernel memory accounting sites since, and there is no imaginable cgroup usecase going forward that wants to track user pages but not the multitude of user-drivable kernel allocations. Link: https://lkml.kernel.org/r/20240701153148.452230-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10mm: memcg: add cache line padding to mem_cgroup_per_nodeRoman Gushchin1-2/+4
Memcg v1-specific fields serve a buffer function between read-mostly and update often parts of the mem_cgroup_per_node structure. If CONFIG_MEMCG_V1 is not set and these fields are not present, an explicit cacheline padding is needed. Link: https://lkml.kernel.org/r/20240701185932.704807-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10mm: memcg: drop obsolete cache line padding in struct mem_cgroupRoman Gushchin1-4/+0
After the grouping of the cgroup v1-related fields and the corresponding reorganization of the struct mem_cgroup, the existing cache line padding doesn't make much sense anymore. Let's drop it for now and put back to new places, if necessary. Link: https://lkml.kernel.org/r/20240701185932.704807-1-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10Docs/mm/damon/index: add links to admin-guide docSeongJae Park1-7/+10
Readers of DAMON subsystem documents index would want to further learn how they can use DAMON from the user-space. Add the link to the admin guide. Link: https://lkml.kernel.org/r/20240701192706.51415-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10Docs/mm/damon/index: add links to designSeongJae Park2-5/+7
DAMON subsystem documents index page provides a short intro of DAMON core concepts. Add links to sections of the design document to let users easily browse to the details. Link: https://lkml.kernel.org/r/20240701192706.51415-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10Docs/mm/damon/design: add links to sections of DAMON sysfs interface usage docSeongJae Park1-0/+48
Readers of the design document would wonder how they can configure and use specific DAMON features. Add links to sections of DAMON sysfs interface usage document that provides the answers for easier browsing. Link: https://lkml.kernel.org/r/20240701192706.51415-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10Docs/mm/damon/design: remove 'Programmable Modules' section in favor of ↵SeongJae Park1-10/+0
'Modules' section 'Programmable Modules' section provides high level descriptions of the DAMON API-based kernel modules layer. But 'Modules' section, which is at the end of the document, provides every detail about the layer including that of 'Programmable Modules' section. Since the brief summary of the layers at the beginning of the document has a link to the 'Modules' section, browsing to the section is not that difficult. Remove 'Programmable Modules' section in favor of 'Modules' section and reducing duplicates. Link: https://lkml.kernel.org/r/20240701192706.51415-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10Docs/mm/damon/design: move 'Configurable Operations Set' section into ↵SeongJae Park1-25/+22
'Operations Set Layer' section 'Configurable Operations Set' section is for providing a description of the pluggability of the operations set layer. Just after that, 'Operations Set Layer' section, which is dedicated for the entire things of the layer, follows. The layout is odd, and some descriptions are duplicated. Move 'Configurable Operations Set' section into 'Operations Set Layer' and re-write some of the detailed descriptions. Link: https://lkml.kernel.org/r/20240701192706.51415-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10Docs/mm/damon/design: add links from overall architecture to sections of detailsSeongJae Park1-7/+13
DAMON design document briefly explains the overall layers architecture first, and then provides detailed explanations of each layer with dedicated sections. Letting readers go directly to the detailed sections for specific layers could help easy browsing of the not-very-short document. Add links from the overall summary to the sections of details. Link: https://lkml.kernel.org/r/20240701192706.51415-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10Docs/admin-guide/mm/damon/start: add access pattern snapshot exampleSeongJae Park1-4/+42
DAMON user-space tool (damo) provides access pattern snapshot feature, which is expected to be frequently used for real time access pattern analysis. The snapshot output is also showing what DAMON provides on its own, including the 'age' information. In contrast, the recorded access patterns, which is shown as an example usage on the quick start section, shows what users can make from what DAMON provided. It includes information that generated outside of DAMON and makes the 'age' concept bit unclear. Hence snapshot output is easier at understanding the raw realtime output of DAMON. Add the snapshot usage example on the quick start section. Link: https://lkml.kernel.org/r/20240701192706.51415-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10Docs/mm/damon/design: clarify regions merging operationSeongJae Park1-5/+12
DAMON design document is not explaining how min_nr_regions limit is kept, and what happens if the number of regions exceeds max_nr_regions. Add more clarification for those. Link: https://lkml.kernel.org/r/20240701192706.51415-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10Docs/mm/damon/design: fix two typosSeongJae Park1-2/+2
Patch series "Docs/damon: minor fixups and improvements". Fixup typos, clarify regions merging operation design with recent change, add access pattern snapshot example use case, and improve readability of the design document and subsystem documents index by reorganizing/wordsmithing and adding links to other sections and/or documents for easy browsing. This patch (of 9): Fix two typos. The first one is just a simple typo: s/accurach/accuracy/ The second one is made by the author being out of their mind. 'Region Based Sampling' section of the doc is mistakenly calling the access frequency counter of region as 'nr_regions'. Fix it with the correct name, 'nr_accesses'. Link: https://lkml.kernel.org/r/20240701192706.51415-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240701192706.51415-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10mm/shmem: fix input and output inconsistenciesBang Li1-1/+1
Commit 19eaf44954df ("mm: thp: support allocation of anonymous multi-size THP") added mTHP support for anonymous shmem. We can configure different policies through the multi-size THP sysfs interface for anonymous shmem. But when we configure the "advise" policy of /sys/kernel/mm/transparent_hugepage/hugepages-xxxkB/shmem_enabled, we cannot write the "advise", but write the "madvise", which is unreasonable. We should keep the output and input values consistent, which is more convenient for users. Link: https://lkml.kernel.org/r/20240628032327.16987-1-libang.li@antgroup.com Fixes: 61a57f1b1da9 ("mm: shmem: add multi-size THP sysfs interface for anonymous shmem") Signed-off-by: Bang Li <libang.li@antgroup.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Bang Li <libang.li@antgroup.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10selftests: centralize -D_GNU_SOURCE= to CFLAGS in lib.mkEdward Liaw15-15/+12
Centralize the _GNU_SOURCE definition to CFLAGS in lib.mk. Remove redundant defines from Makefiles that import lib.mk. Convert any usage of "#define _GNU_SOURCE 1" to "#define _GNU_SOURCE". This uses the form "-D_GNU_SOURCE=", which is equivalent to "#define _GNU_SOURCE". Otherwise using "-D_GNU_SOURCE" is equivalent to "-D_GNU_SOURCE=1" and "#define _GNU_SOURCE 1", which is less commonly seen in source code and would require many changes in selftests to avoid redefinition warnings. Link: https://lkml.kernel.org/r/20240625223454.1586259-2-edliaw@google.com Signed-off-by: Edward Liaw <edliaw@google.com> Suggested-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Shuah Khan <skhan@linuxfoundation.org> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: André Almeida <andrealmeid@igalia.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kees Cook <kees@kernel.org> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10tools/mm: introduce a tool to assess swap entry allocation for thp_swapoutBarry Song2-1/+235
Both Ryan and Chris have been utilizing the small test program to aid in debugging and identifying issues with swap entry allocation. While a real or intricate workload might be more suitable for assessing the correctness and effectiveness of the swap allocation policy, a small test program presents a simpler means of understanding the problem and initially verifying the improvements being made. Let's endeavor to integrate it into tools/mm. Although it presently only accommodates 64KB and 4KB, I'm optimistic that we can expand its capabilities to support multiple sizes and simulate more complex systems in the future as required. Basically, we have 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under high exercise in a short time. 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures new mTHP is always generated, released or swapped out, similar to the behavior on a PC or Android phone where many applications are frequently started and terminated. 3. Swap in with or without the "-a" option to observe how fragments due to swap-in and the incoming swap-in of large folios will impact swap-out fallback. Due to 2, we ensure a certain proportion of mTHP. Similarly, because of 3, we maintain a certain proportion of small folios, as we don't support large folios swap-in, meaning any swap-in will immediately result in small folios. Therefore, with both 2 and 3, we automatically achieve a system containing both mTHP and small folios. Additionally, 1 provides the ability to continuously swap them out. We can also use "-s" to add a dedicated small folios memory area. [akpm@linux-foundation.org: thp_swap_allocator_test.c needs mman.h, per Kairui Song] Link: https://lkml.kernel.org/r/20240622071231.576056-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Chris Li <chrisl@kernel.org> Tested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-06mm: migrate: remove folio_migrate_copy()Kefeng Wang3-9/+2
The folio_migrate_copy() is just a wrapper of folio_copy() and folio_migrate_flags(), it is simple and only aio use it for now, unfold it and remove folio_migrate_copy(). Link: https://lkml.kernel.org/r/20240626085328.608006-7-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-06fs: hugetlbfs: support poisoned recover from hugetlbfs_migrate_folio()Kefeng Wang2-3/+9
This is similar to __migrate_folio(), use folio_mc_copy() in HugeTLB folio migration to avoid panic when copy from poisoned folio. Link: https://lkml.kernel.org/r/20240626085328.608006-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-06mm: migrate: support poisoned recover from migrate folioKefeng Wang1-3/+11
The folio migration is widely used in kernel, memory compaction, memory hotplug, soft offline page, numa balance, memory demote/promotion, etc, but once access a poisoned source folio when migrating, the kerenl will panic. There is a mechanism in the kernel to recover from uncorrectable memory errors, ARCH_HAS_COPY_MC, which is already used in other core-mm paths, eg, CoW, khugepaged, coredump, ksm copy, see copy_mc_to_{user,kernel}, copy_mc_{user_}highpage callers. In order to support poisoned folio copy recover from migrate folio, we chose to make folio migration tolerant of memory failures and return error for folio migration, because folio migration is no guarantee of success, this could avoid the similar panic shown below. CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0 pc : copy_page+0x10/0xc0 lr : copy_highpage+0x38/0x50 ... Call trace: copy_page+0x10/0xc0 folio_copy+0x78/0x90 migrate_folio_extra+0x54/0xa0 move_to_new_folio+0xd8/0x1f0 migrate_folio_move+0xb8/0x300 migrate_pages_batch+0x528/0x788 migrate_pages_sync+0x8c/0x258 migrate_pages+0x440/0x528 soft_offline_in_use_page+0x2ec/0x3c0 soft_offline_page+0x238/0x310 soft_offline_page_store+0x6c/0xc0 dev_attr_store+0x20/0x40 sysfs_kf_write+0x4c/0x68 kernfs_fop_write_iter+0x130/0x1c8 new_sync_write+0xa4/0x138 vfs_write+0x238/0x2d8 ksys_write+0x74/0x110 Note, folio copy is moved in the begin of the __migrate_folio(), which could simplify the error handling since there is no turning back if folio_migrate_mapping() return success, the downside is the folio copied even though folio_migrate_mapping() return fail, an optimization is to check whether source folio does not have extra refs before we do folio copy. Link: https://lkml.kernel.org/r/20240626085328.608006-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-06mm: migrate: split folio_migrate_mapping()Kefeng Wang1-16/+22
The folio refcount check is moved out for both !mapping and mapping folio, also update comment from page to folio for folio_migrate_mapping(). No functional change intended. Link: https://lkml.kernel.org/r/20240626085328.608006-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-06mm: add folio_mc_copy()Kefeng Wang2-0/+18
Add a #MC variant of folio_copy() which uses copy_mc_highpage() to support #MC handled during folio copy, it will be used in folio migration soon. Link: https://lkml.kernel.org/r/20240626085328.608006-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-06mm: move memory_failure_queue() into copy_mc_[user]_highpage()Kefeng Wang3-10/+9
Patch series "mm: migrate: support poison recover from migrate folio", v5. The folio migration is widely used in kernel, memory compaction, memory hotplug, soft offline page, numa balance, memory demote/promotion, etc, but once access a poisoned source folio when migrating, the kernel will panic. There is a mechanism in the kernel to recover from uncorrectable memory errors, ARCH_HAS_COPY_MC(eg, Machine Check Safe Memory Copy on x86), which is already used in NVDIMM or core-mm paths(eg, CoW, khugepaged, coredump, ksm copy), see copy_mc_to_{user,kernel}, copy_mc_{user_}highpage callers. This series of patches provide the recovery mechanism from folio copy for the widely used folio migration. Please note, because folio migration is no guarantee of success, so we could chose to make folio migration tolerant of memory failures, adding folio_mc_copy() which is a #MC versions of folio_copy(), once accessing a poisoned source folio, we could return error and make the folio migration fail, and this could avoid the similar panic shown below. CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0 pc : copy_page+0x10/0xc0 lr : copy_highpage+0x38/0x50 ... Call trace: copy_page+0x10/0xc0 folio_copy+0x78/0x90 migrate_folio_extra+0x54/0xa0 move_to_new_folio+0xd8/0x1f0 migrate_folio_move+0xb8/0x300 migrate_pages_batch+0x528/0x788 migrate_pages_sync+0x8c/0x258 migrate_pages+0x440/0x528 soft_offline_in_use_page+0x2ec/0x3c0 soft_offline_page+0x238/0x310 soft_offline_page_store+0x6c/0xc0 dev_attr_store+0x20/0x40 sysfs_kf_write+0x4c/0x68 kernfs_fop_write_iter+0x130/0x1c8 new_sync_write+0xa4/0x138 vfs_write+0x238/0x2d8 ksys_write+0x74/0x110 This patch (of 5): There is a memory_failure_queue() call after copy_mc_[user]_highpage(), see callers, eg, CoW/KSM page copy, it is used to mark the source page as h/w poisoned and unmap it from other tasks, and the upcomming poison recover from migrate folio will do the similar thing, so let's move the memory_failure_queue() into the copy_mc_[user]_highpage() instead of adding it into each user, this should also enhance the handling of poisoned page in khugepaged. Link: https://lkml.kernel.org/r/20240626085328.608006-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240626085328.608006-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-06Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm: fixAndrew Morton22-274/+323
crashes from deferred split racing folio migration", needed by "mm: migrate: split folio_migrate_mapping()".
2024-07-06MAINTAINERS: mailmap: update Lorenzo Stoakes's email addressLorenzo Stoakes2-1/+2
Now working at Oracle. Link: https://lkml.kernel.org/r/20240703092704.11571-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-06mm: fix crashes from deferred split racing folio migrationHugh Dickins2-11/+13
Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on flags when freeing, yet the flags shown are not bad: PG_locked had been set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN symptoms implying double free by deferred split and large folio migration. 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration") was right to fix the memcg-dependent locking broken in 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), but missed a subtlety of deferred_split_scan(): it moves folios to its own local list to work on them without split_queue_lock, during which time folio->_deferred_list is not empty, but even the "right" lock does nothing to secure the folio and the list it is on. Fortunately, deferred_split_scan() is careful to use folio_try_get(): so folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() while the old folio's reference count is temporarily frozen to 0 - adding such a freeze in the !mapping case too (originally, folio lock and unmapping and no swap cache left an anon folio unreachable, so no freezing was needed there: but the deferred split queue offers a way to reach it). Link: https://lkml.kernel.org/r/29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-06lib/build_OID_registry: avoid non-destructive substitution for Perl < 5.13.2 ↵Paul Menzel1-1/+3
compat On a system with Perl 5.12.1, commit 5ef6dc08cfde ("lib/build_OID_registry: don't mention the full path of the script in output") causes the build to fail with the error below. Bareword found where operator expected at ./lib/build_OID_registry line 41, near "s#^\Q$abs_srctree/\E##r" syntax error at ./lib/build_OID_registry line 41, near "s#^\Q$abs_srctree/\E##r" Execution of ./lib/build_OID_registry aborted due to compilation errors. make[3]: *** [lib/Makefile:352: lib/oid_registry_data.c] Error 255 Ahmad Fatoum analyzed that non-destructive substitution is only supported since Perl 5.13.2. Instead of dropping `r` and having the side effect of modifying `$0`, introduce a dedicated variable to support older Perl versions. Link: https://lkml.kernel.org/r/20240702223512.8329-2-pmenzel@molgen.mpg.de Link: https://lkml.kernel.org/r/20240701155802.75152-1-pmenzel@molgen.mpg.de Fixes: 5ef6dc08cfde ("lib/build_OID_registry: don't mention the full path of the script in output") Link: https://lore.kernel.org/all/259f7a87-2692-480e-9073-1c1c35b52f67@molgen.mpg.de/ Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de> Suggested-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Ahmad Fatoum <a.fatoum@pengutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-06mm: gup: stop abusing try_grab_folioYang Shi3-139/+156
A kernel warning was reported when pinning folio in CMA memory when launching SEV virtual machine. The splat looks like: [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 [ 464.325515] Call Trace: [ 464.325520] <TASK> [ 464.325523] ? __get_user_pages+0x423/0x520 [ 464.325528] ? __warn+0x81/0x130 [ 464.325536] ? __get_user_pages+0x423/0x520 [ 464.325541] ? report_bug+0x171/0x1a0 [ 464.325549] ? handle_bug+0x3c/0x70 [ 464.325554] ? exc_invalid_op+0x17/0x70 [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 [ 464.325567] ? __get_user_pages+0x423/0x520 [ 464.325575] __gup_longterm_locked+0x212/0x7a0 [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 [ 464.325590] pin_user_pages_fast+0x47/0x60 [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] Per the analysis done by yangge, when starting the SEV virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory. But the page is in CMA area, so fast GUP will fail then fallback to the slow path due to the longterm pinnalbe check in try_grab_folio(). The slow path will try to pin the pages then migrate them out of CMA area. But the slow path also uses try_grab_folio() to pin the page, it will also fail due to the same check then the above warning is triggered. In addition, the try_grab_folio() is supposed to be used in fast path and it elevates folio refcount by using add ref unless zero. We are guaranteed to have at least one stable reference in slow path, so the simple atomic add could be used. The performance difference should be trivial, but the misuse may be confusing and misleading. Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page() to try_grab_folio(), and use them in the proper paths. This solves both the abuse and the kernel warning. The proper naming makes their usecase more clear and should prevent from abusing in the future. peterx said: : The user will see the pin fails, for gpu-slow it further triggers the WARN : right below that failure (as in the original report): : : folio = try_grab_folio(page, page_increm - 1, : foll_flags); : if (WARN_ON_ONCE(!folio)) { <------------------------ here : /* : * Release the 1st page ref if the : * folio is problematic, fail hard. : */ : gup_put_folio(page_folio(page), 1, : foll_flags); : ret = -EFAULT; : goto out; : } [1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/ [shy828301@gmail.com: fix implicit declaration of function try_grab_folio_fast] Link: https://lkml.kernel.org/r/CAHbLzkowMSso-4Nufc9hcMehQsK9PNz3OSu-+eniU-2Mm-xjhA@mail.gmail.com Link: https://lkml.kernel.org/r/20240628191458.2605553-1-yang@os.amperecomputing.com Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reported-by: yangge <yangge1116@126.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05docs: mm: add enable_soft_offline sysctlJiaqi Yan1-0/+38
Add the documentation for soft offline behaviors / costs, and what the new enable_soft_offline sysctl is for. [jiaqiyan@google.com: fix kerneldoc warnings] Link: https://lkml.kernel.org/r/CACw3F52=GxTCDw-PqFh3-GDM-fo3GbhGdu0hedxYXOTT4TQSTg@mail.gmail.com [jiaqiyan@google.com: there are more blank lines needed] Link: https://lkml.kernel.org/r/CACw3F52_obAB742XeDRNun4BHBYtrxtbvp5NkUincXdaob0j1g@mail.gmail.com Link: https://lkml.kernel.org/r/20240626050818.2277273-5-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05selftest/mm: test enable_soft_offline behaviorsJiaqi Yan4-0/+236
Add regression and new tests when hugepage has correctable memory errors, and how userspace wants to deal with it: * if enable_soft_offline=1, mapped hugepage is soft offlined * if enable_soft_offline=0, mapped hugepage is intact Free hugepages case is not explicitly covered by the tests. Hugepage having corrected memory errors is emulated with MADV_SOFT_OFFLINE. [jiaqiyan@google.com: v7] Link: https://lkml.kernel.org/r/20240628205958.2845610-4-jiaqiyan@google.com Link: https://lkml.kernel.org/r/20240626050818.2277273-4-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05mm/memory-failure: userspace controls soft-offlining pagesJiaqi Yan1-2/+20
Correctable memory errors are very common on servers with large amount of memory, and are corrected by ECC. Soft offline is kernel's additional recovery handling for memory pages having (excessive) corrected memory errors. Impacted page is migrated to a healthy page if inuse; the original page is discarded for any future use. The actual policy on whether (and when) to soft offline should be maintained by userspace, especially in case of an 1G HugeTLB page. Soft-offline dissolves the HugeTLB page, either in-use or free, into chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. If userspace has not acknowledged such behavior, it may be surprised when later failed to mmap hugepages due to lack of hugepages. In case of a transparent hugepage, it will be split into 4K pages as well; userspace will stop enjoying the transparent performance. In addition, discarding the entire 1G HugeTLB page only because of corrected memory errors sounds very costly and kernel better not doing under the hood. But today there are at least 2 such cases doing so: 1. when GHES driver sees both GHES_SEV_CORRECTED and CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER. 2. RAS Correctable Errors Collector counts correctable errors per PFN and when the counter for a PFN reaches threshold In both cases, userspace has no control of the soft offline performed by kernel's memory failure recovery. This commit gives userspace the control of softofflining any page: kernel only soft offlines raw page / transparent hugepage / HugeTLB hugepage if userspace has agreed to. The interface to userspace is a new sysctl at /proc/sys/vm/enable_soft_offline. By default its value is set to 1 to preserve existing behavior in kernel. When set to 0, soft-offline (e.g. MADV_SOFT_OFFLINE) will fail with EOPNOTSUPP. [jiaqiyan@google.com: v7] Link: https://lkml.kernel.org/r/20240628205958.2845610-3-jiaqiyan@google.com Link: https://lkml.kernel.org/r/20240626050818.2277273-3-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05mm/memory-failure: refactor log format in soft offline codeJiaqi Yan1-6/+9
Patch series "Userspace controls soft-offline pages", v6. Correctable memory errors are very common on servers with large amount of memory, and are corrected by ECC, but with two pain points to users: 1. Correction usually happens on the fly and adds latency overhead 2. Not-fully-proved theory states excessive correctable memory errors can develop into uncorrectable memory error. Soft offline is kernel's additional solution for memory pages having (excessive) corrected memory errors. Impacted page is migrated to healthy page if it is in use, then the original page is discarded for any future use. The actual policy on whether (and when) to soft offline should be maintained by userspace, especially in case of an 1G HugeTLB page. Soft-offline dissolves the HugeTLB page, either in-use or free, into chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. If userspace has not acknowledged such behavior, it may be surprised when later mmap hugepages MAP_FAILED due to lack of hugepages. In case of a transparent hugepage, it will be split into 4K pages as well; userspace will stop enjoying the transparent performance. In addition, discarding the entire 1G HugeTLB page only because of corrected memory errors sounds very costly and kernel better not doing under the hood. But today there are at least 2 such cases: 1. GHES driver sees both GHES_SEV_CORRECTED and CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER. 2. RAS Correctable Errors Collector counts correctable errors per PFN and when the counter for a PFN reaches threshold In both cases, userspace has no control of the soft offline performed by kernel's memory failure recovery. This patch series give userspace the control of softofflining any page: kernel only soft offlines raw page / transparent hugepage / HugeTLB hugepage if userspace has agreed to. The interface to userspace is a new sysctl called enable_soft_offline under /proc/sys/vm. By default enable_soft_line is 1 to preserve existing behavior in kernel. This patch (of 4): Logs from soft_offline_page and soft_offline_in_use_page have different formats than majority of the memory failure code: "Memory failure: 0x${pfn}: ${lower_case_message}" Convert them to the following format: "Soft offline: 0x${pfn}: ${lower_case_message}" No functional change in this commit. Link: https://lkml.kernel.org/r/20240626050818.2277273-1-jiaqiyan@google.com Link: https://lkml.kernel.org/r/20240626050818.2277273-2-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05mm: memcg: adjust the warning when seq_buf overflowsXiu Jianfeng1-1/+2
Currently it uses WARN_ON_ONCE() if seq_buf overflows when user reads memory.stat, the only advantage of WARN_ON_ONCE is that the splat is so verbose that it gets noticed. And also it panics the system if panic_on_warn is enabled. It seems like the warning is just an over reaction and a simple pr_warn should just achieve the similar effect. Link: https://lkml.kernel.org/r/20240628072333.2496527-1-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Suggested-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05mm: memcg: remove redundant seq_buf_has_overflowed()Xiu Jianfeng1-3/+0
Both the end of memory_stat_format() and memcg_stat_format() will call WARN_ON_ONCE(seq_buf_has_overflowed()). However, memory_stat_format() is the only caller of memcg_stat_format(), when memcg is on the default hierarchy, seq_buf_has_overflowed() will be executed twice, so remove the redundant one. Link: https://lkml.kernel.org/r/20240626094232.2432891-1-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05mm: turn off test_uffdio_wp if CONFIG_PTE_MARKER_UFFD_WP is not configured.Audra Mitchell1-0/+3
If CONFIG_PTE_MARKER_UFFD_WP is disabled, then we turn off three features in userfaultfd_api (UFFD_FEATURE_WP_HUGETLBFS_SHMEM, UFFD_FEATURE_WP_UNPOPULATED, and UFFD_FEATURE_WP_ASYNC). Currently this test always will call uffdio_regsiter with the flag UFFDIO_REGISTER_MODE_WP. However, the kernel ensures in vma_can_userfault that if the feature UFFD_FEATURE_WP_HUGETLBFS_SHMEM is disabled, only allow the VM_UFFD_WP on anonymous vmas, meaning our call to uffdio_regsiter will fail. We still want to be able to run the test even if we have CONFIG_PTE_MARKER_UFFD_WP disabled, so check to see if the feature UFFD_FEATURE_WP_HUGETLBFS_SHMEM has been turned off in the test and if so, disable us from calling uffdio_regsiter with the flag UFFDIO_REGISTER_MODE_WP. Link: https://lkml.kernel.org/r/20240626130513.120193-3-audra@redhat.com Signed-off-by: Audra Mitchell <audra@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05mm: update uffd-stress to handle EINVAL for unset config featuresAudra Mitchell1-5/+9
Now that we have updated userfaultfd_api to correctly return EINVAL when a feature is requested but not available, let's fix the uffd-stress test to only set the UFFD_FEATURE_WP_UNPOPULATED feature when the config is set. In addition, still run the test if the CONFIG_PTE_MARKER_UFFD_WP is not set, just dont use the corresponding UFFD_FEATURE_WP_UNPOPULATED feature. Link: https://lkml.kernel.org/r/20240626130513.120193-2-audra@redhat.com Signed-off-by: Audra Mitchell <audra@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05get_task_mm: check PF_KTHREAD locklessOleg Nesterov1-6/+5
Nowadays PF_KTHREAD is sticky and it was never protected by ->alloc_lock. Move the PF_KTHREAD check outside of task_lock() section to make this code more understandable. Link: https://lkml.kernel.org/r/20240626191017.GA20031@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05memcg: mm_update_next_owner: move for_each_thread() into try_to_set_owner()Oleg Nesterov1-16/+24
mm_update_next_owner() checks the children / real_parent->children to avoid the "everything else" loop in the likely case, but this won't work if a child/sibling has a zombie leader with ->mm == NULL. Move the for_each_thread() logic into try_to_set_owner(), if nothing else this makes the children/siblings/everything searches more consistent. Link: https://lkml.kernel.org/r/20240626152930.GA17936@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jinliang Zheng <alexjlzheng@tencent.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05memcg: mm_update_next_owner: kill the "retry" logicOleg Nesterov1-30/+27
Add the new helper, try_to_set_owner(), which tries to update mm->owner once we see c->mm == mm. This way mm_update_next_owner() doesn't need to restart the list_for_each_entry/for_each_process loops from the very beginning if it races with exit/exec, it can just continue. Unlike the current code, try_to_set_owner() re-checks tsk->mm == mm before it drops tasklist_lock, so it doesn't need get/put_task_struct(). Link: https://lkml.kernel.org/r/20240626152924.GA17933@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jinliang Zheng <alexjlzheng@tencent.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05mm: memcg: put struct task_struct::in_user_fault under CONFIG_MEMCG_V1Roman Gushchin2-21/+21
The struct task_struct's in_user_fault member is not used by the cgroup v2's memory controller, so it can be put under the CONFIG_MEMCG_V1 config option. To do so, mem_cgroup_enter_user_fault() and mem_cgroup_exit_user_fault() are moved under the CONFIG_MEMCG_V1 option as well. Link: https://lkml.kernel.org/r/20240628210317.272856-10-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05mm: memcg: put struct task_struct::memcg_in_oom under CONFIG_MEMCG_V1Roman Gushchin1-1/+3
The memcg_in_oom field of the struct task_struct is not used by the cgroup v2's memory controller, so it can be happily compiled out if CONFIG_MEMCG_V1 is not set. Link: https://lkml.kernel.org/r/20240628210317.272856-9-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_nodeRoman Gushchin1-0/+2
Put memcg1-specific members of struct mem_cgroup_per_node under the CONFIG_MEMCG_V1 config option. Link: https://lkml.kernel.org/r/20240628210317.272856-8-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-05mm: memcg: put memcg1-specific struct mem_cgroup's members under CONFIG_MEMCG_V1Roman Gushchin1-50/+53
Put memcg1-specific members of struct mem_cgroup under the CONFIG_MEMCG_V1 config option. Also group them close to the end of struct mem_cgroup just before the dynamic per-node part. Link: https://lkml.kernel.org/r/20240628210317.272856-7-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>