summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2024-02-22mm: zswap: clean up zswap_entry_put()Johannes Weiner1-7/+3
Remove stale comment and unnecessary local variable. Link: https://lkml.kernel.org/r/20240130014208.565554-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: warn when referencing a dead entryJohannes Weiner1-0/+1
Put a standard sanity check on zswap_entry_get() for UAF scenario. Link: https://lkml.kernel.org/r/20240130014208.565554-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: move zswap_invalidate_entry() to related functionsJohannes Weiner1-12/+12
Move it up to the other tree and refcounting functions. Link: https://lkml.kernel.org/r/20240130014208.565554-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: inline and remove zswap_entry_find_get()Johannes Weiner1-15/+2
There is only one caller and the function is trivial. Inline it. Link: https://lkml.kernel.org/r/20240130014208.565554-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: rename zswap_free_entry to zswap_entry_freeJohannes Weiner1-2/+2
There is a zswap_entry_ namespace with multiple functions already. Link: https://lkml.kernel.org/r/20240130014208.565554-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/list_lru: remove list_lru_putback()Chengming Zhou2-15/+1
Since the only user zswap_lru_putback() has gone, remove list_lru_putback() too. Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-3-b10479847099@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chris Li <chriscli@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/zswap: fix race between lru writeback and swapoffChengming Zhou1-65/+49
LRU writeback has race problem with swapoff, as spotted by Yosry [1]: CPU1 CPU2 shrink_memcg_cb swap_off list_lru_isolate zswap_invalidate zswap_swapoff kfree(tree) // UAF spin_lock(&tree->lock) The problem is that the entry in lru list can't protect the tree from being swapoff and freed, and the entry also can be invalidated and freed concurrently after we unlock the lru lock. We can fix it by moving the swap cache allocation ahead before referencing the tree, then check invalidate race with tree lock, only after that we can safely deref the entry. Note we couldn't deref entry or tree anymore after we unlock the folio, since we depend on this to hold on swapoff. So this patch moves all tree and entry usage to zswap_writeback_entry(), we only use the copied swpentry on the stack to allocate swap cache and if returned with folio locked we can reference the tree safely. Then we can check invalidate race with tree lock, the following things is much the same like zswap_load(). Since we can't deref the entry after zswap_writeback_entry(), we can't use zswap_lru_putback() anymore, instead we rotate the entry in the beginning. And it will be unlinked and freed when invalidated if writeback success. Another change is we don't update the memcg nr_zswap_protected in the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with swapin or concurrent shrinker action, since swapin already have memcg nr_zswap_protected updated, don't need double counts here. For concurrent shrinker, the folio will be writeback and freed anyway. -ENOMEM case is extremely rare and doesn't happen spuriously either, so don't bother distinguishing this case. [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chris Li <chriscli@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm and cache_info: remove unnecessary CPU cache info updateHuang Ying1-21/+18
For each CPU hotplug event, we will update per-CPU data slice size and corresponding PCP configuration for every online CPU to make the implementation simple. But, Kyle reported that this takes tens seconds during boot on a machine with 34 zones and 3840 CPUs. So, in this patch, for each CPU hotplug event, we only update per-CPU data slice size and corresponding PCP configuration for the CPUs that share caches with the hotplugged CPU. With the patch, the system boot time reduces 67 seconds on the machine. Link: https://lkml.kernel.org/r/20240126081944.414520-1-ying.huang@intel.com Fixes: 362d37a106dd ("mm, pcp: reduce lock contention for draining high-order pages") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Originally-by: Kyle Meyer <kyle.meyer@hpe.com> Reported-and-tested-by: Kyle Meyer <kyle.meyer@hpe.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22kswapd: replace try_to_freeze() with kthread_freezable_should_stop()Levi Yun1-6/+6
Instead of using try_to_freeze, use kthread_freezable_should_stop in kswapd. By this, we can avoid unnecessary freezing when kswapd should stop. Link: https://lkml.kernel.org/r/20240126152556.58791-1-ppbuk5246@gmail.com Signed-off-by: Levi Yun <ppbuk5246@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: memcg: don't periodically flush stats when memcg is disabledT.J. Mercier1-1/+1
The root memcg is onlined even when memcg is disabled. When it's onlined a 2 second periodic stat flush is started, but no stat flushing is required when memcg is disabled because there can be no child memcgs. Most calls to flush memcg stats are avoided when memcg is disabled as a result of the mem_cgroup_disabled check added in 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"), but the periodic flushing started in mem_cgroup_css_online is not. Skip it. Link: https://lkml.kernel.org/r/20240126211927.1171338-1-tjmercier@google.com Fixes: aa48e47e3906 ("memcg: infrastructure to flush memcg stats") Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Li <chrisl@kernel.org> Reported-by: Minchan Kim <minchan@google.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Michal Koutn <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: kmsan: remove runtime checks from kmsan_unpoison_memory()Alexander Potapenko1-23/+13
Similarly to what's been done in commit 85716a80c16d ("kmsan: allow using __msan_instrument_asm_store() inside runtime"), it should be safe to call kmsan_unpoison_memory() from within the runtime, as it does not allocate memory or take locks. Remove the redundant runtime checks. This should fix false positives seen with CONFIG_DEBUG_LIST=y when the non-instrumented lib/stackdepot.c failed to unpoison the memory chunks later checked by the instrumented lib/list_debug.c Also replace the implementation of kmsan_unpoison_entry_regs() with a call to kmsan_unpoison_memory(). Link: https://lkml.kernel.org/r/20240124173134.1165747-1-glider@google.com Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core") Signed-off-by: Alexander Potapenko <glider@google.com> Tested-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory_hotplug: export mhp_supports_memmap_on_memory()Vishal Verma1-11/+6
In preparation for adding sysfs ABI to toggle memmap_on_memory semantics for drivers adding memory, export the mhp_supports_memmap_on_memory() helper. This allows drivers to check if memmap_on_memory support is available before trying to request it, and display an appropriate message if it isn't available. As part of this, remove the size argument to this - with recent updates to allow memmap_on_memory for larger ranges, and the internal splitting of altmaps into respective memory blocks, the size argument is meaningless. [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20240124-vv-dax_abi-v7-4-20d16cb8d23d@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: remove unused tree argument in zswap_entry_put()Yosry Ahmed1-5/+4
Commit 7310895779624 ("mm: zswap: tighten up entry invalidation") removed the usage of tree argument, delete it. Link: https://lkml.kernel.org/r/20240125081423.1200336-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/mmap: introduce vma_set_range()Yajun Deng2-22/+16
There is a lot of code needs to set the range of vma in mmap.c, introduce vma_set_range() to simplify the code. Link: https://lkml.kernel.org/r/20240124035719.3685193-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: remove unnecessary trees cleanups in zswap_swapoff()Yosry Ahmed1-13/+3
During swapoff, try_to_unuse() makes sure that zswap_invalidate() is called for all swap entries before zswap_swapoff() is called. This means that all zswap entries should already be removed from the tree. Simplify zswap_swapoff() by removing the trees cleanup code, and leave an assertion in its place. Link: https://lkml.kernel.org/r/20240124045113.415378-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: swap: enforce updating inuse_pages at the end of swap_range_free()Yosry Ahmed1-3/+15
Patch series "mm: zswap: simplify zswap_swapoff()", v2. These patches aim to simplify zswap_swapoff() by removing the unnecessary trees cleanup code. Patch 1 makes sure that the order of operations during swapoff is enforced correctly, making sure the simplification in patch 2 is correct in a future-proof manner. This patch (of 2): In swap_range_free(), we update inuse_pages then do some cleanups (arch invalidation, zswap invalidation, swap cache cleanups, etc). During swapoff, try_to_unuse() checks that inuse_pages is 0 to make sure all swap entries are freed. Make sure we only update inuse_pages after we are done with the cleanups in swap_range_free(), and use the proper memory barriers to enforce it. This makes sure that code following try_to_unuse() can safely assume that swap_range_free() ran for all entries in thr swapfile (e.g. swap cache cleanup, zswap_swapoff()). In practice, this currently isn't a problem because swap_range_free() is called with the swap info lock held, and the swapoff code happens to spin for that after try_to_unuse(). However, this seems fragile and unintentional, so make it more relable and future-proof. This also facilitates a following simplification of zswap_swapoff(). Link: https://lkml.kernel.org/r/20240124045113.415378-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20240124045113.415378-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/zswap: split zswap rb-treeChengming Zhou2-26/+47
Each swapfile has one rb-tree to search the mapping of swp_entry_t to zswap_entry, that use a spinlock to protect, which can cause heavy lock contention if multiple tasks zswap_store/load concurrently. Optimize the scalability problem by splitting the zswap rb-tree into multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M), just like we did in the swap cache address_space splitting. Although this method can't solve the spinlock contention completely, it can mitigate much of that contention. Below is the results of kernel build in tmpfs with zswap shrinker enabled: linux-next zswap-lock-optimize real 1m9.181s 1m3.820s user 17m44.036s 17m40.100s sys 7m37.297s 4m54.622s So there are clearly improvements. Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-2-b5cc55479090@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chris Li <chriscli@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/zswap: make sure each swapfile always have zswap rb-treeChengming Zhou2-8/+10
Patch series "mm/zswap: optimize the scalability of zswap rb-tree", v2. When testing the zswap performance by using kernel build -j32 in a tmpfs directory, I found the scalability of zswap rb-tree is not good, which is protected by the only spinlock. That would cause heavy lock contention if multiple tasks zswap_store/load concurrently. So a simple solution is to split the only one zswap rb-tree into multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). Although this method can't solve the spinlock contention completely, it can mitigate much of that contention. Below is the results of kernel build in tmpfs with zswap shrinker enabled: linux-next zswap-lock-optimize real 1m9.181s 1m3.820s user 17m44.036s 17m40.100s sys 7m37.297s 4m54.622s So there are clearly improvements. And it's complementary with the ongoing zswap xarray conversion by Chris. Anyway, I think we can also merge this first, it's complementary IMHO. So I just refresh and resend this for further discussion. This patch (of 2): Not all zswap interfaces can handle the absence of the zswap rb-tree, actually only zswap_store() has handled it for now. To make things simple, we make sure each swapfile always have the zswap rb-tree prepared before being enabled and used. The preparation is unlikely to fail in practice, this patch just make it explicit. Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-0-b5cc55479090@bytedance.com Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-1-b5cc55479090@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chris Li <chriscli@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mempolicy: clean up minor dead code in queue_pages_test_walk()Lukas Bulwahn1-4/+0
Commit 2cafb582173f ("mempolicy: remove confusing MPOL_MF_LAZY dead code") removes MPOL_MF_LAZY handling in queue_pages_test_walk(), and with that, there is no effective use of the local variable endvma in that function remaining. Remove the local variable endvma and its dead code. No functional change. This issue was identified with clang-analyzer's dead stores analysis. Link: https://lkml.kernel.org/r/20240122092504.18377-1-lukas.bulwahn@gmail.com Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: writeback: ratelimit stat flush from mem_cgroup_wb_statsShakeel Butt1-1/+1
One of our workloads (Postgres 14) has regressed when migrated from 5.10 to 6.1 upstream kernel. The regression can be reproduced by sysbench's oltp_write_only benchmark. It seems like the always on rstat flush in mem_cgroup_wb_stats() is causing the regression. So, rate limit that specific rstat flush. One potential consequence would be the dirty throttling might be decided on stale memcg stats. However from our benchmarks and production traffic we have not observed any change in the dirty throttling behavior of the application. Link: https://lkml.kernel.org/r/20240118184235.618164-1-shakeelb@google.com Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: memory: move mem_cgroup_charge() into alloc_anon_folio()Kefeng Wang1-8/+8
The GFP flags from vma_thp_gfp_mask() according to user configuration only used for large folio allocation but not for memory cgroup charge, and GFP_KERNEL is used for both order-0 and large order folio when memory cgroup charge at present. However, mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way. In addition to checking gfpflags_allow_blocking(), it pays attention to __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within this memcg do not exceed their quotas. So we'd better to move mem_cgroup_charge() into alloc_anon_folio(), 1) it will make us to allocate as much as possible large order folio, because we could try the next order if mem_cgroup_charge() fails, although the memcg's memory usage is close to its limits. 2) using same GFP flags for allocation and charge is to be consistent with PMD THP firstly, in addition, according to GFP flag returned from vma_thp_gfp_mask(), GFP_TRANSHUGE_LIGHT could make us skip direct reclaim, _GFP_NORETRY will make us skip mem_cgroup_oom() and won't trigger memory cgroup oom from large order(order <= COSTLY_ORDER) folio charging. Link: https://lkml.kernel.org/r/20240122011612.501029-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240117103954.2756050-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/zswap: improve with alloc_workqueue() callRonald Monthero1-1/+2
The core-api create_workqueue is deprecated, this patch replaces the create_workqueue with alloc_workqueue. The previous implementation workqueue of zswap was a bounded workqueue, this patch uses alloc_workqueue() to create an unbounded workqueue. The WQ_UNBOUND attribute is desirable making the workqueue not localized to a specific cpu so that the scheduler is free to exercise improvisations in any demanding scenarios for offloading cpu time slices for workqueues. For example if any other workqueues of the same primary cpu had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE. Also Unbound workqueue happens to be more efficient in a system during memory pressure scenarios in comparison to a bounded workqueue. shrink_wq = alloc_workqueue("zswap-shrink", WQ_UNBOUND|WQ_MEM_RECLAIM, 1); Overall the change suggested in this patch should be seamless and does not alter the existing behavior, other than the improvisation to be an unbounded workqueue. Link: https://lkml.kernel.org/r/20240116133145.12454-1-debug.penguin32@gmail.com Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22readahead: use ilog2 instead of a while loop in page_cache_ra_order()Pankaj Raghav1-4/+2
A while loop is used to adjust the new_order to be lower than the ra->size. ilog2 could be used to do the same instead of using a loop. ilog2 typically resolves to a bit scan reverse instruction. This is particularly useful when ra->size is smaller than the 2^new_order as it resolves in one instruction instead of looping to find the new_order. No functional changes. Link: https://lkml.kernel.org/r/20240115102523.2336742-1-kernel@pankajraghav.com Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: convert mm_counter_file() to take a folioKefeng Wang4-10/+10
Now all callers of mm_counter_file() have a folio, convert mm_counter_file() to take a folio. Saves a call to compound_head() hidden inside PageSwapBacked(). Link: https://lkml.kernel.org/r/20240111152429.3374566-11-willy@infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: convert mm_counter() to take a folioKefeng Wang3-10/+10
Now all callers of mm_counter() have a folio, convert mm_counter() to take a folio. Saves a call to compound_head() hidden inside PageAnon(). Link: https://lkml.kernel.org/r/20240111152429.3374566-10-willy@infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: convert to should_zap_page() to should_zap_folio()Kefeng Wang1-14/+17
Make should_zap_page() take a folio and rename it to should_zap_folio() as preparation for converting mm counter functions to take a folio. Saves a call to compound_head() hidden inside PageAnon(). [wangkefeng.wang@huawei.com: fix used-uninitialized warning] Link: https://lkml.kernel.org/r/962a7993-fce9-4de8-85cd-25e290f25736@huawei.com Link: https://lkml.kernel.org/r/20240111152429.3374566-9-willy@infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: use pfn_swap_entry_folio() in copy_nonpresent_pte()Kefeng Wang1-2/+2
Call pfn_swap_entry_folio() as preparation for converting mm counter functions to take a folio. Link: https://lkml.kernel.org/r/20240111152429.3374566-8-willy@infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: use pfn_swap_entry_to_folio() in zap_huge_pmd()Kefeng Wang1-7/+10
Call pfn_swap_entry_to_folio() in zap_huge_pmd() as preparation for converting mm counter functions to take a folio. Saves a call to compound_head() embedded inside PageAnon(). Link: https://lkml.kernel.org/r/20240111152429.3374566-7-willy@infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: use pfn_swap_entry_folio() in __split_huge_pmd_locked()Kefeng Wang1-2/+2
Call pfn_swap_entry_folio() in __split_huge_pmd_locked() as preparation for converting mm counter functions to take a folio. Link: https://lkml.kernel.org/r/20240111152429.3374566-6-willy@infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mprotect: use pfn_swap_entry_folioMatthew Wilcox (Oracle)1-2/+2
We only want to know whether the folio is anonymous, so use pfn_swap_entry_folio() and save a call to compound_head(). Link: https://lkml.kernel.org/r/20240111152429.3374566-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: add pfn_swap_entry_folio()Matthew Wilcox (Oracle)2-2/+2
Patch series "mm: convert mm counter to take a folio", v3. Make sure all mm_counter() and mm_counter_file() callers have a folio, then convert mm counter functions to take a folio, which saves some compound_head() calls. This patch (of 10): Thanks to the compound_head() hidden inside PageLocked(), this saves a call to compound_head() over calling page_folio(pfn_swap_entry_to_page()) Link: https://lkml.kernel.org/r/20240111152429.3374566-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240111152429.3374566-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22memcg: use a folio in get_mctgt_type_thpMatthew Wilcox (Oracle)1-5/+7
Replace five calls to compound_head() with one. Link: https://lkml.kernel.org/r/20240111181219.3462852-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22memcg: use a folio in get_mctgt_typeMatthew Wilcox (Oracle)1-10/+13
Replace seven calls to compound_head() with one. We still use the page as page_mapped() is different from folio_mapped(). Link: https://lkml.kernel.org/r/20240111181219.3462852-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22memcg: return the folio in union mc_targetMatthew Wilcox (Oracle)1-7/+7
All users of target.page convert it to the folio, so we can just return the folio directly and save a few calls to compound_head(). Link: https://lkml.kernel.org/r/20240111181219.3462852-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22memcg: convert mem_cgroup_move_charge_pte_range() to use a folioMatthew Wilcox (Oracle)1-25/+24
Patch series "Convert memcontrol charge moving to use folios". No part of these patches should change behaviour; all the called functions already convert from page to folio, so this ought to simply be a reduction in the number of calls to compound_head(). This patch (of 4): Remove many calls to compound_head() by calling page_folio() once at the start of each stanza which receives a struct page from 'target'. There should be no change in behaviour here as all the called functions start out by converting the page to its folio. Link: https://lkml.kernel.org/r/20240111181219.3462852-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240111181219.3462852-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: mmap: no need to call khugepaged_enter_vma() for stackYang Shi1-2/+0
We avoid allocating THP for temporary stack, even though khugepaged_enter_vma() is called for stack VMAs, it actualy returns false. So no need to call it in the first place at all. Link: https://lkml.kernel.org/r/20231221065943.2803551-1-shy828301@gmail.com Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Christopher Lameter <cl@linux.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: kernel test robot <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: list_lru: disable memcg_aware when cgroup.memory is set to "nokmem"Haifeng Xu1-0/+3
Actually, when using a boot time kernel option "cgroup.memory=nokmem", all lru items are inserted to list_lru_node. But for those users who invoke list_lru_init_memcg() to initialize list_lru, list_lru_memcg_aware() returns true. And this brings unneeded operations related to memcg. To make things more convenient, let's disable memcg_aware when cgroup.memory is set to "nokmem". Link: https://lkml.kernel.org/r/20231228062715.338672-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: memory: use nth_page() in clear/copy_subpage()Kefeng Wang1-4/+5
The clear and copy of huge gigantic page has converted to use nth_page() to handle the possible discontinuous struct page(SPARSEMEM without VMEMMAP), but not change for the non-gigantic part, fix it too. Link: https://lkml.kernel.org/r/20231229082207.60235-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/mmap: simplify vma link and unlinkYajun Deng1-25/+19
The file parameter in the __remove_shared_vm_struct is no longer used, remove it. These functions vma_link() and mmap_region() have some of the same code, introduce vma_link_file() helper function to simplify the code. Link: https://lkml.kernel.org/r/20240110084622.2425927-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/filemap: avoid type conversionHongbo Li1-1/+1
The return type of function folio_test_hugetlb is bool type, there is no need to assign it to an integer type. Link: https://lkml.kernel.org/r/20240108044815.3291487-1-lihongbo22@huawei.com Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiersSumanth Korikkar2-4/+16
Patch series "implement "memmap on memory" feature on s390". This series provides "memmap on memory" support on s390 platform. "memmap on memory" allows struct pages array to be allocated from the hotplugged memory range instead of allocating it from main system memory. s390 currently preallocates struct pages array for all potentially possible memory, which ensures memory onlining always succeeds, but with the cost of significant memory consumption from the available system memory during boottime. In certain extreme configuration, this could lead to ipl failure. "memmap on memory" ensures struct pages array are populated from self contained hotplugged memory range instead of depleting the available system memory and this could eliminate ipl failure on s390 platform. On other platforms, system might go OOM when the physically hotplugged memory depletes the available memory before it is onlined. Hence, "memmap on memory" feature was introduced as described in commit a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range"). Unlike other architectures, s390 memory blocks are not physically accessible until it is online. To make it physically accessible two new memory notifiers MEM_PREPARE_ONLINE / MEM_FINISH_OFFLINE are added and this notifier lets the hypervisor inform that the memory should be made physically accessible. This allows for "memmap on memory" initialization during memory hotplug onlining phase, which is performed before calling MEM_GOING_ONLINE notifier. Patch 1 introduces MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers to prepare the transition of memory to and from a physically accessible state. New mhp_flag MHP_OFFLINE_INACCESSIBLE is introduced to ensure altmap cannot be written when adding memory - before it is set online. This enhancement is crucial for implementing the "memmap on memory" feature for s390 in a subsequent patch. Patches 2 allocates vmemmap pages from self-contained memory range for s390. It allocates memory map (struct pages array) from the hotplugged memory range, rather than using system memory by passing altmap to vmemmap functions. Patch 3 removes unhandled memory notifier types on s390. Patch 4 implements MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers on s390. MEM_PREPARE_ONLINE memory notifier makes memory block physical accessible via sclp assign command. The notifier ensures self-contained memory maps are accessible and hence enabling the "memmap on memory" on s390. MEM_FINISH_OFFLINE memory notifier shifts the memory block to an inaccessible state via sclp unassign command. Patch 5 finally enables MHP_MEMMAP_ON_MEMORY on s390. This patch (of 5): Introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers to prepare the transition of memory to and from a physically accessible state. This enhancement is crucial for implementing the "memmap on memory" feature for s390 in a subsequent patch. Platforms such as x86 can support physical memory hotplug via ACPI. When there is physical memory hotplug, ACPI event leads to the memory addition with the following callchain: acpi_memory_device_add() -> acpi_memory_enable_device() -> __add_memory() After this, the hotplugged memory is physically accessible, and altmap support prepared, before the "memmap on memory" initialization in memory_block_online() is called. On s390, memory hotplug works in a different way. The available hotplug memory has to be defined upfront in the hypervisor, but it is made physically accessible only when the user sets it online via sysfs, currently in the MEM_GOING_ONLINE notifier. This is too late and "memmap on memory" initialization is performed before calling MEM_GOING_ONLINE notifier. During the memory hotplug addition phase, altmap support is prepared and during the memory onlining phase s390 requires memory to be physically accessible and then subsequently initiate the "memmap on memory" initialization process. The memory provider will handle new MEM_PREPARE_ONLINE / MEM_FINISH_OFFLINE notifications and make the memory accessible. The mhp_flag MHP_OFFLINE_INACCESSIBLE is introduced and is relevant when used along with MHP_MEMMAP_ON_MEMORY, because the altmap cannot be written (e.g., poisoned) when adding memory -- before it is set online. This allows for adding memory with an altmap that is not currently made available by a hypervisor. When onlining that memory, the hypervisor can be instructed to make that memory accessible via the new notifiers and the onlining phase will not require any memory allocations, which is helpful in low-memory situations. All architectures ignore unknown memory notifiers. Therefore, the introduction of these new notifiers does not result in any functional modifications across architectures. Link: https://lkml.kernel.org/r/20240108132747.3238763-1-sumanthk@linux.ibm.com Link: https://lkml.kernel.org/r/20240108132747.3238763-2-sumanthk@linux.ibm.com Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Suggested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/cma: fix placement of trace_cma_alloc_start/finishKalesh Singh1-4/+4
The current placement of trace_cma_alloc_start/finish misses the fail cases: !cma || !cma->count || !cma->bitmap. trace_cma_alloc_finish is also not emitted for the failure case where bitmap_count > bitmap_maxno. Fix these missed cases by moving the start event before the failure checks and moving the finish event to the out label. Link: https://lkml.kernel.org/r/20240110012234.3793639-1-kaleshsingh@google.com Fixes: 7bc1aec5e287 ("mm: cma: add trace events for CMA alloc perf testing") Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Liam Mark <lmark@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21kasan: guard release_free_meta() shadow access with kasan_arch_is_ready()Benjamin Gray1-0/+3
release_free_meta() accesses the shadow directly through the path kasan_slab_free __kasan_slab_free kasan_release_object_meta release_free_meta kasan_mem_to_shadow There are no kasan_arch_is_ready() guards here, allowing an oops when the shadow is not initialized. The oops can be seen on a Power8 KVM guest. This patch adds the guard to release_free_meta(), as it's the first level that specifically requires the shadow. It is safe to put the guard at the start of this function, before the stack put: only kasan_save_free_info() can initialize the saved stack, which itself is guarded with kasan_arch_is_ready() by its caller poison_slab_object(). If the arch becomes ready before release_free_meta() then we will not observe KASAN_SLAB_FREE_META in the object's shadow, so we will not put an uninitialized stack either. Link: https://lkml.kernel.org/r/20240213033958.139383-1-bgray@linux.ibm.com Fixes: 63b85ac56a64 ("kasan: stop leaking stack trace handles") Signed-off-by: Benjamin Gray <bgray@linux.ibm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm/damon/lru_sort: fix quota status loss due to online tuningsSeongJae Park1-7/+36
For online parameters change, DAMON_LRU_SORT creates new schemes based on latest values of the parameters and replaces the old schemes with the new one. When creating it, the internal status of the quotas of the old schemes is not preserved. As a result, charging of the quota starts from zero after the online tuning. The data that collected to estimate the throughput of the scheme's action is also reset, and therefore the estimation should start from the scratch again. Because the throughput estimation is being used to convert the time quota to the effective size quota, this could result in temporal time quota inaccuracy. It would be recovered over time, though. In short, the quota accuracy could be temporarily degraded after online parameters update. Fix the problem by checking the case and copying the internal fields for the status. Link: https://lkml.kernel.org/r/20240216194025.9207-3-sj@kernel.org Fixes: 40e983cca927 ("mm/damon: introduce DAMON-based LRU-lists Sorting") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm/damon/reclaim: fix quota stauts loss due to online tuningsSeongJae Park1-1/+17
Patch series "mm/damon: fix quota status loss due to online tunings". DAMON_RECLAIM and DAMON_LRU_SORT is not preserving internal quota status when applying new user parameters, and hence could cause temporal quota accuracy degradation. Fix it by preserving the status. This patch (of 2): For online parameters change, DAMON_RECLAIM creates new scheme based on latest values of the parameters and replaces the old scheme with the new one. When creating it, the internal status of the quota of the old scheme is not preserved. As a result, charging of the quota starts from zero after the online tuning. The data that collected to estimate the throughput of the scheme's action is also reset, and therefore the estimation should start from the scratch again. Because the throughput estimation is being used to convert the time quota to the effective size quota, this could result in temporal time quota inaccuracy. It would be recovered over time, though. In short, the quota accuracy could be temporarily degraded after online parameters update. Fix the problem by checking the case and copying the internal fields for the status. Link: https://lkml.kernel.org/r/20240216194025.9207-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240216194025.9207-2-sj@kernel.org Fixes: e035c280f6df ("mm/damon/reclaim: support online inputs update") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [5.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm/damon/sysfs-schemes: handle schemes sysfs dir removal before ↵SeongJae Park1-0/+4
commit_schemes_quota_goals 'commit_schemes_quota_goals' command handler, damos_sysfs_set_quota_scores() assumes the number of schemes sysfs directory will be same to the number of schemes of the DAMON context. The assumption is wrong since users can remove schemes sysfs directories while DAMON is running. In the case, illegal memory accesses can happen. Fix it by checking the case. Link: https://lkml.kernel.org/r/20240213023633.124928-1-sj@kernel.org Fixes: d91beaa505a0 ("mm/damon/sysfs-schemes: implement a command for scheme quota goals only commit") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm: memcontrol: clarify swapaccount=0 deprecation warningJohannes Weiner1-3/+7
The swapaccount deprecation warning is throwing false positives. Since we deprecated the knob and defaulted to enabling, the only reports we've been getting are from folks that set swapaccount=1. While this is a nice affirmation that always-enabling was the right choice, we certainly don't want to warn when users request the supported mode. Only warn when disabling is requested, and clarify the warning. [colin.i.king@gmail.com: spelling: "commdandline" -> "commandline"] Link: https://lkml.kernel.org/r/20240215090544.1649201-1-colin.i.king@gmail.com Link: https://lkml.kernel.org/r/20240213081634.3652326-1-hannes@cmpxchg.org Fixes: b25806dcd3d5 ("mm: memcontrol: deprecate swapaccounting=0 mode") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reported-by: "Jonas Schäfer" <jonas@wielicki.name> Reported-by: Narcis Garcia <debianlists@actiu.net> Suggested-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm/memblock: add MEMBLOCK_RSRV_NOINIT into flagname[] arrayAnshuman Khandual1-0/+1
The commit 77e6c43e137c ("memblock: introduce MEMBLOCK_RSRV_NOINIT flag") skipped adding this newly introduced memblock flag into flagname[] array, thus preventing a correct memblock flags output for applicable memblock regions. Link: https://lkml.kernel.org/r/20240209030912.1382251-1-anshuman.khandual@arm.com Fixes: 77e6c43e137c ("memblock: introduce MEMBLOCK_RSRV_NOINIT flag") Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mike Rapoport <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm/zswap: invalidate duplicate entry when !zswap_enabledChengming Zhou1-1/+5
We have to invalidate any duplicate entry even when !zswap_enabled since zswap can be disabled anytime. If the folio store success before, then got dirtied again but zswap disabled, we won't invalidate the old duplicate entry in the zswap_store(). So later lru writeback may overwrite the new data in swapfile. Link: https://lkml.kernel.org/r/20240208023254.3873823-1-chengming.zhou@linux.dev Fixes: 42c06a0e8ebe ("mm: kill frontswap") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm/swap: fix race when skipping swapcacheKairui Song3-0/+38
When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads swapin the same entry at the same time, they get different pages (A, B). Before one thread (T0) finishes the swapin and installs page (A) to the PTE, another thread (T1) could finish swapin of page (B), swap_free the entry, then swap out the possibly modified page reusing the same entry. It breaks the pte_same check in (T0) because PTE value is unchanged, causing ABA problem. Thread (T0) will install a stalled page (A) into the PTE and cause data corruption. One possible callstack is like this: CPU0 CPU1 ---- ---- do_swap_page() do_swap_page() with same entry <direct swapin path> <direct swapin path> <alloc page A> <alloc page B> swap_read_folio() <- read to page A swap_read_folio() <- read to page B <slow on later locks or interrupt> <finished swapin first> ... set_pte_at() swap_free() <- entry is free <write to page B, now page A stalled> <swap out page B to same swap entry> pte_same() <- Check pass, PTE seems unchanged, but page A is stalled! swap_free() <- page B content lost! set_pte_at() <- staled page A installed! And besides, for ZRAM, swap_free() allows the swap device to discard the entry content, so even if page (B) is not modified, if swap_read_folio() on CPU0 happens later than swap_free() on CPU1, it may also cause data loss. To fix this, reuse swapcache_prepare which will pin the swap entry using the cache flag, and allow only one thread to swap it in, also prevent any parallel code from putting the entry in the cache. Release the pin after PT unlocked. Racers just loop and wait since it's a rare and very short event. A schedule_timeout_uninterruptible(1) call is added to avoid repeated page faults wasting too much CPU, causing livelock or adding too much noise to perf statistics. A similar livelock issue was described in commit 029c4628b2eb ("mm: swap: get rid of livelock in swapin readahead") Reproducer: This race issue can be triggered easily using a well constructed reproducer and patched brd (with a delay in read path) [1]: With latest 6.8 mainline, race caused data loss can be observed easily: $ gcc -g -lpthread test-thread-swap-race.c && ./a.out Polulating 32MB of memory region... Keep swapping out... Starting round 0... Spawning 65536 workers... 32746 workers spawned, wait for done... Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! Round 0 Failed, 15 data loss! This reproducer spawns multiple threads sharing the same memory region using a small swap device. Every two threads updates mapped pages one by one in opposite direction trying to create a race, with one dedicated thread keep swapping out the data out using madvise. The reproducer created a reproduce rate of about once every 5 minutes, so the race should be totally possible in production. After this patch, I ran the reproducer for over a few hundred rounds and no data loss observed. Performance overhead is minimal, microbenchmark swapin 10G from 32G zram: Before: 10934698 us After: 11157121 us Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) [kasong@tencent.com: v4] Link: https://lkml.kernel.org/r/20240219082040.7495-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20240206182559.32264-1-ryncsn@gmail.com Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") Reported-by: "Huang, Ying" <ying.huang@intel.com> Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Yu Zhao <yuzhao@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Barry Song <21cnbao@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>