summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2023-01-19mm: multi-gen LRU: per-node lru_gen_folio listsYu Zhao3-33/+358
For each node, memcgs are divided into two generations: the old and the young. For each generation, memcgs are randomly sharded into multiple bins to improve scalability. For each bin, an RCU hlist_nulls is virtually divided into three segments: the head, the tail and the default. An onlining memcg is added to the tail of a random bin in the old generation. The eviction starts at the head of a random bin in the old generation. The per-node memcg generation counter, whose reminder (mod 2) indexes the old generation, is incremented when all its bins become empty. There are four operations: 1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in its current generation (old or young) and updates its "seg" to "head"; 2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in its current generation (old or young) and updates its "seg" to "tail"; 3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in the old generation, updates its "gen" to "old" and resets its "seg" to "default"; 4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin in the young generation, updates its "gen" to "young" and resets its "seg" to "default". The events that trigger the above operations are: 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD; 2. The first attempt to reclaim an memcg below low, which triggers MEMCG_LRU_TAIL; 3. The first attempt to reclaim an memcg below reclaimable size threshold, which triggers MEMCG_LRU_TAIL; 4. The second attempt to reclaim an memcg below reclaimable size threshold, which triggers MEMCG_LRU_YOUNG; 5. Attempting to reclaim an memcg below min, which triggers MEMCG_LRU_YOUNG; 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG; 7. Offlining an memcg, which triggers MEMCG_LRU_OLD. Note that memcg LRU only applies to global reclaim, and the round-robin incrementing of their max_seq counters ensures the eventual fairness to all eligible memcgs. For memcg reclaim, it still relies on mem_cgroup_iter(). Link: https://lkml.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: multi-gen LRU: shuffle should_run_aging()Yu Zhao1-62/+62
Move should_run_aging() next to its only caller left. Link: https://lkml.kernel.org/r/20221222041905.2431096-6-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: multi-gen LRU: remove aging fairness safeguardYu Zhao1-67/+59
Recall that the aging produces the youngest generation: first it scans for accessed folios and updates their gen counters; then it increments lrugen->max_seq. The current aging fairness safeguard for kswapd uses two passes to ensure the fairness to multiple eligible memcgs. On the first pass, which is shared with the eviction, it checks whether all eligible memcgs are low on cold folios. If so, it requires a second pass, on which it ages all those memcgs at the same time. With memcg LRU, the aging, while ensuring eventual fairness, will run when necessary. Therefore the current aging fairness safeguard for kswapd will not be needed. Note that memcg LRU only applies to global reclaim. For memcg reclaim, the aging can be unfair to different memcgs, i.e., their lrugen->max_seq can be incremented at different paces. Link: https://lkml.kernel.org/r/20221222041905.2431096-5-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: multi-gen LRU: remove eviction fairness safeguardYu Zhao1-58/+23
Recall that the eviction consumes the oldest generation: first it bucket-sorts folios whose gen counters were updated by the aging and reclaims the rest; then it increments lrugen->min_seq. The current eviction fairness safeguard for global reclaim has a dilemma: when there are multiple eligible memcgs, should it continue or stop upon meeting the reclaim goal? If it continues, it overshoots and increases direct reclaim latency; if it stops, it loses fairness between memcgs it has taken memory away from and those it has yet to. With memcg LRU, the eviction, while ensuring eventual fairness, will stop upon meeting its goal. Therefore the current eviction fairness safeguard for global reclaim will not be needed. Note that memcg LRU only applies to global reclaim. For memcg reclaim, the eviction will continue, even if it is overshooting. This becomes unconditional due to code simplification. Link: https://lkml.kernel.org/r/20221222041905.2431096-4-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]Yu Zhao1-10/+10
lru_gen_folio will be chained into per-node lists by the coming lrugen->list. Link: https://lkml.kernel.org/r/20221222041905.2431096-3-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folioYu Zhao2-19/+19
Patch series "mm: multi-gen LRU: memcg LRU", v3. Overview ======== An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, since each node and memcg combination has an LRU of folios (see mem_cgroup_lruvec()). Its goal is to improve the scalability of global reclaim, which is critical to system-wide memory overcommit in data centers. Note that memcg reclaim is currently out of scope. Its memory bloat is a pointer to each lruvec and negligible to each pglist_data. In terms of traversing memcgs during global reclaim, it improves the best-case complexity from O(n) to O(1) and does not affect the worst-case complexity O(n). Therefore, on average, it has a sublinear complexity in contrast to the current linear complexity. The basic structure of an memcg LRU can be understood by an analogy to the active/inactive LRU (of folios): 1. It has the young and the old (generations), i.e., the counterparts to the active and the inactive; 2. The increment of max_seq triggers promotion, i.e., the counterpart to activation; 3. Other events trigger similar operations, e.g., offlining an memcg triggers demotion, i.e., the counterpart to deactivation. In terms of global reclaim, it has two distinct features: 1. Sharding, which allows each thread to start at a random memcg (in the old generation) and improves parallelism; 2. Eventual fairness, which allows direct reclaim to bail out at will and reduces latency without affecting fairness over some time. The commit message in patch 6 details the workflow: https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/ The following is a simple test to quickly verify its effectiveness. Test design: 1. Create multiple memcgs. 2. Each memcg contains a job (fio). 3. All jobs access the same amount of memory randomly. 4. The system does not experience global memory pressure. 5. Periodically write to the root memory.reclaim. Desired outcome: 1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal) over mean(pgsteal) is close to 0%. 2. The total pgsteal is close to the total requested through memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close to 100%. Actual outcome [1]: MGLRU off MGLRU on stddev(pgsteal) / mean(pgsteal) 75% 20% sum(pgsteal) / sum(requested) 425% 95% #################################################################### MEMCGS=128 for ((memcg = 0; memcg < $MEMCGS; memcg++)); do mkdir /sys/fs/cgroup/memcg$memcg done start() { echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \ --filename=/dev/zero --size=1920M --rw=randrw \ --rate=64m,64m --random_distribution=random \ --fadvise_hint=0 --time_based --runtime=10h \ --group_reporting --minimal } for ((memcg = 0; memcg < $MEMCGS; memcg++)); do start & done sleep 600 for ((i = 0; i < 600; i++)); do echo 256m >/sys/fs/cgroup/memory.reclaim sleep 6 done for ((memcg = 0; memcg < $MEMCGS; memcg++)); do grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat done #################################################################### [1]: This was obtained from running the above script (touches less than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an hour. This patch (of 8): The new name lru_gen_folio will be more distinct from the coming lru_gen_memcg. Link: https://lkml.kernel.org/r/20221222041905.2431096-1-yuzhao@google.com Link: https://lkml.kernel.org/r/20221222041905.2431096-2-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: vmalloc: replace BUG_ON() by WARN_ON_ONCE()Uladzislau Rezki (Sony)1-1/+3
Currently a vm_unmap_ram() functions triggers a BUG() if an area is not found. Replace it by the WARN_ON_ONCE() error message and keep machine alive instead of stopping it. The worst case is a memory leaking. Link: https://lkml.kernel.org/r/20221222190022.134380-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: vmalloc: avoid calling __find_vmap_area() twice in __vunmap()Uladzislau Rezki (Sony)1-31/+48
Currently the __vunmap() path calls __find_vmap_area() twice. Once on entry to check that the area exists, then inside the remove_vm_area() function which also performs a new search for the VA. In order to improvie it from a performance point of view we split remove_vm_area() into two new parts: - find_unlink_vmap_area() that does a search and unlink from tree; - __remove_vm_area() that removes without searching. In this case there is no any functional change for remove_vm_area() whereas vm_remove_mappings(), where a second search happens, switches to the __remove_vm_area() variant where the already detached VA is passed as a parameter, so there is no need to find it again. Performance wise, i use test_vmalloc.sh with 32 threads doing alloc free on a 64-CPUs-x86_64-box: perf without this patch: - 31.41% 0.50% vmalloc_test/10 [kernel.vmlinux] [k] __vunmap - 30.92% __vunmap - 17.67% _raw_spin_lock native_queued_spin_lock_slowpath - 12.33% remove_vm_area - 11.79% free_vmap_area_noflush - 11.18% _raw_spin_lock native_queued_spin_lock_slowpath 0.76% free_unref_page perf with this patch: - 11.35% 0.13% vmalloc_test/14 [kernel.vmlinux] [k] __vunmap - 11.23% __vunmap - 8.28% find_unlink_vmap_area - 7.95% _raw_spin_lock 7.44% native_queued_spin_lock_slowpath - 1.93% free_vmap_area_noflush - 0.56% _raw_spin_lock 0.53% native_queued_spin_lock_slowpath 0.60% __vunmap_range_noflush __vunmap() consumes around ~20% less CPU cycles on this test. Also, switch from find_vmap_area() to find_unlink_vmap_area() to prevent a double access to the vmap_area_lock: one for finding area, second time is for unlinking from a tree. [urezki@gmail.com: switch to find_unlink_vmap_area() in vm_unmap_ram()] Link: https://lkml.kernel.org/r/20221222190022.134380-2-urezki@gmail.com Link: https://lkml.kernel.org/r/20221222190022.134380-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reported-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: new primitive kvmemdup()Hao Sun1-1/+23
Similar to kmemdup(), but support large amount of bytes with kvmalloc() and does *not* guarantee that the result will be physically contiguous. Use only in cases where kvmalloc() is needed and free it with kvfree(). Also adapt policy_unpack.c in case someone bisect into this. Link: https://lkml.kernel.org/r/20221221144245.27164-1-sunhao.th@gmail.com Signed-off-by: Hao Sun <sunhao.th@gmail.com> Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nick Terrell <terrelln@fb.com> Cc: John Johansen <john.johansen@canonical.com> Cc: Paul Moore <paul@paul-moore.com> Cc: James Morris <jmorris@namei.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/swap: convert deactivate_page() to folio_deactivate()Vishal Moola (Oracle)5-14/+12
Deactivate_page() has already been converted to use folios, this change converts it to take in a folio argument instead of calling page_folio(). It also renames the function folio_deactivate() to be more consistent with other folio functions. [akpm@linux-foundation.org: fix left-over comments, per Yu Zhao] Link: https://lkml.kernel.org/r/20221221180848.20774-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/damon: convert damon_pa_mark_accessed_or_deactivate() to use foliosVishal Moola (Oracle)1-6/+8
This change replaces 2 calls to compound_head() from put_page() and 1 call from mark_page_accessed() with one from page_folio(). This is in preparation for the conversion of deactivate_page() to folio_deactivate(). Link: https://lkml.kernel.org/r/20221221180848.20774-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19madvise: convert madvise_cold_or_pageout_pte_range() to use foliosVishal Moola (Oracle)1-49/+49
This change removes a number of calls to compound_head(), and saves 1729 bytes of kernel text. Link: https://lkml.kernel.org/r/20221221180848.20774-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/memory: add vm_normal_folio()Vishal Moola (Oracle)1-0/+10
Patch series "Convert deactivate_page() to folio_deactivate()", v4. Deactivate_page() has already been converted to use folios. This patch series modifies the callers of deactivate_page() to use folios. It also introduces vm_normal_folio() to assist with folio conversions, and converts deactivate_page() to folio_deactivate() which takes in a folio. This patch (of 4): Introduce a wrapper function called vm_normal_folio(). This function calls vm_normal_page() and returns the folio of the page found, or null if no page is found. This function allows callers to get a folio from a pte, which will eventually allow them to completely replace their struct page variables with struct folio instead. Link: https://lkml.kernel.org/r/20221221180848.20774-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20221221180848.20774-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: vmalloc: correct use of __GFP_NOWARN mask in __vmalloc_area_node()Lorenzo Stoakes1-1/+1
This function sets __GFP_NOWARN in the gfp_mask rendering the warn_alloc() invocations no-ops. Remove this and instead rely on this flag being set only for the vm_area_alloc_pages() function, ensuring it is cleared for each of the warn_alloc() calls. Link: https://lkml.kernel.org/r/20221219123659.90614-1-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19kasan: allow sampling page_alloc allocations for HW_TAGSAndrey Konovalov4-16/+123
As Hardware Tag-Based KASAN is intended to be used in production, its performance impact is crucial. As page_alloc allocations tend to be big, tagging and checking all such allocations can introduce a significant slowdown. Add two new boot parameters that allow to alleviate that slowdown: - kasan.page_alloc.sample, which makes Hardware Tag-Based KASAN tag only every Nth page_alloc allocation with the order configured by the second added parameter (default: tag every such allocation). - kasan.page_alloc.sample.order, which makes sampling enabled by the first parameter only affect page_alloc allocations with the order equal or greater than the specified value (default: 3, see below). The exact performance improvement caused by using the new parameters depends on their values and the applied workload. The chosen default value for kasan.page_alloc.sample.order is 3, which matches both PAGE_ALLOC_COSTLY_ORDER and SKB_FRAG_PAGE_ORDER. This is done for two reasons: 1. PAGE_ALLOC_COSTLY_ORDER is "the order at which allocations are deemed costly to service", which corresponds to the idea that only large and thus costly allocations are supposed to sampled. 2. One of the workloads targeted by this patch is a benchmark that sends a large amount of data over a local loopback connection. Most multi-page data allocations in the networking subsystem have the order of SKB_FRAG_PAGE_ORDER (or PAGE_ALLOC_COSTLY_ORDER). When running a local loopback test on a testing MTE-enabled device in sync mode, enabling Hardware Tag-Based KASAN introduces a ~50% slowdown. Applying this patch and setting kasan.page_alloc.sampling to a value higher than 1 allows to lower the slowdown. The performance improvement saturates around the sampling interval value of 10 with the default sampling page order of 3. This lowers the slowdown to ~20%. The slowdown in real scenarios involving the network will likely be better. Enabling page_alloc sampling has a downside: KASAN misses bad accesses to a page_alloc allocation that has not been tagged. This lowers the value of KASAN as a security mitigation. However, based on measuring the number of page_alloc allocations of different orders during boot in a test build, sampling with the default kasan.page_alloc.sample.order value affects only ~7% of allocations. The rest ~93% of allocations are still checked deterministically. Link: https://lkml.kernel.org/r/129da0614123bb85ed4dd61ae30842b2dd7c903f.1671471846.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Jann Horn <jannh@google.com> Cc: Mark Brand <markbrand@google.com> Cc: Peter Collingbourne <pcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19swap: avoid holding swap reference in swap_cache_get_folioKairui Song2-6/+13
All its callers either already hold a reference to, or lock the swap device while calling this function. There is only one exception in shmem_swapin_folio, just make this caller also hold a reference of the swap device, so this helper can be simplified and saves a few cycles. This also provides finer control of error handling in shmem_swapin_folio, on race (with swap off), it can just try again. For invalid swap entry, it can fail with a proper error code. Link: https://lkml.kernel.org/r/20221219185840.25441-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19swap: fold swap_ra_clamp_pfn into swap_ra_infoKairui Song1-25/+19
This makes the code cleaner. This helper is made of only two line of self explanational code and not reused anywhere else. And this actually make the compiled object smaller by a bit. bloat-o-meter results on x86_64 of mm/swap_state.o: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-35 (-35) Function old new delta swap_ra_info.constprop 512 477 -35 Total: Before=8388, After=8353, chg -0.42% Link: https://lkml.kernel.org/r/20221219185840.25441-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19swap: avoid a redundant pte map if ra window is 1Kairui Song1-5/+2
Avoid a redundant pte map/unmap when swap readahead window is 1. Link: https://lkml.kernel.org/r/20221219185840.25441-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19swapfile: get rid of volatile and avoid redundant readKairui Song1-3/+4
Patch series "Clean up and fixes for swap", v2. This series cleans up some code paths, saves a few cycles and reduces the object size by a bit. It also fixes some rare race issue with statistics. This patch (of 4): Convert a volatile variable to more readable READ_ONCE. And this actually avoids the code from reading the variable twice redundantly when it races. Link: https://lkml.kernel.org/r/20221219185840.25441-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20221219185840.25441-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/damon/sysfs-schemes: implement scheme filtersSeongJae Park1-1/+92
Implement scheme filters functionality of DAMON sysfs interface by making the code reads the values of files under the filter directories and pass that to DAMON using DAMON kernel API. [sj@kernel.org: fix leaking a filter for wrong cgroup path] Link: https://lkml.kernel.org/r/20221219171807.55708-2-sj@kernel.org [sj@kernel.org: return an error for filter memcg path id lookup failure] Link: https://lkml.kernel.org/r/20221219171807.55708-3-sj@kernel.org Link: https://lkml.kernel.org/r/20221205230830.144349-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/damon/sysfs-schemes: connect filter directory and filters directorySeongJae Park1-0/+68
Implement 'nr_filters' file under 'filters' directory, which will be used to populate specific number of 'filter' directory under the directory, similar to other 'nr_*' files in DAMON sysfs interface. Link: https://lkml.kernel.org/r/20221205230830.144349-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/damon/sysfs-schemes: implement filter directorySeongJae Park1-0/+128
Implement DAMOS filter directory which will be located under the filters directory. The directory provides three files, namely type, matching, and memcg_path. 'type' and 'matching' will be directly connected to the fields of 'struct damos_filter' having same name. 'memcg_path' will receive the path of the memory cgroup of the interest and later converted to memcg id when it's committed. Link: https://lkml.kernel.org/r/20221205230830.144349-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/damon/sysfs-schemes: implement filters directorySeongJae Park1-1/+84
DAMOS filters are currently supported by only DAMON kernel API. To expose the feature to user space, implement a DAMON sysfs directory named 'filters' under each scheme directory. Please note that this is implementing only the directory. Following commits will implement more files and directories, and finally connect the DAMOS filters feature. Link: https://lkml.kernel.org/r/20221205230830.144349-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/damon/reclaim: add a parameter called skip_anon for avoiding anonymous ↵SeongJae Park1-0/+19
pages reclamation In some cases, for example if users have confidence at anonymous pages management or the swap device is too slow, users would want to avoid DAMON_RECLAIM swapping the anonymous pages out. For such case, add yet another DAMON_RECLAIM parameter, namely 'skip_anon'. When it is set as 'Y', DAMON_RECLAIM will avoid reclaiming anonymous pages using a DAMOS filter. Link: https://lkml.kernel.org/r/20221205230830.144349-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/damon/paddr: support DAMOS filtersSeongJae Park1-9/+62
Implement support of the DAMOS filters in the physical address space monitoring operations set, for all DAMOS actions that it supports including 'pageout', 'lru_prio', and 'lru_deprio'. Link: https://lkml.kernel.org/r/20221205230830.144349-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/damon/core: implement damos filterSeongJae Park1-0/+39
Patch series "implement DAMOS filtering for anon pages and/or specific memory cgroups" DAMOS let users do system operations in a data access pattern oriented way. The data access pattern, which is extracted by DAMON, is somewhat accurate more than what user space could know in many cases. However, in some situation, users could know something more than the kernel about the pattern or some special requirements for some types of memory or processes. For example, some users would have slow swap devices and knows latency-ciritical processes and therefore want to use DAMON-based proactive reclamation (DAMON_RECLAIM) for only non-anonymous pages of non-latency-critical processes. For such restriction, users could exclude the memory regions from the initial monitoring regions and use non-dynamic monitoring regions update monitoring operations set including fvaddr and paddr. They could also adjust the DAMOS target access pattern. For dynamically changing memory layout and access pattern, those would be not enough. To help the case, add an interface, namely DAMOS filters, which can be used to avoid the DAMOS actions be applied to specific types of memory, to DAMON kernel API (damon.h). At the moment, it supports filtering anonymous pages and/or specific memory cgroups in or out for each DAMOS scheme. This patchset adds the support for all DAMOS actions that 'paddr' monitoring operations set supports ('pageout', 'lru_prio', and 'lru_deprio'), and the functionality is exposed via DAMON kernel API (damon.h) the DAMON sysfs interface (/sys/kernel/mm/damon/admins/), and DAMON_RECLAIM module parameters. Patches Sequence ---------------- First patch implements DAMOS filter interface to DAMON kernel API. Second patch makes the physical address space monitoring operations set to support the filters from all supporting DAMOS actions. Third patch adds anonymous pages filter support to DAMON_RECLAIM, and the fourth patch documents the DAMON_RECLAIM's new feature. Fifth to seventh patches implement DAMON sysfs files for support of the filters, and eighth patch connects the file to use DAMOS filters feature. Ninth patch adds simple self test cases for DAMOS filters of the sysfs interface. Finally, following two patches (tenth and eleventh) document the new features and interfaces. This patch (of 11): DAMOS lets users do system operation in a data access pattern oriented way. The data access pattern, which is extracted by DAMON, is somewhat accurate more than what user space could know in many cases. However, in some situation, users could know something more than the kernel about the pattern or some special requirements for some types of memory or processes. For example, some users would have slow swap devices and knows latency-ciritical processes and therefore want to use DAMON-based proactive reclamation (DAMON_RECLAIM) for only non-anonymous pages of non-latency-critical processes. For such restriction, users could exclude the memory regions from the initial monitoring regions and use non-dynamic monitoring regions update monitoring operations set including fvaddr and paddr. They could also adjust the DAMOS target access pattern. For dynamically changing memory layout and access pattern, those would be not enough. To help the case, add an interface, namely DAMOS filters, which can be used to avoid the DAMOS actions be applied to specific types of memory, to DAMON kernel API (damon.h). At the moment, it supports filtering anonymous pages and/or specific memory cgroups in or out for each DAMOS scheme. Note that this commit adds only the interface to the DAMON kernel API. The impelmentation should be made in the monitoring operations sets, and following commits will add that. Link: https://lkml.kernel.org/r/20221205230830.144349-1-sj@kernel.org Link: https://lkml.kernel.org/r/20221205230830.144349-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: memcontrol: deprecate charge movingJohannes Weiner1-0/+4
Charge moving mode in cgroup1 allows memory to follow tasks as they migrate between cgroups. This is, and always has been, a questionable thing to do - for several reasons. First, it's expensive. Pages need to be identified, locked and isolated from various MM operations, and reassigned, one by one. Second, it's unreliable. Once pages are charged to a cgroup, there isn't always a clear owner task anymore. Cache isn't moved at all, for example. Mapped memory is moved - but if trylocking or isolating a page fails, it's arbitrarily left behind. Frequent moving between domains may leave a task's memory scattered all over the place. Third, it isn't really needed. Launcher tasks can kick off workload tasks directly in their target cgroup. Using dedicated per-workload groups allows fine-grained policy adjustments - no need to move tasks and their physical pages between control domains. The feature was never forward-ported to cgroup2, and it hasn't been missed. Despite it being a niche usecase, the maintenance overhead of supporting it is enormous. Because pages are moved while they are live and subject to various MM operations, the synchronization rules are complicated. There are lock_page_memcg() in MM and FS code, which non-cgroup people don't understand. In some cases we've been able to shift code and cgroup API calls around such that we can rely on native locking as much as possible. But that's fragile, and sometimes we need to hold MM locks for longer than we otherwise would (pte lock e.g.). Mark the feature deprecated. Hopefully we can remove it soon. And backport into -stable kernels so that people who develop against earlier kernels are warned about this deprecation as early as possible. [akpm@linux-foundation.org: fix memory.rst underlining] Link: https://lkml.kernel.org/r/Y5COd+qXwk/S+n8N@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: rmap: remove lock_page_memcg()Johannes Weiner1-18/+8
The previous patch made sure charge moving only touches pages for which page_mapped() is stable. lock_page_memcg() is no longer needed. Link: https://lkml.kernel.org/r/20221206171340.139790-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: memcontrol: skip moving non-present pages that are mapped elsewhereJohannes Weiner1-14/+38
Patch series "mm: push down lock_page_memcg()", v2. This patch (of 3): During charge moving, the pte lock and the page lock cover nearly all cases of stabilizing page_mapped(). The only exception is when we're looking at a non-present pte and find a page in the page cache or in the swapcache: if the page is mapped elsewhere, it can become unmapped outside of our control. For this reason, rmap needs lock_page_memcg(). We don't like cgroup-specific locks in generic MM code - especially in performance-critical MM code - and for a legacy feature that's unlikely to have many users left - if any. So remove the exception. Arguably that's better semantics anyway: the page is shared, and another process seems to be the more active user. Once we stop moving such pages, rmap doesn't need lock_page_memcg() anymore. The next patch will remove it. Link: https://lkml.kernel.org/r/20221206171340.139790-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20221206171340.139790-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Hugh Dickins <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19hugetlb: initialize variable to avoid compiler warningMike Kravetz1-1/+1
With the gcc 'maybe-uninitialized' warning enabled, gcc will produce: mm/hugetlb.c:6896:20: warning: `chg' may be used uninitialized This is a false positive, but may be difficult for the compiler to determine. maybe-uninitialized is disabled by default, but this gets flagged as a 0-DAY build regression. Initialize the variable to silence the warning. Link: https://lkml.kernel.org/r/20221216224507.106789-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: swap: convert mark_page_lazyfree() to folio_mark_lazyfree()Kefeng Wang3-9/+7
mark_page_lazyfree() and the callers are converted to use folio, this rename and make it to take in a folio argument instead of calling page_folio(). Link: https://lkml.kernel.org/r/20221209020618.190306-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: huge_memory: convert madvise_free_huge_pmd to use a folioKefeng Wang1-14/+14
Using folios instead of pages removes several calls to compound_head(), Link: https://lkml.kernel.org/r/20221207023431.151008-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19cma: tracing: print alloc result in trace_cma_alloc_finishWenchao Hao1-1/+1
The result of the allocation attempt is not printed in trace_cma_alloc_finish, but it's important to do it so we can set filters to catch specific errors on allocation or to trigger some operations on specific errors. We have printed the result in log, but the log is conditional and could not be filtered by tracing events. It introduces little overhead to print this result. The result of allocation is named `errorno' in the trace. Link: https://lkml.kernel.org/r/20221208142130.1501195-1-haowenchao@huawei.com Signed-off-by: Wenchao Hao <haowenchao@huawei.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/mempolicy: do not duplicate policy if it is not applicable for ↵Michal Hocko1-16/+12
set_mempolicy_home_node set_mempolicy_home_node tries to duplicate a memory policy before checking it whether it is applicable for the operation. There is no real reason for doing that and it might actually be a pointless memory allocation and deallocation exercise for MPOL_INTERLEAVE. Not a big problem but we can do better. Simply check the policy before acting on it. Link: https://lkml.kernel.org/r/20221216194537.238047-2-mathieu.desnoyers@efficios.com Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19page_io: remove buffer_head includeMatthew Wilcox (Oracle)1-1/+0
page_io never uses buffer_heads to do I/O. Link: https://lkml.kernel.org/r/20221215214402.3522366-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/hugetlb: introduce hugetlb_walk()Peter Xu3-24/+20
huge_pte_offset() is the main walker function for hugetlb pgtables. The name is not really representing what it does, though. Instead of renaming it, introduce a wrapper function called hugetlb_walk() which will use huge_pte_offset() inside. Assert on the locks when walking the pgtable. Note, the vma lock assertion will be a no-op for private mappings. Document the last special case in the page_vma_mapped_walk() path where we don't need any more lock to call hugetlb_walk(). Taking vma lock there is not needed because either: (1) potential callers of hugetlb pvmw holds i_mmap_rwsem already (from one rmap_walk()), or (2) the caller will not walk a hugetlb vma at all so the hugetlb code path not reachable (e.g. in ksm or uprobe paths). It's slightly implicit for future page_vma_mapped_walk() callers on that lock requirement. But anyway, when one day this rule breaks, one will get a straightforward warning in hugetlb_walk() with lockdep, then there'll be a way out. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20221216155229.2043750-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/hugetlb: make walk_hugetlb_range() safe to pmd unsharePeter Xu2-1/+16
Since walk_hugetlb_range() walks the pgtable, it needs the vma lock to make sure the pgtable page will not be freed concurrently. Link: https://lkml.kernel.org/r/20221216155226.2043738-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/hugetlb: make follow_hugetlb_page() safe to pmd unsharePeter Xu1-0/+7
Since follow_hugetlb_page() walks the pgtable, it needs the vma lock to make sure the pgtable page will not be freed concurrently. Link: https://lkml.kernel.org/r/20221216155223.2043727-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/hugetlb: make hugetlb_follow_page_mask() safe to pmd unsharePeter Xu1-1/+4
Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock to make sure the pgtable page will not be freed concurrently. Link: https://lkml.kernel.org/r/20221216155219.2043714-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/hugetlb: move swap entry handling into vma lock when faultedPeter Xu2-25/+37
In hugetlb_fault(), there used to have a special path to handle swap entry at the entrance using huge_pte_offset(). That's unsafe because huge_pte_offset() for a pmd sharable range can access freed pgtables if without any lock to protect the pgtable from being freed after pmd unshare. Here the simplest solution to make it safe is to move the swap handling to be after the vma lock being held. We may need to take the fault mutex on either migration or hwpoison entries now (also the vma lock, but that's really needed), however neither of them is hot path. Note that the vma lock cannot be released in hugetlb_fault() when the migration entry is detected, because in migration_entry_wait_huge() the pgtable page will be used again (by taking the pgtable lock), so that also need to be protected by the vma lock. Modify migration_entry_wait_huge() so that it must be called with vma read lock held, and properly release the lock in __migration_entry_wait_huge(). Link: https://lkml.kernel.org/r/20221216155100.2043537-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/hugetlb: don't wait for migration entry during follow pagePeter Xu1-11/+0
That's what the code does with !hugetlb pages, so we should logically do the same for hugetlb, so migration entry will also be treated as no page. This is probably also the last piece in follow_page code that may sleep, the last one should be removed in cf994dd8af27 ("mm/gup: remove FOLL_MIGRATION", 2022-11-16). Link: https://lkml.kernel.org/r/20221216155100.2043537-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19hugetlb: update vma flag check for hugetlb vma lockMike Kravetz1-2/+1
The check for whether a hugetlb vma lock exists partially depends on the vma's flags. Currently, it checks for either VM_MAYSHARE or VM_SHARED. The reason both flags are used is because VM_MAYSHARE was previously cleared in hugetlb vmas as they are tore down. This is no longer the case, and only the VM_MAYSHARE check is required. Link: https://lkml.kernel.org/r/20221212235042.178355-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/memfd: add write seals when apply SEAL_EXEC to executable memfdJeff Xu1-0/+6
In order to avoid WX mappings, add F_SEAL_WRITE when apply F_SEAL_EXEC to an executable memfd, so W^X from start. This implys application need to fill the content of the memfd first, after F_SEAL_EXEC is applied, application can no longer modify the content of the memfd. Typically, application seals the memfd right after writing to it. For example: 1. memfd_create(MFD_EXEC). 2. write() code to the memfd. 3. fcntl(F_ADD_SEALS, F_SEAL_EXEC) to convert the memfd to W^X. 4. call exec() on the memfd. Link: https://lkml.kernel.org/r/20221215001205.51969-5-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Daniel Verkamp <dverkamp@chromium.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: kernel test robot <lkp@intel.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXECJeff Xu1-2/+46
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set executable bit at creation time (memfd_create). When MFD_NOEXEC_SEAL is set, memfd is created without executable bit (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be executable (mode: 0777) after creation. when MFD_EXEC flag is set, memfd is created with executable bit (mode:0777), this is the same as the old behavior of memfd_create. The new pid namespaced sysctl vm.memfd_noexec has 3 values: 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_EXEC was set. 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_NOEXEC_SEAL was set. 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected. The sysctl allows finer control of memfd_create for old-software that doesn't set the executable bit, for example, a container with vm.memfd_noexec=1 means the old-software will create non-executable memfd by default. Also, the value of memfd_noexec is passed to child namespace at creation time. For example, if the init namespace has vm.memfd_noexec=2, all its children namespaces will be created with 2. [akpm@linux-foundation.org: add stub functions to fix build] [akpm@linux-foundation.org: remove unneeded register_pid_ns_ctl_table_vm() stub, per Jeff] [akpm@linux-foundation.org: s/pr_warn_ratelimited/pr_warn_once/, per review] [akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warning] Link: https://lkml.kernel.org/r/20221215001205.51969-4-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@google.com> Co-developed-by: Daniel Verkamp <dverkamp@chromium.org> Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/memfd: add F_SEAL_EXECDaniel Verkamp2-0/+8
Patch series "mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC", v8. Since Linux introduced the memfd feature, memfd have always had their execute bit set, and the memfd_create() syscall doesn't allow setting it differently. However, in a secure by default system, such as ChromeOS, (where all executables should come from the rootfs, which is protected by Verified boot), this executable nature of memfd opens a door for NoExec bypass and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm process created a memfd to share the content with an external process, however the memfd is overwritten and used for executing arbitrary code and root escalation. [2] lists more VRP in this kind. On the other hand, executable memfd has its legit use, runc uses memfd’s seal and executable feature to copy the contents of the binary then execute them, for such system, we need a solution to differentiate runc's use of executable memfds and an attacker's [3]. To address those above, this set of patches add following: 1> Let memfd_create() set X bit at creation time. 2> Let memfd to be sealed for modifying X bit. 3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of X bit.For example, if a container has vm.memfd_noexec=2, then memfd_create() without MFD_NOEXEC_SEAL will be rejected. 4> A new security hook in memfd_create(). This make it possible to a new LSM, which rejects or allows executable memfd based on its security policy. This patch (of 5): The new F_SEAL_EXEC flag will prevent modification of the exec bits: written as traditional octal mask, 0111, or as named flags, S_IXUSR | S_IXGRP | S_IXOTH. Any chmod(2) or similar call that attempts to modify any of these bits after the seal is applied will fail with errno EPERM. This will preserve the execute bits as they are at the time of sealing, so the memfd will become either permanently executable or permanently un-executable. Link: https://lkml.kernel.org/r/20221215001205.51969-1-jeffxu@google.com Link: https://lkml.kernel.org/r/20221215001205.51969-2-jeffxu@google.com Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Co-developed-by: Jeff Xu <jeffxu@google.com> Signed-off-by: Jeff Xu <jeffxu@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()Peter Xu5-32/+12
This patch is a cleanup to always wr-protect pte/pmd in mkuffd_wp paths. The reasons I still think this patch is worthwhile, are: (1) It is a cleanup already; diffstat tells. (2) It just feels natural after I thought about this, if the pte is uffd protected, let's remove the write bit no matter what it was. (2) Since x86 is the only arch that supports uffd-wp, it also redefines pte|pmd_mkuffd_wp() in that it should always contain removals of write bits. It means any future arch that want to implement uffd-wp should naturally follow this rule too. It's good to make it a default, even if with vm_page_prot changes on VM_UFFD_WP. (3) It covers more than vm_page_prot. So no chance of any potential future "accident" (like pte_mkdirty() sparc64 or loongarch, even though it just got its pte_mkdirty fixed <1 month ago). It'll be fairly clear when reading the code too that we don't worry anything before a pte_mkuffd_wp() on uncertainty of the write bit. We may call pte_wrprotect() one more time in some paths (e.g. thp split), but that should be fully local bitop instruction so the overhead should be negligible. Although this patch should logically also fix all the known issues on uffd-wp too recently on page migration (not for numa hint recovery - that may need another explcit pte_wrprotect), but this is not the plan for that fix. So no fixes, and stable doesn't need this. Link: https://lkml.kernel.org/r/20221214201533.1774616-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ives van Hoorne <ives@codesandbox.io> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: move folio_set_compound_order() to mm/internal.hSidhartha Kumar2-3/+22
folio_set_compound_order() is moved to an mm-internal location so external folio users cannot misuse this function. Change the name of the function to folio_set_order() and use WARN_ON_ONCE() rather than BUG_ON. Also, handle the case if a non-large folio is passed and add clarifying comments to the function. Link: https://lore.kernel.org/lkml/20221207223731.32784-1-sidhartha.kumar@oracle.com/T/ Link: https://lkml.kernel.org/r/20221215061757.223440-1-sidhartha.kumar@oracle.com Fixes: 9fd330582b2f ("mm: add folio dtor and order setter functions") Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Muchun Song <songmuchun@bytedance.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm: fix a few rare cases of using swapin error pte markerPeter Xu3-3/+16
This patch should harden commit 15520a3f0469 ("mm: use pte markers for swap errors") on using pte markers for swapin errors on a few corner cases. 1. Propagate swapin errors across fork()s: if there're swapin errors in the parent mm, after fork()s the child should sigbus too when an error page is accessed. 2. Fix a rare condition race in pte_marker_clear() where a uffd-wp pte marker can be quickly switched to a swapin error. 3. Explicitly ignore swapin error pte markers in change_protection(). I mostly don't worry on (2) or (3) at all, but we should still have them. Case (1) is special because it can potentially cause silent data corrupt on child when parent has swapin error triggered with swapoff, but since swapin error is rare itself already it's probably not easy to trigger either. Currently there is a priority difference between the uffd-wp bit and the swapin error entry, in which the swapin error always has higher priority (e.g. we don't need to wr-protect a swapin error pte marker). If there will be a 3rd bit introduced, we'll probably need to consider a more involved approach so we may need to start operate on the bits. Let's leave that for later. This patch is tested with case (1) explicitly where we'll get corrupted data before in the child if there's existing swapin error pte markers, and after patch applied the child can be rightfully killed. We don't need to copy stable for this one since 15520a3f0469 just landed as part of v6.2-rc1, only "Fixes" applied. Link: https://lkml.kernel.org/r/20221214200453.1772655-3-peterx@redhat.com Fixes: 15520a3f0469 ("mm: use pte markers for swap errors") Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19mm/uffd: fix pte marker when fork() without fork eventPeter Xu1-6/+2
Patch series "mm: Fixes on pte markers". Patch 1 resolves the syzkiller report from Pengfei. Patch 2 further harden pte markers when used with the recent swapin error markers. The major case is we should persist a swapin error marker after fork(), so child shouldn't read a corrupted page. This patch (of 2): When fork(), dst_vma is not guaranteed to have VM_UFFD_WP even if src may have it and has pte marker installed. The warning is improper along with the comment. The right thing is to inherit the pte marker when needed, or keep the dst pte empty. A vague guess is this happened by an accident when there's the prior patch to introduce src/dst vma into this helper during the uffd-wp feature got developed and I probably messed up in the rebase, since if we replace dst_vma with src_vma the warning & comment it all makes sense too. Hugetlb did exactly the right here (copy_hugetlb_page_range()). Fix the general path. Reproducer: https://github.com/xupengfe/syzkaller_logs/blob/main/221208_115556_copy_page_range/repro.c Bugzilla report: https://bugzilla.kernel.org/show_bug.cgi?id=216808 Link: https://lkml.kernel.org/r/20221214200453.1772655-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20221214200453.1772655-2-peterx@redhat.com Fixes: c56d1b62cce8 ("mm/shmem: handle uffd-wp during fork()") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Pengfei Xu <pengfei.xu@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: <stable@vger.kernel.org> # 5.19+ Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19Sync with v6.2-rc4Andrew Morton1-1/+7
Merge branch 'master' into mm-hotfixes-stable