summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2023-06-10mm: memory-failure: move sysctl register in memory_failure_init()Kefeng Wang1-9/+2
There is already a memory_failure_init(), don't add a new initcall, move register_sysctl_init() into it to cleanup a bit. Link: https://lkml.kernel.org/r/20230508114128.37081-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: hugetlb_vmemmap: provide stronger vmemmap allocation guaranteesPasha Tatashin1-6/+5
HugeTLB pages have a struct page optimizations where struct pages for tail pages are freed. However, when HugeTLB pages are destroyed, the memory for struct pages (vmemmap) need to be allocated again. Currently, __GFP_NORETRY flag is used to allocate the memory for vmemmap, but given that this flag makes very little effort to actually reclaim memory the returning of huge pages back to the system can be problem. Lets use __GFP_RETRY_MAYFAIL instead. This flag is also performs graceful reclaim without causing ooms, but at least it may perform a few retries, and will fail only when there is genuinely little amount of unused memory in the system. Freeing a 1G page requires 16M of free memory. A machine might need to be reconfigured from one task to another, and release a large number of 1G pages back to the system if allocating 16M fails, the release won't work. Link: https://lkml.kernel.org/r/20230508234059.2529638-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Suggested-by: David Rientjes <rientjes@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10kasan: use internal prototypes matching gcc-13 builtinsArnd Bergmann9-162/+161
gcc-13 warns about function definitions for builtin interfaces that have a different prototype, e.g.: In file included from kasan_test.c:31: kasan.h:574:6: error: conflicting types for built-in function '__asan_register_globals'; expected 'void(void *, long int)' [-Werror=builtin-declaration-mismatch] 574 | void __asan_register_globals(struct kasan_global *globals, size_t size); kasan.h:577:6: error: conflicting types for built-in function '__asan_alloca_poison'; expected 'void(void *, long int)' [-Werror=builtin-declaration-mismatch] 577 | void __asan_alloca_poison(unsigned long addr, size_t size); kasan.h:580:6: error: conflicting types for built-in function '__asan_load1'; expected 'void(void *)' [-Werror=builtin-declaration-mismatch] 580 | void __asan_load1(unsigned long addr); kasan.h:581:6: error: conflicting types for built-in function '__asan_store1'; expected 'void(void *)' [-Werror=builtin-declaration-mismatch] 581 | void __asan_store1(unsigned long addr); kasan.h:643:6: error: conflicting types for built-in function '__hwasan_tag_memory'; expected 'void(void *, unsigned char, long int)' [-Werror=builtin-declaration-mismatch] 643 | void __hwasan_tag_memory(unsigned long addr, u8 tag, unsigned long size); The two problems are: - Addresses are passes as 'unsigned long' in the kernel, but gcc-13 expects a 'void *'. - sizes meant to use a signed ssize_t rather than size_t. Change all the prototypes to match these. Using 'void *' consistently for addresses gets rid of a couple of type casts, so push that down to the leaf functions where possible. This now passes all randconfig builds on arm, arm64 and x86, but I have not tested it on the other architectures that support kasan, since they tend to fail randconfig builds in other ways. This might fail if any of the 32-bit architectures expect a 'long' instead of 'int' for the size argument. The __asan_allocas_unpoison() function prototype is somewhat weird, since it uses a pointer for 'stack_top' and an size_t for 'stack_bottom'. This looks like it is meant to be 'addr' and 'size' like the others, but the implementation clearly treats them as 'top' and 'bottom'. Link: https://lkml.kernel.org/r/20230509145735.9263-2-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10kasan: add kasan_tag_mismatch prototypeArnd Bergmann1-0/+3
The kasan sw-tags implementation contains one function that is only called from assembler and has no prototype in a header. This causes a W=1 warning: mm/kasan/sw_tags.c:171:6: warning: no previous prototype for 'kasan_tag_mismatch' [-Wmissing-prototypes] 171 | void kasan_tag_mismatch(unsigned long addr, unsigned long access_info, Add a prototype in the local header to get a clean build. Link: https://lkml.kernel.org/r/20230509145735.9263-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10migrate_pages_batch: simplify retrying and failure counting of large foliosHuang Ying1-76/+36
After recent changes to the retrying and failure counting in migrate_pages_batch(), it was found that it's unnecessary to count retrying and failure for normal, large, and THP folios separately. Because we don't use retrying and failure number of large folios directly. So, in this patch, we simplified retrying and failure counting of large folios via counting retrying and failure of normal and large folios together. This results in the reduced line number. Previously, in migrate_pages_batch we need to track whether the source folio is large/THP before splitting. So is_large is used to cache folio_test_large() result. Now, we don't need that variable any more because we don't count retrying and failure of large folios (only counting that of THP folios). So, in this patch, is_large is removed to simplify the code. This is just code cleanup, no functionality changes are expected. Link: https://lkml.kernel.org/r/20230510031829.11513-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: memory_hotplug: fix format string in warningsRick Wertenbroek1-2/+2
The format string in __add_pages and __remove_pages has a typo and prints e.g., "Misaligned __add_pages start: 0xfc605 end: #fc609" instead of "Misaligned __add_pages start: 0xfc605 end: 0xfc609" Fix "#%lx" => "%#lx" Link: https://lkml.kernel.org/r/20230510090758.3537242-1-rick.wertenbroek@gmail.com Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10filemap: remove page_endio()Pankaj Raghav1-30/+0
page_endio() is not used anymore. Remove it. Link: https://lkml.kernel.org/r/20230510124716.73655-1-p.raghav@samsung.com Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm/gup: add missing gup_must_unshare() check to gup_huge_pgd()Lorenzo Stoakes1-0/+5
All other instances of gup_huge_pXd() perform the unshare check, so update the PGD-specific function to do so as well. While checking pgd_write() might seem unusual, this function already performs such a check via pgd_access_permitted() so this is in line with the existing implementation. David said: : This change makes unshare handling across all GUP-fast variants : consistent, which is desirable as GUP-fast is complicated enough : already even when consistent. : : This function was the only one I seemed to have missed (or left out and : forgot why -- maybe because it's really dead code for now). The COW : selftest would identify the problem, so far there was no report. : Either the selftest wasn't run on corresponding architectures with that : hugetlb size, or that code is still dead code and unused by : architectures. : : the original commit(s) that added unsharing explain why we care about : these checks: : : a7f226604170acd6 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page") : 84209e87c6963f92 ("mm/gup: reliable R/O long-term pinning in COW mappings") Link: https://lkml.kernel.org/r/cb971ac8dd315df97058ea69442ecc007b9a364a.1683381545.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10dmapool: create/destroy cleanupKeith Busch1-6/+4
Set the 'empty' bool directly from the result of the function that determines its value instead of adding additional logic. Link: https://lkml.kernel.org/r/20230126215125.4069751-13-kbusch@meta.com Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup") Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10cachestat: implement cachestat syscallNhat Pham1-0/+171
There is currently no good way to query the page cache state of large file sets and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really doesn not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in the following thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This patch implements a new syscall that queries cache state of a file and summarizes the number of cached pages, number of dirty pages, number of pages marked for writeback, number of (recently) evicted pages, etc. in a given range. Currently, the syscall is only wired in for x86 architecture. NAME cachestat - query the page cache statistics of a file. SYNOPSIS #include <sys/mman.h> struct cachestat_range { __u64 off; __u64 len; }; struct cachestat { __u64 nr_cache; __u64 nr_dirty; __u64 nr_writeback; __u64 nr_evicted; __u64 nr_recently_evicted; }; int cachestat(unsigned int fd, struct cachestat_range *cstat_range, struct cachestat *cstat, unsigned int flags); DESCRIPTION cachestat() queries the number of cached pages, number of dirty pages, number of pages marked for writeback, number of evicted pages, number of recently evicted pages, in the bytes range given by `off` and `len`. An evicted page is a page that is previously in the page cache but has been evicted since. A page is recently evicted if its last eviction was recent enough that its reentry to the cache would indicate that it is actively being used by the system, and that there is memory pressure on the system. These values are returned in a cachestat struct, whose address is given by the `cstat` argument. The `off` and `len` arguments must be non-negative integers. If `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == 0, we will query in the range from `off` to the end of the file. The `flags` argument is unused for now, but is included for future extensibility. User should pass 0 (i.e no flag specified). Currently, hugetlbfs is not supported. Because the status of a page can change after cachestat() checks it but before it returns to the application, the returned values may contain stale information. RETURN VALUE On success, cachestat returns 0. On error, -1 is returned, and errno is set to indicate the error. ERRORS EFAULT cstat or cstat_args points to an invalid address. EINVAL invalid flags. EBADF invalid file descriptor. EOPNOTSUPP file descriptor is of a hugetlbfs file [nphamcs@gmail.com: replace rounddown logic with the existing helper] Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10workingset: refactor LRU refault to expose refault recency checkNhat Pham1-48/+102
Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg, oom: remove explicit wakeup in mem_cgroup_oom_synchronize()Haifeng Xu1-8/+1
Before commit 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path"), all memcg oom killers were delayed to page fault path. And the explicit wakeup is used in this case: thread A: ... if (locked) { // complete oom-kill, hold the lock mem_cgroup_oom_unlock(memcg); ... } ... thread B: ... if (locked && !memcg->oom_kill_disable) { ... } else { schedule(); // can't acquire the lock ... } ... The reason is that thread A kicks off the OOM-killer, which leads to wakeups from the uncharges of the exiting task. But thread B is not guaranteed to see them if it enters the OOM path after the OOM kills but before thread A releases the lock. Now only oom_kill_disable case is handled from the #PF path. In that case it is userspace to trigger the wake up not the #PF path itself. All potential paths to free some charges are responsible to call memcg_oom_recover() , so the explicit wakeup is not needed in the mem_cgroup_oom_synchronize() path which doesn't release any memory itself. Link: https://lkml.kernel.org/r/20230419030739.115845-2-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg, oom: remove unnecessary check in mem_cgroup_oom_synchronize()Haifeng Xu1-10/+3
mem_cgroup_oom_synchronize() is only used when the memcg oom handling is handed over to the edge of the #PF path. Since commit 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path") this is the case only when the kernel memcg oom killer is disabled (current->memcg_in_oom is only set if memcg->oom_kill_disable). Therefore a check for oom_kill_disable in mem_cgroup_oom_synchronize() is not required. Link: https://lkml.kernel.org/r/20230419030739.115845-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg: remove mem_cgroup_flush_stats_atomic()Yosry Ahmed1-19/+5
Previous patches removed all callers of mem_cgroup_flush_stats_atomic(). Remove the function and simplify the code. Link: https://lkml.kernel.org/r/20230421174020.2994750-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg: calculate root usage from global stateYosry Ahmed1-19/+5
Currently, we approximate the root usage by adding the memcg stats for anon, file, and conditionally swap (for memsw). To read the memcg stats we need to invoke an rstat flush. rstat flushes can be expensive, they scale with the number of cpus and cgroups on the system. mem_cgroup_usage() is called by memcg_events()->mem_cgroup_threshold() with irqs disabled, so such an expensive operation with irqs disabled can cause problems. Instead, approximate the root usage from global state. This is not 100% accurate, but the root usage has always been ill-defined anyway. Link: https://lkml.kernel.org/r/20230421174020.2994750-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg: flush stats non-atomically in mem_cgroup_wb_stats()Yosry Ahmed1-5/+1
The previous patch moved the wb_over_bg_thresh()->mem_cgroup_wb_stats() code path in wb_writeback() outside the lock section. We no longer need to flush the stats atomically. Flush the stats non-atomically. Link: https://lkml.kernel.org/r/20230421174020.2994750-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm/page_alloc: drop the unnecessary pfn_valid() for start pfnBaolin Wang1-1/+1
__pageblock_pfn_to_page() currently performs both pfn_valid check and pfn_to_online_page(). The former one is redundant because the latter is a stronger check. Drop pfn_valid(). Link: https://lkml.kernel.org/r/c3868b58c6714c09a43440d7d02c7b4eed6e03f6.1682342634.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: compaction: optimize compact_memory to comply with the admin-guideWen Yang1-1/+11
For the /proc/sys/vm/compact_memory file, the admin-guide states: When 1 is written to the file, all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important for example in the allocation of huge pages although processes will also directly compact memory as required But it was not strictly followed, writing any value would cause all zones to be compacted. It has been slightly optimized to comply with the admin-guide. Enforce the 1 on the unlikely chance that the sysctl handler is ever extended to do something different. Commit ef4984384172 ("mm/compaction: remove unused variable sysctl_compact_memory") has also been optimized a bit here, as the declaration in the external header file has been eliminated, and sysctl_compact_memory also needs to be verified. [akpm@linux-foundation.org: add __read_mostly, per Mel] Link: https://lkml.kernel.org/r/tencent_DFF54DB2A60F3333F97D3F6B5441519B050A@qq.com Signed-off-by: Wen Yang <wenyang.linux@foxmail.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Lam <william.lam@bytedance.com> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Fu Wei <wefu@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg: dump memory.stat during cgroup OOM for v1Yosry Ahmed1-25/+35
Patch series "memcg: OOM log improvements", v2. This short patch series brings back some cgroup v1 stats in OOM logs that were unnecessarily changed before. It also makes memcg OOM logs less reliant on printk() internals. This patch (of 2): Commit c8713d0b2312 ("mm: memcontrol: dump memory.stat during cgroup OOM") made sure we dump all the stats in memory.stat during a cgroup OOM, but it also introduced a slight behavioral change. The code used to print the non-hierarchical v1 cgroup stats for the entire cgroup subtree, now it only prints the v2 cgroup stats for the cgroup under OOM. For cgroup v1 users, this introduces a few problems: (a) The non-hierarchical stats of the memcg under OOM are no longer shown. (b) A couple of v1-only stats (e.g. pgpgin, pgpgout) are no longer shown. (c) We show the list of cgroup v2 stats, even in cgroup v1. This list of stats is not tracked with v1 in mind. While most of the stats seem to be working on v1, there may be some stats that are not fully or correctly tracked. Although OOM log is not set in stone, we should not change it for no reason. When upgrading the kernel version to a version including commit c8713d0b2312 ("mm: memcontrol: dump memory.stat during cgroup OOM"), these behavioral changes are noticed in cgroup v1. The fix is simple. Commit c8713d0b2312 ("mm: memcontrol: dump memory.stat during cgroup OOM") separated stats formatting from stats display for v2, to reuse the stats formatting in the OOM logs. Do the same for v1. Move the v2 specific formatting from memory_stat_format() to memcg_stat_format(), add memcg1_stat_format() for v1, and make memory_stat_format() select between them based on cgroup version. Since memory_stat_show() now works for both v1 & v2, drop memcg_stat_show(). Link: https://lkml.kernel.org/r/20230428132406.2540811-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20230428132406.2540811-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Petr Mladek <pmladek@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg: use seq_buf_do_printk() with mem_cgroup_print_oom_meminfo()Yosry Ahmed1-13/+14
Currently, we format all the memcg stats into a buffer in mem_cgroup_print_oom_meminfo() and use pr_info() to dump it to the logs. However, this buffer is large in size. Although it is currently working as intended, ther is a dependency between the memcg stats buffer and the printk record size limit. If we add more stats in the future and the buffer becomes larger than the printk record size limit, or if the prink record size limit is reduced, the logs may be truncated. It is safer to use seq_buf_do_printk(), which will automatically break up the buffer at line breaks and issue small printk() calls. Refactor the code to move the seq_buf from memory_stat_format() to its callers, and use seq_buf_do_printk() to print the seq_buf in mem_cgroup_print_oom_meminfo(). Link: https://lkml.kernel.org/r/20230428132406.2540811-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10migrate_pages: avoid blocking for IO in MIGRATE_SYNC_LIGHTDouglas Anderson1-23/+26
The MIGRATE_SYNC_LIGHT mode is intended to block for things that will finish quickly but not for things that will take a long time. Exactly how long is too long is not well defined, but waits of tens of milliseconds is likely non-ideal. When putting a Chromebook under memory pressure (opening over 90 tabs on a 4GB machine) it was fairly easy to see delays waiting for some locks in the kcompactd code path of > 100 ms. While the laptop wasn't amazingly usable in this state, it was still limping along and this state isn't something artificial. Sometimes we simply end up with a lot of memory pressure. Putting the same Chromebook under memory pressure while it was running Android apps (though not stressing them) showed a much worse result (NOTE: this was on a older kernel but the codepaths here are similar). Android apps on ChromeOS currently run from a 128K-block, zlib-compressed, loopback-mounted squashfs disk. If we get a page fault from something backed by the squashfs filesystem we could end up holding a folio lock while reading enough from disk to decompress 128K (and then decompressing it using the somewhat slow zlib algorithms). That reading goes through the ext4 subsystem (because it's a loopback mount) before eventually ending up in the block subsystem. This extra jaunt adds extra overhead. Without much work I could see cases where we ended up blocked on a folio lock for over a second. With more extreme memory pressure I could see up to 25 seconds. We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the two locks that were seen to be slow [1] and that generated much discussion. After discussion, it was decided that we should avoid waiting for the two locks during MIGRATE_SYNC_LIGHT if they were being held for IO. We'll continue with the unbounded wait for the more full SYNC modes. With this change, I couldn't see any slow waits on these locks with my previous testcases. NOTE: The reason I stated digging into this originally isn't because some benchmark had gone awry, but because we've received in-the-field crash reports where we have a hung task waiting on the page lock (which is the equivalent code path on old kernels). While the root cause of those crashes is likely unrelated and won't be fixed by this patch, analyzing those crash reports did point out these very long waits seemed like something good to fix. With this patch we should no longer hang waiting on these locks, but presumably the system will still be in a bad shape and hang somewhere else. [1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid Link: https://lkml.kernel.org/r/20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hillf Danton <hdanton@sina.com> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: memcg: use READ_ONCE()/WRITE_ONCE() to access stock->cachedRoman Gushchin1-6/+6
A memcg pointer in the percpu stock can be accessed by drain_all_stock() from another cpu in a lockless way. In theory it might lead to an issue, similar to the one which has been discovered with stock->cached_objcg, where the pointer was zeroed between the check for being NULL and dereferencing. In this case the issue is unlikely a real problem, but to make it bulletproof and similar to stock->cached_objcg, let's annotate all accesses to stock->cached with READ_ONCE()/WTRITE_ONCE(). Link: https://lkml.kernel.org/r/20230502160839.361544-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: kmem: fix a NULL pointer dereference in obj_stock_flush_required()Roman Gushchin1-9/+10
KCSAN found an issue in obj_stock_flush_required(): stock->cached_objcg can be reset between the check and dereference: ================================================================== BUG: KCSAN: data-race in drain_all_stock / drain_obj_stock write to 0xffff888237c2a2f8 of 8 bytes by task 19625 on cpu 0: drain_obj_stock+0x408/0x4e0 mm/memcontrol.c:3306 refill_obj_stock+0x9c/0x1e0 mm/memcontrol.c:3340 obj_cgroup_uncharge+0xe/0x10 mm/memcontrol.c:3408 memcg_slab_free_hook mm/slab.h:587 [inline] __cache_free mm/slab.c:3373 [inline] __do_kmem_cache_free mm/slab.c:3577 [inline] kmem_cache_free+0x105/0x280 mm/slab.c:3602 __d_free fs/dcache.c:298 [inline] dentry_free fs/dcache.c:375 [inline] __dentry_kill+0x422/0x4a0 fs/dcache.c:621 dentry_kill+0x8d/0x1e0 dput+0x118/0x1f0 fs/dcache.c:913 __fput+0x3bf/0x570 fs/file_table.c:329 ____fput+0x15/0x20 fs/file_table.c:349 task_work_run+0x123/0x160 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop+0xcf/0xe0 kernel/entry/common.c:171 exit_to_user_mode_prepare+0x6a/0xa0 kernel/entry/common.c:203 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:296 do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff888237c2a2f8 of 8 bytes by task 19632 on cpu 1: obj_stock_flush_required mm/memcontrol.c:3319 [inline] drain_all_stock+0x174/0x2a0 mm/memcontrol.c:2361 try_charge_memcg+0x6d0/0xd10 mm/memcontrol.c:2703 try_charge mm/memcontrol.c:2837 [inline] mem_cgroup_charge_skmem+0x51/0x140 mm/memcontrol.c:7290 sock_reserve_memory+0xb1/0x390 net/core/sock.c:1025 sk_setsockopt+0x800/0x1e70 net/core/sock.c:1525 udp_lib_setsockopt+0x99/0x6c0 net/ipv4/udp.c:2692 udp_setsockopt+0x73/0xa0 net/ipv4/udp.c:2817 sock_common_setsockopt+0x61/0x70 net/core/sock.c:3668 __sys_setsockopt+0x1c3/0x230 net/socket.c:2271 __do_sys_setsockopt net/socket.c:2282 [inline] __se_sys_setsockopt net/socket.c:2279 [inline] __x64_sys_setsockopt+0x66/0x80 net/socket.c:2279 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0xffff8881382d52c0 -> 0xffff888138893740 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 19632 Comm: syz-executor.0 Not tainted 6.3.0-rc2-syzkaller-00387-g534293368afa #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023 Fix it by using READ_ONCE()/WRITE_ONCE() for all accesses to stock->cached_objcg. Link: https://lkml.kernel.org/r/20230502160839.361544-1-roman.gushchin@linux.dev Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reported-by: syzbot+774c29891415ab0fd29d@syzkaller.appspotmail.com Reported-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/linux-mm/CACT4Y+ZfucZhM60YPphWiCLJr6+SGFhT+jjm8k1P-a_8Kkxsjg@mail.gmail.com/T/#t Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-18mm: fix zswap writeback race conditionDomenico Cerasuolo1-0/+16
The zswap writeback mechanism can cause a race condition resulting in memory corruption, where a swapped out page gets swapped in with data that was written to a different page. The race unfolds like this: 1. a page with data A and swap offset X is stored in zswap 2. page A is removed off the LRU by zpool driver for writeback in zswap-shrink work, data for A is mapped by zpool driver 3. user space program faults and invalidates page entry A, offset X is considered free 4. kswapd stores page B at offset X in zswap (zswap could also be full, if so, page B would then be IOed to X, then skip step 5.) 5. entry A is replaced by B in tree->rbroot, this doesn't affect the local reference held by zswap-shrink work 6. zswap-shrink work writes back A at X, and frees zswap entry A 7. swapin of slot X brings A in memory instead of B The fix: Once the swap page cache has been allocated (case ZSWAP_SWAPCACHE_NEW), zswap-shrink work just checks that the local zswap_entry reference is still the same as the one in the tree. If it's not the same it means that it's either been invalidated or replaced, in both cases the writeback is aborted because the local entry contains stale data. Reproducer: I originally found this by running `stress` overnight to validate my work on the zswap writeback mechanism, it manifested after hours on my test machine. The key to make it happen is having zswap writebacks, so whatever setup pumps /sys/kernel/debug/zswap/written_back_pages should do the trick. In order to reproduce this faster on a vm, I setup a system with ~100M of available memory and a 500M swap file, then running `stress --vm 1 --vm-bytes 300000000 --vm-stride 4000` makes it happen in matter of tens of minutes. One can speed things up even more by swinging /sys/module/zswap/parameters/max_pool_percent up and down between, say, 20 and 1; this makes it reproduce in tens of seconds. It's crucial to set `--vm-stride` to something other than 4096 otherwise `stress` won't realize that memory has been corrupted because all pages would have the same data. Link: https://lkml.kernel.org/r/20230503151200.19707-1-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Chris Li (Google) <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-18mm: kfence: fix false positives on big endianMichael Ellerman1-1/+1
Since commit 1ba3cbf3ec3b ("mm: kfence: improve the performance of __kfence_alloc() and __kfence_free()"), kfence reports failures in random places at boot on big endian machines. The problem is that the new KFENCE_CANARY_PATTERN_U64 encodes the address of each byte in its value, so it needs to be byte swapped on big endian machines. The compiler is smart enough to do the le64_to_cpu() at compile time, so there is no runtime overhead. Link: https://lkml.kernel.org/r/20230505035127.195387-1-mpe@ellerman.id.au Fixes: 1ba3cbf3ec3b ("mm: kfence: improve the performance of __kfence_alloc() and __kfence_free()") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: David Laight <David.Laight@ACULAB.COM> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-18zsmalloc: move LRU update from zs_map_object() to zs_malloc()Nhat Pham1-27/+9
Under memory pressure, we sometimes observe the following crash: [ 5694.832838] ------------[ cut here ]------------ [ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100) [ 5694.858677] WARNING: CPU: 33 PID: 418824 at lib/list_debug.c:47 __list_del_entry_valid+0x42/0x80 [ 5694.961820] CPU: 33 PID: 418824 Comm: fuse_counters.s Kdump: loaded Tainted: G S 5.19.0-0_fbk3_rc3_hoangnhatpzsdynshrv41_10870_g85a9558a25de #1 [ 5694.990194] Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM16 05/24/2021 [ 5695.007072] RIP: 0010:__list_del_entry_valid+0x42/0x80 [ 5695.017351] Code: 08 48 83 c2 22 48 39 d0 74 24 48 8b 10 48 39 f2 75 2c 48 8b 51 08 b0 01 48 39 f2 75 34 c3 48 c7 c7 55 d7 78 82 e8 4e 45 3b 00 <0f> 0b eb 31 48 c7 c7 27 a8 70 82 e8 3e 45 3b 00 0f 0b eb 21 48 c7 [ 5695.054919] RSP: 0018:ffffc90027aef4f0 EFLAGS: 00010246 [ 5695.065366] RAX: 41fe484987275300 RBX: ffff888008988180 RCX: 0000000000000000 [ 5695.079636] RDX: ffff88886006c280 RSI: ffff888860060480 RDI: ffff888860060480 [ 5695.093904] RBP: 0000000000000002 R08: 0000000000000000 R09: ffffc90027aef370 [ 5695.108175] R10: 0000000000000000 R11: ffffffff82fdf1c0 R12: 0000000010000002 [ 5695.122447] R13: ffff888014b6a448 R14: ffff888014b6a420 R15: 00000000138dc240 [ 5695.136717] FS: 00007f23a7d3f740(0000) GS:ffff888860040000(0000) knlGS:0000000000000000 [ 5695.152899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5695.164388] CR2: 0000560ceaab6ac0 CR3: 000000001c06c001 CR4: 00000000007706e0 [ 5695.178659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5695.192927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5695.207197] PKRU: 55555554 [ 5695.212602] Call Trace: [ 5695.217486] <TASK> [ 5695.221674] zs_map_object+0x91/0x270 [ 5695.229000] zswap_frontswap_store+0x33d/0x870 [ 5695.237885] ? do_raw_spin_lock+0x5d/0xa0 [ 5695.245899] __frontswap_store+0x51/0xb0 [ 5695.253742] swap_writepage+0x3c/0x60 [ 5695.261063] shrink_page_list+0x738/0x1230 [ 5695.269255] shrink_lruvec+0x5ec/0xcd0 [ 5695.276749] ? shrink_slab+0x187/0x5f0 [ 5695.284240] ? mem_cgroup_iter+0x6e/0x120 [ 5695.292255] shrink_node+0x293/0x7b0 [ 5695.299402] do_try_to_free_pages+0xea/0x550 [ 5695.307940] try_to_free_pages+0x19a/0x490 [ 5695.316126] __folio_alloc+0x19ff/0x3e40 [ 5695.323971] ? __filemap_get_folio+0x8a/0x4e0 [ 5695.332681] ? walk_component+0x2a8/0xb50 [ 5695.340697] ? generic_permission+0xda/0x2a0 [ 5695.349231] ? __filemap_get_folio+0x8a/0x4e0 [ 5695.357940] ? walk_component+0x2a8/0xb50 [ 5695.365955] vma_alloc_folio+0x10e/0x570 [ 5695.373796] ? walk_component+0x52/0xb50 [ 5695.381634] wp_page_copy+0x38c/0xc10 [ 5695.388953] ? filename_lookup+0x378/0xbc0 [ 5695.397140] handle_mm_fault+0x87f/0x1800 [ 5695.405157] do_user_addr_fault+0x1bd/0x570 [ 5695.413520] exc_page_fault+0x5d/0x110 [ 5695.421017] asm_exc_page_fault+0x22/0x30 After some investigation, I have found the following issue: unlike other zswap backends, zsmalloc performs the LRU list update at the object mapping time, rather than when the slot for the object is allocated. This deviation was discussed and agreed upon during the review process of the zsmalloc writeback patch series: https://lore.kernel.org/lkml/Y3flcAXNxxrvy3ZH@cmpxchg.org/ Unfortunately, this introduces a subtle bug that occurs when there is a concurrent store and reclaim, which interleave as follows: zswap_frontswap_store() shrink_worker() zs_malloc() zs_zpool_shrink() spin_lock(&pool->lock) zs_reclaim_page() zspage = find_get_zspage() spin_unlock(&pool->lock) spin_lock(&pool->lock) zspage = list_first_entry(&pool->lru) list_del(&zspage->lru) zspage->lru.next = LIST_POISON1 zspage->lru.prev = LIST_POISON2 spin_unlock(&pool->lock) zs_map_object() spin_lock(&pool->lock) if (!list_empty(&zspage->lru)) list_del(&zspage->lru) CHECK_DATA_CORRUPTION(next == LIST_POISON1) /* BOOM */ With the current upstream code, this issue rarely happens. zswap only triggers writeback when the pool is already full, at which point all further store attempts are short-circuited. This creates an implicit pseudo-serialization between reclaim and store. I am working on a new zswap shrinking mechanism, which makes interleaving reclaim and store more likely, exposing this bug. zbud and z3fold do not have this problem, because they perform the LRU list update in the alloc function, while still holding the pool's lock. This patch fixes the aforementioned bug by moving the LRU update back to zs_malloc(), analogous to zbud and z3fold. Link: https://lkml.kernel.org/r/20230505185054.2417128-1-nphamcs@gmail.com Fixes: 64f768c6b32e ("zsmalloc: add a LRU to zs_pool to keep track of zspages in LRU order") Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-18mm: shrinkers: fix race condition on debugfs cleanupJoan Bruguera Micó2-7/+13
When something registers and unregisters many shrinkers, such as: for x in $(seq 10000); do unshare -Ui true; done Sometimes the following error is printed to the kernel log: debugfs: Directory '...' with parent 'shrinker' already present! This occurs since commit badc28d4924b ("mm: shrinkers: fix deadlock in shrinker debugfs") / v6.2: Since the call to `debugfs_remove_recursive` was moved outside the `shrinker_rwsem`/`shrinker_mutex` lock, but the call to `ida_free` stayed inside, a newly registered shrinker can be re-assigned that ID and attempt to create the debugfs directory before the directory from the previous shrinker has been removed. The locking changes in commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless") made the race condition more likely, though it existed before then. Commit badc28d4924b ("mm: shrinkers: fix deadlock in shrinker debugfs") could be reverted since the issue is addressed should no longer occur since the count and scan operations are lockless since commit 20cd1892fcc3 ("mm: shrinkers: make count and scan in shrinker debugfs lockless"). However, since this is a contended lock, prefer instead moving `ida_free` outside the lock to avoid the race. Link: https://lkml.kernel.org/r/20230503013232.299211-1-joanbrugueram@gmail.com Fixes: badc28d4924b ("mm: shrinkers: fix deadlock in shrinker debugfs") Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06Merge tag 'mm-stable-2023-05-06-10-49' of ↵Linus Torvalds1-205/+202
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull dmapool updates - again - from Andrew Morton: "Reinstate the dmapool changes which were accidentally removed by a mishap on the last commit in the previous attempt at the series" Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup"). [ The whole old series: def8574308ed..2d55c16c0c54 results in an empty diff because that last commit ended up being just a revert of all that came everything before it. - Linus ] * tag 'mm-stable-2023-05-06-10-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: dmapool: link blocks across pages dmapool: don't memset on free twice dmapool: simplify freeing dmapool: consolidate page initialization dmapool: rearrange page alloc failure handling dmapool: move debug code to own functions dmapool: speedup DMAPOOL_DEBUG with init_on_alloc dmapool: cleanup integer types dmapool: use sysfs_emit() instead of scnprintf() dmapool: remove checks for dev == NULL
2023-05-06Merge tag 'mm-hotfixes-stable-2023-05-06-10-45' of ↵Linus Torvalds2-5/+15
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "Five hotfixes. Three are cc:stable, two pertain to merge window changes" * tag 'mm-hotfixes-stable-2023-05-06-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: afs: fix the afs_dir_get_folio return value nilfs2: do not write dirty data after degenerating to read-only mm: do not reclaim private data from pinned page nilfs2: fix infinite loop in nilfs_mdt_get_block() mm/mmap/vma_merge: always check invariants
2023-05-06dmapool: link blocks across pagesKeith Busch1-127/+130
The allocated dmapool pages are never freed for the lifetime of the pool. There is no need for the two level list+stack lookup for finding a free block since nothing is ever removed from the list. Just use a simple stack, reducing time complexity to constant. The implementation inserts the stack linking elements and the dma handle of the block within itself when freed. This means the smallest possible dmapool block is increased to at most 16 bytes to accommodate these fields, but there are no exisiting users requesting a dma pool smaller than that anyway. Removing the list has a significant change in performance. Using the kernel's micro-benchmarking self test: Before: # modprobe dmapool_test dmapool test: size:16 blocks:8192 time:57282 dmapool test: size:64 blocks:8192 time:172562 dmapool test: size:256 blocks:8192 time:789247 dmapool test: size:1024 blocks:2048 time:371823 dmapool test: size:4096 blocks:1024 time:362237 After: # modprobe dmapool_test dmapool test: size:16 blocks:8192 time:24997 dmapool test: size:64 blocks:8192 time:26584 dmapool test: size:256 blocks:8192 time:33542 dmapool test: size:1024 blocks:2048 time:9022 dmapool test: size:4096 blocks:1024 time:6045 The module test allocates quite a few blocks that may not accurately represent how these pools are used in real life. For a more marco level benchmark, running fio high-depth + high-batched on nvme, this patch shows submission and completion latency reduced by ~100usec each, 1% IOPs improvement, and perf record's time spent in dma_pool_alloc/free were reduced by half. [kbusch@kernel.org: push new blocks in ascending order] Link: https://lkml.kernel.org/r/20230221165400.1595247-1-kbusch@meta.com Link: https://lkml.kernel.org/r/20230126215125.4069751-12-kbusch@meta.com Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup") Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Bryan O'Donoghue <bryan.odonoghue@linaro.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06dmapool: don't memset on free twiceKeith Busch1-2/+2
If debug is enabled, dmapool will poison the range, so no need to clear it to 0 immediately before writing over it. Link: https://lkml.kernel.org/r/20230126215125.4069751-11-kbusch@meta.com Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup") Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06dmapool: simplify freeingKeith Busch1-16/+6
The actions for busy and not busy are mostly the same, so combine these and remove the unnecessary function. Also, the pool is about to be freed so there's no need to poison the page data since we only check for poison on alloc, which can't be done on a freed pool. Link: https://lkml.kernel.org/r/20230126215125.4069751-10-kbusch@meta.com Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup") Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06dmapool: consolidate page initializationKeith Busch1-4/+3
Various fields of the dma pool are set in different places. Move it all to one function. Link: https://lkml.kernel.org/r/20230126215125.4069751-9-kbusch@meta.com Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup") Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06dmapool: rearrange page alloc failure handlingKeith Busch1-7/+9
Handle the error in a condition so the good path can be in the normal flow. Link: https://lkml.kernel.org/r/20230126215125.4069751-8-kbusch@meta.com Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup") Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06dmapool: move debug code to own functionsKeith Busch1-51/+77
Clean up the normal path by moving the debug code outside it. Link: https://lkml.kernel.org/r/20230126215125.4069751-7-kbusch@meta.com Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup") Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06dmapool: speedup DMAPOOL_DEBUG with init_on_allocTony Battersby1-1/+1
Avoid double-memset of the same allocated memory in dma_pool_alloc() when both DMAPOOL_DEBUG is enabled and init_on_alloc=1. Link: https://lkml.kernel.org/r/20230126215125.4069751-6-kbusch@meta.com Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup") Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06dmapool: cleanup integer typesTony Battersby1-8/+11
To represent the size of a single allocation, dmapool currently uses 'unsigned int' in some places and 'size_t' in other places. Standardize on 'unsigned int' to reduce overhead, but use 'size_t' when counting all the blocks in the entire pool. Link: https://lkml.kernel.org/r/20230126215125.4069751-5-kbusch@meta.com Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup") Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06dmapool: use sysfs_emit() instead of scnprintf()Tony Battersby1-16/+7
Use sysfs_emit instead of scnprintf, snprintf or sprintf. Link: https://lkml.kernel.org/r/20230126215125.4069751-4-kbusch@meta.com Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup") Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06dmapool: remove checks for dev == NULLTony Battersby1-31/+14
dmapool originally tried to support pools without a device because dma_alloc_coherent() supports allocations without a device. But nobody ended up using dma pools without a device, and trying to do so will result in an oops. So remove the checks for pool->dev == NULL since they are unneeded bloat. [kbusch@kernel.org: add check for null dev on create] Link: https://lkml.kernel.org/r/20230126215125.4069751-3-kbusch@meta.com Fixes: 2d55c16c0c54 ("dmapool: create/destroy cleanup") Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06mm: do not reclaim private data from pinned pageJan Kara1-0/+10
If the page is pinned, there's no point in trying to reclaim it. Furthermore if the page is from the page cache we don't want to reclaim fs-private data from the page because the pinning process may be writing to the page at any time and reclaiming fs private info on a dirty page can upset the filesystem (see link below). Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz Link: https://lkml.kernel.org/r/20230428124140.30166-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06mm/mmap/vma_merge: always check invariantsLorenzo Stoakes1-5/+5
We may still have inconsistent input parameters even if we choose not to merge and the vma_merge() invariant checks are useful for checking this with no production runtime cost (these are only relevant when CONFIG_DEBUG_VM is specified). Therefore, perform these checks regardless of whether we merge. This is relevant, as a recent issue (addressed in commit "mm/mempolicy: Correctly update prev when policy is equal on mbind") in the mbind logic was only picked up in the 6.2.y stable branch where these assertions are performed prior to determining mergeability. Had this remained the same in mainline this issue may have been picked up faster, so moving forward let's always check them. Link: https://lkml.kernel.org/r/df548a6ae3fa135eec3b446eb3dae8eb4227da97.1682885809.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06filemap: Handle error return from __filemap_get_folio()Matthew Wilcox1-1/+1
Smatch reports that filemap_fault() was missed in the conversion of __filemap_get_folio() error returns from NULL to ERR_PTR. Fixes: 66dabbb65d67 ("mm: return an ERR_PTR from __filemap_get_folio") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Reported-by: syzbot+48011b86c8ea329af1b9@syzkaller.appspotmail.com Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-05Merge branch 'x86-uaccess-cleanup': x86 uaccess header cleanupsLinus Torvalds1-0/+2
Merge my x86 uaccess updates branch. The LAM ("Linear Address Masking") updates in this release made me unhappy about how "access_ok()" was done, and it actually turned out to have a couple of small bugs in it too. This is my cleanup of the code: - use the sign bit of the __user pointer rather than masking the address and checking it against the TASK_SIZE range. We already did this part for the get/put_user() side, but 'access_ok()' did the naïve "mask and range check" thing, which not only generates nasty code, but also ended up meaning that __access_ok itself didn't do a good job, and so copy_from_user_nmi() didn't get the check right. - move all the code that is 64-bit only into the 64-bit version of the header file, so that we don't unnecessarily pollute the shared x86 code and make it look like LAM might work in 32-bit too. - fix a bug in the address masking (that doesn't end up mattering: in this case the fix was to just remove the buggy code entirely). - a couple of trivial cleanups and added commentary about the access_ok() rules. * x86-uaccess-cleanup: x86-64: mm: clarify the 'positive addresses' user address rules x86: mm: remove 'sign' games from LAM untagged_addr*() macros x86: uaccess: move 32-bit and 64-bit parts into proper <asm/uaccess_N.h> header x86: mm: remove architecture-specific 'access_ok()' define x86-64: make access_ok() independent of LAM
2023-05-04Merge tag 'mm-hotfixes-stable-2023-05-03-16-27' of ↵Linus Torvalds3-5/+13
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hitfixes from Andrew Morton: "Five hotfixes. Three are cc:stable, two for this -rc cycle" * tag 'mm-hotfixes-stable-2023-05-03-16-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: change per-VMA lock statistics to be disabled by default MAINTAINERS: update Michal Simek's email mm/mempolicy: correctly update prev when policy is equal on mbind relayfs: fix out-of-bounds access in relay_file_read kasan: hw_tags: avoid invalid virt_to_page()
2023-05-04Merge tag 'mm-stable-2023-05-03-16-22' of ↵Linus Torvalds3-16/+89
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - Some DAMON cleanups from Kefeng Wang - Some KSM work from David Hildenbrand, to make the PR_SET_MEMORY_MERGE ioctl's behavior more similar to KSM's behavior. [ Andrew called these "final", but I suspect we'll have a series fixing up the fact that the last commit in the dmapools series in the previous pull seems to have unintentionally just reverted all the other commits in the same series.. - Linus ] * tag 'mm-stable-2023-05-03-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: hwpoison: coredump: support recovery from dump_user_range() mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page() mm/ksm: move disabling KSM from s390/gmap code to KSM code selftests/ksm: ksm_functional_tests: add prctl unmerge test mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0 mm/damon/paddr: fix missing folio_sz update in damon_pa_young() mm/damon/paddr: minor refactor of damon_pa_mark_accessed_or_deactivate() mm/damon/paddr: minor refactor of damon_pa_pageout()
2023-05-03x86-64: make access_ok() independent of LAMLinus Torvalds1-0/+2
The linear address masking (LAM) code made access_ok() more complicated, in that it now needs to untag the address in order to verify the access range. See commit 74c228d20a51 ("x86/uaccess: Provide untagged_addr() and remove tags before address check"). We were able to avoid that overhead in the get_user/put_user code paths by simply using the sign bit for the address check, and depending on the GP fault if the address was non-canonical, which made it all independent of LAM. And we can do the same thing for access_ok(): simply check that the user pointer range has the high bit clear. No need to bother with any address bit masking. In fact, we can go a bit further, and just check the starting address for known small accesses ranges: any accesses that overflow will still be in the non-canonical area and will still GP fault. To still make syzkaller catch any potentially unchecked user addresses, we'll continue to warn about GP faults that are caused by accesses in the non-canonical range. But we'll limit that to purely "high bit set and past the one-page 'slop' area". We could probably just do that "check only starting address" for any arbitrary range size: realistically all kernel accesses to user space will be done starting at the low address. But let's leave that kind of optimization for later. As it is, this already allows us to generate simpler code and not worry about any tag bits in the address. The one thing to look out for is the GUP address check: instead of actually copying data in the virtual address range (and thus bad addresses being caught by the GP fault), GUP will look up the page tables manually. As a result, the page table limits need to be checked, and that was previously implicitly done by the access_ok(). With the relaxed access_ok() check, we need to just do an explicit check for TASK_SIZE_MAX in the GUP code instead. The GUP code already needs to do the tag bit unmasking anyway, so there this is all very straightforward, and there are no LAM issues. Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-03mm: change per-VMA lock statistics to be disabled by defaultSuren Baghdasaryan1-2/+8
Change CONFIG_PER_VMA_LOCK_STATS to be disabled by default, as most users don't need it. Add configuration help to clarify its usage. Link: https://lkml.kernel.org/r/20230428173533.18158-1-surenb@google.com Fixes: 52f238653e45 ("mm: introduce per-VMA lock statistics") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03mm/mempolicy: correctly update prev when policy is equal on mbindLorenzo Stoakes1-1/+3
The refactoring in commit f4e9e0e69468 ("mm/mempolicy: fix use-after-free of VMA iterator") introduces a subtle bug which arises when attempting to apply a new NUMA policy across a range of VMAs in mbind_range(). The refactoring passes a **prev pointer to keep track of the previous VMA in order to reduce duplication, and in all but one case it keeps this correctly updated. The bug arises when a VMA within the specified range has an equivalent policy as determined by mpol_equal() - which unlike other cases, does not update prev. This can result in a situation where, later in the iteration, a VMA is found whose policy does need to change. At this point, vma_merge() is invoked with prev pointing to a VMA which is before the previous VMA. Since vma_merge() discovers the curr VMA by looking for the one immediately after prev, it will now be in a situation where this VMA is incorrect and the merge will not proceed correctly. This is checked in the VM_WARN_ON() invariant case with end > curr->vm_end, which, if a merge is possible, results in a warning (if CONFIG_DEBUG_VM is specified). I note that vma_merge() performs these invariant checks only after merge_prev/merge_next are checked, which is debatable as it hides this issue if no merge is possible even though a buggy situation has arisen. The solution is simply to update the prev pointer even when policies are equal. This caused a bug to arise in the 6.2.y stable tree, and this patch resolves this bug. Link: https://lkml.kernel.org/r/83f1d612acb519d777bebf7f3359317c4e7f4265.1682866629.git.lstoakes@gmail.com Fixes: f4e9e0e69468 ("mm/mempolicy: fix use-after-free of VMA iterator") Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202304292203.44ddeff6-oliver.sang@intel.com Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03kasan: hw_tags: avoid invalid virt_to_page()Mark Rutland1-2/+2
When booting with 'kasan.vmalloc=off', a kernel configured with support for KASAN_HW_TAGS will explode at boot time due to bogus use of virt_to_page() on a vmalloc adddress. With CONFIG_DEBUG_VIRTUAL selected this will be reported explicitly, and with or without CONFIG_DEBUG_VIRTUAL the kernel will dereference a bogus address: | ------------[ cut here ]------------ | virt_to_phys used for non-linear address: (____ptrval____) (0xffff800008000000) | WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x78/0x80 | Modules linked in: | CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.3.0-rc3-00073-g83865133300d-dirty #4 | Hardware name: linux,dummy-virt (DT) | pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : __virt_to_phys+0x78/0x80 | lr : __virt_to_phys+0x78/0x80 | sp : ffffcd076afd3c80 | x29: ffffcd076afd3c80 x28: 0068000000000f07 x27: ffff800008000000 | x26: fffffbfff0000000 x25: fffffbffff000000 x24: ff00000000000000 | x23: ffffcd076ad3c000 x22: fffffc0000000000 x21: ffff800008000000 | x20: ffff800008004000 x19: ffff800008000000 x18: ffff800008004000 | x17: 666678302820295f x16: ffffffffffffffff x15: 0000000000000004 | x14: ffffcd076b009e88 x13: 0000000000000fff x12: 0000000000000003 | x11: 00000000ffffefff x10: c0000000ffffefff x9 : 0000000000000000 | x8 : 0000000000000000 x7 : 205d303030303030 x6 : 302e30202020205b | x5 : ffffcd076b41d63f x4 : ffffcd076afd3827 x3 : 0000000000000000 | x2 : 0000000000000000 x1 : ffffcd076afd3a30 x0 : 000000000000004f | Call trace: | __virt_to_phys+0x78/0x80 | __kasan_unpoison_vmalloc+0xd4/0x478 | __vmalloc_node_range+0x77c/0x7b8 | __vmalloc_node+0x54/0x64 | init_IRQ+0x94/0xc8 | start_kernel+0x194/0x420 | __primary_switched+0xbc/0xc4 | ---[ end trace 0000000000000000 ]--- | Unable to handle kernel paging request at virtual address 03fffacbe27b8000 | Mem abort info: | ESR = 0x0000000096000004 | EC = 0x25: DABT (current EL), IL = 32 bits | SET = 0, FnV = 0 | EA = 0, S1PTW = 0 | FSC = 0x04: level 0 translation fault | Data abort info: | ISV = 0, ISS = 0x00000004 | CM = 0, WnR = 0 | swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000041bc5000 | [03fffacbe27b8000] pgd=0000000000000000, p4d=0000000000000000 | Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP | Modules linked in: | CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 6.3.0-rc3-00073-g83865133300d-dirty #4 | Hardware name: linux,dummy-virt (DT) | pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : __kasan_unpoison_vmalloc+0xe4/0x478 | lr : __kasan_unpoison_vmalloc+0xd4/0x478 | sp : ffffcd076afd3ca0 | x29: ffffcd076afd3ca0 x28: 0068000000000f07 x27: ffff800008000000 | x26: 0000000000000000 x25: 03fffacbe27b8000 x24: ff00000000000000 | x23: ffffcd076ad3c000 x22: fffffc0000000000 x21: ffff800008000000 | x20: ffff800008004000 x19: ffff800008000000 x18: ffff800008004000 | x17: 666678302820295f x16: ffffffffffffffff x15: 0000000000000004 | x14: ffffcd076b009e88 x13: 0000000000000fff x12: 0000000000000001 | x11: 0000800008000000 x10: ffff800008000000 x9 : ffffb2f8dee00000 | x8 : 000ffffb2f8dee00 x7 : 205d303030303030 x6 : 302e30202020205b | x5 : ffffcd076b41d63f x4 : ffffcd076afd3827 x3 : 0000000000000000 | x2 : 0000000000000000 x1 : ffffcd076afd3a30 x0 : ffffb2f8dee00000 | Call trace: | __kasan_unpoison_vmalloc+0xe4/0x478 | __vmalloc_node_range+0x77c/0x7b8 | __vmalloc_node+0x54/0x64 | init_IRQ+0x94/0xc8 | start_kernel+0x194/0x420 | __primary_switched+0xbc/0xc4 | Code: d34cfc08 aa1f03fa 8b081b39 d503201f (f9400328) | ---[ end trace 0000000000000000 ]--- | Kernel panic - not syncing: Attempted to kill the idle task! This is because init_vmalloc_pages() erroneously calls virt_to_page() on a vmalloc address, while virt_to_page() is only valid for addresses in the linear/direct map. Since init_vmalloc_pages() expects virtual addresses in the vmalloc range, it must use vmalloc_to_page() rather than virt_to_page(). We call init_vmalloc_pages() from __kasan_unpoison_vmalloc(), where we check !is_vmalloc_or_module_addr(), suggesting that we might encounter a non-vmalloc address. Luckily, this never happens. By design, we only call __kasan_unpoison_vmalloc() on pointers in the vmalloc area, and I have verified that we don't violate that expectation. Given that, is_vmalloc_or_module_addr() must always be true for any legitimate argument to __kasan_unpoison_vmalloc(). Correct init_vmalloc_pages() to use vmalloc_to_page(), and remove the redundant and misleading use of is_vmalloc_or_module_addr() in __kasan_unpoison_vmalloc(). Link: https://lkml.kernel.org/r/20230418164212.1775741-1-mark.rutland@arm.com Fixes: 6c2f761dad7851d8 ("kasan: fix zeroing vmalloc memory with HW_TAGS") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03mm/page_alloc: add some comments to explain the possible hole in ↵Baolin Wang1-0/+9
__pageblock_pfn_to_page() Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(), which checks whether the given zone contains holes, and uses pfn_to_online_page() to validate if the start pfn is online and valid, as well as using pfn_valid() to validate the end pfn. However, the __pageblock_pfn_to_page() function may return non-NULL even if the end pfn of a pageblock is in a memory hole in some situations. For example, if the pageblock order is MAX_ORDER, which will fall into 2 sub-sections, and the end pfn of the pageblock may be hole even though the start pfn is online and valid. See below memory layout as an example and suppose the pageblock order is MAX_ORDER. [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] [ 0.000000] DMA32 empty [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000040000000-0x0000001fa3c7ffff] [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7dfffff] Focus on the last memory range, and there is a hole for the range [mem 0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the pageblock must be 4M aligned. And in this pageblock, these pfns will fall into 2 sub-section (the sub-section size is 2M aligned). So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 - 0x1fa7dfffff ) in this pageblock is valid by calling subsection_map_init() in free_area_init(), but the 2nd sub-section (indicates pfn range: 0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock is not valid. This did not break anything until now, but the zone continuous is fragile in this possible scenario. So as previous discussion[1], it is better to add some comments to explain this possible issue in case there are some future pfn walkers that rely on this. [1] https://lore.kernel.org/all/87r0sdsmr6.fsf@yhuang6-desk2.ccr.corp.intel.com/ Link: https://lkml.kernel.org/r/5c26368865e79c743a453dea48d30670b19d2e4f.1682425534.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/5c26368865e79c743a453dea48d30670b19d2e4f.1682425534.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>