summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2023-12-29kasan: rename pagealloc testsAndrey Konovalov1-25/+26
Rename "pagealloc" KASAN tests: 1. Use "kmalloc_large" for tests that use large kmalloc allocations. 2. Use "page_alloc" for tests that use page_alloc. Also clean up the comments. Link: https://lkml.kernel.org/r/f3eef6ddb87176c40958a3e5a0bd2386b52af4c6.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: add mempool testsAndrey Konovalov1-0/+319
Add KASAN tests for mempool. Link: https://lkml.kernel.org/r/5fd64732266be8287711b6408d86ffc78784be06.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mempool: introduce mempool_use_prealloc_onlyAndrey Konovalov1-0/+37
Introduce a new mempool_alloc_preallocated API that asks the mempool to only use the elements preallocated during the mempool's creation when allocating and to not attempt allocating new ones from the underlying allocator. This API is required to test the KASAN poisoning/unpoisoning functionality in KASAN tests, but it might be also useful on its own. Link: https://lkml.kernel.org/r/a14d809dbdfd04cc33bcacc632fee2abd6b83c00.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mempool: use new mempool KASAN hooksAndrey Konovalov1-10/+12
Update the mempool code to use the new mempool KASAN hooks. Rely on the return value of kasan_mempool_poison_object and kasan_mempool_poison_pages to prevent double-free and invalid-free bugs. Link: https://lkml.kernel.org/r/d36fc4a6865bdbd297cadb46b67641d436849f4c.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mempool: skip slub_debug poisoning when KASAN is enabledAndrey Konovalov1-0/+8
With the changes in the following patch, KASAN starts saving its metadata within freed mempool elements. Thus, skip slub_debug poisoning and checking of mempool elements when KASAN is enabled. Corruptions of freed mempool elements will be detected by KASAN anyway. Link: https://lkml.kernel.org/r/98a4b1617e8ceeb266ef9a46f5e8c7f67a563ad2.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: save alloc stack traces for mempoolAndrey Konovalov1-10/+40
Update kasan_mempool_unpoison_object to properly poison the redzone and save alloc strack traces for kmalloc and slab pools. As a part of this change, split out and use a unpoison_slab_object helper function from __kasan_slab_alloc. [nathan@kernel.org: mark unpoison_slab_object() as static] Link: https://lkml.kernel.org/r/20231221180042.104694-1-andrey.konovalov@linux.dev Link: https://lkml.kernel.org/r/05ad235da8347cfe14d496d01b2aaf074b4f607c.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: introduce poison_kmalloc_large_redzoneAndrey Konovalov1-18/+23
Split out a poison_kmalloc_large_redzone helper from __kasan_kmalloc_large and use it in the caller's code. This is a preparatory change for the following patches in this series. Link: https://lkml.kernel.org/r/93317097b668519d76097fb065201b2027436e22.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: clean up and rename ____kasan_kmallocAndrey Konovalov1-20/+22
Introduce a new poison_kmalloc_redzone helper function that poisons the redzone for kmalloc object. Drop the confusingly named ____kasan_kmalloc function and instead use poison_kmalloc_redzone along with the other required parts of ____kasan_kmalloc in the callers' code. This is a preparatory change for the following patches in this series. Link: https://lkml.kernel.org/r/5881232ad357ec0d59a5b1aefd9e0673a386399a.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: save free stack traces for slab mempoolsAndrey Konovalov1-11/+9
Make kasan_mempool_poison_object save free stack traces for slab and kmalloc mempools when the object is freed into the mempool. Also simplify and rename ____kasan_slab_free to poison_slab_object and do a few other reability changes. Link: https://lkml.kernel.org/r/413a7c7c3344fb56809853339ffaabc9e4905e94.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: clean up __kasan_mempool_poison_objectAndrey Konovalov1-12/+7
Reorganize the code and reword the comment in __kasan_mempool_poison_object to improve the code readability. Link: https://lkml.kernel.org/r/4f6fc8840512286c1a96e16e86901082c671677d.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: introduce kasan_mempool_unpoison_pagesAndrey Konovalov1-0/+6
Introduce and document a new kasan_mempool_unpoison_pages hook to be used by the mempool code instead of kasan_unpoison_pages. This hook is not functionally different from kasan_unpoison_pages, but using it improves the mempool code readability. Link: https://lkml.kernel.org/r/239bd9af6176f2cc59f5c25893eb36143184daff.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: introduce kasan_mempool_poison_pagesAndrey Konovalov1-0/+23
Introduce and document a kasan_mempool_poison_pages hook to be used by the mempool code instead of kasan_poison_pages. Compated to kasan_poison_pages, the new hook: 1. For the tag-based modes, skips checking and poisoning allocations that were not tagged due to sampling. 2. Checks for double-free and invalid-free bugs. In the future, kasan_poison_pages can also be updated to handle #2, but this is out-of-scope of this series. Link: https://lkml.kernel.org/r/88dc7340cce28249abf789f6e0c792c317df9ba5.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: introduce kasan_mempool_unpoison_objectAndrey Konovalov1-0/+5
Introduce and document a kasan_mempool_unpoison_object hook. This hook serves as a replacement for the generic kasan_unpoison_range that the mempool code relies on right now. mempool will be updated to use the new hook in one of the following patches. For now, define the new hook to be identical to kasan_unpoison_range. One of the following patches will update it to add stack trace collection. Link: https://lkml.kernel.org/r/dae25f0e18ed8fd50efe509c5b71a0592de5c18d.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: add return value for kasan_mempool_poison_objectAndrey Konovalov1-11/+10
Add a return value for kasan_mempool_poison_object that lets the caller know whether the allocation is affected by a double-free or an invalid-free bug. The caller can use this return value to stop operating on the object. Also introduce a check_page_allocation helper function to improve the code readability. Link: https://lkml.kernel.org/r/618af65273875fb9f56954285443279b15f1fcd9.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: move kasan_mempool_poison_objectAndrey Konovalov1-23/+23
Move kasan_mempool_poison_object after all slab-related KASAN hooks. This is a preparatory change for the following patches in this series. No functional changes. Link: https://lkml.kernel.org/r/23ea215409f43c13cdf9ecc454501a264c107d67.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: rename kasan_slab_free_mempool to kasan_mempool_poison_objectAndrey Konovalov2-3/+3
Patch series "kasan: save mempool stack traces". This series updates KASAN to save alloc and free stack traces for secondary-level allocators that cache and reuse allocations internally instead of giving them back to the underlying allocator (e.g. mempool). As a part of this change, introduce and document a set of KASAN hooks: bool kasan_mempool_poison_pages(struct page *page, unsigned int order); void kasan_mempool_unpoison_pages(struct page *page, unsigned int order); bool kasan_mempool_poison_object(void *ptr); void kasan_mempool_unpoison_object(void *ptr, size_t size); and use them in the mempool code. Besides mempool, skbuff and io_uring also cache allocations and already use KASAN hooks to poison those. Their code is updated to use the new mempool hooks. The new hooks save alloc and free stack traces (for normal kmalloc and slab objects; stack traces for large kmalloc objects and page_alloc are not supported by KASAN yet), improve the readability of the users' code, and also allow the users to prevent double-free and invalid-free bugs; see the patches for the details. This patch (of 21): Rename kasan_slab_free_mempool to kasan_mempool_poison_object. kasan_slab_free_mempool is a slightly confusing name: it is unclear whether this function poisons the object when it is freed into mempool or does something when the object is freed from mempool to the underlying allocator. The new name also aligns with other mempool-related KASAN hooks added in the following patches in this series. Link: https://lkml.kernel.org/r/cover.1703024586.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/c5618685abb7cdbf9fb4897f565e7759f601da84.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: migrate: fix getting incorrect page mapping during page migrationBaolin Wang1-17/+10
When running stress-ng testing, we found below kernel crash after a few hours: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 pc : dentry_name+0xd8/0x224 lr : pointer+0x22c/0x370 sp : ffff800025f134c0 ...... Call trace: dentry_name+0xd8/0x224 pointer+0x22c/0x370 vsnprintf+0x1ec/0x730 vscnprintf+0x2c/0x60 vprintk_store+0x70/0x234 vprintk_emit+0xe0/0x24c vprintk_default+0x3c/0x44 vprintk_func+0x84/0x2d0 printk+0x64/0x88 __dump_page+0x52c/0x530 dump_page+0x14/0x20 set_migratetype_isolate+0x110/0x224 start_isolate_page_range+0xc4/0x20c offline_pages+0x124/0x474 memory_block_offline+0x44/0xf4 memory_subsys_offline+0x3c/0x70 device_offline+0xf0/0x120 ...... After analyzing the vmcore, I found this issue is caused by page migration. The scenario is that, one thread is doing page migration, and we will use the target page's ->mapping field to save 'anon_vma' pointer between page unmap and page move, and now the target page is locked and refcount is 1. Currently, there is another stress-ng thread performing memory hotplug, attempting to offline the target page that is being migrated. It discovers that the refcount of this target page is 1, preventing the offline operation, thus proceeding to dump the page. However, page_mapping() of the target page may return an incorrect file mapping to crash the system in dump_mapping(), since the target page->mapping only saves 'anon_vma' pointer without setting PAGE_MAPPING_ANON flag. There are seveval ways to fix this issue: (1) Setting the PAGE_MAPPING_ANON flag for target page's ->mapping when saving 'anon_vma', but this can confuse PageAnon() for PFN walkers, since the target page has not built mappings yet. (2) Getting the page lock to call page_mapping() in __dump_page() to avoid crashing the system, however, there are still some PFN walkers that call page_mapping() without holding the page lock, such as compaction. (3) Using target page->private field to save the 'anon_vma' pointer and 2 bits page state, just as page->mapping records an anonymous page, which can remove the page_mapping() impact for PFN walkers and also seems a simple way. So I choose option 3 to fix this issue, and this can also fix other potential issues for PFN walkers, such as compaction. Link: https://lkml.kernel.org/r/e60b17a88afc38cb32f84c3e30837ec70b343d2b.1702641709.git.baolin.wang@linux.alibaba.com Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Xu Yu <xuyu@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: convert swap_cluster_readahead and swap_vma_readahead to return a folioMatthew Wilcox (Oracle)3-19/+19
shmem_swapin_cluster() immediately converts the page back to a folio, and swapin_readahead() may as well call folio_file_page() once instead of having each function call it. [willy@infradead.org: avoid NULL pointer deref] Link: https://lkml.kernel.org/r/ZYI7OcVlM1voKfBl@casper.infradead.org Link: https://lkml.kernel.org/r/20231213215842.671461-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: return a folio from read_swap_cache_async()Matthew Wilcox (Oracle)3-19/+18
The only two callers simply call put_page() on the page returned, so they're happier calling folio_put(). Saves two calls to compound_head(). Link: https://lkml.kernel.org/r/20231213215842.671461-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: remove page_swap_info()Matthew Wilcox (Oracle)2-8/+2
It's more efficient to get the swap_info_struct by calling swp_swap_info() directly. Link: https://lkml.kernel.org/r/20231213215842.671461-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: convert swap_readpage() to swap_read_folio()Matthew Wilcox (Oracle)5-20/+21
All callers have a folio, so pass it in, saving two calls to compound_head(). Link: https://lkml.kernel.org/r/20231213215842.671461-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: convert swap_page_sector() to swap_folio_sector()Matthew Wilcox (Oracle)2-7/+7
All callers have a folio, so pass it in. Saves a couple of calls to compound_head(). Link: https://lkml.kernel.org/r/20231213215842.671461-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: pass a folio to swap_readpage_bdev_async()Matthew Wilcox (Oracle)1-4/+4
Make it plain that this takes the head page (which before this point was just an assumption, but is now enforced by the compiler). Link: https://lkml.kernel.org/r/20231213215842.671461-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: pass a folio to swap_readpage_bdev_sync()Matthew Wilcox (Oracle)1-4/+4
Make it plain that this takes the head page (which before this point was just an assumption, but is now enforced by the compiler). Link: https://lkml.kernel.org/r/20231213215842.671461-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: pass a folio to swap_readpage_fs()Matthew Wilcox (Oracle)1-7/+6
Saves a call to compound_head(). Link: https://lkml.kernel.org/r/20231213215842.671461-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: pass a folio to swap_writepage_bdev_async()Matthew Wilcox (Oracle)1-5/+4
Saves a call to compound_head(). Link: https://lkml.kernel.org/r/20231213215842.671461-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: pass a folio to swap_writepage_bdev_sync()Matthew Wilcox (Oracle)1-5/+4
Saves a call to compound_head(). Link: https://lkml.kernel.org/r/20231213215842.671461-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: pass a folio to swap_writepage_fs()Matthew Wilcox (Oracle)1-9/+9
Saves several calls to compound_head(). Link: https://lkml.kernel.org/r/20231213215842.671461-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: pass a folio to __swap_writepage()Matthew Wilcox (Oracle)3-9/+9
Both callers now have a folio, so pass that in instead of the page. Removes a few hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20231213215842.671461-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: return the folio from __read_swap_cache_async()Matthew Wilcox (Oracle)3-73/+67
Patch series "More swap folio conversions". These all seem like fairly straightforward conversions to me. A lot of compound_head() calls get removed. And page_swap_info(), which is nice. This patch (of 13): Move the folio->page conversion into the callers that actually want that. Most of the callers are happier with the folio anyway. If the page_allocated boolean is set, the folio allocated is of order-0, so it is safe to pass the page directly to swap_readpage(). Link: https://lkml.kernel.org/r/20231213215842.671461-1-willy@infradead.org Link: https://lkml.kernel.org/r/20231213215842.671461-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/zswap: change per-cpu mutex and buffer to per-acomp_ctxChengming Zhou1-71/+33
First of all, we need to rename acomp_ctx->dstmem field to buffer, since we are now using for purposes other than compression. Then we change per-cpu mutex and buffer to per-acomp_ctx, since them belong to the acomp_ctx and are necessary parts when used in the compress/decompress contexts. So we can remove the old per-cpu mutex and dstmem. Link: https://lkml.kernel.org/r/20231213-zswap-dstmem-v5-5-9382162bbf05@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Chris Li <chrisl@kernel.org> (Google) Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/zswap: cleanup zswap_writeback_entry()Chengming Zhou1-19/+10
Also after the common decompress part goes to __zswap_load(), we can cleanup the zswap_writeback_entry() a little. Link: https://lkml.kernel.org/r/20231213-zswap-dstmem-v5-4-9382162bbf05@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> (Google) Cc: Barry Song <21cnbao@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/zswap: cleanup zswap_load()Chengming Zhou1-9/+5
After the common decompress part goes to __zswap_load(), we can cleanup the zswap_load() a little. Link: https://lkml.kernel.org/r/20231213-zswap-dstmem-v5-3-9382162bbf05@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Chis Li <chrisl@kernel.org> (Google) Cc: Barry Song <21cnbao@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/zswap: refactor out __zswap_load()Chengming Zhou1-60/+32
zswap_load() and zswap_writeback_entry() have the same part that decompress the data from zswap_entry to page, so refactor out the common part as __zswap_load(entry, page). Link: https://lkml.kernel.org/r/20231213-zswap-dstmem-v5-2-9382162bbf05@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Chris Li <chrisl@kernel.org> (Google) Cc: Barry Song <21cnbao@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/zswap: reuse dstmem when decompressChengming Zhou1-32/+12
Patch series "mm/zswap: dstmem reuse optimizations and cleanups", v5. The problem this series tries to optimize is that zswap_load() and zswap_writeback_entry() have to malloc a temporary memory to support !zpool_can_sleep_mapped(). We can avoid it by reusing the percpu crypto_acomp_ctx->dstmem, which is also used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex. This patch (of 5): In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first copy the entry->handle memory to a temporary memory, which is allocated using kmalloc. Obviously we can reuse the per-compressor dstmem to avoid allocating every time, since it's percpu-compressor and protected in percpu mutex. Link: https://lkml.kernel.org/r/20231213-zswap-dstmem-v5-0-9382162bbf05@bytedance.com Link: https://lkml.kernel.org/r/20231213-zswap-dstmem-v5-1-9382162bbf05@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Chris Li <chrisl@kernel.org> (Google) Cc: Barry Song <21cnbao@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/ksm: add tracepoint for ksm advisorStefan Roesch1-0/+1
This adds a new tracepoint for the ksm advisor. It reports the last scan time, the new setting of the pages_to_scan parameter and the average cpu percent usage of the ksmd background thread for the last scan. Link: https://lkml.kernel.org/r/20231218231054.1625219-4-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/ksm: add sysfs knobs for advisorStefan Roesch1-0/+148
This adds four new knobs for the KSM advisor to influence its behaviour. The knobs are: - advisor_mode: none: no advisor (default) scan-time: scan time advisor - advisor_max_cpu: 70 (default, cpu usage percent) - advisor_min_pages_to_scan: 500 (default) - advisor_max_pages_to_scan: 30000 (default) - advisor_target_scan_time: 200 (default in seconds) The new values will take effect on the next scan round. Link: https://lkml.kernel.org/r/20231218231054.1625219-3-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/ksm: add ksm advisorStefan Roesch1-1/+157
Patch series "mm/ksm: Add ksm advisor", v5. What is the KSM advisor? ========================= The ksm advisor automatically manages the pages_to_scan setting to achieve a target scan time. The target scan time defines how many seconds it should take to scan all the candidate KSM pages. In other words the pages_to_scan rate is changed by the advisor to achieve the target scan time. Why do we need a KSM advisor? ============================== The number of candidate pages for KSM is dynamic. It can often be observed that during the startup of an application more candidate pages need to be processed. Without an advisor the pages_to_scan parameter needs to be sized for the maximum number of candidate pages. With the scan time advisor the pages_to_scan parameter based can be changed based on demand. Algorithm ========== The algorithm calculates the change value based on the target scan time and the previous scan time. To avoid pertubations an exponentially weighted moving average is applied. The algorithm has a max and min value to: - guarantee responsiveness to changes - to limit CPU resource consumption Parameters to influence the KSM scan advisor ============================================= The respective parameters are: - ksm_advisor_mode 0: None (default), 1: scan time advisor - ksm_advisor_target_scan_time how many seconds a scan should of all candidate pages take - ksm_advisor_max_cpu upper limit for the cpu usage in percent of the ksmd background thread The initial value and the max value for the pages_to_scan parameter can be limited with: - ksm_advisor_min_pages_to_scan minimum value for pages_to_scan per batch - ksm_advisor_max_pages_to_scan maximum value for pages_to_scan per batch The default settings for the above two parameters should be suitable for most workloads. The parameters are exposed as knobs in /sys/kernel/mm/ksm. By default the scan time advisor is disabled. Currently there are two advisors: - none and - scan-time. Resource savings ================= Tests with various workloads have shown considerable CPU savings. Most of the workloads I have investigated have more candidate pages during startup. Once the workload is stable in terms of memory, the number of candidate pages is reduced. Without the advisor, the pages_to_scan needs to be sized for the maximum number of candidate pages. So having this advisor definitely helps in reducing CPU consumption. For the instagram workload, the advisor achieves a 25% CPU reduction. Once the memory is stable, the pages_to_scan parameter gets reduced to about 40% of its max value. The new advisor works especially well if the smart scan feature is also enabled. How is defining a target scan time better? =========================================== For an administrator it is more logical to set a target scan time.. The administrator can determine how many pages are scanned on each scan. Therefore setting a target scan time makes more sense. In addition the administrator might have a good idea about the memory sizing of its respective workloads. Setting cpu limits is easier than setting The pages_to_scan parameter. The pages_to_scan parameter is per batch. For the administrator it is difficult to set the pages_to_scan parameter. Tracing ======= A new tracing event has been added for the scan time advisor. The new trace event is called ksm_advisor. It reports the scan time, the new pages_to_scan setting and the cpu usage of the ksmd background thread. Other approaches ================= Approach 1: Adapt pages_to_scan after processing each batch. If KSM merges pages, increase the scan rate, if less KSM pages, reduce the the pages_to_scan rate. This doesn't work too well. While it increases the pages_to_scan for a short period, but generally it ends up with a too low pages_to_scan rate. Approach 2: Adapt pages_to_scan after each scan. The problem with that approach is that the calculated scan rate tends to be high. The more aggressive KSM scans, the more pages it can de-duplicate. There have been earlier attempts at an advisor: propose auto-run mode of ksm and its tests (https://marc.info/?l=linux-mm&m=166029880214485&w=2) This patch (of 5): This adds the ksm advisor. The ksm advisor automatically manages the pages_to_scan setting to achieve a target scan time. The target scan time defines how many seconds it should take to scan all the candidate KSM pages. In other words the pages_to_scan rate is changed by the advisor to achieve the target scan time. The algorithm has a max and min value to: - guarantee responsiveness to changes - limit CPU resource consumption The respective parameters are: - ksm_advisor_target_scan_time (how many seconds a scan should take) - ksm_advisor_max_cpu (maximum value for cpu percent usage) - ksm_advisor_min_pages (minimum value for pages_to_scan per batch) - ksm_advisor_max_pages (maximum value for pages_to_scan per batch) The algorithm calculates the change value based on the target scan time and the previous scan time. To avoid pertubations an exponentially weighted moving average is applied. The advisor is managed by two main parameters: target scan time, cpu max time for the ksmd background thread. These parameters determine how aggresive ksmd scans. In addition there are min and max values for the pages_to_scan parameter to make sure that its initial and max values are not set too low or too high. This ensures that it is able to react to changes quickly enough. The default values are: - target scan time: 200 secs - max cpu: 70% - min pages: 500 - max pages: 30000 By default the advisor is disabled. Currently there are two advisors: none and scan-time. Tests with various workloads have shown considerable CPU savings. Most of the workloads I have investigated have more candidate pages during startup, once the workload is stable in terms of memory, the number of candidate pages is reduced. Without the advisor, the pages_to_scan needs to be sized for the maximum number of candidate pages. So having this advisor definitely helps in reducing CPU consumption. For the instagram workload, the advisor achieves a 25% CPU reduction. Once the memory is stable, the pages_to_scan parameter gets reduced to about 40% of its max value. Link: https://lkml.kernel.org/r/20231218231054.1625219-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20231218231054.1625219-2-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@surriel.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: remove page_add_new_anon_rmap and lru_cache_add_inactive_or_unevictableMatthew Wilcox (Oracle)1-16/+0
All callers have now been converted to folio_add_new_anon_rmap() and folio_add_lru_vma() so we can remove the wrapper. Link: https://lkml.kernel.org/r/20231211162214.2146080-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: convert collapse_huge_page() to use a folioMatthew Wilcox (Oracle)1-7/+8
Replace three calls to compound_head() with one. Link: https://lkml.kernel.org/r/20231211162214.2146080-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: convert migrate_vma_insert_page() to use a folioMatthew Wilcox (Oracle)1-11/+12
Replaces five calls to compound_head() with one. Link: https://lkml.kernel.org/r/20231211162214.2146080-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: remove references to page_add_new_anon_rmap in commentsMatthew Wilcox (Oracle)1-2/+2
Refer to folio_add_new_anon_rmap() instead. Link: https://lkml.kernel.org/r/20231211162214.2146080-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: remove stale example from commentMatthew Wilcox (Oracle)1-14/+4
folio_add_new_anon_rmap() no longer works this way, so just remove the entire example. Link: https://lkml.kernel.org/r/20231211162214.2146080-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: remove some calls to page_add_new_anon_rmap()Matthew Wilcox (Oracle)2-2/+2
We already have the folio in these functions, we just need to use it. folio_add_new_anon_rmap() didn't exist at the time they were converted to folios. Link: https://lkml.kernel.org/r/20231211162214.2146080-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: convert unuse_pte() to use a folio throughoutMatthew Wilcox (Oracle)1-22/+25
Saves about eight calls to compound_head(). Link: https://lkml.kernel.org/r/20231211162214.2146080-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: remove PageAnonExclusive assertions in unuse_pte()Matthew Wilcox (Oracle)1-4/+0
The page in question is either freshly allocated or known to be in the swap cache; these assertions are not particularly useful. Link: https://lkml.kernel.org/r/20231212164813.2540119-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: convert ksm_might_need_to_copy() to work on foliosMatthew Wilcox (Oracle)3-17/+23
Patch series "Finish two folio conversions". Most callers of page_add_new_anon_rmap() and lru_cache_add_inactive_or_unevictable() have been converted to their folio equivalents, but there are still a few stragglers. There's a bit of preparatory work in ksm and unuse_pte(), but after that it's pretty mechanical. This patch (of 9): Accept a folio as an argument and return a folio result. Removes a call to compound_head() in do_swap_page(), and prevents folio & page from getting out of sync in unuse_pte(). Reviewed-by: David Hildenbrand <david@redhat.com> [willy@infradead.org: fix smatch warning] Link: https://lkml.kernel.org/r/ZXnPtblC6A1IkyAB@casper.infradead.org [david@redhat.com: only adjust the page if the folio changed] Link: https://lkml.kernel.org/r/6a8f2110-fa91-4c10-9eae-88315309a6e3@redhat.com Link: https://lkml.kernel.org/r/20231211162214.2146080-1-willy@infradead.org Link: https://lkml.kernel.org/r/20231211162214.2146080-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29userfaultfd: UFFDIO_MOVE uABIAndrea Arcangeli4-0/+745
Implement the uABI of UFFDIO_MOVE ioctl. UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are available (in userspace) for recycling, as is usually the case in heap compaction algorithms, then we can avoid the page allocation and memcpy (done by UFFDIO_COPY). Also, since the pages are recycled in the userspace, we avoid the need to release (via madvise) the pages back to the kernel [2]. We see over 40% reduction (on a Google pixel 6 device) in the compacting thread's completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was measured using a benchmark that emulates a heap compaction implementation using userfaultfd (to allow concurrent accesses by application threads). More details of the usecase are explained in [2]. Furthermore, UFFDIO_MOVE enables moving swapped-out pages without touching them within the same vma. Today, it can only be done by mremap, however it forces splitting the vma. [1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redhat.com/ [2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/ Update for the ioctl_userfaultfd(2) manpage: UFFDIO_MOVE (Since Linux xxx) Move a continuous memory chunk into the userfault registered range and optionally wake up the blocked thread. The source and destination addresses and the number of bytes to move are specified by the src, dst, and len fields of the uffdio_move structure pointed to by argp: struct uffdio_move { __u64 dst; /* Destination of move */ __u64 src; /* Source of move */ __u64 len; /* Number of bytes to move */ __u64 mode; /* Flags controlling behavior of move */ __s64 move; /* Number of bytes moved, or negated error */ }; The following value may be bitwise ORed in mode to change the behavior of the UFFDIO_MOVE operation: UFFDIO_MOVE_MODE_DONTWAKE Do not wake up the thread that waits for page-fault resolution UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES Allow holes in the source virtual range that is being moved. When not specified, the holes will result in ENOENT error. When specified, the holes will be accounted as successfully moved memory. This is mostly useful to move hugepage aligned virtual regions without knowing if there are transparent hugepages in the regions or not, but preventing the risk of having to split the hugepage during the operation. The move field is used by the kernel to return the number of bytes that was actually moved, or an error (a negated errno- style value). If the value returned in move doesn't match the value that was specified in len, the operation fails with the error EAGAIN. The move field is output-only; it is not read by the UFFDIO_MOVE operation. The operation may fail for various reasons. Usually, remapping of pages that are not exclusive to the given process fail; once KSM might deduplicate pages or fork() COW-shares pages during fork() with child processes, they are no longer exclusive. Further, the kernel might only perform lightweight checks for detecting whether the pages are exclusive, and return -EBUSY in case that check fails. To make the operation more likely to succeed, KSM should be disabled, fork() should be avoided or MADV_DONTFORK should be configured for the source VMA before fork(). This ioctl(2) operation returns 0 on success. In this case, the entire area was moved. On error, -1 is returned and errno is set to indicate the error. Possible errors include: EAGAIN The number of bytes moved (i.e., the value returned in the move field) does not equal the value that was specified in the len field. EINVAL Either dst or len was not a multiple of the system page size, or the range specified by src and len or dst and len was invalid. EINVAL An invalid bit was specified in the mode field. ENOENT The source virtual memory range has unmapped holes and UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set. EEXIST The destination virtual memory range is fully or partially mapped. EBUSY The pages in the source virtual memory range are either pinned or not exclusive to the process. The kernel might only perform lightweight checks for detecting whether the pages are exclusive. To make the operation more likely to succeed, KSM should be disabled, fork() should be avoided or MADV_DONTFORK should be configured for the source virtual memory area before fork(). ENOMEM Allocating memory needed for the operation failed. ESRCH The target process has exited at the time of a UFFDIO_MOVE operation. Link: https://lkml.kernel.org/r/20231206103702.3873743-3-surenb@google.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/rmap: support move to different root anon_vma in folio_move_anon_rmap()Andrea Arcangeli1-0/+24
Patch series "userfaultfd move option", v6. This patch series introduces UFFDIO_MOVE feature to userfaultfd, which has long been implemented and maintained by Andrea in his local tree [1], but was not upstreamed due to lack of use cases where this approach would be better than allocating a new page and copying the contents. Previous upstraming attempts could be found at [6] and [7]. UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are available (in userspace) for recycling, as is usually the case in heap compaction algorithms, then we can avoid the page allocation and memcpy (done by UFFDIO_COPY). Also, since the pages are recycled in the userspace, we avoid the need to release (via madvise) the pages back to the kernel [3]. We see over 40% reduction (on a Google pixel 6 device) in the compacting thread's completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was measured using a benchmark that emulates a heap compaction implementation using userfaultfd (to allow concurrent accesses by application threads). More details of the usecase are explained in [3]. Furthermore, UFFDIO_MOVE enables moving swapped-out pages without touching them within the same vma. Today, it can only be done by mremap, however it forces splitting the vma. TODOs for follow-up improvements: - cross-mm support. Known differences from single-mm and missing pieces: - memcg recharging (might need to isolate pages in the process) - mm counters - cross-mm deposit table moves - cross-mm test - document the address space where src and dest reside in struct uffdio_move - TLB flush batching. Will require extensive changes to PTL locking in move_pages_pte(). OTOH that might let us reuse parts of mremap code. This patch (of 5): For now, folio_move_anon_rmap() was only used to move a folio to a different anon_vma after fork(), whereby the root anon_vma stayed unchanged. For that, it was sufficient to hold the folio lock when calling folio_move_anon_rmap(). However, we want to make use of folio_move_anon_rmap() to move folios between VMAs that have a different root anon_vma. As folio_referenced() performs an RMAP walk without holding the folio lock but only holding the anon_vma in read mode, holding the folio lock is insufficient. When moving to an anon_vma with a different root anon_vma, we'll have to hold both, the folio lock and the anon_vma lock in write mode. Consequently, whenever we succeeded in folio_lock_anon_vma_read() to read-lock the anon_vma, we have to re-check if the mapping was changed in the meantime. If that was the case, we have to retry. Note that folio_move_anon_rmap() must only be called if the anon page is exclusive to a process, and must not be called on KSM folios. This is a preparation for UFFDIO_MOVE, which will hold the folio lock, the anon_vma lock in write mode, and the mmap_lock in read mode. Link: https://lkml.kernel.org/r/20231206103702.3873743-1-surenb@google.com Link: https://lkml.kernel.org/r/20231206103702.3873743-2-surenb@google.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: kernel-team@android.com Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/mglru: skip special VMAs in lru_gen_look_around()Yu Zhao1-4/+9
Special VMAs like VM_PFNMAP can contain anon pages from COW. There isn't much profit in doing lookaround on them. Besides, they can trigger the pte_special() warning in get_pte_pfn(). Skip them in lru_gen_look_around(). Link: https://lkml.kernel.org/r/20231223045647.1566043-1-yuzhao@google.com Fixes: 018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: syzbot+03fd9b3f71641f0ebf2d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/000000000000f9ff00060d14c256@google.com/ Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>