summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-12-12fsdax: zero the edges if source is HOLE or UNWRITTENShiyang Ruan1-30/+49
If srcmap contains invalid data, such as HOLE and UNWRITTEN, the dest page should be zeroed. Otherwise, since it's a pmem, old data may remains on the dest page, the result of CoW will be incorrect. The function name is also not easy to understand, rename it to "dax_iomap_copy_around()", which means it copies data around the range. [akpm@linux-foundation.org: update dax_iomap_copy_around() kerneldoc, per Darrick] Link: https://lkml.kernel.org/r/1669973145-318-1-git-send-email-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/1669908538-55-4-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12fsdax: invalidate pages when CoWShiyang Ruan1-4/+13
CoW changes the share state of a dax page, but the share count of the page isn't updated. The next time access this page, it should have been a newly accessed, but old association exists. So, we need to clear the share state when CoW happens, in both dax_iomap_rw() and dax_zero_iter(). Link: https://lkml.kernel.org/r/1669908538-55-3-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12fsdax: introduce page->share for fsdax in reflink modeShiyang Ruan3-18/+27
Patch series "fsdax,xfs: fix warning messages", v2. Many testcases failed in dax+reflink mode with warning message in dmesg. Such as generic/051,075,127. The warning message is like this: [ 775.509337] ------------[ cut here ]------------ [ 775.509636] WARNING: CPU: 1 PID: 16815 at fs/dax.c:386 dax_insert_entry.cold+0x2e/0x69 [ 775.510151] Modules linked in: auth_rpcgss oid_registry nfsv4 algif_hash af_alg af_packet nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter ip_tables x_tables dax_pmem nd_pmem nd_btt sch_fq_codel configfs xfs libcrc32c fuse [ 775.524288] CPU: 1 PID: 16815 Comm: fsx Kdump: loaded Tainted: G W 6.1.0-rc4+ #164 eb34e4ee4200c7cbbb47de2b1892c5a3e027fd6d [ 775.524904] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.0-3-3 04/01/2014 [ 775.525460] RIP: 0010:dax_insert_entry.cold+0x2e/0x69 [ 775.525797] Code: c7 c7 18 eb e0 81 48 89 4c 24 20 48 89 54 24 10 e8 73 6d ff ff 48 83 7d 18 00 48 8b 54 24 10 48 8b 4c 24 20 0f 84 e3 e9 b9 ff <0f> 0b e9 dc e9 b9 ff 48 c7 c6 a0 20 c3 81 48 c7 c7 f0 ea e0 81 48 [ 775.526708] RSP: 0000:ffffc90001d57b30 EFLAGS: 00010082 [ 775.527042] RAX: 000000000000002a RBX: 0000000000000000 RCX: 0000000000000042 [ 775.527396] RDX: ffffea000a0f6c80 RSI: ffffffff81dfab1b RDI: 00000000ffffffff [ 775.527819] RBP: ffffea000a0f6c40 R08: 0000000000000000 R09: ffffffff820625e0 [ 775.528241] R10: ffffc90001d579d8 R11: ffffffff820d2628 R12: ffff88815fc98320 [ 775.528598] R13: ffffc90001d57c18 R14: 0000000000000000 R15: 0000000000000001 [ 775.528997] FS: 00007f39fc75d740(0000) GS:ffff88817bc80000(0000) knlGS:0000000000000000 [ 775.529474] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 775.529800] CR2: 00007f39fc772040 CR3: 0000000107eb6001 CR4: 00000000003706e0 [ 775.530214] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 775.530592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 775.531002] Call Trace: [ 775.531230] <TASK> [ 775.531444] dax_fault_iter+0x267/0x6c0 [ 775.531719] dax_iomap_pte_fault+0x198/0x3d0 [ 775.532002] __xfs_filemap_fault+0x24a/0x2d0 [xfs aa8d25411432b306d9554da38096f4ebb86bdfe7] [ 775.532603] __do_fault+0x30/0x1e0 [ 775.532903] do_fault+0x314/0x6c0 [ 775.533166] __handle_mm_fault+0x646/0x1250 [ 775.533480] handle_mm_fault+0xc1/0x230 [ 775.533810] do_user_addr_fault+0x1ac/0x610 [ 775.534110] exc_page_fault+0x63/0x140 [ 775.534389] asm_exc_page_fault+0x22/0x30 [ 775.534678] RIP: 0033:0x7f39fc55820a [ 775.534950] Code: 00 01 00 00 00 74 99 83 f9 c0 0f 87 7b fe ff ff c5 fe 6f 4e 20 48 29 fe 48 83 c7 3f 49 8d 0c 10 48 83 e7 c0 48 01 fe 48 29 f9 <f3> a4 c4 c1 7e 7f 00 c4 c1 7e 7f 48 20 c5 f8 77 c3 0f 1f 44 00 00 [ 775.535839] RSP: 002b:00007ffc66a08118 EFLAGS: 00010202 [ 775.536157] RAX: 00007f39fc772001 RBX: 0000000000042001 RCX: 00000000000063c1 [ 775.536537] RDX: 0000000000006400 RSI: 00007f39fac42050 RDI: 00007f39fc772040 [ 775.536919] RBP: 0000000000006400 R08: 00007f39fc772001 R09: 0000000000042000 [ 775.537304] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001 [ 775.537694] R13: 00007f39fc772000 R14: 0000000000006401 R15: 0000000000000003 [ 775.538086] </TASK> [ 775.538333] ---[ end trace 0000000000000000 ]--- This also affects dax+noreflink mode if we run the test after a dax+reflink test. So, the most urgent thing is solving the warning messages. With these fixes, most warning messages in dax_associate_entry() are gone. But honestly, generic/388 will randomly failed with the warning. The case shutdown the xfs when fsstress is running, and do it for many times. I think the reason is that dax pages in use are not able to be invalidated in time when fs is shutdown. The next time dax page to be associated, it still remains the mapping value set last time. I'll keep on solving it. The warning message in dax_writeback_one() can also be fixed because of the dax unshare. This patch (of 8): fsdax page is used not only when CoW, but also mapread. To make the it easily understood, use 'share' to indicate that the dax page is shared by more than one extent. And add helper functions to use it. Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED. [ruansy.fnst@fujitsu.com: rename several functions] Link: https://lkml.kernel.org/r/1669972991-246-1-git-send-email-ruansy.fnst@fujitsu.com [ruansy.fnst@fujitsu.com: v2.2] Link: https://lkml.kernel.org/r/1670381359-53-1-git-send-email-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/1669908538-55-1-git-send-email-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/1669908538-55-2-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12selftests/damon: test removed scheme sysfs dir access bugSeongJae Park2-1/+59
A DAMON sysfs user could start DAMON with a scheme, remove the sysfs directory for the scheme, and then ask stats or schemes tried regions update. The related logic were not aware of the already removed directory situation, so it was able to results in invalid memory accesses. The fix has made with commit 8468b486612c ("mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed"), though. Add a selftest to prevent such kinds of bugs from being introduced again. Link: https://lkml.kernel.org/r/20221201170834.62823-1-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12kasan: fail non-kasan KUnit tests on KASAN reportsAndrey Konovalov3-0/+69
After the recent changes done to KUnit-enabled KASAN tests, non-KASAN KUnit tests stopped being failed when KASAN report is detected. Recover that property by failing the currently running non-KASAN KUnit test when KASAN detects and prints a report for a bad memory access. Note that if the bad accesses happened in a kernel thread that doesn't have a reference to the currently running KUnit-test available via current->kunit_test, the test won't be failed. This is a limitation of KUnit, which doesn't yet provide a thread-agnostic way to find the reference to the currenly running test. Link: https://lkml.kernel.org/r/7be29a8ea967cee6b7e48d3d5a242d1d0bd96851.1669820505.git.andreyknvl@google.com Fixes: 49d9977ac909 ("kasan: check CONFIG_KASAN_KUNIT_TEST instead of CONFIG_KUNIT") Fixes: 7ce0ea19d50e ("kasan: switch kunit tests to console tracepoints") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: David Gow <davidgow@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/hugetlb: change hugetlb allocation functions to return a folioSidhartha Kumar1-70/+64
Many hugetlb allocation helper functions have now been converting to folios, update their higher level callers to be compatible with folios. alloc_pool_huge_page is reorganized to avoid a smatch warning reporting the folio variable is uninitialized. [sidhartha.kumar@oracle.com: update alloc_and_dissolve_hugetlb_folio comments] Link: https://lkml.kernel.org/r/20221206233512.146535-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20221129225039.82257-11-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reported-by: Wei Chen <harperchen1110@gmail.com> Suggested-by: John Hubbard <jhubbard@nvidia.com> Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Tarun Sahu <tsahu@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/hugetlb: convert hugetlb prep functions to foliosSidhartha Kumar1-33/+30
Convert prep_new_huge_page() and __prep_compound_gigantic_page() to folios. Link: https://lkml.kernel.org/r/20221129225039.82257-10-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/hugetlb: convert free_gigantic_page() to foliosSidhartha Kumar1-12/+17
Convert callers of free_gigantic_page() to use folios, function is then renamed to free_gigantic_folio(). Link: https://lkml.kernel.org/r/20221129225039.82257-9-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/hugetlb: convert enqueue_huge_page() to foliosSidhartha Kumar1-11/+11
Convert callers of enqueue_huge_page() to pass in a folio, function is renamed to enqueue_hugetlb_folio(). Link: https://lkml.kernel.org/r/20221129225039.82257-8-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/hugetlb: convert add_hugetlb_page() to folios and add hugetlb_cma_folio()Sidhartha Kumar1-21/+21
Convert add_hugetlb_page() to take in a folio, also convert hugetlb_cma_page() to take in a folio. Link: https://lkml.kernel.org/r/20221129225039.82257-7-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/hugetlb: convert update_and_free_page() to foliosSidhartha Kumar1-14/+16
Make more progress on converting the free_huge_page() destructor to operate on folios by converting update_and_free_page() to folios. Link: https://lkml.kernel.org/r/20221129225039.82257-6-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>\ Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/hugetlb: convert remove_hugetlb_page() to foliosSidhartha Kumar1-23/+25
Removes page_folio() call by converting callers to directly pass a folio into __remove_hugetlb_page(). Link: https://lkml.kernel.org/r/20221129225039.82257-5-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/hugetlb: convert dissolve_free_huge_page() to foliosSidhartha Kumar1-10/+10
Removes compound_head() call by using a folio rather than a head page. Link: https://lkml.kernel.org/r/20221129225039.82257-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/hugetlb: convert destroy_compound_gigantic_page() to foliosSidhartha Kumar1-22/+21
Convert page operations within __destroy_compound_gigantic_page() to the corresponding folio operations. Link: https://lkml.kernel.org/r/20221129225039.82257-3-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm: add folio dtor and order setter functionsSidhartha Kumar2-3/+24
Patch series "convert core hugetlb functions to folios", v5. ============== OVERVIEW =========================== Now that many hugetlb helper functions that deal with hugetlb specific flags[1] and hugetlb cgroups[2] are converted to folios, higher level allocation, prep, and freeing functions within hugetlb can also be converted to operate in folios. Patch 1 of this series implements the wrapper functions around setting the compound destructor and compound order for a folio. Besides the user added in patch 1, patch 2 and patch 9 also use these helper functions. Patches 2-10 convert the higher level hugetlb functions to folios. ============== TESTING =========================== LTP: Ran 10 back to back rounds of the LTP hugetlb test suite. Gigantic Huge Pages: Test allocation and freeing via hugeadm commands: hugeadm --pool-pages-min 1GB:10 hugeadm --pool-pages-min 1GB:0 Demote: Demote 1 1GB hugepages to 512 2MB hugepages echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # 512 cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages # 0 [1] https://lore.kernel.org/lkml/20220922154207.1575343-1-sidhartha.kumar@oracle.com/ [2] https://lore.kernel.org/linux-mm/20221101223059.460937-1-sidhartha.kumar@oracle.com/ This patch (of 10): Add folio equivalents for set_compound_order() and set_compound_page_dtor(). Also remove extra new-lines introduced by mm/hugetlb: convert move_hugetlb_state() to folios and mm/hugetlb_cgroup: convert hugetlb_cgroup_uncharge_page() to folios. [sidhartha.kumar@oracle.com: clarify folio_set_compound_order() zero support] Link: https://lkml.kernel.org/r/20221207223731.32784-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20221129225039.82257-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20221129225039.82257-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12folio-compat: remove lru_cache_add()Vishal Moola (Oracle)4-9/+5
There are no longer any callers of lru_cache_add(), so remove it. This saves 79 bytes of kernel text. Also cleanup some comments such that they reference the new folio_add_lru() instead. Link: https://lkml.kernel.org/r/20221101175326.13265-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12khugepage: replace lru_cache_add() with folio_add_lru()Vishal Moola (Oracle)1-4/+7
Replaces some calls with their folio equivalents. This is in preparation for the removal of lru_cache_add(). This replaces 3 calls to compound_head() with 1. Link: https://lkml.kernel.org/r/20221101175326.13265-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12userfaultfd: replace lru_cache functions with folio_add functionsVishal Moola (Oracle)1-2/+4
Replaces lru_cache_add() and lru_cache_add_inactive_or_unevictable() with folio_add_lru() and folio_add_lru_vma(). This is in preparation for the removal of lru_cache_add(). Link: https://lkml.kernel.org/r/20221101175326.13265-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12fuse: convert fuse_try_move_page() to use foliosVishal Moola (Oracle)1-27/+28
Converts the function to try to move folios instead of pages. Also converts fuse_check_page() to fuse_get_folio() since this is its only caller. This change removes 15 calls to compound_head(). Link: https://lkml.kernel.org/r/20221101175326.13265-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12filemap: convert replace_page_cache_page() to replace_page_cache_folio()Vishal Moola (Oracle)3-29/+27
Patch series "Removing the lru_cache_add() wrapper". This patchset replaces all calls of lru_cache_add() with the folio equivalent: folio_add_lru(). This is allows us to get rid of the wrapper The series passes xfstests and the userfaultfd selftests. This patch (of 5): Eliminates 7 calls to compound_head(). Link: https://lkml.kernel.org/r/20221101175326.13265-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20221101175326.13265-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12LoongArch: enable ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAPFeiyang Chen1-0/+1
The feature of minimizing overhead of struct page associated with each HugeTLB page is implemented on x86_64. However, the infrastructure of this feature is already there, so just select ARCH_WANT_HUGETLB_PAGE_ OPTIMIZE_VMEMMAP is enough to enable this feature for LoongArch. Link: https://lkml.kernel.org/r/20221027125253.3458989-5-chenhuacai@loongson.cn Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Min Zhou <zhoumin@loongson.cn> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Will Deacon <will@kernel.org> Cc: Xuefeng Li <lixuefeng@loongson.cn> Cc: Xuerui Wang <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/sparse-vmemmap: generalise vmemmap_populate_hugepages()Feiyang Chen5-143/+132
Generalise vmemmap_populate_hugepages() so ARM64 & X86 & LoongArch can share its implementation. Link: https://lkml.kernel.org/r/20221027125253.3458989-4-chenhuacai@loongson.cn Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Acked-by: Will Deacon <will@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Min Zhou <zhoumin@loongson.cn> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Mathieu-Daudé <philmd@linaro.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Xuefeng Li <lixuefeng@loongson.cn> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12LoongArch: add sparse memory vmemmap supportFeiyang Chen6-4/+96
Add sparse memory vmemmap support for LoongArch. SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise pfn_to_page and page_to_pfn operations. This is the most efficient option when sufficient kernel resources are available. Link: https://lkml.kernel.org/r/20221027125253.3458989-3-chenhuacai@loongson.cn Signed-off-by: Min Zhou <zhoumin@loongson.cn> Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Mathieu-Daudé <philmd@linaro.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Will Deacon <will@kernel.org> Cc: Xuefeng Li <lixuefeng@loongson.cn> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12MIPS&LoongArch&NIOS2: adjust prototypes of p?d_init()Feiyang Chen11-57/+46
Patch series "mm/sparse-vmemmap: Generalise helpers and enable for LoongArch", v14. This series is in order to enable sparse-vmemmap for LoongArch. But LoongArch cannot use generic helpers directly because MIPS&LoongArch need to call pgd_init()/pud_init()/pmd_init() when populating page tables. So we adjust the prototypes of p?d_init() to make generic helpers can call them, then enable sparse-vmemmap with generic helpers, and to be further, generalise vmemmap_populate_hugepages() for ARM64, X86 and LoongArch. This patch (of 4): We are preparing to add sparse vmemmap support to LoongArch. MIPS and LoongArch need to call pgd_init()/pud_init()/pmd_init() when populating page tables, so adjust their prototypes to make generic helpers can call them. NIOS2 declares pmd_init() but doesn't use, just remove it to avoid build errors. Link: https://lkml.kernel.org/r/20221027125253.3458989-1-chenhuacai@loongson.cn Link: https://lkml.kernel.org/r/20221027125253.3458989-2-chenhuacai@loongson.cn Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Will Deacon <will@kernel.org> Cc: Xuefeng Li <lixuefeng@loongson.cn> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Min Zhou <zhoumin@loongson.cn> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12kmsan: allow using __msan_instrument_asm_store() inside runtimeAlexander Potapenko1-3/+5
In certain cases (e.g. when handling a softirq) __msan_instrument_asm_store(&var, sizeof(var)) may be called with from within KMSAN runtime, but later the value of @var is used with !kmsan_in_runtime(), leading to false positives. Because kmsan_internal_unpoison_memory() doesn't take locks, it should be fine to call it without kmsan_in_runtime() checks, which fixes the mentioned false positives. Link: https://lkml.kernel.org/r/20221128094541.2645890-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Marco Elver <elver@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12lockdep: allow instrumenting lockdep.c with KMSANAlexander Potapenko1-1/+0
Lockdep and KMSAN used to play badly together, causing deadlocks when KMSAN instrumentation of lockdep.c called lockdep functions recursively. Looks like this is no more the case, and a kernel can run (yet slower) with both KMSAN and lockdep enabled. This patch should fix false positives on wq_head->lock->dep_map, which KMSAN used to consider uninitialized because of lockdep.c not being instrumented. Link: https://lore.kernel.org/lkml/Y3b9AAEKp2Vr3e6O@sol.localdomain/ Link: https://lkml.kernel.org/r/20221128094541.2645890-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Eric Biggers <ebiggers@kernel.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Marco Elver <elver@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12include/linux/pgtable.h: : remove redundant pte variablezhang songyi1-3/+1
Return value from ptep_get_and_clear_full() directly instead of taking this in another redundant variable. Link: https://lkml.kernel.org/r/202211282107437343474@zte.com.cn Signed-off-by: zhang songyi <zhang.songyi@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/fadvise: use LLONG_MAX instead of -1 for eofBrian Foster1-1/+1
generic_fadvise() sets endbyte = -1 to specify end of file (i.e. if length == 0 is passed from userspace). Most other callers to filemap_fdatawrite_range() use LLONG_MAX for this purpose, particularly if they also call fdatawait_range() (which requires end >= start). For example, sync_file_range(), vfs_fsync() (where the range is passed down through per-fs ->fsync() callbacks), filemap_flush(), etc. generic_fadvise() does not currently wait on writeback, but fix the call up to be consistent with other callers. Link: https://lkml.kernel.org/r/20221128155632.3950447-3-bfoster@redhat.com Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12filemap: skip write and wait if end offset precedes startBrian Foster1-3/+6
Patch series "filemap: skip write and wait if end offset precedes start", v2. A fix for the odd write and wait behavior described in the patch 1 commit log. Technically patch 1 could simply remove the check rather than lift it into the callers, but this seemed a bit more user friendly to me. Patch 2 is appended after observation that fadvise() interacted poorly with the v1 patch. This is no longer a problem with v2, making patch 2 purely a cleanup. This series survived both fstests and ltp regression runs without observable problems. I had (end < start) warning checks in each relevant function, with fadvise() being the only caller that triggered them. That said, I dropped the warnings after testing because there seemed to much potential for noise from the various other callers. This patch (of 2): A call to file[map]_write_and_wait_range() with an end offset that precedes the start offset but happens to land in the same page can trigger writeback submission but fails to wait on the submitted page. Writeback submission occurs because __filemap_fdatawrite_range() passes both offsets down into write_cache_pages(), which rounds down to page indexes before it starts processing writeback. However, __filemap_fdatawait_range() immediately returns if the byte-granular end offset precedes the start offset. This behavior was observed in the form of unpredictable latency from a frequent write and wait call with incorrect parameters. The behavior gave the impression that the fdatawait path might occasionally fail to wait on writeback, but further investigation showed the latency was from write_cache_pages() waiting on writeback state to clear for a page already under writeback. Therefore, this indicated that fdatawait actually never waits on writeback in this particular situation. The byte granular check in __filemap_fdatawait_range() goes all the way back to the old wait_on_page_writeback() helper. It originally used page offsets and so would have waited in this problematic case. That changed to byte granularity file offsets in commit 94004ed726f3 ("kill wait_on_page_writeback_range"), which subtly changed this behavior. The check itself has become somewhat redundant since the error checking code that used to follow the wait loop (at the time of the aforementioned commit) has now been removed and lifted into the higher level callers. Therefore, we can restore historical fdatawait behavior by simply removing the check. Since the current fdatawait behavior has been in place for quite some time and is consistent with other interfaces that use file offsets, instead lift the check into the file[map]_write_and_wait_range() helpers to provide consistent behavior between the write and wait. Link: https://lkml.kernel.org/r/20221128155632.3950447-1-bfoster@redhat.com Link: https://lkml.kernel.org/r/20221128155632.3950447-2-bfoster@redhat.com Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12zsmalloc: implement writeback mechanism for zsmallocNhat Pham1-11/+183
This commit adds the writeback mechanism for zsmalloc, analogous to the zbud allocator. Zsmalloc will attempt to determine the coldest zspage (i.e least recently used) in the pool, and attempt to write back all the stored compressed objects via the pool's evict handler. Link: https://lkml.kernel.org/r/20221128191616.1261026-7-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12zsmalloc: add zpool_ops field to zs_pool to store evict handlersNhat Pham1-1/+10
This adds a new field to zs_pool to store evict handlers for writeback, analogous to the zbud allocator. Link: https://lkml.kernel.org/r/20221128191616.1261026-6-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12zsmalloc: add a LRU to zs_pool to keep track of zspages in LRU orderNhat Pham1-0/+50
This helps determines the coldest zspages as candidates for writeback. Link: https://lkml.kernel.org/r/20221128191616.1261026-5-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12zsmalloc: consolidate zs_pool's migrate_lock and size_class's locksNhat Pham1-50/+37
Currently, zsmalloc has a hierarchy of locks, which includes a pool-level migrate_lock, and a lock for each size class. We have to obtain both locks in the hotpath in most cases anyway, except for zs_malloc. This exception will no longer exist when we introduce a LRU into the zs_pool for the new writeback functionality - we will need to obtain a pool-level lock to synchronize LRU handling even in zs_malloc. In preparation for zsmalloc writeback, consolidate these locks into a single pool-level lock, which drastically reduces the complexity of synchronization in zsmalloc. We have also benchmarked the lock consolidation to see the performance effect of this change on zram. First, we ran a synthetic FS workload on a server machine with 36 cores (same machine for all runs), using fs_mark -d ../zram1mnt -s 100000 -n 2500 -t 32 -k before and after for btrfs and ext4 on zram (FS usage is 80%). Here is the result (unit is file/second): With lock consolidation (btrfs): Average: 13520.2, Median: 13531.0, Stddev: 137.5961482019028 Without lock consolidation (btrfs): Average: 13487.2, Median: 13575.0, Stddev: 309.08283679298665 With lock consolidation (ext4): Average: 16824.4, Median: 16839.0, Stddev: 89.97388510006668 Without lock consolidation (ext4) Average: 16958.0, Median: 16986.0, Stddev: 194.7370021336469 As you can see, we observe a 0.3% regression for btrfs, and a 0.9% regression for ext4. This is a small, barely measurable difference in my opinion. For a more realistic scenario, we also tries building the kernel on zram. Here is the time it takes (in seconds): With lock consolidation (btrfs): real Average: 319.6, Median: 320.0, Stddev: 0.8944271909999159 user Average: 6894.2, Median: 6895.0, Stddev: 25.528415540334656 sys Average: 521.4, Median: 522.0, Stddev: 1.51657508881031 Without lock consolidation (btrfs): real Average: 319.8, Median: 320.0, Stddev: 0.8366600265340756 user Average: 6896.6, Median: 6899.0, Stddev: 16.04057355583023 sys Average: 520.6, Median: 521.0, Stddev: 1.140175425099138 With lock consolidation (ext4): real Average: 320.0, Median: 319.0, Stddev: 1.4142135623730951 user Average: 6896.8, Median: 6878.0, Stddev: 28.621670111997307 sys Average: 521.2, Median: 521.0, Stddev: 1.7888543819998317 Without lock consolidation (ext4) real Average: 319.6, Median: 319.0, Stddev: 0.8944271909999159 user Average: 6886.2, Median: 6887.0, Stddev: 16.93221781102523 sys Average: 520.4, Median: 520.0, Stddev: 1.140175425099138 The difference is entirely within the noise of a typical run on zram. This hardly justifies the complexity of maintaining both the pool lock and the class lock. In fact, for writeback, we would need to introduce yet another lock to prevent data races on the pool's LRU, further complicating the lock handling logic. IMHO, it is just better to collapse all of these into a single pool-level lock. Link: https://lkml.kernel.org/r/20221128191616.1261026-4-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12zpool: clean out dead codeJohannes Weiner3-66/+12
There is a lot of provision for flexibility that isn't actually needed or used. Zswap (the only zpool user) always passes zpool_ops with an .evict method set. The backends who reclaim only do so for zswap, so they can also directly call zpool_ops without indirection or checks. Finally, there is no need to check the retries parameters and bail with -EINVAL in the reclaim function, when that's called just a few lines below with a hard-coded 8. There is no need to duplicate the evictable and sleep_mapped attrs from the driver in zpool_ops. Link: https://lkml.kernel.org/r/20221128191616.1261026-3-nphamcs@gmail.com Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12zswap: fix writeback lock ordering for zsmallocJohannes Weiner1-16/+19
Patch series "Implement writeback for zsmalloc", v7. Unlike other zswap allocators such as zbud or z3fold, zsmalloc currently lacks the writeback mechanism. This means that when the zswap pool is full, it will simply reject further allocations, and the pages will be written directly to swap. This series of patches implements writeback for zsmalloc. When the zswap pool becomes full, zsmalloc will attempt to evict all the compressed objects in the least-recently used zspages. This patch (of 6): zswap's customary lock order is tree->lock before pool->lock, because the tree->lock protects the entries' refcount, and the free callbacks in the backends acquire their respective pool locks to dispatch the backing object. zsmalloc's map callback takes the pool lock, so zswap must not grab the tree->lock while a handle is mapped. This currently only happens during writeback, which isn't implemented for zsmalloc. In preparation for it, move the tree->lock section out of the mapped entry section Link: https://lkml.kernel.org/r/20221128191616.1261026-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20221128191616.1261026-2-nphamcs@gmail.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/madvise: fix madvise_pageout for private file mappingsPavankumar Kondeti1-18/+35
When MADV_PAGEOUT is called on a private file mapping VMA region, we bail out early if the process is neither owner nor write capable of the file. However, this VMA may have both private/shared clean pages and private dirty pages. The opportunity of paging out the private dirty pages (Anon pages) is missed. Fix this behavior by allowing private file mappings pageout further and perform the file access check along with PageAnon() during page walk. We observe ~10% improvement in zram usage, thus leaving more available memory on a 4GB RAM system running Android. [quic_pkondeti@quicinc.com: v2] Link: https://lkml.kernel.org/r/1669962597-27724-1-git-send-email-quic_pkondeti@quicinc.com Link: https://lkml.kernel.org/r/1667971116-12900-1-git-send-email-quic_pkondeti@quicinc.com Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/khugepaged: add tracepoint to collapse_file()Gautam Menghani2-3/+42
"mm_khugepaged_collapse_file" for capturing is_shmem. Currently, is_shmem is not being captured. Capturing is_shmem is useful as it can indicate if tmpfs is being used as a backing store instead of persistent storage. Add the tracepoint in collapse_file() named "mm_khugepaged_collapse_file" for capturing is_shmem. [gautammenghani201@gmail.com: swap is_shmem and addr to save space, per Steven Rostedt] Link: https://lkml.kernel.org/r/20221202201807.182829-1-gautammenghani201@gmail.com Link: https://lkml.kernel.org/r/20221026052218.148234-1-gautammenghani201@gmail.com Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> [tracing] Cc: David Hildenbrand <david@redhat.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/gup: remove FOLL_MIGRATIONDavid Hildenbrand2-51/+5
Fortunately, the last user (KSM) is gone, so let's just remove this rather special code from generic GUP handling -- especially because KSM never required the PMD handling as KSM only deals with individual base pages. [akpm@linux-foundation.org: fix merge snafu]Link: https://lkml.kernel.org/r/20221021101141.84170-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/ksm: convert break_ksm() to use walk_page_range_vma()David Hildenbrand1-10/+39
FOLL_MIGRATION exists only for the purpose of break_ksm(), and actually, there is not even the need to wait for the migration to finish, we only want to know if we're dealing with a KSM page. Using follow_page() just to identify a KSM page overcomplicates GUP code. Let's use walk_page_range_vma() instead, because we don't actually care about the page itself, we only need to know a single property -- no need to even grab a reference. So, get rid of follow_page() usage such that we can get rid of FOLL_MIGRATION now and eventually be able to get rid of follow_page() in the future. In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in a performance degradation of ~2% (old: ~5010 MiB/s, new: ~4900 MiB/s). I don't think we particularly care for now. Interestingly, the benchmark reduction is due to the single callback. Adding a second callback (e.g., pud_entry()) reduces the benchmark by another 100-200 MiB/s. Link: https://lkml.kernel.org/r/20221021101141.84170-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/pagewalk: add walk_page_range_vma()David Hildenbrand2-0/+23
Let's add walk_page_range_vma(), which is similar to walk_page_vma(), however, is only interested in a subset of the VMA range. To be used in KSM code to stop using follow_page() next. Link: https://lkml.kernel.org/r/20221021101141.84170-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/ksm: fix KSM COW breaking with userfaultfd-wp via FAULT_FLAG_UNSHAREDavid Hildenbrand1-7/+5
Let's stop breaking COW via a fake write fault and let's use FAULT_FLAG_UNSHARE instead. This avoids any wrong side effects of the fake write fault, such as mapping the PTE writable and marking the pte dirty/softdirty. Consequently, we will no longer trigger a fake write fault and break COW without any such side-effects. Also, this fixes KSM interaction with userfaultfd-wp: when we have a KSM page that's write-protected by userfaultfd, break_ksm()->handle_mm_fault() will fail with VM_FAULT_SIGBUS and will simply return in break_ksm() with 0 instead of actually breaking COW. For now, the KSM unmerge tests can trigger that: $ sudo ./ksm_functional_tests TAP version 13 1..3 # [RUN] test_unmerge ok 1 Pages were unmerged # [RUN] test_unmerge_discarded ok 2 Pages were unmerged # [RUN] test_unmerge_uffd_wp not ok 3 Pages were unmerged Bail out! 1 out of 3 tests failed # Planned tests != run tests (2 != 3) # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0 The warning in dmesg also indicates this wrong handling: [ 230.096368] FAULT_FLAG_ALLOW_RETRY missing 881 [ 230.100822] CPU: 1 PID: 1643 Comm: ksm-uffd-wp [...] [ 230.110124] Hardware name: [...] [ 230.117775] Call Trace: [ 230.120227] <TASK> [ 230.122334] dump_stack_lvl+0x44/0x5c [ 230.126010] handle_userfault.cold+0x14/0x19 [ 230.130281] ? tlb_finish_mmu+0x65/0x170 [ 230.134207] ? uffd_wp_range+0x65/0xa0 [ 230.137959] ? _raw_spin_unlock+0x15/0x30 [ 230.141972] ? do_wp_page+0x50/0x590 [ 230.145551] __handle_mm_fault+0x9f5/0xf50 [ 230.149652] ? mmput+0x1f/0x40 [ 230.152712] handle_mm_fault+0xb9/0x2a0 [ 230.156550] break_ksm+0x141/0x180 [ 230.159964] unmerge_ksm_pages+0x60/0x90 [ 230.163890] ksm_madvise+0x3c/0xb0 [ 230.167295] do_madvise.part.0+0x10c/0xeb0 [ 230.171396] ? do_syscall_64+0x67/0x80 [ 230.175157] __x64_sys_madvise+0x5a/0x70 [ 230.179082] do_syscall_64+0x58/0x80 [ 230.182661] ? do_syscall_64+0x67/0x80 [ 230.186413] entry_SYSCALL_64_after_hwframe+0x63/0xcd This is primarily a fix for KSM+userfaultfd-wp, however, the fake write fault was always questionable. As this fix is not easy to backport and it's not very critical, let's not cc stable. Link: https://lkml.kernel.org/r/20221021101141.84170-6-david@redhat.com Fixes: 529b930b87d9 ("userfaultfd: wp: hook userfault handler to write protection fault") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm: remove VM_FAULT_WRITEDavid Hildenbrand3-9/+5
All users -- GUP and KSM -- are gone, let's just remove it. Link: https://lkml.kernel.org/r/20221021101141.84170-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/ksm: simplify break_ksm() to not rely on VM_FAULT_WRITEDavid Hildenbrand1-12/+13
Now that GUP no longer requires VM_FAULT_WRITE, break_ksm() is the sole remaining user of VM_FAULT_WRITE. As we also want to stop triggering a fake write fault and instead use FAULT_FLAG_UNSHARE -- similar to GUP-triggered unsharing when taking a R/O pin on a shared anonymous page (including KSM pages), let's stop relying on VM_FAULT_WRITE. Let's rework break_ksm() to not rely on the return value of handle_mm_fault() anymore to figure out whether COW-breaking was successful. Simply perform another follow_page() lookup to verify the result. While this makes break_ksm() slightly less efficient, we can simplify handle_mm_fault() a little and easily switch to FAULT_FLAG_UNSHARE without introducing similar KSM-specific behavior for FAULT_FLAG_UNSHARE. In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in a performance degradation of ~4% -- 5% (old: ~5250 MiB/s, new: ~5010 MiB/s). I don't think that we particularly care about that performance drop when unmerging. If it ever turns out to be an actual performance issue, we can think about a better alternative for FAULT_FLAG_UNSHARE -- let's just keep it simple for now. Link: https://lkml.kernel.org/r/20221021101141.84170-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12selftests/vm: add test to measure MADV_UNMERGEABLE performanceDavid Hildenbrand1-2/+74
Let's add a test to measure performance of KSM breaking not triggered via COW, but triggered by disabling KSM on an area filled with KSM pages via MADV_UNMERGEABLE. Link: https://lkml.kernel.org/r/20221021101141.84170-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12mm/pagewalk: don't trigger test_walk() in walk_page_vma()David Hildenbrand2-7/+2
As Peter points out, the caller passes a single VMA and can just do that check itself. And in fact, no existing users rely on test_walk() getting called. So let's just remove it and make the implementation slightly more efficient. Link: https://lkml.kernel.org/r/20221021101141.84170-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12selftests/vm: add KSM unmerge testsDavid Hildenbrand5-0/+294
Patch series "mm/ksm: break_ksm() cleanups and fixes", v2. This series cleans up and fixes break_ksm(). In summary, we no longer use fake write faults to break COW but instead FAULT_FLAG_UNSHARE. Further, we move away from using follow_page() --- that we can hopefully remove completely at one point --- and use new walk_page_range_vma() instead. Fortunately, we can get rid of VM_FAULT_WRITE and FOLL_MIGRATION in common code now. Extend the existing ksm tests by an unmerge benchmark, and a some new unmerge tests. Also, add a selftest to measure MADV_UNMERGEABLE performance. In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in a performance degradation of ~6% -- 7% (old: ~5250 MiB/s, new: ~4900 MiB/s). I don't think we particularly care for now, but it's good to be aware of the implication. This patch (of 9): Let's add three unmerge tests (MADV_UNMERGEABLE unmerging all pages in the range). test_unmerge(): basic unmerge tests test_unmerge_discarded(): have some pte_none() entries in the range test_unmerge_uffd_wp(): protect the merged pages using uffd-wp ksm_tests.c currently contains a mixture of benchmarks and tests, whereby each test is carried out by executing the ksm_tests binary with specific parameters. Let's add new ksm_functional_tests.c that performs multiple, smaller functional tests all at once. Link: https://lkml.kernel.org/r/20221021101141.84170-1-david@redhat.com Link: https://lkml.kernel.org/r/20221021101141.84170-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-12selftests/vm: enable running select groups of testsJoel Savitz1-63/+144
Our memory management kernel CI testing at Red Hat uses the VM selftests and we have run into two problems: First, our LTP tests overlap with the VM selftests. We want to avoid unhelpful redundancy in our testing practices. Second, we have observed the current run_vmtests.sh to report overall failure/ambiguous results in the case that a machine lacks the necessary hardware to perform one or more of the tests. E.g. ksm tests that require more than one numa node. We want to be able to run the vm selftests suitable to particular hardware. Add the ability to run one or more groups of vm tests via run_vmtests.sh instead of simply all-or-none in order to solve these problems. Preserve existing default behavior of running all tests when the script is invoked with no arguments. Documentation of test groups is included in the patch as follows: # ./run_vmtests.sh [ -h || --help ] usage: ./tools/testing/selftests/vm/run_vmtests.sh [ -h | -t "<categories>"] -t: specify specific categories to tests to run -h: display this message The default behavior is to run all tests. Alternatively, specific groups tests can be run by passing a string to the -t argument containing one or more of the following categories separated by spaces: - mmap tests for mmap(2) - gup_test tests for gup using gup_test interface - userfaultfd tests for userfaultfd(2) - compaction a test for the patch "Allow compaction of unevictable pages" - mlock tests for mlock(2) - mremap tests for mremap(2) - hugevm tests for very large virtual address space - vmalloc vmalloc smoke tests - hmm hmm smoke tests - madv_populate test memadvise(2) MADV_POPULATE_{READ,WRITE} options - memfd_secret test memfd_secret(2) - process_mrelease test process_mrelease(2) - ksm ksm tests that do not require >=2 NUMA nodes - ksm_numa ksm tests that require >=2 NUMA nodes - pkey memory protection key tests - soft_dirty test soft dirty page bit semantics - anon_cow test anonymous copy-on-write semantics example: ./run_vmtests.sh -t "hmm mmap ksm" Link: https://lkml.kernel.org/r/20221018231222.1884715-1-jsavitz@redhat.com Signed-off-by: Joel Savitz <jsavitz@redhat.com> Cc: Joel Savitz <jsavitz@redhat.com> Cc: Nico Pache <npache@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-10Merge branch 'mm-hotfixes-stable' into mm-stableAndrew Morton10-22/+43
2022-12-10memcg: fix possible use-after-free in memcg_write_event_control()Tejun Heo3-3/+14
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> [3.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-10MAINTAINERS: update Muchun Song's emailMuchun Song2-2/+4
I'm moving to the @linux.dev account. Map my old addresses and update it to my new address. Link: https://lkml.kernel.org/r/20221208115548.85244-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>