From 2b4f3b4987b56365b981f44a7e843efa5b6619b9 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Wed, 5 Jul 2023 18:13:59 -0700 Subject: fork: lock VMAs of the parent process when forking MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "Avoid memory corruption caused by per-VMA locks", v4. A memory corruption was reported in [1] with bisection pointing to the patch [2] enabling per-VMA locks for x86. Based on the reproducer provided in [1] we suspect this is caused by the lack of VMA locking while forking a child process. Patch 1/2 in the series implements proper VMA locking during fork. I tested the fix locally using the reproducer and was unable to reproduce the memory corruption problem. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~7% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic. Patch 2/2 disables per-VMA locks until the fix is tested and verified. This patch (of 2): When forking a child process, parent write-protects an anonymous page and COW-shares it with the child being forked using copy_present_pte(). Parent's TLB is flushed right before we drop the parent's mmap_lock in dup_mmap(). If we get a write-fault before that TLB flush in the parent, and we end up replacing that anonymous page in the parent process in do_wp_page() (because, COW-shared with the child), this might lead to some stale writable TLB entries targeting the wrong (old) page. Similar issue happened in the past with userfaultfd (see flush_tlb_page() call inside do_wp_page()). Lock VMAs of the parent process when forking a child, which prevents concurrent page faults during fork operation and avoids this issue. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~7% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic. Link: https://lkml.kernel.org/r/20230706011400.2949242-1-surenb@google.com Link: https://lkml.kernel.org/r/20230706011400.2949242-2-surenb@google.com Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Signed-off-by: Suren Baghdasaryan Suggested-by: David Hildenbrand Reported-by: Jiri Slaby Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ Reported-by: Holger Hoffstätte Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/ Reported-by: Jacob Young Closes: https://bugzilla.kernel.org/show_bug.cgi?id=3D217624 Reviewed-by: Liam R. Howlett Acked-by: David Hildenbrand Tested-by: Holger Hoffsttte Cc: Signed-off-by: Andrew Morton --- kernel/fork.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/fork.c b/kernel/fork.c index b85814e614a5..2ba918f83bde 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -658,6 +658,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, retval = -EINTR; goto fail_uprobe_end; } +#ifdef CONFIG_PER_VMA_LOCK + /* Disallow any page faults before calling flush_cache_dup_mm */ + for_each_vma(old_vmi, mpnt) + vma_start_write(mpnt); + vma_iter_set(&old_vmi, 0); +#endif flush_cache_dup_mm(oldmm); uprobe_dup_mmap(oldmm, mm); /* -- cgit v1.2.3 From f96c48670319d685d18d50819ed0c1ef751ed2ac Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Wed, 5 Jul 2023 18:14:00 -0700 Subject: mm: disable CONFIG_PER_VMA_LOCK until its fixed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A memory corruption was reported in [1] with bisection pointing to the patch [2] enabling per-VMA locks for x86. Disable per-VMA locks config to prevent this issue until the fix is confirmed. This is expected to be a temporary measure. [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624 [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com Link: https://lkml.kernel.org/r/20230706011400.2949242-3-surenb@google.com Reported-by: Jiri Slaby Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ Reported-by: Jacob Young Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Signed-off-by: Suren Baghdasaryan Cc: David Hildenbrand Cc: Holger Hoffstätte Cc: Signed-off-by: Andrew Morton --- mm/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/Kconfig b/mm/Kconfig index 09130434e30d..0abc6c71dd89 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1224,8 +1224,9 @@ config ARCH_SUPPORTS_PER_VMA_LOCK def_bool n config PER_VMA_LOCK - def_bool y + bool "Enable per-vma locking during page fault handling." depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP + depends on BROKEN help Allow per-vma locking during page fault handling. -- cgit v1.2.3 From a57b4b7f0557be4fa40d57e2c5e71f17e4510248 Mon Sep 17 00:00:00 2001 From: Anthony Iliopoulos Date: Wed, 28 Jun 2023 03:34:36 +0200 Subject: MAINTAINERS: update ocfs2-devel mailing list address The ocfs2-devel mailing list has been migrated to the kernel.org infrastructure, update the related entry to reflect the change. Link: https://lkml.kernel.org/r/20230628013437.47030-2-ailiop@suse.com Signed-off-by: Anthony Iliopoulos Acked-by: Joseph Qi Acked-by: Joel Becker Cc: Mark Fasheh Cc: Junxiao Bi Cc: Changwei Ge Cc: Gang He Cc: Jun Piao Signed-off-by: Andrew Morton --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 41385f01fa98..7d3050e57cb3 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15922,7 +15922,7 @@ ORACLE CLUSTER FILESYSTEM 2 (OCFS2) M: Mark Fasheh M: Joel Becker M: Joseph Qi -L: ocfs2-devel@oss.oracle.com (moderated for non-subscribers) +L: ocfs2-devel@lists.linux.dev S: Supported W: http://ocfs2.wiki.kernel.org F: Documentation/filesystems/dlmfs.rst -- cgit v1.2.3 From 5a569db68c6a961cf75993b16bdcc2fed087df9d Mon Sep 17 00:00:00 2001 From: Anthony Iliopoulos Date: Wed, 28 Jun 2023 03:34:37 +0200 Subject: docs: update ocfs2-devel mailing list address The ocfs2-devel mailing list has been migrated to the kernel.org infrastructure, update all related documentation pointers to reflect the change. Link: https://lkml.kernel.org/r/20230628013437.47030-3-ailiop@suse.com Signed-off-by: Anthony Iliopoulos Acked-by: Joseph Qi Acked-by: Joel Becker Cc: Changwei Ge Cc: Gang He Cc: Jun Piao Cc: Junxiao Bi Cc: Mark Fasheh Signed-off-by: Andrew Morton --- Documentation/ABI/obsolete/o2cb | 4 ++-- Documentation/ABI/removed/o2cb | 4 ++-- Documentation/ABI/stable/o2cb | 4 ++-- Documentation/ABI/testing/sysfs-ocfs2 | 12 ++++++------ Documentation/filesystems/dlmfs.rst | 2 +- Documentation/filesystems/ocfs2.rst | 2 +- fs/ocfs2/Kconfig | 6 +++--- 7 files changed, 17 insertions(+), 17 deletions(-) diff --git a/Documentation/ABI/obsolete/o2cb b/Documentation/ABI/obsolete/o2cb index fe7e45e17bc7..8f39b596731d 100644 --- a/Documentation/ABI/obsolete/o2cb +++ b/Documentation/ABI/obsolete/o2cb @@ -1,11 +1,11 @@ What: /sys/o2cb Date: Dec 2005 KernelVersion: 2.6.16 -Contact: ocfs2-devel@oss.oracle.com +Contact: ocfs2-devel@lists.linux.dev Description: Ocfs2-tools looks at 'interface-revision' for versioning information. Each logmask/ file controls a set of debug prints and can be written into with the strings "allow", "deny", or "off". Reading the file returns the current state. Was renamed to /sys/fs/u2cb/ Users: ocfs2-tools. It's sufficient to mail proposed changes to - ocfs2-devel@oss.oracle.com. + ocfs2-devel@lists.linux.dev. diff --git a/Documentation/ABI/removed/o2cb b/Documentation/ABI/removed/o2cb index 20c91adca6d4..61cff238fbe8 100644 --- a/Documentation/ABI/removed/o2cb +++ b/Documentation/ABI/removed/o2cb @@ -1,10 +1,10 @@ What: /sys/o2cb symlink Date: May 2011 KernelVersion: 3.0 -Contact: ocfs2-devel@oss.oracle.com +Contact: ocfs2-devel@lists.linux.dev Description: This is a symlink: /sys/o2cb to /sys/fs/o2cb. The symlink is removed when new versions of ocfs2-tools which know to look in /sys/fs/o2cb are sufficiently prevalent. Don't code new software to look here, it should try /sys/fs/o2cb instead. Users: ocfs2-tools. It's sufficient to mail proposed changes to - ocfs2-devel@oss.oracle.com. + ocfs2-devel@lists.linux.dev. diff --git a/Documentation/ABI/stable/o2cb b/Documentation/ABI/stable/o2cb index b62a967f01a0..3a83b5c54e93 100644 --- a/Documentation/ABI/stable/o2cb +++ b/Documentation/ABI/stable/o2cb @@ -1,10 +1,10 @@ What: /sys/fs/o2cb/ Date: Dec 2005 KernelVersion: 2.6.16 -Contact: ocfs2-devel@oss.oracle.com +Contact: ocfs2-devel@lists.linux.dev Description: Ocfs2-tools looks at 'interface-revision' for versioning information. Each logmask/ file controls a set of debug prints and can be written into with the strings "allow", "deny", or "off". Reading the file returns the current state. Users: ocfs2-tools. It's sufficient to mail proposed changes to - ocfs2-devel@oss.oracle.com. + ocfs2-devel@lists.linux.dev. diff --git a/Documentation/ABI/testing/sysfs-ocfs2 b/Documentation/ABI/testing/sysfs-ocfs2 index b7cc516a8a8a..494d7c1ac710 100644 --- a/Documentation/ABI/testing/sysfs-ocfs2 +++ b/Documentation/ABI/testing/sysfs-ocfs2 @@ -1,13 +1,13 @@ What: /sys/fs/ocfs2/ Date: April 2008 -Contact: ocfs2-devel@oss.oracle.com +Contact: ocfs2-devel@lists.linux.dev Description: The /sys/fs/ocfs2 directory contains knobs used by the ocfs2-tools to interact with the filesystem. What: /sys/fs/ocfs2/max_locking_protocol Date: April 2008 -Contact: ocfs2-devel@oss.oracle.com +Contact: ocfs2-devel@lists.linux.dev Description: The /sys/fs/ocfs2/max_locking_protocol file displays version of ocfs2 locking supported by the filesystem. This version @@ -28,7 +28,7 @@ Description: What: /sys/fs/ocfs2/loaded_cluster_plugins Date: April 2008 -Contact: ocfs2-devel@oss.oracle.com +Contact: ocfs2-devel@lists.linux.dev Description: The /sys/fs/ocfs2/loaded_cluster_plugins file describes the available plugins to support ocfs2 cluster operation. @@ -48,7 +48,7 @@ Description: What: /sys/fs/ocfs2/active_cluster_plugin Date: April 2008 -Contact: ocfs2-devel@oss.oracle.com +Contact: ocfs2-devel@lists.linux.dev Description: The /sys/fs/ocfs2/active_cluster_plugin displays which cluster plugin is currently in use by the filesystem. @@ -65,7 +65,7 @@ Description: What: /sys/fs/ocfs2/cluster_stack Date: April 2008 -Contact: ocfs2-devel@oss.oracle.com +Contact: ocfs2-devel@lists.linux.dev Description: The /sys/fs/ocfs2/cluster_stack file contains the name of current ocfs2 cluster stack. This value is set by @@ -86,4 +86,4 @@ Description: stack return an error. Users: - ocfs2-tools + ocfs2-tools diff --git a/Documentation/filesystems/dlmfs.rst b/Documentation/filesystems/dlmfs.rst index 28dd41a63be2..7e2b1fd471d7 100644 --- a/Documentation/filesystems/dlmfs.rst +++ b/Documentation/filesystems/dlmfs.rst @@ -12,7 +12,7 @@ dlmfs is built with OCFS2 as it requires most of its infrastructure. :Project web page: http://ocfs2.wiki.kernel.org :Tools web page: https://github.com/markfasheh/ocfs2-tools -:OCFS2 mailing lists: https://oss.oracle.com/projects/ocfs2/mailman/ +:OCFS2 mailing lists: https://subspace.kernel.org/lists.linux.dev.html All code copyright 2005 Oracle except when otherwise noted. diff --git a/Documentation/filesystems/ocfs2.rst b/Documentation/filesystems/ocfs2.rst index 42ca9a3d4c6e..5827062995cb 100644 --- a/Documentation/filesystems/ocfs2.rst +++ b/Documentation/filesystems/ocfs2.rst @@ -14,7 +14,7 @@ get "mount.ocfs2" and "ocfs2_hb_ctl". Project web page: http://ocfs2.wiki.kernel.org Tools git tree: https://github.com/markfasheh/ocfs2-tools -OCFS2 mailing lists: https://oss.oracle.com/projects/ocfs2/mailman/ +OCFS2 mailing lists: https://subspace.kernel.org/lists.linux.dev.html All code copyright 2005 Oracle except when otherwise noted. diff --git a/fs/ocfs2/Kconfig b/fs/ocfs2/Kconfig index 304d12186ccd..3123da7cfb30 100644 --- a/fs/ocfs2/Kconfig +++ b/fs/ocfs2/Kconfig @@ -17,9 +17,9 @@ config OCFS2_FS You'll want to install the ocfs2-tools package in order to at least get "mount.ocfs2". - Project web page: https://oss.oracle.com/projects/ocfs2 - Tools web page: https://oss.oracle.com/projects/ocfs2-tools - OCFS2 mailing lists: https://oss.oracle.com/projects/ocfs2/mailman/ + Project web page: https://ocfs2.wiki.kernel.org/ + Tools web page: https://github.com/markfasheh/ocfs2-tools + OCFS2 mailing lists: https://subspace.kernel.org/lists.linux.dev.html For more information on OCFS2, see the file . -- cgit v1.2.3 From 191fcdb6c9cf8b738b1628cbcf3af63d545c825c Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Fri, 30 Jun 2023 18:04:42 -0700 Subject: mm/hugetlb.c: fix a bug within a BUG(): inconsistent pte comparison The following crash happens for me when running the -mm selftests (below). Specifically, it happens while running the uffd-stress subtests: kernel BUG at mm/hugetlb.c:7249! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 3238 Comm: uffd-stress Not tainted 6.4.0-hubbard-github+ #109 Hardware name: ASUS X299-A/PRIME X299-A, BIOS 1503 08/03/2018 RIP: 0010:huge_pte_alloc+0x12c/0x1a0 ... Call Trace: ? __die_body+0x63/0xb0 ? die+0x9f/0xc0 ? do_trap+0xab/0x180 ? huge_pte_alloc+0x12c/0x1a0 ? do_error_trap+0xc6/0x110 ? huge_pte_alloc+0x12c/0x1a0 ? handle_invalid_op+0x2c/0x40 ? huge_pte_alloc+0x12c/0x1a0 ? exc_invalid_op+0x33/0x50 ? asm_exc_invalid_op+0x16/0x20 ? __pfx_put_prev_task_idle+0x10/0x10 ? huge_pte_alloc+0x12c/0x1a0 hugetlb_fault+0x1a3/0x1120 ? finish_task_switch+0xb3/0x2a0 ? lock_is_held_type+0xdb/0x150 handle_mm_fault+0xb8a/0xd40 ? find_vma+0x5d/0xa0 do_user_addr_fault+0x257/0x5d0 exc_page_fault+0x7b/0x1f0 asm_exc_page_fault+0x22/0x30 That happens because a BUG() statement in huge_pte_alloc() attempts to check that a pte, if present, is a hugetlb pte, but it does so in a non-lockless-safe manner that leads to a false BUG() report. We got here due to a couple of bugs, each of which by itself was not quite enough to cause a problem: First of all, before commit c33c794828f2("mm: ptep_get() conversion"), the BUG() statement in huge_pte_alloc() was itself fragile: it relied upon compiler behavior to only read the pte once, despite using it twice in the same conditional. Next, commit c33c794828f2 ("mm: ptep_get() conversion") broke that delicate situation, by causing all direct pte reads to be done via READ_ONCE(). And so READ_ONCE() got called twice within the same BUG() conditional, leading to comparing (potentially, occasionally) different versions of the pte, and thus to false BUG() reports. Fix this by taking a single snapshot of the pte before using it in the BUG conditional. Now, that commit is only partially to blame here but, people doing bisections will invariably land there, so this will help them find a fix for a real crash. And also, the previous behavior was unlikely to ever expose this bug--it was fragile, yet not actually broken. So that's why I chose this commit for the Fixes tag, rather than the commit that created the original BUG() statement. Link: https://lkml.kernel.org/r/20230701010442.2041858-1-jhubbard@nvidia.com Fixes: c33c794828f2 ("mm: ptep_get() conversion") Signed-off-by: John Hubbard Acked-by: James Houghton Acked-by: Muchun Song Reviewed-by: Ryan Roberts Acked-by: Mike Kravetz Cc: Adrian Hunter Cc: Al Viro Cc: Alex Williamson Cc: Alexander Potapenko Cc: Alexander Shishkin Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Christian Brauner Cc: Christoph Hellwig Cc: Daniel Vetter Cc: Dave Airlie Cc: Dimitri Sivanich Cc: Dmitry Vyukov Cc: Ian Rogers Cc: Jason Gunthorpe Cc: Jiri Olsa Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Lorenzo Stoakes Cc: Mark Rutland Cc: Matthew Wilcox Cc: Miaohe Lin Cc: Michal Hocko Cc: Mike Rapoport (IBM) Cc: Namhyung Kim Cc: Naoya Horiguchi Cc: Oleksandr Tyshchenko Cc: Pavel Tatashin Cc: Roman Gushchin Cc: SeongJae Park Cc: Shakeel Butt Cc: Uladzislau Rezki (Sony) Cc: Vincenzo Frascino Cc: Yu Zhao Signed-off-by: Andrew Morton --- mm/hugetlb.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bce28cca73a1..64a3239b6407 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7246,7 +7246,12 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, pte = (pte_t *)pmd_alloc(mm, pud, addr); } } - BUG_ON(pte && pte_present(ptep_get(pte)) && !pte_huge(ptep_get(pte))); + + if (pte) { + pte_t pteval = ptep_get_lockless(pte); + + BUG_ON(pte_present(pteval) && !pte_huge(pteval)); + } return pte; } -- cgit v1.2.3 From 08bab74ae653b57bb2bfcec7d499bfe7ff0efe4f Mon Sep 17 00:00:00 2001 From: Vincent Whitchurch Date: Thu, 29 Jun 2023 16:17:57 +0200 Subject: squashfs: fix cache race with migration Migration replaces the page in the mapping before copying the contents and the flags over from the old page, so check that the page in the page cache is really up to date before using it. Without this, stressing squashfs reads with parallel compaction sometimes results in squashfs reporting data corruption. Link: https://lkml.kernel.org/r/20230629-squashfs-cache-migration-v1-1-d50ebe55099d@axis.com Fixes: e994f5b677ee ("squashfs: cache partial compressed blocks") Signed-off-by: Vincent Whitchurch Cc: Christoph Hellwig Cc: Phillip Lougher Signed-off-by: Andrew Morton --- fs/squashfs/block.c | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c index 6aa9c2e1e8eb..581ce9519339 100644 --- a/fs/squashfs/block.c +++ b/fs/squashfs/block.c @@ -166,6 +166,26 @@ static int squashfs_bio_read_cached(struct bio *fullbio, return 0; } +static struct page *squashfs_get_cache_page(struct address_space *mapping, + pgoff_t index) +{ + struct page *page; + + if (!mapping) + return NULL; + + page = find_get_page(mapping, index); + if (!page) + return NULL; + + if (!PageUptodate(page)) { + put_page(page); + return NULL; + } + + return page; +} + static int squashfs_bio_read(struct super_block *sb, u64 index, int length, struct bio **biop, int *block_offset) { @@ -190,11 +210,10 @@ static int squashfs_bio_read(struct super_block *sb, u64 index, int length, for (i = 0; i < page_count; ++i) { unsigned int len = min_t(unsigned int, PAGE_SIZE - offset, total_len); - struct page *page = NULL; + pgoff_t index = (read_start >> PAGE_SHIFT) + i; + struct page *page; - if (cache_mapping) - page = find_get_page(cache_mapping, - (read_start >> PAGE_SHIFT) + i); + page = squashfs_get_cache_page(cache_mapping, index); if (!page) page = alloc_page(GFP_NOIO); -- cgit v1.2.3 From 6dca4ac6fc91fd41ea4d6c4511838d37f4e0eab2 Mon Sep 17 00:00:00 2001 From: Peter Collingbourne Date: Mon, 22 May 2023 17:43:08 -0700 Subject: mm: call arch_swap_restore() from do_swap_page() Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved the call to swap_free() before the call to set_pte_at(), which meant that the MTE tags could end up being freed before set_pte_at() had a chance to restore them. Fix it by adding a call to the arch_swap_restore() hook before the call to swap_free(). Link: https://lkml.kernel.org/r/20230523004312.1807357-2-pcc@google.com Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965 Fixes: c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") Signed-off-by: Peter Collingbourne Reported-by: Qun-wei Lin Closes: https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@mediatek.com/ Acked-by: David Hildenbrand Acked-by: "Huang, Ying" Reviewed-by: Steven Price Acked-by: Catalin Marinas Cc: [6.1+] Signed-off-by: Andrew Morton --- mm/memory.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 0ae594703021..01f39e8144ef 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3950,6 +3950,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } } + /* + * Some architectures may have to restore extra metadata to the page + * when reading from swap. This metadata may be indexed by swap entry + * so this must be called before swap_free(). + */ + arch_swap_restore(entry, folio); + /* * Remove the swap entry and conditionally try to free up the swapcache. * We're already holding a reference on the page but haven't mapped it -- cgit v1.2.3 From 8344a3d44be3d18671e18c4ba23bb03dd21e14ad Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 28 Jun 2023 19:55:48 +0100 Subject: writeback: account the number of pages written back nr_to_write is a count of pages, so we need to decrease it by the number of pages in the folio we just wrote, not by 1. Most callers specify either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() might end up writing 512x as many pages as it asked for. Dave added: : XFS is the only filesystem this would affect, right? AFAIA, nothing : else enables large folios and uses writeback through : write_cache_pages() at this point... : : In which case, I'd be surprised if much difference, if any, gets : noticed by anyone. Link: https://lkml.kernel.org/r/20230628185548.981888-1-willy@infradead.org Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Christoph Hellwig Cc: Jan Kara Cc: Dave Chinner Signed-off-by: Andrew Morton --- mm/page-writeback.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 1d17fb1ec863..d3f42009bb70 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2434,6 +2434,7 @@ int write_cache_pages(struct address_space *mapping, for (i = 0; i < nr_folios; i++) { struct folio *folio = fbatch.folios[i]; + unsigned long nr; done_index = folio->index; @@ -2471,6 +2472,7 @@ continue_unlock: trace_wbc_writepage(wbc, inode_to_bdi(mapping->host)); error = writepage(folio, wbc, data); + nr = folio_nr_pages(folio); if (unlikely(error)) { /* * Handle errors according to the type of @@ -2489,8 +2491,7 @@ continue_unlock: error = 0; } else if (wbc->sync_mode != WB_SYNC_ALL) { ret = error; - done_index = folio->index + - folio_nr_pages(folio); + done_index = folio->index + nr; done = 1; break; } @@ -2504,7 +2505,8 @@ continue_unlock: * keep going until we have written all the pages * we tagged for writeback prior to entering this loop. */ - if (--wbc->nr_to_write <= 0 && + wbc->nr_to_write -= nr; + if (wbc->nr_to_write <= 0 && wbc->sync_mode == WB_SYNC_NONE) { done = 1; break; -- cgit v1.2.3 From 6dedd768f380a6977234891fc3c7e0df656f1908 Mon Sep 17 00:00:00 2001 From: Markus Schneider-Pargmann Date: Wed, 28 Jun 2023 10:13:41 +0200 Subject: mailmap: add Markus Schneider-Pargmann Add my old mail address and update my name. Link: https://lkml.kernel.org/r/20230628081341.3470229-1-msp@baylibre.com Signed-off-by: Markus Schneider-Pargmann Signed-off-by: Andrew Morton --- .mailmap | 1 + 1 file changed, 1 insertion(+) diff --git a/.mailmap b/.mailmap index 4a9d87472ba8..113126433121 100644 --- a/.mailmap +++ b/.mailmap @@ -301,6 +301,7 @@ Marek Behún Marek Behún Marek Behun Mark Brown Mark Starovoytov +Markus Schneider-Pargmann Mark Yao Martin Kepplinger Martin Kepplinger -- cgit v1.2.3 From 0d707cdefb3b7f52d23967e1473d24d591329e13 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Mon, 3 Jul 2023 22:44:10 -0700 Subject: MAINTAINERS: add linux-next info Add linux-next info to MAINTAINERS for ease of finding this data. Link: https://lkml.kernel.org/r/20230704054410.12527-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap Acked-by: Stephen Rothwell Signed-off-by: Andrew Morton --- MAINTAINERS | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 7d3050e57cb3..ba5c92f3458b 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -12105,6 +12105,13 @@ F: Documentation/litmus-tests/ F: Documentation/memory-barriers.txt F: tools/memory-model/ +LINUX-NEXT TREE +M: Stephen Rothwell +L: linux-next@vger.kernel.org +S: Supported +B: mailto:linux-next@vger.kernel.org and the appropriate development tree +T: git git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/ + LIS3LV02D ACCELEROMETER DRIVER M: Eric Piel S: Maintained -- cgit v1.2.3 From 028725e73375a1ff080bbdf9fb503306d0116f28 Mon Sep 17 00:00:00 2001 From: Liu Shixin Date: Tue, 4 Jul 2023 18:19:42 +0800 Subject: bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page commit dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem") fix an overlaps existing problem of kmemleak. But the problem still existed when HAVE_BOOTMEM_INFO_NODE is disabled, because in this case, free_bootmem_page() will call free_reserved_page() directly. Fix the problem by adding kmemleak_free_part() in free_bootmem_page() when HAVE_BOOTMEM_INFO_NODE is disabled. Link: https://lkml.kernel.org/r/20230704101942.2819426-1-liushixin2@huawei.com Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") Signed-off-by: Liu Shixin Acked-by: Muchun Song Cc: Matthew Wilcox Cc: Mike Kravetz Cc: Oscar Salvador Cc: Signed-off-by: Andrew Morton --- include/linux/bootmem_info.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h index cc35d010fa94..e1a3c9c9754c 100644 --- a/include/linux/bootmem_info.h +++ b/include/linux/bootmem_info.h @@ -3,6 +3,7 @@ #define __LINUX_BOOTMEM_INFO_H #include +#include /* * Types for free bootmem stored in page->lru.next. These have to be in @@ -59,6 +60,7 @@ static inline void get_page_bootmem(unsigned long info, struct page *page, static inline void free_bootmem_page(struct page *page) { + kmemleak_free_part(page_to_virt(page), PAGE_SIZE); free_reserved_page(page); } #endif -- cgit v1.2.3 From ddcd91f4cb42fcc833b0a5e00d4e9f034da95249 Mon Sep 17 00:00:00 2001 From: Heiko Stuebner Date: Tue, 4 Jul 2023 18:39:18 +0200 Subject: mailmap: update manpage link Patch series "Update .mailmap for my work address and fix manpage". While updating mailmap for the going-away address, I also found that on current systems the manpage linked from the header comment changed. And in fact it looks like the git mailmap feature got its own manpage. This patch (of 2): On recent systems the git-shortlog manpage only tells people to See gitmailmap(5) So instead of sending people on a scavenger hunt, put that info into the header directly. Though keep the old reference around for older systems. Link: https://lkml.kernel.org/r/20230704163919.1136784-1-heiko@sntech.de Link: https://lkml.kernel.org/r/20230704163919.1136784-2-heiko@sntech.de Signed-off-by: Heiko Stuebner Signed-off-by: Andrew Morton --- .mailmap | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.mailmap b/.mailmap index 113126433121..e30928b8d23e 100644 --- a/.mailmap +++ b/.mailmap @@ -5,7 +5,8 @@ # same person appearing not to be so or badly displayed. Also allows for # old email addresses to map to new email addresses. # -# For format details, see "MAPPING AUTHORS" in "man git-shortlog". +# For format details, see "man gitmailmap" or "MAPPING AUTHORS" in +# "man git-shortlog" on older systems. # # Please keep this list dictionary sorted. # -- cgit v1.2.3 From d3a808ec787e8cbfee053405f95105b3be3c7743 Mon Sep 17 00:00:00 2001 From: Heiko Stuebner Date: Tue, 4 Jul 2023 18:39:19 +0200 Subject: mailmap: add entries for Heiko Stuebner I am going to lose my vrull.eu address at the end of july, and while adding it to mailmap I also realised that there are more old addresses from me dangling, so update .mailmap for all of them. Link: https://lkml.kernel.org/r/20230704163919.1136784-3-heiko@sntech.de Signed-off-by: Heiko Stuebner Signed-off-by: Heiko Stuebner Signed-off-by: Andrew Morton --- .mailmap | 3 +++ 1 file changed, 3 insertions(+) diff --git a/.mailmap b/.mailmap index e30928b8d23e..a3b4d7ac25b5 100644 --- a/.mailmap +++ b/.mailmap @@ -178,6 +178,9 @@ Gustavo Padovan Hanjun Guo Heiko Carstens Heiko Carstens +Heiko Stuebner +Heiko Stuebner +Heiko Stuebner Henk Vergonet Henrik Kretzschmar Henrik Rydberg -- cgit v1.2.3 From 05c56e7b4319d7f6352f27da876a1acdc8fa5cc4 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Tue, 4 Jul 2023 02:52:05 +0200 Subject: kasan: fix type cast in memory_is_poisoned_n Commit bb6e04a173f0 ("kasan: use internal prototypes matching gcc-13 builtins") introduced a bug into the memory_is_poisoned_n implementation: it effectively removed the cast to a signed integer type after applying KASAN_GRANULE_MASK. As a result, KASAN started failing to properly check memset, memcpy, and other similar functions. Fix the bug by adding the cast back (through an additional signed integer variable to make the code more readable). Link: https://lkml.kernel.org/r/8c9e0251c2b8b81016255709d4ec42942dcaf018.1688431866.git.andreyknvl@google.com Fixes: bb6e04a173f0 ("kasan: use internal prototypes matching gcc-13 builtins") Signed-off-by: Andrey Konovalov Cc: Alexander Potapenko Cc: Andrey Ryabinin Cc: Arnd Bergmann Cc: Dmitry Vyukov Cc: Marco Elver Cc: Signed-off-by: Andrew Morton --- mm/kasan/generic.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c index 5b4c97baa656..4d837ab83f08 100644 --- a/mm/kasan/generic.c +++ b/mm/kasan/generic.c @@ -130,9 +130,10 @@ static __always_inline bool memory_is_poisoned_n(const void *addr, size_t size) if (unlikely(ret)) { const void *last_byte = addr + size - 1; s8 *last_shadow = (s8 *)kasan_mem_to_shadow(last_byte); + s8 last_accessible_byte = (unsigned long)last_byte & KASAN_GRANULE_MASK; if (unlikely(ret != (unsigned long)last_shadow || - (((long)last_byte & KASAN_GRANULE_MASK) >= *last_shadow))) + last_accessible_byte >= *last_shadow)) return true; } return false; -- cgit v1.2.3 From fdb54d96600aafe45951f549866cd6fc1af59954 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Wed, 5 Jul 2023 14:44:02 +0200 Subject: kasan, slub: fix HW_TAGS zeroing with slub_debug Commit 946fa0dbf2d8 ("mm/slub: extend redzone check to extra allocated kmalloc space than requested") added precise kmalloc redzone poisoning to the slub_debug functionality. However, this commit didn't account for HW_TAGS KASAN fully initializing the object via its built-in memory initialization feature. Even though HW_TAGS KASAN memory initialization contains special memory initialization handling for when slub_debug is enabled, it does not account for in-object slub_debug redzones. As a result, HW_TAGS KASAN can overwrite these redzones and cause false-positive slub_debug reports. To fix the issue, avoid HW_TAGS KASAN memory initialization when slub_debug is enabled altogether. Implement this by moving the __slub_debug_enabled check to slab_post_alloc_hook. Common slab code seems like a more appropriate place for a slub_debug check anyway. Link: https://lkml.kernel.org/r/678ac92ab790dba9198f9ca14f405651b97c8502.1688561016.git.andreyknvl@google.com Fixes: 946fa0dbf2d8 ("mm/slub: extend redzone check to extra allocated kmalloc space than requested") Signed-off-by: Andrey Konovalov Reported-by: Will Deacon Acked-by: Marco Elver Cc: Mark Rutland Cc: Alexander Potapenko Cc: Andrey Ryabinin Cc: Catalin Marinas Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Feng Tang Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim Cc: kasan-dev@googlegroups.com Cc: Pekka Enberg Cc: Peter Collingbourne Cc: Roman Gushchin Cc: Vincenzo Frascino Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton --- mm/kasan/kasan.h | 12 ------------ mm/slab.h | 16 ++++++++++++++-- 2 files changed, 14 insertions(+), 14 deletions(-) diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h index b799f11e45dc..2e973b36fe07 100644 --- a/mm/kasan/kasan.h +++ b/mm/kasan/kasan.h @@ -466,18 +466,6 @@ static inline void kasan_unpoison(const void *addr, size_t size, bool init) if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK)) return; - /* - * Explicitly initialize the memory with the precise object size to - * avoid overwriting the slab redzone. This disables initialization in - * the arch code and may thus lead to performance penalty. This penalty - * does not affect production builds, as slab redzones are not enabled - * there. - */ - if (__slub_debug_enabled() && - init && ((unsigned long)size & KASAN_GRANULE_MASK)) { - init = false; - memzero_explicit((void *)addr, size); - } size = round_up(size, KASAN_GRANULE_SIZE); hw_set_mem_tag_range((void *)addr, size, tag, init); diff --git a/mm/slab.h b/mm/slab.h index 6a5633b25eb5..9c0e09d0f81f 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -723,6 +723,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, unsigned int orig_size) { unsigned int zero_size = s->object_size; + bool kasan_init = init; size_t i; flags &= gfp_allowed_mask; @@ -739,6 +740,17 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, (s->flags & SLAB_KMALLOC)) zero_size = orig_size; + /* + * When slub_debug is enabled, avoid memory initialization integrated + * into KASAN and instead zero out the memory via the memset below with + * the proper size. Otherwise, KASAN might overwrite SLUB redzones and + * cause false-positive reports. This does not lead to a performance + * penalty on production builds, as slub_debug is not intended to be + * enabled there. + */ + if (__slub_debug_enabled()) + kasan_init = false; + /* * As memory initialization might be integrated into KASAN, * kasan_slab_alloc and initialization memset must be @@ -747,8 +759,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, * As p[i] might get tagged, memset and kmemleak hook come after KASAN. */ for (i = 0; i < size; i++) { - p[i] = kasan_slab_alloc(s, p[i], flags, init); - if (p[i] && init && !kasan_has_integrated_init()) + p[i] = kasan_slab_alloc(s, p[i], flags, kasan_init); + if (p[i] && init && (!kasan_init || !kasan_has_integrated_init())) memset(p[i], 0, zero_size); kmemleak_alloc_recursive(p[i], s->object_size, 1, s->flags, flags); -- cgit v1.2.3 From 8ba388c06bc8056935ec1814b2689bfb42f3b89a Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Wed, 5 Jul 2023 16:54:04 +0200 Subject: lib: dhry: fix sleeping allocations inside non-preemptable section The Smatch static checker reports the following warnings: lib/dhry_run.c:38 dhry_benchmark() warn: sleeping in atomic context lib/dhry_run.c:43 dhry_benchmark() warn: sleeping in atomic context Indeed, dhry() does sleeping allocations inside the non-preemptable section delimited by get_cpu()/put_cpu(). Fix this by using atomic allocations instead. Add error handling, as atomic these allocations may fail. Link: https://lkml.kernel.org/r/bac6d517818a7cd8efe217c1ad649fffab9cc371.1688568764.git.geert+renesas@glider.be Fixes: 13684e966d46283e ("lib: dhry: fix unstable smp_processor_id(_) usage") Reported-by: Dan Carpenter Closes: https://lore.kernel.org/r/0469eb3a-02eb-4b41-b189-de20b931fa56@moroto.mountain Signed-off-by: Geert Uytterhoeven Signed-off-by: Andrew Morton --- lib/dhry_1.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/lib/dhry_1.c b/lib/dhry_1.c index 83247106824c..08edbbb19f57 100644 --- a/lib/dhry_1.c +++ b/lib/dhry_1.c @@ -139,8 +139,15 @@ int dhry(int n) /* Initializations */ - Next_Ptr_Glob = (Rec_Pointer)kzalloc(sizeof(Rec_Type), GFP_KERNEL); - Ptr_Glob = (Rec_Pointer)kzalloc(sizeof(Rec_Type), GFP_KERNEL); + Next_Ptr_Glob = (Rec_Pointer)kzalloc(sizeof(Rec_Type), GFP_ATOMIC); + if (!Next_Ptr_Glob) + return -ENOMEM; + + Ptr_Glob = (Rec_Pointer)kzalloc(sizeof(Rec_Type), GFP_ATOMIC); + if (!Ptr_Glob) { + kfree(Next_Ptr_Glob); + return -ENOMEM; + } Ptr_Glob->Ptr_Comp = Next_Ptr_Glob; Ptr_Glob->Discr = Ident_1; -- cgit v1.2.3