Age | Commit message (Collapse) | Author | Files | Lines |
|
Currently the implementation will split the PUD when a fallback is taken
inside the create_huge_pud function. This isn't where it should be done:
the splitting should be done in wp_huge_pud, just like it's done for PMDs.
Reason being that if a callback is taken during create, there is no PUD
yet so nothing to split, whereas if a fallback is taken when encountering
a write protection fault there is something to split.
It looks like this was the original intention with the commit where the
splitting was introduced, but somehow it got moved to the wrong place
between v1 and v2 of the patch series. Rebase mistake perhaps.
Link: https://lkml.kernel.org/r/6f48d622eb8bce1ae5dd75327b0b73894a2ec407.camel@amazon.com
Fixes: 327e9fd48972 ("mm: Split huge pages on write-notify or COW")
Signed-off-by: James Gowans <jgowans@amazon.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The subpage we calculate is an invalid pointer for device private pages,
because device private pages are mapped via non-present device private
entries, not ordinary present PTEs.
Let's just not compute broken pointers and fixup later. Move the proper
assignment of the correct subpage to the beginning of the function and
assert that we really only have a single page in our folio.
This currently results in a BUG when tying to compute anon_exclusive,
because:
[ 528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0
[ 528.739585] #PF: supervisor read access in kernel mode
[ 528.745324] #PF: error_code(0x0000) - not-present page
[ 528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0
[ 528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex #257
[ 528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021
[ 528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000
[ 528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29
c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74
0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
[ 528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
[ 528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0
[ 528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8
[ 528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000
[ 528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540
[ 528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff
[ 528.850865] FS: 00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000
[ 528.859891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0
[ 528.874275] PKRU: 55555554
[ 528.877286] Call Trace:
[ 528.880016] <TASK>
[ 528.882356] ? lock_is_held_type+0xdf/0x130
[ 528.887033] rmap_walk_anon+0x167/0x410
[ 528.891316] try_to_migrate+0x90/0xd0
[ 528.895405] ? try_to_unmap_one+0xe10/0xe10
[ 528.900074] ? anon_vma_ctor+0x50/0x50
[ 528.904260] ? put_anon_vma+0x10/0x10
[ 528.908347] ? invalid_mkclean_vma+0x20/0x20
[ 528.913114] migrate_vma_setup+0x5f4/0x750
[ 528.917691] dmirror_devmem_fault+0x8c/0x250 [test_hmm]
[ 528.923532] do_swap_page+0xac0/0xe50
[ 528.927623] ? __lock_acquire+0x4b2/0x1ac0
[ 528.932199] __handle_mm_fault+0x949/0x1440
[ 528.936876] handle_mm_fault+0x13f/0x3e0
[ 528.941256] do_user_addr_fault+0x215/0x740
[ 528.945928] exc_page_fault+0x75/0x280
[ 528.950115] asm_exc_page_fault+0x27/0x30
[ 528.954593] RIP: 0033:0x40366b
...
Link: https://lkml.kernel.org/r/20220623205332.319257-1-david@redhat.com
Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Tested-by: Alistair Popple <apopple@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Higher order allocations for vmemmap pages from buddy allocator must be
able to be treated as indepdenent small pages as they can be freed
individually by the caller. There is no problem for higher order vmemmap
pages allocated at boot time since each individual small page will be
initialized at boot time. However, it will be an issue for memory hotplug
case since those higher order vmemmap pages are allocated from buddy
allocator without initializing each individual small page's refcount. The
system will panic in put_page_testzero() when CONFIG_DEBUG_VM is enabled
if the vmemmap page is freed.
Link: https://lkml.kernel.org/r/20220620023019.94257-1-songmuchun@bytedance.com
Fixes: d8d55f5616cf ("mm: sparsemem: use page table lock to protect kernel pmd operations")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The huge_ptep_set_access_flags() can not make the huge pte old according
to the discussion [1], that means we will always mornitor the young state
of the hugetlb though we stopped accessing the hugetlb, as a result DAMON
will get inaccurate accessing statistics.
So changing to use set_huge_pte_at() to make the huge pte old to fix this
issue.
[1] https://lore.kernel.org/all/Yqy97gXI4Nqb7dYo@arm.com/
Link: https://lkml.kernel.org/r/1655692482-28797-1-git-send-email-baolin.wang@linux.alibaba.com
Fixes: 49f4203aae06 ("mm/damon: add access checking for hugetlb pages")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When fallocate() is used on a shmem file, the pages we allocate can end up
with !PageUptodate.
Since UFFDIO_CONTINUE tries to find the existing page the user wants to
map with SGP_READ, we would fail to find such a page, since
shmem_getpage_gfp returns with a "NULL" pagep for SGP_READ if it discovers
!PageUptodate. As a result, UFFDIO_CONTINUE returns -EFAULT, as it would
do if the page wasn't found in the page cache at all.
This isn't the intended behavior. UFFDIO_CONTINUE is just trying to find
if a page exists, and doesn't care whether it still needs to be cleared or
not. So, instead of SGP_READ, pass in SGP_NOALLOC. This is the same,
except for one critical difference: in the !PageUptodate case, SGP_NOALLOC
will clear the page and then return it. With this change, UFFDIO_CONTINUE
works properly (succeeds) on a shmem file which has been fallocated, but
otherwise not modified.
Link: https://lkml.kernel.org/r/20220610173812.1768919-1-axelrasmussen@google.com
Fixes: 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A recent commit factored out a series of annoying (unsigned long) casts
into a single variable declaration, but made the pointer type a
`uintptr_t` rather than the usual `unsigned long`. This patch changes it
to be the integer type more typically used by the kernel to represent
addresses.
Fixes: 35fb9ae4aa2e ("usercopy: Cast pointer to an integer once")
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220616143617.449094-1-Jason@zx2c4.com
|
|
By passing ->read_folio to filemap_read_folio(), we can use
filemap_read_folio() in do_read_cache_folio().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
If the call to filler() returns AOP_TRUNCATED_PAGE, we need to
retry the page cache lookup.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
No functionality change intended; this simply moves code around to
disentangle the function a little.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
All callers of find_get_pages_range(), pagevec_lookup_range() and
pagevec_lookup() have now been removed.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
This is a straightforward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
Change the guts of check_move_unevictable_pages() over to use folios
and add check_move_unevictable_pages() as a wrapper.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
This is the equivalent of find_get_pages() but fills a folio_batch
instead of an array of pages.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
These functions have no more users, so delete them.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
|
|
Remove the last caller of add_to_page_cache()
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
|
|
This wrapper is no longer used. Remove it and all references to it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
If the last folio in a file is split as a result of truncation,
we simply clear the dirty bits for the pages we're discarding.
That causes NR_FILE_DIRTY (among other counters) to be thrown off
and eventually Linux will hang in balance_dirty_pages_ratelimited()
Reported-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Fixes: d68eccad3706 ("mm/filemap: Allow large folios to be added to the page cache")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
Add special hook for architecture to verify addr, size or prot
when ioremap() or iounmap(), which will make the generic ioremap
more useful.
ioremap_allowed() return a bool,
- true means continue to remap
- false means skip remap and return directly
iounmap_allowed() return a bool,
- true means continue to vunmap
- false code means skip vunmap and return directly
Meanwhile, only vunmap the address when it is in vmalloc area
as the generic ioremap only returns vmalloc addresses.
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Link: https://lore.kernel.org/r/20220607125027.44946-5-wangkefeng.wang@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Show physical address of each ioremap in /proc/vmallocinfo.
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Link: https://lore.kernel.org/r/20220607125027.44946-4-wangkefeng.wang@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Use more meaningful and sensible naming phys_addr
instead addr in ioremap_prot().
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610092255.32445-1-wangkefeng.wang@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"Minor things, mainly - mailmap updates, MAINTAINERS updates, etc.
Fixes for this merge window:
- fix for a damon boot hang, from SeongJae
- fix for a kfence warning splat, from Jason Donenfeld
- fix for zero-pfn pinning, from Alex Williamson
- fix for fallocate hole punch clearing, from Mike Kravetz
Fixes for previous releases:
- fix for a performance regression, from Marcelo
- fix for a hwpoisining BUG from zhenwei pi"
* tag 'mm-hotfixes-stable-2022-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: add entry for Christian Marangi
mm/memory-failure: disable unpoison once hw error happens
hugetlbfs: zero partial pages during fallocate hole punch
mm: memcontrol: reference to tools/cgroup/memcg_slabinfo.py
mm: re-allow pinning of zero pfns
mm/kfence: select random number before taking raw lock
MAINTAINERS: add maillist information for LoongArch
MAINTAINERS: update MM tree references
MAINTAINERS: update Abel Vesa's email
MAINTAINERS: add MEMORY HOT(UN)PLUG section and add David as reviewer
MAINTAINERS: add Miaohe Lin as a memory-failure reviewer
mailmap: add alias for jarkko@profian.com
mm/damon/reclaim: schedule 'damon_reclaim_timer' only after 'system_wq' is initialized
kthread: make it clear that kthread_create_on_node() might be terminated by any fatal signal
mm: lru_cache_disable: use synchronize_rcu_expedited
mm/page_isolation.c: fix one kernel-doc comment
|
|
Commit 793917d997df ("mm/readahead: Add large folio readahead")
introduced support for using large folios for filebacked pages if the
filesystem supports it.
page_cache_ra_order() was introduced to allocate and add these large
folios to the page cache. However adding pages to the page cache should
be serialized against truncation and hole punching by taking
invalidate_lock. Not doing so can lead to data races resulting in stale
data getting added to the page cache and marked up-to-date. See commit
730633f0b7f9 ("mm: Protect operations adding pages to page cache with
invalidate_lock") for more details.
This issue was found by inspection but a testcase revealed it was
possible to observe in practice on XFS. Fix this by taking
invalidate_lock in page_cache_ra_order(), to mirror what is done for the
non-thp case in page_cache_ra_unbounded().
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
In our efforts to remove uses of PG_private, we have found folios with
the private flag clear and folio->private not-NULL. That is the root
cause behind 642d51fb0775 ("ceph: check folio PG_private bit instead
of folio->private"). It can also affect a few other filesystems that
haven't yet reported a problem.
compaction_alloc() can return a page with uninitialised page->private,
and rather than checking all the callers of migrate_pages(), just zero
page->private after calling get_new_page(). Similarly, the tail pages
from split_huge_page() may also have an uninitialised page->private.
Reported-by: Xiubo Li <xiubli@redhat.com>
Tested-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
If a read races with an invalidation followed by another read, it is
possible for a folio to be replaced with a higher-order folio. If that
happens, we'll see a sibling entry for the new folio in the next iteration
of the loop. This manifests as a NULL pointer dereference while holding
the RCU read lock.
Handle this by simply returning. The next call will find the new folio
and handle it correctly. The other ways of handling this rare race are
more complex and it's just not worth it.
Reported-by: Dave Chinner <david@fromorbit.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Debugged-by: Brian Foster <bfoster@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Fixes: cbd59c48ae2b ("mm/filemap: use head pages in generic_file_buffered_read")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
We had an off-by-one error which meant that we never marked the first page
in a read as accessed. This was visible as a slowdown when re-reading
a file as pages were being evicted from cache too soon. In reviewing
this code, we noticed a second bug where a multi-page folio would be
marked as accessed multiple times when doing reads that were less than
the size of the folio.
Abstract the comparison of whether two file positions are in the same
folio into a new function, fixing both of these bugs.
Reported-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
- A slub fix for PREEMPT_RT locking semantics from Sebastian.
- A slub fix for state corruption due to a possible race scenario from
Jann.
* tag 'slab-for-5.19-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slub: add missing TID updates on slab deactivation
mm/slub: Move the stackdepot related allocation out of IRQ-off section.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull writeback and ext2 fixes from Jan Kara:
"A fix for writeback bug which prevented machines with kdevtmpfs from
booting and also one small ext2 bugfix in IO error handling"
* tag 'fs_for_v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
init: Initialize noop_backing_dev_info early
ext2: fix fs corruption when trying to remove a non-empty directory with IO error
|
|
Currently unpoison_memory(unsigned long pfn) is designed for soft
poison(hwpoison-inject) only. Since 17fae1294ad9d, the KPTE gets cleared
on a x86 platform once hardware memory corrupts.
Unpoisoning a hardware corrupted page puts page back buddy only, the
kernel has a chance to access the page with *NOT PRESENT* KPTE. This
leads BUG during accessing on the corrupted KPTE.
Suggested by David&Naoya, disable unpoison mechanism when a real HW error
happens to avoid BUG like this:
Unpoison: Software-unpoisoned page 0x61234
BUG: unable to handle page fault for address: ffff888061234000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 2c01067 P4D 2c01067 PUD 107267063 PMD 10382b063 PTE 800fffff9edcb062
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 4 PID: 26551 Comm: stress Kdump: loaded Tainted: G M OE 5.18.0.bm.1-amd64 #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
RIP: 0010:clear_page_erms+0x7/0x10
Code: ...
RSP: 0000:ffffc90001107bc8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000901 RCX: 0000000000001000
RDX: ffffea0001848d00 RSI: ffffea0001848d40 RDI: ffff888061234000
RBP: ffffea0001848d00 R08: 0000000000000901 R09: 0000000000001276
R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000140dca R15: 0000000000000001
FS: 00007fd8b2333740(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff888061234000 CR3: 00000001023d2005 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
prep_new_page+0x151/0x170
get_page_from_freelist+0xca0/0xe20
? sysvec_apic_timer_interrupt+0xab/0xc0
? asm_sysvec_apic_timer_interrupt+0x1b/0x20
__alloc_pages+0x17e/0x340
__folio_alloc+0x17/0x40
vma_alloc_folio+0x84/0x280
__handle_mm_fault+0x8d4/0xeb0
handle_mm_fault+0xd5/0x2a0
do_user_addr_fault+0x1d0/0x680
? kvm_read_and_reset_apf_flags+0x3b/0x50
exc_page_fault+0x78/0x170
asm_exc_page_fault+0x27/0x30
Link: https://lkml.kernel.org/r/20220615093209.259374-2-pizhenwei@bytedance.com
Fixes: 847ce401df392 ("HWPOISON: Add unpoisoning support")
Fixes: 17fae1294ad9d ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned")
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org> [5.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There is no slabinfo.py in tools/cgroup, but has memcg_slabinfo.py instead.
Link: https://lkml.kernel.org/r/20220610024451.744135-1-yang.yang29@zte.com.cn
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The RNG uses vanilla spinlocks, not raw spinlocks, so kfence should pick
its random numbers before taking its raw spinlocks. This also has the
nice effect of doing less work inside the lock. It should fix a splat
that Geert saw with CONFIG_PROVE_RAW_LOCK_NESTING:
dump_backtrace.part.0+0x98/0xc0
show_stack+0x14/0x28
dump_stack_lvl+0xac/0xec
dump_stack+0x14/0x2c
__lock_acquire+0x388/0x10a0
lock_acquire+0x190/0x2c0
_raw_spin_lock_irqsave+0x6c/0x94
crng_make_state+0x148/0x1e4
_get_random_bytes.part.0+0x4c/0xe8
get_random_u32+0x4c/0x140
__kfence_alloc+0x460/0x5c4
kmem_cache_alloc_trace+0x194/0x1dc
__kthread_create_on_node+0x5c/0x1a8
kthread_create_on_node+0x58/0x7c
printk_start_kthread.part.0+0x34/0xa8
printk_activate_kthreads+0x4c/0x54
do_one_initcall+0xec/0x278
kernel_init_freeable+0x11c/0x214
kernel_init+0x24/0x124
ret_from_fork+0x10/0x20
Link: https://lkml.kernel.org/r/20220609123319.17576-1-Jason@zx2c4.com
Fixes: d4150779e60f ("random32: use real rng for non-deterministic randomness")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
initialized
Commit 059342d1dd4e ("mm/damon/reclaim: fix the timer always stays
active") made DAMON_RECLAIM's 'enabled' parameter store callback,
'enabled_store()', to schedule 'damon_reclaim_timer'. The scheduling uses
'system_wq', which is initialized in 'workqueue_init_early()'. As kernel
parameters parsing function ('parse_args()') is called before
'workqueue_init_early()', 'enabled_store()' can be executed before
'workqueue_init_early()' and end up accessing the uninitialized
'system_wq'. As a result, the booting hang[1]. This commit fixes the
issue by checking if the initialization is done before scheduling the
timer.
[1] https://lkml.kernel.org/20220604192222.1488-1-sj@kernel.org/
Link: https://lkml.kernel.org/r/20220604195051.1589-1-sj@kernel.org
Fixes: 059342d1dd4e ("mm/damon/reclaim: fix the timer always stays active")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Greg White <gwhite@kupulau.com>
Cc: Hailong Tu <tuhailong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
commit ff042f4a9b050 ("mm: lru_cache_disable: replace work queue
synchronization with synchronize_rcu") replaced lru_cache_disable's usage
of work queues with synchronize_rcu.
Some users reported large performance regressions due to this commit, for
example:
https://lore.kernel.org/all/20220521234616.GO1790663@paulmck-ThinkPad-P17-Gen-1/T/
Switching to synchronize_rcu_expedited fixes the problem.
Link: https://lkml.kernel.org/r/YpToHCmnx/HEcVyR@fuller.cnet
Fixes: ff042f4a9b050 ("mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu")
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Phil Elwell <phil@raspberrypi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove one warning found by running scripts/kernel-doc, which is caused by
using 'make W=1':
mm/page_isolation.c:304: warning: Function parameter or member
'skip_isolation' not described in 'isolate_single_pageblock'
Link: https://lkml.kernel.org/r/20220602062116.61199-1-yang.lee@linux.alibaba.com
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
noop_backing_dev_info is used by superblocks of various
pseudofilesystems such as kdevtmpfs. After commit 10e14073107d
("writeback: Fix inode->i_io_list not be protected by inode->i_lock
error") this broke because __mark_inode_dirty() started to access more
fields from noop_backing_dev_info and this led to crashes inside
locked_inode_to_wb_and_lock_list() called from __mark_inode_dirty().
Fix the problem by initializing noop_backing_dev_info before the
filesystems get mounted.
Fixes: 10e14073107d ("writeback: Fix inode->i_io_list not be protected by inode->i_lock error")
Reported-and-tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reported-and-tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
If system have some mirrored memory and mirrored feature is not specified
in boot parameter, the basic mirrored feature will be enabled and this will
lead to the following situations:
- memblock memory allocation prefers mirrored region. This may have some
unexpected influence on numa affinity.
- contiguous memory will be split into several parts if parts of them
is mirrored memory via memblock_mark_mirror().
To fix this, variable mirrored_kernelcore will be checked in
memblock_mark_mirror(). Mark mirrored memory with flag MEMBLOCK_MIRROR iff
kernelcore=mirror is added in the kernel parameters.
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20220614092156.1972846-6-mawupeng1@huawei.com
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
For a system only have limited mirrored memory or some numa node without
mirrored memory, the per node vmemmap page_structs prefer to allocate
memory from mirrored region, which will lead to vmemmap_verify() in
vmemmap_populate_basepages() report lots of warning message.
This patch change the frequency of "potential offnode page_structs" warning
messages to only once to avoid a very long print during bootup.
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20220614092156.1972846-4-mawupeng1@huawei.com
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
If system has mirrored memory, memblock will try to allocate mirrored
memory firstly and fallback to non-mirrored memory when fails, but if with
limited mirrored memory or some numa node without mirrored memory, lots of
warning message about memblock allocation will occur.
This patch ratelimit the warning message to avoid a very long print during
bootup.
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Link: https://lore.kernel.org/r/20220614092156.1972846-3-mawupeng1@huawei.com
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
Create a security context for the inodes created by memfd_secret(2) via
the LSM hook inode_init_security_anon to allow a fine grained control.
As secret memory areas can affect hibernation and have a global shared
limit access control might be desirable.
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
If 'n' is so large that it's negative, we might wrap around and mistakenly
think that the copy is OK when it's not. Such a copy would probably
crash, but just doing the arithmetic in a more simple way lets us detect
and refuse this case.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220612213227.3881769-4-willy@infradead.org
|
|
Get rid of a lot of annoying casts by setting 'addr' once at the top
of the function.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220612213227.3881769-3-willy@infradead.org
|
|
vmalloc does not allocate a vm_struct for vm_map_ram() areas. That causes
us to deny usercopies from those areas. This affects XFS which uses
vm_map_ram() for its directories.
Fix this by calling find_vmap_area() instead of find_vm_area().
Fixes: 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220612213227.3881769-2-willy@infradead.org
|
|
The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
the TID stays the same. However, two places in __slab_alloc() currently
don't update the TID when deactivating the CPU slab.
If multiple operations race the right way, this could lead to an object
getting lost; or, in an even more unlikely situation, it could even lead to
an object being freed onto the wrong slab's freelist, messing up the
`inuse` counter and eventually causing a page to be freed to the page
allocator while it still contains slab objects.
(I haven't actually tested these cases though, this is just based on
looking at the code. Writing testcases for this stuff seems like it'd be
a pain...)
The race leading to state inconsistency is (all operations on the same CPU
and kmem_cache):
- task A: begin do_slab_free():
- read TID
- read pcpu freelist (==NULL)
- check `slab == c->slab` (true)
- [PREEMPT A->B]
- task B: begin slab_alloc_node():
- fastpath fails (`c->freelist` is NULL)
- enter __slab_alloc()
- slub_get_cpu_ptr() (disables preemption)
- enter ___slab_alloc()
- take local_lock_irqsave()
- read c->freelist as NULL
- get_freelist() returns NULL
- write `c->slab = NULL`
- drop local_unlock_irqrestore()
- goto new_slab
- slub_percpu_partial() is NULL
- get_partial() returns NULL
- slub_put_cpu_ptr() (enables preemption)
- [PREEMPT B->A]
- task A: finish do_slab_free():
- this_cpu_cmpxchg_double() succeeds()
- [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]
From there, the object on c->freelist will get lost if task B is allowed to
continue from here: It will proceed to the retry_load_slab label,
set c->slab, then jump to load_freelist, which clobbers c->freelist.
But if we instead continue as follows, we get worse corruption:
- task A: run __slab_free() on object from other struct slab:
- CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
- task A: run slab_alloc_node() with NUMA node constraint:
- fastpath fails (c->slab is NULL)
- call __slab_alloc()
- slub_get_cpu_ptr() (disables preemption)
- enter ___slab_alloc()
- c->slab is NULL: goto new_slab
- slub_percpu_partial() is non-NULL
- set c->slab to slub_percpu_partial(c)
- [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
from slab-2]
- goto redo
- node_match() fails
- goto deactivate_slab
- existing c->freelist is passed into deactivate_slab()
- inuse count of slab-1 is decremented to account for object from
slab-2
At this point, the inuse count of slab-1 is 1 lower than it should be.
This means that if we free all allocated objects in slab-1 except for one,
SLUB will think that slab-1 is completely unused, and may free its page,
leading to use-after-free.
Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
Fixes: 03e404af26dc2 ("slub: fast release on full slab")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220608182205.2945720-1-jannh@google.com
|
|
The set_track() invocation in free_debug_processing() is invoked with
acquired slab_lock(). The lock disables interrupts on PREEMPT_RT and
this forbids to allocate memory which is done in stack_depot_save().
Split set_track() into two parts: set_track_prepare() which allocate
memory and set_track_update() which only performs the assignment of the
trace data structure. Use set_track_prepare() before disabling
interrupts.
[ vbabka@suse.cz: make set_track() call set_track_update() instead of
open-coded assignments ]
Fixes: 5cf909c553e9e ("mm/slub: use stackdepot to save stack trace in objects")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/Yp9sqoUi4fVa5ExF@linutronix.de
|
|
If xas_split_alloc() fails to allocate the necessary nodes to complete the
xarray entry split, it sets the xa_state to -ENOMEM, which xas_nomem()
then interprets as "Please allocate more memory", not as "Please free
any unnecessary memory" (which was the intended outcome). It's confusing
to use xas_nomem() to free memory in this context, so call xas_destroy()
instead.
Reported-by: syzbot+9e27a75a8c24f3fe75c1@syzkaller.appspotmail.com
Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
After we have unlocked the mmap_lock for I/O, the file is pinned, but
the VMA is not. Checking this flag after that can be a use-after-free.
It's not a terribly interesting use-after-free as it can only read one
bit, and it's used to decide whether to read 2MB or 4MB. But it
upsets the automated tools and it's generally bad practice anyway,
so let's fix it.
Reported-by: syzbot+5b96d55e5b54924c77ad@syzkaller.appspotmail.com
Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
We must hold a reference over the call to filemap_release_folio(),
otherwise the page cache will put the last reference to the folio
before we unlock it, leading to splats like this:
BUG: Bad page state in process u8:5 pfn:1ab1f4
page:ffffea0006ac7d00 refcount:0 mapcount:0 mapping:0000000000000000 index:0x28b1de pfn:0x1ab1f4
flags: 0x17ff80000040001(locked|reclaim|node=0|zone=2|lastcpupid=0xfff)
raw: 017ff80000040001 dead000000000100 dead000000000122 0000000000000000
raw: 000000000028b1de 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
It's an error path, so it doesn't see much testing.
Reported-by: Darrick J. Wong <djwong@kernel.org>
Fixes: a42634a6c07d ("readahead: Use a folio in read_pages()")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm hotfixes from Andrew Morton:
"Fixups for various recently-added and longer-term issues and a few
minor tweaks:
- fixes for material merged during this merge window
- cc:stable fixes for more longstanding issues
- minor mailmap and MAINTAINERS updates"
* tag 'mm-hotfixes-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/oom_kill.c: fix vm_oom_kill_table[] ifdeffery
x86/kexec: fix memory leak of elf header buffer
mm/memremap: fix missing call to untrack_pfn() in pagemap_range()
mm: page_isolation: use compound_nr() correctly in isolate_single_pageblock()
mm: hugetlb_vmemmap: fix CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON
MAINTAINERS: add maintainer information for z3fold
mailmap: update Josh Poimboeuf's email
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull delay-accounting update from Andrew Morton:
"A single featurette for delay accounting.
Delayed a bit because, unusually, it had dependencies on both the
mm-stable and mm-nonmm-stable queues"
* tag 'mm-nonmm-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
delayacct: track delays from write-protect copy
|
|
Pull bitmap updates from Yury Norov:
- bitmap: optimize bitmap_weight() usage, from me
- lib/bitmap.c make bitmap_print_bitmask_to_buf parseable, from Mauro
Carvalho Chehab
- include/linux/find: Fix documentation, from Anna-Maria Behnsen
- bitmap: fix conversion from/to fix-sized arrays, from me
- bitmap: Fix return values to be unsigned, from Kees Cook
It has been in linux-next for at least a week with no problems.
* tag 'bitmap-for-5.19-rc1' of https://github.com/norov/linux: (31 commits)
nodemask: Fix return values to be unsigned
bitmap: Fix return values to be unsigned
KVM: x86: hyper-v: replace bitmap_weight() with hweight64()
KVM: x86: hyper-v: fix type of valid_bank_mask
ia64: cleanup remove_siblinginfo()
drm/amd/pm: use bitmap_{from,to}_arr32 where appropriate
KVM: s390: replace bitmap_copy with bitmap_{from,to}_arr64 where appropriate
lib/bitmap: add test for bitmap_{from,to}_arr64
lib: add bitmap_{from,to}_arr64
lib/bitmap: extend comment for bitmap_(from,to)_arr32()
include/linux/find: Fix documentation
lib/bitmap.c make bitmap_print_bitmask_to_buf parseable
MAINTAINERS: add cpumask and nodemask files to BITMAP_API
arch/x86: replace nodes_weight with nodes_empty where appropriate
mm/vmstat: replace cpumask_weight with cpumask_empty where appropriate
clocksource: replace cpumask_weight with cpumask_empty in clocksource.c
genirq/affinity: replace cpumask_weight with cpumask_empty where appropriate
irq: mips: replace cpumask_weight with cpumask_empty where appropriate
drm/i915/pmu: replace cpumask_weight with cpumask_empty where appropriate
arch/x86: replace cpumask_weight with cpumask_empty where appropriate
...
|
|
mm/vmstat.c code calls cpumask_weight() to check if any bit of a given
cpumask is set. We can do it more efficiently with cpumask_empty() because
cpumask_empty() stops traversing the cpumask as soon as it finds first set
bit, while cpumask_weight() counts all bits unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
|