summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2022-06-17mm: kmem: make mem_cgroup_from_obj() vmalloc()-safeRoman Gushchin2-22/+51
Currently mem_cgroup_from_obj() is not working properly with objects allocated using vmalloc(). It creates problems in some cases, when it's called for static objects belonging to modules or generally allocated using vmalloc(). This patch makes mem_cgroup_from_obj() safe to be called on objects allocated using vmalloc(). It also introduces mem_cgroup_from_slab_obj(), which is a faster version to use in places when we know the object is either a slab object or a generic slab page (e.g. when adding an object to a lru list). Link: https://lkml.kernel.org/r/20220610180310.1725111-1-roman.gushchin@linux.dev Suggested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Acked-by: Shakeel Butt <shakeelb@google.com> Tested-by: Vasily Averin <vvs@openvz.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Qian Cai <quic_qiancai@quicinc.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm/memremap: fix memunmap_pages() race with get_dev_pagemap()Miaohe Lin1-1/+1
Think about the below scene: CPU1 CPU2 memunmap_pages percpu_ref_exit __percpu_ref_exit free_percpu(percpu_count); /* percpu_count is freed here! */ get_dev_pagemap xa_load(&pgmap_array, PHYS_PFN(phys)) /* pgmap still in the pgmap_array */ percpu_ref_tryget_live(&pgmap->ref) if __ref_is_percpu /* __PERCPU_REF_ATOMIC_DEAD not set yet */ this_cpu_inc(*percpu_count) /* access freed percpu_count here! */ ref->percpu_count_ptr = __PERCPU_REF_ATOMIC_DEAD; /* too late... */ pageunmap_range To fix the issue, do percpu_ref_exit() after pgmap_array is emptied. So we won't do percpu_ref_tryget_live() against a being freed percpu_ref. Link: https://lkml.kernel.org/r/20220609121305.2508-1-linmiaohe@huawei.com Fixes: b7b3c01b1915 ("mm/memremap_pages: support multiple ranges per invocation") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm: kmemleak: check physical address when scanPatrick Wang1-3/+14
Check the physical address of objects for its boundary when scan instead of in kmemleak_*_phys(). Link: https://lkml.kernel.org/r/20220611035551.1823303-5-patrick.wang.shcn@gmail.com Fixes: 23c2d497de21 ("mm: kmemleak: take a full lowmem check in kmemleak_*_phys()") Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Yee Lee <yee.lee@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm: kmemleak: add rbtree and store physical address for objects allocated ↵Patrick Wang1-42/+91
with PA Add object_phys_tree_root to store the objects allocated with physical address. Distinguish it from object_tree_root by OBJECT_PHYS flag or function argument. The physical address is stored directly in those objects. Link: https://lkml.kernel.org/r/20220611035551.1823303-4-patrick.wang.shcn@gmail.com Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Yee Lee <yee.lee@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm: kmemleak: add OBJECT_PHYS flag for objects allocated with physical addressPatrick Wang1-9/+31
Add OBJECT_PHYS flag for object. This flag is used to identify the objects allocated with physical address. The create_object_phys() function is added as well to set that flag and is used by kmemleak_alloc_phys(). Link: https://lkml.kernel.org/r/20220611035551.1823303-3-patrick.wang.shcn@gmail.com Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Yee Lee <yee.lee@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm: kmemleak: remove kmemleak_not_leak_phys() and the min_count argument to ↵Patrick Wang2-24/+10
kmemleak_alloc_phys() Patch series "mm: kmemleak: store objects allocated with physical address separately and check when scan", v4. The kmemleak_*_phys() interface uses "min_low_pfn" and "max_low_pfn" to check address. But on some architectures, kmemleak_*_phys() is called before those two variables initialized. The following steps will be taken: 1) Add OBJECT_PHYS flag and rbtree for the objects allocated with physical address 2) Store physical address in objects if allocated with OBJECT_PHYS 3) Check the boundary when scan instead of in kmemleak_*_phys() This patch set will solve: https://lore.kernel.org/r/20220527032504.30341-1-yee.lee@mediatek.com https://lore.kernel.org/r/9dd08bb5-f39e-53d8-f88d-bec598a08c93@gmail.com v3: https://lore.kernel.org/r/20220609124950.1694394-1-patrick.wang.shcn@gmail.com v2: https://lore.kernel.org/r/20220603035415.1243913-1-patrick.wang.shcn@gmail.com v1: https://lore.kernel.org/r/20220531150823.1004101-1-patrick.wang.shcn@gmail.com This patch (of 4): Remove the unused kmemleak_not_leak_phys() function. And remove the min_count argument to kmemleak_alloc_phys() function, assume it's 0. Link: https://lkml.kernel.org/r/20220611035551.1823303-1-patrick.wang.shcn@gmail.com Link: https://lkml.kernel.org/r/20220611035551.1823303-2-patrick.wang.shcn@gmail.com Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Yee Lee <yee.lee@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm/memremap: fix wrong function name above memremap_pages()Miaohe Lin1-2/+2
Fix the wrong function name dev_memremap_pages above memremap_pages() to avoid confusion. Minor readability improvement. Link: https://lkml.kernel.org/r/20220607143621.58989-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm/mempool: use might_alloc()Daniel Vetter1-1/+1
mempool are generally used for GFP_NOIO, so this wont benefit all that much because might_alloc currently only checks GFP_NOFS. But it does validate against mmu notifier pte zapping, some might catch some drivers doing really silly things, plus it's a bit more meaningful in what we're checking for here. Link: https://lkml.kernel.org/r/20220605152539.3196045-3-daniel.vetter@ffwll.ch Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm/slab: delete cache_alloc_debugcheck_before()Daniel Vetter1-10/+0
It only does a might_sleep_if(GFP_RECLAIM) check, which is already covered by the might_alloc() in slab_pre_alloc_hook(). And all callers of cache_alloc_debugcheck_before() call that beforehand already. Link: https://lkml.kernel.org/r/20220605152539.3196045-2-daniel.vetter@ffwll.ch Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm/page_alloc: use might_alloc()Daniel Vetter1-4/+1
... instead of open coding it. Completely equivalent code, just a notch more meaningful when reading. Link: https://lkml.kernel.org/r/20220605152539.3196045-1-daniel.vetter@ffwll.ch Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm: memcontrol: add {pgscan,pgsteal}_{kswapd,direct} items in memory.stat of ↵Qi Zheng1-28/+27
cgroup v2 There are already statistics of {pgscan,pgsteal}_kswapd and {pgscan,pgsteal}_direct of memcg event here, but now only the sum of the two is displayed in memory.stat of cgroup v2. In order to obtain more accurate information during monitoring and debugging, and to align with the display in /proc/vmstat, it better to display {pgscan,pgsteal}_kswapd and {pgscan,pgsteal}_direct separately. Also, for forward compatibility, we still display pgscan and pgsteal items so that it won't break existing applications. [zhengqi.arch@bytedance.com: add comment for memcg_vm_event_stat (suggested by Michal)] Link: https://lkml.kernel.org/r/20220606154028.55030-1-zhengqi.arch@bytedance.com [zhengqi.arch@bytedance.com: fix the doc, thanks to Johannes] Link: https://lkml.kernel.org/r/20220607064803.79363-1-zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/20220604082209.55174-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm/vmalloc: add code comment for find_vmap_area_exceed_addr()Baoquan He1-2/+3
Its behaviour is like find_vma() which finds an area above the specified address, add comment to make it easier to understand. And also fix two places of grammer mistake/typo. Link: https://lkml.kernel.org/r/20220607105958.382076-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm/vmalloc: fix typo in local variable nameBaoquan He1-6/+6
In __purge_vmap_area_lazy(), rename local_pure_list to local_purge_list. Link: https://lkml.kernel.org/r/20220607105958.382076-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm/vmalloc: remove the redundant boundary checkBaoquan He1-4/+2
In find_va_links(), when traversing the vmap_area tree, the comparing to check if the passed in 'va' is above or below 'tmp_va' is redundant, assuming both 'va' and 'tmp_va' has ->va_start <= ->va_end. Here, to simplify the checking as code change. Link: https://lkml.kernel.org/r/20220607105958.382076-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm/vmalloc: invoke classify_va_fit_type() in adjust_va_to_fit_type()Baoquan He1-17/+6
Patch series "Cleanup patches of vmalloc", v2. Some cleanup patches found when reading vmalloc code. This patch (of 4): adjust_va_to_fit_type() checks all values of passed in fit type, including NOTHING_FIT in the else branch. However, the check of NOTHING_FIT has been done inside adjust_va_to_fit_type() and before it's called in all call sites. In fact, both of these functions are coupled tightly, since classify_va_fit_type() is doing the preparation work for adjust_va_to_fit_type(). So putting invocation of classify_va_fit_type() inside adjust_va_to_fit_type() can simplify code logic and the redundant check of NOTHING_FIT issue will go away. Link: https://lkml.kernel.org/r/20220607105958.382076-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20220607105958.382076-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Suggested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm/memory_hotplug: drop 'reason' argument from check_pfn_span()Anshuman Khandual1-11/+9
In check_pfn_span(), a 'reason' string is being used to recreate the caller function name, while printing the warning message. It is really unnecessary as the warning message could just be printed inside the caller depending on the return code. Currently there are just two callers for check_pfn_span() i.e __add_pages() and __remove_pages(). Let's clean this up. Link: https://lkml.kernel.org/r/20220531090441.170650-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm/shmem.c: clean up comment of shmem_swapin_folioMiaohe Lin1-5/+5
shmem_swapin_folio has changed to use folio but comment still mentions page. Update the relevant comment accordingly as suggested by Naoya. Link: https://lkml.kernel.org/r/20220530115841.4348-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm: avoid unnecessary page fault retires on shared memory typesPeter Xu2-2/+34
I observed that for each of the shared file-backed page faults, we're very likely to retry one more time for the 1st write fault upon no page. It's because we'll need to release the mmap lock for dirty rate limit purpose with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()). Then after that throttling we return VM_FAULT_RETRY. We did that probably because VM_FAULT_RETRY is the only way we can return to the fault handler at that time telling it we've released the mmap lock. However that's not ideal because it's very likely the fault does not need to be retried at all since the pgtable was well installed before the throttling, so the next continuous fault (including taking mmap read lock, walk the pgtable, etc.) could be in most cases unnecessary. It's not only slowing down page faults for shared file-backed, but also add more mmap lock contention which is in most cases not needed at all. To observe this, one could try to write to some shmem page and look at "pgfault" value in /proc/vmstat, then we should expect 2 counts for each shmem write simply because we retried, and vm event "pgfault" will capture that. To make it more efficient, add a new VM_FAULT_COMPLETED return code just to show that we've completed the whole fault and released the lock. It's also a hint that we should very possibly not need another fault immediately on this page because we've just completed it. This patch provides a ~12% perf boost on my aarch64 test VM with a simple program sequentially dirtying 400MB shmem file being mmap()ed and these are the time it needs: Before: 650.980 ms (+-1.94%) After: 569.396 ms (+-1.38%) I believe it could help more than that. We need some special care on GUP and the s390 pgfault handler (for gmap code before returning from pgfault), the rest changes in the page fault handlers should be relatively straightforward. Another thing to mention is that mm_account_fault() does take this new fault as a generic fault to be accounted, unlike VM_FAULT_RETRY. I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping them as-is. Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vineet Gupta <vgupta@kernel.org> Acked-by: Guo Ren <guoren@kernel.org> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm part] Acked-by: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Stafford Horne <shorne@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Brian Cain <bcain@quicinc.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Richard Weinberger <richard@nod.at> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Will Deacon <will@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Simek <monstr@monstr.eu> Cc: Matt Turner <mattst88@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: David Hildenbrand <david@redhat.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Rich Felker <dalias@libc.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Helge Deller <deller@gmx.de> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17mm: use PAGE_ALIGNED instead of IS_ALIGNEDFanjun Kong1-2/+2
<linux/mm.h> already provides the PAGE_ALIGNED macro. Let's use this macro instead of IS_ALIGNED and passing PAGE_SIZE directly. Link: https://lkml.kernel.org/r/20220526140257.1568744-1-bh1scw@gmail.com Signed-off-by: Fanjun Kong <bh1scw@gmail.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-09mm/huge_memory: Fix xarray node memory leakMatthew Wilcox (Oracle)1-2/+1
If xas_split_alloc() fails to allocate the necessary nodes to complete the xarray entry split, it sets the xa_state to -ENOMEM, which xas_nomem() then interprets as "Please allocate more memory", not as "Please free any unnecessary memory" (which was the intended outcome). It's confusing to use xas_nomem() to free memory in this context, so call xas_destroy() instead. Reported-by: syzbot+9e27a75a8c24f3fe75c1@syzkaller.appspotmail.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Cc: stable@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-09filemap: Cache the value of vm_flagsMatthew Wilcox (Oracle)1-4/+5
After we have unlocked the mmap_lock for I/O, the file is pinned, but the VMA is not. Checking this flag after that can be a use-after-free. It's not a terribly interesting use-after-free as it can only read one bit, and it's used to decide whether to read 2MB or 4MB. But it upsets the automated tools and it's generally bad practice anyway, so let's fix it. Reported-by: syzbot+5b96d55e5b54924c77ad@syzkaller.appspotmail.com Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings") Cc: stable@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-09filemap: Don't release a locked folioMatthew Wilcox (Oracle)1-0/+2
We must hold a reference over the call to filemap_release_folio(), otherwise the page cache will put the last reference to the folio before we unlock it, leading to splats like this: BUG: Bad page state in process u8:5 pfn:1ab1f4 page:ffffea0006ac7d00 refcount:0 mapcount:0 mapping:0000000000000000 index:0x28b1de pfn:0x1ab1f4 flags: 0x17ff80000040001(locked|reclaim|node=0|zone=2|lastcpupid=0xfff) raw: 017ff80000040001 dead000000000100 dead000000000122 0000000000000000 raw: 000000000028b1de 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set It's an error path, so it doesn't see much testing. Reported-by: Darrick J. Wong <djwong@kernel.org> Fixes: a42634a6c07d ("readahead: Use a folio in read_pages()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-06Merge tag 'mm-hotfixes-stable-2022-06-05' of ↵Linus Torvalds4-32/+32
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm hotfixes from Andrew Morton: "Fixups for various recently-added and longer-term issues and a few minor tweaks: - fixes for material merged during this merge window - cc:stable fixes for more longstanding issues - minor mailmap and MAINTAINERS updates" * tag 'mm-hotfixes-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/oom_kill.c: fix vm_oom_kill_table[] ifdeffery x86/kexec: fix memory leak of elf header buffer mm/memremap: fix missing call to untrack_pfn() in pagemap_range() mm: page_isolation: use compound_nr() correctly in isolate_single_pageblock() mm: hugetlb_vmemmap: fix CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON MAINTAINERS: add maintainer information for z3fold mailmap: update Josh Poimboeuf's email
2022-06-06Merge tag 'mm-nonmm-stable-2022-06-05' of ↵Linus Torvalds2-0/+16
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull delay-accounting update from Andrew Morton: "A single featurette for delay accounting. Delayed a bit because, unusually, it had dependencies on both the mm-stable and mm-nonmm-stable queues" * tag 'mm-nonmm-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: delayacct: track delays from write-protect copy
2022-06-05Merge tag 'bitmap-for-5.19-rc1' of https://github.com/norov/linuxLinus Torvalds1-2/+2
Pull bitmap updates from Yury Norov: - bitmap: optimize bitmap_weight() usage, from me - lib/bitmap.c make bitmap_print_bitmask_to_buf parseable, from Mauro Carvalho Chehab - include/linux/find: Fix documentation, from Anna-Maria Behnsen - bitmap: fix conversion from/to fix-sized arrays, from me - bitmap: Fix return values to be unsigned, from Kees Cook It has been in linux-next for at least a week with no problems. * tag 'bitmap-for-5.19-rc1' of https://github.com/norov/linux: (31 commits) nodemask: Fix return values to be unsigned bitmap: Fix return values to be unsigned KVM: x86: hyper-v: replace bitmap_weight() with hweight64() KVM: x86: hyper-v: fix type of valid_bank_mask ia64: cleanup remove_siblinginfo() drm/amd/pm: use bitmap_{from,to}_arr32 where appropriate KVM: s390: replace bitmap_copy with bitmap_{from,to}_arr64 where appropriate lib/bitmap: add test for bitmap_{from,to}_arr64 lib: add bitmap_{from,to}_arr64 lib/bitmap: extend comment for bitmap_(from,to)_arr32() include/linux/find: Fix documentation lib/bitmap.c make bitmap_print_bitmask_to_buf parseable MAINTAINERS: add cpumask and nodemask files to BITMAP_API arch/x86: replace nodes_weight with nodes_empty where appropriate mm/vmstat: replace cpumask_weight with cpumask_empty where appropriate clocksource: replace cpumask_weight with cpumask_empty in clocksource.c genirq/affinity: replace cpumask_weight with cpumask_empty where appropriate irq: mips: replace cpumask_weight with cpumask_empty where appropriate drm/i915/pmu: replace cpumask_weight with cpumask_empty where appropriate arch/x86: replace cpumask_weight with cpumask_empty where appropriate ...
2022-06-03mm/vmstat: replace cpumask_weight with cpumask_empty where appropriateYury Norov1-2/+2
mm/vmstat.c code calls cpumask_weight() to check if any bit of a given cpumask is set. We can do it more efficiently with cpumask_empty() because cpumask_empty() stops traversing the cpumask as soon as it finds first set bit, while cpumask_weight() counts all bits unconditionally. Signed-off-by: Yury Norov <yury.norov@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com>
2022-06-02mm/oom_kill.c: fix vm_oom_kill_table[] ifdefferyAndrew Morton1-29/+29
arm allnoconfig: mm/oom_kill.c:60:25: warning: 'vm_oom_kill_table' defined but not used [-Wunused-variable] 60 | static struct ctl_table vm_oom_kill_table[] = { | ^~~~~~~~~~~~~~~~~ Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-02mm/memremap: fix missing call to untrack_pfn() in pagemap_range()Miaohe Lin1-1/+1
We forget to call untrack_pfn() to pair with track_pfn_remap() when range is not allowed to hotplug. Fix it by jump err_kasan. Link: https://lkml.kernel.org/r/20220531122643.25249-1-linmiaohe@huawei.com Fixes: bca3feaa0764 ("mm/memory_hotplug: prevalidate the address range being added with platform") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-02mm: page_isolation: use compound_nr() correctly in isolate_single_pageblock()Zi Yan1-1/+1
When compound_nr(page) was used, page was not guaranteed to be the head of the compound page and it could cause an infinite loop. Fix it by calling it on the head page. Link: https://lkml.kernel.org/r/20220531024450.2498431-1-zi.yan@sent.com Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/linux-mm/20220530115027.123341-1-anshuman.khandual@arm.com/ Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Qian Cai <quic_qiancai@quicinc.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Eric Ren <renzhengeek@gmail.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-02mm: hugetlb_vmemmap: fix CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ONMuchun Song1-1/+1
The following: commit 47010c040dec ("mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*") forgot to update CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON used in vmemmap_optimize_mode to CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON. The result is we cannot enable hugetlb_optimize_vmemmap at boot time when we configure CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON. Fix it. Link: https://lkml.kernel.org/r/20220527081948.68832-1-songmuchun@bytedance.com Fixes: 47010c040dec ("mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-02delayacct: track delays from write-protect copyYang Yang2-0/+16
Delay accounting does not track the delay of write-protect copy. When tasks trigger many write-protect copys(include COW and unsharing of anonymous pages[1]), it may spend a amount of time waiting for them. To get the delay of tasks in write-protect copy, could help users to evaluate the impact of using KSM or fork() or GUP. Also update tools/accounting/getdelays.c: / # ./getdelays -dl -p 231 print delayacct stats ON listen forever PID 231 CPU count real total virtual total delay total delay average 6247 1859000000 2154070021 1674255063 0.268ms IO count delay total delay average 0 0 0ms SWAP count delay total delay average 0 0 0ms RECLAIM count delay total delay average 0 0 0ms THRASHING count delay total delay average 0 0 0ms COMPACT count delay total delay average 3 72758 0ms WPCOPY count delay total delay average 3635 271567604 0ms [1] commit 31cc5bc4af70("mm: support GUP-triggered unsharing of anonymous pages") Link: https://lkml.kernel.org/r/20220409014342.2505532-1-yang.yang29@zte.com.cn Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn> Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: wangyong <wang.yong12@zte.com.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-01Merge tag 'riscv-for-linus-5.19-mw0' of ↵Linus Torvalds2-0/+18
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for the Svpbmt extension, which allows memory attributes to be encoded in pages - Support for the Allwinner D1's implementation of page-based memory attributes - Support for running rv32 binaries on rv64 systems, via the compat subsystem - Support for kexec_file() - Support for the new generic ticket-based spinlocks, which allows us to also move to qrwlock. These should have already gone in through the asm-geneic tree as well - A handful of cleanups and fixes, include some larger ones around atomics and XIP * tag 'riscv-for-linus-5.19-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (51 commits) RISC-V: Prepare dropping week attribute from arch_kexec_apply_relocations[_add] riscv: compat: Using seperated vdso_maps for compat_vdso_info RISC-V: Fix the XIP build RISC-V: Split out the XIP fixups into their own file RISC-V: ignore xipImage RISC-V: Avoid empty create_*_mapping definitions riscv: Don't output a bogus mmu-type on a no MMU kernel riscv: atomic: Add custom conditional atomic operation implementation riscv: atomic: Optimize dec_if_positive functions riscv: atomic: Cleanup unnecessary definition RISC-V: Load purgatory in kexec_file RISC-V: Add purgatory RISC-V: Support for kexec_file on panic RISC-V: Add kexec_file support RISC-V: use memcpy for kexec_file mode kexec_file: Fix kexec_file.c build error for riscv platform riscv: compat: Add COMPAT Kbuild skeletal support riscv: compat: ptrace: Add compat_arch_ptrace implement riscv: compat: signal: Add rt_frame implementation riscv: add memory-type errata for T-Head ...
2022-05-28Merge tag 'powerpc-5.19-1' of ↵Linus Torvalds2-10/+27
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Convert to the generic mmap support (ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT) - Add support for outline-only KASAN with 64-bit Radix MMU (P9 or later) - Increase SIGSTKSZ and MINSIGSTKSZ and add support for AT_MINSIGSTKSZ - Enable the DAWR (Data Address Watchpoint) on POWER9 DD2.3 or later - Drop support for system call instruction emulation - Many other small features and fixes Thanks to Alexey Kardashevskiy, Alistair Popple, Andy Shevchenko, Bagas Sanjaya, Bjorn Helgaas, Bo Liu, Chen Huang, Christophe Leroy, Colin Ian King, Daniel Axtens, Dwaipayan Ray, Fabiano Rosas, Finn Thain, Frank Rowand, Fuqian Huang, Guilherme G. Piccoli, Hangyu Hua, Haowen Bai, Haren Myneni, Hari Bathini, He Ying, Jason Wang, Jiapeng Chong, Jing Yangyang, Joel Stanley, Julia Lawall, Kajol Jain, Kevin Hao, Krzysztof Kozlowski, Laurent Dufour, Lv Ruyi, Madhavan Srinivasan, Magali Lemes, Miaoqian Lin, Minghao Chi, Nathan Chancellor, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Oscar Salvador, Pali Rohár, Paul Mackerras, Peng Wu, Qing Wang, Randy Dunlap, Reza Arbab, Russell Currey, Sohaib Mohamed, Vaibhav Jain, Vasant Hegde, Wang Qing, Wang Wensheng, Xiang wangx, Xiaomeng Tong, Xu Wang, Yang Guang, Yang Li, Ye Bin, YueHaibing, Yu Kuai, Zheng Bin, Zou Wei, and Zucheng Zheng. * tag 'powerpc-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (200 commits) powerpc/64: Include cache.h directly in paca.h powerpc/64s: Only set HAVE_ARCH_UNMAPPED_AREA when CONFIG_PPC_64S_HASH_MMU is set powerpc/xics: Include missing header powerpc/powernv/pci: Drop VF MPS fixup powerpc/fsl_book3e: Don't set rodata RO too early powerpc/microwatt: Add mmu bits to device tree powerpc/powernv/flash: Check OPAL flash calls exist before using powerpc/powermac: constify device_node in of_irq_parse_oldworld() powerpc/powermac: add missing g5_phy_disable_cpu1() declaration selftests/powerpc/pmu: fix spelling mistake "mis-match" -> "mismatch" powerpc: Enable the DAWR on POWER9 DD2.3 and above powerpc/64s: Add CPU_FTRS_POWER10 to ALWAYS mask powerpc/64s: Add CPU_FTRS_POWER9_DD2_2 to CPU_FTRS_ALWAYS mask powerpc: Fix all occurences of "the the" selftests/powerpc/pmu/ebb: remove fixed_instruction.S powerpc/platforms/83xx: Use of_device_get_match_data() powerpc/eeh: Drop redundant spinlock initialization powerpc/iommu: Add missing of_node_put in iommu_init_early_dart powerpc/pseries/vas: Call misc_deregister if sysfs init fails powerpc/papr_scm: Fix leaking nvdimm_events_map elements ...
2022-05-27Merge tag 'mm-stable-2022-05-27' of ↵Linus Torvalds12-87/+261
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - Two follow-on fixes for the post-5.19 series "Use pageblock_order for cma and alloc_contig_range alignment", from Zi Yan. - A series of z3fold cleanups and fixes from Miaohe Lin. - Some memcg selftests work from Michal Koutný <mkoutny@suse.com> - Some swap fixes and cleanups from Miaohe Lin - Several individual minor fixups * tag 'mm-stable-2022-05-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits) mm/shmem.c: suppress shift warning mm: Kconfig: reorganize misplaced mm options mm: kasan: fix input of vmalloc_to_page() mm: fix is_pinnable_page against a cma page mm: filter out swapin error entry in shmem mapping mm/shmem: fix infinite loop when swap in shmem error at swapoff time mm/madvise: free hwpoison and swapin error entry in madvise_free_pte_range mm/swapfile: fix lost swap bits in unuse_pte() mm/swapfile: unuse_pte can map random data if swap read fails selftests: memcg: factor out common parts of memory.{low,min} tests selftests: memcg: remove protection from top level memcg selftests: memcg: adjust expected reclaim values of protected cgroups selftests: memcg: expect no low events in unprotected sibling selftests: memcg: fix compilation mm/z3fold: fix z3fold_page_migrate races with z3fold_map mm/z3fold: fix z3fold_reclaim_page races with z3fold_free mm/z3fold: always clear PAGE_CLAIMED under z3fold page lock mm/z3fold: put z3fold page back into unbuddied list when reclaim or migration fails revert "mm/z3fold.c: allow __GFP_HIGHMEM in z3fold_alloc" mm/z3fold: throw warning on failure of trylock_page in z3fold_alloc ...
2022-05-27Merge tag 'mm-hotfixes-stable-2022-05-27' of ↵Linus Torvalds5-9/+47
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "Six hotfixes. The page_table_check one from Miaohe Lin is considered a minor thing so it isn't marked for -stable. The remainder address pre-5.19 issues and are cc:stable" * tag 'mm-hotfixes-stable-2022-05-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/page_table_check: fix accessing unmapped ptep kexec_file: drop weak attribute from arch_kexec_apply_relocations[_add] mm/page_alloc: always attempt to allocate at least one page during bulk allocation hugetlb: fix huge_pmd_unshare address update zsmalloc: fix races between asynchronous zspage free and page migration Revert "mm/cma.c: remove redundant cma_mutex lock"
2022-05-27mm/shmem.c: suppress shift warningAndrew Morton1-1/+1
mm/shmem.c:1948 shmem_getpage_gfp() warn: should '(((1) << 12) / 512) << folio_order(folio)' be a 64 bit type? On i386, so an unsigned long is 32-bit, but i_blocks is a 64-bit blkcnt_t. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Jessica Clarke <jrtc27@jrtc27.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm: Kconfig: reorganize misplaced mm optionsVlastimil Babka2-0/+87
After commits 7b42f1041c98 ("mm: Kconfig: move swap and slab config options to the MM section") and 519bcb797907 ("mm: Kconfig: group swap, slab, hotplug and thp options into submenus") we now have nicely organized mm related config options. I have noticed some that were still misplaced, so this moves them from various places into the new structure: VM_EVENT_COUNTERS, COMPAT_BRK, MMAP_ALLOW_UNINITIALIZED to mm/Kconfig and general MM section. SLUB_STATS to mm/Kconfig and the slab submenu. DEBUG_SLAB, SLUB_DEBUG, SLUB_DEBUG_ON to mm/Kconfig.debug and the Kernel hacking / Memory Debugging submenu. Link: https://lkml.kernel.org/r/20220525112559.1139-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm: kasan: fix input of vmalloc_to_page()Kefeng Wang1-1/+1
When print virtual mapping info for vmalloc address, it should pass the addr not page, fix it. Link: https://lkml.kernel.org/r/20220525120804.38155-1-wangkefeng.wang@huawei.com Fixes: c056a364e954 ("kasan: print virtual mapping info in reports") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm: fix is_pinnable_page against a cma pageMinchan Kim1-2/+6
Pages in the CMA area could have MIGRATE_ISOLATE as well as MIGRATE_CMA so the current is_pinnable_page() could miss CMA pages which have MIGRATE_ISOLATE. It ends up pinning CMA pages as longterm for the pin_user_pages() API so CMA allocations keep failing until the pin is released. CPU 0 CPU 1 - Task B cma_alloc alloc_contig_range pin_user_pages_fast(FOLL_LONGTERM) change pageblock as MIGRATE_ISOLATE internal_get_user_pages_fast lockless_pages_from_mm gup_pte_range try_grab_folio is_pinnable_page return true; So, pinned the page successfully. page migration failure with pinned page .. .. After 30 sec unpin_user_page(page) CMA allocation succeeded after 30 sec. The CMA allocation path protects the migration type change race using zone->lock but what GUP path need to know is just whether the page is on CMA area or not rather than exact migration type. Thus, we don't need zone->lock but just checks migration type in either of (MIGRATE_ISOLATE and MIGRATE_CMA). Adding the MIGRATE_ISOLATE check in is_pinnable_page could cause rejecting of pinning pages on MIGRATE_ISOLATE pageblocks even though it's neither CMA nor movable zone if the page is temporarily unmovable. However, such a migration failure by unexpected temporal refcount holding is general issue, not only come from MIGRATE_ISOLATE and the MIGRATE_ISOLATE is also transient state like other temporal elevated refcount problem. Link: https://lkml.kernel.org/r/20220524171525.976723-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm: filter out swapin error entry in shmem mappingMiaohe Lin2-1/+7
There might be swapin error entries in shmem mapping. Filter them out to avoid "Bad swap file entry" complaint. Link: https://lkml.kernel.org/r/20220519125030.21486-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm/shmem: fix infinite loop when swap in shmem error at swapoff timeMiaohe Lin1-0/+39
When swap in shmem error at swapoff time, there would be a infinite loop in the while loop in shmem_unuse_inode(). It's because swapin error is deliberately ignored now and thus info->swapped will never reach 0. So we can't escape the loop in shmem_unuse(). In order to fix the issue, swapin_error entry is stored in the mapping when swapin error occurs. So the swapcache page can be freed and the user won't end up with a permanently mounted swap because a sector is bad. If the page is accessed later, the user process will be killed so that corrupted data is never consumed. On the other hand, if the page is never accessed, the user won't even notice it. Link: https://lkml.kernel.org/r/20220519125030.21486-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reported-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm/madvise: free hwpoison and swapin error entry in madvise_free_pte_rangeMiaohe Lin1-5/+8
Once the MADV_FREE operation has succeeded, callers can expect they might get zero-fill pages if accessing the memory again. Therefore it should be safe to delete the hwpoison entry and swapin error entry. There is no reason to kill the process if it has called MADV_FREE on the range. Link: https://lkml.kernel.org/r/20220519125030.21486-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm/swapfile: fix lost swap bits in unuse_pte()Miaohe Lin1-3/+7
This is observed by code review only but not any real report. When we turn off swapping we could have lost the bits stored in the swap ptes. The new rmap-exclusive bit is fine since that turned into a page flag, but not for soft-dirty and uffd-wp. Add them. Link: https://lkml.kernel.org/r/20220519125030.21486-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm/swapfile: unuse_pte can map random data if swap read failsMiaohe Lin2-1/+15
Patch series "A few fixup patches for mm", v4. This series contains a few patches to avoid mapping random data if swap read fails and fix lost swap bits in unuse_pte. Also we free hwpoison and swapin error entry in madvise_free_pte_range and so on. More details can be found in the respective changelogs. This patch (of 5): There is a bug in unuse_pte(): when swap page happens to be unreadable, page filled with random data is mapped into user address space. In case of error, a special swap entry indicating swap read fails is set to the page table. So the swapcache page can be freed and the user won't end up with a permanently mounted swap because a sector is bad. And if the page is accessed later, the user process will be killed so that corrupted data is never consumed. On the other hand, if the page is never accessed, the user won't even notice it. Link: https://lkml.kernel.org/r/20220519125030.21486-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220519125030.21486-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Howells <dhowells@redhat.com> Cc: NeilBrown <neilb@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm/z3fold: fix z3fold_page_migrate races with z3fold_mapMiaohe Lin1-4/+12
Think about the below scenario: CPU1 CPU2 z3fold_page_migrate z3fold_map z3fold_page_trylock ... z3fold_page_unlock /* slots still points to old zhdr*/ get_z3fold_header get slots from handle get old zhdr from slots z3fold_page_trylock return *old* zhdr encode_handle(new_zhdr, FIRST|LAST|MIDDLE) put_page(page) /* zhdr is freed! */ but zhdr is still used by caller! z3fold_map can map freed z3fold page and lead to use-after-free bug. To fix it, we add PAGE_MIGRATED to indicate z3fold page is migrated and soon to be released. So get_z3fold_header won't return such page. Link: https://lkml.kernel.org/r/20220429064051.61552-10-linmiaohe@huawei.com Fixes: 1f862989b04a ("mm/z3fold.c: support page migration") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm/z3fold: fix z3fold_reclaim_page races with z3fold_freeMiaohe Lin1-15/+3
Think about the below scenario: CPU1 CPU2 z3fold_reclaim_page z3fold_free spin_lock(&pool->lock) get_z3fold_header -- hold page_lock kref_get_unless_zero kref_put--zhdr->refcount can be 1 now !z3fold_page_trylock kref_put -- zhdr->refcount is 0 now release_z3fold_page WARN_ON(!list_empty(&zhdr->buddy)); -- we're on buddy now! spin_lock(&pool->lock); -- deadlock here! z3fold_reclaim_page might race with z3fold_free and will lead to pool lock deadlock and zhdr buddy non-empty warning. To fix this, defer getting the refcount until page_lock is held just like what __z3fold_alloc does. Note this has the side effect that we won't break the reclaim if we meet a soon to be released z3fold page now. Link: https://lkml.kernel.org/r/20220429064051.61552-9-linmiaohe@huawei.com Fixes: dcf5aedb24f8 ("z3fold: stricter locking and more careful reclaim") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm/z3fold: always clear PAGE_CLAIMED under z3fold page lockMiaohe Lin1-3/+3
Think about the below race window: CPU1 CPU2 z3fold_reclaim_page z3fold_free test_and_set_bit PAGE_CLAIMED failed to reclaim page z3fold_page_lock(zhdr); add back to the lru list; z3fold_page_unlock(zhdr); get_z3fold_header page_claimed=test_and_set_bit PAGE_CLAIMED clear_bit(PAGE_CLAIMED, &page->private); if (!page_claimed) /* it's false true */ free_handle is not called free_handle won't be called in this case. So z3fold_buddy_slots will leak. Fix it by always clear PAGE_CLAIMED under z3fold page lock. Link: https://lkml.kernel.org/r/20220429064051.61552-8-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm/z3fold: put z3fold page back into unbuddied list when reclaim or ↵Miaohe Lin1-0/+4
migration fails When doing z3fold page reclaim or migration, the page is removed from unbuddied list. If reclaim or migration succeeds, it's fine as page is released. But in case it fails, the page is not put back into unbuddied list now. The page will be leaked until next compaction work, reclaim or migration is done. Link: https://lkml.kernel.org/r/20220429064051.61552-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27revert "mm/z3fold.c: allow __GFP_HIGHMEM in z3fold_alloc"Miaohe Lin1-5/+3
Revert commit f1549cb5ab2b ("mm/z3fold.c: allow __GFP_HIGHMEM in z3fold_alloc"). z3fold can't support GFP_HIGHMEM page now. page_address is used directly at all places. Moreover, z3fold_header is on per cpu unbuddied list which could be accessed anytime. So we should remove the support of GFP_HIGHMEM allocation for z3fold. Link: https://lkml.kernel.org/r/20220429064051.61552-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-27mm/z3fold: throw warning on failure of trylock_page in z3fold_allocMiaohe Lin1-4/+3
If trylock_page fails, the page won't be non-lru movable page. When this page is freed via free_z3fold_page, it will trigger bug on PageMovable check in __ClearPageMovable. Throw warning on failure of trylock_page to guard against such rare case just as what zsmalloc does. Link: https://lkml.kernel.org/r/20220429064051.61552-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>