summaryrefslogtreecommitdiff
path: root/mm/memory-failure.c
AgeCommit message (Collapse)AuthorFilesLines
2022-04-27mm/memory-failure.c: skip huge_zero_page in memory_failure()Xu Yu1-0/+13
commit d173d5417fb67411e623d394aab986d847e47dad upstream. Kernel panic when injecting memory_failure for the global huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows. Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000 page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00 head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0 flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff) raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head)) ------------[ cut here ]------------ kernel BUG at mm/huge_memory.c:2499! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11 Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014 RIP: 0010:split_huge_page_to_list+0x66a/0x880 Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246 RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000 R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40 FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: try_to_split_thp_page+0x3a/0x130 memory_failure+0x128/0x800 madvise_inject_error.cold+0x8b/0xa1 __x64_sys_madvise+0x54/0x60 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fc3754f8bf9 Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9 RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000 RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490 R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000 This makes huge_zero_page bail out explicitly before split in memory_failure(), thus the panic above won't happen again. Link: https://lkml.kernel.org/r/497d3835612610e370c74e697ea3c721d1d55b9c.1649775850.git.xuyu@linux.alibaba.com Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7") Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Reported-by: Abaci <abaci@linux.alibaba.com> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-29mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page()Liu Shixin1-0/+1
commit 2a57d83c78f889bf3f54eede908d0643c40d5418 upstream. Hulk Robot reported a panic in put_page_testzero() when testing madvise() with MADV_SOFT_OFFLINE. The BUG() is triggered when retrying get_any_page(). This is because we keep MF_COUNT_INCREASED flag in second try but the refcnt is not increased. page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:737! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 5 PID: 2135 Comm: sshd Tainted: G B 5.16.0-rc6-dirty #373 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: release_pages+0x53f/0x840 Call Trace: free_pages_and_swap_cache+0x64/0x80 tlb_flush_mmu+0x6f/0x220 unmap_page_range+0xe6c/0x12c0 unmap_single_vma+0x90/0x170 unmap_vmas+0xc4/0x180 exit_mmap+0xde/0x3a0 mmput+0xa3/0x250 do_exit+0x564/0x1470 do_group_exit+0x3b/0x100 __do_sys_exit_group+0x13/0x20 __x64_sys_exit_group+0x16/0x20 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Modules linked in: ---[ end trace e99579b570fe0649 ]--- RIP: 0010:release_pages+0x53f/0x840 Link: https://lkml.kernel.org/r/20211221074908.3910286-1-liushixin2@huawei.com Fixes: b94e02822deb ("mm,hwpoison: try to narrow window race for free pages") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-29mm, hwpoison: fix condition in free hugetlb page pathNaoya Horiguchi1-9/+4
commit e37e7b0b3bd52ec4f8ab71b027bcec08f57f1b3b upstream. When a memory error hits a tail page of a free hugepage, __page_handle_poison() is expected to be called to isolate the error in 4kB unit, but it's not called due to the outdated if-condition in memory_failure_hugetlb(). This loses the chance to isolate the error in the finer unit, so it's not optimal. Drop the condition. This "(p != head && TestSetPageHWPoison(head)" condition is based on the old semantics of PageHWPoison on hugepage (where PG_hwpoison flag was set on the subpage), so it's not necessray any more. By getting to set PG_hwpoison on head page for hugepages, concurrent error events on different subpages in a single hugepage can be prevented by TestSetPageHWPoison(head) at the beginning of memory_failure_hugetlb(). So dropping the condition should not reopen the race window originally mentioned in commit b985194c8c0a ("hwpoison, hugetlb: lock_page/unlock_page does not match for handling a free hugepage") [naoya.horiguchi@linux.dev: fix "HardwareCorrupted" counter] Link: https://lkml.kernel.org/r/20211220084851.GA1460264@u2004 Link: https://lkml.kernel.org/r/20211210110208.879740-1-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Fei Luo <luofei@unicloud.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> [5.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-29mm: filemap: check if THP has hwpoisoned subpage for PMD page faultYang Shi1-0/+14
When handling shmem page fault the THP with corrupted subpage could be PMD mapped if certain conditions are satisfied. But kernel is supposed to send SIGBUS when trying to map hwpoisoned page. There are two paths which may do PMD map: fault around and regular fault. Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") the thing was even worse in fault around path. The THP could be PMD mapped as long as the VMA fits regardless what subpage is accessed and corrupted. After this commit as long as head page is not corrupted the THP could be PMD mapped. In the regular fault path the THP could be PMD mapped as long as the corrupted page is not accessed and the VMA fits. This loophole could be fixed by iterating every subpage to check if any of them is hwpoisoned or not, but it is somewhat costly in page fault path. So introduce a new page flag called HasHWPoisoned on the first tail page. It indicates the THP has hwpoisoned subpage(s). It is set if any subpage of THP is found hwpoisoned by memory failure and after the refcount is bumped successfully, then cleared when the THP is freed or split. The soft offline path doesn't need this since soft offline handler just marks a subpage hwpoisoned when the subpage is migrated successfully. But shmem THP didn't get split then migrated at all. Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-29mm: hwpoison: remove the unnecessary THP checkYang Shi1-14/+0
When handling THP hwpoison checked if the THP is in allocation or free stage since hwpoison may mistreat it as hugetlb page. After commit 415c64c1453a ("mm/memory-failure: split thp earlier in memory error handling") the problem has been fixed, so this check is no longer needed. Remove it. The side effect of the removal is hwpoison may report unsplit THP instead of unknown error for shmem THP. It seems not like a big deal. The following patch "mm: filemap: check if THP has hwpoisoned subpage for PMD page fault" depends on this, which fixes shmem THP with hwpoisoned subpage(s) are mapped PMD wrongly. So this patch needs to be backported to -stable as well. Link: https://lkml.kernel.org/r/20211020210755.23964-2-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-25mm/memory_failure: fix the missing pte_unmap() callQi Zheng1-5/+5
The paired pte_unmap() call is missing before the dev_pagemap_mapping_shift() returns. So fix it. David says: "I guess this code never runs on 32bit / highmem, that's why we didn't notice so far". [akpm@linux-foundation.org: cleanup] Link: https://lkml.kernel.org/r/20210923122642.4999-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-25mm, hwpoison: add is_free_buddy_page() in HWPoisonHandlable()Naoya Horiguchi1-1/+1
Commit fcc00621d88b ("mm/hwpoison: retry with shake_page() for unhandlable pages") changed the return value of __get_hwpoison_page() to retry for transiently unhandlable cases. However, __get_hwpoison_page() currently fails to properly judge buddy pages as handlable, so hard/soft offline for buddy pages always fail as "unhandlable page". This is totally regrettable. So let's add is_free_buddy_page() in HWPoisonHandlable(), so that __get_hwpoison_page() returns different return values between buddy pages and unhandlable pages as intended. Link: https://lkml.kernel.org/r/20210909004131.163221-1-naoya.horiguchi@linux.dev Fixes: fcc00621d88b ("mm/hwpoison: retry with shake_page() for unhandlable pages") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-27/+24
Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ...
2021-09-03mm/migrate: enable returning precise migrate_pages() success countYang Shi1-1/+1
Under normal circumstances, migrate_pages() returns the number of pages migrated. In error conditions, it returns an error code. When returning an error code, there is no way to know how many pages were migrated or not migrated. Make migrate_pages() return how many pages are demoted successfully for all cases, including when encountering errors. Page reclaim behavior will depend on this in subsequent patches. Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter] Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: fix panic caused by __page_handle_poison()Michael Wang1-2/+2
In commit 510d25c92ec4 ("mm/hwpoison: disable pcp for page_handle_poison()"), __page_handle_poison() was introduced, and if we mark: RET_A = dissolve_free_huge_page(); RET_B = take_page_off_buddy(); then __page_handle_poison was supposed to return TRUE When RET_A == 0 && RET_B == TRUE But since it failed to take care the case when RET_A is -EBUSY or -ENOMEM, and just return the ret as a bool which actually become TRUE, it break the original logic. The following result is a huge page in freelist but was referenced as poisoned, and lead into the final panic: kernel BUG at mm/internal.h:95! invalid opcode: 0000 [#1] SMP PTI skip... RIP: 0010:set_page_refcounted mm/internal.h:95 [inline] RIP: 0010:remove_hugetlb_page+0x23c/0x240 mm/hugetlb.c:1371 skip... Call Trace: remove_pool_huge_page+0xe4/0x110 mm/hugetlb.c:1892 return_unused_surplus_pages+0x8d/0x150 mm/hugetlb.c:2272 hugetlb_acct_memory.part.91+0x524/0x690 mm/hugetlb.c:4017 This patch replaces 'bool' with 'int' to handle RET_A correctly. Link: https://lkml.kernel.org/r/61782ac6-1e8a-4f6f-35e6-e94fce3b37f5@linux.alibaba.com Fixes: 510d25c92ec4 ("mm/hwpoison: disable pcp for page_handle_poison()") Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Abaci <abaci@linux.alibaba.com> Cc: <stable@vger.kernel.org> [5.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: hwpoison: dump page for unhandlable pageYang Shi1-3/+3
Currently just very simple message is shown for unhandlable page, e.g. non-LRU page, like: soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () It is not very helpful for further debug, calling dump_page() could show more useful information. Calling dump_page() in get_any_page() in order to not duplicate the call in a couple of different places. It may be called with pcp disabled and holding memory hotplug lock, it should be not a big deal since hwpoison handler is not called very often. [shy828301@gmail.com: remove redundant pr_info per Noaya Horiguchi] Link: https://lkml.kernel.org/r/20210824020946.195257-3-shy828301@gmail.com Link: https://lkml.kernel.org/r/20210819054116.266126-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Mackey <tdmackey@twitter.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: hwpoison: don't drop slab caches for offlining non-LRU pageYang Shi1-10/+8
In the current implementation of soft offline, if non-LRU page is met, all the slab caches will be dropped to free the page then offline. But if the page is not slab page all the effort is wasted in vain. Even though it is a slab page, it is not guaranteed the page could be freed at all. However the side effect and cost is quite high. It does not only drop the slab caches, but also may drop a significant amount of page caches which are associated with inode caches. It could make the most workingset gone in order to just offline a page. And the offline is not guaranteed to succeed at all, actually I really doubt the success rate for real life workload. Furthermore the worse consequence is the system may be locked up and unusable since the page cache release may incur huge amount of works queued for memcg release. Actually we ran into such unpleasant case in our production environment. Firstly, the workqueue of memory_failure_work_func is locked up as below: BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15 in-flight: 409271:memory_failure_work_func pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work workqueue mm_percpu_wq: flags=0x8 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 pending: vmstat_update workqueue cgroup_destroy: flags=0x0 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072 pending: css_release_work_fn There were over 12K css_release_work_fn queued, and this caused a few lockups due to the contention of worker pool lock with IRQ disabled, for example: NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G L 5.10.32-t1.el7.twitter.x86_64 #1 Hardware name: TYAN F5AMT /z /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021 Workqueue: events memory_failure_work_func RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0 Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47 RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002 RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140 RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000 R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001 R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068 FS: 0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0 Call Trace: __queue_work+0xd6/0x3c0 queue_work_on+0x1c/0x30 uncharge_batch+0x10e/0x110 mem_cgroup_uncharge_list+0x6d/0x80 release_pages+0x37f/0x3f0 __pagevec_release+0x1c/0x50 __invalidate_mapping_pages+0x348/0x380 inode_lru_isolate+0x10a/0x160 __list_lru_walk_one+0x7b/0x170 list_lru_walk_one+0x4a/0x60 prune_icache_sb+0x37/0x50 super_cache_scan+0x123/0x1a0 do_shrink_slab+0x10c/0x2c0 shrink_slab+0x1f1/0x290 drop_slab_node+0x4d/0x70 soft_offline_page+0x1ac/0x5b0 memory_failure_work_func+0x6a/0x90 process_one_work+0x19e/0x340 worker_thread+0x30/0x360 kthread+0x116/0x130 The lockup made the machine is quite unusable. And it also made the most workingset gone, the reclaimabled slab caches were reduced from 12G to 300MB, the page caches were decreased from 17G to 4G. But the most disappointing thing is all the effort doesn't make the page offline, it just returns: soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () It seems the aggressive behavior for non-LRU page didn't pay back, so it doesn't make too much sense to keep it considering the terrible side effect. Link: https://lkml.kernel.org/r/20210819054116.266126-1-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reported-by: David Mackey <tdmackey@twitter.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/hwpoison: fix some obsolete commentsMiaohe Lin1-3/+3
Since commit cb731d6c62bb ("vmscan: per memory cgroup slab shrinkers"), shrink_node_slabs is renamed to drop_slab_node. And doit argument is changed to forcekill since commit 6751ed65dc66 ("x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults"). Link: https://lkml.kernel.org/r/20210814105131.48814-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/hwpoison: change argument struct page **hpagep to *hpageMiaohe Lin1-4/+3
It's unnecessary to pass in a struct page **hpagep because it's never modified. Changing to use *hpage to simplify the code. Link: https://lkml.kernel.org/r/20210814105131.48814-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/hwpoison: fix potential pte_unmap_unlock pte errorMiaohe Lin1-3/+4
If the first pte is equal to poisoned_pfn, i.e. check_hwpoisoned_entry() return 1, the wrong ptep - 1 would be passed to pte_unmap_unlock(). Link: https://lkml.kernel.org/r/20210814105131.48814-3-linmiaohe@huawei.com Fixes: ad9c59c24095 ("mm,hwpoison: send SIGBUS with error virutal address") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/hwpoison: remove unneeded variable unmap_successMiaohe Lin1-2/+1
Patch series "Cleanups and fixup for hwpoison" This series contains cleanups to remove unneeded variable, fix some obsolete comments and so on. Also we fix potential pte_unmap_unlock on wrong pte. More details can be found in the respective changelogs. This patch (of 4): unmap_success is used to indicate whether page is successfully unmapped but it's irrelated with ZONE_DEVICE page and unmap_success is always true here. Remove this unneeded one. Link: https://lkml.kernel.org/r/20210814105131.48814-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210814105131.48814-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-30Merge tag 'hole_punch_for_v5.15-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fs hole punching vs cache filling race fixes from Jan Kara: "Fix races leading to possible data corruption or stale data exposure in multiple filesystems when hole punching races with operations such as readahead. This is the series I was sending for the last merge window but with your objection fixed - now filemap_fault() has been modified to take invalidate_lock only when we need to create new page in the page cache and / or bring it uptodate" * tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: filesystems/locking: fix Malformed table warning cifs: Fix race between hole punch and page fault ceph: Fix race between hole punch and page fault fuse: Convert to using invalidate_lock f2fs: Convert to using invalidate_lock zonefs: Convert to using invalidate_lock xfs: Convert double locking of MMAPLOCK to use VFS helpers xfs: Convert to use invalidate_lock xfs: Refactor xfs_isilocked() ext2: Convert to using invalidate_lock ext4: Convert to use mapping->invalidate_lock mm: Add functions to lock invalidate_lock for two mappings mm: Protect operations adding pages to page cache with invalidate_lock documentation: Sync file_operations members with reality mm: Fix comments mentioning i_mutex
2021-08-20mm/hwpoison: retry with shake_page() for unhandlable pagesNaoya Horiguchi1-3/+9
HWPoisonHandlable() sometimes returns false for typical user pages due to races with average memory events like transfers over LRU lists. This causes failures in hwpoison handling. There's retry code for such a case but does not work because the retry loop reaches the retry limit too quickly before the page settles down to handlable state. Let get_any_page() call shake_page() to fix it. [naoya.horiguchi@nec.com: get_any_page(): return -EIO when retry limit reached] Link: https://lkml.kernel.org/r/20210819001958.2365157-1-naoya.horiguchi@linux.dev Link: https://lkml.kernel.org/r/20210817053703.2267588-1-naoya.horiguchi@linux.dev Fixes: 25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [5.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-12mm: Fix comments mentioning i_mutexJan Kara1-1/+1
inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix comments still mentioning i_mutex. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-07-01mm: fix spelling mistakesZhen Lei1-1/+1
Fix some spelling mistakes in comments: each having differents usage ==> each has a different usage statments ==> statements adresses ==> addresses aggresive ==> aggressive datas ==> data posion ==> poison higer ==> higher precisly ==> precisely wont ==> won't We moves tha ==> We move the endianess ==> endianness Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNCHugh Dickins1-1/+1
TTU_SYNC prevents an unlikely race, when try_to_unmap() returns shortly before the page is accounted as unmapped. It is unlikely to coincide with hwpoisoning, but now that we have the flag, hwpoison_user_mappings() would do well to use it. Link: https://lkml.kernel.org/r/329c28ed-95df-9a2c-8893-b444d8a6d340@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01mm: rmap: make try_to_unmap() void functionYang Shi1-8/+7
Currently try_to_unmap() return bool value by checking page_mapcount(), however this may return false positive since page_mapcount() doesn't check all subpages of compound page. The total_mapcount() could be used instead, but its cost is higher since it traverses all subpages. Actually the most callers of try_to_unmap() don't care about the return value at all. So just need check if page is still mapped by page_mapped() when necessary. And page_mapped() does bail out early when it finds mapped subpage. Link: https://lkml.kernel.org/r/bb27e3fe-6036-b637-5086-272befbfe3da@google.com Suggested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01mm/hwpoison: disable pcp for page_handle_poison()Naoya Horiguchi1-3/+16
Recent changes by patch "mm/page_alloc: allow high-order pages to be stored on the per-cpu lists" makes kernels determine whether to use pcp by pcp_allowed_order(), which breaks soft-offline for hugetlb pages. Soft-offline dissolves a migration source page, then removes it from buddy free list, so it's assumed that any subpage of the soft-offlined hugepage are recognized as a buddy page just after returning from dissolve_free_huge_page(). pcp_allowed_order() returns true for hugetlb, so this assumption is no longer true. So disable pcp during dissolve_free_huge_page() and take_page_off_buddy() to prevent soft-offlined hugepages from linking to pcp lists. Soft-offline should not be common events so the impact on performance should be minimal. And I think that the optimization of Mel's patch could benefit to hugetlb so zone_pcp_disable() is called only in hwpoison context. Link: https://lkml.kernel.org/r/20210617092626.291006-1-nao.horiguchi@gmail.com Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm,hwpoison: make get_hwpoison_page() call get_any_page()Naoya Horiguchi1-85/+109
__get_hwpoison_page() could fail to grab refcount by some race condition, so it's helpful if we can handle it by retrying. We already have retry logic, so make get_hwpoison_page() call get_any_page() when called from memory_failure(). As a result, get_hwpoison_page() can return negative values (i.e. error code), so some callers are also changed to handle error cases. soft_offline_page() does nothing for -EBUSY because that's enough and users in userspace can easily handle it. unpoison_memory() is also unchanged because it's broken and need thorough fixes (will be done later). Link: https://lkml.kernel.org/r/20210603233632.2964832-3-nao.horiguchi@gmail.com Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm,hwpoison: send SIGBUS with error virutal addressNaoya Horiguchi1-1/+149
Now an action required MCE in already hwpoisoned address surely sends a SIGBUS to current process, but the SIGBUS doesn't convey error virtual address. That's not optimal for hwpoison-aware applications. To fix the issue, make memory_failure() call kill_accessing_process(), that does pagetable walk to find the error virtual address. It could find multiple virtual addresses for the same error page, and it seems hard to tell which virtual address is correct one. But that's rare and sending incorrect virtual address could be better than no address. So let's report the first found virtual address for now. [naoya.horiguchi@nec.com: fix walk_page_range() return] Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.com Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Jue Wang <juew@google.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-25mm/hwpoison: do not lock page again when me_huge_page() successfully recoversNaoya Horiguchi1-14/+30
Currently me_huge_page() temporary unlocks page to perform some actions then locks it again later. My testcase (which calls hard-offline on some tail page in a hugetlb, then accesses the address of the hugetlb range) showed that page allocation code detects this page lock on buddy page and printed out "BUG: Bad page state" message. check_new_page_bad() does not consider a page with __PG_HWPOISON as bad page, so this flag works as kind of filter, but this filtering doesn't work in this case because the "bad page" is not the actual hwpoisoned page. So stop locking page again. Actions to be taken depend on the page type of the error, so page unlocking should be done in ->action() callbacks. So let's make it assumed and change all existing callbacks that way. Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-25mm,hwpoison: return -EHWPOISON to denote that the page has already been poisonedAili Yao1-1/+2
When memory_failure() is called with MF_ACTION_REQUIRED on the page that has already been hwpoisoned, memory_failure() could fail to send SIGBUS to the affected process, which results in infinite loop of MCEs. Currently memory_failure() returns 0 if it's called for already hwpoisoned page, then the caller, kill_me_maybe(), could return without sending SIGBUS to current process. An action required MCE is raised when the current process accesses to the broken memory, so no SIGBUS means that the current process continues to run and access to the error page again soon, so running into MCE loop. This issue can arise for example in the following scenarios: - Two or more threads access to the poisoned page concurrently. If local MCE is enabled, MCE handler independently handles the MCE events. So there's a race among MCE events, and the second or latter threads fall into the situation in question. - If there was a precedent memory error event and memory_failure() for the event failed to unmap the error page for some reason, the subsequent memory access to the error page triggers the MCE loop situation. To fix the issue, make memory_failure() return an error code when the error page has already been hwpoisoned. This allows memory error handler to control how it sends signals to userspace. And make sure that any process touching a hwpoisoned page should get a SIGBUS even in "already hwpoisoned" path of memory_failure() as is done in page fault path. Link: https://lkml.kernel.org/r/20210521030156.2612074-3-nao.horiguchi@gmail.com Signed-off-by: Aili Yao <yaoaili@kingsoft.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jue Wang <juew@google.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-25mm/memory-failure: use a mutex to avoid memory_failure() racesTony Luck1-13/+23
Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5. I wrote this patchset to materialize what I think is the current allowable solution mentioned by the previous discussion [1]. I simply borrowed Tony's mutex patch and Aili's return code patch, then I queued another one to find error virtual address in the best effort manner. I know that this is not a perfect solution, but should work for some typical case. [1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/ This patch (of 2): There can be races when multiple CPUs consume poison from the same page. The first into memory_failure() atomically sets the HWPoison page flag and begins hunting for tasks that map this page. Eventually it invalidates those mappings and may send a SIGBUS to the affected tasks. But while all that work is going on, other CPUs see a "success" return code from memory_failure() and so they believe the error has been handled and continue executing. Fix by wrapping most of the internal parts of memory_failure() in a mutex. [akpm@linux-foundation.org: make mf_mutex local to memory_failure()] Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jue Wang <juew@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16mm/memory-failure: make sure wait for page writeback in memory_failureyangerkun1-1/+6
Our syzkaller trigger the "BUG_ON(!list_empty(&inode->i_wb_list))" in clear_inode: kernel BUG at fs/inode.c:519! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: Process syz-executor.0 (pid: 249, stack limit = 0x00000000a12409d7) CPU: 1 PID: 249 Comm: syz-executor.0 Not tainted 4.19.95 Hardware name: linux,dummy-virt (DT) pstate: 80000005 (Nzcv daif -PAN -UAO) pc : clear_inode+0x280/0x2a8 lr : clear_inode+0x280/0x2a8 Call trace: clear_inode+0x280/0x2a8 ext4_clear_inode+0x38/0xe8 ext4_free_inode+0x130/0xc68 ext4_evict_inode+0xb20/0xcb8 evict+0x1a8/0x3c0 iput+0x344/0x460 do_unlinkat+0x260/0x410 __arm64_sys_unlinkat+0x6c/0xc0 el0_svc_common+0xdc/0x3b0 el0_svc_handler+0xf8/0x160 el0_svc+0x10/0x218 Kernel panic - not syncing: Fatal exception A crash dump of this problem show that someone called __munlock_pagevec to clear page LRU without lock_page: do_mmap -> mmap_region -> do_munmap -> munlock_vma_pages_range -> __munlock_pagevec. As a result memory_failure will call identify_page_state without wait_on_page_writeback. And after truncate_error_page clear the mapping of this page. end_page_writeback won't call sb_clear_inode_writeback to clear inode->i_wb_list. That will trigger BUG_ON in clear_inode! Fix it by checking PageWriteback too to help determine should we skip wait_on_page_writeback. Link: https://lkml.kernel.org/r/20210604084705.3729204-1-yangerkun@huawei.com Fixes: 0bc1f8b0682c ("hwpoison: fix the handling path of the victimized page frame that belong to non-LRU") Signed-off-by: yangerkun <yangerkun@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16mm,hwpoison: fix race with hugetlb page allocationNaoya Horiguchi1-2/+27
When hugetlb page fault (under overcommitting situation) and memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race: CPU0: CPU1: gather_surplus_pages() page = alloc_surplus_huge_page() memory_failure_hugetlb() get_hwpoison_page(page) __get_hwpoison_page(page) get_page_unless_zero(page) zero = put_page_testzero(page) VM_BUG_ON_PAGE(!zero, page) enqueue_huge_page(h, page) put_page(page) __get_hwpoison_page() only checks the page refcount before taking an additional one for memory error handling, which is not enough because there's a time window where compound pages have non-zero refcount during hugetlb page initialization. So make __get_hwpoison_page() check page status a bit more for hugetlb pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags under hugetlb_lock makes sure that the hugetlb page is not transitive. It's notable that another new function, HWPoisonHandlable(), is helpful to prevent a race against other transitive page states (like a generic compound page just before PageHuge becomes true). Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07mm: fix typos in commentsIngo Molnar1-1/+1
Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm/memory-failure: unnecessary amount of unmappingJane Chu1-1/+1
It appears that unmap_mapping_range() actually takes a 'size' as its third argument rather than a location, the current calling fashion causes unnecessary amount of unmapping to occur. Link: https://lkml.kernel.org/r/20210420002821.2749748-1-jane.chu@oracle.com Fixes: 6100e34b2526e ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages") Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26mm: fix memory_failure() handling of dax-namespace metadataDan Williams1-0/+6
Given 'struct dev_pagemap' spans both data pages and metadata pages be careful to consult the altmap if present to delineate metadata. In fact the pfn_first() helper already identifies the first valid data pfn, so export that helper for other code paths via pgmap_pfn_valid(). Other usage of get_dev_pagemap() are not a concern because those are operating on known data pfns having been looked up by get_user_pages(). I.e. metadata pfns are never user mapped. Link: https://lkml.kernel.org/r/161058501758.1840162.4239831989762604527.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 6100e34b2526 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25mm,hwpoison: send SIGBUS to PF_MCE_EARLY processes on action required eventsAili Yao1-15/+19
When a memory uncorrected error is triggered by process who accessed the address with error, It's Action Required Case for only current process which triggered this; This Action Required case means Action optional to other process who share the same page. Usually killing current process will be sufficient, other processes sharing the same page will get be signaled when they really touch the poisoned page. But there is another scenario that other processes sharing the same page want to be signaled early with PF_MCE_EARLY set. In this case, we should get them into kill list and signal BUS_MCEERR_AO to them. So in this patch, task_early_kill will check current process if force_early is set, and if not current,the code will fallback to find_early_kill_thread() to check if there is PF_MCE_EARLY process who cares the error. In kill_proc(), BUS_MCEERR_AR is only send to current, other processes in kill list will be signaled with BUS_MCEERR_AO. Link: https://lkml.kernel.org/r/20210122132424.313c8f5f.yaoaili@kingsoft.com Signed-off-by: Aili Yao <yaoaili@kingsoft.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24mm: fix page reference leak in soft_offline_page()Dan Williams1-4/+16
The conversion to move pfn_to_online_page() internal to soft_offline_page() missed that the get_user_pages() reference taken by the madvise() path needs to be dropped when pfn_to_online_page() fails. Note the direct sysfs-path to soft_offline_page() does not perform a get_user_pages() lookup. When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to a leaked reference. Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-13mm,hwpoison: fix printing of page flagsOscar Salvador1-1/+1
Format %pG expects a lower case 'p' in order to print the flags. Fix it. Link: https://lkml.kernel.org/r/20210108085202.4506-1-osalvador@suse.de Fixes: 8295d535e2aa ("mm,hwpoison: refactor get_any_page") Signed-off-by: Oscar Salvador <osalvador@suse.de> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: return -EBUSY when migration failsOscar Salvador1-3/+3
Currently, we return -EIO when we fail to migrate the page. Migrations' failures are rather transient as they can happen due to several reasons, e.g: high page refcount bump, mapping->migrate_page failing etc. All meaning that at that time the page could not be migrated, but that has nothing to do with an EIO error. Let us return -EBUSY instead, as we do in case we failed to isolate the page. While are it, let us remove the "ret" print as its value does not change. Link: https://lkml.kernel.org/r/20201209092818.30417-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,memory_failure: always pin the page in madvise_inject_errorOscar Salvador1-0/+6
madvise_inject_error() uses get_user_pages_fast to translate the address we specified to a page. After [1], we drop the extra reference count for memory_failure() path. That commit says that memory_failure wanted to keep the pin in order to take the page out of circulation. The truth is that we need to keep the page pinned, otherwise the page might be re-used after the put_page() and we can end up messing with someone else's memory. E.g: CPU0 process X CPU1 madvise_inject_error get_user_pages put_page page gets reclaimed process Y allocates the page memory_failure // We mess with process Y memory madvise() is meant to operate on a self address space, so messing with pages that do not belong to us seems the wrong thing to do. To avoid that, let us keep the page pinned for memory_failure as well. Pages for DAX mappings will release this extra refcount in memory_failure_dev_pagemap. [1] ("23e7b5c2e271: mm, madvise_inject_error: Let memory_failure() optionally take a page reference") Link: https://lkml.kernel.org/r/20201207094818.8518-1-osalvador@suse.de Fixes: 23e7b5c2e271 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference") Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: remove drain_all_pages from shake_pageOscar Salvador1-5/+2
get_hwpoison_page already drains pcplists, previously disabling them when trying to grab a refcount. We do not need shake_page to take care of it anymore. Link: https://lkml.kernel.org/r/20201204102558.31607-4-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Qian Cai <qcai@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: disable pcplists before grabbing a refcountOscar Salvador1-67/+65
Currently, we have a sort of retry mechanism to make sure pages in pcp-lists are spilled to the buddy system, so we can handle those. We can save us this extra checks with the new disable-pcplist mechanism that is available with [1]. zone_pcplist_disable makes sure to 1) disable pcplists, so any page that is freed up from that point onwards will end up in the buddy system and 2) drain pcplists, so those pages that already in pcplists are spilled to buddy. With that, we can make a common entry point for grabbing a refcount from both soft_offline and memory_failure paths that is guarded by zone_pcplist_disable/zone_pcplist_enable. [1] https://patchwork.kernel.org/project/linux-mm/cover/20201111092812.11329-1-vbabka@suse.cz/ Link: https://lkml.kernel.org/r/20201204102558.31607-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Qian Cai <qcai@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: refactor get_any_pageOscar Salvador1-57/+42
Patch series "HWPoison: Refactor get page interface", v2. This patch (of 3): When we want to grab a refcount via get_any_page, we call __get_any_page that calls get_hwpoison_page to get the actual refcount. get_any_page() is only there because we have a sort of retry mechanism in case the page we met is unknown to us or if we raced with an allocation. Also __get_any_page() prints some messages about the page type in case the page was a free page or the page type was unknown, but if anything, we only need to print a message in case the pagetype was unknown, as that is reporting an error down the chain. Let us merge get_any_page() and __get_any_page(), and let the message be printed in soft_offline_page. While we are it, we can also remove the 'pfn' parameter as it is no longer used. Link: https://lkml.kernel.org/r/20201204102558.31607-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20201204102558.31607-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Vlastimil Babka <Vbabka@suse.cz> Cc: Qian Cai <qcai@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: take free pages off the buddy freelistsOscar Salvador1-16/+30
The crux of the matter is that historically we left poisoned pages in the buddy system because we have some checks in place when allocating a page that are gatekeeper for poisoned pages. Unfortunately, we do have other users (e.g: compaction [1]) that scan buddy freelists and try to get a page from there without checking whether the page is HWPoison. As I stated already, I think it is fundamentally wrong to keep HWPoison pages within the buddy systems, checks in place or not. Let us fix this the same way we did for soft_offline [2], taking the page off the buddy freelist so it is completely unreachable. Note that this is fairly simple to trigger, as we only need to poison free buddy pages (madvise MADV_HWPOISON) and then run some sort of memory stress system. Just for a matter of reference, I put a dump_page() in compaction_alloc() to trigger for HWPoison patches: page:0000000012b2982b refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1d5db flags: 0xfffffc0800000(hwpoison) raw: 000fffffc0800000 ffffea00007573c8 ffffc90000857de0 0000000000000000 raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: compaction_alloc CPU: 4 PID: 123 Comm: kcompactd0 Tainted: G E 5.9.0-rc2-mm1-1-default+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0x6d/0x8b compaction_alloc+0xb2/0xc0 migrate_pages+0x2a6/0x12a0 compact_zone+0x5eb/0x11c0 proactive_compact_node+0x89/0xf0 kcompactd+0x2d0/0x3a0 kthread+0x118/0x130 ret_from_fork+0x22/0x30 After that, if e.g: a process faults in the page, it will get killed unexpectedly. Fix it by containing the page immediatelly. Besides that, two more changes can be noticed: * MF_DELAYED no longer suits as we are fixing the issue by containing the page immediately, so it does no longer rely on the allocation-time checks to stop HWPoison to be handed over. gain unless it is unpoisoned, so we fixed the situation. Because of that, let us use MF_RECOVERED from now on. * The second block that handles PageBuddy pages is no longer needed: We call shake_page and then check whether the page is Buddy because shake_page calls drain_all_pages, which sends pcp-pages back to the buddy freelists, so we could have a chance to handle free pages. Currently, get_hwpoison_page already calls drain_all_pages, and we call get_hwpoison_page right before coming here, so we should be on the safe side. [1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u [2] https://patchwork.kernel.org/cover/11792607/ [osalvador@suse.de: take the poisoned subpage off the buddy frelists] Link: https://lkml.kernel.org/r/20201013144447.6706-4-osalvador@suse.de Link: https://lkml.kernel.org/r/20201013144447.6706-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount pageOscar Salvador1-2/+22
Patch series "HWpoison: further fixes and cleanups", v5. This patchset includes some more fixes and a cleanup. Patch#2 and patch#3 are both fixes for taking a HWpoison page off a buddy freelist, since having them there has proved to be bad (see [1] and pathch#2's commit log). Patch#3 does the same for hugetlb pages. [1] https://lkml.org/lkml/2020/9/22/565 This patch (of 4): A page with 0-refcount and !PageBuddy could perfectly be a pcppage. Currently, we bail out with an error if we encounter such a page, meaning that we do not handle pcppages neither from hard-offline nor from soft-offline path. Fix this by draining pcplists whenever we find this kind of page and retry the check again. It might be that pcplists have been spilled into the buddy allocator and so we can handle it. Link: https://lkml.kernel.org/r/20201013144447.6706-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20201013144447.6706-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/rmap: always do TTU_IGNORE_ACCESSShakeel Butt1-1/+1
Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2"), the code to check the secondary MMU's page table access bit is broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the secondary MMU's page table before the check. More specifically for those secondary MMUs which unmap the memory in mmu_notifier_invalidate_range_start() like kvm. However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the absence of TTU_IGNORE_ACCESS and it explicitly performs the page table access check before trying to unmap the page. So, at worst the reclaim will miss accesses in a very short window if we remove page table access check in unmapping code. There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg reclaim. From memcg reclaim the page_referenced() only account the accesses from the processes which are in the same memcg of the target page but the unmapping code is considering accesses from all the processes, so, decreasing the effectiveness of memcg reclaim. The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping code. Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-14hugetlbfs: fix anon huge page migration raceMike Kravetz1-19/+17
Qian Cai reported the following BUG in [1] LTP: starting move_pages12 BUG: unable to handle page fault for address: ffffffffffffffe0 ... RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63 Call Trace: rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864 try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763 migrate_pages+0x1005/0x1fb0 move_pages_and_store_status.isra.47+0xd7/0x1a0 __x64_sys_move_pages+0xa5c/0x1100 do_syscall_64+0x5f/0x310 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Hugh Dickins diagnosed this as a migration bug caused by code introduced to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for anon pages as the anon_vma_lock should be held in this case. Further analysis suggested that i_mmap_rwsem was not required to he held at all when calling try_to_unmap for anon pages as an anon page could never be part of a shared pmd mapping. Discussion also revealed that the hack in hugetlb_page_mapping_lock_write to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to keep mapping valid while dropping page lock. This patch does the following: - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when calling try_to_unmap. - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine will now simply do a 'trylock' while still holding the page lock. If the trylock fails, it will return NULL. This could impact the callers: - migration calling code will receive -EAGAIN and retry up to the hard coded limit (10). - memory error code will treat the page as BUSY. This will force killing (SIGKILL) instead of SIGBUS any mapping tasks. Do note that this change in behavior only happens when there is a race. None of the standard kernel testing suites actually hit this race, but it is possible. [1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/ [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/ Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") Reported-by: Qian Cai <cai@lca.pw> Suggested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm/memory-failure: remove a wrapper for alloc_migration_target()Joonsoo Kim1-12/+6
There is a well-defined standard migration target callback. Use it directly. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-9-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm,hwpoison: try to narrow window race for free pagesOscar Salvador1-1/+6
Aristeu Rozanski reported that a customer test case started to report -EBUSY after the hwpoison rework patchset. There is a race window between spotting a free page and taking it off its buddy freelist, so it might be that by the time we try to take it off, the page has been already allocated. This patch tries to handle such race window by trying to handle the new type of page again if the page was allocated under us. Reported-by: Aristeu Rozanski <aris@ruivo.org> Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Aristeu Rozanski <aris@ruivo.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-15-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm,hwpoison: double-check page count in __get_any_page()Naoya Horiguchi1-0/+6
Soft offlining could fail with EIO due to the race condition with hugepage migration. This issuse became visible due to the change by previous patch that makes soft offline handler take page refcount by its own. We have no way to directly pin zero refcount page, and the page considered as a zero refcount page could be allocated just after the first check. This patch adds the second check to find the race and gives us chance to handle it more reliably. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-14-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm,hwpoison: introduce MF_MSG_UNSPLIT_THPNaoya Horiguchi1-1/+4
memory_failure() is supposed to call action_result() when it handles a memory error event, but there's one missing case. So let's add it. I find that include/ras/ras_event.h has some other MF_MSG_* undefined, so this patch also adds them. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-13-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm,hwpoison: return 0 if the page is already poisoned in soft-offlineOscar Salvador1-2/+2
Currently, there is an inconsistency when calling soft-offline from different paths on a page that is already poisoned. 1) madvise: madvise_inject_error skips any poisoned page and continues the loop. If that was the only page to madvise, it returns 0. 2) /sys/devices/system/memory/: When calling soft_offline_page_store()->soft_offline_page(), we return -EBUSY in case the page is already poisoned. This is inconsistent with a) the above example and b) memory_failure, where we return 0 if the page was poisoned. Fix this by dropping the PageHWPoison() check in madvise_inject_error, and let soft_offline_page return 0 if it finds the page already poisoned. Please, note that this represents a user-api change, since now the return error when calling soft_offline_page_store()->soft_offline_page() will be different. Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-12-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>