summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-09-12fs/buffer: remove __breadahead_gfp()Zhang Yi2-19/+0
Patch series "fs/buffer: remove ll_rw_block()", v2. ll_rw_block() will skip locked buffer before submitting IO, it assumes that locked buffer means it is under IO. This assumption is not always true because we cannot guarantee every buffer lock path would submit IO. After commit 88dbcbb3a484 ("blkdev: avoid migration stalls for blkdev pages"), buffer_migrate_folio_norefs() becomes one exceptional case, and there may be others. So ll_rw_block() is not safe on the sync read path, we could get false positive EIO return value when filesystem reading metadata. It seems that it could be only used on the readahead path. Unfortunately, many filesystem misuse the ll_rw_block() on the sync read path. This patch set just remove ll_rw_block() and add new friendly helpers, which could prevent false positive EIO on the read metadata path. Thanks for the suggestion from Jan, the original discussion is at [1]. patch 1: remove unused helpers in fs/buffer.c patch 2: add new bh_read_[*] helpers patch 3-11: remove all ll_rw_block() calls in filesystems patch 12-14: do some leftover cleanups. [1]. https://lore.kernel.org/linux-mm/20220825080146.2021641-1-chengzhihao1@huawei.com/ This patch (of 14): No one use __breadahead_gfp() and sb_breadahead_unmovable() any more, remove them. Link: https://lkml.kernel.org/r/20220901133505.2510834-1-yi.zhang@huawei.com Link: https://lkml.kernel.org/r/20220901133505.2510834-2-yi.zhang@huawei.com Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Heming Zhao <ocfs2-devel@oss.oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/vmalloc: extend find_vmap_lowest_match_check with extra argumentsSong Liu1-6/+7
find_vmap_lowest_match() is now able to handle different roots. With DEBUG_AUGMENT_LOWEST_MATCH_CHECK enabled as: : --- a/mm/vmalloc.c : +++ b/mm/vmalloc.c : @@ -713,7 +713,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn); : /*** Global kva allocator ***/ : : -#define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0 : +#define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 1 compilation failed as: mm/vmalloc.c: In function 'find_vmap_lowest_match_check': mm/vmalloc.c:1328:32: warning: passing argument 1 of 'find_vmap_lowest_match' makes pointer from integer without a cast [-Wint-conversion] 1328 | va_1 = find_vmap_lowest_match(size, align, vstart, false); | ^~~~ | | | long unsigned int mm/vmalloc.c:1236:40: note: expected 'struct rb_root *' but argument is of type 'long unsigned int' 1236 | find_vmap_lowest_match(struct rb_root *root, unsigned long size, | ~~~~~~~~~~~~~~~~^~~~ mm/vmalloc.c:1328:9: error: too few arguments to function 'find_vmap_lowest_match' 1328 | va_1 = find_vmap_lowest_match(size, align, vstart, false); | ^~~~~~~~~~~~~~~~~~~~~~ mm/vmalloc.c:1236:1: note: declared here 1236 | find_vmap_lowest_match(struct rb_root *root, unsigned long size, | ^~~~~~~~~~~~~~~~~~~~~~ Extend find_vmap_lowest_match_check() and find_vmap_lowest_linear_match() with extra arguments to fix this. Link: https://lkml.kernel.org/r/20220906060548.1127396-1-song@kernel.org Link: https://lkml.kernel.org/r/20220831052734.3423079-1-song@kernel.org Fixes: f9863be49312 ("mm/vmalloc: extend __alloc_vmap_area() with extra arguments") Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/migrate_device.c: fix a misleading and outdated commentAlistair Popple1-3/+10
Commit ab09243aa95a ("mm/migrate.c: remove MIGRATE_PFN_LOCKED") changed the way trylock_page() in migrate_vma_collect_pmd() works without updating the comment. Reword the comment to be less misleading and a better reflection of what happens. Link: https://lkml.kernel.org/r/20220830020138.497063-1-apopple@nvidia.com Fixes: ab09243aa95a ("mm/migrate.c: remove MIGRATE_PFN_LOCKED") Signed-off-by: Alistair Popple <apopple@nvidia.com> Reported-by: Peter Xu <peterx@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/page_alloc.c: delete a redundant parameter of rmqueue_pcplistzezuo1-3/+2
The gfp_flags parameter is not used in rmqueue_pcplist, so directly delete this parameter. Link: https://lkml.kernel.org/r/20220831013404.3360714-1-zuoze1@huawei.com Signed-off-by: zezuo <zuoze1@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/damon: get the hotness from damon_hot_score() in damon_pageout_score()Kaixu Xia1-40/+6
We can get the hotness value from damon_hot_score() directly in damon_pageout_score() function and improve the code readability. Link: https://lkml.kernel.org/r/1661766366-20998-1-git-send-email-kaixuxia@tencent.com Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/thp: remove redundant CONFIG_TRANSPARENT_HUGEPAGELiu Shixin1-2/+1
Simplify code by removing redundant CONFIG_TRANSPARENT_HUGEPAGE judgment. No functional change. Link: https://lkml.kernel.org/r/20220829095125.3284567-1-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/thp: simplify has_transparent_hugepage by using IS_BUILTINLiu Shixin1-5/+1
Simplify code of has_transparent_hugepage define by using IS_BUILTIN. No functional change. Link: https://lkml.kernel.org/r/20220829095709.3287462-1-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/damon/vaddr: remove comparison between mm and last_mm when checking ↵Kaixu Xia1-5/+6
region accesses The damon regions that belong to the same damon target have the same 'struct mm_struct *mm', so it's unnecessary to compare the mm and last_mm objects among the damon regions in one damon target when checking accesses. But the check is necessary when the target changed in '__damon_va_check_accesses()', so we can simplify the whole operation by using the bool 'same_target' to indicate whether the target changed. Link: https://lkml.kernel.org/r/1661590971-20893-3-git-send-email-kaixuxia@tencent.com Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/damon: simplify the parameter passing for 'check_accesses'Kaixu Xia2-6/+5
Patch series "mm/damon: Simplify the damon regions access check", v2. This patchset simplifies the operations when checking the damon regions accesses. This patch (of 2): The parameter 'struct damon_ctx *ctx' isn't used in the functions __damon_{p,v}a_check_access(), so we can remove it and simplify the parameter passing. Link: https://lkml.kernel.org/r/1661590971-20893-1-git-send-email-kaixuxia@tencent.com Link: https://lkml.kernel.org/r/1661590971-20893-2-git-send-email-kaixuxia@tencent.com Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: fix null-ptr-deref in kswapd_is_running()Kefeng Wang6-15/+55
kswapd_run/stop() will set pgdat->kswapd to NULL, which could race with kswapd_is_running() in kcompactd(), kswapd_run/stop() kcompactd() kswapd_is_running() pgdat->kswapd // error or nomal ptr verify pgdat->kswapd // load non-NULL pgdat->kswapd pgdat->kswapd = NULL task_is_running(pgdat->kswapd) // Null pointer derefence KASAN reports the null-ptr-deref shown below, vmscan: Failed to start kswapd on node 0 ... BUG: KASAN: null-ptr-deref in kcompactd+0x440/0x504 Read of size 8 at addr 0000000000000024 by task kcompactd0/37 CPU: 0 PID: 37 Comm: kcompactd0 Kdump: loaded Tainted: G OE 5.10.60 #1 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: dump_backtrace+0x0/0x394 show_stack+0x34/0x4c dump_stack+0x158/0x1e4 __kasan_report+0x138/0x140 kasan_report+0x44/0xdc __asan_load8+0x94/0xd0 kcompactd+0x440/0x504 kthread+0x1a4/0x1f0 ret_from_fork+0x10/0x18 At present kswapd/kcompactd_run() and kswapd/kcompactd_stop() are protected by mem_hotplug_begin/done(), but without kcompactd(). There is no need to involve memory hotplug lock in kcompactd(), so let's add a new mutex to protect pgdat->kswapd accesses. Also, because the kcompactd task will check the state of kswapd task, it's better to call kcompactd_stop() before kswapd_stop() to reduce lock conflicts. [akpm@linux-foundation.org: add comments] Link: https://lkml.kernel.org/r/20220827111959.186838-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: kill is_memblock_offlined()Kefeng Wang3-10/+1
Directly check state of struct memory_block, no need a single function. Link: https://lkml.kernel.org/r/20220827112043.187028-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12filemap: remove find_get_pages_contig()Vishal Moola (Oracle)2-62/+0
All callers of find_get_pages_contig() have been removed, so it is no longer needed. Link: https://lkml.kernel.org/r/20220824004023.77310-8-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: David Sterba <dsterb@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12ramfs: convert ramfs_nommu_get_unmapped_area() to use ↵Vishal Moola (Oracle)1-21/+29
filemap_get_folios_contig() Convert to use folios throughout. This is in preparation for the removal for find_get_pages_contig(). Now also supports large folios. The initial version of this function set the page_address to be returned after finishing all the checks. Since folio_batches have a maximum of 15 folios, the function had to be modified to support getting and checking up to lpages, 15 pages at a time while still returning the initial page address. Now the function sets ret as soon as the first batch arrives, and updates it only if a check fails. The physical adjacency check utilizes the page frame numbers. The page frame number of each folio must be nr_pages away from the first folio. Link: https://lkml.kernel.org/r/20220824004023.77310-7-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: David Sterba <dsterb@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12nilfs2: convert nilfs_find_uncommited_extent() to use ↵Vishal Moola (Oracle)1-27/+18
filemap_get_folios_contig() Convert function to use folios throughout. This is in preparation for the removal of find_get_pages_contig(). Now also supports large folios. Also clean up an unnecessary if statement - pvec.pages[0]->index > index will always evaluate to false, and filemap_get_folios_contig() returns 0 if there is no folio found at index. Link: https://lkml.kernel.org/r/20220824004023.77310-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: David Sterba <dsterb@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12btrfs: convert process_page_range() to use filemap_get_folios_contig()Vishal Moola (Oracle)2-16/+18
Converted function to use folios throughout. This is in preparation for the removal of find_get_pages_contig(). Now also supports large folios. Since we may receive more than nr_pages pages, nr_pages may underflow. Since nr_pages > 0 is equivalent to index <= end_index, we replaced it with this check instead. Also minor comment renaming for consistency in subpage. Link: https://lkml.kernel.org/r/20220824004023.77310-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Sterba <dsterb@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12btrfs: convert end_compressed_writeback() to use filemap_get_folios()Vishal Moola (Oracle)1-16/+15
Converted function to use folios throughout. This is in preparation for the removal of find_get_pages_contig(). Now also supports large folios. Since we may receive more than nr_pages pages, nr_pages may underflow. Since nr_pages > 0 is equivalent to index <= end_index, we replaced it with this check instead. Also this function does not care about the pages being contiguous so we can just use filemap_get_folios() to be more efficient. Link: https://lkml.kernel.org/r/20220824004023.77310-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Sterba <dsterba@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterb@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12btrfs: convert __process_pages_contig() to use filemap_get_folios_contig()Vishal Moola (Oracle)1-18/+15
Convert to use folios throughout. This is in preparation for the removal of find_get_pages_contig(). Now also supports large folios. Since we may receive more than nr_pages pages, nr_pages may underflow. Since nr_pages > 0 is equivalent to index <= end_index, we replaced it with this check instead. Link: https://lkml.kernel.org/r/20220824004023.77310-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Sterba <dsterba@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterb@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12filemap: add filemap_get_folios_contig()Vishal Moola (Oracle)2-0/+75
Patch series "Convert to filemap_get_folios_contig()", v3. This patch series replaces find_get_pages_contig() with filemap_get_folios_contig(). This patch (of 7): This function is meant to replace find_get_pages_contig(). Unlike find_get_pages_contig(), filemap_get_folios_contig() no longer takes in a target number of pages to find - It returns up to 15 contiguous folios. To be more consistent with filemap_get_folios(), filemap_get_folios_contig() now also updates the start index passed in, and takes an end index. Link: https://lkml.kernel.org/r/20220824004023.77310-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20220824004023.77310-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: David Sterba <dsterba@suse.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Sterba <dsterb@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12zram: don't retry compress incompressible pageAlexey Romanov1-2/+12
It doesn't make sense for us to retry to compress an uncompressible page (comp_len == PAGE_SIZE) in zsmalloc slowpath, because we will be storing it uncompressed anyway. We can avoid wasting time on another compression attempt. It is enough to take lock (zcomp_stream_get) and execute the code below. Link: https://lkml.kernel.org/r/20220824113117.78849-1-avromanov@sberdevices.ru Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Signed-off-by: Dmitry Rokosov <ddrokosov@sberdevices.ru> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Dmitry Rokosov <DDRokosov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: backing-dev: Remove the unneeded result variableye xingchen1-5/+1
Return the value cgwb_bdi_init() directly instead of storing it in another redundant variable. Link: https://lkml.kernel.org/r/20220826071906.252419-1-ye.xingchen@zte.com.cn Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Reported-by: Zeal Robot <zealci@zte.com.cn> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12page_ext: introduce boot parameter 'early_page_ext'Li Zhe5-1/+34
In commit 2f1ee0913ce5 ("Revert "mm: use early_pfn_to_nid in page_ext_init""), we call page_ext_init() after page_alloc_init_late() to avoid some panic problem. It seems that we cannot track early page allocations in current kernel even if page structure has been initialized early. This patch introduces a new boot parameter 'early_page_ext' to resolve this problem. If we pass it to the kernel, page_ext_init() will be moved up and the feature 'deferred initialization of struct pages' will be disabled to initialize the page allocator early and prevent the panic problem above. It can help us to catch early page allocations. This is useful especially when we find that the free memory value is not the same right after different kernel booting. [akpm@linux-foundation.org: fix section issue by removing __meminitdata] Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12s390/hugetlb: switch to generic version of follow_huge_pud()Gerald Schaefer1-10/+0
When pud-sized hugepages were introduced for s390, the generic version of follow_huge_pud() was using pte_page() instead of pud_page(). This would be wrong for s390, see also commit 97534127012f ("mm/hugetlb: use pmd_page() in follow_huge_pmd()"). Therefore, and probably because not all archs were supporting pud_page() at that time, a private version of follow_huge_pud() was added for s390, correctly using pud_page(). Since commit 3a194f3f8ad01 ("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry"), the generic version of follow_huge_pud() is now also using pud_page(), and in general behaves similar to follow_huge_pmd(). Therefore we can now switch to the generic version and get rid of the s390-specific follow_huge_pud(). Link: https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Haiyue Wang <haiyue.wang@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12memcg: increase MEMCG_CHARGE_BATCH to 64Shakeel Butt1-3/+4
For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger machines and the network intensive workloads requiring througput in Gbps, 32 is too small and makes the memcg charging path a bottleneck. For now, increase it to 64 for easy acceptance to 6.0. We will need to revisit this in future for ever increasing demand of higher performance. Please note that the memcg charge path drain the per-cpu memcg charge stock, so there should not be any oom behavior change. Though it does have impact on rstat flushing and high limit reclaim backoff. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 17064.7 Mbps (62.7% improvement) With the patch, the throughput improved by 62.7%. Link: https://lkml.kernel.org/r/20220825000506.239406-4-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: page_counter: rearrange struct page_counter fieldsShakeel Butt1-12/+23
With memcg v2 enabled, memcg->memory.usage is a very hot member for the workloads doing memcg charging on multiple CPUs concurrently. Particularly the network intensive workloads. In addition, there is a false cache sharing between memory.usage and memory.high on the charge path. This patch moves the usage into a separate cacheline and move all the read most fields into separate cacheline. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 12413.7 Mbps (18.4% improvement) With the patch, the throughput improved by 18.4%. One side-effect of this patch is the increase in the size of struct mem_cgroup. For example with this patch on 64 bit build, the size of struct mem_cgroup increased from 4032 bytes to 4416 bytes. However for the performance improvement, this additional size is worth it. In addition there are opportunities to reduce the size of struct mem_cgroup like deprecation of kmem and tcpmem page counters and better packing. Link: https://lkml.kernel.org/r/20220825000506.239406-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: page_counter: remove unneeded atomic ops for low/minShakeel Butt1-7/+6
Patch series "memcg: optimize charge codepath", v2. Recently Linux networking stack has moved from a very old per socket pre-charge caching to per-cpu caching to avoid pre-charge fragmentation and unwarranted OOMs. One impact of this change is that for network traffic workloads, memcg charging codepath can become a bottleneck. The kernel test robot has also reported this regression[1]. This patch series tries to improve the memcg charging for such workloads. This patch series implement three optimizations: (A) Reduce atomic ops in page counter update path. (B) Change layout of struct page_counter to eliminate false sharing between usage and high. (C) Increase the memcg charge batch to 64. To evaluate the impact of these optimizations, on a 72 CPUs machine, we ran the following workload in root memcg and then compared with scenario where the workload is run in a three level of cgroup hierarchy with top level having min and low setup appropriately. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): 1. root memcg 21694.8 Mbps 2. 6.0-rc1 10482.7 Mbps (-51.6%) 3. 6.0-rc1 + (A) 14542.5 Mbps (-32.9%) 4. 6.0-rc1 + (B) 12413.7 Mbps (-42.7%) 5. 6.0-rc1 + (C) 17063.7 Mbps (-21.3%) 6. 6.0-rc1 + (A+B+C) 20120.3 Mbps (-7.2%) With all three optimizations, the memcg overhead of this workload has been reduced from 51.6% to just 7.2%. [1] https://lore.kernel.org/linux-mm/20220619150456.GB34471@xsang-OptiPlex-9020/ This patch (of 3): For cgroups using low or min protections, the function propagate_protected_usage() was doing an atomic xchg() operation irrespectively. We can optimize out this atomic operation for one specific scenario where the workload is using the protection (i.e. min > 0) and the usage is above the protection (i.e. usage > min). This scenario is actually very common where the users want a part of their workload to be protected against the external reclaim. Though this optimization does introduce a race when the usage is around the protection and concurrent charges and uncharged trip it over or under the protection. In such cases, we might see lower effective protection but the subsequent charge/uncharge will correct it. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately to see if this optimization is effective for the mentioned case. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 14542.5 Mbps (38.7% improvement) With the patch, the throughput improved by 38.7% Link: https://lkml.kernel.org/r/20220825000506.239406-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20220825000506.239406-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oliver Sang <oliver.sang@intel.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: remove EXPERIMENTAL flag for zswapDavid Heidelberg1-7/+1
zswap has been with us since 2013, and it's widely used in many products. Link: https://lkml.kernel.org/r/20220823152033.66682-1-david@ixit.cz Signed-off-by: David Heidelberg <david@ixit.cz> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12drivers/block/zram/zram_drv.c: do not keep dangling zcomp pointer after zram ↵Sergey Senozhatsky1-9/+4
reset We do all reset operations under write lock, so we don't need to save ->disksize and ->comp to stack variables. Another thing is that ->comp is freed during zram reset, but comp pointer is not NULL-ed, so zram keeps the freed pointer value. Link: https://lkml.kernel.org/r/20220824035100.971816-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/gup.c: refactor check_and_migrate_movable_pages()Alistair Popple1-68/+111
When pinning pages with FOLL_LONGTERM check_and_migrate_movable_pages() is called to migrate pages out of zones which should not contain any longterm pinned pages. When migration succeeds all pages will have been unpinned so pinning needs to be retried. Migration can also fail, in which case the pages will also have been unpinned but the operation should not be retried. If all pages are in the correct zone nothing will be unpinned and no retry is required. The logic in check_and_migrate_movable_pages() tracks unnecessary state and the return codes for each case are difficult to follow. Refactor the code to clean this up. No behaviour change is intended. [akpm@linux-foundation.org: fix unused var warning] Link: https://lkml.kernel.org/r/19583d1df07fdcb99cfa05c265588a3fa58d1902.1661317396.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/gup.c: don't pass gup_flags to check_and_migrate_movable_pages()Alistair Popple1-14/+9
gup_flags is passed to check_and_migrate_movable_pages() so that it can call either put_page() or unpin_user_page() to drop the page reference. However check_and_migrate_movable_pages() is only called for FOLL_LONGTERM, which implies FOLL_PIN so there is no need to pass gup_flags. Link: https://lkml.kernel.org/r/d611c65a9008ff55887307df457c6c2220ad6163.1661317396.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: skip retry when new limit is not below old one in page_counter_set_maxBui Quang Minh1-1/+1
In page_counter_set_max, we want to make sure the new limit is not below the concurrently-changing counter value. We read the counter and check that the limit is not below the counter before the swap. After the swap, we read the counter again and retry in case the counter is incremented as this may violate the requirement. Even though the page_counter_try_charge can see the old limit, it is guaranteed that the counter is not above the old limit after the increment. So in case the new limit is not below the old limit, the counter is guaranteed to be not above the new limit too. We can skip the retry in this case to optimize a little bit. Link: https://lkml.kernel.org/r/20220821154055.109635-1-minhquangbui99@gmail.com Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: pagewalk: add api documentation for walk_page_range_novma()Rolf Eike Beer1-1/+9
Link: https://lkml.kernel.org/r/8991525.CDJkKcVGEf@devpool047 Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: pagewalk: fix documentation of PTE hole handlingRolf Eike Beer1-5/+5
Empty PTEs are passed to the pte_entry callback, not to pte_hole. Link: https://lkml.kernel.org/r/3695521.kQq0lBPeGt@devpool047 Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: memcg: export workingset refault stats for cgroup v1Yang Shi1-2/+7
Workingset refault stats are important and useful metrics to measure how well reclaimer and swapping work and how healthy the services are, but they are just available for cgroup v2. There are still plenty users with cgroup v1, export the stats for cgroup v1. Link: https://lkml.kernel.org/r/20220816185801.651091-1-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/page_owner.c: add llseek for page_ownerKassey Li2-3/+26
It is too slow to dump all the pages, in some usage we just want to dump a given start pfn, for example: a CMA range or a single page. To speed up and save time, this change allows specifying of a start pfn by adding llseek for page_owner. Link: https://lkml.kernel.org/r/20220818022425.31056-1-quic_yingangl@quicinc.com Signed-off-by: Kassey Li <quic_yingangl@quicinc.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/damon: replace pmd_huge() with pmd_trans_huge() for THPBaolin Wang1-4/+4
pmd_huge() is usually used to indicate a pmd level hugetlb. However a pmd mapped huge page can only be THP in damon_mkold_pmd_entry() or damon_young_pmd_entry(), so replace pmd_huge() with pmd_trans_huge() in this case to make the code more readable according to the discussion [1]. [1] https://lore.kernel.org/all/098c1480-416d-bca9-cedb-ca495df69b64@linux.alibaba.com/ Link: https://lkml.kernel.org/r/a9e010ca5d299e18d740c7c52290ecb6a014dde6.1660805030.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/damon: validate if the pmd entry is present before accessingBaolin Wang1-0/+10
pmd_huge() is used to validate if the pmd entry is mapped by a huge page, also including the case of non-present (migration or hwpoisoned) pmd entry on arm64 or x86 architectures. This means that pmd_pfn() can not get the correct pfn number for a non-present pmd entry, which will cause damon_get_page() to get an incorrect page struct (also may be NULL by pfn_to_online_page()), making the access statistics incorrect. This means that the DAMON may make incorrect decision according to the incorrect statistics, for example, DAMON may can not reclaim cold page in time due to this cold page was regarded as accessed mistakenly if DAMOS_PAGEOUT operation is specified. Moreover it does not make sense that we still waste time to get the page of the non-present entry. Just treat it as not-accessed and skip it, which maintains consistency with non-present pte level entries. So add pmd entry present validation to fix the above issues. Link: https://lkml.kernel.org/r/58b1d1f5fbda7db49ca886d9ef6783e3dcbbbc98.1660805030.git.baolin.wang@linux.alibaba.com Fixes: 3f49584b262c ("mm/damon: implement primitives for the virtual memory address spaces") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: release private data before split THPYin Fengwei1-2/+12
If there is private data attached to a THP, the refcount of THP will be increased and will prevent the THP from being split. Attempt to release any private data attached to the THP before attempting the split to increase the chance of splitting successfully. There was a memory failure issue hit during HW error injection testing with 5.18 kernel + xfs as rootfs. The test was killed and a system reboot was required to re-run the test. The issue was tracked down to a THP split failure caused by the memory failure not being handled. The page dump showed: [ 1785.433075] page:0000000025f9530b refcount:18 mapcount:0 mapping:000000008162eea7 index:0xa10 pfn:0x2f0200 [ 1785.443954] head:0000000025f9530b order:4 compound_mapcount:0 compound_pincount:0 [ 1785.452408] memcg:ff4247f2d28e9000 [ 1785.456304] aops:xfs_address_space_operations ino:8555182 dentry name:"baseos-filenames.solvx" [ 1785.466612] flags: 0x1000000000012036(referenced|uptodate|lru|active|private|head|node=0|zone=2) [ 1785.476514] raw: 1000000000012036 ffb9460f8bc07c08 ffb9460f8bc08408 ff4247f22e6299f8 [ 1785.485268] raw: 0000000000000a10 ff4247f194ade900 00000012ffffffff ff4247f2d28e9000 It was like the error was injected to a large folio for xfs with private data attached. With private data released before splitting the THP, the test case could be run successfully many times without rebooting the system. Link: https://lkml.kernel.org/r/20220810064907.582899-1-fengwei.yin@intel.com Co-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: thp: remove redundant pgtable check in set_huge_zero_page()Qi Zheng1-2/+1
When the pgtable is NULL in the set_huge_zero_page(), we should not increment the count of PTE page table pages by calling mm_inc_nr_ptes(). Otherwise we may receive the following warning when the mm exits: BUG: non-zero pgtables_bytes on freeing mm Now we can't observe the above warning since only do_huge_pmd_anonymous_page() invokes set_huge_zero_page() and the pgtable can not be NULL. Therefore, instead of moving mm_inc_nr_ptes() to the non-NULL branch of pgtable, it is better to remove the redundant pgtable check directly. Link: https://lkml.kernel.org/r/20220818082748.40021-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: memory-failure: kill __soft_offline_page()Kefeng Wang1-15/+10
Squash the __soft_offline_page() into soft_offline_in_use_page() and kill __soft_offline_page(). [wangkefeng.wang@huawei.com: update hpage when try_to_split_thp_page() succeeds] Link: https://lkml.kernel.org/r/20220830104654.28234-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20220819033402.156519-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: memory-failure: kill soft_offline_free_page()Kefeng Wang1-11/+1
Open-code the page_handle_poison() into soft_offline_page() and kill unneeded soft_offline_free_page(). Link: https://lkml.kernel.org/r/20220819033402.156519-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: hugetlb_vmemmap: simplify reset_struct_pages()Muchun Song1-3/+2
We can choose to copy three contiguous tail pages' content to the first three pages instead of copying one by one to simplify the code and reduce code size from 229 bytes to 63 bytes. The BUILD_BUG_ON() aims to avoid out-of-bounds accesses. Link: https://lkml.kernel.org/r/20220819035532.6189-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm, hwpoison: avoid trying to unpoison reserved pageMiaohe Lin1-2/+2
For reserved pages, HWPoison flag will be set without increasing the page refcnt. So we shouldn't even try to unpoison these pages and thus decrease the page refcnt unexpectly. Add a PageReserved() check to filter this case out and remove the below unneeded zero page (zero page is reserved) check. Link: https://lkml.kernel.org/r/20220818130016.45313-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm, hwpoison: kill procs if unmap failsMiaohe Lin1-8/+4
If try_to_unmap() fails, the hwpoisoned page still resides in the address space of some processes. We should kill these processes or the hwpoisoned page might be consumed later. collect_procs() is always called to collect relevant processes now so they can be killed later if unmap fails. [linmiaohe@huawei.com: v2] Link: https://lkml.kernel.org/r/20220823032346.4260-6-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220818130016.45313-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm, hwpoison: fix possible use-after-free in mf_dax_kill_procs()Miaohe Lin1-1/+2
After kill_procs(), tk will be freed without being removed from the to_kill list. In the next iteration, the freed list entry in the to_kill list will be accessed, thus leading to use-after-free issue. Adding list_del() in kill_procs() to fix the issue. Link: https://lkml.kernel.org/r/20220823032346.4260-5-linmiaohe@huawei.com Fixes: c36e20249571 ("mm: introduce mf_dax_kill_procs() for fsdax case") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm, hwpoison: fix extra put_page() in soft_offline_page()Miaohe Lin1-2/+0
When hwpoison_filter() refuses to soft offline a page, the page refcnt incremented previously by MF_COUNT_INCREASED would have been consumed via get_hwpoison_page() if ret <= 0. So the put_ref_page() here will put the extra one. Remove it to fix the issue. Link: https://lkml.kernel.org/r/20220818130016.45313-4-linmiaohe@huawei.com Fixes: 9113eaf331bf ("mm/memory-failure.c: add hwpoison_filter for soft offline") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm, hwpoison: fix page refcnt leaking in unpoison_memory()Miaohe Lin1-0/+1
When free_raw_hwp_pages() fails its work, the refcnt of the hugetlb page would have been incremented if ret > 0. Using put_page() to fix refcnt leaking in this case. Link: https://lkml.kernel.org/r/20220818130016.45313-3-linmiaohe@huawei.com Fixes: debb6b9c3fdd ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm, hwpoison: fix page refcnt leaking in try_memory_failure_hugetlb()Miaohe Lin1-2/+4
Patch series "A few fixup patches for memory-failure", v2. This series contains a few fixup patches to fix incorrect update of page refcnt, fix possible use-after-free issue and so on. More details can be found in the respective changelogs. This patch (of 6): When hwpoison_filter() refuses to hwpoison a hugetlb page, the refcnt of the page would have been incremented if res == 1. Using put_page() to fix the refcnt leaking in this case. Link: https://lkml.kernel.org/r/20220823032346.4260-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220818130016.45313-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220818130016.45313-2-linmiaohe@huawei.com Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: fix use-after free of page_ext after race with memory-offlineCharan Teja Kalla5-45/+192
The below is one path where race between page_ext and offline of the respective memory blocks will cause use-after-free on the access of page_ext structure. process1 process2 --------- --------- a)doing /proc/page_owner doing memory offline through offline_pages. b) PageBuddy check is failed thus proceed to get the page_owner information through page_ext access. page_ext = lookup_page_ext(page); migrate_pages(); ................. Since all pages are successfully migrated as part of the offline operation,send MEM_OFFLINE notification where for page_ext it calls: offline_page_ext()--> __free_page_ext()--> free_page_ext()--> vfree(ms->page_ext) mem_section->page_ext = NULL c) Check for the PAGE_EXT flags in the page_ext->flags access results into the use-after-free (leading to the translation faults). As mentioned above, there is really no synchronization between page_ext access and its freeing in the memory_offline. The memory offline steps(roughly) on a memory block is as below: 1) Isolate all the pages 2) while(1) try free the pages to buddy.(->free_list[MIGRATE_ISOLATE]) 3) delete the pages from this buddy list. 4) Then free page_ext.(Note: The struct page is still alive as it is freed only during hot remove of the memory which frees the memmap, which steps the user might not perform). This design leads to the state where struct page is alive but the struct page_ext is freed, where the later is ideally part of the former which just representing the page_flags (check [3] for why this design is chosen). The abovementioned race is just one example __but the problem persists in the other paths too involving page_ext->flags access(eg: page_is_idle())__. Fix all the paths where offline races with page_ext access by maintaining synchronization with rcu lock and is achieved in 3 steps: 1) Invalidate all the page_ext's of the sections of a memory block by storing a flag in the LSB of mem_section->page_ext. 2) Wait until all the existing readers to finish working with the ->page_ext's with synchronize_rcu(). Any parallel process that starts after this call will not get page_ext, through lookup_page_ext(), for the block parallel offline operation is being performed. 3) Now safely free all sections ->page_ext's of the block on which offline operation is being performed. Note: If synchronize_rcu() takes time then optimizations can be done in this path through call_rcu()[2]. Thanks to David Hildenbrand for his views/suggestions on the initial discussion[1] and Pavan kondeti for various inputs on this patch. [1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@quicinc.com/ [2] https://lore.kernel.org/all/a26ce299-aed1-b8ad-711e-a49e82bdd180@quicinc.com/T/#u [3] https://lore.kernel.org/all/6fa6b7aa-731e-891c-3efb-a03d6a700efa@redhat.com/ [quic_charante@quicinc.com: rename label `loop' to `ext_put_continue' per David] Link: https://lkml.kernel.org/r/1661496993-11473-1-git-send-email-quic_charante@quicinc.com Link: https://lkml.kernel.org/r/1660830600-9068-1-git-send-email-quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Fernand Sieber <sieberf@amazon.com> Cc: Minchan Kim <minchan@google.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavan Kondeti <quic_pkondeti@quicinc.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/vmalloc.c: support HIGHMEM pages in vmap_pages_range_noflush()Matthew Wilcox1-1/+1
If the pages being mapped are in HIGHMEM, page_address() returns NULL. This probably wasn't noticed before because there aren't currently any architectures with HAVE_ARCH_HUGE_VMALLOC and HIGHMEM, but it's simpler to call page_to_phys() and futureproofs us against such configurations existing. Link: https://lkml.kernel.org/r/Yv6qHc6e+m7TMWhi@casper.infradead.org Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm: memcontrol: fix a typo in commentxupanda1-1/+1
Fix a spelling mistake in comment. Link: https://lkml.kernel.org/r/20220815065102.74347-1-xu.panda@zte.com.cn Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: xupanda <xu.panda@zte.com.cn> Signed-off-by: CGEL ZTE <cgel.zte@gmail.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>