summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-03-05mm/zsmalloc: don't need to reserve LSB in handleChengming Zhou1-4/+1
We will save allocated tag in the object header to indicate that it's allocated. handle |= OBJ_ALLOCATED_TAG; So the object header needs to reserve LSB for this tag bit. But the handle itself doesn't need to reserve LSB to save tag, since it's only used to find the position of object, by (pfn + obj_idx). So remove LSB reserve from handle, one more bit can be used as obj_idx. Link: https://lkml.kernel.org/r/20240228023854.3511239-1-chengming.zhou@linux.dev Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm/memory.c: do_numa_page(): remove a redundant page table readJohn Hubbard1-6/+6
do_numa_page() is reading from the same page table entry, twice, while holding the page table lock: once while checking that the pte hasn't changed, and again in order to modify the pte. Instead, just read the pte once, and save it in the same old_pte variable that already exists. This has no effect on behavior, other than to provide a tiny potential improvement to performance, by avoiding the redundant memory read (which the compiler cannot elide, due to READ_ONCE()). Also improve the associated comments nearby. Link: https://lkml.kernel.org/r/20240228034151.459370-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: add alloc_contig_migrate_range allocation statisticsRichard Chang4-7/+68
alloc_contig_migrate_range has every information to be able to understand big contiguous allocation latency. For example, how many pages are migrated, how many times they were needed to unmap from page tables. This patch adds the trace event to collect the allocation statistics. In the field, it was quite useful to understand CMA allocation latency. [akpm@linux-foundation.org: a/trace_mm_alloc_config_migrate_range_info_enabled/trace_mm_alloc_contig_migrate_range_info_enabled] Link: https://lkml.kernel.org/r/20240228051127.2859472-1-richardycc@google.com Signed-off-by: Richard Chang <richardycc@google.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org. Cc: Martin Liu <liumartin@google.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: use folio more widely in __split_huge_pageMatthew Wilcox (Oracle)1-10/+11
We already have a folio; use it instead of the head page where reasonable. Saves a couple of calls to compound_head() and elimimnates a few references to page->mapping. Link: https://lkml.kernel.org/r/20240228164326.1355045-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05crash_core: export vmemmap when CONFIG_SPARSEMEM_VMEMMAP is enabledHuang Shijie1-0/+3
In memory_model.h, if CONFIG_SPARSEMEM_VMEMMAP is configed, kernel will use vmemmap to do the __pfn_to_page/page_to_pfn, and kernel will not use the "classic sparse" to do the __pfn_to_page/page_to_pfn. So export the vmemmap when CONFIG_SPARSEMEM_VMEMMAP is configed. This makes the user applications (crash, etc) get faster pfn_to_page/page_to_pfn operations too. Link: https://lkml.kernel.org/r/20240227014952.3184-1-shijie@os.amperecomputing.com Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Lianbo Jiang <lijiang@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05modules: wait do_free_init correctlyChangbin Du3-4/+18
The synchronization here is to ensure the ordering of freeing of a module init so that it happens before W+X checking. It is worth noting it is not that the freeing was not happening, it is just that our sanity checkers raced against the permission checkers which assume init memory is already gone. Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved calling do_free_init() into a global workqueue instead of relying on it being called through call_rcu(..., do_free_init), which used to allowed us call do_free_init() asynchronously after the end of a subsequent grace period. The move to a global workqueue broke the gaurantees for code which needed to be sure the do_free_init() would complete with rcu_barrier(). To fix this callers which used to rely on rcu_barrier() must now instead use flush_work(&init_free_wq). Without this fix, we still could encounter false positive reports in W+X checking since the rcu_barrier() here can not ensure the ordering now. Even worse, the rcu_barrier() can introduce significant delay. Eric Chanudet reported that the rcu_barrier introduces ~0.1s delay on a PREEMPT_RT kernel. [ 0.291444] Freeing unused kernel memory: 5568K [ 0.402442] Run /sbin/init as init process With this fix, the above delay can be eliminated. Link: https://lkml.kernel.org/r/20240227023546.2490667-1-changbin.du@huawei.com Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag") Signed-off-by: Changbin Du <changbin.du@huawei.com> Tested-by: Eric Chanudet <echanude@redhat.com> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Xiaoyi Su <suxiaoyi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: convert free_swap_cache() to take a folioMatthew Wilcox (Oracle)4-12/+12
All but one caller already has a folio, so convert free_page_and_swap_cache() to have a folio and remove the call to page_folio(). Link: https://lkml.kernel.org/r/20240227174254.710559-19-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: use a folio in __collapse_huge_page_copy_succeeded()Matthew Wilcox (Oracle)1-16/+14
These pages are all chained together through the lru list, so we know they're folios. Use the folio APIs to save three hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20240227174254.710559-18-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: convert free_pages_and_swap_cache() to use folios_put()Matthew Wilcox (Oracle)1-8/+13
Process the pages in batch-sized quantities instead of all-at-once. Link: https://lkml.kernel.org/r/20240227174254.710559-17-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: remove lru_to_page()Matthew Wilcox (Oracle)1-1/+0
The last user was removed over a year ago; remove the definition. Link: https://lkml.kernel.org/r/20240227174254.710559-16-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: remove free_unref_page_list()Matthew Wilcox (Oracle)2-19/+0
All callers now use free_unref_folios() so we can delete this function. Link: https://lkml.kernel.org/r/20240227174254.710559-15-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05memcg: remove mem_cgroup_uncharge_list()Matthew Wilcox (Oracle)2-31/+0
All users have been converted to mem_cgroup_uncharge_folios() so we can remove this API. Link: https://lkml.kernel.org/r/20240227174254.710559-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: free folios directly in move_folios_to_lru()Matthew Wilcox (Oracle)1-20/+12
The few folios which can't be moved to the LRU list (because their refcount dropped to zero) used to be returned to the caller to dispose of. Make this simpler to call by freeing the folios directly through free_unref_folios(). Link: https://lkml.kernel.org/r/20240227174254.710559-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: free folios in a batch in shrink_folio_list()Matthew Wilcox (Oracle)1-11/+9
Use free_unref_page_batch() to free the folios. This may increase the number of IPIs from calling try_to_unmap_flush() more often, but that's going to be very workload-dependent. It may even reduce the number of IPIs as we now batch-free large folios instead of freeing them one at a time. Link: https://lkml.kernel.org/r/20240227174254.710559-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: allow non-hugetlb large folios to be batch processedMatthew Wilcox (Oracle)1-2/+3
Hugetlb folios still get special treatment, but normal large folios can now be freed by free_unref_folios(). This should have a reasonable performance impact, TBD. Link: https://lkml.kernel.org/r/20240227174254.710559-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: handle large folios in free_unref_folios()Matthew Wilcox (Oracle)1-8/+17
Call folio_undo_large_rmappable() if needed. free_unref_page_prepare() destroys the ability to call folio_order(), so stash the order in folio->private for the benefit of the second loop. Link: https://lkml.kernel.org/r/20240227174254.710559-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: use __page_cache_release() in folios_put()Matthew Wilcox (Oracle)2-41/+37
Pass a pointer to the lruvec so we can take advantage of the folio_lruvec_relock_irqsave(). Adjust the calling convention of folio_lruvec_relock_irqsave() to suit and add a page_cache_release() wrapper. Link: https://lkml.kernel.org/r/20240227174254.710559-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: use free_unref_folios() in put_pages_list()Matthew Wilcox (Oracle)1-7/+10
Break up the list of folios into batches here so that the folios are more likely to be cache hot when doing the rest of the processing. Link: https://lkml.kernel.org/r/20240227174254.710559-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: remove use of folio list from folios_put()Matthew Wilcox (Oracle)1-7/+12
Instead of putting the interesting folios on a list, delete the uninteresting one from the folio_batch. Link: https://lkml.kernel.org/r/20240227174254.710559-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05memcg: add mem_cgroup_uncharge_folios()Matthew Wilcox (Oracle)2-2/+25
Almost identical to mem_cgroup_uncharge_list(), except it takes a folio_batch instead of a list_head. Link: https://lkml.kernel.org/r/20240227174254.710559-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: use folios_put() in __folio_batch_release()Matthew Wilcox (Oracle)1-2/+1
There's no need to indirect through release_pages() and iterate over this batch of folios an extra time; we can just use the batch that we have. Link: https://lkml.kernel.org/r/20240227174254.710559-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: add free_unref_folios()Matthew Wilcox (Oracle)2-25/+39
Iterate over a folio_batch rather than a linked list. This is easier for the CPU to prefetch and has a batch count naturally built in so we don't need to track it. Again, this lowers the maximum lock hold time from 32 folios to 15, but I do not expect this to have a significant effect. Link: https://lkml.kernel.org/r/20240227174254.710559-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: convert free_unref_page_list() to use foliosMatthew Wilcox (Oracle)1-18/+20
Most of its callees are not yet ready to accept a folio, but we know all of the pages passed in are actually folios because they're linked through ->lru. Link: https://lkml.kernel.org/r/20240227174254.710559-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: make folios_put() the basis of release_pages()Matthew Wilcox (Oracle)3-49/+70
Patch series "Rearrange batched folio freeing", v3. Other than the obvious "remove calls to compound_head" changes, the fundamental belief here is that iterating a linked list is much slower than iterating an array (5-15x slower in my testing). There's also an associated belief that since we iterate the batch of folios three times, we do better when the array is small (ie 15 entries) than we do with a batch that is hundreds of entries long, which only gives us the opportunity for the first pages to fall out of cache by the time we get to the end. It is possible we should increase the size of folio_batch. Hopefully the bots let us know if this introduces any performance regressions. This patch (of 3): By making release_pages() call folios_put(), we can get rid of the calls to compound_head() for the callers that already know they have folios. We can also get rid of the lock_batch tracking as we know the size of the batch is limited by folio_batch. This does reduce the maximum number of pages for which the lruvec lock is held, from SWAP_CLUSTER_MAX (32) to PAGEVEC_SIZE (15). I do not expect this to make a significant difference, but if it does, we can increase PAGEVEC_SIZE to 31. Link: https://lkml.kernel.org/r/20240227174254.710559-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240227174254.710559-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm/khugepaged: keep mm in mm_slot without MMF_DISABLE_THP checkLance Yang1-3/+3
Previously, we removed the mm from mm_slot and dropped mm_count if the MMF_THP_DISABLE flag was set. However, we didn't re-add the mm back after clearing the MMF_THP_DISABLE flag. Additionally, We add a check for the MMF_THP_DISABLE flag in hugepage_vma_revalidate(). Link: https://lkml.kernel.org/r/20240227035135.54593-1-ioworker0@gmail.com Fixes: 879c6000e191 ("mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP check") Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05lib/test_vmalloc.c: use unsigned long constantMartin Kaiser1-1/+1
Use an unsigned long constant instead of an int constant and a cast. This fixes the checkpatch warning WARNING: Unnecessary typecast of c90 int constant - '(unsigned long) 1' could be '1UL' + align = ((unsigned long) 1) << i; Link: https://lkml.kernel.org/r/20240226191159.39509-4-martin@kaiser.cx Signed-off-by: Martin Kaiser <martin@kaiser.cx> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05lib/test_vmalloc.c: drop empty exit functionMartin Kaiser1-5/+0
The module is never loaded successfully. Therefore, it'll never be unloaded and we can remove the exit function. Link: https://lkml.kernel.org/r/20240226191159.39509-3-martin@kaiser.cx Signed-off-by: Martin Kaiser <martin@kaiser.cx> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05lib/test_vmalloc.c: fix typo in function nameMartin Kaiser1-2/+2
Fix a typo and change the function name to init_test_configuration. Both caller and definition have the same typo, so the current code already works. Link: https://lkml.kernel.org/r/20240226191159.39509-2-martin@kaiser.cx Signed-off-by: Martin Kaiser <martin@kaiser.cx> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: remove total_mapcount()David Hildenbrand1-8/+1
All users of total_mapcount() are gone, let's remove it. Link: https://lkml.kernel.org/r/20240226141324.278526-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm/memfd: refactor memfd_tag_pins() and memfd_wait_for_pins()David Hildenbrand1-29/+18
Patch series "mm: remove total_mapcount()", v2. Let's remove the remaining user from mm/memfd.c so we can get rid of total_mapcount(). This patch (of 2): Both functions are the remaining users of total_mapcount(). Let's get rid of the calls by converting the code to folios. As it turns out, the code is unnecessarily complicated, especially: 1) We can query the number of pagecache references for a folio simply via folio_nr_pages(). This will handle other folio sizes in the future correctly. 2) The xas_set(xas, page->index + cache_count) call to increment the iterator for large folios is not required. Remove it. Further, simplify the XA_CHECK_SCHED check, counting each entry exactly once. Memfd pages can be swapped out when using shmem; leave xa_is_value() checks in place. Link: https://lkml.kernel.org/r/20240226141324.278526-1-david@redhat.com Link: https://lkml.kernel.org/r/20240226141324.278526-2-david@redhat.com Co-developed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: huge_memory: enable debugfs to split huge pages to any orderZi Yan3-19/+205
It is used to test split_huge_page_to_list_to_order for pagecache THPs. Also add test cases for split_huge_page_to_list_to_order via both debugfs. [ziy@nvidia.com: fix issue discovered with NFS] Link: https://lkml.kernel.org/r/262E4DAA-4A78-4328-B745-1355AE356A07@nvidia.com Link: https://lkml.kernel.org/r/20240226205534.1603748-9-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Tested-by: Aishwarya TCV <aishwarya.tcv@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Aishwarya TCV <aishwarya.tcv@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: thp: split huge page to any lower order pagesZi Yan2-32/+96
To split a THP to any lower order pages, we need to reform THPs on subpages at given order and add page refcount based on the new page order. Also we need to reinitialize page_deferred_list after removing the page from the split_queue, otherwise a subsequent split will see list corruption when checking the page_deferred_list again. Note: Anonymous order-1 folio is not supported because _deferred_list, which is used by partially mapped folios, is stored in subpage 2 and an order-1 folio only has subpage 0 and 1. File-backed order-1 folios are fine, since they do not use _deferred_list. [ziy@nvidia.com: fixup per discussion with Ryan] Link: https://lkml.kernel.org/r/494F48CD-1F0F-4CAD-884E-6D48F40AF990@nvidia.com Link: https://lkml.kernel.org/r/20240226205534.1603748-8-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: page_owner: add support for splitting to any order in split page_ownerZi Yan4-12/+14
It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order. Link: https://lkml.kernel.org/r/20240226205534.1603748-7-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: memcg: make memcg huge page split support any order splitZi Yan4-10/+11
It sets memcg information for the pages after the split. A new parameter new_order is added to tell the order of subpages in the new page, always 0 for now. It prepares for upcoming changes to support split huge page to any lower order. Link: https://lkml.kernel.org/r/20240226205534.1603748-6-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm/page_owner: use order instead of nr in split_page_owner()Zi Yan4-9/+9
We do not have non power of two pages, using nr is error prone if nr is not power-of-two. Use page order instead. Link: https://lkml.kernel.org/r/20240226205534.1603748-5-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm/memcg: use order instead of nr in split_page_memcg()Zi Yan4-7/+9
We do not have non power of two pages, using nr is error prone if nr is not power-of-two. Use page order instead. Link: https://lkml.kernel.org/r/20240226205534.1603748-4-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: support order-1 folios in the page cacheMatthew Wilcox (Oracle)4-11/+16
Folios of order 1 have no space to store the deferred list. This is not a problem for the page cache as file-backed folios are never placed on the deferred list. All we need to do is prevent the core MM from touching the deferred list for order 1 folios and remove the code which prevented us from allocating order 1 folios. Link: https://lore.kernel.org/linux-mm/90344ea7-4eec-47ee-5996-0c22f42d6a6a@google.com/ Link: https://lkml.kernel.org/r/20240226205534.1603748-3-zi.yan@sent.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm/huge_memory: only split PMD mapping when necessary in unmap_folio()Zi Yan1-2/+5
Patch series "Split a folio to any lower order folios", v5. File folio supports any order and multi-size THP is upstreamed[1], so both file and anonymous folios can be >0 order. Currently, split_huge_page() only splits a huge page to order-0 pages, but splitting to orders higher than 0 might better utilize large folios, if done properly. In addition, Large Block Sizes in XFS support would benefit from it during truncate[2]. This patchset adds support for splitting a large folio to any lower order folios. In addition to this implementation of split_huge_page_to_list_to_order(), a possible optimization could be splitting a large folio to arbitrary smaller folios instead of a single order. As both Hugh and Ryan pointed out [3,5] that split to a single order might not be optimal, an order-9 folio might be better split into 1 order-8, 1 order-7, ..., 1 order-1, and 2 order-0 folios, depending on subsequent folio operations. Leave this as future work. [1] https://lore.kernel.org/all/20231207161211.2374093-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20240226094936.2677493-1-kernel@pankajraghav.com/ [3] https://lore.kernel.org/linux-mm/9dd96da-efa2-5123-20d4-4992136ef3ad@google.com/ [4] https://lore.kernel.org/linux-mm/cbb1d6a0-66dd-47d0-8733-f836fe050374@arm.com/ [5] https://lore.kernel.org/linux-mm/20240213215520.1048625-1-zi.yan@sent.com/ This patch (of 8): As multi-size THP support is added, not all THPs are PMD-mapped, thus during a huge page split, there is no need to always split PMD mapping in unmap_folio(). Make it conditional. Link: https://lkml.kernel.org/r/20240226205534.1603748-1-zi.yan@sent.com Link: https://lkml.kernel.org/r/20240226205534.1603748-2-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: enumerate all gfp flagsSuren Baghdasaryan1-28/+62
Introduce GFP bits enumeration to let compiler track the number of used bits (which depends on the config options) instead of hardcoding them. That simplifies __GFP_BITS_SHIFT calculation. Link: https://lkml.kernel.org/r/20240224015800.2569851-1-surenb@google.com Suggested-by: Petr Tesařík <petr@tesarici.cz> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Petr Tesarik <petr@tesarici.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: madvise: pageout: ignore references rather than clearing youngBarry Song4-11/+13
While doing MADV_PAGEOUT, the current code will clear PTE young so that vmscan won't read young flags to allow the reclamation of madvised folios to go ahead. It seems we can do it by directly ignoring references, thus we can remove tlb flush in madvise and rmap overhead in vmscan. Regarding the side effect, in the original code, if a parallel thread runs side by side to access the madvised memory with the thread doing madvise, folios will get a chance to be re-activated by vmscan (though the time gap is actually quite small since checking PTEs is done immediately after clearing PTEs young). But with this patch, they will still be reclaimed. But this behaviour doing PAGEOUT and doing access at the same time is quite silly like DoS. So probably, we don't need to care. Or ignoring the new access during the quite small time gap is even better. For DAMON's DAMOS_PAGEOUT based on physical address region, we still keep its behaviour as is since a physical address might be mapped by multiple processes. MADV_PAGEOUT based on virtual address is actually much more aggressive on reclamation. To untouch paddr's DAMOS_PAGEOUT, we simply pass ignore_references as false in reclaim_pages(). A microbench as below has shown 6% decrement on the latency of MADV_PAGEOUT, #define PGSIZE 4096 main() { int i; #define SIZE 512*1024*1024 volatile long *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < SIZE/sizeof(long); i += PGSIZE / sizeof(long)) p[i] = 0x11; madvise(p, SIZE, MADV_PAGEOUT); } w/o patch w/ patch root@10:~# time ./a.out root@10:~# time ./a.out real 0m49.634s real 0m46.334s user 0m0.637s user 0m0.648s sys 0m47.434s sys 0m44.265s Link: https://lkml.kernel.org/r/20240226005739.24350-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05arm64/mm: improve comment in contpte_ptep_get_lockless()Ryan Roberts1-10/+14
Make clear the atmicity/consistency requirements of the API and how we achieve them. Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/ Link: https://lkml.kernel.org/r/20240226120321.1055731-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05arm64/mm: export contpte symbols only to GPL usersRyan Roberts1-11/+11
Patch series "Address some contpte nits". These 2 patches address some nits raised by Catalin late in the review cycle for my contpte series [1]. [1] https://lore.kernel.org/linux-mm/20240215103205.2607016-1-ryan.roberts@arm.com/ This patch (of 2): The contpte symbols must be exported since some of the public inline ptep_* APIs are called from modules and these inlines now call the contpte functions. Originally they were exported as EXPORT_SYMBOL() for fear of breaking out-of-tree modules. But we subsequently concluded that EXPORT_SYMBOL_GPL() should be safe since these functions are deeply core mm routines, and any module operating at this level is not going to be able to survive on EXPORT_SYMBOL alone. Link: https://lkml.kernel.org/r/20240226120321.1055731-1-ryan.roberts@arm.com Link: https://lore.kernel.org/linux-mm/f9fc2b31-11cb-4969-8961-9c89fea41b74@nvidia.com/ Link: https://lkml.kernel.org/r/20240226120321.1055731-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05Docs/mm/damon/design: remove the details for pageout as paddr doesn't use ↵Barry Song1-1/+1
MADV_PAGEOUT The doc needs a fix. As only in the case of virtual address, we are calling madvise() with MADV_PAGEOUT. But in the case of physical address, we are calling reclaim_pages() directly. MADV_PAGEOUT on virtual address is much more aggresive to reclaim memory compared to reclaim_pages() on paddr region. This patch removes the details so that the description can apply to both cases. And we don't need to couple with the implementation details. Link: https://lkml.kernel.org/r/20240224224751.4673-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05kasan: fix a2 allocation and remove explicit cast in atomic testsPaul Heidekrüger1-3/+3
Address the additional feedback since 4e76c8cc3378 kasan: add atomic tests (""kasan: add atomic tests") by removing an explicit cast and fixing the size as well as the check of the allocation of `a2`. Link: https://lkml.kernel.org/r/20240224105414.211995-1-paul.heidekrueger@tum.de Link: https://lore.kernel.org/all/20240131210041.686657-1-paul.heidekrueger@tum.de/T/#u Fixes: 4e76c8cc3378 ("kasan: add atomic tests") Signed-off-by: Paul Heidekrüger <paul.heidekrueger@tum.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=214055 Reviewed-by: Marco Elver <elver@google.com> Tested-by: Marco Elver <elver@google.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05lib/stackdepot: off by one in depot_fetch_stack()Dan Carpenter1-1/+1
The stack_pools[] array has DEPOT_MAX_POOLS. The "pools_num" tracks the number of pools which are initialized. See depot_init_pool() for more details. If pool_index == pools_num_cached, this will read one element beyond what we want. If not all the pools are initialized, then the pool will be NULL, triggering a WARN(), and if they are all initialized it will read one element beyond the end of the array. Link: https://lkml.kernel.org/r/361ac881-60b7-471f-91e5-5bf8fe8042b2@moroto.mountain Fixes: b29d31885814 ("lib/stackdepot: store free stack records in a freelist") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05zram: zcomp: remove zcomp_set_max_streams() declarationKefeng Wang1-1/+0
The zcomp_set_max_streams() is removed from commit 43209ea2d17a ("zram: remove max_comp_streams internals"), remove the declaration. Link: https://lkml.kernel.org/r/20240223035548.2591882-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05mm: update mark_victim tracepoints fieldsCarlos Galo2-5/+37
The current implementation of the mark_victim tracepoint provides only the process ID (pid) of the victim process. This limitation poses challenges for userspace tools requiring real-time OOM analysis and intervention. Although this information is available from the kernel logs, it’s not the appropriate format to provide OOM notifications. In Android, BPF programs are used with the mark_victim trace events to notify userspace of an OOM kill. For consistency, update the trace event to include the same information about the OOMed victim as the kernel logs. - UID In Android each installed application has a unique UID. Including the `uid` assists in correlating OOM events with specific apps. - Process Name (comm) Enables identification of the affected process. - OOM Score Will allow userspace to get additional insight of the relative kill priority of the OOM victim. In Android, the oom_score_adj is used to categorize app state (foreground, background, etc.), which aids in analyzing user-perceptible impacts of OOM events [1]. - Total VM, RSS Stats, and pgtables Amount of memory used by the victim that will, potentially, be freed up by killing it. [1] https://cs.android.com/android/platform/superproject/main/+/246dc8fc95b6d93afcba5c6d6c133307abb3ac2e:frameworks/base/services/core/java/com/android/server/am/ProcessList.java;l=188-283 Signed-off-by: Carlos Galo <carlosgalo@google.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05selftests: damon: add access_memory to .gitignoreJavier Carrasco1-0/+1
This binary is missing in the .gitignore and stays as an untracked file. Link: https://lore.kernel.org/r/20240214-damon_selftest_gitignore-v1-1-f517d0f9f783@gmail.com Link: https://lkml.kernel.org/r/20240221211148.46522-3-sj@kernel.org Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com> Singed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Closes: https://lore.kernel.org/all/AS8P193MB1285C963658008F1B2702AF7E4792@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM/ Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Vincenzo Mezzela <vincenzo.mezzela@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05selftest: damon: fix minor typos in test logsVincenzo Mezzela2-2/+2
Patch series "selftests/damon: misc fixes". Misc fixes for DAMON selftets on behalf of the original authors. This patch (of 2): This patch resolves a spelling error in the test log, preventing potential confusion. It is submitted as part of my application to the "Linux Kernel Bug Fixing Spring Unpaid 2024" mentorship program of the Linux Foundation. Link: https://lore.kernel.org/r/20240204122523.14160-1-vincenzo.mezzela@gmail.com Link: https://lkml.kernel.org/r/20240221211148.46522-2-sj@kernel.org Signed-off-by: Vincenzo Mezzela <vincenzo.mezzela@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-05hugetlb: allow faults to be handled under the VMA lockVishal Moola (Oracle)1-6/+0
Hugetlb can now safely handle faults under the VMA lock, so allow it to do so. This patch may cause ltp hugemmap10 to "fail". Hugemmap10 tests hugetlb counters, and expects the counters to remain unchanged on failure to handle a fault. In hugetlb_no_page(), vmf_anon_prepare() may bailout with no anon_vma under the VMA lock after allocating a folio for the hugepage. In free_huge_folio(), this folio is completely freed on bailout iff there is a surplus of hugetlb pages. This will remove a folio off the freelist and decrement the number of hugepages while ltp expects these counters to remain unchanged on failure. Originally this could only happen due to OOM failures, but now it may also occur after we allocate a hugetlb folio without a suitable anon_vma under the VMA lock. This should only happen for the first freshly allocated hugepage in this vma. Link: https://lkml.kernel.org/r/20240221234732.187629-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>