summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2024-05-06mm/ksm: use ksm_get_folio in scan_get_next_rmap_itemAlex Shi (tencent)1-5/+5
Save a compound_head call. Link: https://lkml.kernel.org/r/20240411061713.1847574-7-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/ksm: use folio in stable_node_dupAlex Shi (tencent)1-8/+8
Use ksm_get_folio() and save 2 compound_head calls. Link: https://lkml.kernel.org/r/20240411061713.1847574-6-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/ksm: use folio in remove_stable_nodeAlex Shi (tencent)1-12/+12
Pages in stable tree are all single normal page, so uses ksm_get_folio() and folio_set_stable_node(), also saves 3 calls to compound_head(). Link: https://lkml.kernel.org/r/20240411061713.1847574-5-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/ksm: add folio_set_stable_nodeAlex Shi (tencent)1-1/+7
Turn set_page_stable_node() into a wrapper folio_set_stable_node, and then use it to replace the former. we will merge them together after all place converted to folio. Link: https://lkml.kernel.org/r/20240411061713.1847574-4-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/ksm: use folio in remove_rmap_item_from_treeAlex Shi (tencent)1-5/+5
To save 2 compound_head calls. Link: https://lkml.kernel.org/r/20240411061713.1847574-3-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/ksm: add ksm_get_folioAlex Shi (tencent)1-17/+25
Patch series "transfer page to folio in KSM". This is the first part of page to folio transfer on KSM. Since only single page could be stored in KSM, we could safely transfer stable tree pages to folios. This patchset could reduce ksm.o 57kbytes from 2541776 bytes on latest akpm/mm-stable branch with CONFIG_DEBUG_VM enabled. It pass the KSM testing in LTP and kernel selftest. Thanks for Matthew Wilcox and David Hildenbrand's suggestions and comments! This patch (of 10): The ksm only contains single pages, so we could add a new func ksm_get_folio for get_ksm_page to use folio instead of pages to save a couple of compound_head calls. After all caller replaced, get_ksm_page will be removed. Link: https://lkml.kernel.org/r/20240411061713.1847574-1-alexs@kernel.org Link: https://lkml.kernel.org/r/20240411061713.1847574-2-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/debug: print only page mapcount (excluding folio entire mapcount) in ↵David Hildenbrand1-7/+2
__dump_folio() Let's simplify and only print the page mapcount: we already print the large folio mapcount and the entire folio mapcount for large folios separately; that should be sufficient to figure out what's happening. While at it, print the page mapcount also if it had an underflow, filtering out only typed pages. Link: https://lkml.kernel.org/r/20240409192301.907377-18-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/migrate_device: use folio_mapcount() in migrate_vma_check_page()David Hildenbrand1-5/+7
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. Let's convert migrate_vma_check_page() to work on a folio internally so we can remove the page_mapcount() usage. Note that we reject any large folios. There is a lot more folio conversion to be had, but that has to wait for another day. No functional change intended. Link: https://lkml.kernel.org/r/20240409192301.907377-15-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/filemap: use folio_mapcount() in filemap_unaccount_folio()David Hildenbrand1-1/+1
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. Let's use folio_mapcount() instead of filemap_unaccount_folio(). No functional change intended, because we're only dealing with small folios. Link: https://lkml.kernel.org/r/20240409192301.907377-14-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/migrate: use folio_likely_mapped_shared() in add_page_for_migration()David Hildenbrand1-1/+1
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. In add_page_for_migration(), we actually want to check if the folio is mapped shared, to reject such folios. So let's use folio_likely_mapped_shared() instead. For small folios, fully mapped THP, and hugetlb folios, there is no change. For partially mapped, shared THP, we should now do a better job at rejecting such folios. Link: https://lkml.kernel.org/r/20240409192301.907377-12-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()David Hildenbrand1-2/+6
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. For tracing purposes, we use page_mapcount() in __alloc_contig_migrate_range(). Adding that mapcount to total_mapped sounds strange: total_migrated and total_reclaimed would count each page only once, not multiple times. But then, isolate_migratepages_range() adds each folio only once to the list. So for large folios, we would query the mapcount of the first page of the folio, which doesn't make too much sense for large folios. Let's simply use folio_mapped() * folio_nr_pages(), which makes more sense as nr_migratepages is also incremented by the number of pages in the folio in case of successful migration. Link: https://lkml.kernel.org/r/20240409192301.907377-11-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()David Hildenbrand1-2/+2
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. We can only unmap full folios; page_mapped(), which we check here, is translated to folio_mapped() -- based on folio_mapcount(). So let's print the folio mapcount instead. Link: https://lkml.kernel.org/r/20240409192301.907377-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity checkDavid Hildenbrand1-1/+1
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. Let's similarly check for folio_mapcount() underflows instead of page_mapcount() underflows like we do in zap_present_folio_ptes() now. Instead of the VM_BUG_ON(), we should actually be doing something like print_bad_pte(). For now, let's keep it simple and use WARN_ON_ONCE(), performing that check independently of DEBUG_VM. Link: https://lkml.kernel.org/r/20240409192301.907377-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/memory: use folio_mapcount() in zap_present_folio_ptes()David Hildenbrand1-2/+1
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. In zap_present_folio_ptes(), let's simply check the folio mapcount(). If there is some issue, it will underflow at some point either way when unmapping. As indicated already in commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP"), we already documented "If we ever have a cheap folio_mapcount(), we might just want to check for underflows there.". There is no change for small folios. For large folios, we'll now catch more underflows when batch-unmapping, because instead of only testing the mapcount of the first subpage, we'll test if the folio mapcount underflows. Link: https://lkml.kernel.org/r/20240409192301.907377-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: track mapcount of large folios in single valueDavid Hildenbrand6-27/+23
Let's track the mapcount of large folios in a single value. The mapcount of a large folio currently corresponds to the sum of the entire mapcount and all page mapcounts. This sum is what we actually want to know in folio_mapcount() and it is also sufficient for implementing folio_mapped(). With PTE-mapped THP becoming more important and more widely used, we want to avoid looping over all pages of a folio just to obtain the mapcount of large folios. The comment "In the common case, avoid the loop when no pages mapped by PTE" in folio_total_mapcount() does no longer hold for mTHP that are always mapped by PTE. Further, we are planning on using folio_mapcount() more frequently, and might even want to remove page mapcounts for large folios in some kernel configs. Therefore, allow for reading the mapcount of large folios efficiently and atomically without looping over any pages. Maintain the mapcount also for hugetlb pages for simplicity. Use the new mapcount to implement folio_mapcount() and folio_mapped(). Make page_mapped() simply call folio_mapped(). We can now get rid of folio_large_is_mapped(). _nr_pages_mapped is now only used in rmap code and for debugging purposes. Keep folio_nr_pages_mapped() around, but document that its use should be limited to rmap internals and debugging purposes. This change implies one additional atomic add/sub whenever mapping/unmapping (parts of) a large folio. As we now batch RMAP operations for PTE-mapped THP during fork(), during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust the large mapcount for a PTE batch only once, the added overhead in the common case is small. Only when unmapping individual pages of a large folio (e.g., during COW), the overhead might be bigger in comparison, but it's essentially one additional atomic operation. Note that before the new mapcount would overflow, already our refcount would overflow: each mapping requires a folio reference. Extend the focumentation of folio_mapcount(). Link: https://lkml.kernel.org/r/20240409192301.907377-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/rmap: add fast-path for small folios when adding/removing/duplicatingDavid Hildenbrand1-10/+16
Let's add a fast-path for small folios to all relevant rmap functions. Note that only RMAP_LEVEL_PTE applies. This is a preparation for tracking the mapcount of large folios in a single value. Link: https://lkml.kernel.org/r/20240409192301.907377-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: follow_pte() improvementsDavid Hildenbrand1-5/+15
follow_pte() is now our main function to lookup PTEs in VM_PFNMAP/VM_IO VMAs. Let's perform some more sanity checks to make this exported function harder to abuse. Further, extend the doc a bit, it still focuses on the KVM use case with MMU notifiers. Drop the KVM+follow_pfn() comment, follow_pfn() is no more, and we have other users nowadays. Also extend the doc regarding refcounted pages and the interaction with MMU notifiers. KVM is one example that uses MMU notifiers and can deal with refcounted pages properly. VFIO is one example that doesn't use MMU notifiers, and to prevent use-after-free, rejects refcounted pages: pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)). Protection changes are less of a concern for users like VFIO: the behavior is similar to longterm-pinning a page, and getting the PTE protection changed afterwards. The primary concern with refcounted pages is use-after-free, which callers should be aware of. Link: https://lkml.kernel.org/r/20240410155527.474777-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: pass VMA instead of MM to follow_pte()David Hildenbrand1-7/+8
... and centralize the VM_IO/VM_PFNMAP sanity check in there. We'll now also perform these sanity checks for direct follow_pte() invocations. For generic_access_phys(), we might now check multiple times: nothing to worry about, really. Link: https://lkml.kernel.org/r/20240410155527.474777-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> [KVM] Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm,swap: add document about RCU read lock and swapoff interactionHuang Ying1-13/+13
During reviewing a patch to fix the race condition between free_swap_and_cache() and swapoff() [1], it was found that the document about how to prevent racing with swapoff isn't clear enough. Especially RCU read lock can prevent swapoff from freeing data structures. So, the document is added as comments. [1] https://lore.kernel.org/linux-mm/c8fe62d0-78b8-527a-5bef-ee663ccdc37a@huawei.com/ Link: https://lkml.kernel.org/r/20240407065450.498821-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/mmap: make accountable_mapping return boolHao Ge1-2/+2
accountable_mapping() can return bool, so change it. Link: https://lkml.kernel.org/r/20240407063843.804274-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/mmap: make vma_wants_writenotify return boolHao Ge1-6/+6
vma_wants_writenotify() should return bool, so change it. Link: https://lkml.kernel.org/r/20240407062653.803142-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06memory tier: create CPUless memory tiers after obtaining HMAT infoHo-Ren (Jack) Chuang1-24/+70
The current implementation treats emulated memory devices, such as CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory (E820_TYPE_RAM). However, these emulated devices have different characteristics than traditional DRAM, making it important to distinguish them. Thus, we modify the tiered memory initialization process to introduce a delay specifically for CPUless NUMA nodes. This delay ensures that the memory tier initialization for these nodes is deferred until HMAT information is obtained during the boot process. Finally, demotion tables are recalculated at the end. * late_initcall(memory_tier_late_init); Some device drivers may have initialized memory tiers between `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing online memory nodes and configuring memory tiers. They should be excluded in the late init. * Handle cases where there is no HMAT when creating memory tiers There is a scenario where a CPUless node does not provide HMAT information. If no HMAT is specified, it falls back to using the default DRAM tier. * Introduce another new lock `default_dram_perf_lock` for adist calculation In the current implementation, iterating through CPUlist nodes requires holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up trying to acquire the same lock, leading to a potential deadlock. Therefore, we propose introducing a standalone `default_dram_perf_lock` to protect `default_dram_perf_*`. This approach not only avoids deadlock but also prevents holding a large lock simultaneously. * Upgrade `set_node_memory_tier` to support additional cases, including default DRAM, late CPUless, and hot-plugged initializations. To cover hot-plugged memory nodes, `mt_calc_adistance()` and `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to handle cases where memtype is not initialized and where HMAT information is available. * Introduce `default_memory_types` for those memory types that are not initialized by device drivers. Because late initialized memory and default DRAM memory need to be managed, a default memory type is created for storing all memory types that are not initialized by device drivers and as a fallback. Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gregory Price <gourry.memverge@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06memory tier: dax/kmem: introduce an abstract layer for finding, allocating, ↵Ho-Ren (Jack) Chuang1-0/+29
and putting memory types Patch series "Improved Memory Tier Creation for CPUless NUMA Nodes", v11. When a memory device, such as CXL1.1 type3 memory, is emulated as normal memory (E820_TYPE_RAM), the memory device is indistinguishable from normal DRAM in terms of memory tiering with the current implementation. The current memory tiering assigns all detected normal memory nodes to the same DRAM tier. This results in normal memory devices with different attributions being unable to be assigned to the correct memory tier, leading to the inability to migrate pages between different types of memory. https://lore.kernel.org/linux-mm/PH0PR08MB7955E9F08CCB64F23963B5C3A860A@PH0PR08MB7955.namprd08.prod.outlook.com/T/ This patchset automatically resolves the issues. It delays the initialization of memory tiers for CPUless NUMA nodes until they obtain HMAT information and after all devices are initialized at boot time, eliminating the need for user intervention. If no HMAT is specified, it falls back to using `default_dram_type`. Example usecase: We have CXL memory on the host, and we create VMs with a new system memory device backed by host CXL memory. We inject CXL memory performance attributes through QEMU, and the guest now sees memory nodes with performance attributes in HMAT. With this change, we enable the guest kernel to construct the correct memory tiering for the memory nodes. This patch (of 2): Since different memory devices require finding, allocating, and putting memory types, these common steps are abstracted in this patch, enhancing the scalability and conciseness of the code. Link: https://lkml.kernel.org/r/20240405000707.2670063-1-horenchuang@bytedance.com Link: https://lkml.kernel.org/r/20240405000707.2670063-2-horenchuang@bytedance.com Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawie.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gregory Price <gourry.memverge@gmail.com> Cc: Hao Xiang <hao.xiang@bytedance.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: convert free_zone_device_page to free_zone_device_folioMatthew Wilcox (Oracle)3-17/+19
Both callers already have a folio; pass it in and save a few calls to compound_head(). Link: https://lkml.kernel.org/r/20240405153228.2563754-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: combine __folio_put_small, __folio_put_large and __folio_putMatthew Wilcox (Oracle)1-26/+6
It's now obvious that __folio_put_small() and __folio_put_large() do almost exactly the same thing. Inline them both into __folio_put(). Link: https://lkml.kernel.org/r/20240405153228.2563754-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: inline destroy_large_folio() into __folio_put_large()Matthew Wilcox (Oracle)2-17/+10
destroy_large_folio() has only one caller, move its contents there. Link: https://lkml.kernel.org/r/20240405153228.2563754-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: combine free_the_page() and free_unref_page()Matthew Wilcox (Oracle)1-14/+11
The pcp_allowed_order() check in free_the_page() was only being skipped by __folio_put_small() which is about to be rearranged. Link: https://lkml.kernel.org/r/20240405153228.2563754-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: free non-hugetlb large folios in a batchMatthew Wilcox (Oracle)1-2/+2
Patch series "Clean up __folio_put()". With all the changes over the last few years, __folio_put_small and __folio_put_large have become almost identical to each other ... except you can't tell because they're spread over two files. Rearrange it all so that you can tell, and then inline them both into __folio_put(). This patch (of 5): free_unref_folios() can now handle non-hugetlb large folios, so keep normal large folios in the batch. hugetlb folios still need to be handled specially. [peterx@redhat.com: fix panic] Link: https://lkml.kernel.org/r/ZikjPB0Dt5HA8-uL@x1n Link: https://lkml.kernel.org/r/20240405153228.2563754-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240405153228.2563754-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: convert pagecache_isize_extended to use a folioMatthew Wilcox (Oracle)1-19/+17
Remove four hidden calls to compound_head(). Also exit early if the filesystem block size is >= PAGE_SIZE instead of just equal to PAGE_SIZE. Link: https://lkml.kernel.org/r/20240405180038.2618624-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/hugetlb: pass correct order_per_bit to cma_declare_contiguous_nidFrank van der Linden1-3/+3
The hugetlb_cma code passes 0 in the order_per_bit argument to cma_declare_contiguous_nid (the alignment, computed using the page order, is correctly passed in). This causes a bit in the cma allocation bitmap to always represent a 4k page, making the bitmaps potentially very large, and slower. It would create bitmaps that would be pretty big. E.g. for a 4k page size on x86, hugetlb_cma=64G would mean a bitmap size of (64G / 4k) / 8 == 2M. With HUGETLB_PAGE_ORDER as order_per_bit, as intended, this would be (64G / 2M) / 8 == 4k. So, that's quite a difference. Also, this restricted the hugetlb_cma area to ((PAGE_SIZE << MAX_PAGE_ORDER) * 8) * PAGE_SIZE (e.g. 128G on x86) , since bitmap_alloc uses normal page allocation, and is thus restricted by MAX_PAGE_ORDER. Specifying anything about that would fail the CMA initialization. So, correctly pass in the order instead. Link: https://lkml.kernel.org/r/20240404162515.527802-2-fvdl@google.com Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using cma") Signed-off-by: Frank van der Linden <fvdl@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/cma: drop incorrect alignment check in cma_init_reserved_memFrank van der Linden1-4/+0
cma_init_reserved_mem uses IS_ALIGNED to check if the size represented by one bit in the cma allocation bitmask is aligned with CMA_MIN_ALIGNMENT_BYTES (pageblock size). However, this is too strict, as this will fail if order_per_bit > pageblock_order, which is a valid configuration. We could check IS_ALIGNED both ways, but since both numbers are powers of two, no check is needed at all. Link: https://lkml.kernel.org/r/20240404162515.527802-1-fvdl@google.com Fixes: de9e14eebf33 ("drivers: dma-contiguous: add initialization from device tree") Signed-off-by: Frank van der Linden <fvdl@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26hugetlb: convert hugetlb_wp() to use struct vm_faultVishal Moola (Oracle)1-32/+32
hugetlb_wp() can use the struct vm_fault passed in from hugetlb_fault(). This alleviates the stack by consolidating 5 variables into a single struct. [vishal.moola@gmail.com: simplify hugetlb_wp() arguments] Link: https://lkml.kernel.org/r/ZhQtoFNZBNwBCeXn@fedora Link: https://lkml.kernel.org/r/20240401202651.31440-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26hugetlb: convert hugetlb_no_page() to use struct vm_faultVishal Moola (Oracle)1-32/+31
hugetlb_no_page() can use the struct vm_fault passed in from hugetlb_fault(). This alleviates the stack by consolidating 7 variables into a single struct. [vishal.moola@gmail.com: simplify hugetlb_no_page() arguments] Link: https://lkml.kernel.org/r/ZhQtN8y5zud8iI1u@fedora Link: https://lkml.kernel.org/r/20240401202651.31440-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26hugetlb: convert hugetlb_fault() to use struct vm_faultVishal Moola (Oracle)1-43/+41
Patch series "Hugetlb fault path to use struct vm_fault", v2. This patchset converts the hugetlb fault path to use struct vm_fault. This helps make the code more readable, and alleviates the stack by allowing us to consolidate many fault-related variables into an individual pointer. This patch (of 3): Now that hugetlb_fault() has a vm_fault available for fault tracking, use it throughout. This cleans up the code by removing 2 variables, and prepares hugetlb_fault() to take in a struct vm_fault argument. Link: https://lkml.kernel.org/r/20240401202651.31440-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20240401202651.31440-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: use "GUP-fast" instead "fast GUP" in remaining commentsDavid Hildenbrand2-2/+2
Let's fixup the remaining comments to consistently call that thing "GUP-fast". With this change, we consistently call it "GUP-fast". Link: https://lkml.kernel.org/r/20240402125516.223131-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FASTDavid Hildenbrand3-7/+7
Nowadays, we call it "GUP-fast", the external interface includes functions like "get_user_pages_fast()", and we renamed all internal functions to reflect that as well. Let's make the config option reflect that. Link: https://lkml.kernel.org/r/20240402125516.223131-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/gup: consistently name GUP-fast functionsDavid Hildenbrand1-102/+103
Patch series "mm/gup: consistently call it GUP-fast". Some cleanups around function names, comments and the config option of "GUP-fast" -- GUP without "lock" safety belts on. With this cleanup it's easy to judge which functions are GUP-fast specific. We now consistently call it "GUP-fast", avoiding mixing it with "fast GUP", "lockless", or simply "gup" (which I always considered confusing in the ode). So the magic now happens in functions that contain "gup_fast", whereby gup_fast() is the entry point into that magic. Comments consistently reference either "GUP-fast" or "gup_fast()". This patch (of 3): Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename all relevant internal functions to start with "gup_fast", to make it clearer that this is not ordinary GUP. The current mixture of "lockless", "gup" and "gup_fast" is confusing. Further, avoid the term "huge" when talking about a "leaf" -- for example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that stays. What remains is the "external" interface: * get_user_pages_fast_only() * get_user_pages_fast() * pin_user_pages_fast() The high-level internal functions for GUP-fast (+slow fallback) are now: * internal_get_user_pages_fast() -> gup_fast_fallback() * lockless_pages_from_mm() -> gup_fast() The basic GUP-fast walker functions: * gup_pgd_range() -> gup_fast_pgd_range() * gup_p4d_range() -> gup_fast_p4d_range() * gup_pud_range() -> gup_fast_pud_range() * gup_pmd_range() -> gup_fast_pmd_range() * gup_pte_range() -> gup_fast_pte_range() * gup_huge_pgd() -> gup_fast_pgd_leaf() * gup_huge_pud() -> gup_fast_pud_leaf() * gup_huge_pmd() -> gup_fast_pmd_leaf() The weird hugepd stuff: * gup_huge_pd() -> gup_fast_hugepd() * gup_hugepte() -> gup_fast_hugepte() The weird devmap stuff: * __gup_device_huge_pud() -> gup_fast_devmap_pud_leaf() * __gup_device_huge_pmd -> gup_fast_devmap_pmd_leaf() * __gup_device_huge() -> gup_fast_devmap_leaf() * undo_dev_pagemap() -> gup_fast_undo_dev_pagemap() Helper functions: * unpin_user_pages_lockless() -> gup_fast_unpin_user_pages() * gup_fast_folio_allowed() is already properly named * gup_fast_permitted() is already properly named With "gup_fast()", we now even have a function that is referred to in comment in mm/mmu_gather.c. Link: https://lkml.kernel.org/r/20240402125516.223131-1-david@redhat.com Link: https://lkml.kernel.org/r/20240402125516.223131-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26hugetlb: convert alloc_buddy_hugetlb_folio to use a folioMatthew Wilcox (Oracle)1-17/+16
While this function returned a folio, it was still using __alloc_pages() and __free_pages(). Use __folio_alloc() and put_folio() instead. This actually removes a call to compound_head(), but more importantly, it prepares us for the move to memdescs. Link: https://lkml.kernel.org/r/20240402200656.913841-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: remove struct page from get_shadow_from_swap_cacheMatthew Wilcox (Oracle)1-4/+4
We don't actually use any parts of struct page; all we do is check the value of the pointer. So give the pointer the appropriate name & type. Link: https://lkml.kernel.org/r/20240402201659.918308-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: madvise: avoid split during MADV_PAGEOUT and MADV_COLDRyan Roberts3-41/+62
Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large folio that is fully and contiguously mapped in the pageout/cold vm range. This change means that large folios will be maintained all the way to swap storage. This both improves performance during swap-out, by eliding the cost of splitting the folio, and sets us up nicely for maintaining the large folio when it is swapped back in (to be covered in a separate series). Folios that are not fully mapped in the target range are still split, but note that behavior is changed so that if the split fails for any reason (folio locked, shared, etc) we now leave it as is and move to the next pte in the range and continue work on the proceeding folios. Previously any failure of this sort would cause the entire operation to give up and no folios mapped at higher addresses were paged out or made cold. Given large folios are becoming more common, this old behavior would have likely lead to wasted opportunities. While we are at it, change the code that clears young from the ptes to use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper function. This is more efficent than get_and_clear/modify/set, especially for contpte mappings on arm64, where the old approach would require unfolding/refolding and the new approach can be done in place. Link: https://lkml.kernel.org/r/20240408183946.2991168-8-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: vmscan: avoid split during shrink_folio_list()Ryan Roberts1-10/+10
Now that swap supports storing all mTHP sizes, avoid splitting large folios before swap-out. This benefits performance of the swap-out path by eliding split_folio_to_list(), which is expensive, and also sets us up for swapping in large folios in a future series. If the folio is partially mapped, we continue to split it since we want to avoid the extra IO overhead and storage of writing out pages uneccessarily. THP_SWPOUT and THP_SWPOUT_FALLBACK counters should continue to count events only for PMD-mappable folios to avoid user confusion. THP_SWPOUT already has the appropriate guard. Add a guard for THP_SWPOUT_FALLBACK. It may be appropriate to add per-size counters in future. Link: https://lkml.kernel.org/r/20240408183946.2991168-7-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: swap: allow storage of all mTHP ordersRyan Roberts1-71/+91
Multi-size THP enables performance improvements by allocating large, pte-mapped folios for anonymous memory. However I've observed that on an arm64 system running a parallel workload (e.g. kernel compilation) across many cores, under high memory pressure, the speed regresses. This is due to bottlenecking on the increased number of TLBIs added due to all the extra folio splitting when the large folios are swapped out. Therefore, solve this regression by adding support for swapping out mTHP without needing to split the folio, just like is already done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, and when the swap backing store is a non-rotating block device. These are the same constraints as for the existing PMD-sized THP swap-out support. Note that no attempt is made to swap-in (m)THP here - this is still done page-by-page, like for PMD-sized THP. But swapping-out mTHP is a prerequisite for swapping-in mTHP. The main change here is to improve the swap entry allocator so that it can allocate any power-of-2 number of contiguous entries between [1, (1 << PMD_ORDER)]. This is done by allocating a cluster for each distinct order and allocating sequentially from it until the cluster is full. This ensures that we don't need to search the map and we get no fragmentation due to alignment padding for different orders in the cluster. If there is no current cluster for a given order, we attempt to allocate a free cluster from the list. If there are no free clusters, we fail the allocation and the caller can fall back to splitting the folio and allocates individual entries (as per existing PMD-sized THP fallback). The per-order current clusters are maintained per-cpu using the existing infrastructure. This is done to avoid interleving pages from different tasks, which would prevent IO being batched. This is already done for the order-0 allocations so we follow the same pattern. As is done for order-0 per-cpu clusters, the scanner now can steal order-0 entries from any per-cpu-per-order reserved cluster. This ensures that when the swap file is getting full, space doesn't get tied up in the per-cpu reserves. This change only modifies swap to be able to accept any order mTHP. It doesn't change the callers to elide doing the actual split. That will be done in separate changes. Link: https://lkml.kernel.org/r/20240408183946.2991168-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: swap: update get_swap_pages() to take folio orderRyan Roberts2-9/+10
We are about to allow swap storage of any mTHP size. To prepare for that, let's change get_swap_pages() to take a folio order parameter instead of nr_pages. This makes the interface self-documenting; a power-of-2 number of pages must be provided. We will also need the order internally so this simplifies accessing it. Link: https://lkml.kernel.org/r/20240408183946.2991168-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: swap: simplify struct percpu_clusterRyan Roberts1-11/+11
struct percpu_cluster stores the index of cpu's current cluster and the offset of the next entry that will be allocated for the cpu. These two pieces of information are redundant because the cluster index is just (offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the cluster index is because the structure used for it also has a flag to indicate "no cluster". However this data structure also contains a spin lock, which is never used in this context, as a side effect the code copies the spinlock_t structure, which is questionable coding practice in my view. So let's clean this up and store only the next offset, and use a sentinal value (SWAP_NEXT_INVALID) to indicate "no cluster". SWAP_NEXT_INVALID is chosen to be 0, because 0 will never be seen legitimately; The first page in the swap file is the swap header, which is always marked bad to prevent it from being allocated as an entry. This also prevents the cluster to which it belongs being marked free, so it will never appear on the free list. This change saves 16 bytes per cpu. And given we are shortly going to extend this mechanism to be per-cpu-AND-per-order, we will end up saving 16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the system. Link: https://lkml.kernel.org/r/20240408183946.2991168-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()Ryan Roberts4-28/+158
Now that we no longer have a convenient flag in the cluster to determine if a folio is large, free_swap_and_cache() will take a reference and lock a large folio much more often, which could lead to contention and (e.g.) failure to split large folios, etc. Let's solve that problem by batch freeing swap and cache with a new function, free_swap_and_cache_nr(), to free a contiguous range of swap entries together. This allows us to first drop a reference to each swap slot before we try to release the cache folio. This means we only try to release the folio once, only taking the reference and lock once - much better than the previous 512 times for the 2M THP case. Contiguous swap entries are gathered in zap_pte_range() and madvise_free_pte_range() in a similar way to how present ptes are already gathered in zap_pte_range(). While we are at it, let's simplify by converting the return type of both functions to void. The return value was used only by zap_pte_range() to print a bad pte, and was ignored by everyone else, so the extra reporting wasn't exactly guaranteed. We will still get the warning with most of the information from get_swap_device(). With the batch version, we wouldn't know which pte was bad anyway so could print the wrong one. [ryan.roberts@arm.com: fix a build warning on parisc] Link: https://lkml.kernel.org/r/20240409111840.3173122-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240408183946.2991168-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: swap: remove CLUSTER_FLAG_HUGE from swap_cluster_info:flagsRyan Roberts2-42/+8
Patch series "Swap-out mTHP without splitting", v7. This series adds support for swapping out multi-size THP (mTHP) without needing to first split the large folio via split_huge_page_to_list_to_order(). It closely follows the approach already used to swap-out PMD-sized THP. There are a couple of reasons for swapping out mTHP without splitting: - Performance: It is expensive to split a large folio and under extreme memory pressure some workloads regressed performance when using 64K mTHP vs 4K small folios because of this extra cost in the swap-out path. This series not only eliminates the regression but makes it faster to swap out 64K mTHP vs 4K small folios. - Memory fragmentation avoidance: If we can avoid splitting a large folio memory is less likely to become fragmented, making it easier to re-allocate a large folio in future. - Performance: Enables a separate series [7] to swap-in whole mTHPs, which means we won't lose the TLB-efficiency benefits of mTHP once the memory has been through a swap cycle. I've done what I thought was the smallest change possible, and as a result, this approach is only employed when the swap is backed by a non-rotating block device (just as PMD-sized THP is supported today). Discussion against the RFC concluded that this is sufficient. Performance Testing =================== I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The VM is set up with a 35G block ram device as the swap device and the test is run from inside a memcg limited to 40G memory. I've then run `usemem` from vm-scalability with 70 processes, each allocating and writing 1G of memory. I've repeated everything 6 times and taken the mean performance improvement relative to 4K page baseline: | alloc size | baseline | + this series | | | mm-unstable (~v6.9-rc1) | | |:-----------|------------------------:|------------------------:| | 4K Page | 0.0% | 1.3% | | 64K THP | -13.6% | 46.3% | | 2M THP | 91.4% | 89.6% | So with this change, the 64K swap performance goes from a 14% regression to a 46% improvement. While 2M shows a small regression I'm confident that this is just noise. [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/ [3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/ [4] https://lore.kernel.org/linux-mm/20240311150058.1122862-1-ryan.roberts@arm.com/ [5] https://lore.kernel.org/linux-mm/20240327144537.4165578-1-ryan.roberts@arm.com/ [6] https://lore.kernel.org/linux-mm/20240403114032.1162100-1-ryan.roberts@arm.com/ [7] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/ [8] https://lore.kernel.org/linux-mm/CAGsJ_4yMOow27WDvN2q=E4HAtDd2PJ=OQ5Pj9DG+6FLWwNuXUw@mail.gmail.com/ [9] https://lore.kernel.org/linux-mm/579d5127-c763-4001-9625-4563a9316ac3@redhat.com/ This patch (of 7): As preparation for supporting small-sized THP in the swap-out path, without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE, which, when present, always implies PMD-sized THP, which is the same as the cluster size. The only use of the flag was to determine whether a swap entry refers to a single page or a PMD-sized THP in swap_page_trans_huge_swapped(). Instead of relying on the flag, we now pass in order, which originates from the folio's order. This allows the logic to work for folios of any order. The one snag is that one of the swap_page_trans_huge_swapped() call sites does not have the folio. But it was only being called there to shortcut a call __try_to_reclaim_swap() in some cases. __try_to_reclaim_swap() gets the folio and (via some other functions) calls swap_page_trans_huge_swapped(). So I've removed the problematic call site and believe the new logic should be functionally equivalent. That said, removing the fast path means that we will take a reference and trylock a large folio much more often, which we would like to avoid. The next patch will solve this. Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster() which used to be called during folio splitting, since split_swap_cluster()'s only job was to remove the flag. Link: https://lkml.kernel.org/r/20240408183946.2991168-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240408183946.2991168-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: page_alloc: use the correct THP order for THP PCPBaolin Wang1-3/+3
Commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") extends the PCP allocator to store THP pages, and it determines whether to cache THP pages in PCP by comparing with pageblock_order. But the pageblock_order is not always equal to THP order. It might also be MAX_PAGE_ORDER, which could prevent PCP from caching THP pages. Therefore, using HPAGE_PMD_ORDER instead to determine the need for caching THP for PCP will fix this issue Link: https://lkml.kernel.org/r/a25c9e14cd03907d5978b60546a69e6aa3fc2a7d.1712151833.git.baolin.wang@linux.alibaba.com Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Barry Song <baohua@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26khugepaged: use a folio throughout hpage_collapse_scan_file()Matthew Wilcox (Oracle)1-17/+16
Replace the use of pages with folios. Saves a few calls to compound_head() and removes some uses of obsolete functions. Link: https://lkml.kernel.org/r/20240403171838.1445826-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26khugepaged: use a folio throughout collapse_file()Matthew Wilcox (Oracle)1-59/+54
Pull folios from the page cache instead of pages. Half of this work had been done already, but we were still operating on pages for a large chunk of this function. There is no attempt in this patch to handle large folios that are smaller than a THP; that will have to wait for a future patch. [willy@infradead.org: the unlikely() is embedded in IS_ERR()] Link: https://lkml.kernel.org/r/ZhIWX8K0E2tSyMSr@casper.infradead.org Link: https://lkml.kernel.org/r/20240403171838.1445826-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26khugepaged: remove hpage from collapse_file()Matthew Wilcox (Oracle)1-38/+39
Use new_folio throughout where we had been using hpage. Link: https://lkml.kernel.org/r/20240403171838.1445826-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>