summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-05-06mm/memory-failure: add some folio conversions to unpoison_memoryMatthew Wilcox (Oracle)1-4/+4
Some of these folio APIs didn't exist when the unpoison_memory() conversion was done originally. Link: https://lkml.kernel.org/r/20240412193510.2356957-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/memory-failure: convert hwpoison_user_mappings to take a folioMatthew Wilcox (Oracle)1-15/+15
Pass the folio from the callers, and use it throughout instead of hpage. Saves dozens of calls to compound_head(). Link: https://lkml.kernel.org/r/20240412193510.2356957-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/memory-failure: convert memory_failure() to use a folioMatthew Wilcox (Oracle)1-19/+21
Saves dozens of calls to compound_head(). Link: https://lkml.kernel.org/r/20240412193510.2356957-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: convert hugetlb_page_mapping_lock_write to folioMatthew Wilcox (Oracle)4-8/+8
The page is only used to get the mapping, so the folio will do just as well. Both callers already have a folio available, so this saves a call to compound_head(). Link: https://lkml.kernel.org/r/20240412193510.2356957-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jane Chu  <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/memory-failure: convert shake_page() to shake_folio()Matthew Wilcox (Oracle)4-11/+17
Removes two calls to compound_head(). Move the prototype to internal.h; we definitely don't want code outside mm using it. Link: https://lkml.kernel.org/r/20240412193510.2356957-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jane Chu <jane.chu@oracle.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: make page_mapped_in_vma conditional on CONFIG_MEMORY_FAILUREMatthew Wilcox (Oracle)1-0/+2
This function is only currently used by the memory-failure code, so we can omit it if we're not compiling in the memory-failure code. Link: https://lkml.kernel.org/r/20240412193510.2356957-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Suggested-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: return the address from page_mapped_in_vma()Matthew Wilcox (Oracle)3-17/+23
The only user of this function calls page_address_in_vma() immediately after page_mapped_in_vma() calculates it and uses it to return true/false. Return the address instead, allowing memory-failure to skip the call to page_address_in_vma(). Link: https://lkml.kernel.org/r/20240412193510.2356957-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/memory-failure: pass addr to __add_to_kill()Matthew Wilcox (Oracle)1-2/+4
Handle anon/file folios the same way as KSM & DAX folios by passing in the address. Link: https://lkml.kernel.org/r/20240412193510.2356957-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/memory-failure: remove fsdax_pgoff argument from __add_to_killMatthew Wilcox (Oracle)1-18/+9
Patch series "Some cleanups for memory-failure", v3. A lot of folio conversions, plus some other simplifications. This patch (of 11): Unify the KSM and DAX codepaths by calculating the addr in add_to_kill_fsdax() instead of telling __add_to_kill() to calculate it. Link: https://lkml.kernel.org/r/20240412193510.2356957-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240412193510.2356957-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06xarray: don't use "proxy" headersAndy Shevchenko1-1/+5
Update header inclusions to follow IWYU (Include What You Use) principle. Link: https://lkml.kernel.org/r/20240423142204.2408923-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06xarray: use BITS_PER_LONGS()Andy Shevchenko1-1/+1
Patch series "xarray: Clean up xarray.h". Main portion of this change is to get rid of kernel.h included into other globally available headers. This decreases a dependency hell degree. The first patch makes it possible to avoid math.h to be included as bitops.h is implied by bitmap.h. This patch (of 2): Use BITS_PER_LONGS() instead of open coded variant. Link: https://lkml.kernel.org/r/20240423142204.2408923-2-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06memcg: simple cleanup of stats update functionsShakeel Butt3-35/+15
mod_memcg_lruvec_state() is never called from outside of memcontrol.c and with always irq disabled. So, replace it with the irq disabled version and add an assert that irq is disabled in the caller. Similarly mod_objcg_state() is not called from outside of memcontrol.c, so simply make it static and change it's name to __mod_objcg_state(). Link: https://lkml.kernel.org/r/20240420232505.2768428-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: memory: check userfaultfd_wp() in vmf_orig_pte_uffd_wp()Kefeng Wang1-5/+5
Add userfaultfd_wp() check in vmf_orig_pte_uffd_wp() to avoid the unnecessary FAULT_FLAG_ORIG_PTE_VALID check/pte_marker_entry_uffd_wp() in most pagefault, note, the function vmf_orig_pte_uffd_wp() is not inlined in the two kernel versions, the difference is shown below, perf date, perf report -i perf.data.before | grep vmf 0.17% 0.13% lat_pagefault [kernel.kallsyms] [k] vmf_orig_pte_uffd_wp.part.0.isra.0 perf report -i perf.data.after | grep vmf lat_pagefault -W 5 -N 5 /tmp/XXX latency before after diff average(8 tests) 0.262675 0.2600375 -0.0026375 Although it's a small, but the uffd_wp is a new feature than previous kernel, when the vma is not registered with UFFD_WP, let's avoid to execute the new logical, also adding __always_inline attribute to vmf_orig_pte_uffd_wp(), which make set_pte_range() only check VM_UFFD_WP flags without the function call. In addition, directly call the vmf_orig_pte_uffd_wp() in do_anonymous_page() and set_pte_range() to save an uffd_wp variable. Link: https://lkml.kernel.org/r/20240422030039.3293568-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/page-flags: make PageUptodate return boolHao Ge1-1/+1
Make PageUptodate return bool to align the return values of folio_test_uptodate function Link: https://lkml.kernel.org/r/20240422032725.41452-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Cc: David Hildenbrand <david@redhat.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ruihan Li <lrh2000@pku.edu.cn> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/madvise: optimize lazyfreeing with mTHP in madvise_freeLance Yang1-41/+44
This patch optimizes lazyfreeing with PTE-mapped mTHP[1] (Inspired by David Hildenbrand[2]). We aim to avoid unnecessary folio splitting if the large folio is fully mapped within the target range. If a large folio is locked or shared, or if we fail to split it, we just leave it in place and advance to the next PTE in the range. But note that the behavior is changed; previously, any failure of this sort would cause the entire operation to give up. As large folios become more common, sticking to the old way could result in wasted opportunities. On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by PTE-mapped folios of the same size results in the following runtimes for madvise(MADV_FREE) in seconds (shorter is better): Folio Size | Old | New | Change ------------------------------------------ 4KiB | 0.590251 | 0.590259 | 0% 16KiB | 2.990447 | 0.185655 | -94% 32KiB | 2.547831 | 0.104870 | -95% 64KiB | 2.457796 | 0.052812 | -97% 128KiB | 2.281034 | 0.032777 | -99% 256KiB | 2.230387 | 0.017496 | -99% 512KiB | 2.189106 | 0.010781 | -99% 1024KiB | 2.183949 | 0.007753 | -99% 2048KiB | 0.002799 | 0.002804 | 0% [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com Link: https://lkml.kernel.org/r/20240418134435.6092-5-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Jeff Xie <xiehuan09@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/memory: add any_dirty optional pointer to folio_pte_batch()Lance Yang3-9/+26
This commit adds the any_dirty pointer as an optional parameter to folio_pte_batch() function. By using both the any_young and any_dirty pointers, madvise_free can make smarter decisions about whether to clear the PTEs when marking large folios as lazyfree. Link: https://lkml.kernel.org/r/20240418134435.6092-4-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Jeff Xie <xiehuan09@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/arm64: override clear_young_dirty_ptes() batch helperLance Yang2-0/+84
The per-pte get_and_clear/modify/set approach would result in unfolding/refolding for contpte mappings on arm64. So we need to override clear_young_dirty_ptes() for arm64 to avoid it. Link: https://lkml.kernel.org/r/20240418134435.6092-3-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: Barry Song <21cnbao@gmail.com> Suggested-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jeff Xie <xiehuan09@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/madvise: introduce clear_young_dirty_ptes() batch helperLance Yang3-31/+55
Patch series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free", v10. This patchset adds support for lazyfreeing multi-size THP (mTHP) without needing to first split the large folio via split_folio(). However, we still need to split a large folio that is not fully mapped within the target range. If a large folio is locked or shared, or if we fail to split it, we just leave it in place and advance to the next PTE in the range. But note that the behavior is changed; previously, any failure of this sort would cause the entire operation to give up. As large folios become more common, sticking to the old way could result in wasted opportunities. Performance Testing =================== On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by PTE-mapped folios of the same size results in the following runtimes for madvise(MADV_FREE) in seconds (shorter is better): Folio Size | Old | New | Change ------------------------------------------ 4KiB | 0.590251 | 0.590259 | 0% 16KiB | 2.990447 | 0.185655 | -94% 32KiB | 2.547831 | 0.104870 | -95% 64KiB | 2.457796 | 0.052812 | -97% 128KiB | 2.281034 | 0.032777 | -99% 256KiB | 2.230387 | 0.017496 | -99% 512KiB | 2.189106 | 0.010781 | -99% 1024KiB | 2.183949 | 0.007753 | -99% 2048KiB | 0.002799 | 0.002804 | 0% This patch (of 4): This commit introduces clear_young_dirty_ptes() to replace mkold_ptes(). By doing so, we can use the same function for both use cases (madvise_pageout and madvise_free), and it also provides the flexibility to only clear the dirty flag in the future if needed. Link: https://lkml.kernel.org/r/20240418134435.6092-1-ioworker0@gmail.com Link: https://lkml.kernel.org/r/20240418134435.6092-2-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Jeff Xie <xiehuan09@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: swapfile: check usable swap device in __folio_throttle_swaprate()Kefeng Wang1-3/+10
Skip blk_cgroup_congested() if there is no usable swap device since no swapin/out will occur, Thereby avoid taking swap_lock. The difference is shown below from perf date of CoW pagefault, perf report -g -i perf.data.swapon | egrep "blk_cgroup_congested|__folio_throttle_swaprate" 1.01% 0.16% page_fault2_pro [kernel.kallsyms] [k] __folio_throttle_swaprate 0.83% 0.80% page_fault2_pro [kernel.kallsyms] [k] blk_cgroup_congested perf report -g -i perf.data.swapoff | egrep "blk_cgroup_congested|__folio_throttle_swaprate" 0.15% 0.15% page_fault2_pro [kernel.kallsyms] [k] __folio_throttle_swaprate Link: https://lkml.kernel.org/r/20240418135644.2736748-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/huge_memory: improve split_huge_page_to_list_to_order() return value ↵David Hildenbrand1-3/+11
documentation The documentation is wrong and relying on it almost resulted in BUGs in new callers: ever since fd4a7ac32918 ("mm: migrate: try again if THP split is failed due to page refcnt") we return -EAGAIN on unexpected folio references, not -EBUSY. Let's fix that and also document which other return values we can currently see and why they could happen. [david@redhat.com: v2] Link: https://lkml.kernel.org/r/20240422194217.442933-1-david@redhat.com Link: https://lkml.kernel.org/r/20240418151834.216557-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/page_table_check: support userfault wr-protect entriesPeter Xu3-18/+39
Allow page_table_check hooks to check over userfaultfd wr-protect criteria upon pgtable updates. The rule is no co-existance allowed for any writable flag against userfault wr-protect flag. This should be better than c2da319c2e, where we used to only sanitize such issues during a pgtable walk, but when hitting such issue we don't have a good chance to know where does that writable bit came from [1], so that even the pgtable walk exposes a kernel bug (which is still helpful on triaging) but not easy to track and debug. Now we switch to track the source. It's much easier too with the recent introduction of page table check. There are some limitations with using the page table check here for userfaultfd wr-protect purpose: - It is only enabled with explicit enablement of page table check configs and/or boot parameters, but should be good enough to track at least syzbot issues, as syzbot should enable PAGE_TABLE_CHECK[_ENFORCED] for x86 [1]. We used to have DEBUG_VM but it's now off for most distros, while distros also normally not enable PAGE_TABLE_CHECK[_ENFORCED], which is similar. - It conditionally works with the ptep_modify_prot API. It will be bypassed when e.g. XEN PV is enabled, however still work for most of the rest scenarios, which should be the common cases so should be good enough. - Hugetlb check is a bit hairy, as the page table check cannot identify hugetlb pte or normal pte via trapping at set_pte_at(), because of the current design where hugetlb maps every layers to pte_t... For example, the default set_huge_pte_at() can invoke set_pte_at() directly and lose the hugetlb context, treating it the same as a normal pte_t. So far it's fine because we have huge_pte_uffd_wp() always equals to pte_uffd_wp() as long as supported (x86 only). It'll be a bigger problem when we'll define _PAGE_UFFD_WP differently at various pgtable levels, because then one huge_pte_uffd_wp() per-arch will stop making sense first.. as of now we can leave this for later too. This patch also removes commit c2da319c2e altogether, as we have something better now. [1] https://lore.kernel.org/all/000000000000dce0530615c89210@google.com/ Link: https://lkml.kernel.org/r/20240417212549.2766883-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/hugetlb: assert hugetlb_lock in __hugetlb_cgroup_commit_chargePeter Xu1-1/+1
This is similar to __hugetlb_cgroup_uncharge_folio() where it relies on holding hugetlb_lock. Add the similar assertion like the other one, since it looks like such things may help some day. Link: https://lkml.kernel.org/r/20240417211836.2742593-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06fs/proc/task_mmu: convert smaps_hugetlb_range() to work on foliosDavid Hildenbrand1-6/+7
Let's get rid of another page_mapcount() check and simply use folio_likely_mapped_shared(), which is precise for hugetlb folios. While at it, use huge_ptep_get() + pte_page() instead of ptep_get() + vm_normal_page(), just like we do in pagemap_hugetlb_range(). No functional change intended. Link: https://lkml.kernel.org/r/20240417092313.753919-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06fs/proc/task_mmu: convert pagemap_hugetlb_range() to work on foliosDavid Hildenbrand1-3/+4
Patch series "fs/proc/task_mmu: convert hugetlb functions to work on folis". Let's convert two more functions, getting rid of two more page_mapcount() calls. This patch (of 2): Let's get rid of another page_mapcount() check and simply use folio_likely_mapped_shared(), which is precise for hugetlb folios. While at it, also check for PMD table sharing, like we do in smaps_hugetlb_range(). No functional change intended, except that we would now detect hugetlb folios shared via PMD table sharing correctly. Link: https://lkml.kernel.org/r/20240417092313.753919-1-david@redhat.com Link: https://lkml.kernel.org/r/20240417092313.753919-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/sparse: guard the size of mem_section is power of 2Wei Yang1-0/+2
We usually have this check, while commit 2a3cb8baef71 ("mm/sparse: delete old sparse_init and enable new one") missed to take it. Link: https://lkml.kernel.org/r/20240416012559.4536-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06doc: split buffer.rst out of api-summary.rstMatthew Wilcox (Oracle)3-3/+13
Buffer heads are no longer a generic filesystem API but an optional filesystem support library. Make the documentation structure reflect that, and include the fine documentation kept in buffer_head.h. We could give a better overview of what buffer heads are all about, but my enthusiasm for documenting it is limited. [willy@infradead.org: fix kerneldoc warning] Link: https://lkml.kernel.org/r/20240417015933.453505-1-willy@infradead.org [akpm@linux-foundation.org: remove newline at EOF] Link: https://lkml.kernel.org/r/20240416031754.4076917-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06buffer: improve bdev_getblk documentationMatthew Wilcox (Oracle)1-0/+5
Add some more information about the state of the buffer_head returned. Link: https://lkml.kernel.org/r/20240416031754.4076917-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06buffer: add kernel-doc for bforget() and __bforget()Matthew Wilcox (Oracle)2-3/+16
Distinguish these functions from brelse() and __brelse(). Link: https://lkml.kernel.org/r/20240416031754.4076917-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06buffer: add kernel-doc for brelse() and __brelse()Matthew Wilcox (Oracle)2-9/+24
Move the documentation for __brelse() to brelse(), format it as kernel-doc and update it from talking about pages to folios. Link: https://lkml.kernel.org/r/20240416031754.4076917-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06buffer: fix __bread and __bread_gfp kernel-docMatthew Wilcox (Oracle)2-22/+35
The extra indentation confused the kernel-doc parser, so remove it. Fix some other wording while I'm here, and advise the user they need to call brelse() on this buffer. __bread_gfp() isn't used directly by filesystems, but the other wrappers for it don't have documentation, so document it accordingly. Link: https://lkml.kernel.org/r/20240416031754.4076917-5-willy@infradead.org Co-developed-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06buffer: add kernel-doc for try_to_free_buffers()Matthew Wilcox (Oracle)1-20/+24
The documentation for this function has become separated from it over time; move it to the right place and turn it into kernel-doc. Mild editing of the content to make it more about what the function does, and less about how it does it. Link: https://lkml.kernel.org/r/20240416031754.4076917-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06buffer: add kernel-doc for block_dirty_folio()Matthew Wilcox (Oracle)1-24/+31
Turn the excellent documentation for this function into kernel-doc. Replace 'page' with 'folio' and make a few other minor updates. Link: https://lkml.kernel.org/r/20240416031754.4076917-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06doc: improve the description of __folio_mark_dirtyMatthew Wilcox (Oracle)1-5/+9
Patch series "Improve buffer head documentation", v3. Turn buffer head documentation into its own document, and make many general improvements to the docs. Obviously there is much more that could be done. Tested with make htmldocs. This patch (of 8): I've learned why it's safe to call __folio_mark_dirty() from mark_buffer_dirty() without holding the folio lock, so update the description to explain why. Link: https://lkml.kernel.org/r/20240416031754.4076917-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240416031754.4076917-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06xarray: inline xas_descend to improve performanceLong Li1-1/+2
The commit 63b1898fffcd ("XArray: Disallow sibling entries of nodes") modified the xas_descend function in such a way that it was no longer being compiled as an inline function, because it increased the size of xas_descend(), and the compiler no longer optimizes it as inline. This had a negative impact on performance, xas_descend is called frequently to traverse downwards in the xarray tree, making it a hot function. Inlining xas_descend has been shown to significantly improve performance by approximately 4.95% in the iozone write test. Machine: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz #iozone i 0 -i 1 -s 64g -r 16m -f /test/tmptest Before this patch: kB reclen write rewrite read reread 67108864 16384 2230080 3637689 6315197 5496027 After this patch: kB reclen write rewrite read reread 67108864 16384 2340360 3666175 6272401 5460782 Percentage change: 4.95% 0.78% -0.68% -0.64% This patch introduces inlining to the xas_descend function. While this change increases the size of lib/xarray.o, the performance gains in critical workloads make this an acceptable trade-off. Size comparison before and after patch: .text .data .bss file 0x3502 0 0 lib/xarray.o.before 0x3602 0 0 lib/xarray.o.after Link: https://lkml.kernel.org/r/20240416061628.3768901-1-leo.lilong@huawei.com Signed-off-by: Long Li <leo.lilong@huawei.com> Cc: Hou Tao <houtao1@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: yangerkun <yangerkun@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/ksm: remove page_mapcount() usage in stable_tree_search()David Hildenbrand1-5/+8
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. If our folio has a stable node, it is a (small) KSM folio -- see folio_stable_node(). Let's use folio_mapcount() in stable_tree_search() instead, which results in no functional change. The mapcount > 1 check is a bit confusing, because that's usually a check for page sharing. Looks like the reason is that we are guaranteed to not exceed ksm_max_page_sharing for the tree KSM folio when merging with that. Let's update the documentation to make that clearer. Link: https://lkml.kernel.org/r/20240416172533.663418-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alex Shi <alexs@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: zswap: remove same_filled module paramsYosry Ahmed3-50/+0
These knobs offer more fine-grained control to userspace than needed and directly expose/influence kernel implementation; remove them. For disabling same_filled handling, there is no logical reason to refuse storing same-filled pages more efficiently and opt for compression. Scanning pages for patterns may be an argument, but the page contents will be read into the CPU cache anyway during compression. Also, removing the same_filled handling code does not move the needle significantly in terms of performance anyway [1]. For disabling non_same_filled handling, it was added when the compressed pages in zswap were not being properly charged to memcgs, as workloads could escape the accounting with compression [2]. This is no longer the case after commit f4840ccfca25 ("zswap: memcg accounting"), and using zswap without compression does not make much sense. [1]https://lore.kernel.org/lkml/CAJD7tkaySFP2hBQw4pnZHJJwe3bMdjJ1t9VC2VJd=khn1_TXvA@mail.gmail.com/ [2]https://lore.kernel.org/lkml/19d5cdee-2868-41bd-83d5-6da75d72e940@maciej.szmigiero.name/ [yosryahmed@google.com: remove same_filled_pages from docs] Link: https://lkml.kernel.org/r/ZhxFVggdyvCo79jc@google.com Link: https://lkml.kernel.org/r/20240413022407.785696-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: zswap: move more same-filled pages checks outside of zswap_store()Yosry Ahmed1-20/+25
Currently, zswap_store() checks zswap_same_filled_pages_enabled, kmaps the folio, then calls zswap_is_page_same_filled() to check the folio contents. Move this logic into zswap_is_page_same_filled() as well (and rename it to use 'folio' while we are at it). This makes zswap_store() cleaner, and makes following changes to that logic contained within the helper. While we are at it: - Rename the insert_entry label to store_entry to match xa_store(). - Add comment headers for same-filled functions and the main API functions (load, store, invalidate, swapon, swapoff). No functional change intended. Link: https://lkml.kernel.org/r/20240413022407.785696-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: zswap: refactor limit checking from zswap_store()Yosry Ahmed1-16/+16
Refactor limit and acceptance threshold checking outside of zswap_store(). This code will be moved around in a following patch, so it would be cleaner to move a function call around. Link: https://lkml.kernel.org/r/20240413022407.785696-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: zswap: always shrink in zswap_store() if zswap_pool_reached_fullYosry Ahmed1-6/+4
Patch series "zswap same-filled and limit checking cleanups", v3. Miscellaneous cleanups for limit checking and same-filled handling in the store path. This series was broken out of the "zswap: store zero-filled pages more efficiently" series [1]. It contains the cleanups and drops the main functional changes. [1]https://lore.kernel.org/lkml/20240325235018.2028408-1-yosryahmed@google.com/ This patch (of 4): The cleanup code in zswap_store() is not pretty, particularly the 'shrink' label at the bottom that ends up jumping between cleanup labels. Instead of having a dedicated label to shrink the pool, just use zswap_pool_reached_full directly to figure out if the pool needs shrinking. zswap_pool_reached_full should be true if and only if the pool needs shrinking. The only caveat is that the value of zswap_pool_reached_full may be changed by concurrent zswap_store() calls between checking the limit and testing zswap_pool_reached_full in the cleanup code. This is fine because: - If zswap_pool_reached_full was true during limit checking then became false during the cleanup code, then someone else already took care of shrinking the pool and there is no need to queue the worker. That would be a good change. - If zswap_pool_reached_full was false during limit checking then became true during the cleanup code, then someone else hit the limit meanwhile. In this case, both threads will try to queue the worker, but it never gets queued more than once anyway. Also, calling queue_work() multiple times when the limit is hit could already happen today, so this isn't a significant change in any way. Link: https://lkml.kernel.org/r/20240413022407.785696-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20240413022407.785696-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06userfaultfd: remove WRITE_ONCE when setting folio->index during UFFDIO_MOVESuren Baghdasaryan2-2/+2
When folio is moved with UFFDIO_MOVE it gets locked before the rmap and index are modified. Due to the folio lock being already held, WRITE_ONCE() is not needed when setting the folio index. Remove it. Link: https://lkml.kernel.org/r/20240415020821.1152951-1-surenb@google.com Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: page_alloc: allowing mTHP compaction to capture the freed page directlyBaolin Wang1-2/+4
Currently, compaction_capture() does not allow lower-order allocations to directly capture the movable free pages, even though lower-order allocations might also be requesting movable pages, that can lead to more compaction scanning. And, with the enablement of mTHP, such situations will become more common. Thus allowing lower-order (mTHP) allocations of movable page types directly capture the movable free pages can avoid unnecessary compaction scanning, meanwhile that won't pollute the movable pageblock. With testing 1M mTHP compaction, it can be seen that compaction scanning is significantly reduced. mm-unstable patched Ops Compaction pages isolated 116598741.00 120946702.00 Ops Compaction migrate scanned 1764870054.00 1488621550.00 Ops Compaction free scanned 7707879039.00 4986299318.00 Ops Compact scan efficiency 22.90 29.85 Ops Compaction cost 73797.69 72933.48 Link: https://lkml.kernel.org/r/8118a5d66a034736a48433beddaca60ed78577c4.1712892329.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: filemap: batch mm counter updating in filemap_map_pages()Kefeng Wang1-9/+12
Like copy_pte_range()/zap_pte_range(), make mm counter batch updating in filemap_map_pages(), since folios type are same(MM_SHMEMPAGES or MM_FILEPAGES) in filemap_map_pages(), only check the first folio type is enough, the 'lat_pagefault -P 1 file' test from lmbench shows 12% improvement, and the percpu_counter_add_batch() is gone from perf flame graph. Link: https://lkml.kernel.org/r/20240412064751.119015-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: move mm counter updating out of set_pte_range()Kefeng Wang2-3/+9
Patch series "mm: batch mm counter updating in filemap_map_pages()", v3. Let's batch mm counter updating to accelerate filemap_map_pages(). This patch (of 2): In order to support batch mm counter updating in filemap_map_pages(), move mm counter updating out of set_pte_range(), the folios are file from filemap, and distinguish folios by vmf->flags and vma->vm_flags from another caller finish_fault(). Link: https://lkml.kernel.org/r/20240412064751.119015-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240412064751.119015-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: correct the docs for thp_fault_alloc and thp_fault_fallbackBarry Song1-2/+2
The documentation does not align with the code. In __do_huge_pmd_anonymous_page(), THP_FAULT_FALLBACK is incremented when mem_cgroup_charge() fails, despite the allocation succeeding, whereas THP_FAULT_ALLOC is only incremented after a successful charge. Link: https://lkml.kernel.org/r/20240412114858.407208-5-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: add docs for per-order mTHP counters and transhuge_page ABIBarry Song2-0/+46
This patch includes documentation for mTHP counters and an ABI file for sys-kernel-mm-transparent-hugepage, which appears to have been missing for some time. [v-songbaohua@oppo.com: fix the name and unexpected indentation] Link: https://lkml.kernel.org/r/20240415054538.17071-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240412114858.407208-4-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: add per-order mTHP anon_swpout and anon_swpout_fallback countersBarry Song4-0/+10
This helps to display the fragmentation situation of the swapfile, knowing the proportion of how much we haven't split large folios. So far, we only support non-split swapout for anon memory, with the possibility of expanding to shmem in the future. So, we add the "anon" prefix to the counter names. Link: https://lkml.kernel.org/r/20240412114858.407208-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm: add per-order mTHP anon_fault_alloc and anon_fault_fallback countersBarry Song3-0/+78
Patch series "mm: add per-order mTHP alloc and swpout counters", v6. The patchset introduces a framework to facilitate mTHP counters, starting with the allocation and swap-out counters. Currently, only four new nodes are appended to the stats directory for each mTHP size. /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats anon_fault_alloc anon_fault_fallback anon_fault_fallback_charge anon_swpout anon_swpout_fallback These nodes are crucial for us to monitor the fragmentation levels of both the buddy system and the swap partitions. In the future, we may consider adding additional nodes for further insights. This patch (of 4): Profiling a system blindly with mTHP has become challenging due to the lack of visibility into its operations. Presenting the success rate of mTHP allocations appears to be pressing need. Recently, I've been experiencing significant difficulty debugging performance improvements and regressions without these figures. It's crucial for us to understand the true effectiveness of mTHP in real-world scenarios, especially in systems with fragmented memory. This patch establishes the framework for per-order mTHP counters. It begins by introducing the anon_fault_alloc and anon_fault_fallback counters. Additionally, to maintain consistency with thp_fault_fallback_charge in /proc/vmstat, this patch also tracks anon_fault_fallback_charge when mem_cgroup_charge fails for mTHP. Incorporating additional counters should now be straightforward as well. Link: https://lkml.kernel.org/r/20240412114858.407208-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240412114858.407208-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/hugetlb: rename dissolve_free_huge_pages() to dissolve_free_hugetlb_folios()Sidhartha Kumar3-5/+5
dissolve_free_huge_pages() only uses folios internally, rename it to dissolve_free_hugetlb_folios() and change the comments which reference it. [akpm@linux-foundation.org: remove unneeded `extern'] Link: https://lkml.kernel.org/r/20240412182139.120871-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/hugetlb: convert dissolve_free_huge_pages() to foliosSidhartha Kumar3-15/+14
Allows us to rename dissolve_free_huge_pages() to dissolve_free_hugetlb_folio(). Convert one caller to pass in a folio directly and use page_folio() to convert the caller in mm/memory-failure. [sidhartha.kumar@oracle.com: remove unneeded `extern'] Link: https://lkml.kernel.org/r/71760ed4-e80d-493a-95ea-2545414b1aba@oracle.com [sidhartha.kumar@oracle.com: v2] Link: https://lkml.kernel.org/r/20240412182139.120871-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20240411164756.261178-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-06mm/ksm: replace set_page_stable_node by folio_set_stable_nodeAlex Shi (tencent)1-9/+3
Only single page could be reached where we set stable node after write protect, so use folio converted func to replace page's. And remove the unused func set_page_stable_node(). Link: https://lkml.kernel.org/r/20240411061713.1847574-11-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>