summaryrefslogtreecommitdiff
path: root/mm/Kconfig
AgeCommit message (Collapse)AuthorFilesLines
2023-06-30Merge tag 'slab-for-6.5' of ↵Linus Torvalds1-4/+14
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - SLAB deprecation: Following the discussion at LSF/MM 2023 [1] and no objections, the SLAB allocator is deprecated by renaming the config option (to make its users notice) to CONFIG_SLAB_DEPRECATED with updated help text. SLUB should be used instead. Existing defconfigs with CONFIG_SLAB are also updated. - SLAB_NO_MERGE kmem_cache flag (Jesper Dangaard Brouer): There are (very limited) cases where kmem_cache merging is undesirable, and existing ways to prevent it are hacky. Introduce a new flag to do that cleanly and convert the existing hacky users. Btrfs plans to use this for debug kernel builds (that use case is always fine), networking for performance reasons (that should be very rare). - Replace the usage of weak PRNGs (David Keisar Schmidt): In addition to using stronger RNGs for the security related features, the code is a bit cleaner. - Misc code cleanups (SeongJae Parki, Xiongwei Song, Zhen Lei, and zhaoxinchao) Link: https://lwn.net/Articles/932201/ [1] * tag 'slab-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab_common: use SLAB_NO_MERGE instead of negative refcount mm/slab: break up RCU readers on SLAB_TYPESAFE_BY_RCU example code mm/slab: add a missing semicolon on SLAB_TYPESAFE_BY_RCU example code mm/slab_common: reduce an if statement in create_cache() mm/slab: introduce kmem_cache flag SLAB_NO_MERGE mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED mm/slab: remove HAVE_HARDENED_USERCOPY_ALLOCATOR mm/slab_common: Replace invocation of weak PRNG mm/slab: Replace invocation of weak PRNG slub: Don't read nr_slabs and total_objects directly slub: Remove slabs_node() function slub: Remove CONFIG_SMP defined check slub: Put objects_show() into CONFIG_SLUB_DEBUG enabled block slub: Correct the error code when slab_kset is NULL mm/slab: correct return values in comment for _kmem_cache_create()
2023-06-29Merge branch 'expand-stack'Linus Torvalds1-0/+4
This modifies our user mode stack expansion code to always take the mmap_lock for writing before modifying the VM layout. It's actually something we always technically should have done, but because we didn't strictly need it, we were being lazy ("opportunistic" sounds so much better, doesn't it?) about things, and had this hack in place where we would extend the stack vma in-place without doing the proper locking. And it worked fine. We just needed to change vm_start (or, in the case of grow-up stacks, vm_end) and together with some special ad-hoc locking using the anon_vma lock and the mm->page_table_lock, it all was fairly straightforward. That is, it was all fine until Ruihan Li pointed out that now that the vma layout uses the maple tree code, we *really* don't just change vm_start and vm_end any more, and the locking really is broken. Oops. It's not actually all _that_ horrible to fix this once and for all, and do proper locking, but it's a bit painful. We have basically three different cases of stack expansion, and they all work just a bit differently: - the common and obvious case is the page fault handling. It's actually fairly simple and straightforward, except for the fact that we have something like 24 different versions of it, and you end up in a maze of twisty little passages, all alike. - the simplest case is the execve() code that creates a new stack. There are no real locking concerns because it's all in a private new VM that hasn't been exposed to anybody, but lockdep still can end up unhappy if you get it wrong. - and finally, we have GUP and page pinning, which shouldn't really be expanding the stack in the first place, but in addition to execve() we also use it for ptrace(). And debuggers do want to possibly access memory under the stack pointer and thus need to be able to expand the stack as a special case. None of these cases are exactly complicated, but the page fault case in particular is just repeated slightly differently many many times. And ia64 in particular has a fairly complicated situation where you can have both a regular grow-down stack _and_ a special grow-up stack for the register backing store. So to make this slightly more manageable, the bulk of this series is to first create a helper function for the most common page fault case, and convert all the straightforward architectures to it. Thus the new 'lock_mm_and_find_vma()' helper function, which ends up being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon, loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more than half the architectures, we now have more shared code and avoid some of those twisty little passages. And largely due to this common helper function, the full diffstat of this series ends up deleting more lines than it adds. That still leaves eight architectures (ia64, m68k, microblaze, openrisc, parisc, s390, sparc64 and um) that end up doing 'expand_stack()' manually because they are doing something slightly different from the normal pattern. Along with the couple of special cases in execve() and GUP. So there's a couple of patches that first create 'locked' helper versions of the stack expansion functions, so that there's a obvious path forward in the conversion. The execve() case is then actually pretty simple, and is a nice cleanup from our old "grow-up stackls are special, because at execve time even they grow down". The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because it's just more straightforward to write out the stack expansion there manually, instead od having get_user_pages_remote() do it for us in some situations but not others and have to worry about locking rules for GUP. And the final step is then to just convert the remaining odd cases to a new world order where 'expand_stack()' is called with the mmap_lock held for reading, but where it might drop it and upgrade it to a write, only to return with it held for reading (in the success case) or with it completely dropped (in the failure case). In the process, we remove all the stack expansion from GUP (where dropping the lock wouldn't be ok without special rules anyway), and add it in manually to __access_remote_vm() for ptrace(). Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases. Everything else here felt pretty straightforward, but the ia64 rules for stack expansion are really quite odd and very different from everything else. Also thanks to Vegard Nossum who caught me getting one of those odd conditions entirely the wrong way around. Anyway, I think I want to actually move all the stack expansion code to a whole new file of its own, rather than have it split up between mm/mmap.c and mm/memory.c, but since this will have to be backported to the initial maple tree vma introduction anyway, I tried to keep the patches _fairly_ minimal. Also, while I don't think it's valid to expand the stack from GUP, the final patch in here is a "warn if some crazy GUP user wants to try to expand the stack" patch. That one will be reverted before the final release, but it's left to catch any odd cases during the merge window and release candidates. Reported-by: Ruihan Li <lrh2000@pku.edu.cn> * branch 'expand-stack': gup: add warning if some caller would seem to want stack expansion mm: always expand the stack with the mmap write lock held execve: expand new process stack manually ahead of time mm: make find_extend_vma() fail if write lock not held powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma() mm/fault: convert remaining simple cases to lock_mm_and_find_vma() arm/mm: Convert to using lock_mm_and_find_vma() riscv/mm: Convert to using lock_mm_and_find_vma() mips/mm: Convert to using lock_mm_and_find_vma() powerpc/mm: Convert to using lock_mm_and_find_vma() arm64/mm: Convert to using lock_mm_and_find_vma() mm: make the page fault mmap locking killable mm: introduce new 'lock_mm_and_find_vma()' page fault helper
2023-06-25mm: introduce new 'lock_mm_and_find_vma()' page fault helperLinus Torvalds1-0/+4
.. and make x86 use it. This basically extracts the existing x86 "find and expand faulting vma" code, but extends it to also take the mmap lock for writing in case we actually do need to expand the vma. We've historically short-circuited that case, and have some rather ugly special logic to serialize the stack segment expansion (since we only hold the mmap lock for reading) that doesn't match the normal VM locking. That slight violation of locking worked well, right up until it didn't: the maple tree code really does want proper locking even for simple extension of an existing vma. So extract the code for "look up the vma of the fault" from x86, fix it up to do the necessary write locking, and make it available as a helper function for other architectures that can use the common helper. Note: I say "common helper", but it really only handles the normal stack-grows-down case. Which is all architectures except for PA-RISC and IA64. So some rare architectures can't use the helper, but if they care they'll just need to open-code this logic. It's also worth pointing out that this code really would like to have an optimistic "mmap_upgrade_trylock()" to make it quicker to go from a read-lock (for the common case) to taking the write lock (for having to extend the vma) in the normal single-threaded situation where there is no other locking activity. But that _is_ all the very uncommon special case, so while it would be nice to have such an operation, it probably doesn't matter in reality. I did put in the skeleton code for such a possible future expansion, even if it only acts as pseudo-documentation for what we're doing. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-20mm: zswap: support exclusive loadsYosry Ahmed1-0/+16
Commit 71024cb4a0bf ("frontswap: remove frontswap_tmem_exclusive_gets") removed support for exclusive loads from frontswap as it was not used. Bring back exclusive loads support to frontswap by adding an "exclusive" output parameter to frontswap_ops->load. On the zswap side, add a module parameter to enable/disable exclusive loads, and a config option to control the boot default value. Refactor zswap entry invalidation in zswap_frontswap_invalidate_page() into zswap_invalidate_entry() to reuse it in zswap_frontswap_load() if exclusive loads are enabled. With exclusive loads, we avoid having two copies of the same page in memory (compressed & uncompressed) after faulting it in from zswap. On the other hand, if the page is to be reclaimed again without being dirtied, it will be re-compressed. Compression is not usually slow, and a page that was just faulted in is less likely to be reclaimed again soon. Link: https://lkml.kernel.org/r/20230607195143.1473802-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Yu Zhao <yuzhao@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16Merge branches 'slab/for-6.5/prandom', 'slab/for-6.5/slab_no_merge' and ↵Vlastimil Babka1-2/+14
'slab/for-6.5/slab-deprecate' into slab/for-next Merge the feature branches scheduled for 6.5: - replace the usage of weak PRNGs, by David Keisar Schmidt - introduce the SLAB_NO_MERGE kmem_cache flag, by Jesper Dangaard Brouer - deprecate CONFIG_SLAB, with a planned removal, by myself
2023-05-26mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATEDVlastimil Babka1-2/+14
As discussed at LSF/MM [1] [2] and with no objections raised there, deprecate the SLAB allocator. Rename the user-visible option so that users with CONFIG_SLAB=y get a new prompt with explanation during make oldconfig, while make olddefconfig will just switch to SLUB. In all defconfigs with CONFIG_SLAB=y remove the line so those also switch to SLUB. Regressions due to the switch should be reported to linux-mm and slab maintainers. [1] https://lore.kernel.org/all/4b9fc9c6-b48c-198f-5f80-811a44737e5f@suse.cz/ [2] https://lwn.net/Articles/932201/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Acked-by: Helge Deller <deller@gmx.de> # parisc
2023-05-24mm/slab: remove HAVE_HARDENED_USERCOPY_ALLOCATORVlastimil Babka1-2/+0
With SLOB removed, both remaining allocators support hardened usercopy, so remove the config and associated #ifdef. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
2023-04-28Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds1-5/+32
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-25Merge tag 'slab-for-6.4' of ↵Linus Torvalds1-22/+0
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: "The main change is naturally the SLOB removal. Since its deprecation in 6.2 I've seen no complaints so hopefully SLUB_(TINY) works well for everyone and we can proceed. Besides the code cleanup, the main immediate benefit will be allowing kfree() family of function to work on kmem_cache_alloc() objects, which was incompatible with SLOB. This includes kfree_rcu() which had no kmem_cache_free_rcu() counterpart yet and now it shouldn't be necessary anymore. Besides that, there are several small code and comment improvements from Thomas, Thorsten and Vernon" * tag 'slab-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: document kfree() as allowed for kmem_cache_alloc() objects mm/slob: remove slob.c mm/slab: remove CONFIG_SLOB code from slab common code mm, pagemap: remove SLOB and SLQB from comments and documentation mm, page_flags: remove PG_slob_free mm/slob: remove CONFIG_SLOB mm/slub: fix help comment of SLUB_DEBUG mm: slub: make kobj_type structure constant slab: Adjust comment after refactoring of gfp.h
2023-04-19mm/hugetlb_vmemmap: rename ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAPAneesh Kumar K.V1-0/+6
Now we use ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP config option to indicate devdax and hugetlb vmemmap optimization support. Hence rename that to a generic ARCH_WANT_OPTIMIZE_VMEMMAP Link: https://lkml.kernel.org/r/20230412050025.84346-2-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Tarun Sahu <tsahu@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-06mm: introduce CONFIG_PER_VMA_LOCKSuren Baghdasaryan1-0/+12
Patch series "Per-VMA locks", v4. LWN article describing the feature: https://lwn.net/Articles/906852/ Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM last year [2], which concluded with suggestion that “a reader/writer semaphore could be put into the VMA itself; that would have the effect of using the VMA as a sort of range lock. There would still be contention at the VMA level, but it would be an improvement.” This patchset implements this suggested approach. When handling page faults we lookup the VMA that contains the faulting page under RCU protection and try to acquire its lock. If that fails we fall back to using mmap_lock, similar to how SPF handled this situation. One notable way the implementation deviates from the proposal is the way VMAs are read-locked. During some of mm updates, multiple VMAs need to be locked until the end of the update (e.g. vma_merge, split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks, figuring out when it's safe to unlock previously locked VMAs would make the code more complex. So, instead of the usual lock/unlock pattern, the proposed solution marks a VMA as locked and provides an efficient way to: 1. Identify locked VMAs. 2. Unlock all locked VMAs in bulk. We also postpone unlocking the locked VMAs until the end of the update, when we do mmap_write_unlock. Potentially this keeps a VMA locked for longer than is absolutely necessary but it results in a big reduction of code complexity. Read-locking a VMA is done using two sequence numbers - one in the vm_area_struct and one in the mm_struct. VMA is considered read-locked when these sequence numbers are equal. To read-lock a VMA we set the sequence number in vm_area_struct to be equal to the sequence number in mm_struct. To unlock all VMAs we increment mm_struct's seq number. This allows for an efficient way to track locked VMAs and to drop the locks on all VMAs at the end of the update. The patchset implements per-VMA locking only for anonymous pages which are not in swap and avoids userfaultfs as their implementation is more complex. Additional support for file-back page faults, swapped and user pages can be added incrementally. Performance benchmarks show similar although slightly smaller benefits as with SPF patchset (~75% of SPF benefits). Still, with lower complexity this approach might be more desirable. Since RFC was posted in September 2022, two separate Google teams outside of Android evaluated the patchset and confirmed positive results. Here are the known usecases when per-VMA locks show benefits: Android: Apps with high number of threads (~100) launch times improve by up to 20%. Each thread mmaps several areas upon startup (Stack and Thread-local storage (TLS), thread signal stack, indirect ref table), which requires taking mmap_lock in write mode. Page faults take mmap_lock in read mode. During app launch, both thread creation and page faults establishing the active workinget are happening in parallel and that causes lock contention between mm writers and readers even if updates and page faults are happening in different VMAs. Per-vma locks prevent this contention by providing more granular lock. Google Fibers: We have several dynamically sized thread pools that spawn new threads under increased load and reduce their number when idling. For example, Google's in-process scheduling/threading framework, UMCG/Fibers, is backed by such a thread pool. When idling, only a small number of idle worker threads are available; when a spike of incoming requests arrive, each request is handled in its own "fiber", which is a work item posted onto a UMCG worker thread; quite often these spikes lead to a number of new threads spawning. Each new thread needs to allocate and register an RSEQ section on its TLS, then register itself with the kernel as a UMCG worker thread, and only after that it can be considered by the in-process UMCG/Fiber scheduler as available to do useful work. In short, during an incoming workload spike new threads have to be spawned, and they perform several syscalls (RSEQ registration, UMCG worker registration, memory allocations) before they can actually start doing useful work. Removing any bottlenecks on this thread startup path will greatly improve our services' latencies when faced with request/workload spikes. At high scale, mmap_lock contention during thread creation and stack page faults leads to user-visible multi-second serving latencies in a similar pattern to Android app startup. Per-VMA locking patchset has been run successfully in limited experiments with user-facing production workloads. In these experiments, we observed that the peak thread creation rate was high enough that thread creation is no longer a bottleneck. TCP zerocopy receive: From the point of view of TCP zerocopy receive, the per-vma lock patch is massively beneficial. In today's implementation, a process with N threads where N - 1 are performing zerocopy receive and 1 thread is performing madvise() with the write lock taken (e.g. needs to change vm_flags) will result in all N -1 receive threads blocking until the madvise is done. Conversely, on a busy process receiving a lot of data, an madvise operation that does need to take the mmap lock in write mode will need to wait for all of the receives to be done - a lose:lose proposition. Per-VMA locking _removes_ by definition this source of contention entirely. There are other benefits for receive as well, chiefly a reduction in cacheline bouncing across receiving threads for locking/unlocking the single mmap lock. On an RPC style synthetic workload with 4KB RPCs: 1a) The find+lock+unlock VMA path in the base case, without the per-vma lock patchset, is about 0.7% of cycles as measured by perf. 1b) mmap_read_lock + mmap_read_unlock in the base case is about 0.5% cycles overall - most of this is within the TCP read hotpath (a small fraction is 'other' usage in the system). 2a) The find+lock+unlock VMA path, with the per-vma patchset and a trivial patch written to take advantage of it in TCP, is about 0.4% of cycles (down from 0.7% above) 2b) mmap_read_lock + mmap_read_unlock in the per-vma patchset is < 0.1% cycles and is out of the TCP read hotpath entirely (down from 0.5% before, the remaining usage is the 'other' usage in the system). So, in addition to entirely removing an onerous source of contention, it also reduces the CPU cycles of TCP receive zerocopy by about 0.5%+ (compared to overall cycles in perf) for the 'small' RPC scenario. In https://lkml.kernel.org/r/87fsaqouyd.fsf_-_@stealth, Punit demonstrated throughput improvements of as much as 188% from this patchset. This patch (of 25): This configuration variable will be used to build the support for VMA locking during page fault handling. This is enabled on supported architectures with SMP and MMU set. The architecture support is needed since the page fault handler is called from the architecture's page faulting code which needs modifications to handle faults under VMA lock. Link: https://lkml.kernel.org/r/20230227173632.3292573-1-surenb@google.com Link: https://lkml.kernel.org/r/20230227173632.3292573-10-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-06mm, treewide: redefine MAX_ORDER sanelyKirill A. Shutemov1-5/+5
MAX_ORDER currently defined as number of orders page allocator supports: user can ask buddy allocator for page order between 0 and MAX_ORDER-1. This definition is counter-intuitive and lead to number of bugs all over the kernel. Change the definition of MAX_ORDER to be inclusive: the range of orders user can ask from buddy allocator is 0..MAX_ORDER now. [kirill@shutemov.name: fix min() warning] Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box [akpm@linux-foundation.org: fix another min_t warning] [kirill@shutemov.name: fixups per Zi Yan] Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name [akpm@linux-foundation.org: fix underlining in docs] Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/ Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-06dmapool: add alloc/free performance testKeith Busch1-0/+9
Patch series "dmapool enhancements", v4. Time spent in dma_pool alloc/free increases linearly with the number of pages backing the pool. We can reduce this to constant time with minor changes to how free pages are tracked. This patch (of 12): Provide a module that allocates and frees many blocks of various sizes and report how long it takes. This is intended to provide a consistent way to measure how changes to the dma_pool_alloc/free routines affect timing. Link: https://lkml.kernel.org/r/20230126215125.4069751-1-kbusch@meta.com Link: https://lkml.kernel.org/r/20230126215125.4069751-2-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: Remove "select SRCU"Paul E. McKenney1-1/+0
Now that the SRCU Kconfig option is unconditionally selected, there is no longer any point in selecting it. Therefore, remove the "select SRCU" Kconfig statements. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <linux-mm@kvack.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-03-29mm/slob: remove CONFIG_SLOBVlastimil Babka1-22/+0
Remove SLOB from Kconfig and Makefile. Everything under #ifdef CONFIG_SLOB, and mm/slob.c is now dead code. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
2023-02-03zsmalloc: set default zspage chain size to 8Sergey Senozhatsky1-1/+1
This changes key characteristics (pages per-zspage and objects per-zspage) of a number of size classes which in results in different pool configuration. With zspage chain size of 8 we have more size clases clusters (123) and higher huge size class watermark (3632 bytes). Please read zsmalloc documentation for more details. Link: https://lkml.kernel.org/r/20230118005210.2814763-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-03zsmalloc: make zspage chain size configurableSergey Senozhatsky1-0/+19
Remove hard coded limit on the maximum number of physical pages per-zspage. This will allow tuning of zsmalloc pool as zspage chain size changes `pages per-zspage` and `objects per-zspage` characteristics of size classes which also affects size classes clustering (the way size classes are merged). Link: https://lkml.kernel.org/r/20230118005210.2814763-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19selftests/vm: rename selftests/vm to selftests/mmSeongJae Park1-1/+1
Rename selftets/vm to selftests/mm for being more consistent with the code, documentation, and tools directories, and won't be confused with virtual machines. [sj@kernel.org: convert missing vm->mm changes] Link: https://lkml.kernel.org/r/20230107230643.252273-1-sj@kernel.org Link: https://lkml.kernel.org/r/20230103180754.129637-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-17Merge tag 'x86_mm_for_6.2_v2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Dave Hansen: "New Feature: - Randomize the per-cpu entry areas Cleanups: - Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it - Move to "native" set_memory_rox() helper - Clean up pmd_get_atomic() and i386-PAE - Remove some unused page table size macros" * tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) x86/mm: Ensure forced page table splitting x86/kasan: Populate shadow for shared chunk of the CPU entry area x86/kasan: Add helpers to align shadow addresses up and down x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area x86/mm: Recompute physical address for every page of per-CPU CEA mapping x86/mm: Rename __change_page_attr_set_clr(.checkalias) x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias() x86/mm: Untangle __change_page_attr_set_clr(.checkalias) x86/mm: Add a few comments x86/mm: Fix CR3_ADDR_MASK x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros mm: Convert __HAVE_ARCH_P..P_GET to the new style mm: Remove pointless barrier() after pmdp_get_lockless() x86/mm/pae: Get rid of set_64bit() x86_64: Remove pointless set_64bit() usage x86/mm/pae: Be consistent with pXXp_get_and_clear() x86/mm/pae: Use WRITE_ONCE() x86/mm/pae: Don't (ab)use atomic64 mm/gup: Fix the lockless PMD access ...
2022-12-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-0/+8
Pull kvm updates from Paolo Bonzini: "ARM64: - Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. - Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a97d: "Fix a number of issues with MTE, such as races on the tags being initialised vs the PG_mte_tagged flag as well as the lack of support for VM_SHARED when KVM is involved. Patches from Catalin Marinas and Peter Collingbourne"). - Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. - Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. - Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. - Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. s390: - Second batch of the lazy destroy patches - First batch of KVM changes for kernel virtual != physical address support - Removal of a unused function x86: - Allow compiling out SMM support - Cleanup and documentation of SMM state save area format - Preserve interrupt shadow in SMM state save area - Respond to generic signals during slow page faults - Fixes and optimizations for the non-executable huge page errata fix. - Reprogram all performance counters on PMU filter change - Cleanups to Hyper-V emulation and tests - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest running on top of a L1 Hyper-V hypervisor) - Advertise several new Intel features - x86 Xen-for-KVM: - Allow the Xen runstate information to cross a page boundary - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured - Add support for 32-bit guests in SCHEDOP_poll - Notable x86 fixes and cleanups: - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. - Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. - Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. - Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency. - Advertise (on AMD) that the SMM_CTL MSR is not supported - Remove unnecessary exports Generic: - Support for responding to signals during page faults; introduces new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks Selftests: - Fix an inverted check in the access tracking perf test, and restore support for asserting that there aren't too many idle pages when running on bare metal. - Fix build errors that occur in certain setups (unsure exactly what is unique about the problematic setup) due to glibc overriding static_assert() to a variant that requires a custom message. - Introduce actual atomics for clear/set_bit() in selftests - Add support for pinning vCPUs in dirty_log_perf_test. - Rename the so called "perf_util" framework to "memstress". - Add a lightweight psuedo RNG for guest use, and use it to randomize the access pattern and write vs. read percentage in the memstress tests. - Add a common ucall implementation; code dedup and pre-work for running SEV (and beyond) guests in selftests. - Provide a common constructor and arch hook, which will eventually be used by x86 to automatically select the right hypercall (AMD vs. Intel). - A bunch of added/enabled/fixed selftests for ARM64, covering memslots, breakpoints, stage-2 faults and access tracking. - x86-specific selftest changes: - Clean up x86's page table management. - Clean up and enhance the "smaller maxphyaddr" test, and add a related test to cover generic emulation failure. - Clean up the nEPT support checks. - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values. - Fix an ordering issue in the AMX test introduced by recent conversions to use kvm_cpu_has(), and harden the code to guard against similar bugs in the future. Anything that tiggers caching of KVM's supported CPUID, kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if the caching occurs before the test opts in via prctl(). Documentation: - Remove deleted ioctls from documentation - Clean up the docs for the x86 MSR filter. - Various fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits) KVM: x86: Add proper ReST tables for userspace MSR exits/flags KVM: selftests: Allocate ucall pool from MEM_REGION_DATA KVM: arm64: selftests: Align VA space allocator with TTBR0 KVM: arm64: Fix benign bug with incorrect use of VA_BITS KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow KVM: x86: Advertise that the SMM_CTL MSR is not supported KVM: x86: remove unnecessary exports KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic" tools: KVM: selftests: Convert clear/set_bit() to actual atomics tools: Drop "atomic_" prefix from atomic test_and_set_bit() tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests perf tools: Use dedicated non-atomic clear/set bit helpers tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers KVM: arm64: selftests: Enable single-step without a "full" ucall() KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself KVM: Remove stale comment about KVM_REQ_UNHALT KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR KVM: Reference to kvm_userspace_memory_region in doc and comments KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl ...
2022-12-15mm: Rename GUP_GET_PTE_LOW_HIGHPeter Zijlstra1-1/+1
Since it no longer applies to only PTEs, rename it to PXX. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114424.776404066%40infradead.org
2022-12-14Merge tag 'mm-stable-2022-12-13' of ↵Linus Torvalds1-9/+8
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ...
2022-12-01mm: Kconfig: make config SECRETMEM visible with EXPERTLukas Bulwahn1-1/+7
Commit 6a108a14fa35 ("kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT") introduces CONFIG_EXPERT to carry the previous intent of CONFIG_EMBEDDED and just gives that intent a much better name. That has been clearly a good and long overdue renaming, and it is clearly an improvement to the kernel build configuration that has shown to help managing the kernel build configuration in the last decade. However, rather than bravely and radically just deleting CONFIG_EMBEDDED, this commit gives CONFIG_EMBEDDED a new intended semantics, but keeps it open for future contributors to implement that intended semantics: A new CONFIG_EMBEDDED option is added that automatically selects CONFIG_EXPERT when enabled and can be used in the future to isolate options that should only be considered for embedded systems (RISC architectures, SLOB, etc). Since then, this CONFIG_EMBEDDED implicitly had two purposes: - It can make even more options visible beyond what CONFIG_EXPERT makes visible. In other words, it may introduce another level of enabling the visibility of configuration options: always visible, visible with CONFIG_EXPERT and visible with CONFIG_EMBEDDED. - Set certain default values of some configurations differently, following the assumption that configuring a kernel build for an embedded system generally starts with a different set of default values compared to kernel builds for all other kind of systems. Considering the second purpose, note that already probably arguing that a kernel build for an embedded system would choose some values differently is already tricky: the set of embedded systems with Linux kernels is already quite diverse. Many embedded system have powerful CPUs and it would not be clear that all embedded systems just optimize towards one specific aspect, e.g., a smaller kernel image size. So, it is unclear if starting with "one set of default configuration" that is induced by CONFIG_EMBEDDED is a good offer for developers configuring their kernels. Also, the differences of needed user-space features in an embedded system compared to a non-embedded system are probably difficult or even impossible to name in some generic way. So it is not surprising that in the last decade hardly anyone has contributed changes to make something default differently in case of CONFIG_EMBEDDED=y. Currently, in v6.0-rc4, SECRETMEM is the only config switched off if CONFIG_EMBEDDED=y. As long as that is actually the only option that currently is selected or deselected, it is better to just make SECRETMEM configurable at build time by experts using menuconfig instead. Make SECRETMEM configurable when EXPERT is set and otherwise default to yes. Further, SECRETMEM needs ARCH_HAS_SET_DIRECT_MAP. This allows us to remove CONFIG_EMBEDDED in the close future. Link: https://lkml.kernel.org/r/20221116131922.25533-1-lukas.bulwahn@gmail.com Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-01mm,hugetlb: use folio fields in second tail pageHugh Dickins1-1/+1
Patch series "mm,huge,rmap: unify and speed up compound mapcounts". This patch (of 3): We want to declare one more int in the first tail of a compound page: that first tail page being valuable property, since every compound page has a first tail, but perhaps no more than that. No problem on 64-bit: there is already space for it. No problem with 32-bit THPs: 5.18 commit 5232c63f46fd ("mm: Make compound_pincount always available") kindly cleared the space for it, apparently not realizing that only 64-bit architectures enable CONFIG_THP_SWAP (whose use of tail page->private might conflict) - but make sure of that in its Kconfig. But hugetlb pages use tail page->private of the first tail page for a subpool pointer, which will conflict; and they also use page->private of the 2nd, 3rd and 4th tails. Undo "mm: add private field of first tail to struct page and struct folio"'s recent addition of private_1 to the folio tail: instead add hugetlb_subpool, hugetlb_cgroup, hugetlb_cgroup_rsvd, hugetlb_hwpoison to a second tail page of the folio: THP has long been using several fields of that tail, so make better use of it for hugetlb too. This is not how a generic folio should be declared in future, but it is an effective transitional way to make use of it. Delete the SUBPAGE_INDEX stuff, but keep __NR_USED_SUBPAGE: now 3. [hughd@google.com: prefix folio's page_1 and page_2 with double underscore, give folio's _flags_2 and _head_2 a line documentation each] Link: https://lkml.kernel.org/r/9e2cb6b-5b58-d3f2-b5ee-5f8a14e8f10@google.com Link: https://lkml.kernel.org/r/5f52de70-975-e94f-f141-543765736181@google.com Link: https://lkml.kernel.org/r/3818cc9a-9999-d064-d778-9c94c5911e6@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-01mm: always compile in pte markersPeter Xu1-7/+0
Patch series "mm: Use pte marker for swapin errors". This series uses the pte marker to replace the swapin error swap entry, then we save one more swap entry slot for swap devices. A new pte marker bit is defined. This patch (of 2): The PTE markers code is tiny and now it's enabled for most of the distributions. It's fine to keep it as-is, but to make a broader use of it (e.g. replacing read error swap entry) it needs to be there always otherwise we need special code path to take care of !PTE_MARKER case. It'll be easier just make pte marker always exist. Use this chance to extend its usage to anonymous too by simply touching up some of the old comments, because it'll be used for anonymous pages in the follow up patches. Link: https://lkml.kernel.org/r/20221030214151.402274-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20221030214151.402274-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Huang Ying <ying.huang@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-01mm, slob: rename CONFIG_SLOB to CONFIG_SLOB_DEPRECATEDVlastimil Babka1-2/+15
As explained in [1], we would like to remove SLOB if possible. - There are no known users that need its somewhat lower memory footprint so much that they cannot handle SLUB (after some modifications by the previous patches) instead. - It is an extra maintenance burden, and a number of features are incompatible with it. - It blocks the API improvement of allowing kfree() on objects allocated via kmem_cache_alloc(). As the first step, rename the CONFIG_SLOB option in the slab allocator configuration choice to CONFIG_SLOB_DEPRECATED. Add CONFIG_SLOB depending on CONFIG_SLOB_DEPRECATED as an internal option to avoid code churn. This will cause existing .config files and defconfigs with CONFIG_SLOB=y to silently switch to the default (and recommended replacement) SLUB, while still allowing SLOB to be configured by anyone that notices and needs it. But those should contact the slab maintainers and linux-mm@kvack.org as explained in the updated help. With no valid objections, the plan is to update the existing defconfigs to SLUB and remove SLOB in a few cycles. To make SLUB more suitable replacement for SLOB, a CONFIG_SLUB_TINY option was introduced to limit SLUB's memory overhead. There is a number of defconfigs specifying CONFIG_SLOB=y. As part of this patch, update them to select CONFIG_SLUB and CONFIG_SLUB_TINY. [1] https://lore.kernel.org/all/b35c3f82-f67b-2103-7d82-7a7ba7521439@suse.cz/ Cc: Russell King <linux@armlinux.org.uk> Cc: Aaro Koskinen <aaro.koskinen@iki.fi> Cc: Janusz Krzysztofik <jmkrzyszt@gmail.com> Cc: Tony Lindgren <tony@atomide.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Conor Dooley <conor@kernel.org> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Aaro Koskinen <aaro.koskinen@iki.fi> # OMAP1 Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> # riscv k210 Acked-by: Arnd Bergmann <arnd@arndb.de> # arm Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
2022-11-29mm: Do not enable PG_arch_2 for all 64-bit architecturesCatalin Marinas1-0/+8
Commit 4beba9486abd ("mm: Add PG_arch_2 page flag") introduced a new page flag for all 64-bit architectures. However, even if an architecture is 64-bit, it may still have limited spare bits in the 'flags' member of 'struct page'. This may happen if an architecture enables SPARSEMEM without SPARSEMEM_VMEMMAP as is the case with the newly added loongarch. This architecture port needs 19 more bits for the sparsemem section information and, while it is currently fine with PG_arch_2, adding any more PG_arch_* flags will trigger build-time warnings. Add a new CONFIG_ARCH_USES_PG_ARCH_X option which can be selected by architectures that need more PG_arch_* flags beyond PG_arch_1. Select it on arm64. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> [pcc@google.com: fix build with CONFIG_ARM64_MTE disabled] Signed-off-by: Peter Collingbourne <pcc@google.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Price <steven.price@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221104011041.290951-2-pcc@google.com
2022-11-28mm, slub: add CONFIG_SLUB_TINYVlastimil Babka1-4/+17
For tiny systems that have used SLOB until now, SLUB might be impractical due to its higher memory usage. To help with that, introduce an option CONFIG_SLUB_TINY that modifies SLUB to use less memory. This is done by sacrificing scalability, security and debugging features, therefore not recommended for any system with more than 16MB RAM. This commit introduces the option and uses it to set other related options in a way that reduces memory usage. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
2022-10-11Merge tag 'mm-stable-2022-10-08' of ↵Linus Torvalds1-7/+27
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
2022-09-27mm: multi-gen LRU: admin guideYu Zhao1-1/+2
Add an admin guide. Link: https://lkml.kernel.org/r/20220918080010.2920238-14-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-27mm: multi-gen LRU: kill switchYu Zhao1-0/+6
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that can be disabled include: 0x0001: the multi-gen LRU core 0x0002: walking page table, when arch_has_hw_pte_young() returns true 0x0004: clearing the accessed bit in non-leaf PMD entries, when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y [yYnN]: apply to all the components above E.g., echo y >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0007 echo 5 >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0005 NB: the page table walks happen on the scale of seconds under heavy memory pressure, in which case the mmap_lock contention is a lesser concern, compared with the LRU lock contention and the I/O congestion. So far the only well-known case of the mmap_lock contention happens on Android, due to Scudo [1] which allocates several thousand VMAs for merely a few hundred MBs. The SPF and the Maple Tree also have provided their own assessments [2][3]. However, if walking page tables does worsen the mmap_lock contention, the kill switch can be used to disable it. In this case the multi-gen LRU will suffer a minor performance degradation, as shown previously. Clearing the accessed bit in non-leaf PMD entries can also be disabled, since this behavior was not tested on x86 varieties other than Intel and AMD. [1] https://source.android.com/devices/tech/debug/scudo [2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/ [3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/ Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-27mm: multi-gen LRU: minimal implementationYu Zhao1-0/+11
To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-27mm: multi-gen LRU: groundworkYu Zhao1-0/+8
Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in folio->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These techniques are commonly used to optimize job scheduling (bin packing) in data centers [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-19mm/compaction: Get rid of RT ifdefferyThomas Gleixner1-0/+6
Move the RT dependency for the initial value of sysctl_compact_unevictable_allowed into Kconfig. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220825164131.402717-7-bigeasy@linutronix.de
2022-09-12mm: remove EXPERIMENTAL flag for zswapDavid Heidelberg1-7/+1
zswap has been with us since 2013, and it's widely used in many products. Link: https://lkml.kernel.org/r/20220823152033.66682-1-david@ixit.cz Signed-off-by: David Heidelberg <david@ixit.cz> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-10Merge tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxlLinus Torvalds1-0/+5
Pull cxl updates from Dan Williams: "Compute Express Link (CXL) updates for 6.0: - Introduce a 'struct cxl_region' object with support for provisioning and assembling persistent memory regions. - Introduce alloc_free_mem_region() to accompany the existing request_free_mem_region() as a method to allocate physical memory capacity out of an existing resource. - Export insert_resource_expand_to_fit() for the CXL subsystem to late-publish CXL platform windows in iomem_resource. - Add a polled mode PCI DOE (Data Object Exchange) driver service and use it in cxl_pci to retrieve the CDAT (Coherent Device Attribute Table)" * tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (74 commits) cxl/hdm: Fix skip allocations vs multiple pmem allocations cxl/region: Disallow region granularity != window granularity cxl/region: Fix x1 interleave to greater than x1 interleave routing cxl/region: Move HPA setup to cxl_region_attach() cxl/region: Fix decoder interleave programming Documentation: cxl: remove dangling kernel-doc reference cxl/region: describe targets and nr_targets members of cxl_region_params cxl/regions: add padding for cxl_rr_ep_add nested lists cxl/region: Fix IS_ERR() vs NULL check cxl/region: Fix region reference target accounting cxl/region: Fix region commit uninitialized variable warning cxl/region: Fix port setup uninitialized variable warnings cxl/region: Stop initializing interleave granularity cxl/hdm: Fix DPA reservation vs cxl_endpoint_decoder lifetime cxl/acpi: Minimize granularity for x1 interleaves cxl/region: Delete 'region' attribute from root decoders cxl/acpi: Autoload driver for 'cxl_acpi' test devices cxl/region: decrement ->nr_targets on error in cxl_region_attach() cxl/region: prevent underflow in ways_to_cxl() cxl/region: uninitialized variable in alloc_hpa() ...
2022-08-06Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds1-5/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
2022-07-30mm: Kconfig: fix typoSophia Gabriella1-1/+1
Fixes a typo in the help section for ZSWAP. Link: https://lkml.kernel.org/r/Message-ID: Signed-off-by: Sophia Gabriella <sophia.gabriellla@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-22resource: Introduce alloc_free_mem_region()Dan Williams1-0/+5
The core of devm_request_free_mem_region() is a helper that searches for free space in iomem_resource and performs __request_region_locked() on the result of that search. The policy choices of the implementation conform to what CONFIG_DEVICE_PRIVATE users want which is memory that is immediately marked busy, and a preference to search for the first-fit free range in descending order from the top of the physical address space. CXL has a need for a similar allocator, but with the following tweaks: 1/ Search for free space in ascending order 2/ Search for free space relative to a given CXL window 3/ 'insert' rather than 'request' the new resource given downstream drivers from the CXL Region driver (like the pmem or dax drivers) are responsible for request_mem_region() when they activate the memory range. Rework __request_free_mem_region() into get_free_mem_region() which takes a set of GFR_* (Get Free Region) flags to control the allocation policy (ascending vs descending), and "busy" policy (insert_resource() vs request_region()). As part of the consolidation of the legacy GFR_REQUEST_REGION case with the new default of just inserting a new resource into the free space some minor cleanups like not checking for NULL before calling devres_free() (which does its own check) is included. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/linux-cxl/20220420143406.GY2120790@nvidia.com/ Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/165784333333.1758207.13703329337805274043.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-07-18mm/mmap: drop ARCH_HAS_VM_GET_PAGE_PROTAnshuman Khandual1-3/+0
Now all the platforms enable ARCH_HAS_GET_PAGE_PROT. They define and export own vm_get_page_prot() whether custom or standard DECLARE_VM_GET_PAGE_PROT. Hence there is no need for default generic fallback for vm_get_page_prot(). Just drop this fallback and also ARCH_HAS_GET_PAGE_PROT mechanism. Link: https://lkml.kernel.org/r/20220711070600.2378316-27-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-28arch/*/: remove CONFIG_VIRT_TO_BUSArnd Bergmann1-8/+0
All architecture-independent users of virt_to_bus() and bus_to_virt() have been fixed to use the dma mapping interfaces or have been removed now. This means the definitions on most architectures, and the CONFIG_VIRT_TO_BUS symbol are now obsolete and can be removed. The only exceptions to this are a few network and scsi drivers for m68k Amiga and VME machines and ppc32 Macintosh. These drivers work correctly with the old interfaces and are probably not worth changing. On alpha and parisc, virt_to_bus() were still used in asm/floppy.h. alpha can use isa_virt_to_bus() like x86 does, and parisc can just open-code the virt_to_phys() here, as this is architecture specific code. I tried updating the bus-virt-phys-mapping.rst documentation, which started as an email from Linus to explain some details of the Linux-2.0 driver interfaces. The bits about virt_to_bus() were declared obsolete backin 2000, and the rest is not all that relevant any more, so in the end I just decided to remove the file completely. Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-06-27docs: rename Documentation/vm to Documentation/mmMike Rapoport1-1/+1
so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Jonathan Corbet <corbet@lwn.net> Acked-by: Wu XiangCheng <bobwxc@email.cn>
2022-05-27mm: Kconfig: reorganize misplaced mm optionsVlastimil Babka1-0/+56
After commits 7b42f1041c98 ("mm: Kconfig: move swap and slab config options to the MM section") and 519bcb797907 ("mm: Kconfig: group swap, slab, hotplug and thp options into submenus") we now have nicely organized mm related config options. I have noticed some that were still misplaced, so this moves them from various places into the new structure: VM_EVENT_COUNTERS, COMPAT_BRK, MMAP_ALLOW_UNINITIALIZED to mm/Kconfig and general MM section. SLUB_STATS to mm/Kconfig and the slab submenu. DEBUG_SLAB, SLUB_DEBUG, SLUB_DEBUG_ON to mm/Kconfig.debug and the Kernel hacking / Memory Debugging submenu. Link: https://lkml.kernel.org/r/20220525112559.1139-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-20mm: Kconfig: simplify zswap configurationJohannes Weiner1-30/+25
- CONFIG_ZRAM: Zram is a user-facing feature, whereas zsmalloc is not. Don't make the user chase down a technical dependency like that, just select it in automatically when zram is requested. The CONFIG_CRYPTO dependency is redundant due to more specific deps. - CONFIG_ZPOOL: This is not a user-facing feature. Hide the symbol and have it selected in as needed. - CONFIG_ZSWAP: Select CRYPTO instead of depend. Common pattern. - Make the ZSWAP suboptions and their descriptions (compression, allocation backend) a bit more straight-forward for the user. Link: https://lkml.kernel.org/r/20220510152847.230957-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-20mm: Kconfig: group swap, slab, hotplug and thp options into submenusJohannes Weiner1-217/+230
There are several clusters of related config options spread throughout the mostly flat MM submenu. Group them together and put specialization options into further subdirectories to make the MM submenu a bit more organized and easier to navigate. [hannes@cmpxchg.org: fix kbuild warnings] Link: https://lkml.kernel.org/r/YnvkSVivfnT57Vwh@cmpxchg.org [hannes@cmpxchg.org: fix more kbuild warnings] Link: https://lkml.kernel.org/r/Ynz8NusTdEGcCnJN@cmpxchg.org Link: https://lkml.kernel.org/r/20220510152847.230957-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-20mm: Kconfig: move swap and slab config options to the MM sectionJohannes Weiner1-0/+123
These are currently under General Setup. MM seems like a better fit. Link: https://lkml.kernel.org/r/20220510152847.230957-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/uffd: move USERFAULTFD configs into mm/Peter Xu1-0/+17
We used to have USERFAULTFD configs stored in init/. It makes sense as a start because that's the default place for storing syscall related configs. However userfaultfd evolved a bit in the past few years and some more config options were added. They're no longer related to syscalls and start to be not suitable to be kept in the init/ directory anymore, because they're pure mm concepts. But it's not ideal either to keep the userfaultfd configs separate from each other. Hence this patch moves the userfaultfd configs under init/ to be under mm/ so that we'll start to group all userfaultfd configs together. We do have quite a few examples of syscall related configs that are not put under init/Kconfig: FTRACE_SYSCALLS, SWAP, FILE_LOCKING, MEMFD_CREATE.. They all reside in the dir where they're more suitable for the concept. So it seems there's no restriction to keep the role of having syscall related CONFIG_* under init/ only. Link: https://lkml.kernel.org/r/20220420144823.35277-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: enable PTE markers by defaultPeter Xu1-3/+5
Enable PTE markers by default. On x86_64 it means it'll auto-enable PTE_MARKER_UFFD_WP as well. [peterx@redhat.com: hide PTE_MARKER option] Link: https://lkml.kernel.org/r/20220419202531.27415-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220405014929.15158-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/uffd: PTE_MARKER_UFFD_WPPeter Xu1-0/+9
This patch introduces the 1st user of pte marker: the uffd-wp marker. When the pte marker is installed with the uffd-wp bit set, it means this pte was wr-protected by uffd. We will use this special pte to arm the ptes that got either unmapped or swapped out for a file-backed region that was previously wr-protected. This special pte could trigger a page fault just like swap entries. This idea is greatly inspired by Hugh and Andrea in the discussion, which is referenced in the links below. Some helpers are introduced to detect whether a swap pte is uffd wr-protected. After the pte marker introduced, one swap pte can be wr-protected in two forms: either it is a normal swap pte and it has _PAGE_SWP_UFFD_WP set, or it's a pte marker that has PTE_MARKER_UFFD_WP set. [peterx@redhat.com: fixup] Link: https://lkml.kernel.org/r/YkzKiM8tI4+qOfXF@xz-m1.local Link: https://lore.kernel.org/lkml/20201126222359.8120-1-peterx@redhat.com/ Link: https://lore.kernel.org/lkml/20201130230603.46187-1-peterx@redhat.com/ Link: https://lkml.kernel.org/r/20220405014838.14131-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: introduce PTE_MARKER swap entryPeter Xu1-0/+6
Patch series "userfaultfd-wp: Support shmem and hugetlbfs", v8. Overview ======== Userfaultfd-wp anonymous support was merged two years ago. There're quite a few applications that started to leverage this capability either to take snapshots for user-app memory, or use it for full user controled swapping. This series tries to complete the feature for uffd-wp so as to cover all the RAM-based memory types. So far uffd-wp is the only missing piece of the rest features (uffd-missing & uffd-minor mode). One major reason to do so is that anonymous pages are sometimes not satisfying the need of applications, and there're growing users of either shmem and hugetlbfs for either sharing purpose (e.g., sharing guest mem between hypervisor process and device emulation process, shmem local live migration for upgrades), or for performance on tlb hits. All these mean that if a uffd-wp app wants to switch to any of the memory types, it'll stop working. I think it's worthwhile to have the kernel to cover all these aspects. This series chose to protect pages in pte level not page level. One major reason is safety. I have no idea how we could make it safe if any of the uffd-privileged app can wr-protect a page that any other application can use. It means this app can block any process potentially for any time it wants. The other reason is that it aligns very well with not only the anonymous uffd-wp solution, but also uffd as a whole. For example, userfaultfd is implemented fundamentally based on VMAs. We set flags to VMAs showing the status of uffd tracking. For another per-page based protection solution, it'll be crossing the fundation line on VMA-based, and it could simply be too far away already from what's called userfaultfd. PTE markers =========== The patchset is based on the idea called PTE markers. It was discussed in one of the mm alignment sessions, proposed starting from v6, and this is the 2nd version of it using PTE marker idea. PTE marker is a new type of swap entry that is ony applicable to file backed memories like shmem and hugetlbfs. It's used to persist some pte-level information even if the original present ptes in pgtable are zapped. Logically pte markers can store more than uffd-wp information, but so far only one bit is used for uffd-wp purpose. When the pte marker is installed with uffd-wp bit set, it means this pte is wr-protected by uffd. It solves the problem on e.g. file-backed memory mapped ptes got zapped due to any reason (e.g. thp split, or swapped out), we can still keep the wr-protect information in the ptes. Then when the page fault triggers again, we'll know this pte is wr-protected so we can treat the pte the same as a normal uffd wr-protected pte. The extra information is encoded into the swap entry, or swp_offset to be explicit, with the swp_type being PTE_MARKER. So far uffd-wp only uses one bit out of the swap entry, the rest bits of swp_offset are still reserved for other purposes. There're two configs to enable/disable PTE markers: CONFIG_PTE_MARKER CONFIG_PTE_MARKER_UFFD_WP We can set !PTE_MARKER to completely disable all the PTE markers, along with uffd-wp support. I made two config so we can also enable PTE marker but disable uffd-wp file-backed for other purposes. At the end of current series, I'll enable CONFIG_PTE_MARKER by default, but that patch is standalone and if anyone worries about having it by default, we can also consider turn it off by dropping that oneliner patch. So far I don't see a huge risk of doing so, so I kept that patch. In most cases, PTE markers should be treated as none ptes. It is because that unlike most of the other swap entry types, there's no PFN or block offset information encoded into PTE markers but some extra well-defined bits showing the status of the pte. These bits should only be used as extra data when servicing an upcoming page fault, and then we behave as if it's a none pte. I did spend a lot of time observing all the pte_none() users this time. It is indeed a challenge because there're a lot, and I hope I didn't miss a single of them when we should take care of pte markers. Luckily, I don't think it'll need to be considered in many cases, for example: boot code, arch code (especially non-x86), kernel-only page handlings (e.g. CPA), or device driver codes when we're tackling with pure PFN mappings. I introduced pte_none_mostly() in this series when we need to handle pte markers the same as none pte, the "mostly" is the other way to write "either none pte or a pte marker". I didn't replace pte_none() to cover pte markers for below reasons: - Very rare case of pte_none() callers will handle pte markers. E.g., all the kernel pages do not require knowledge of pte markers. So we don't pollute the major use cases. - Unconditionally change pte_none() semantics could confuse people, because pte_none() existed for so long a time. - Unconditionally change pte_none() semantics could make pte_none() slower even if in many cases pte markers do not exist. - There're cases where we'd like to handle pte markers differntly from pte_none(), so a full replace is also impossible. E.g. khugepaged should still treat pte markers as normal swap ptes rather than none ptes, because pte markers will always need a fault-in to merge the marker with a valid pte. Or the smap code will need to parse PTE markers not none ptes. Patch Layout ============ Introducing PTE marker and uffd-wp bit in PTE marker: mm: Introduce PTE_MARKER swap entry mm: Teach core mm about pte markers mm: Check against orig_pte for finish_fault() mm/uffd: PTE_MARKER_UFFD_WP Adding support for shmem uffd-wp: mm/shmem: Take care of UFFDIO_COPY_MODE_WP mm/shmem: Handle uffd-wp special pte in page fault handler mm/shmem: Persist uffd-wp bit across zapping for file-backed mm/shmem: Allow uffd wr-protect none pte for file-backed mem mm/shmem: Allows file-back mem to be uffd wr-protected on thps mm/shmem: Handle uffd-wp during fork() Adding support for hugetlbfs uffd-wp: mm/hugetlb: Introduce huge pte version of uffd-wp helpers mm/hugetlb: Hook page faults for uffd write protection mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP mm/hugetlb: Handle UFFDIO_WRITEPROTECT mm/hugetlb: Handle pte markers in page faults mm/hugetlb: Allow uffd wr-protect none ptes mm/hugetlb: Only drop uffd-wp special pte if required mm/hugetlb: Handle uffd-wp during fork() Misc handling on the rest mm for uffd-wp file-backed: mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Enabling of uffd-wp on file-backed memory: mm/uffd: Enable write protection for shmem & hugetlbfs mm: Enable PTE markers by default selftests/uffd: Enable uffd-wp for shmem/hugetlbfs Tests ===== - Compile test on x86_64 and aarch64 on different configs - Kernel selftests - uffd-test [0] - Umapsort [1,2] test for shmem/hugetlb, with swap on/off [0] https://github.com/xzpeter/clibs/tree/master/uffd-test [1] https://github.com/xzpeter/umap-apps/tree/peter [2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs This patch (of 23): Introduces a new swap entry type called PTE_MARKER. It can be installed for any pte that maps a file-backed memory when the pte is temporarily zapped, so as to maintain per-pte information. The information that kept in the pte is called a "marker". Here we define the marker as "unsigned long" just to match pgoff_t, however it will only work if it still fits in swp_offset(), which is e.g. currently 58 bits on x86_64. A new config CONFIG_PTE_MARKER is introduced too; it's by default off. A bunch of helpers are defined altogether to service the rest of the pte marker code. [peterx@redhat.com: fixup] Link: https://lkml.kernel.org/r/Yk2rdB7SXZf+2BDF@xz-m1.local Link: https://lkml.kernel.org/r/20220405014646.13522-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220405014646.13522-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>