summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-05-13mm: make alloc_contig_range work at pageblock granularityZi Yan5-18/+242
alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid merging pageblocks with different migratetypes. It might unnecessarily convert extra pageblocks at the beginning and at the end of the range. Change alloc_contig_range() to work at pageblock granularity. Special handling is needed for free pages and in-use pages across the boundaries of the range specified by alloc_contig_range(). Because these= Partially isolated pages causes free page accounting issues. The free pages will be split and freed into separate migratetype lists; the in-use= Pages will be migrated then the freed pages will be handled in the aforementioned way. [ziy@nvidia.com: fix deadlock/crash] Link: https://lkml.kernel.org/r/23A7297E-6C84-4138-A9FE-3598234004E6@nvidia.com Link: https://lkml.kernel.org/r/20220425143118.2850746-4-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Ren <renzhengeek@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: page_isolation: check specified range for unmovable pagesZi Yan1-13/+34
Enable set_migratetype_isolate() to check specified range for unmovable pages during isolation to prepare arbitrary range page isolation. The functionality will take effect in upcoming commits by adjusting the callers of start_isolate_page_range(), which uses set_migratetype_isolate(). For example, alloc_contig_range(), which calls start_isolate_page_range(), accepts unaligned ranges, but because page isolation is currently done at MAX_ORDER_NR_PAEGS granularity, pages that are out of the specified range but withint MAX_ORDER_NR_PAEGS alignment might be attempted for isolation and the failure of isolating these unrelated pages fails the whole operation undesirably. Link: https://lkml.kernel.org/r/20220425143118.2850746-3-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Ren <renzhengeek@gmail.com> Cc: kernel test robot <lkp@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.cZi Yan3-121/+119
Patch series "Use pageblock_order for cma and alloc_contig_range alignment", v11. This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA and alloc_contig_range(). It prepares for my upcoming changes to make MAX_ORDER adjustable at boot time[1]. The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range() isolates pageblocks to remove free memory from buddy allocator but isolating only a subset of pageblocks within a page spanning across multiple pageblocks causes free page accounting issues. Isolated page might not be put into the right free list, since the code assumes the migratetype of the first pageblock as the whole free page migratetype. This is based on the discussion at [2]. To remove the requirement, this patchset: 1. isolates pages at pageblock granularity instead of max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages); 2. splits free pages across the specified range or migrates in-use pages across the specified range then splits the freed page to avoid free page accounting issues (it happens when multiple pageblocks within a single page have different migratetypes); 3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned range during isolation to avoid alloc_contig_range() failure when pageblocks within a MAX_ORDER - 1 aligned range are allocated separately. 4. returns pages not in the range as it did before. One optimization might come later: 1. make MIGRATE_ISOLATE a separate bit to be able to restore the original migratetypes when isolation fails in the middle of the range. [1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/ [2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/ This patch (of 6): has_unmovable_pages() is only used in mm/page_isolation.c. Move it from mm/page_alloc.c and make it static. Link: https://lkml.kernel.org/r/20220425143118.2850746-2-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Eric Ren <renzhengeek@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Minchan Kim <minchan@kernel.org> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13cgroup: fix racy check in alloc_pagecache_max_30M() helper functionDavid Vernet1-2/+7
alloc_pagecache_max_30M() in the cgroup memcg tests performs a 50MB pagecache allocation, which it expects to be capped at 30MB due to the calling process having a memory.high setting of 30MB. After the allocation, the function contains a check that verifies that MB(29) < memory.current <= MB(30). This check can actually fail non-deterministically. The testcases that use this function are test_memcg_high() and test_memcg_max(), which set memory.min and memory.max to 30MB respectively for the cgroup under test. The allocation can slightly exceed this number in both cases, and for memory.max, the process performing the allocation will not have the OOM killer invoked as it's performing a pagecache allocation. This patchset therefore updates the above check to instead use the verify_close() helper function. Link: https://lkml.kernel.org/r/20220423155619.3669555-6-void@manifault.com Signed-off-by: David Vernet <void@manifault.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13cgroup: remove racy check in test_memcg_sock()David Vernet1-3/+0
test_memcg_sock() in the cgroup memcg tests, verifies expected memory accounting for sockets. The test forks a process which functions as a TCP server, and sends large buffers back and forth between itself (as the TCP client) and the forked TCP server. While doing so, it verifies that memory.current and memory.stat.sock look correct. There is currently a check in tcp_client() which asserts memory.current >= memory.stat.sock. This check is racy, as between memory.current and memory.stat.sock being queried, a packet could come in which causes mem_cgroup_charge_skmem() to be invoked. This could cause memory.stat.sock to exceed memory.current. Reversing the order of querying doesn't address the problem either, as memory may be reclaimed between the two calls. Instead, this patch just removes that assertion altogether, and instead relies on the values_close() check that follows to validate the expected accounting. Link: https://lkml.kernel.org/r/20220423155619.3669555-5-void@manifault.com Signed-off-by: David Vernet <void@manifault.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13cgroup: account for memory_localevents in test_memcg_oom_group_leaf_events()David Vernet1-5/+17
The test_memcg_oom_group_leaf_events() testcase in the cgroup memcg tests validates that processes in a group that perform allocations exceeding memory.oom.group are killed. It also validates that the memory.events.oom_kill events are properly propagated in this case. Commit 06e11c907ea4 ("kselftests: memcg: update the oom group leaf events test") fixed test_memcg_oom_group_leaf_events() to account for the fact that the memory.events.oom_kill events in a child cgroup is propagated up to its parent. This behavior can actually be configured by the memory_localevents mount option, so this patch updates the testcase to properly account for the possible presence of this mount option. Link: https://lkml.kernel.org/r/20220423155619.3669555-4-void@manifault.com Signed-off-by: David Vernet <void@manifault.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13cgroup: account for memory_recursiveprot in test_memcg_low()David Vernet3-3/+26
The test_memcg_low() testcase in test_memcontrol.c verifies the expected behavior of groups using the memory.low knob. Part of the testcase verifies that a group with memory.low that experiences reclaim due to memory pressure elsewhere in the system, observes memory.events.low events as a result of that reclaim. In commit 8a931f801340 ("mm: memcontrol: recursive memory.low protection"), the memory controller was updated to propagate memory.low and memory.min protection from a parent group to its children via a configurable memory_recursiveprot mount option. This unfortunately broke the memcg tests, which asserts that a sibling that experienced reclaim but had a memory.low value of 0, would not observe any memory.low events. This patch updates test_memcg_low() to account for the new behavior introduced by memory_recursiveprot. So as to make the test resilient to multiple configurations, the patch also adds a new proc_mount_contains() helper that checks for a string in /proc/mounts, and is used to toggle behavior based on whether the default memory_recursiveprot was present. Link: https://lkml.kernel.org/r/20220423155619.3669555-3-void@manifault.com Signed-off-by: David Vernet <void@manifault.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13cgroups: refactor children cgroups in memcg testsDavid Vernet1-14/+14
Patch series "Fix bugs in memcontroller cgroup tests", v2. tools/testing/selftests/cgroup/test_memcontrol.c contains a set of testcases which validate expected behavior of the cgroup memory controller. Roman Gushchin recently sent out a patchset that fixed a few issues in the test. This patchset continues that effort by fixing a few more issues that were causing non-deterministic failures in the suite. With this patchset, I'm unable to reproduce any more errors after running the tests in a continuous loop for many iterations. Before, I was able to reproduce at least one of the errors fixed in this patchset with just one or two runs. This patch (of 5): In test_memcg_min() and test_memcg_low(), there is an array of four sibling cgroups. All but one of these sibling groups does a 50MB allocation, and the group that does no allocation is the third of four in the array. This is not a problem per se, but makes it a bit tricky to do some assertions in test_memcg_low(), as we want to make assertions on the siblings based on whether or not they performed allocations. Having a static index before which all groups have performed an allocation makes this cleaner. This patch therefore reorders the sibling groups so that the group that performs no allocations is the last in the array. A follow-on patch will leverage this to fix a bug in the test that incorrectly asserts that a sibling group that had performed an allocation, but only had protection from its parent, will not observe any memory.events.low events during reclaim. Link: https://lkml.kernel.org/r/20220423155619.3669555-1-void@manifault.com Link: https://lkml.kernel.org/r/20220423155619.3669555-2-void@manifault.com Signed-off-by: David Vernet <void@manifault.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/uffd: move USERFAULTFD configs into mm/Peter Xu2-17/+17
We used to have USERFAULTFD configs stored in init/. It makes sense as a start because that's the default place for storing syscall related configs. However userfaultfd evolved a bit in the past few years and some more config options were added. They're no longer related to syscalls and start to be not suitable to be kept in the init/ directory anymore, because they're pure mm concepts. But it's not ideal either to keep the userfaultfd configs separate from each other. Hence this patch moves the userfaultfd configs under init/ to be under mm/ so that we'll start to group all userfaultfd configs together. We do have quite a few examples of syscall related configs that are not put under init/Kconfig: FTRACE_SYSCALLS, SWAP, FILE_LOCKING, MEMFD_CREATE.. They all reside in the dir where they're more suitable for the concept. So it seems there's no restriction to keep the role of having syscall related CONFIG_* under init/ only. Link: https://lkml.kernel.org/r/20220420144823.35277-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13userfaultfd/selftests: use swap() instead of open coding itGuo Zhengkui1-7/+2
Address the following coccicheck warning: tools/testing/selftests/vm/userfaultfd.c:1536:21-22: WARNING opportunity for swap(). tools/testing/selftests/vm/userfaultfd.c:1540:33-34: WARNING opportunity for swap(). by using swap() for the swapping of variable values and drop `tmp_area` that is not needed any more. `swap()` macro in userfaultfd.c is introduced in commit 681696862bc18 ("selftests: vm: remove dependecy from internal kernel macros") It has been tested with gcc (Debian 8.3.0-6) 8.3.0. Link: https://lkml.kernel.org/r/20220407123141.4998-1-guozhengkui@vivo.com Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13selftests/uffd: enable uffd-wp for shmem/hugetlbfsPeter Xu1-3/+1
After we added support for shmem and hugetlbfs, we can turn uffd-wp test on always now. Link: https://lkml.kernel.org/r/20220405014932.15212-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: enable PTE markers by defaultPeter Xu1-3/+5
Enable PTE markers by default. On x86_64 it means it'll auto-enable PTE_MARKER_UFFD_WP as well. [peterx@redhat.com: hide PTE_MARKER option] Link: https://lkml.kernel.org/r/20220419202531.27415-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220405014929.15158-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/uffd: enable write protection for shmem & hugetlbfsPeter Xu4-26/+34
We've had all the necessary changes ready for both shmem and hugetlbfs. Turn on all the shmem/hugetlbfs switches for userfaultfd-wp. We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too because all existing types now support write protection mode. Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h. Link: https://lkml.kernel.org/r/20220405014926.15101-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/pagemap: recognize uffd-wp bit for shmem/hugetlbfsPeter Xu1-0/+7
This requires the pagemap code to be able to recognize the newly introduced swap special pte for uffd-wp, meanwhile the general case for hugetlb that we recently start to support. It should make pagemap uffd-wp support complete. Link: https://lkml.kernel.org/r/20220405014923.15047-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/khugepaged: don't recycle vma pgtable if uffd-wp registeredPeter Xu1-1/+13
When we're trying to collapse a 2M huge shmem page, don't retract pgtable pmd page if it's registered with uffd-wp, because that pgtable could have pte markers installed. Recycling of that pgtable means we'll lose the pte markers. That could cause data loss for an uffd-wp enabled application on shmem. Instead of disabling khugepaged on these files, simply skip retracting these special VMAs, then the page cache can still be merged into a huge thp, and other mm/vma can still map the range of file with a huge thp when proper. Note that checking VM_UFFD_WP needs to be done with mmap_sem held for write, that avoids race like: khugepaged user thread ========== =========== check VM_UFFD_WP, not set UFFDIO_REGISTER with uffd-wp on shmem wr-protect some pages (install markers) take mmap_sem write lock erase pmd and free pmd page --> pte markers are dropped unnoticed! Link: https://lkml.kernel.org/r/20220405014921.14994-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/hugetlb: handle uffd-wp during fork()Peter Xu3-18/+35
Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range() because for uffd-wp it's the dst vma that matters on deciding how we should treat uffd-wp protected ptes. We should recognize pte markers during fork and do the pte copy if needed. [lkp@intel.com: vma_needs_copy can be static] Link: https://lkml.kernel.org/r/Ylb0CGeFJlc4EzLk@7ec4ff11d4ae Link: https://lkml.kernel.org/r/20220405014918.14932-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/hugetlb: only drop uffd-wp special pte if requiredPeter Xu6-20/+45
As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte if unmapping an entire vma or synchronized such that faults can not race with the unmap operation. This requires passing zap_flags all the way to the lowest level hugetlb unmap routine: __unmap_hugepage_range. In general, unmap calls originated in hugetlbfs code will pass the ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent faults. The exception is hole punch which will first unmap without any synchronization. Later when hole punch actually removes the page from the file, it will check to see if there was a subsequent fault and if so take the hugetlb fault mutex while unmapping again. This second unmap will pass in ZAP_FLAG_DROP_MARKER. The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when unmap a hugetlb range" is (IMHO): we should never reach a state when a page fault could errornously fault in a page-cache page that was wr-protected to be writable, even in an extremely short period. That could happen if e.g. we pass ZAP_FLAG_DROP_MARKER when hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page faults after that call and before remove_inode_hugepages() is executed, the page cache can be mapped writable again in the small racy window, that can cause unexpected data overwritten. [peterx@redhat.com: fix sparse warning] Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local [akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues] Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/hugetlb: allow uffd wr-protect none ptesPeter Xu1-4/+24
Teach hugetlbfs code to wr-protect none ptes just in case the page cache existed for that pte. Meanwhile we also need to be able to recognize a uffd-wp marker pte and remove it for uffd_wp_resolve. Since at it, introduce a variable "psize" to replace all references to the huge page size fetcher. Link: https://lkml.kernel.org/r/20220405014912.14815-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/hugetlb: handle pte markers in page faultsPeter Xu1-4/+14
Allow hugetlb code to handle pte markers just like none ptes. It's mostly there, we just need to make sure we don't assume hugetlb_no_page() only handles none pte, so when detecting pte change we should use pte_same() rather than pte_none(). We need to pass in the old_pte to do the comparison. Check the original pte to see whether it's a pte marker, if it is, we should recover uffd-wp bit on the new pte to be installed, so that the next write will be trapped by uffd. Link: https://lkml.kernel.org/r/20220405014909.14761-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/hugetlb: handle UFFDIO_WRITEPROTECTPeter Xu4-4/+26
This starts from passing cp_flags into hugetlb_change_protection() so hugetlb will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests. huge_pte_clear_uffd_wp() is introduced to handle the case where the UFFDIO_WRITEPROTECT is requested upon migrating huge page entries. Link: https://lkml.kernel.org/r/20220405014906.14708-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/hugetlb: take care of UFFDIO_COPY_MODE_WPPeter Xu3-13/+36
Pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout the stack. Apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with UFFDIO_COPY. Hugetlb pages are only managed by hugetlbfs, so we're safe even without setting dirty bit in the huge pte if the page is installed as read-only. However we'd better still keep the dirty bit set for a read-only UFFDIO_COPY pte (when UFFDIO_COPY_MODE_WP bit is set), not only to match what we do with shmem, but also because the page does contain dirty data that the kernel just copied from the userspace. Link: https://lkml.kernel.org/r/20220405014904.14643-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/hugetlb: hook page faults for uffd write protectionPeter Xu1-0/+20
Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp faults. We do this slightly earlier than hugetlb_cow() so that we can avoid taking some extra locks that we definitely don't need. Link: https://lkml.kernel.org/r/20220405014901.14590-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/hugetlb: introduce huge pte version of uffd-wp helpersPeter Xu2-0/+30
They will be used in the follow up patches to either check/set/clear uffd-wp bit of a huge pte. So far it reuses all the small pte helpers. Archs can overwrite these versions when necessary (with __HAVE_ARCH_HUGE_PTE_UFFD_WP* macros) in the future. Link: https://lkml.kernel.org/r/20220405014858.14531-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/shmem: handle uffd-wp during fork()Peter Xu1-8/+41
Normally we skip copy page when fork() for VM_SHARED shmem, but we can't skip it anymore if uffd-wp is enabled on dst vma. This should only happen when the src uffd has UFFD_FEATURE_EVENT_FORK enabled on uffd-wp shmem vma, so that VM_UFFD_WP will be propagated onto dst vma too, then we should copy the pgtables with uffd-wp bit and pte markers, because these information will be lost otherwise. Since the condition checks will become even more complicated for deciding "whether a vma needs to copy the pgtable during fork()", introduce a helper vma_needs_copy() for it, so everything will be clearer. Link: https://lkml.kernel.org/r/20220405014855.14468-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/shmem: allows file-back mem to be uffd wr-protected on thpsPeter Xu1-1/+8
We don't have "huge" version of pte markers, instead when necessary we split the thp. However split the thp is not enough, because file-backed thp is handled totally differently comparing to anonymous thps: rather than doing a real split, the thp pmd will simply got cleared in __split_huge_pmd_locked(). That is not enough if e.g. when there is a thp covers range [0, 2M) but we want to wr-protect small page resides in [4K, 8K) range, because after __split_huge_pmd() returns, there will be a none pmd, and change_pmd_range() will just skip it right after the split. Here we leverage the previously introduced change_pmd_prepare() macro so that we'll populate the pmd with a pgtable page after the pmd split (in which process the pmd will be cleared for cases like shmem). Then change_pte_range() will do all the rest for us by installing the uffd-wp pte marker at any none pte that we'd like to wr-protect. Link: https://lkml.kernel.org/r/20220405014852.14413-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/shmem: allow uffd wr-protect none pte for file-backed memPeter Xu1-2/+62
File-backed memory differs from anonymous memory in that even if the pte is missing, the data could still resides either in the file or in page/swap cache. So when wr-protect a pte, we need to consider none ptes too. We do that by installing the uffd-wp pte markers when necessary. So when there's a future write to the pte, the fault handler will go the special path to first fault-in the page as read-only, then report to userfaultfd server with the wr-protect message. On the other hand, when unprotecting a page, it's also possible that the pte got unmapped but replaced by the special uffd-wp marker. Then we'll need to be able to recover from a uffd-wp pte marker into a none pte, so that the next access to the page will fault in correctly as usual when accessed the next time. Special care needs to be taken throughout the change_protection_range() process. Since now we allow user to wr-protect a none pte, we need to be able to pre-populate the page table entries if we see (!anonymous && MM_CP_UFFD_WP) requests, otherwise change_protection_range() will always skip when the pgtable entry does not exist. For example, the pgtable can be missing for a whole chunk of 2M pmd, but the page cache can exist for the 2M range. When we want to wr-protect one 4K page within the 2M pmd range, we need to pre-populate the pgtable and install the pte marker showing that we want to get a message and block the thread when the page cache of that 4K page is written. Without pre-populating the pmd, change_protection() will simply skip that whole pmd. Note that this patch only covers the small pages (pte level) but not covering any of the transparent huge pages yet. That will be done later, and this patch will be a preparation for it too. Link: https://lkml.kernel.org/r/20220405014850.14352-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/shmem: persist uffd-wp bit across zapping for file-backedPeter Xu4-3/+107
File-backed memory is prone to being unmapped at any time. It means all information in the pte will be dropped, including the uffd-wp flag. To persist the uffd-wp flag, we'll use the pte markers. This patch teaches the zap code to understand uffd-wp and know when to keep or drop the uffd-wp bit. Add a new flag ZAP_FLAG_DROP_MARKER and set it in zap_details when we don't want to persist such an information, for example, when destroying the whole vma, or punching a hole in a shmem file. For the rest cases we should never drop the uffd-wp bit, or the wr-protect information will get lost. The new ZAP_FLAG_DROP_MARKER needs to be put into mm.h rather than memory.c because it'll be further referenced in hugetlb files later. Link: https://lkml.kernel.org/r/20220405014847.14295-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/shmem: handle uffd-wp special pte in page fault handlerPeter Xu2-9/+75
File-backed memories are prone to unmap/swap so the ptes are always unstable, because they can be easily faulted back later using the page cache. This could lead to uffd-wp getting lost when unmapping or swapping out such memory. One example is shmem. PTE markers are needed to store those information. This patch prepares it by handling uffd-wp pte markers first it is applied elsewhere, so that the page fault handler can recognize uffd-wp pte markers. The handling of uffd-wp pte markers is similar to missing fault, it's just that we'll handle this "missing fault" when we see the pte markers, meanwhile we need to make sure the marker information is kept during processing the fault. This is a slow path of uffd-wp handling, because zapping of wr-protected shmem ptes should be rare. So far it should only trigger in two conditions: (1) When trying to punch holes in shmem_fallocate(), there is an optimization to zap the pgtables before evicting the page. (2) When swapping out shmem pages. Because of this, the page fault handling is simplifed too by not sending the wr-protect message in the 1st page fault, instead the page will be installed read-only, so the uffd-wp message will be generated in the next fault, which will trigger the do_wp_page() path of general uffd-wp handling. Disable fault-around for all uffd-wp registered ranges for extra safety just like uffd-minor fault, and clean the code up. Link: https://lkml.kernel.org/r/20220405014844.14239-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/shmem: take care of UFFDIO_COPY_MODE_WPPeter Xu3-9/+22
Pass wp_copy into shmem_mfill_atomic_pte() through the stack, then apply the UFFD_WP bit properly when the UFFDIO_COPY on shmem is with UFFDIO_COPY_MODE_WP. wp_copy lands mfill_atomic_install_pte() finally. Note: we must do pte_wrprotect() if !writable in mfill_atomic_install_pte(), as mk_pte() could return a writable pte (e.g., when VM_SHARED on a shmem file). Link: https://lkml.kernel.org/r/20220405014841.14185-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/uffd: PTE_MARKER_UFFD_WPPeter Xu3-1/+58
This patch introduces the 1st user of pte marker: the uffd-wp marker. When the pte marker is installed with the uffd-wp bit set, it means this pte was wr-protected by uffd. We will use this special pte to arm the ptes that got either unmapped or swapped out for a file-backed region that was previously wr-protected. This special pte could trigger a page fault just like swap entries. This idea is greatly inspired by Hugh and Andrea in the discussion, which is referenced in the links below. Some helpers are introduced to detect whether a swap pte is uffd wr-protected. After the pte marker introduced, one swap pte can be wr-protected in two forms: either it is a normal swap pte and it has _PAGE_SWP_UFFD_WP set, or it's a pte marker that has PTE_MARKER_UFFD_WP set. [peterx@redhat.com: fixup] Link: https://lkml.kernel.org/r/YkzKiM8tI4+qOfXF@xz-m1.local Link: https://lore.kernel.org/lkml/20201126222359.8120-1-peterx@redhat.com/ Link: https://lore.kernel.org/lkml/20201130230603.46187-1-peterx@redhat.com/ Link: https://lkml.kernel.org/r/20220405014838.14131-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: check against orig_pte for finish_fault()Peter Xu2-1/+14
This patch allows do_fault() to trigger on !pte_none() cases too. This prepares for the pte markers to be handled by do_fault() just like none pte. To achieve this, instead of unconditionally check against pte_none() in finish_fault(), we may hit the case that the orig_pte was some pte marker so what we want to do is to replace the pte marker with some valid pte entry. Then if orig_pte was set we'd want to check the current *pte (under pgtable lock) against orig_pte rather than none pte. Right now there's no solid way to safely reference orig_pte because when pmd is not allocated handle_pte_fault() will not initialize orig_pte, so it's not safe to reference it. There's another solution proposed before this patch to do pte_clear() for vmf->orig_pte for pmd==NULL case, however it turns out it'll break arm32 because arm32 could have assumption that pte_t* pointer will always reside on a real ram32 pgtable, not any kernel stack variable. To solve this, we add a new flag FAULT_FLAG_ORIG_PTE_VALID, and it'll be set along with orig_pte when there is valid orig_pte, or it'll be cleared when orig_pte was not initialized. It'll be updated every time we call handle_pte_fault(), so e.g. if a page fault retry happened it'll be properly updated along with orig_pte. [1] https://lore.kernel.org/lkml/710c48c9-406d-e4c5-a394-10501b951316@samsung.com/ [akpm@linux-foundation.org: coding-style cleanups] [peterx@redhat.com: fix crash reported by Marek] Link: https://lkml.kernel.org/r/Ylb9rXJyPm8/ao8f@xz-m1.local Link: https://lkml.kernel.org/r/20220405014836.14077-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: teach core mm about pte markersPeter Xu7-8/+47
This patch still does not use pte marker in any way, however it teaches the core mm about the pte marker idea. For example, handle_pte_marker() is introduced that will parse and handle all the pte marker faults. Many of the places are more about commenting it up - so that we know there's the possibility of pte marker showing up, and why we don't need special code for the cases. [peterx@redhat.com: userfaultfd.c needs swapops.h] Link: https://lkml.kernel.org/r/YmRlVj3cdizYJsr0@xz-m1.local Link: https://lkml.kernel.org/r/20220405014833.14015-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: introduce PTE_MARKER swap entryPeter Xu5-1/+112
Patch series "userfaultfd-wp: Support shmem and hugetlbfs", v8. Overview ======== Userfaultfd-wp anonymous support was merged two years ago. There're quite a few applications that started to leverage this capability either to take snapshots for user-app memory, or use it for full user controled swapping. This series tries to complete the feature for uffd-wp so as to cover all the RAM-based memory types. So far uffd-wp is the only missing piece of the rest features (uffd-missing & uffd-minor mode). One major reason to do so is that anonymous pages are sometimes not satisfying the need of applications, and there're growing users of either shmem and hugetlbfs for either sharing purpose (e.g., sharing guest mem between hypervisor process and device emulation process, shmem local live migration for upgrades), or for performance on tlb hits. All these mean that if a uffd-wp app wants to switch to any of the memory types, it'll stop working. I think it's worthwhile to have the kernel to cover all these aspects. This series chose to protect pages in pte level not page level. One major reason is safety. I have no idea how we could make it safe if any of the uffd-privileged app can wr-protect a page that any other application can use. It means this app can block any process potentially for any time it wants. The other reason is that it aligns very well with not only the anonymous uffd-wp solution, but also uffd as a whole. For example, userfaultfd is implemented fundamentally based on VMAs. We set flags to VMAs showing the status of uffd tracking. For another per-page based protection solution, it'll be crossing the fundation line on VMA-based, and it could simply be too far away already from what's called userfaultfd. PTE markers =========== The patchset is based on the idea called PTE markers. It was discussed in one of the mm alignment sessions, proposed starting from v6, and this is the 2nd version of it using PTE marker idea. PTE marker is a new type of swap entry that is ony applicable to file backed memories like shmem and hugetlbfs. It's used to persist some pte-level information even if the original present ptes in pgtable are zapped. Logically pte markers can store more than uffd-wp information, but so far only one bit is used for uffd-wp purpose. When the pte marker is installed with uffd-wp bit set, it means this pte is wr-protected by uffd. It solves the problem on e.g. file-backed memory mapped ptes got zapped due to any reason (e.g. thp split, or swapped out), we can still keep the wr-protect information in the ptes. Then when the page fault triggers again, we'll know this pte is wr-protected so we can treat the pte the same as a normal uffd wr-protected pte. The extra information is encoded into the swap entry, or swp_offset to be explicit, with the swp_type being PTE_MARKER. So far uffd-wp only uses one bit out of the swap entry, the rest bits of swp_offset are still reserved for other purposes. There're two configs to enable/disable PTE markers: CONFIG_PTE_MARKER CONFIG_PTE_MARKER_UFFD_WP We can set !PTE_MARKER to completely disable all the PTE markers, along with uffd-wp support. I made two config so we can also enable PTE marker but disable uffd-wp file-backed for other purposes. At the end of current series, I'll enable CONFIG_PTE_MARKER by default, but that patch is standalone and if anyone worries about having it by default, we can also consider turn it off by dropping that oneliner patch. So far I don't see a huge risk of doing so, so I kept that patch. In most cases, PTE markers should be treated as none ptes. It is because that unlike most of the other swap entry types, there's no PFN or block offset information encoded into PTE markers but some extra well-defined bits showing the status of the pte. These bits should only be used as extra data when servicing an upcoming page fault, and then we behave as if it's a none pte. I did spend a lot of time observing all the pte_none() users this time. It is indeed a challenge because there're a lot, and I hope I didn't miss a single of them when we should take care of pte markers. Luckily, I don't think it'll need to be considered in many cases, for example: boot code, arch code (especially non-x86), kernel-only page handlings (e.g. CPA), or device driver codes when we're tackling with pure PFN mappings. I introduced pte_none_mostly() in this series when we need to handle pte markers the same as none pte, the "mostly" is the other way to write "either none pte or a pte marker". I didn't replace pte_none() to cover pte markers for below reasons: - Very rare case of pte_none() callers will handle pte markers. E.g., all the kernel pages do not require knowledge of pte markers. So we don't pollute the major use cases. - Unconditionally change pte_none() semantics could confuse people, because pte_none() existed for so long a time. - Unconditionally change pte_none() semantics could make pte_none() slower even if in many cases pte markers do not exist. - There're cases where we'd like to handle pte markers differntly from pte_none(), so a full replace is also impossible. E.g. khugepaged should still treat pte markers as normal swap ptes rather than none ptes, because pte markers will always need a fault-in to merge the marker with a valid pte. Or the smap code will need to parse PTE markers not none ptes. Patch Layout ============ Introducing PTE marker and uffd-wp bit in PTE marker: mm: Introduce PTE_MARKER swap entry mm: Teach core mm about pte markers mm: Check against orig_pte for finish_fault() mm/uffd: PTE_MARKER_UFFD_WP Adding support for shmem uffd-wp: mm/shmem: Take care of UFFDIO_COPY_MODE_WP mm/shmem: Handle uffd-wp special pte in page fault handler mm/shmem: Persist uffd-wp bit across zapping for file-backed mm/shmem: Allow uffd wr-protect none pte for file-backed mem mm/shmem: Allows file-back mem to be uffd wr-protected on thps mm/shmem: Handle uffd-wp during fork() Adding support for hugetlbfs uffd-wp: mm/hugetlb: Introduce huge pte version of uffd-wp helpers mm/hugetlb: Hook page faults for uffd write protection mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP mm/hugetlb: Handle UFFDIO_WRITEPROTECT mm/hugetlb: Handle pte markers in page faults mm/hugetlb: Allow uffd wr-protect none ptes mm/hugetlb: Only drop uffd-wp special pte if required mm/hugetlb: Handle uffd-wp during fork() Misc handling on the rest mm for uffd-wp file-backed: mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Enabling of uffd-wp on file-backed memory: mm/uffd: Enable write protection for shmem & hugetlbfs mm: Enable PTE markers by default selftests/uffd: Enable uffd-wp for shmem/hugetlbfs Tests ===== - Compile test on x86_64 and aarch64 on different configs - Kernel selftests - uffd-test [0] - Umapsort [1,2] test for shmem/hugetlb, with swap on/off [0] https://github.com/xzpeter/clibs/tree/master/uffd-test [1] https://github.com/xzpeter/umap-apps/tree/peter [2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs This patch (of 23): Introduces a new swap entry type called PTE_MARKER. It can be installed for any pte that maps a file-backed memory when the pte is temporarily zapped, so as to maintain per-pte information. The information that kept in the pte is called a "marker". Here we define the marker as "unsigned long" just to match pgoff_t, however it will only work if it still fits in swp_offset(), which is e.g. currently 58 bits on x86_64. A new config CONFIG_PTE_MARKER is introduced too; it's by default off. A bunch of helpers are defined altogether to service the rest of the pte marker code. [peterx@redhat.com: fixup] Link: https://lkml.kernel.org/r/Yk2rdB7SXZf+2BDF@xz-m1.local Link: https://lkml.kernel.org/r/20220405014646.13522-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220405014646.13522-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/page_alloc: cache the result of node_dirty_ok()Wonhyuk Yang1-6/+7
To spread dirty pages, nodes are checked whether they have reached the dirty limit using the expensive node_dirty_ok(). To reduce the frequency of calling node_dirty_ok(), the last node that hit the dirty limit can be cached. Instead of caching the node, caching both the node and its node_dirty_ok() status can reduce the number of calle to node_dirty_ok(). [akpm@linux-foundation.org: rename last_pgdat_dirty_limit to last_pgdat_dirty_ok] Link: https://lkml.kernel.org/r/20220430011032.64071-1-vvghjk1234@gmail.com Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Donghyeok Kim <dthex5d@gmail.com> Cc: JaeSang Yoo <jsyoo5b@gmail.com> Cc: Jiyoup Kim <lakroforce@gmail.com> Cc: Ohhoon Kwon <ohkwon1043@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13Docs/admin-guide/mm/damon/reclaim: document 'commit_inputs' parameterSeongJae Park1-0/+11
This commit documents the new DAMON_RECLAIM parameter, 'commit_inputs' in its usage document. Link: https://lkml.kernel.org/r/20220429160606.127307-15-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon/reclaim: support online inputs updateSeongJae Park1-33/+62
DAMON_RECLAIM reads the user input parameters only when it starts. To allow more efficient online tuning, this commit implements a new input parameter called 'commit_inputs'. Writing true to the parameter makes DAMON_RECLAIM reads the input parameters again. Link: https://lkml.kernel.org/r/20220429160606.127307-14-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13Docs/{ABI,admin-guide}/damon: Update for 'state' sysfs file input keyword, ↵SeongJae Park2-7/+9
'commit' This commit documents the newly added 'state' sysfs file input keyword, 'commit', which allows online tuning of DAMON contexts. Link: https://lkml.kernel.org/r/20220429160606.127307-13-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon/sysfs: support online inputs updateSeongJae Park1-9/+90
Currently, DAMON sysfs interface doesn't provide a way for adjusting DAMON input parameters while it is turned on. Therefore, users who want to reconfigure DAMON need to stop DAMON and restart. This means all the monitoring results that accumulated so far, which could be useful, should be flushed. This would be inefficient for many cases. For an example, let's suppose a sysadmin was running a DAMON-based Operation Scheme to find memory regions not accessed for more than 5 mins and page out the regions. If it turns out the 5 mins threshold was too long and therefore the sysadmin wants to reduce it to 4 mins, the sysadmin should turn off DAMON, restart it, and wait for at least 4 more minutes so that DAMON can find the cold memory regions, even though DAMON was knowing there are regions that not accessed for 4 mins at the time of shutdown. This commit makes DAMON sysfs interface to support online DAMON input parameters updates by adding a new input keyword for the 'state' DAMON sysfs file, 'commit'. Writing the keyword to the 'state' file while the corresponding kdamond is running makes the kdamond to read the sysfs file values again and update the DAMON context. Link: https://lkml.kernel.org/r/20220429160606.127307-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon/sysfs: update schemes stat in the kdamond contextSeongJae Park1-26/+135
Only '->kdamond' and '->kdamond_stop' are protected by 'kdamond_lock' of 'struct damon_ctx'. All other DAMON context internal data items are recommended to be accessed in DAMON callbacks, or under some additional synchronizations. But, DAMON sysfs is accessing the schemes stat under 'kdamond_lock'. It makes no big issue as the read values are not used anywhere inside kernel, but would better to be fixed. This commit moves the reads to DAMON callback context, as supposed to be used for the purpose. Link: https://lkml.kernel.org/r/20220429160606.127307-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon/sysfs: use enum for 'state' input handlingSeongJae Park1-10/+62
DAMON sysfs 'state' file handling code is using string literals in both 'state_show()' and 'state_store()'. This makes the code error prone and inflexible for future extensions. To improve the situation, this commit defines possible input strings and 'enum' for identifying each input keyword only once, and refactors the code to reuse those. Link: https://lkml.kernel.org/r/20220429160606.127307-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon/sysfs: reuse damon_set_regions() for regions settingSeongJae Park1-15/+18
'damon_set_regions()' is general enough so that it can also be used for only creating regions. This commit makes DAMON sysfs interface to reuse the function rather keeping two implementations for a same purpose. Link: https://lkml.kernel.org/r/20220429160606.127307-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon/sysfs: move targets setup code to a separated functionSeongJae Park1-21/+28
This commit separates DAMON sysfs interface's monitoring context targets setup code to a new function for better readability. Link: https://lkml.kernel.org/r/20220429160606.127307-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon/sysfs: prohibit multiple physical address space monitoring targetsSeongJae Park1-0/+4
Having multiple targets for physical address space monitoring makes no sense. This commit prohibits such a ridiculous DAMON context setup my making the DAMON context build function to check and return an error for the case. Link: https://lkml.kernel.org/r/20220429160606.127307-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon/vaddr: remove damon_va_apply_three_regions()SeongJae Park2-20/+4
'damon_va_apply_three_regions()' is just a wrapper of its general version, 'damon_set_regions()'. This commit replaces the wrapper calls to directly call the general version. Link: https://lkml.kernel.org/r/20220429160606.127307-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon/vaddr: move 'damon_set_regions()' to coreSeongJae Park3-73/+75
This commit moves 'damon_set_regions()' from vaddr to core, as it is aimed to be used by not only 'vaddr' but also other parts of DAMON. Link: https://lkml.kernel.org/r/20220429160606.127307-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon/vaddr: generalize damon_va_apply_three_regions()SeongJae Park1-24/+42
'damon_va_apply_three_regions()' is for adjusting address ranges to fit in three discontiguous ranges. The function can be generalized for arbitrary number of discontiguous ranges and reused for future usage, such as arbitrary online regions update. For such future usage, this commit introduces a generalized version of the function called 'damon_set_regions()'. Link: https://lkml.kernel.org/r/20220429160606.127307-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon/core: finish kdamond as soon as any callback returns an errorSeongJae Park1-2/+6
When 'after_sampling()' or 'after_aggregation()' DAMON callbacks return an error, kdamond continues the remaining loop once. It makes no much sense to run the remaining part while something wrong already happened. The context might be corrupted or having invalid data. This commit therefore makes kdamond skips the remaining works and immediately finish in the cases. Link: https://lkml.kernel.org/r/20220429160606.127307-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon/core: add a new callback for watermarks checksSeongJae Park2-1/+14
Patch series "mm/damon: Support online tuning". Effects of DAMON and DAMON-based Operation Schemes highly depends on the configurations. Wrong configurations could even result in unexpected efficiency degradations. For finding a best configuration, repeating incremental configuration changes and results measurements, in other words, online tuning, could be helpful. Nevertheless, DAMON kernel API supports only restrictive online tuning. Worse yet, the sysfs-based DAMON user interface doesn't support online tuning at all. DAMON_RECLAIM also doesn't support online tuning. This patchset makes the DAMON kernel API, DAMON sysfs interface, and DAMON_RECLAIM supports online tuning. Sequence of patches ------------------- First two patches enhance DAMON online tuning for kernel API users. Specifically, patch 1 let kernel API users to be able to do DAMON online tuning without a restriction, and patch 2 makes error handling easier. Following seven patches (patches 3-9) refactor code for better readability and easier reuse of code fragments that will be useful for online tuning support. Patch 10 introduces DAMON callback based user request handling structure for DAMON sysfs interface, and patch 11 enables DAMON online tuning via DAMON sysfs interface. Documentation patch (patch 12) for usage of it follows. Patch 13 enables online tuning of DAMON_RECLAIM and finally patch 14 documents the DAMON_RECLAIM online tuning usage. This patch (of 14): For updating input parameters for running DAMON contexts, DAMON kernel API users can use the contexts' callbacks, as it is the safe place for context internal data accesses. When the context has DAMON-based operation schemes and all schemes are deactivated due to their watermarks, however, DAMON does nothing but only watermarks checks. As a result, no callbacks will be called back, and therefore the kernel API users cannot update the input parameters including monitoring attributes, DAMON-based operation schemes, and watermarks. To let users easily update such DAMON input parameters in such a case, this commit adds a new callback, 'after_wmarks_check()'. It will be called after each watermarks check. Users can do the online input parameters update in the callback even under the schemes deactivated case. Link: https://lkml.kernel.org/r/20220429160606.127307-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13selftest/vm: test that mremap fails on non-existent vmaNiels Dossche1-0/+6
Add a regression test that validates that mremap fails for vma's that don't exist. Link: https://lkml.kernel.org/r/20220427224439.23828-3-dossche.niels@gmail.com Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/rmap: Fix typos in commentsAdrian Huang1-2/+2
Fix spelling/grammar mistakes in comments. Link: https://lkml.kernel.org/r/20220428061522.666-1-adrianhuang0701@gmail.com Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>