summaryrefslogtreecommitdiff
path: root/mm/gup.c
AgeCommit message (Collapse)AuthorFilesLines
2022-03-23mm/gup: follow_pfn_pte(): -EEXIST cleanupJohn Hubbard1-5/+8
Remove a quirky special case from follow_pfn_pte(), and adjust its callers to match. Caller changes include: __get_user_pages(): Regardless of any FOLL_* flags, get_user_pages() and its variants should handle PFN-only entries by stopping early, if the caller expected **pages to be filled in. This makes for a more reliable API, as compared to the previous approach of skipping over such entries (and thus leaving them silently unwritten). move_pages(): squash the -EEXIST error return from follow_page() into -EFAULT, because -EFAULT is listed in the man page, whereas -EEXIST is not. Link: https://lkml.kernel.org/r/20220204020010.68930-3-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Peter Xu <peterx@redhat.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23mm: fix invalid page pointer returned with FOLL_PIN gupsPeter Xu1-1/+1
Patch series "mm/gup: some cleanups", v5. This patch (of 5): Alex reported invalid page pointer returned with pin_user_pages_remote() from vfio after upstream commit 4b6c33b32296 ("vfio/type1: Prepare for batched pinning with struct vfio_batch"). It turns out that it's not the fault of the vfio commit; however after vfio switches to a full page buffer to store the page pointers it starts to expose the problem easier. The problem is for VM_PFNMAP vmas we should normally fail with an -EFAULT then vfio will carry on to handle the MMIO regions. However when the bug triggered, follow_page_mask() returned -EEXIST for such a page, which will jump over the current page, leaving that entry in **pages untouched. However the caller is not aware of it, hence the caller will reference the page as usual even if the pointer data can be anything. We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn mapping unless FOLL_GET is requested") which seems very reasonable. It could be that when we reworked GUP with FOLL_PIN we could have overlooked that special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when it needs to return an -EEXIST. Attaching the Fixes to the FOLL_PIN rework commit, as it happened later than 1027e4436b6a. [jhubbard@nvidia.com: added some tags, removed a reference to an out of tree module.] Link: https://lkml.kernel.org/r/20220207062213.235127-1-jhubbard@nvidia.com Link: https://lkml.kernel.org/r/20220204020010.68930-1-jhubbard@nvidia.com Link: https://lkml.kernel.org/r/20220204020010.68930-2-jhubbard@nvidia.com Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages") Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reported-by: Alex Williamson <alex.williamson@redhat.com> Debugged-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: David Hildenbrand <david@redhat.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-21mm/gup: Convert check_and_migrate_movable_pages() to use a folioMatthew Wilcox (Oracle)1-13/+14
Switch from head pages to folios. This removes an assumption that THPs are the only way to have a high-order page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Turn compound_range_next() into gup_folio_range_next()Matthew Wilcox (Oracle)1-21/+17
Convert the only caller to work on folios instead of pages. This removes the last caller of put_compound_head(), so delete it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Turn compound_next() into gup_folio_next()Matthew Wilcox (Oracle)1-19/+21
Convert both callers to work on folios instead of pages. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Convert gup_huge_pgd() to use a folioMatthew Wilcox (Oracle)1-11/+6
Use the new folio-based APIs. This was the last user of try_grab_compound_head(), so remove it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Convert gup_huge_pud() to use a folioMatthew Wilcox (Oracle)1-5/+6
Use the new folio-based APIs. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Convert gup_huge_pmd() to use a folioMatthew Wilcox (Oracle)1-5/+6
Use the new folio-based APIs. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Convert gup_hugepte() to use a folioMatthew Wilcox (Oracle)1-7/+7
There should be little to no effect from this patch; just removing uses of some old APIs. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Convert gup_pte_range() to use a folioMatthew Wilcox (Oracle)1-10/+8
We still call try_grab_folio() once per PTE; a future patch could optimise to just adjust the reference count for each page within the folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/hugetlb: Use try_grab_folio() instead of try_grab_compound_head()Matthew Wilcox (Oracle)1-1/+1
follow_hugetlb_page() only cares about success or failure, so it doesn't need to know the type of the returned pointer, only whether it's NULL or not. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Add gup_put_folio()Matthew Wilcox (Oracle)1-26/+12
Convert put_compound_head() to gup_put_folio() and hpage_pincount_sub() to folio_pincount_sub(). This removes the last call to put_page_refs(), so delete it. Add a temporary put_compound_head() wrapper which will be deleted by the end of this series. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Convert try_grab_page() to use a folioMatthew Wilcox (Oracle)1-15/+13
Hoist the folio conversion and the folio_ref_count() check to the top of the function instead of using the one buried in try_get_page(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-21mm/gup: Add try_get_folio() and try_grab_folio()Matthew Wilcox (Oracle)1-50/+49
Convert try_get_compound_head() into try_get_folio() and convert try_grab_compound_head() into try_grab_folio(). Add a temporary try_grab_compound_head() wrapper around try_grab_folio() to let us convert callers individually. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm: Make compound_pincount always availableMatthew Wilcox (Oracle)1-11/+9
Move compound_pincount from the third page to the second page, which means it's available for all compound pages. That lets us delete hpage_pincount_available(). On 32-bit systems, there isn't enough space for both compound_pincount and compound_nr in the second page (it would collide with page->private, which is in use for pages in the swap cache), so revert the optimisation of storing both compound_order and compound_nr on 32-bit systems. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Remove hpage_pincount_sub()Matthew Wilcox (Oracle)1-10/+3
Move the assertion (and correct it to be a cheaper variant), and inline the atomic_sub() operation. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Remove hpage_pincount_add()Matthew Wilcox (Oracle)1-20/+11
It's clearer to call atomic_add() in the callers; the assertions clearly can't fire there because they're part of the condition for calling atomic_add(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-21mm/gup: Handle page split race more efficientlyMatthew Wilcox (Oracle)1-2/+5
If we hit the page split race, the current code returns NULL which will presumably trigger a retry under the mmap_lock. This isn't necessary; we can just retry the compound_head() lookup. This is a very minor optimisation of an unlikely path, but conceptually it matches (eg) the page cache RCU-protected lookup. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Remove an assumption of a contiguous memmapMatthew Wilcox (Oracle)1-2/+2
This assumption needs the inverse of nth_page(), which is temporarily named page_nth() until it's renamed later in this series. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Fix some contiguous memmap assumptionsMatthew Wilcox (Oracle)1-7/+7
Several functions in gup.c assume that a compound page has virtually contiguous page structs. This isn't true for SPARSEMEM configs unless SPARSEMEM_VMEMMAP is also set. Fix them by using nth_page() instead of plain pointer arithmetic. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Change the calling convention for compound_next()Matthew Wilcox (Oracle)1-6/+5
Return the head page instead of storing it to a passed parameter. Reorder the arguments to match the calling function's arguments. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Optimise compound_range_next()Matthew Wilcox (Oracle)1-1/+1
By definition, a compound page has an order >= 1, so the second half of the test was redundant. Also, this cannot be a tail page since it's the result of calling compound_head(), so use PageHead() instead of PageCompound(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Change the calling convention for compound_range_next()Matthew Wilcox (Oracle)1-6/+5
Return the head page instead of storing it to a passed parameter. Pass the start page directly instead of passing a pointer to it. Reorder the arguments to match the calling function's arguments. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-21mm/gup: Remove for_each_compound_head()Matthew Wilcox (Oracle)1-11/+5
This macro doesn't simplify the users; it's easier to just call compound_next() inside a standard loop. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Remove for_each_compound_range()Matthew Wilcox (Oracle)1-10/+2
This macro doesn't simplify the users; it's easier to just call compound_range_next() inside the loop. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Increment the page refcount before the pincountMatthew Wilcox (Oracle)1-8/+6
We should always increase the refcount before doing anything else to the page so that other page users see the elevated refcount first. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-10mm: gup: make fault_in_safe_writeable() use fixup_user_fault()Linus Torvalds1-38/+19
Instead of using GUP, make fault_in_safe_writeable() actually force a 'handle_mm_fault()' using the same fixup_user_fault() machinery that futexes already use. Using the GUP machinery meant that fault_in_safe_writeable() did not do everything that a real fault would do, ranging from not auto-expanding the stack segment, to not updating accessed or dirty flags in the page tables (GUP sets those flags on the pages themselves). The latter causes problems on architectures (like s390) that do accessed bit handling in software, which meant that fault_in_safe_writeable() didn't actually do all the fault handling it needed to, and trying to access the user address afterwards would still cause faults. Reported-and-tested-by: Andreas Gruenbacher <agruenba@redhat.com> Fixes: cdd591fc86e3 ("iov_iter: Introduce fault_in_iov_iter_writeable") Link: https://lore.kernel.org/all/CAHc6FU5nP+nziNGG0JAF1FUx-GV7kKFvM7aZuU_XD2_1v4vnvg@mail.gmail.com/ Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-03mm: refactor check_and_migrate_movable_pagesChristoph Hellwig1-37/+44
Remove up to two levels of indentation by using continue statements and move variables to local scope where possible. Link: https://lkml.kernel.org/r/20220210072828.2930359-11-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: delete FOLL_MLOCK and FOLL_POPULATEHugh Dickins1-35/+8
If counting page mlocks, we must not double-count: follow_page_pte() can tell if a page has already been Mlocked or not, but cannot tell if a pte has already been counted or not: that will have to be done when the pte is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks for new anon pages, but there's no such tracking yet for others). Delete all the FOLL_MLOCK code - faulting in the missing pages will do all that is necessary, without special mlock_vma_page() calls from here. But then FOLL_POPULATE turns out to serve no purpose - it was there so that its absence would tell faultin_page() not to faultin page when setting up VM_LOCKONFAULT areas; but if there's no special work needed here for mlock, then there's no work at all here for VM_LOCKONFAULT. Have I got that right? I've not looked into the history, but see that FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different purpose before? Ah, yes, it was used to skip the old stack guard page. And is it intentional that COW is not broken on existing pages when setting up a VM_LOCKONFAULT area? I can see that being argued either way, and have no reason to disagree with current behaviour. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-03Revert "mm/gup: small refactoring: simplify try_grab_page()"John Hubbard1-5/+30
This reverts commit 54d516b1d62ff8f17cee2da06e5e4706a0d00b8a That commit did a refactoring that effectively combined fast and slow gup paths (again). And that was again incorrect, for two reasons: a) Fast gup and slow gup get reference counts on pages in different ways and with different goals: see Linus' writeup in commit cd1adf1b63a1 ("Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly""), and b) try_grab_compound_head() also has a specific check for "FOLL_LONGTERM && !is_pinned(page)", that assumes that the caller can fall back to slow gup. This resulted in new failures, as recently report by Will McVicker [1]. But (a) has problems too, even though they may not have been reported yet. So just revert this. Link: https://lore.kernel.org/r/20220131203504.3458775-1-willmcvicker@google.com [1] Fixes: 54d516b1d62f ("mm/gup: small refactoring: simplify try_grab_page()") Reported-and-tested-by: Will McVicker <willmcvicker@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Minchan Kim <minchan@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: stable@vger.kernel.org # 5.15 Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/gup.c: stricter check on THP migration entry during follow_pmd_maskLi Xinhai1-4/+9
When BUG_ON check for THP migration entry, the existing code only check thp_migration_supported case, but not for !thp_migration_supported case. If !thp_migration_supported() and !pmd_present(), the original code may dead loop in theory. To make the BUG_ON check consistent, we need catch both cases. Move the BUG_ON check one step earlier, because if the bug happen we should know it instead of depend on FOLL_MIGRATION been used by caller. Because pmdval instead of *pmd is read by the is_pmd_migration_entry() check, the existing code don't help to avoid useless locking within pmd_migration_entry_wait(), so remove that check. Link: https://lkml.kernel.org/r/20211217062559.737063-1-lixinhai.lxh@gmail.com Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Zi Yan <ziy@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15gup: avoid multiple user access locking/unlocking in fault_in_{read/write}ableChristophe Leroy1-8/+10
fault_in_readable() and fault_in_writeable() perform __get_user() and __put_user() in a loop, implying multiple user access locking/unlocking. To avoid that, use user access blocks. Link: https://lkml.kernel.org/r/720dcf79314acca1a78fae56d478cc851952149d.1637084492.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-4/+1
Merge misc updates from Andrew Morton: "257 patches. Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ...
2021-11-06mm/gup: further simplify __gup_device_huge()John Hubbard1-4/+1
Commit 6401c4eb57f9 ("mm: gup: fix potential pgmap refcnt leak in __gup_device_huge()") simplified the return paths, but didn't go quite far enough, as discussed in [1]. Remove the "ret" variable entirely, because there is enough information already available to provide the return value. [1] https://lore.kernel.org/r/CAHk-=wgQTRX=5SkCmS+zfmpqubGHGJvXX_HgnPG8JSpHKHBMeg@mail.gmail.com Link: https://lkml.kernel.org/r/20210904004224.86391-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-24gup: Introduce FOLL_NOFAULT flag to disable page faultsAndreas Gruenbacher1-1/+3
Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return -EFAULT when it would otherwise trigger a page fault. This is roughly similar to FOLL_FAST_ONLY but available on all architectures, and less fragile. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20iov_iter: Introduce fault_in_iov_iter_writeableAndreas Gruenbacher1-0/+63
Introduce a new fault_in_iov_iter_writeable helper for safely faulting in an iterator for writing. Uses get_user_pages() to fault in the pages without actually writing to them, which would be destructive. We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that the iterator passed to .read_iter isn't in memory. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-18gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}Andreas Gruenbacher1-0/+72
Turn fault_in_pages_{readable,writeable} into versions that return the number of bytes not faulted in, similar to copy_to_user, instead of returning a non-zero value when any of the requested pages couldn't be faulted in. This supports the existing users that require all pages to be faulted in as well as new users that are happy if any pages can be faulted in. Rename the functions to fault_in_{readable,writeable} to make sure this change doesn't silently break things. Neither of these functions is entirely trivial and it doesn't seem useful to inline them, so move them to mm/gup.c. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-09-07Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly"Linus Torvalds1-17/+4
This reverts commit 9857a17f206ff374aea78bccfb687f145368be2e. That commit was completely broken, and I should have caught on to it earlier. But happily, the kernel test robot noticed the breakage fairly quickly. The breakage is because "try_get_page()" is about avoiding the page reference count overflow case, but is otherwise the exact same as a plain "get_page()". In contrast, "try_get_compound_head()" is an entirely different beast, and uses __page_cache_add_speculative() because it's not just about the page reference count, but also about possibly racing with the underlying page going away. So all the commentary about how "try_get_page() has fallen a little behind in terms of maintenance, try_get_compound_head() handles speculative page references more thoroughly" was just completely wrong: yes, try_get_compound_head() handles speculative page references, but the point is that try_get_page() does not, and must not. So there's no lack of maintainance - there are fundamentally different semantics. A speculative page reference would be entirely wrong in "get_page()", and it's entirely wrong in "try_get_page()". It's not about speculation, it's purely about "uhhuh, you can't get this page because you've tried to increment the reference count too much already". The reason the kernel test robot noticed this bug was that it hit the VM_BUG_ON() in __page_cache_add_speculative(), which is all about verifying that the context of any speculative page access is correct. But since that isn't what try_get_page() is all about, the VM_BUG_ON() tests things that are not correct to test for try_get_page(). Reported-by: kernel test robot <oliver.sang@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/migrate: enable returning precise migrate_pages() success countYang Shi1-1/+1
Under normal circumstances, migrate_pages() returns the number of pages migrated. In error conditions, it returns an error code. When returning an error code, there is no way to know how many pages were migrated or not migrated. Make migrate_pages() return how many pages are demoted successfully for all cases, including when encountering errors. Page reclaim behavior will depend on this in subsequent patches. Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter] Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/gup: remove try_get_page(), call try_get_compound_head() directlyJohn Hubbard1-4/+17
try_get_page() is very similar to try_get_compound_head(), and in fact try_get_page() has fallen a little behind in terms of maintenance: try_get_compound_head() handles speculative page references more thoroughly. There are only two try_get_page() callsites, so just call try_get_compound_head() directly from those, and remove try_get_page() entirely. Also, seeing as how this changes try_get_compound_head() into a non-static function, provide some kerneldoc documentation for it. Link: https://lkml.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/gup: small refactoring: simplify try_grab_page()John Hubbard1-30/+5
try_grab_page() does the same thing as try_grab_compound_head(..., refs=1, ...), just with a different API. So there is a lot of code duplication there. Change try_grab_page() to call try_grab_compound_head(), while keeping the API contract identical for callers. Also, now that try_grab_compound_head() always has a caller, remove the __maybe_unused annotation. Link: https://lkml.kernel.org/r/20210813044133.1536842-3-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/gup: documentation corrections for gup/pupJohn Hubbard1-7/+20
Patch series "A few gup refactorings and documentation updates", v3. While reviewing some of the other things going on around gup.c, I noticed that the documentation was wrong for a few of the routines that I wrote. And then I noticed that there was some significant code duplication too. So this fixes those issues. This is not entirely risk-free, but after looking closely at this, I think it's actually a useful improvement, getting rid of the code duplication here. This patch (of 3): The documentation for try_grab_compound_head() and try_grab_page() has fallen a little out of date. Update and clarify a few points. Also make it kerneldoc-correct, by adding @args documentation. Link: https://lkml.kernel.org/r/20210813044133.1536842-1-jhubbard@nvidia.com Link: https://lkml.kernel.org/r/20210813044133.1536842-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: gup: use helper PAGE_ALIGNED in populate_vma_page_range()Miaohe Lin1-2/+2
Use helper PAGE_ALIGNED to check if address is aligned to PAGE_SIZE. Minor readability improvement. Link: https://lkml.kernel.org/r/20210807093620.21347-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: gup: fix potential pgmap refcnt leak in __gup_device_huge()Miaohe Lin1-5/+7
When failed to try_grab_page, put_dev_pagemap() is missed. So pgmap refcnt will leak in this case. Also we remove the check for pgmap against NULL as it's also checked inside the put_dev_pagemap(). [akpm@linux-foundation.org: simplify, cleanup] [akpm@linux-foundation.org: fix return value] Link: https://lkml.kernel.org/r/20210807093620.21347-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages") Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: gup: remove useless BUG_ON in __get_user_pages()Miaohe Lin1-1/+0
Indeed, this BUG_ON couldn't catch anything useful. We are sure ret == 0 here because we would already bail out if ret != 0 and ret is untouched till here. Link: https://lkml.kernel.org/r/20210807093620.21347-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: gup: remove unneed local variable orig_refsMiaohe Lin1-3/+1
Remove unneed local variable orig_refs since refs is unchanged now. Link: https://lkml.kernel.org/r/20210807093620.21347-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: gup: remove set but unused local variable majorMiaohe Lin1-2/+1
Patch series "Cleanups and fixup for gup". This series contains cleanups to remove unneeded variable, useless BUG_ON and use helper to improve readability. Also we fix a potential pgmap refcnt leak. More details can be found in the respective changelogs. This patch (of 5): Since commit a2beb5f1efed ("mm: clean up the last pieces of page fault accountings"), the local variable major is unused. Remove it. Link: https://lkml.kernel.org/r/20210807093620.21347-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210807093620.21347-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-14mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE)David Hildenbrand1-2/+5
Doing some extended tests and polishing the man page update for MADV_POPULATE_(READ|WRITE), I realized that we end up converting also SIGBUS (via -EFAULT) to -EINVAL, making it look like yet another madvise() user error. We want to report only problematic mappings and permission problems that the user could have know as -EINVAL. Let's not convert -EFAULT arising due to SIGBUS (or SIGSEGV) to -EINVAL, but instead indicate -EFAULT to user space. While we could also convert it to -ENOMEM, using -EFAULT looks more helpful when user space might want to troubleshoot what's going wrong: MADV_POPULATE_(READ|WRITE) is not part of an final Linux release and we can still adjust the behavior. Link: https://lkml.kernel.org/r/20210726154932.102880-1-david@redhat.com Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08mm: introduce memfd_secret system call to create "secret" memory areasMike Rapoport1-0/+12
Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process and not mapped not only to other processes but in the kernel page tables as well. The secretmem feature is off by default and the user must explicitly enable it at the boot time. Once secretmem is enabled, the user will be able to create a file descriptor using the memfd_secret() system call. The memory areas created by mmap() calls from this file descriptor will be unmapped from the kernel direct map and they will be only mapped in the page table of the processes that have access to the file descriptor. Secretmem is designed to provide the following protections: * Enhanced protection (in conjunction with all the other in-kernel attack prevention systems) against ROP attacks. Seceretmem makes "simple" ROP insufficient to perform exfiltration, which increases the required complexity of the attack. Along with other protections like the kernel stack size limit and address space layout randomization which make finding gadgets is really hard, absence of any in-kernel primitive for accessing secret memory means the one gadget ROP attack can't work. Since the only way to access secret memory is to reconstruct the missing mapping entry, the attacker has to recover the physical page and insert a PTE pointing to it in the kernel and then retrieve the contents. That takes at least three gadgets which is a level of difficulty beyond most standard attacks. * Prevent cross-process secret userspace memory exposures. Once the secret memory is allocated, the user can't accidentally pass it into the kernel to be transmitted somewhere. The secreremem pages cannot be accessed via the direct map and they are disallowed in GUP. * Harden against exploited kernel flaws. In order to access secretmem, a kernel-side attack would need to either walk the page tables and create new ones, or spawn a new privileged uiserspace process to perform secrets exfiltration using ptrace. The file descriptor based memory has several advantages over the "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File descriptor approach allows explicit and controlled sharing of the memory areas, it allows to seal the operations. Besides, file descriptor based memory paves the way for VMMs to remove the secret memory range from the userspace hipervisor process, for instance QEMU. Andy Lutomirski says: "Getting fd-backed memory into a guest will take some possibly major work in the kernel, but getting vma-backed memory into a guest without mapping it in the host user address space seems much, much worse." memfd_secret() is made a dedicated system call rather than an extension to memfd_create() because it's purpose is to allow the user to create more secure memory mappings rather than to simply allow file based access to the memory. Nowadays a new system call cost is negligible while it is way simpler for userspace to deal with a clear-cut system calls than with a multiplexer or an overloaded syscall. Moreover, the initial implementation of memfd_secret() is completely distinct from memfd_create() so there is no much sense in overloading memfd_create() to begin with. If there will be a need for code sharing between these implementation it can be easily achieved without a need to adjust user visible APIs. The secret memory remains accessible in the process context using uaccess primitives, but it is not exposed to the kernel otherwise; secret memory areas are removed from the direct map and functions in the follow_page()/get_user_page() family will refuse to return a page that belongs to the secret memory area. Once there will be a use case that will require exposing secretmem to the kernel it will be an opt-in request in the system call flags so that user would have to decide what data can be exposed to the kernel. Removing of the pages from the direct map may cause its fragmentation on architectures that use large pages to map the physical memory which affects the system performance. However, the original Kconfig text for CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736 ("x86: add gbpages switches")) and the recent report [1] showed that "... although 1G mappings are a good default choice, there is no compelling evidence that it must be the only choice". Hence, it is sufficient to have secretmem disabled by default with the ability of a system administrator to enable it at boot time. Pages in the secretmem regions are unevictable and unmovable to avoid accidental exposure of the sensitive data via swap or during page migration. Since the secretmem mappings are locked in memory they cannot exceed RLIMIT_MEMLOCK. Since these mappings are already locked independently from mlock(), an attempt to mlock()/munlock() secretmem range would fail and mlockall()/munlockall() will ignore secretmem mappings. However, unlike mlock()ed memory, secretmem currently behaves more like long-term GUP: secretmem mappings are unmovable mappings directly consumed by user space. With default limits, there is no excessive use of secretmem and it poses no real problem in combination with ZONE_MOVABLE/CMA, but in the future this should be addressed to allow balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA. A page that was a part of the secret memory area is cleared when it is freed to ensure the data is not exposed to the next user of that page. The following example demonstrates creation of a secret mapping (error handling is omitted): fd = memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ [akpm@linux-foundation.org: suppress Kconfig whine] Link: https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net> Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tablesDavid Hildenbrand1-0/+58
I. Background: Sparse Memory Mappings When we manage sparse memory mappings dynamically in user space - also sometimes involving MAP_NORESERVE - we want to dynamically populate/ discard memory inside such a sparse memory region. Example users are hypervisors (especially implementing memory ballooning or similar technologies like virtio-mem) and memory allocators. In addition, we want to fail in a nice way (instead of generating SIGBUS) if populating does not succeed because we are out of backend memory (which can happen easily with file-based mappings, especially tmpfs and hugetlbfs). While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for reliably discarding memory for most mapping types, there is no generic approach to populate page tables and preallocate memory. Although mmap() supports MAP_POPULATE, it is not applicable to the concept of sparse memory mappings, where we want to populate/discard dynamically and avoid expensive/problematic remappings. In addition, we never actually report errors during the final populate phase - it is best-effort only. fallocate() can be used to preallocate file-based memory and fail in a safe way. However, it cannot really be used for any private mappings on anonymous files via memfd due to COW semantics. In addition, fallocate() does not actually populate page tables, so we still always get pagefaults on first access - which is sometimes undesired (i.e., real-time workloads) and requires real prefaulting of page tables, not just a preallocation of backend storage. There might be interesting use cases for sparse memory regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy as it does not prefault page tables. II. On preallcoation/prefaulting from user space Because we don't have a proper interface, what applications (like QEMU and databases) end up doing is touching (i.e., reading+writing one byte to not overwrite existing data) all individual pages. However, that approach 1) Can result in wear on storage backing, because we end up reading/writing each page; this is especially a problem for dax/pmem. 2) Can result in mmap_sem contention when prefaulting via multiple threads. 3) Requires expensive signal handling, especially to catch SIGBUS in case of hugetlbfs/shmem/file-backed memory. For example, this is problematic in hypervisors like QEMU where SIGBUS handlers might already be used by other subsystems concurrently to e.g, handle hardware errors. "Simply" doing preallocation concurrently from other thread is not that easy. III. On MADV_WILLNEED Extending MADV_WILLNEED is not an option because 1. It would change the semantics: "Expect access in the near future." and "might be a good idea to read some pages" vs. "Definitely populate/ preallocate all memory and definitely fail on errors.". 2. Existing users (like virtio-balloon in QEMU when deflating the balloon) don't want populate/prealloc semantics. They treat this rather as a hint to give a little performance boost without too much overhead - and don't expect that a lot of memory might get consumed or a lot of time might be spent. IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by MAP_POPULATE, with the following semantics: 1. MADV_POPULATE_READ can be used to prefault page tables just like manually reading each individual page. This will not break any COW mappings. The shared zero page might get mapped and no backend storage might get preallocated -- allocation might be deferred to write-fault time. Especially shared file mappings require an explicit fallocate() upfront to actually preallocate backend memory (blocks in the file system) in case the file might have holes. 2. If MADV_POPULATE_READ succeeds, all page tables have been populated (prefaulted) readable once. 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and prefault page tables just like manually writing (or reading+writing) each individual page. This will break any COW mappings -- e.g., the shared zeropage is never populated. 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated (prefaulted) writable once. 5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special mappings marked with VM_PFNMAP and VM_IO. Also, proper access permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such mapping is encountered, madvise() fails with -EINVAL. 6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables might have been populated. 7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON when encountering a HW poisoned page in the range. 8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot protect from the OOM (Out Of Memory) handler killing the process. While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e., preallocate memory and prefault page tables for VMs), one issue is that whenever we prefault pages writable, the pages have to be marked dirty, because the CPU could dirty them any time. while not a real problem for hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each page will be marked dirty and has to be written back later when evicting. MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole mapping from backend storage without marking it dirty, such that eviction won't have to write it back. As discussed above, shared file mappings might require an explciit fallocate() upfront to achieve preallcoation+prepopulation. Although sparse memory mappings are the primary use case, this will also be useful for other preallocate/prefault use cases where MAP_POPULATE is not desired or the semantics of MAP_POPULATE are not sufficient: as one example, QEMU users can trigger preallocation/prefaulting of guest RAM after the mapping was created -- and don't want errors to be silently suppressed. Looking at the history, MADV_POPULATE was already proposed in 2013 [1], however, the main motivation back than was performance improvements -- which should also still be the case. V. Single-threaded performance comparison I did a short experiment, prefaulting page tables on completely *empty mappings/files* and repeated the experiment 10 times. The results correspond to the shortest execution time. In general, the performance benefit for huge pages is negligible with small mappings. V.1: Private mappings POPULATE_READ and POPULATE_WRITE is fastest. Note that Reading/POPULATE_READ will populate the shared zeropage where applicable -- which result in short population times. The fastest way to allocate backend storage (here: swap or huge pages) and prefault page tables is POPULATE_WRITE. V.2: Shared mappings fallocate() is fastest, however, doesn't prefault page tables. POPULATE_WRITE is faster than simple writes and read/writes. POPULATE_READ is faster than simple reads. Without a fd, the fastest way to allocate backend storage and prefault page tables is POPULATE_WRITE. With an fd, the fastest way is usually FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one exception are actual files: FALLOCATE+Read is slightly faster than FALLOCATE+POPULATE_READ. The fastest way to allocate backend storage prefault page tables is FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then, FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as dirty. v.3: Detailed results ================================================== 2 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 0.119 ms Anon 4 KiB : Write : 0.222 ms Anon 4 KiB : Read/Write : 0.380 ms Anon 4 KiB : POPULATE_READ : 0.060 ms Anon 4 KiB : POPULATE_WRITE : 0.158 ms Memfd 4 KiB : Read : 0.034 ms Memfd 4 KiB : Write : 0.310 ms Memfd 4 KiB : Read/Write : 0.362 ms Memfd 4 KiB : POPULATE_READ : 0.039 ms Memfd 4 KiB : POPULATE_WRITE : 0.229 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.033 ms tmpfs : Write : 0.313 ms tmpfs : Read/Write : 0.406 ms tmpfs : POPULATE_READ : 0.039 ms tmpfs : POPULATE_WRITE : 0.285 ms file : Read : 0.033 ms file : Write : 0.351 ms file : Read/Write : 0.408 ms file : POPULATE_READ : 0.039 ms file : POPULATE_WRITE : 0.290 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 237.940 ms Anon 4 KiB : Write : 708.409 ms Anon 4 KiB : Read/Write : 1054.041 ms Anon 4 KiB : POPULATE_READ : 124.310 ms Anon 4 KiB : POPULATE_WRITE : 572.582 ms Memfd 4 KiB : Read : 136.928 ms Memfd 4 KiB : Write : 963.898 ms Memfd 4 KiB : Read/Write : 1106.561 ms Memfd 4 KiB : POPULATE_READ : 78.450 ms Memfd 4 KiB : POPULATE_WRITE : 805.881 ms Memfd 2 MiB : Read : 357.116 ms Memfd 2 MiB : Write : 357.210 ms Memfd 2 MiB : Read/Write : 357.606 ms Memfd 2 MiB : POPULATE_READ : 356.094 ms Memfd 2 MiB : POPULATE_WRITE : 356.937 ms tmpfs : Read : 137.536 ms tmpfs : Write : 954.362 ms tmpfs : Read/Write : 1105.954 ms tmpfs : POPULATE_READ : 80.289 ms tmpfs : POPULATE_WRITE : 822.826 ms file : Read : 137.874 ms file : Write : 987.025 ms file : Read/Write : 1107.439 ms file : POPULATE_READ : 80.413 ms file : POPULATE_WRITE : 857.622 ms hugetlbfs : Read : 355.607 ms hugetlbfs : Write : 355.729 ms hugetlbfs : Read/Write : 356.127 ms hugetlbfs : POPULATE_READ : 354.585 ms hugetlbfs : POPULATE_WRITE : 355.138 ms ************************************************** 2 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 0.394 ms Anon 4 KiB : Write : 0.348 ms Anon 4 KiB : Read/Write : 0.400 ms Anon 4 KiB : POPULATE_READ : 0.326 ms Anon 4 KiB : POPULATE_WRITE : 0.273 ms Anon 2 MiB : Read : 0.030 ms Anon 2 MiB : Write : 0.030 ms Anon 2 MiB : Read/Write : 0.030 ms Anon 2 MiB : POPULATE_READ : 0.030 ms Anon 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 4 KiB : Read : 0.412 ms Memfd 4 KiB : Write : 0.372 ms Memfd 4 KiB : Read/Write : 0.419 ms Memfd 4 KiB : POPULATE_READ : 0.343 ms Memfd 4 KiB : POPULATE_WRITE : 0.288 ms Memfd 4 KiB : FALLOCATE : 0.137 ms Memfd 4 KiB : FALLOCATE+Read : 0.446 ms Memfd 4 KiB : FALLOCATE+Write : 0.330 ms Memfd 4 KiB : FALLOCATE+Read/Write : 0.454 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 0.379 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 0.268 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 2 MiB : FALLOCATE : 0.030 ms Memfd 2 MiB : FALLOCATE+Read : 0.031 ms Memfd 2 MiB : FALLOCATE+Write : 0.031 ms Memfd 2 MiB : FALLOCATE+Read/Write : 0.031 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 0.030 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.416 ms tmpfs : Write : 0.369 ms tmpfs : Read/Write : 0.425 ms tmpfs : POPULATE_READ : 0.346 ms tmpfs : POPULATE_WRITE : 0.295 ms tmpfs : FALLOCATE : 0.139 ms tmpfs : FALLOCATE+Read : 0.447 ms tmpfs : FALLOCATE+Write : 0.333 ms tmpfs : FALLOCATE+Read/Write : 0.454 ms tmpfs : FALLOCATE+POPULATE_READ : 0.380 ms tmpfs : FALLOCATE+POPULATE_WRITE : 0.272 ms file : Read : 0.191 ms file : Write : 0.511 ms file : Read/Write : 0.524 ms file : POPULATE_READ : 0.196 ms file : POPULATE_WRITE : 0.434 ms file : FALLOCATE : 0.004 ms file : FALLOCATE+Read : 0.197 ms file : FALLOCATE+Write : 0.554 ms file : FALLOCATE+Read/Write : 0.480 ms file : FALLOCATE+POPULATE_READ : 0.201 ms file : FALLOCATE+POPULATE_WRITE : 0.381 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms hugetlbfs : FALLOCATE : 0.030 ms hugetlbfs : FALLOCATE+Read : 0.031 ms hugetlbfs : FALLOCATE+Write : 0.031 ms hugetlbfs : FALLOCATE+Read/Write : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_READ : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 1053.090 ms Anon 4 KiB : Write : 913.642 ms Anon 4 KiB : Read/Write : 1060.350 ms Anon 4 KiB : POPULATE_READ : 893.691 ms Anon 4 KiB : POPULATE_WRITE : 782.885 ms Anon 2 MiB : Read : 358.553 ms Anon 2 MiB : Write : 358.419 ms Anon 2 MiB : Read/Write : 357.992 ms Anon 2 MiB : POPULATE_READ : 357.533 ms Anon 2 MiB : POPULATE_WRITE : 357.808 ms Memfd 4 KiB : Read : 1078.144 ms Memfd 4 KiB : Write : 942.036 ms Memfd 4 KiB : Read/Write : 1100.391 ms Memfd 4 KiB : POPULATE_READ : 925.829 ms Memfd 4 KiB : POPULATE_WRITE : 804.394 ms Memfd 4 KiB : FALLOCATE : 304.632 ms Memfd 4 KiB : FALLOCATE+Read : 1163.359 ms Memfd 4 KiB : FALLOCATE+Write : 933.186 ms Memfd 4 KiB : FALLOCATE+Read/Write : 1187.304 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 1013.660 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 794.560 ms Memfd 2 MiB : Read : 358.131 ms Memfd 2 MiB : Write : 358.099 ms Memfd 2 MiB : Read/Write : 358.250 ms Memfd 2 MiB : POPULATE_READ : 357.563 ms Memfd 2 MiB : POPULATE_WRITE : 357.334 ms Memfd 2 MiB : FALLOCATE : 356.735 ms Memfd 2 MiB : FALLOCATE+Read : 358.152 ms Memfd 2 MiB : FALLOCATE+Write : 358.331 ms Memfd 2 MiB : FALLOCATE+Read/Write : 358.018 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 357.286 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 357.523 ms tmpfs : Read : 1087.265 ms tmpfs : Write : 950.840 ms tmpfs : Read/Write : 1107.567 ms tmpfs : POPULATE_READ : 922.605 ms tmpfs : POPULATE_WRITE : 810.094 ms tmpfs : FALLOCATE : 306.320 ms tmpfs : FALLOCATE+Read : 1169.796 ms tmpfs : FALLOCATE+Write : 933.730 ms tmpfs : FALLOCATE+Read/Write : 1191.610 ms tmpfs : FALLOCATE+POPULATE_READ : 1020.474 ms tmpfs : FALLOCATE+POPULATE_WRITE : 798.945 ms file : Read : 654.101 ms file : Write : 1259.142 ms file : Read/Write : 1289.509 ms file : POPULATE_READ : 661.642 ms file : POPULATE_WRITE : 1106.816 ms file : FALLOCATE : 1.864 ms file : FALLOCATE+Read : 656.328 ms file : FALLOCATE+Write : 1153.300 ms file : FALLOCATE+Read/Write : 1180.613 ms file : FALLOCATE+POPULATE_READ : 668.347 ms file : FALLOCATE+POPULATE_WRITE : 996.143 ms hugetlbfs : Read : 357.245 ms hugetlbfs : Write : 357.413 ms hugetlbfs : Read/Write : 357.120 ms hugetlbfs : POPULATE_READ : 356.321 ms hugetlbfs : POPULATE_WRITE : 356.693 ms hugetlbfs : FALLOCATE : 355.927 ms hugetlbfs : FALLOCATE+Read : 357.074 ms hugetlbfs : FALLOCATE+Write : 357.120 ms hugetlbfs : FALLOCATE+Read/Write : 356.983 ms hugetlbfs : FALLOCATE+POPULATE_READ : 356.413 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 356.266 ms ************************************************** [1] https://lkml.org/lkml/2013/6/27/698 [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>