summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-08-18selftests/mm: add -a to run_vmtests.shPeter Xu1-3/+8
Allows to specify optional tests in run_vmtests.sh, where we can run time consuming test matrix only when user specified "-a". Link: https://lkml.kernel.org/r/20230628215310.73782-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/gup: retire follow_hugetlb_page()Peter Xu4-256/+1
Now __get_user_pages() should be well prepared to handle thp completely, as long as hugetlb gup requests even without the hugetlb's special path. Time to retire follow_hugetlb_page(). Tweak misc comments to reflect reality of follow_hugetlb_page()'s removal. Link: https://lkml.kernel.org/r/20230628215310.73782-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/gup: accelerate thp gup even for "pages != NULL"Peter Xu1-7/+44
The acceleration of THP was done with ctx.page_mask, however it'll be ignored if **pages is non-NULL. The old optimization was introduced in 2013 in 240aadeedc4a ("mm: accelerate mm_populate() treatment of THP pages"). It didn't explain why we can't optimize the **pages non-NULL case. It's possible that at that time the major goal was for mm_populate() which should be enough back then. Optimize thp for all cases, by properly looping over each subpage, doing cache flushes, and boost refcounts / pincounts where needed in one go. This can be verified using gup_test below: # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10 Before: 13992.50 ( +-8.75%) After: 378.50 (+-69.62%) Link: https://lkml.kernel.org/r/20230628215310.73782-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/gup: cleanup next_page handlingPeter Xu1-4/+3
The only path that doesn't use generic "**pages" handling is the gate vma. Make it use the same path, meanwhile tune the next_page label upper to cover "**pages" handling. This prepares for THP handling for "**pages". Link: https://lkml.kernel.org/r/20230628215310.73782-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/hugetlb: add page_mask for hugetlb_follow_page_mask()Peter Xu3-5/+11
follow_page() doesn't need it, but we'll start to need it when unifying gup for hugetlb. Link: https://lkml.kernel.org/r/20230628215310.73782-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/hugetlb: prepare hugetlb_follow_page_mask() for FOLL_PINPeter Xu1-11/+22
follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the target of FOLL_WRITE either. However add the checks. Namely, either the need to CoW due to missing write bit, or proper unsharing on !AnonExclusive pages over R/O pins to reject the follow page. That brings this function closer to follow_hugetlb_page(). So we don't care before, and also for now. But we'll care if we switch over slow-gup to use hugetlb_follow_page_mask(). We'll also care when to return -EMLINK properly, as that's the gup internal api to mean "we should unshare". Not really needed for follow page path, though. When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be clear that it just should never fail. When error happens, instead of setting page==NULL, capture the errno instead. Link: https://lkml.kernel.org/r/20230628215310.73782-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/hugetlb: handle FOLL_DUMP well in follow_page_mask()Peter Xu2-7/+11
Patch series "mm/gup: Unify hugetlb, speed up thp", v4. Hugetlb has a special path for slow gup that follow_page_mask() is actually skipped completely along with faultin_page(). It's not only confusing, but also duplicating a lot of logics that generic gup already has, making hugetlb slightly special. This patchset tries to dedup the logic, by first touching up the slow gup code to be able to handle hugetlb pages correctly with the current follow page and faultin routines (where we're mostly there.. due to 10 years ago we did try to optimize thp, but half way done; more below), then at the last patch drop the special path, then the hugetlb gup will always go the generic routine too via faultin_page(). Note that hugetlb is still special for gup, mostly due to the pgtable walking (hugetlb_walk()) that we rely on which is currently per-arch. But this is still one small step forward, and the diffstat might be a proof too that this might be worthwhile. Then for the "speed up thp" side: as a side effect, when I'm looking at the chunk of code, I found that thp support is actually partially done. It doesn't mean that thp won't work for gup, but as long as **pages pointer passed over, the optimization will be skipped too. Patch 6 should address that, so for thp we now get full speed gup. For a quick number, "chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10" gives me 13992.50us -> 378.50us. Gup_test is an extreme case, but just to show how it affects thp gups. This patch (of 8): Firstly, the no_page_table() is meaningless for hugetlb which is a no-op there, because a hugetlb page always satisfies: - vma_is_anonymous() == false - vma->vm_ops->fault != NULL So we can already safely remove it in hugetlb_follow_page_mask(), alongside with the page* variable. Meanwhile, what we do in follow_hugetlb_page() actually makes sense for a dump: we try to fault in the page only if the page cache is already allocated. Let's do the same here for follow_page_mask() on hugetlb. It should so far has zero effect on real dumps, because that still goes into follow_hugetlb_page(). But this may start to influence a bit on follow_page() users who mimics a "dump page" scenario, but hopefully in a good way. This also paves way for unifying the hugetlb gup-slow. Link: https://lkml.kernel.org/r/20230628215310.73782-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20230628215310.73782-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18arm64: mte: simplify swap tag restoration logicPeter Collingbourne3-44/+11
As a result of the patches "mm: Call arch_swap_restore() from do_swap_page()" and "mm: Call arch_swap_restore() from unuse_pte()", there are no circumstances in which a swapped-in page is installed in a page table without first having arch_swap_restore() called on it. Therefore, we no longer need the logic in set_pte_at() that restores the tags, so remove it. Link: https://lkml.kernel.org/r/20230523004312.1807357-4-pcc@google.com Link: https://linux-review.googlesource.com/id/I8ad54476f3b2d0144ccd8ce0c1d7a2963e5ff6f3 Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexandru Elisei <alexandru.elisei@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Hildenbrand <david@redhat.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: kasan-dev@googlegroups.com Cc: kasan-dev <kasan-dev@googlegroups.com> Cc: "Kuan-Ying Lee (李冠穎)" <Kuan-Ying.Lee@mediatek.com> Cc: Qun-Wei Lin <qun-wei.lin@mediatek.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: call arch_swap_restore() from unuse_pte()Peter Collingbourne1-0/+7
We would like to move away from requiring architectures to restore metadata from swap in the set_pte_at() implementation, as this is not only error-prone but adds complexity to the arch-specific code. This requires us to call arch_swap_restore() before calling swap_free() whenever pages are restored from swap. We are currently doing so everywhere except in unuse_pte(); do so there as well. Link: https://lkml.kernel.org/r/20230523004312.1807357-3-pcc@google.com Link: https://linux-review.googlesource.com/id/I68276653e612d64cde271ce1b5a99ae05d6bbc4f Signed-off-by: Peter Collingbourne <pcc@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Steven Price <steven.price@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexandru Elisei <alexandru.elisei@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: kasan-dev <kasan-dev@googlegroups.com> Cc: "Kuan-Ying Lee (李冠穎)" <Kuan-Ying.Lee@mediatek.com> Cc: Qun-Wei Lin <qun-wei.lin@mediatek.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: make show_free_areas() staticKefeng Wang5-19/+13
All callers of show_free_areas() pass 0 and NULL, so we can directly use show_mem() instead of show_free_areas(0, NULL), which could make show_free_areas() a static function. Link: https://lkml.kernel.org/r/20230630062253.189440-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: remove arguments of show_mem()Kefeng Wang6-7/+7
All callers of show_mem() pass 0 and NULL, so we can remove the two arguments by directly calling __show_mem(0, NULL, MAX_NR_ZONES - 1) in show_mem(). Link: https://lkml.kernel.org/r/20230630062253.189440-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: make MEMFD_CREATE into a selectable config optionThomas Weißschuh2-3/+5
The memfd_create() syscall, enabled by CONFIG_MEMFD_CREATE, is useful on its own even when not required by CONFIG_TMPFS or CONFIG_HUGETLBFS. Split it into its own proper bool option that can be enabled by users. Move that option into mm/ where the code itself also lies. Also add "select" statements to CONFIG_TMPFS and CONFIG_HUGETLBFS so they automatically enable CONFIG_MEMFD_CREATE as before. Link: https://lkml.kernel.org/r/20230630-config-memfd-v1-1-9acc3ae38b5a@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Tested-by: Zhangjin Wu <falcon@tinylab.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: remove page_rmapping()ZhangPeng2-7/+0
After converting the last user to folio_raw_mapping(), we can safely remove the function. Link: https://lkml.kernel.org/r/20230701032853.258697-3-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: use a folio in fault_dirty_shared_page()ZhangPeng1-8/+8
We can replace four implicit calls to compound_head() with one by using folio. Link: https://lkml.kernel.org/r/20230701032853.258697-2-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18swap: stop add to avail list if swap is fullMa Wupeng1-1/+4
Our test finds a WARN_ON in add_to_avail_list. During add_to_avail_list, avail_lists is already in swap_avail_heads, while leads to this WARN_ON. Here is the simplified calltrace: ------------[ cut here ]------------ Call trace: add_to_avail_list+0xb8/0xc0 swap_range_free+0x110/0x138 swapcache_free_entries+0x100/0x1c0 free_swap_slot+0xbc/0xe0 put_swap_folio+0x1f0/0x2ec delete_from_swap_cache+0x6c/0xd0 folio_free_swap+0xa4/0xe4 __try_to_reclaim_swap+0x9c/0x190 free_swap_and_cache+0x84/0x88 unmap_page_range+0x31c/0x934 unmap_single_vma.isra.0+0x48/0x84 unmap_vmas+0x98/0x10c exit_mmap+0xa4/0x210 mmput+0x88/0x158 do_exit+0x284/0x970 do_group_exit+0x34/0x90 post_copy_siginfo_from_user32+0x0/0x1cc do_notify_resume+0x15c/0x470 el0_svc+0x74/0x84 el0t_64_sync_handler+0xb8/0xbc el0t_64_sync+0x190/0x194 During swapoff, try_to_unuse fails to alloc memory due to memory limit and this leads to the failure of swapoff and causes re-insertion of swap space back into swap_list. During _enable_swap_info, this swap device is added to avail list even this swap device if full. At the same time, one entry in this full swap device in released and we try to add this device into avail list and find it is already in the avail list. This causes this WARN_ON. To fix this. Don't add to avail list is swap is full. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20230627120833.2230766-3-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18swap: cleanup duplicated WARN_ON in add_to_avail_listMa Wupeng1-3/+1
Patch series "fix WARN_ON in add_to_avail_list". Empty check for plist_node is checked in add_to_avail_list and plist_add. Drop the duplicate one in add_to_avail_list. Link: https://lkml.kernel.org/r/20230627120833.2230766-1-mawupeng1@huawei.com Link: https://lkml.kernel.org/r/20230627120833.2230766-2-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: increase usage of folio_next_index() helperSidhartha Kumar5-9/+9
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using the existing helper folio_next_index(). Link: https://lkml.kernel.org/r/20230627174349.491803-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/mm_init.c: update obsolete comment in get_pfn_range_for_nid()Miaohe Lin1-2/+1
Since commit 633c0666b5a5 ("Memoryless nodes: drop one memoryless node boot warning"), the warning for a node with no available memory is removed. Update the corresponding comment. Link: https://lkml.kernel.org/r/20230625033340.1054103-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18maple_tree: fix a few documentation issuesThomas Gleixner2-7/+24
The documentation of mt_next() claims that it starts the search at the provided index. That's incorrect as it starts the search after the provided index. The documentation of mt_find() is slightly confusing. "Handles locking" is not really helpful as it does not explain how the "locking" works. Also the documentation of index talks about a range, while in reality the index is updated on a succesful search to the index of the found entry plus one. Fix similar issues for mt_find_after() and mt_prev(). Reword the confusing "Note: Will not return the zero entry." comment on mt_for_each() and document @__index correctly. Link: https://lkml.kernel.org/r/87ttw2n556.ffs@tglx Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: madvise: fix uneven accounting of psiCharan Teja Kalla1-0/+4
A folio turns into a Workingset during: 1) shrink_active_list() placing the folio from active to inactive list. 2) When a workingset transition is happening during the folio refault. And when Workingset is set on a folio, PSI for memory can be accounted during a) That folio is being reclaimed and b) Refault of that folio, for usual reclaims. This accounting of PSI for memory is not consistent for reclaim + refault operation between usual reclaim and madvise(COLD/PAGEOUT) which deactivate or proactively reclaim a folio: a) A folio started at inactive and moved to active as part of accesses. Workingset is absent on the folio thus refault of it when reclaimed through MADV_PAGEOUT operation doesn't account for PSI. b) When the same folio transition from inactive->active and then to inactive through shrink_active_list(). Workingset is set on the folio thus refault of it when reclaimed through MADV_PAGEOUT operation accounts for PSI. c) When the same folio is part of active list directly as a result of folio refault and this was a workingset folio prior to eviction. Workingset is set on the folio thus the refault of it when reclaimed through MADV_PAGEOUT/MADV_COLD operation accounts for PSI. d) MADV_COLD transfers the folio from active list to inactive list. Such folios may not have the Workingset thus refault operation on such folio doesn't account for PSI. As said above, refault operation caused because of MADV_PAGEOUT on a folio is accounts for memory PSI in b) and c) but not in a). Refault caused by the reclaim of a folio on which MADV_COLD is performed accounts memory PSI in c) but not in d). These behaviours are inconsistent w.r.t usual reclaim + refault operation. Make this PSI accounting always consistent by turning a folio into a workingset one whenever it is leaving the active list. Also, accounting of PSI on a folio whenever it leaves the active list as part of the MADV_COLD/PAGEOUT operation helps the users whether they are operating on proper folios[1]. [1] https://lore.kernel.org/all/20230605180013.GD221380@cmpxchg.org/ Link: https://lkml.kernel.org/r/1688393201-11135-1-git-send-email-quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Suggested-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Sai Manobhiram Manapragada <quic_smanapra@quicinc.com> Reported-by: Pavan Kondeti <quic_pkondeti@quicinc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-30Linux 6.5-rc4v6.5-rc4Linus Torvalds1-1/+1
2023-07-30Merge tag 'spi-fix-v6.5-rc3' of ↵Linus Torvalds1-5/+49
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark Brown: "A bunch of fixes for the Qualcomm QSPI driver, fixing multiple issues with the newly added DMA mode - it had a number of issues exposed when tested in a wider range of use cases, both race condition style issues and issues with different inputs to those that had been used in test" * tag 'spi-fix-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: spi-qcom-qspi: Add mem_ops to avoid PIO for badly sized reads spi: spi-qcom-qspi: Fallback to PIO for xfers that aren't multiples of 4 bytes spi: spi-qcom-qspi: Add DMA_CHAIN_DONE to ALL_IRQS spi: spi-qcom-qspi: Call dma_wmb() after setting up descriptors spi: spi-qcom-qspi: Use GFP_ATOMIC flag while allocating for descriptor spi: spi-qcom-qspi: Ignore disabled interrupts' status in isr
2023-07-30Merge tag 'regulator-fix-v6.5-rc3' of ↵Linus Torvalds1-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fixes from Mark Brown: "A couple of small fixes for the the mt6358 driver, fixing error reporting and a bootstrapping issue" * tag 'regulator-fix-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: mt6358: Fix incorrect VCN33 sync error message regulator: mt6358: Sync VCN33_* enable status after checking ID
2023-07-30Merge tag 'usb-6.5-rc4' of ↵Linus Torvalds21-114/+103
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are a set of USB driver fixes for 6.5-rc4. Include in here are: - new USB serial device ids - dwc3 driver fixes for reported issues - typec driver fixes for reported problems - gadget driver fixes - reverts of some problematic USB changes that went into -rc1 All of these have been in linux-next with no reported problems" * tag 'usb-6.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (24 commits) usb: misc: ehset: fix wrong if condition usb: dwc3: pci: skip BYT GPIO lookup table for hardwired phy usb: cdns3: fix incorrect calculation of ep_buf_size when more than one config usb: gadget: call usb_gadget_check_config() to verify UDC capability usb: typec: Use sysfs_emit_at when concatenating the string usb: typec: Iterate pds array when showing the pd list usb: typec: Set port->pd before adding device for typec_port usb: typec: qcom: fix return value check in qcom_pmic_typec_probe() Revert "usb: gadget: tegra-xudc: Fix error check in tegra_xudc_powerdomain_init()" Revert "usb: xhci: tegra: Fix error check" USB: gadget: Fix the memory leak in raw_gadget driver usb: gadget: core: remove unbalanced mutex_unlock in usb_gadget_activate Revert "usb: dwc3: core: Enable AutoRetry feature in the controller" Revert "xhci: add quirk for host controllers that don't update endpoint DCS" USB: quirks: add quirk for Focusrite Scarlett usb: xhci-mtk: set the dma max_seg_size MAINTAINERS: drop invalid usb/cdns3 Reviewer e-mail usb: dwc3: don't reset device side if dwc3 was configured as host-only usb: typec: ucsi: move typec_set_mode(TYPEC_STATE_SAFE) to ucsi_unregister_partner() usb: ohci-at91: Fix the unhandle interrupt when resume ...
2023-07-30Merge tag 'tty-6.5-rc4' of ↵Linus Torvalds8-8/+18
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial fixes from Greg KH: "Here are some small TTY and serial driver fixes for 6.5-rc4 for some reported problems. Included in here is: - TIOCSTI fix for braille readers - documentation fix for minor numbers - MAINTAINERS update for new serial files in -rc1 - minor serial driver fixes for reported problems All of these have been in linux-next with no reported problems" * tag 'tty-6.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: serial: 8250_dw: Preserve original value of DLF register tty: serial: sh-sci: Fix sleeping in atomic context serial: sifive: Fix sifive_serial_console_setup() section Documentation: devices.txt: reconcile serial/ucc_uart minor numers MAINTAINERS: Update TTY layer for lists and recently added files tty: n_gsm: fix UAF in gsm_cleanup_mux TIOCSTI: always enable for CAP_SYS_ADMIN
2023-07-30Merge tag 'staging-6.5-rc4' of ↵Linus Torvalds4-12/+45
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging driver fixes from Greg KH: "Here are three small staging driver fixes for 6.5-rc4 that resolve some reported problems. These fixes are: - fix for an old bug in the r8712 driver - fbtft driver fix for a spi device - potential overflow fix in the ks7010 driver All of these have been in linux-next with no reported problems" * tag 'staging-6.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging: ks7010: potential buffer overflow in ks_wlan_set_encode_ext() staging: fbtft: ili9341: use macro FBTFT_REGISTER_SPI_DRIVER staging: r8712: Fix memory leak in _r8712_init_xmit_priv()
2023-07-30Merge tag 'char-misc-6.5-rc4' of ↵Linus Torvalds4-26/+20
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char driver and Documentation fixes from Greg KH: "Here is a char driver fix and some documentation updates for 6.5-rc4 that contain the following changes: - sram/genalloc bugfix for reported problem - security-bugs.rst update based on recent discussions - embargoed-hardware-issues minor cleanups and then partial revert for the project/company lists All of these have been in linux-next for a while with no reported problems, and the documentation updates have all been reviewed by the relevant developers" * tag 'char-misc-6.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: misc/genalloc: Name subpools by of_node_full_name() Documentation: embargoed-hardware-issues.rst: add AMD to the list Documentation: embargoed-hardware-issues.rst: clean out empty and unused entries Documentation: security-bugs.rst: clarify CVE handling Documentation: security-bugs.rst: update preferences when dealing with the linux-distros group
2023-07-30Merge tag 'probes-fixes-v6.5-rc3' of ↵Linus Torvalds3-6/+18
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probe fixes from Masami Hiramatsu: - probe-events: add NULL check for some BTF API calls which can return error code and NULL. - ftrace selftests: check fprobe and kprobe event correctly. This fixes a miss condition of the test command. - kprobes: do not allow probing functions that start with "__cfi_" or "__pfx_" since those are auto generated for kernel CFI and not executed. * tag 'probes-fixes-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobes: Prohibit probing on CFI preamble symbol selftests/ftrace: Fix to check fprobe event eneblement tracing/probes: Fix to add NULL check for BTF APIs
2023-07-30Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds14-186/+255
Pull kvm fixes from Paolo Bonzini: "x86: - Do not register IRQ bypass consumer if posted interrupts not supported - Fix missed device interrupt due to non-atomic update of IRR - Use GFP_KERNEL_ACCOUNT for pid_table in ipiv - Make VMREAD error path play nice with noinstr - x86: Acquire SRCU read lock when handling fastpath MSR writes - Support linking rseq tests statically against glibc 2.35+ - Fix reference count for stats file descriptors - Detect userspace setting invalid CR0 Non-KVM: - Remove coccinelle script that has caused multiple confusion ("debugfs, coccinelle: check for obsolete DEFINE_SIMPLE_ATTRIBUTE() usage", acked by Greg)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits) KVM: selftests: Expand x86's sregs test to cover illegal CR0 values KVM: VMX: Don't fudge CR0 and CR4 for restricted L2 guest KVM: x86: Disallow KVM_SET_SREGS{2} if incoming CR0 is invalid Revert "debugfs, coccinelle: check for obsolete DEFINE_SIMPLE_ATTRIBUTE() usage" KVM: selftests: Verify stats fd is usable after VM fd has been closed KVM: selftests: Verify stats fd can be dup()'d and read KVM: selftests: Verify userspace can create "redundant" binary stats files KVM: selftests: Explicitly free vcpus array in binary stats test KVM: selftests: Clean up stats fd in common stats_test() helper KVM: selftests: Use pread() to read binary stats header KVM: Grab a reference to KVM for VM and vCPU stats file descriptors selftests/rseq: Play nice with binaries statically linked against glibc 2.35+ Revert "KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid" KVM: x86: Acquire SRCU read lock when handling fastpath MSR writes KVM: VMX: Use vmread_error() to report VM-Fail in "goto" path KVM: VMX: Make VMREAD error path play nice with noinstr KVM: x86/irq: Conditionally register IRQ bypass consumer again KVM: X86: Use GFP_KERNEL_ACCOUNT for pid_table in ipiv KVM: x86: check the kvm_cpu_get_interrupt result before using it KVM: x86: VMX: set irr_pending in kvm_apic_update_irr ...
2023-07-30Merge tag 'locking_urgent_for_v6.5_rc4' of ↵Linus Torvalds4-76/+155
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Borislav Petkov: - Fix a rtmutex race condition resulting from sharing of the sort key between the lock waiters and the PI chain tree (->pi_waiters) of a task by giving each tree their own sort key * tag 'locking_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Fix task->pi_waiters integrity
2023-07-30Merge tag 'x86_urgent_for_v6.5_rc4' of ↵Linus Torvalds4-13/+33
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - AMD's automatic IBRS doesn't enable cross-thread branch target injection protection (STIBP) for user processes. Enable STIBP on such systems. - Do not delete (but put the ref instead) of AMD MCE error thresholding sysfs kobjects when destroying them in order not to delete the kernfs pointer prematurely - Restore annotation in ret_from_fork_asm() in order to fix kthread stack unwinding from being marked as unreliable and thus breaking livepatching * tag 'x86_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Enable STIBP on AMD if Automatic IBRS is enabled x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocks x86: Fix kthread unwind
2023-07-30Merge tag 'irq_urgent_for_v6.5_rc4' of ↵Linus Torvalds4-40/+117
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Work around an erratum on GIC700, where a race between a CPU handling a wake-up interrupt, a change of affinity, and another CPU going to sleep can result in a lack of wake-up event on the next interrupt - Fix the locking required on a VPE for GICv4 - Enable Rockchip 3588001 erratum workaround for RK3588S - Fix the irq-bcm6345-l1 assumtions of the boot CPU always be the first CPU in the system * tag 'irq_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3: Workaround for GIC-700 erratum 2941627 irqchip/gic-v3: Enable Rockchip 3588001 erratum workaround for RK3588S irqchip/gic-v4.1: Properly lock VPEs when doing a directLPI invalidation irq-bcm6345-l1: Do not assume a fixed block to cpu mapping
2023-07-30Merge tag '6.5-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds9-8/+20
Pull smb client fixes from Steve French: "Four small SMB3 client fixes: - two reconnect fixes (to address the case where non-default iocharset gets incorrectly overridden at reconnect with the default charset) - fix for NTLMSSP_AUTH request setting a flag incorrectly) - Add missing check for invalid tlink (tree connection) in ioctl" * tag '6.5-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: add missing return value check for cifs_sb_tlink smb3: do not set NTLMSSP_VERSION flag for negotiate not auth request cifs: fix charset issue in reconnection fs/nls: make load_nls() take a const parameter
2023-07-30Merge tag 'trace-v6.5-rc3' of ↵Linus Torvalds6-26/+21
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix to /sys/kernel/tracing/per_cpu/cpu*/stats read and entries. If a resize shrinks the buffer it clears the read count to notify readers that they need to reset. But the read count is also used for accounting and this causes the numbers to be off. Instead, create a separate variable to use to notify readers to reset. - Fix the ref counts of the "soft disable" mode. The wrong value was used for testing if soft disable mode should be enabled or disable, but instead, just change the logic to do the enable and disable in place when the SOFT_MODE is set or cleared. - Several kernel-doc fixes - Removal of unused external declarations * tag 'trace-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix warning in trace_buffered_event_disable() ftrace: Remove unused extern declarations tracing: Fix kernel-doc warnings in trace_seq.c tracing: Fix kernel-doc warnings in trace_events_trigger.c tracing/synthetic: Fix kernel-doc warnings in trace_events_synth.c ring-buffer: Fix kernel-doc warnings in ring_buffer.c ring-buffer: Fix wrong stat of cpu_buffer->read
2023-07-30arch/*/configs/*defconfig: Replace AUTOFS4_FS by AUTOFS_FSSven Joachim64-75/+63
Commit a2225d931f75 ("autofs: remove left-over autofs4 stubs") promised the removal of the fs/autofs/Kconfig fragment for AUTOFS4_FS within a couple of releases, but five years later this still has not happened yet, and AUTOFS4_FS is still enabled in 63 defconfigs. Get rid of it mechanically: git grep -l CONFIG_AUTOFS4_FS -- '*defconfig' | xargs sed -i 's/AUTOFS4_FS/AUTOFS_FS/' Also just remove the AUTOFS4_FS config option stub. Anybody who hasn't regenerated their config file in the last five years will need to just get the new name right when they do. Signed-off-by: Sven Joachim <svenjoac@gmx.de> Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-29Merge tag 'loongarch-fixes-6.5-1' of ↵Linus Torvalds7-15/+29
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch fixes from Huacai Chen: "Some bug fixes for build system, builtin cmdline handling, bpf and {copy, clear}_user, together with a trivial cleanup" * tag 'loongarch-fixes-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: LoongArch: Cleanup __builtin_constant_p() checking for cpu_has_* LoongArch: BPF: Fix check condition to call lu32id in move_imm() LoongArch: BPF: Enable bpf_probe_read{, str}() on LoongArch LoongArch: Fix return value underflow in exception path LoongArch: Fix CMDLINE_EXTEND and CMDLINE_BOOTLOADER handling LoongArch: Fix module relocation error with binutils 2.41 LoongArch: Only fiddle with CHECKFLAGS if `need-compiler'
2023-07-29KVM: selftests: Expand x86's sregs test to cover illegal CR0 valuesSean Christopherson1-31/+39
Add coverage to x86's set_sregs_test to verify KVM rejects vendor-agnostic illegal CR0 values, i.e. CR0 values whose legality doesn't depend on the current VMX mode. KVM historically has neglected to reject bad CR0s from userspace, i.e. would happily accept a completely bogus CR0 via KVM_SET_SREGS{2}. Punt VMX specific subtests to future work, as they would require quite a bit more effort, and KVM gets coverage for CR0 checks in general through other means, e.g. KVM-Unit-Tests. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230613203037.1968489-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: VMX: Don't fudge CR0 and CR4 for restricted L2 guestSean Christopherson1-4/+9
Stuff CR0 and/or CR4 to be compliant with a restricted guest if and only if KVM itself is not configured to utilize unrestricted guests, i.e. don't stuff CR0/CR4 for a restricted L2 that is running as the guest of an unrestricted L1. Any attempt to VM-Enter a restricted guest with invalid CR0/CR4 values should fail, i.e. in a nested scenario, KVM (as L0) should never observe a restricted L2 with incompatible CR0/CR4, since nested VM-Enter from L1 should have failed. And if KVM does observe an active, restricted L2 with incompatible state, e.g. due to a KVM bug, fudging CR0/CR4 instead of letting VM-Enter fail does more harm than good, as KVM will often neglect to undo the side effects, e.g. won't clear rmode.vm86_active on nested VM-Exit, and thus the damage can easily spill over to L1. On the other hand, letting VM-Enter fail due to bad guest state is more likely to contain the damage to L2 as KVM relies on hardware to perform most guest state consistency checks, i.e. KVM needs to be able to reflect a failed nested VM-Enter into L1 irrespective of (un)restricted guest behavior. Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Fixes: bddd82d19e2e ("KVM: nVMX: KVM needs to unset "unrestricted guest" VM-execution control in vmcs02 if vmcs12 doesn't set it") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230613203037.1968489-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: x86: Disallow KVM_SET_SREGS{2} if incoming CR0 is invalidSean Christopherson5-20/+52
Reject KVM_SET_SREGS{2} with -EINVAL if the incoming CR0 is invalid, e.g. due to setting bits 63:32, illegal combinations, or to a value that isn't allowed in VMX (non-)root mode. The VMX checks in particular are "fun" as failure to disallow Real Mode for an L2 that is configured with unrestricted guest disabled, when KVM itself has unrestricted guest enabled, will result in KVM forcing VM86 mode to virtual Real Mode for L2, but then fail to unwind the related metadata when synthesizing a nested VM-Exit back to L1 (which has unrestricted guest enabled). Opportunistically fix a benign typo in the prototype for is_valid_cr4(). Cc: stable@vger.kernel.org Reported-by: syzbot+5feef0b9ee9c8e9e5689@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000f316b705fdf6e2b4@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230613203037.1968489-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29Revert "debugfs, coccinelle: check for obsolete DEFINE_SIMPLE_ATTRIBUTE() usage"Sean Christopherson1-68/+0
Remove coccinelle's recommendation to use DEFINE_DEBUGFS_ATTRIBUTE() instead of DEFINE_SIMPLE_ATTRIBUTE(). Regardless of whether or not the "significant overhead" incurred by debugfs_create_file() is actually meaningful, warnings from the script have led to a rash of low-quality patches that have sowed confusion and consumed maintainer time for little to no benefit. There have been no less than four attempts to "fix" KVM, and a quick search on lore shows that KVM is not alone. This reverts commit 5103068eaca290f890a30aae70085fac44cecaf6. Link: https://lore.kernel.org/all/87tu2nbnz3.fsf@mpe.ellerman.id.au Link: https://lore.kernel.org/all/c0b98151-16b6-6d8f-1765-0f7d46682d60@redhat.com Link: https://lkml.kernel.org/r/20230706072954.4881-1-duminjie%40vivo.com Link: https://lore.kernel.org/all/Y2FsbufV00jbyF0B@google.com Link: https://lore.kernel.org/all/Y2ENJJ1YiSg5oHiy@orome Link: https://lore.kernel.org/all/7560b350e7b23786ce712118a9a504356ff1cca4.camel@kernel.org Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230726202920.507756-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: selftests: Verify stats fd is usable after VM fd has been closedSean Christopherson1-2/+8
Verify that VM and vCPU binary stats files are usable even after userspace has put its last direct reference to the VM. This is a regression test for a UAF bug where KVM didn't gift the stats files a reference to the VM. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230711230131.648752-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: selftests: Verify stats fd can be dup()'d and readSean Christopherson1-1/+7
Expand the binary stats test to verify that a stats fd can be dup()'d and read, to (very) roughly simulate userspace passing around the file. Adding the dup() test is primarily an intermediate step towards verifying that userspace can read VM/vCPU stats before _and_ after userspace closes its copy of the VM fd; the dup() test itself is only mildly interesting. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230711230131.648752-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: selftests: Verify userspace can create "redundant" binary stats filesSean Christopherson1-2/+23
Verify that KVM doesn't artificially limit KVM_GET_STATS_FD to a single file per VM/vCPU. There's no known use case for getting multiple stats fds, but it should work, and more importantly creating multiple files will make it easier to test that KVM correct manages VM refcounts for stats files. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230711230131.648752-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: selftests: Explicitly free vcpus array in binary stats testSean Christopherson1-0/+1
Explicitly free the all-encompassing vcpus array in the binary stats test so that the test is consistent with respect to freeing all dynamically allocated resources (versus letting them be freed on exit). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230711230131.648752-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: selftests: Clean up stats fd in common stats_test() helperSean Christopherson1-18/+4
Move the stats fd cleanup code into stats_test() and drop the superfluous vm_stats_test() and vcpu_stats_test() helpers in order to decouple creation of the stats file from consuming/testing the file (deduping code is a bonus). This will make it easier to test various edge cases related to stats, e.g. that userspace can dup() a stats fd, that userspace can have multiple stats files for a singleVM/vCPU, etc. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230711230131.648752-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: selftests: Use pread() to read binary stats headerSean Christopherson2-4/+8
Use pread() with an explicit offset when reading the header and the header name for a binary stats fd so that the common helper and the binary stats test don't subtly rely on the file effectively being untouched, e.g. to allow multiple reads of the header, name, etc. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230711230131.648752-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: Grab a reference to KVM for VM and vCPU stats file descriptorsSean Christopherson1-0/+24
Grab a reference to KVM prior to installing VM and vCPU stats file descriptors to ensure the underlying VM and vCPU objects are not freed until the last reference to any and all stats fds are dropped. Note, the stats paths manually invoke fd_install() and so don't need to grab a reference before creating the file. Fixes: ce55c049459c ("KVM: stats: Support binary stats retrieval for a VCPU") Fixes: fcfe1baeddbf ("KVM: stats: Support binary stats retrieval for a VM") Reported-by: Zheng Zhang <zheng.zhang@email.ucr.edu> Closes: https://lore.kernel.org/all/CAC_GQSr3xzZaeZt85k_RCBd5kfiOve8qXo7a81Cq53LuVQ5r=Q@mail.gmail.com Cc: stable@vger.kernel.org Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Message-Id: <20230711230131.648752-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29selftests/rseq: Play nice with binaries statically linked against glibc 2.35+Sean Christopherson1-6/+22
To allow running rseq and KVM's rseq selftests as statically linked binaries, initialize the various "trampoline" pointers to point directly at the expect glibc symbols, and skip the dlysm() lookups if the rseq size is non-zero, i.e. the binary is statically linked *and* the libc registered its own rseq. Define weak versions of the symbols so as not to break linking against libc versions that don't support rseq in any capacity. The KVM selftests in particular are often statically linked so that they can be run on targets with very limited runtime environments, i.e. test machines. Fixes: 233e667e1ae3 ("selftests/rseq: Uplift rseq selftests for compatibility with glibc-2.35") Cc: Aaron Lewis <aaronlewis@google.com> Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230721223352.2333911-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29Revert "KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"Sean Christopherson1-8/+2
Now that handle_fastpath_set_msr_irqoff() acquires kvm->srcu, i.e. allows dereferencing memslots during WRMSR emulation, drop the requirement that "next RIP" is valid. In hindsight, acquiring kvm->srcu would have been a better fix than avoiding the pastpath, but at the time it was thought that accessing SRCU-protected data in the fastpath was a one-off edge case. This reverts commit 5c30e8101e8d5d020b1d7119117889756a6ed713. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230721224337.2335137-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: x86: Acquire SRCU read lock when handling fastpath MSR writesSean Christopherson1-0/+4
Temporarily acquire kvm->srcu for read when potentially emulating WRMSR in the VM-Exit fastpath handler, as several of the common helpers used during emulation expect the caller to provide SRCU protection. E.g. if the guest is counting instructions retired, KVM will query the PMU event filter when stepping over the WRMSR. dump_stack+0x85/0xdf lockdep_rcu_suspicious+0x109/0x120 pmc_event_is_allowed+0x165/0x170 kvm_pmu_trigger_event+0xa5/0x190 handle_fastpath_set_msr_irqoff+0xca/0x1e0 svm_vcpu_run+0x5c3/0x7b0 [kvm_amd] vcpu_enter_guest+0x2108/0x2580 Alternatively, check_pmu_event_filter() could acquire kvm->srcu, but this isn't the first bug of this nature, e.g. see commit 5c30e8101e8d ("KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"). Providing protection for the entirety of WRMSR emulation will allow reverting the aforementioned commit, and will avoid having to play whack-a-mole when new uses of SRCU-protected structures are inevitably added in common emulation helpers. Fixes: dfdeda67ea2d ("KVM: x86/pmu: Prevent the PMU from counting disallowed events") Reported-by: Greg Thelen <gthelen@google.com> Reported-by: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230721224337.2335137-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>