summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/mmu
AgeCommit message (Collapse)AuthorFilesLines
2023-07-01Merge tag 'kvm-x86-vmx-6.5' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-0/+4
KVM VMX changes for 6.5: - Fix missing/incorrect #GP checks on ENCLS - Use standard mmu_notifier hooks for handling APIC access page - Misc cleanups
2023-07-01Merge tag 'kvm-x86-mmu-6.5' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2-6/+48
KVM x86/mmu changes for 6.5: - Add back a comment about the subtle side effect of try_cmpxchg64() in tdp_mmu_set_spte_atomic() - Add an assertion in __kvm_mmu_invalidate_addr() to verify that the target KVM MMU is the current MMU - Add a "never" option to effectively avoid creating NX hugepage recovery threads
2023-06-13KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pagesSean Christopherson1-5/+36
Add a "never" option to the nx_huge_pages module param to allow userspace to do a one-way hard disabling of the mitigation, and don't create the per-VM recovery threads when the mitigation is hard disabled. Letting userspace pinky swear that userspace doesn't want to enable NX mitigation (without reloading KVM) allows certain use cases to avoid the latency problems associated with spawning a kthread for each VM. E.g. in FaaS use cases, the guest kernel is trusted and the host may create 100+ VMs per logical CPU, which can result in 100ms+ latencies when a burst of VMs is created. Reported-by: Li RongQing <lirongqing@baidu.com> Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@baidu.com Cc: Yong He <zhuangel570@gmail.com> Cc: Robert Hoo <robert.hoo.linux@gmail.com> Cc: Kai Huang <kai.huang@intel.com> Reviewed-by: Robert Hoo <robert.hoo.linux@gmail.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Luiz Capitulino <luizcap@amazon.com> Reviewed-by: Li RongQing <lirongqing@baidu.com> Link: https://lore.kernel.org/r/20230602005859.784190-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-07KVM: x86/mmu: Trigger APIC-access page reload iff vendor code caresSean Christopherson1-1/+2
Request an APIC-access page reload when the backing page is migrated (or unmapped) if and only if vendor code actually plugs the backing pfn into structures that reside outside of KVM's MMU. This avoids kicking all vCPUs in the (hopefully infrequent) scenario where the backing page is migrated/invalidated. Unlike VMX's APICv, SVM's AVIC doesn't plug the backing pfn directly into the VMCB and so doesn't need a hook to invalidate an out-of-MMU "mapping". Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-07KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access pageSean Christopherson1-0/+3
Now that KVM honors past and in-progress mmu_notifier invalidations when reloading the APIC-access page, use KVM's "standard" invalidation hooks to trigger a reload and delete the one-off usage of invalidate_range(). Aside from eliminating one-off code in KVM, dropping KVM's use of invalidate_range() will allow common mmu_notifier to redefine the API to be more strictly focused on invalidating secondary TLBs that share the primary MMU's page tables. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-03KVM: x86/mmu: Grab memslot for correct address space in NX recovery workerSean Christopherson1-1/+4
Factor in the address space (non-SMM vs. SMM) of the target shadow page when recovering potential NX huge pages, otherwise KVM will retrieve the wrong memslot when zapping shadow pages that were created for SMM. The bug most visibly manifests as a WARN on the memslot being non-NULL, but the worst case scenario is that KVM could unaccount the shadow page without ensuring KVM won't install a huge page, i.e. if the non-SMM slot is being dirty logged, but the SMM slot is not. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 3911 at arch/x86/kvm/mmu/mmu.c:7015 kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm] CPU: 1 PID: 3911 Comm: kvm-nx-lpage-re RIP: 0010:kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm] RSP: 0018:ffff99b284f0be68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff99b284edd000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff9271397024e0 R08: 0000000000000000 R09: ffff927139702450 R10: 0000000000000000 R11: 0000000000000001 R12: ffff99b284f0be98 R13: 0000000000000000 R14: ffff9270991fcd80 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff927f9f640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0aacad3ae0 CR3: 000000088fc2c005 CR4: 00000000003726e0 Call Trace: <TASK> __pfx_kvm_nx_huge_page_recovery_worker+0x10/0x10 [kvm] kvm_vm_worker_thread+0x106/0x1c0 [kvm] kthread+0xd9/0x100 ret_from_fork+0x2c/0x50 </TASK> ---[ end trace 0000000000000000 ]--- This bug was exposed by commit edbdb43fc96b ("KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated"), which allowed KVM to retain SMM TDP MMU roots effectively indefinitely. Before commit edbdb43fc96b, KVM would zap all SMM TDP MMU roots and thus all SMM TDP MMU shadow pages once all vCPUs exited SMM, which made the window where this bug (recovering an SMM NX huge page) could be encountered quite tiny. To hit the bug, the NX recovery thread would have to run while at least one vCPU was in SMM. Most VMs typically only use SMM during boot, and so the problematic shadow pages were gone by the time the NX recovery thread ran. Now that KVM preserves TDP MMU roots until they are explicitly invalidated (e.g. by a memslot deletion), the window to trigger the bug is effectively never closed because most VMMs don't delete memslots after boot (except for a handful of special scenarios). Fixes: eb298605705a ("KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages") Reported-by: Fabio Coatti <fabio.coatti@gmail.com> Closes: https://lore.kernel.org/all/CADpTngX9LESCdHVu_2mQkNGena_Ng2CphWNwsRGSMxzDsTjU2A@mail.gmail.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602010137.784664-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26KVM: x86/mmu: Assert on @mmu in the __kvm_mmu_invalidate_addr()Like Xu1-0/+8
Add assertion to track that "mmu == vcpu->arch.mmu" is always true in the context of __kvm_mmu_invalidate_addr(). for_each_shadow_entry_using_root() and kvm_sync_spte() operate on vcpu->arch.mmu, but the only reason that doesn't cause explosions is because handle_invept() frees roots instead of doing a manual invalidation. As of now, there are no major roadblocks to switching INVEPT emulation over to use kvm_mmu_invalidate_addr(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230523032947.60041-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26KVM: x86/mmu: Add comment on try_cmpxchg64 usage in tdp_mmu_set_spte_atomicUros Bizjak1-1/+4
Commit aee98a6838d5 ("KVM: x86/mmu: Use try_cmpxchg64 in tdp_mmu_set_spte_atomic") removed the comment that iter->old_spte is updated when different logical CPU modifies the page table entry. Although this is what try_cmpxchg does implicitly, it won't hurt if this fact is explicitly mentioned in a restored comment. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20230425113932.3148-1-ubizjak@gmail.com [sean: extend comment above try_cmpxchg64()] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-05Merge tag 'kvm-x86-mmu-6.4-2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-65/+56
Fix a long-standing flaw in x86's TDP MMU where unloading roots on a vCPU can result in the root being freed even though the root is completely valid and can be reused as-is (with a TLB flush).
2023-04-27KVM: x86: Preserve TDP MMU roots until they are explicitly invalidatedSean Christopherson1-65/+56
Preserve TDP MMU roots until they are explicitly invalidated by gifting the TDP MMU itself a reference to a root when it is allocated. Keeping a reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible performance, and can potentially even soft-hang a vCPU, if a vCPU frequently unloads its roots, e.g. when KVM is emulating SMI+RSM. When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root (from the vCPU's perspective). In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM hash table. Unloading a shadow MMU root just wipes it from the per-vCPU cache; the root is still tracked in the per-VM hash table. When KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root in the per-VM hash table. Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a per-VM structure, where "active" in this case means a root is either in-use or cached as a previous root by at least one vCPU. When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing may be done by a worker, but for all intents and purposes the root is gone). The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all roots effectively frees all roots. The issue is mitigated to some degree in multi-vCPU setups as a different vCPU usually holds a reference to an unloaded root and thus keeps the root alive, allowing the vCPU to reuse its old root after unloading (with a flush+sync). The TDP MMU flaw has been known for some time, as until very recently, KVM's handling of CR0.WP also triggered unloading of all roots. The CR0.WP toggling scenario was eventually addressed by not unloading roots when _only_ CR0.WP is toggled, but such an approach doesn't Just Work for emulating SMM as KVM must emulate a full TLB flush on entry and exit to/from SMM. Given that the shadow MMU plays nice with unloading roots at will, teaching the TDP MMU to do the same is far less complex than modifying KVM to track which roots need to be flushed before reuse. Note, preserving all possible TDP MMU roots is not a concern with respect to memory consumption. Now that the role for direct MMUs doesn't include information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there are _at most_ six possible roots (where "guest_mode" here means L2): 1. 4-level !SMM !guest_mode 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode And because each vCPU can track 4 valid roots, a VM can already have all 6 root combinations live at any given time. Not to mention that, in practice, no sane VMM will advertise different guest.MAXPHYADDR values across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for a single VM. Furthermore, the vast majority of modern hypervisors will utilize EPT/NPT when available, thus the guest_mode=%true cases are also unlikely to be utilized. Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-26Merge tag 'kvm-x86-pmu-6.4' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-1/+1
KVM x86 PMU changes for 6.4: - Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES - Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest - Misc cleanups and fixes
2023-04-26Merge tag 'kvm-x86-mmu-6.4' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini6-525/+464
KVM x86 MMU changes for 6.4: - Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the guest is only adding new PTEs - Overhaul .sync_page() and .invlpg() to share the .sync_page() implementation, i.e. utilize .sync_page()'s optimizations when emulating invalidations - Clean up the range-based flushing APIs - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry - Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() - Misc cleanups
2023-04-11KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permission faultsSean Christopherson1-0/+15
Refresh the MMU's snapshot of the vCPU's CR0.WP prior to checking for permission faults when emulating a guest memory access and CR0.WP may be guest owned. If the guest toggles only CR0.WP and triggers emulation of a supervisor write, e.g. when KVM is emulating UMIP, KVM may consume a stale CR0.WP, i.e. use stale protection bits metadata. Note, KVM passes through CR0.WP if and only if EPT is enabled as CR0.WP is part of the MMU role for legacy shadow paging, and SVM (NPT) doesn't support per-bit interception controls for CR0. Don't bother checking for EPT vs. NPT as the "old == new" check will always be true under NPT, i.e. the only cost is the read of vcpu->arch.cr4 (SVM unconditionally grabs CR0 from the VMCB on VM-Exit). Reported-by: Mathias Krause <minipli@grsecurity.net> Link: https://lkml.kernel.org/r/677169b4-051f-fcae-756b-9a3e1bb9f8fe%40grsecurity.net Fixes: fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit") Tested-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230405002608.418442-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-11KVM: x86/mmu: Move filling of Hyper-V's TLB range struct into Hyper-V codeSean Christopherson1-6/+2
Refactor Hyper-V's range-based TLB flushing API to take a gfn+nr_pages pair instead of a struct, and bury said struct in Hyper-V specific code. Passing along two params generates much better code for the common case where KVM is _not_ running on Hyper-V, as forwarding the flush on to Hyper-V's hv_flush_remote_tlbs_range() from kvm_flush_remote_tlbs_range() becomes a tail call. Cc: David Matlack <dmatlack@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20230405003133.419177-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-11KVM: x86: Rename Hyper-V remote TLB hooks to match established schemeSean Christopherson1-6/+6
Rename the Hyper-V hooks for TLB flushing to match the naming scheme used by all the other TLB flushing hooks, e.g. in kvm_x86_ops, vendor code, arch hooks from common code, etc. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20230405003133.419177-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-07KVM: x86: Add a helper to query whether or not a vCPU has ever runSean Christopherson1-1/+1
Add a helper to query if a vCPU has run so that KVM doesn't have to open code the check on last_vmentry_cpu being set to a magic value. No functional change intended. Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Like Xu <like.xu.linux@gmail.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230311004618.920745-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Merge all handle_changed_pte*() functionsVipin Sharma1-30/+12
Merge __handle_changed_pte() and handle_changed_spte_acc_track() into a single function, handle_changed_pte(), as the two are always used together. Remove the existing handle_changed_pte(), as it's just a wrapper that calls __handle_changed_pte() and handle_changed_spte_acc_track(). Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Remove handle_changed_spte_dirty_log()Vipin Sharma1-23/+3
Remove handle_changed_spte_dirty_log() as there is no code flow which sets 4KiB SPTE writable and hit this path. This function marks the page dirty in a memslot only if new SPTE is 4KiB in size and writable. Current users of handle_changed_spte_dirty_log() are: 1. set_spte_gfn() - Create only non writable SPTEs. 2. write_protect_gfn() - Change an SPTE to non writable. 3. zap leaf and roots APIs - Everything is 0. 4. handle_removed_pt() - Sets SPTEs to REMOVED_SPTE 5. tdp_mmu_link_sp() - Makes non leaf SPTEs. There is also no path which creates a writable 4KiB without going through make_spte() and this functions takes care of marking SPTE dirty in the memslot if it is PT_WRITABLE. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: add blurb to __handle_changed_spte()'s comment] Link: https://lore.kernel.org/r/20230321220021.2119033-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Remove "record_acc_track" in __tdp_mmu_set_spte()Vipin Sharma1-34/+17
Remove bool parameter "record_acc_track" from __tdp_mmu_set_spte() and refactor the code. This variable is always set to true by its caller. Remove single and double underscore prefix from tdp_mmu_set_spte() related APIs: 1. Change __tdp_mmu_set_spte() to tdp_mmu_set_spte() 2. Change _tdp_mmu_set_spte() to tdp_mmu_iter_set_spte() Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230321220021.2119033-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Bypass __handle_changed_spte() when aging TDP MMU SPTEsVipin Sharma1-2/+2
Drop everything except the "tdp_mmu_spte_changed" tracepoint part of __handle_changed_spte() when aging SPTEs in the TDP MMU, as clearing the accessed status doesn't affect the SPTE's shadow-present status, whether or not the SPTE is a leaf, or change the PFN. I.e. none of the functional updates handled by __handle_changed_spte() are relevant. Losing __handle_changed_spte()'s sanity checks does mean that a bug could theoretical go unnoticed, but that scenario is extremely unlikely, e.g. would effectively require a misconfigured MMU or a locking bug elsewhere. Link: https://lore.kernel.org/all/Y9HcHRBShQgjxsQb@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Drop unnecessary dirty log checks when aging TDP MMU SPTEsVipin Sharma1-2/+0
Drop the unnecessary call to handle dirty log updates when aging TDP MMU SPTEs, as neither clearing the Accessed bit nor marking a SPTE for access tracking can _set_ the Writable bit, i.e. can't trigger marking a gfn dirty in its memslot. The access tracking path can _clear_ the Writable bit, e.g. if the XCHG races with fast_page_fault() and writes the stale value without the Writable bit set, but clearing the Writable bit outside of mmu_lock is not allowed, i.e. access tracking can't spuriously set the Writable bit. Signed-off-by: Vipin Sharma <vipinsh@google.com> [sean: split to separate patch, apply to dirty path, write changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Clear only A-bit (if enabled) when aging TDP MMU SPTEsVipin Sharma1-17/+21
Use tdp_mmu_clear_spte_bits() when clearing the Accessed bit in TDP MMU SPTEs so as to use an atomic-AND instead of XCHG to clear the A-bit. Similar to the D-bit story, this will allow KVM to bypass __handle_changed_spte() by ensuring only the A-bit is modified. Link: https://lore.kernel.org/all/Y9HcHRBShQgjxsQb@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Remove "record_dirty_log" in __tdp_mmu_set_spte()Vipin Sharma1-15/+9
Remove bool parameter "record_dirty_log" from __tdp_mmu_set_spte() and refactor the code as this variable is always set to true by its caller. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230321220021.2119033-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Bypass __handle_changed_spte() when clearing TDP MMU dirty bitsVipin Sharma1-3/+4
Drop everything except marking the PFN dirty and the relevant tracepoint parts of __handle_changed_spte() when clearing the dirty status of gfns in the TDP MMU. Clearing only the Dirty (or Writable) bit doesn't affect the SPTEs shadow-present status, whether or not the SPTE is a leaf, or change the SPTE's PFN. I.e. other than marking the PFN dirty, none of the functional updates handled by __handle_changed_spte() are relevant. Losing __handle_changed_spte()'s sanity checks does mean that a bug could theoretical go unnoticed, but that scenario is extremely unlikely, e.g. would effectively require a misconfigured or a locking bug elsewhere. Opportunistically remove a comment blurb from __handle_changed_spte() about all modifications to TDP MMU SPTEs needing to invoke said function, that "rule" hasn't been true since fast page fault support was added for the TDP MMU (and perhaps even before). Tested on a VM (160 vCPUs, 160 GB memory) and found that performance of clear dirty log stage improved by ~40% in dirty_log_perf_test (with the full optimization applied). Before optimization: -------------------- Iteration 1 clear dirty log time: 3.638543593s Iteration 2 clear dirty log time: 3.145032742s Iteration 3 clear dirty log time: 3.142340358s Clear dirty log over 3 iterations took 9.925916693s. (Avg 3.308638897s/iteration) After optimization: ------------------- Iteration 1 clear dirty log time: 2.318988110s Iteration 2 clear dirty log time: 1.794470164s Iteration 3 clear dirty log time: 1.791668628s Clear dirty log over 3 iterations took 5.905126902s. (Avg 1.968375634s/iteration) Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: split the switch to atomic-AND to a separate patch] Link: https://lore.kernel.org/r/20230321220021.2119033-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Drop access tracking checks when clearing TDP MMU dirty bitsVipin Sharma1-2/+0
Drop the unnecessary call to handle access-tracking changes when clearing the dirty status of TDP MMU SPTEs. Neither the Dirty bit nor the Writable bit has any impact on the accessed state of a page, i.e. clearing only the aforementioned bits doesn't make an accessed SPTE suddently not accessed. Signed-off-by: Vipin Sharma <vipinsh@google.com> [sean: split to separate patch, write changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Atomically clear SPTE dirty state in the clear-dirty-log flowVipin Sharma2-8/+22
Optimize the clearing of dirty state in TDP MMU SPTEs by doing an atomic-AND (on SPTEs that have volatile bits) instead of the full XCHG that currently ends up being invoked (see kvm_tdp_mmu_write_spte()). Clearing _only_ the bit in question will allow KVM to skip the many irrelevant checks in __handle_changed_spte() by avoiding any collateral damage due to the XCHG writing all SPTE bits, e.g. the XCHG could race with fast_page_fault() setting the W-bit and the CPU setting the D-bit, and thus incorrectly drop the CPU's D-bit update. Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: split the switch to atomic-AND to a separate patch] Link: https://lore.kernel.org/r/20230321220021.2119033-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Consolidate Dirty vs. Writable clearing logic in TDP MMUVipin Sharma1-26/+9
Deduplicate the guts of the TDP MMU's clearing of dirty status by snapshotting whether to check+clear the Dirty bit vs. the Writable bit, which is the only difference between the two flavors of dirty tracking. Note, kvm_ad_enabled() is just a wrapper for shadow_accessed_mask, i.e. is constant after kvm-{intel,amd}.ko is loaded. Link: https://lore.kernel.org/all/Yz4Qi7cn7TWTWQjj@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> [sean: split to separate patch, apply to dirty log, write changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprotVipin Sharma1-2/+8
Use the constant-after-module-load kvm_ad_enabled() to check if SPTEs in the TDP MMU need to be write-protected when clearing accessed/dirty status instead of manually checking every SPTE. The per-SPTE A/D enabling is specific to nested EPT MMUs, i.e. when KVM is using EPT A/D bits but L1 is not, and so cannot happen in the TDP MMU (which is non-nested only). Keep the original code as sanity checks buried under MMU_WARN_ON(). MMU_WARN_ON() is more or less useless at the moment, but there are plans to change that. Link: https://lore.kernel.org/all/Yz4Qi7cn7TWTWQjj@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> [sean: split to separate patch, apply to dirty path, write changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Add a helper function to check if an SPTE needs atomic writeVipin Sharma1-14/+20
Move conditions in kvm_tdp_mmu_write_spte() to check if an SPTE should be written atomically or not to a separate function. This new function, kvm_tdp_mmu_spte_need_atomic_write(), will be used in future commits to optimize clearing bits in SPTEs. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Link: https://lore.kernel.org/r/20230321220021.2119033-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22KVM: x86/mmu: Fix comment typoMathias Krause1-1/+1
Fix a small comment typo in make_spte(). Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-6-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22KVM: x86/mmu: Avoid indirect call for get_cr3Paolo Bonzini2-12/+21
Most of the time, calls to get_guest_pgd result in calling kvm_read_cr3 (the exception is only nested TDP). Hardcode the default instead of using the get_cr3 function, avoiding a retpoline if they are enabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-2-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-18KVM: x86/mmu: Clean up mmu.c functions that put return type on separate lineSean Christopherson1-30/+27
Adjust a variety of functions in mmu.c to put the function return type on the same line as the function declaration. As stated in the Linus specification: But the "on their own line" is complete garbage to begin with. That will NEVER be a kernel rule. We should never have a rule that assumes things are so long that they need to be on multiple lines. We don't put function return types on their own lines either, even if some other projects have that rule (just to get function names at the beginning of lines or some other odd reason). Leave the functions generated by BUILD_MMU_ROLE_REGS_ACCESSOR() as-is, that code is basically illegible no matter how it's formatted. No functional change intended. Link: https://lore.kernel.org/mm-commits/CAHk-=wjS-Jg7sGMwUPpDsjv392nDOOs0CtUtVkp=S6Q7JzFJRw@mail.gmail.com Signed-off-by: Ben Gardon <bgardon@google.com> Link: https://lore.kernel.org/r/20230202182809.1929122-4-bgardon@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-18KVM: x86/mmu: Replace comment with an actual lockdep assertion on mmu_lockSean Christopherson1-1/+2
Assert that mmu_lock is held for write in __walk_slot_rmaps() instead of hoping the function comment will magically prevent introducing bugs. Signed-off-by: Ben Gardon <bgardon@google.com> Link: https://lore.kernel.org/r/20230202182809.1929122-3-bgardon@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-18KVM: x86/mmu: Rename slot rmap walkers to add clarity and clean up codeSean Christopherson1-33/+33
Replace "slot_handle_level" with "walk_slot_rmaps" to better capture what the helpers are doing, and to slightly shorten the function names so that each function's return type and attributes can be placed on the same line as the function declaration. No functional change intended. Link: https://lore.kernel.org/mm-commits/CAHk-=wjS-Jg7sGMwUPpDsjv392nDOOs0CtUtVkp=S6Q7JzFJRw@mail.gmail.com Signed-off-by: Ben Gardon <bgardon@google.com> Link: https://lore.kernel.org/r/20230202182809.1929122-2-bgardon@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-18KVM: x86/mmu: Use gfn_t in kvm_flush_remote_tlbs_range()David Matlack2-3/+5
Use gfn_t instead of u64 for kvm_flush_remote_tlbs_range()'s parameters, since gfn_t is the standard type for GFNs throughout KVM. Opportunistically rename pages to nr_pages to make its role even more obvious. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230126184025.2294823-6-dmatlack@google.com [sean: convert pages to gfn_t too, and rename] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-18KVM: x86/mmu: Rename kvm_flush_remote_tlbs_with_address()David Matlack2-13/+8
Rename kvm_flush_remote_tlbs_with_address() to kvm_flush_remote_tlbs_range(). This name is shorter, which reduces the number of callsites that need to be broken up across multiple lines, and more readable since it conveys a range of memory is being flushed rather than a single address. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230126184025.2294823-5-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-18KVM: x86/mmu: Collapse kvm_flush_remote_tlbs_with_{range,address}() togetherDavid Matlack1-13/+6
Collapse kvm_flush_remote_tlbs_with_range() and kvm_flush_remote_tlbs_with_address() into a single function. This eliminates some lines of code and a useless NULL check on the range struct. Opportunistically switch from ENOTSUPP to EOPNOTSUPP to make checkpatch happy. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230126184025.2294823-4-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17KVM: x86/mmu: Track tail count in pte_list_desc to optimize guest fork()Lai Jiangshan1-44/+65
Rework "struct pte_list_desc" and pte_list_{add|remove} to track the tail count, i.e. number of PTEs in non-head descriptors, and to always keep all tail descriptors full so that adding a new entry and counting the number of entries is done in constant time instead of linear time. No visible performace is changed in tests. But pte_list_add() is no longer shown in the perf result for the COWed pages even the guest forks millions of tasks. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230113122910.672417-1-jiangshanlai@gmail.com [sean: reword shortlog, tweak changelog, add lots of comments, add BUG_ON()] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17KVM: x86/mmu: Skip calling mmu->sync_spte() when the spte is 0Lai Jiangshan2-3/+11
Sync the spte only when the spte is set and avoid the indirect branch. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-5-jiangshanlai@gmail.com [sean: add wrapper instead of open coding each check] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17kvm: x86/mmu: Remove @no_dirty_log from FNAME(prefetch_gpte)Lai Jiangshan1-4/+3
FNAME(prefetch_gpte) is always called with @no_dirty_log=true. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-4-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17KVM: x86/mmu: Remove FNAME(invlpg) and use FNAME(sync_spte) to update vTLB ↵Lai Jiangshan2-74/+33
instead. In hardware TLB, invalidating TLB entries means the translations are removed from the TLB. In KVM shadowed vTLB, the translations (combinations of shadow paging and hardware TLB) are generally maintained as long as they remain "clean" when the TLB of an address space (i.e. a PCID or all) is flushed with the help of write-protections, sp->unsync, and kvm_sync_page(), where "clean" in this context means that no updates to KVM's SPTEs are needed. However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is unsync, and thus triggers a remote flush even if the original vTLB entry is clean, i.e. is usable as-is. Besides this, FNAME(invlpg) is largely is a duplicate implementation of FNAME(sync_spte) to invalidate a vTLB entry. To address both issues, reuse FNAME(sync_spte) to share the code and slightly modify the semantics, i.e. keep the vTLB entry if it's "clean" and avoid remote TLB flush. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17KVM: x86/mmu: Allow the roots to be invalid in FNAME(invlpg)Lai Jiangshan2-5/+2
Don't assume the current root to be valid, just check it and remove the WARN(). Also move the code to check if the root is valid into FNAME(invlpg) to simplify the code. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-2-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17KVM: x86/mmu: Use kvm_mmu_invalidate_addr() in nested_ept_invalidate_addr()Lai Jiangshan1-0/+1
Use kvm_mmu_invalidate_addr() instead open calls to mmu->invlpg(). No functional change intended. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-1-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17KVM: x86/mmu: Use kvm_mmu_invalidate_addr() in kvm_mmu_invpcid_gva()Lai Jiangshan1-14/+7
Use kvm_mmu_invalidate_addr() instead open calls to mmu->invlpg(). No functional change intended. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-10-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17kvm: x86/mmu: Use KVM_MMU_ROOT_XXX for kvm_mmu_invalidate_addr()Lai Jiangshan1-19/+19
The @root_hpa for kvm_mmu_invalidate_addr() is called with @mmu->root.hpa or INVALID_PAGE where @mmu->root.hpa is to invalidate gva for the current root (the same meaning as KVM_MMU_ROOT_CURRENT) and INVALID_PAGE is to invalidate gva for all roots (the same meaning as KVM_MMU_ROOTS_ALL). Change the argument type of kvm_mmu_invalidate_addr() and use KVM_MMU_ROOT_XXX instead so that we can reuse the function for kvm_mmu_invpcid_gva() and nested_ept_invalidate_addr() for invalidating gva for different set of roots. No fuctionalities changed. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-9-jiangshanlai@gmail.com [sean: massage comment slightly] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17KVM: x86/mmu: Sanity check input to kvm_mmu_free_roots()Sean Christopherson1-0/+2
Tweak KVM_MMU_ROOTS_ALL to precisely cover all current+previous root flags, and add a sanity in kvm_mmu_free_roots() to verify that the set of roots to free doesn't stray outside KVM_MMU_ROOTS_ALL. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-8-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17KVM: x86/mmu: Reduce the update to the spte in FNAME(sync_spte)Lai Jiangshan1-0/+8
Sometimes when the guest updates its pagetable, it adds only new gptes to it without changing any existed one, so there is no point to update the sptes for these existed gptes. Also when the sptes for these unchanged gptes are updated, the AD bits are also removed since make_spte() is called with prefetch=true which might result unneeded TLB flushing. Just do nothing if the gpte's permissions are unchanged. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-7-jiangshanlai@gmail.com [sean: expand comment to call out A/D bits] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17KVM: x86/mmu: Move the code out of FNAME(sync_page)'s loop body into mmu.cLai Jiangshan2-74/+74
Rename mmu->sync_page to mmu->sync_spte and move the code out of FNAME(sync_page)'s loop body into mmu.c. No functionalities change intended. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-6-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16KVM: x86/mmu: Set mmu->sync_page as NULL for direct pagingLai Jiangshan1-8/+2
mmu->sync_page for direct paging is never called. And both mmu->sync_page and mm->invlpg only make sense in shadow paging. Setting mmu->sync_page as NULL for direct paging makes it consistent with mm->invlpg which is set NULL for the case. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-5-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16KVM: x86/mmu: Check mmu->sync_page pointer in kvm_sync_page_check()Lai Jiangshan1-1/+1
Assert that mmu->sync_page is non-NULL as part of the sanity checks performed before attempting to sync a shadow page. Explicitly checking mmu->sync_page is all but guaranteed to be redundant with the existing sanity check that the MMU is indirect, but the cost is negligible, and the explicit check also serves as documentation. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-4-jiangshanlai@gmail.com [sean: increase verbosity of changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>