diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-05-01 22:06:20 +0300 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-05-01 22:06:20 +0300 |
commit | c8c655c34e33544aec9d64b660872ab33c29b5f1 (patch) | |
tree | 4aad88f698f04cef9e5d9d573a6df6283085dadd /arch/x86/kvm/svm/svm.c | |
parent | d75439d64a1e2b35e0f08906205b00279753cbed (diff) | |
parent | b3c98052d46948a8d65d2778c7f306ff38366aac (diff) | |
download | linux-c8c655c34e33544aec9d64b660872ab33c29b5f1.tar.xz |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"s390:
- More phys_to_virt conversions
- Improvement of AP management for VSIE (nested virtualization)
ARM64:
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features being
moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one. This
last part allows the NV timer code to be implemented on top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
x86:
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
enabled, and by giving the guest control of CR0.WP when EPT is
enabled on VMX (VMX-only because SVM doesn't support per-bit
controls)
- Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
return as a bool
- Move AMD_PSFD to cpufeatures.h and purge KVM's definition
- Avoid unnecessary writes+flushes when the guest is only adding new
PTEs
- Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
optimizations when emulating invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
single A/D bit using a LOCK AND instead of XCHG, and skip all of
the "handle changed SPTE" overhead associated with writing the
entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid
having to walk (potentially) all descriptors during insertion and
deletion, which gets quite expensive if the guest is spamming
fork()
- Disallow virtualizing legacy LBRs if architectural LBRs are
available, the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably
PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features
- Overhaul the vmx_pmu_caps selftest to better validate
PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- AMD SVM:
- Add support for virtual NMIs
- Fixes for edge cases related to virtual interrupts
- Intel AMX:
- Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
XTILE_DATA is not being reported due to userspace not opting in
via prctl()
- Fix a bug in emulation of ENCLS in compatibility mode
- Allow emulation of NOP and PAUSE for L2
- AMX selftests improvements
- Misc cleanups
MIPS:
- Constify MIPS's internal callbacks (a leftover from the hardware
enabling rework that landed in 6.3)
Generic:
- Drop unnecessary casts from "void *" throughout kvm_main.c
- Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
struct size by 8 bytes on 64-bit kernels by utilizing a padding
hole
Documentation:
- Fix goof introduced by the conversion to rST"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
KVM: s390: pci: fix virtual-physical confusion on module unload/load
KVM: s390: vsie: clarifications on setting the APCB
KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
KVM: selftests: Test the PMU event "Instructions retired"
KVM: selftests: Copy full counter values from guest in PMU event filter test
KVM: selftests: Use error codes to signal errors in PMU event filter test
KVM: selftests: Print detailed info in PMU event filter asserts
KVM: selftests: Add helpers for PMC asserts in PMU event filter test
KVM: selftests: Add a common helper for the PMU event filter guest code
KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
KVM: arm64: vhe: Drop extra isb() on guest exit
KVM: arm64: vhe: Synchronise with page table walker on MMU update
KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
KVM: arm64: nvhe: Synchronise with page table walker on TLBI
KVM: arm64: Handle 32bit CNTPCTSS traps
KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
KVM: arm64: vgic: Don't acquire its_lock before config_lock
KVM: selftests: Add test to verify KVM's supported XCR0
...
Diffstat (limited to 'arch/x86/kvm/svm/svm.c')
-rw-r--r-- | arch/x86/kvm/svm/svm.c | 201 |
1 files changed, 145 insertions, 56 deletions
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index a1b08359769b..ca32389f3c36 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -99,6 +99,7 @@ static const struct svm_direct_access_msrs { #endif { .index = MSR_IA32_SPEC_CTRL, .always = false }, { .index = MSR_IA32_PRED_CMD, .always = false }, + { .index = MSR_IA32_FLUSH_CMD, .always = false }, { .index = MSR_IA32_LASTBRANCHFROMIP, .always = false }, { .index = MSR_IA32_LASTBRANCHTOIP, .always = false }, { .index = MSR_IA32_LASTINTFROMIP, .always = false }, @@ -234,6 +235,8 @@ module_param(dump_invalid_vmcb, bool, 0644); bool intercept_smi = true; module_param(intercept_smi, bool, 0444); +bool vnmi = true; +module_param(vnmi, bool, 0444); static bool svm_gp_erratum_intercept = true; @@ -1315,6 +1318,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu) if (kvm_vcpu_apicv_active(vcpu)) avic_init_vmcb(svm, vmcb); + if (vnmi) + svm->vmcb->control.int_ctl |= V_NMI_ENABLE_MASK; + if (vgif) { svm_clr_intercept(svm, INTERCEPT_STGI); svm_clr_intercept(svm, INTERCEPT_CLGI); @@ -1588,6 +1594,16 @@ static void svm_set_vintr(struct vcpu_svm *svm) svm_set_intercept(svm, INTERCEPT_VINTR); /* + * Recalculating intercepts may have cleared the VINTR intercept. If + * V_INTR_MASKING is enabled in vmcb12, then the effective RFLAGS.IF + * for L1 physical interrupts is L1's RFLAGS.IF at the time of VMRUN. + * Requesting an interrupt window if save.RFLAGS.IF=0 is pointless as + * interrupts will never be unblocked while L2 is running. + */ + if (!svm_is_intercept(svm, INTERCEPT_VINTR)) + return; + + /* * This is just a dummy VINTR to actually cause a vmexit to happen. * Actual injection of virtual interrupts happens through EVENTINJ. */ @@ -2484,16 +2500,29 @@ static int task_switch_interception(struct kvm_vcpu *vcpu) has_error_code, error_code); } +static void svm_clr_iret_intercept(struct vcpu_svm *svm) +{ + if (!sev_es_guest(svm->vcpu.kvm)) + svm_clr_intercept(svm, INTERCEPT_IRET); +} + +static void svm_set_iret_intercept(struct vcpu_svm *svm) +{ + if (!sev_es_guest(svm->vcpu.kvm)) + svm_set_intercept(svm, INTERCEPT_IRET); +} + static int iret_interception(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); ++vcpu->stat.nmi_window_exits; svm->awaiting_iret_completion = true; - if (!sev_es_guest(vcpu->kvm)) { - svm_clr_intercept(svm, INTERCEPT_IRET); + + svm_clr_iret_intercept(svm); + if (!sev_es_guest(vcpu->kvm)) svm->nmi_iret_rip = kvm_rip_read(vcpu); - } + kvm_make_request(KVM_REQ_EVENT, vcpu); return 1; } @@ -2876,7 +2905,7 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 data) static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { struct vcpu_svm *svm = to_svm(vcpu); - int r; + int ret = 0; u32 ecx = msr->index; u64 data = msr->data; @@ -2946,21 +2975,6 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) */ set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1); break; - case MSR_IA32_PRED_CMD: - if (!msr->host_initiated && - !guest_has_pred_cmd_msr(vcpu)) - return 1; - - if (data & ~PRED_CMD_IBPB) - return 1; - if (!boot_cpu_has(X86_FEATURE_IBPB)) - return 1; - if (!data) - break; - - wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB); - set_msr_interception(vcpu, svm->msrpm, MSR_IA32_PRED_CMD, 0, 1); - break; case MSR_AMD64_VIRT_SPEC_CTRL: if (!msr->host_initiated && !guest_cpuid_has(vcpu, X86_FEATURE_VIRT_SSBD)) @@ -3013,10 +3027,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) * guest via direct_access_msrs, and switch it via user return. */ preempt_disable(); - r = kvm_set_user_return_msr(tsc_aux_uret_slot, data, -1ull); + ret = kvm_set_user_return_msr(tsc_aux_uret_slot, data, -1ull); preempt_enable(); - if (r) - return 1; + if (ret) + break; svm->tsc_aux = data; break; @@ -3074,7 +3088,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) default: return kvm_set_msr_common(vcpu, msr); } - return 0; + return ret; } static int msr_interception(struct kvm_vcpu *vcpu) @@ -3485,11 +3499,43 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu) return; svm->nmi_masked = true; - if (!sev_es_guest(vcpu->kvm)) - svm_set_intercept(svm, INTERCEPT_IRET); + svm_set_iret_intercept(svm); ++vcpu->stat.nmi_injections; } +static bool svm_is_vnmi_pending(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (!is_vnmi_enabled(svm)) + return false; + + return !!(svm->vmcb->control.int_ctl & V_NMI_BLOCKING_MASK); +} + +static bool svm_set_vnmi_pending(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (!is_vnmi_enabled(svm)) + return false; + + if (svm->vmcb->control.int_ctl & V_NMI_PENDING_MASK) + return false; + + svm->vmcb->control.int_ctl |= V_NMI_PENDING_MASK; + vmcb_mark_dirty(svm->vmcb, VMCB_INTR); + + /* + * Because the pending NMI is serviced by hardware, KVM can't know when + * the NMI is "injected", but for all intents and purposes, passing the + * NMI off to hardware counts as injection. + */ + ++vcpu->stat.nmi_injections; + + return true; +} + static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected) { struct vcpu_svm *svm = to_svm(vcpu); @@ -3585,6 +3631,35 @@ static void svm_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) svm_set_intercept(svm, INTERCEPT_CR8_WRITE); } +static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (is_vnmi_enabled(svm)) + return svm->vmcb->control.int_ctl & V_NMI_BLOCKING_MASK; + else + return svm->nmi_masked; +} + +static void svm_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (is_vnmi_enabled(svm)) { + if (masked) + svm->vmcb->control.int_ctl |= V_NMI_BLOCKING_MASK; + else + svm->vmcb->control.int_ctl &= ~V_NMI_BLOCKING_MASK; + + } else { + svm->nmi_masked = masked; + if (masked) + svm_set_iret_intercept(svm); + else + svm_clr_iret_intercept(svm); + } +} + bool svm_nmi_blocked(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -3596,8 +3671,10 @@ bool svm_nmi_blocked(struct kvm_vcpu *vcpu) if (is_guest_mode(vcpu) && nested_exit_on_nmi(svm)) return false; - return (vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) || - svm->nmi_masked; + if (svm_get_nmi_mask(vcpu)) + return true; + + return vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK; } static int svm_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection) @@ -3615,26 +3692,6 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection) return 1; } -static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu) -{ - return to_svm(vcpu)->nmi_masked; -} - -static void svm_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - if (masked) { - svm->nmi_masked = true; - if (!sev_es_guest(vcpu->kvm)) - svm_set_intercept(svm, INTERCEPT_IRET); - } else { - svm->nmi_masked = false; - if (!sev_es_guest(vcpu->kvm)) - svm_clr_intercept(svm, INTERCEPT_IRET); - } -} - bool svm_interrupt_blocked(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -3715,7 +3772,16 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - if (svm->nmi_masked && !svm->awaiting_iret_completion) + /* + * KVM should never request an NMI window when vNMI is enabled, as KVM + * allows at most one to-be-injected NMI and one pending NMI, i.e. if + * two NMIs arrive simultaneously, KVM will inject one and set + * V_NMI_PENDING for the other. WARN, but continue with the standard + * single-step approach to try and salvage the pending NMI. + */ + WARN_ON_ONCE(is_vnmi_enabled(svm)); + + if (svm_get_nmi_mask(vcpu) && !svm->awaiting_iret_completion) return; /* IRET will cause a vm exit */ if (!gif_set(svm)) { @@ -3777,13 +3843,13 @@ static void svm_flush_tlb_all(struct kvm_vcpu *vcpu) { /* * When running on Hyper-V with EnlightenedNptTlb enabled, remote TLB - * flushes should be routed to hv_remote_flush_tlb() without requesting + * flushes should be routed to hv_flush_remote_tlbs() without requesting * a "regular" remote flush. Reaching this point means either there's - * a KVM bug or a prior hv_remote_flush_tlb() call failed, both of + * a KVM bug or a prior hv_flush_remote_tlbs() call failed, both of * which might be fatal to the guest. Yell, but try to recover. */ if (WARN_ON_ONCE(svm_hv_is_enlightened_tlb_enabled(vcpu))) - hv_remote_flush_tlb(vcpu->kvm); + hv_flush_remote_tlbs(vcpu->kvm); svm_flush_tlb_asid(vcpu); } @@ -4142,7 +4208,7 @@ static bool svm_has_emulated_msr(struct kvm *kvm, u32 index) { switch (index) { case MSR_IA32_MCG_EXT_CTL: - case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC: + case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR: return false; case MSR_IA32_SMBASE: if (!IS_ENABLED(CONFIG_KVM_SMM)) @@ -4184,8 +4250,18 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) svm->vgif_enabled = vgif && guest_cpuid_has(vcpu, X86_FEATURE_VGIF); + svm->vnmi_enabled = vnmi && guest_cpuid_has(vcpu, X86_FEATURE_VNMI); + svm_recalc_instruction_intercepts(vcpu, svm); + if (boot_cpu_has(X86_FEATURE_IBPB)) + set_msr_interception(vcpu, svm->msrpm, MSR_IA32_PRED_CMD, 0, + !!guest_has_pred_cmd_msr(vcpu)); + + if (boot_cpu_has(X86_FEATURE_FLUSH_L1D)) + set_msr_interception(vcpu, svm->msrpm, MSR_IA32_FLUSH_CMD, 0, + !!guest_cpuid_has(vcpu, X86_FEATURE_FLUSH_L1D)); + /* For sev guests, the memory encryption bit is not reserved in CR3. */ if (sev_guest(vcpu->kvm)) { best = kvm_find_cpuid_entry(vcpu, 0x8000001F); @@ -4563,7 +4639,6 @@ static bool svm_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type, void *insn, int insn_len) { bool smep, smap, is_user; - unsigned long cr4; u64 error_code; /* Emulation is always possible when KVM has access to all guest state. */ @@ -4655,9 +4730,8 @@ static bool svm_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type, if (error_code & (PFERR_GUEST_PAGE_MASK | PFERR_FETCH_MASK)) goto resume_guest; - cr4 = kvm_read_cr4(vcpu); - smep = cr4 & X86_CR4_SMEP; - smap = cr4 & X86_CR4_SMAP; + smep = kvm_is_cr4_bit_set(vcpu, X86_CR4_SMEP); + smap = kvm_is_cr4_bit_set(vcpu, X86_CR4_SMAP); is_user = svm_get_cpl(vcpu) == 3; if (smap && (!smep || is_user)) { pr_err_ratelimited("SEV Guest triggered AMD Erratum 1096\n"); @@ -4795,6 +4869,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .patch_hypercall = svm_patch_hypercall, .inject_irq = svm_inject_irq, .inject_nmi = svm_inject_nmi, + .is_vnmi_pending = svm_is_vnmi_pending, + .set_vnmi_pending = svm_set_vnmi_pending, .inject_exception = svm_inject_exception, .cancel_injection = svm_cancel_injection, .interrupt_allowed = svm_interrupt_allowed, @@ -4937,6 +5013,9 @@ static __init void svm_set_cpu_caps(void) if (vgif) kvm_cpu_cap_set(X86_FEATURE_VGIF); + if (vnmi) + kvm_cpu_cap_set(X86_FEATURE_VNMI); + /* Nested VM can receive #VMEXIT instead of triggering #GP */ kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK); } @@ -5088,6 +5167,16 @@ static __init int svm_hardware_setup(void) pr_info("Virtual GIF supported\n"); } + vnmi = vgif && vnmi && boot_cpu_has(X86_FEATURE_VNMI); + if (vnmi) + pr_info("Virtual NMI enabled\n"); + + if (!vnmi) { + svm_x86_ops.is_vnmi_pending = NULL; + svm_x86_ops.set_vnmi_pending = NULL; + } + + if (lbrv) { if (!boot_cpu_has(X86_FEATURE_LBRV)) lbrv = false; |