From 955997e880175b00161efb83b1b282849ff412d3 Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Mon, 30 Oct 2023 16:17:28 +0200 Subject: KVM: x86: Use mutex guards to eliminate __kvm_x86_vendor_init() Use the recently introduced guard(mutex) infrastructure acquire and automatically release vendor_module_lock when the guard goes out of scope. Drop the inner __kvm_x86_vendor_init(), its sole purpose was to simplify releasing vendor_module_lock in error paths. No functional change intended. Signed-off-by: Nikolay Borisov Reviewed-by: Kai Huang Link: https://lore.kernel.org/r/20231030141728.1406118-1-nik.borisov@suse.com [sean: rewrite changelog] Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 363b1c080205..709844fcd521 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9632,11 +9632,13 @@ static void kvm_x86_check_cpu_compat(void *ret) *(int *)ret = kvm_x86_check_processor_compatibility(); } -static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) +int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) { u64 host_pat; int r, cpu; + guard(mutex)(&vendor_module_lock); + if (kvm_x86_ops.hardware_enable) { pr_err("already loaded vendor module '%s'\n", kvm_x86_ops.name); return -EEXIST; @@ -9766,17 +9768,6 @@ out_free_x86_emulator_cache: kmem_cache_destroy(x86_emulator_cache); return r; } - -int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) -{ - int r; - - mutex_lock(&vendor_module_lock); - r = __kvm_x86_vendor_init(ops); - mutex_unlock(&vendor_module_lock); - - return r; -} EXPORT_SYMBOL_GPL(kvm_x86_vendor_init); void kvm_x86_vendor_exit(void) -- cgit v1.2.3 From 5eb7fcbdea63fc91685ac0c89ceb42d3becc96d2 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 15:02:21 -0800 Subject: KVM: x86/pmu: Always treat Fixed counters as available when supported Treat fixed counters as available when they are supported, i.e. don't silently ignore an enabled fixed counter just because guest CPUID says the associated general purpose architectural event is unavailable. KVM originally treated fixed counters as always available, but that got changed as part of a fix to avoid confusing REF_CPU_CYCLES, which does NOT map to an architectural event, with the actual architectural event used associated with bit 7, TOPDOWN_SLOTS. The commit justified the change with: If the event is marked as unavailable in the Intel guest CPUID 0AH.EBX leaf, we need to avoid any perf_event creation, whether it's a gp or fixed counter. but that justification doesn't mesh with reality. The Intel SDM uses "architectural events" to refer to both general purpose events (the ones with the reverse polarity mask in CPUID.0xA.EBX) and the events for fixed counters, e.g. the SDM makes statements like: Each of the fixed-function PMC can count only one architectural performance event. but the fact that fixed counter 2 (TSC reference cycles) doesn't have an associated general purpose architectural makes trying to apply the mask from CPUID.0xA.EBX impossible. Furthermore, the lack of enumeration for an architectural event in CPUID only means the CPU doesn't officially support the architectural encoding, i.e. it doesn't mean using the architectural encoding _won't_ work, it sipmly means there are no guarantees that it will work as expected. E.g. if KVM is running in a VM that advertises a fixed counters but not the corresponding architectural event encoding, and perf decides to use a general purpose counter instead of a fixed counter, odds are very good that the underlying hardware actually does support the architectrual encoding, and that programming the encoding will count the right thing. In other words, asking perf to count the event will probably work, whereas intentionally doing nothing is obviously guaranteed to fail. Note, at the time of the change, KVM didn't enforce hardware support, i.e. didn't prevent userspace from enumerating support in guest CPUID.0xA.EBX for architectural events that aren't supported in hardware. I.e. silently dropping the fixed counter didn't somehow protection against counting the wrong event, it just enforced guest CPUID. And practically speaking, this issue is almost certainly limited to running KVM on a funky virtual CPU model. No known real hardware has an asymmetric PMU where a fixed counter is supported but the associated architectural event is not. Fixes: a21864486f7e ("KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event") Tested-by: Dapeng Mi Link: https://lore.kernel.org/r/20240109230250.424295-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/pmu_intel.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index a6216c874729..8207f8c03585 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -108,11 +108,24 @@ static bool intel_hw_event_available(struct kvm_pmc *pmc) u8 unit_mask = (pmc->eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8; int i; + /* + * Fixed counters are always available if KVM reaches this point. If a + * fixed counter is unsupported in hardware or guest CPUID, KVM doesn't + * allow the counter's corresponding MSR to be written. KVM does use + * architectural events to program fixed counters, as the interface to + * perf doesn't allow requesting a specific fixed counter, e.g. perf + * may (sadly) back a guest fixed PMC with a general purposed counter. + * But if _hardware_ doesn't support the associated event, KVM simply + * doesn't enumerate support for the fixed counter. + */ + if (pmc_is_fixed(pmc)) + return true; + BUILD_BUG_ON(ARRAY_SIZE(intel_arch_events) != NR_INTEL_ARCH_EVENTS); /* * Disallow events reported as unavailable in guest CPUID. Note, this - * doesn't apply to pseudo-architectural events. + * doesn't apply to pseudo-architectural events (see above). */ for (i = 0; i < NR_REAL_INTEL_ARCH_EVENTS; i++) { if (intel_arch_events[i].eventsel != event_select || -- cgit v1.2.3 From cbbd1aa891398e9ddbdd05bc4e40a988175f22af Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 15:02:22 -0800 Subject: KVM: x86/pmu: Allow programming events that match unsupported arch events Remove KVM's bogus restriction that the guest can't program an event whose encoding matches an unsupported architectural event. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed using the architectural encoding. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the architectural event. Preventing the guest from counting events whose encoding happens to match an architectural event breaks existing functionality whenever Intel adds an architectural encoding that was *ever* used for a CPU that doesn't enumerate support for the architectural event, even if the encoding is for the exact same event! E.g. the architectural encoding for Top-Down Slots is 0x01a4. Broadwell CPUs, which do not support the Top-Down Slots architectural event, 0x01a4 is a valid, model-specific event. Denying guest usage of 0x01a4 if/when KVM adds support for Top-Down slots would break any Broadwell-based guest. Reported-by: Kan Liang Closes: https://lore.kernel.org/all/2004baa6-b494-462c-a11f-8104ea152c6a@linux.intel.com Fixes: a21864486f7e ("KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event") Reviewed-by: Dapeng Mi Reviewed-by: Jim Mattson Reviewed-by: Kan Liang Tested-by: Dapeng Mi Link: https://lore.kernel.org/r/20240109230250.424295-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 - arch/x86/kvm/pmu.c | 1 - arch/x86/kvm/pmu.h | 1 - arch/x86/kvm/svm/pmu.c | 6 ------ arch/x86/kvm/vmx/pmu_intel.c | 38 ---------------------------------- 5 files changed, 47 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h index 058bc636356a..d7eebee4450c 100644 --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h @@ -12,7 +12,6 @@ BUILD_BUG_ON(1) * a NULL definition, for example if "static_call_cond()" will be used * at the call sites. */ -KVM_X86_PMU_OP(hw_event_available) KVM_X86_PMU_OP(pmc_idx_to_pmc) KVM_X86_PMU_OP(rdpmc_ecx_to_pmc) KVM_X86_PMU_OP(msr_idx_to_pmc) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 87cc6c8809ad..30945fea6988 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -441,7 +441,6 @@ static bool check_pmu_event_filter(struct kvm_pmc *pmc) static bool pmc_event_is_allowed(struct kvm_pmc *pmc) { return pmc_is_globally_enabled(pmc) && pmc_speculative_in_use(pmc) && - static_call(kvm_x86_pmu_hw_event_available)(pmc) && check_pmu_event_filter(pmc); } diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 7caeb3d8d4fd..87ecf22f5b25 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -19,7 +19,6 @@ #define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002 struct kvm_pmu_ops { - bool (*hw_event_available)(struct kvm_pmc *pmc); struct kvm_pmc *(*pmc_idx_to_pmc)(struct kvm_pmu *pmu, int pmc_idx); struct kvm_pmc *(*rdpmc_ecx_to_pmc)(struct kvm_vcpu *vcpu, unsigned int idx, u64 *mask); diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c index b6a7ad4d6914..1475d47c821c 100644 --- a/arch/x86/kvm/svm/pmu.c +++ b/arch/x86/kvm/svm/pmu.c @@ -73,11 +73,6 @@ static inline struct kvm_pmc *get_gp_pmc_amd(struct kvm_pmu *pmu, u32 msr, return amd_pmc_idx_to_pmc(pmu, idx); } -static bool amd_hw_event_available(struct kvm_pmc *pmc) -{ - return true; -} - static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); @@ -233,7 +228,6 @@ static void amd_pmu_init(struct kvm_vcpu *vcpu) } struct kvm_pmu_ops amd_pmu_ops __initdata = { - .hw_event_available = amd_hw_event_available, .pmc_idx_to_pmc = amd_pmc_idx_to_pmc, .rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc, .msr_idx_to_pmc = amd_msr_idx_to_pmc, diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 8207f8c03585..1a7d021a6c7b 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -101,43 +101,6 @@ static struct kvm_pmc *intel_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx) } } -static bool intel_hw_event_available(struct kvm_pmc *pmc) -{ - struct kvm_pmu *pmu = pmc_to_pmu(pmc); - u8 event_select = pmc->eventsel & ARCH_PERFMON_EVENTSEL_EVENT; - u8 unit_mask = (pmc->eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8; - int i; - - /* - * Fixed counters are always available if KVM reaches this point. If a - * fixed counter is unsupported in hardware or guest CPUID, KVM doesn't - * allow the counter's corresponding MSR to be written. KVM does use - * architectural events to program fixed counters, as the interface to - * perf doesn't allow requesting a specific fixed counter, e.g. perf - * may (sadly) back a guest fixed PMC with a general purposed counter. - * But if _hardware_ doesn't support the associated event, KVM simply - * doesn't enumerate support for the fixed counter. - */ - if (pmc_is_fixed(pmc)) - return true; - - BUILD_BUG_ON(ARRAY_SIZE(intel_arch_events) != NR_INTEL_ARCH_EVENTS); - - /* - * Disallow events reported as unavailable in guest CPUID. Note, this - * doesn't apply to pseudo-architectural events (see above). - */ - for (i = 0; i < NR_REAL_INTEL_ARCH_EVENTS; i++) { - if (intel_arch_events[i].eventsel != event_select || - intel_arch_events[i].unit_mask != unit_mask) - continue; - - return pmu->available_event_types & BIT(i); - } - - return true; -} - static bool intel_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); @@ -780,7 +743,6 @@ void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu) } struct kvm_pmu_ops intel_pmu_ops __initdata = { - .hw_event_available = intel_hw_event_available, .pmc_idx_to_pmc = intel_pmc_idx_to_pmc, .rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc, .msr_idx_to_pmc = intel_msr_idx_to_pmc, -- cgit v1.2.3 From db9e008a0f37bd31ac3d77d650812612b7c873da Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 15:02:23 -0800 Subject: KVM: x86/pmu: Remove KVM's enumeration of Intel's architectural encodings Drop KVM's enumeration of Intel's architectural event encodings, and instead open code the three encodings (of which only two are real) that KVM uses to emulate fixed counters. Now that KVM doesn't incorrectly enforce the availability of architectural encodings, there is no reason for KVM to ever care about the encodings themselves, at least not in the current format of an array indexed by the encoding's position in CPUID. Opportunistically add a comment to explain why KVM cares about eventsel values for fixed counters. Suggested-by: Jim Mattson Tested-by: Dapeng Mi Link: https://lore.kernel.org/r/20240109230250.424295-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/pmu_intel.c | 72 ++++++++++++++------------------------------ 1 file changed, 23 insertions(+), 49 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 1a7d021a6c7b..f3c44ddc09f8 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -22,52 +22,6 @@ #define MSR_PMC_FULL_WIDTH_BIT (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0) -enum intel_pmu_architectural_events { - /* - * The order of the architectural events matters as support for each - * event is enumerated via CPUID using the index of the event. - */ - INTEL_ARCH_CPU_CYCLES, - INTEL_ARCH_INSTRUCTIONS_RETIRED, - INTEL_ARCH_REFERENCE_CYCLES, - INTEL_ARCH_LLC_REFERENCES, - INTEL_ARCH_LLC_MISSES, - INTEL_ARCH_BRANCHES_RETIRED, - INTEL_ARCH_BRANCHES_MISPREDICTED, - - NR_REAL_INTEL_ARCH_EVENTS, - - /* - * Pseudo-architectural event used to implement IA32_FIXED_CTR2, a.k.a. - * TSC reference cycles. The architectural reference cycles event may - * or may not actually use the TSC as the reference, e.g. might use the - * core crystal clock or the bus clock (yeah, "architectural"). - */ - PSEUDO_ARCH_REFERENCE_CYCLES = NR_REAL_INTEL_ARCH_EVENTS, - NR_INTEL_ARCH_EVENTS, -}; - -static struct { - u8 eventsel; - u8 unit_mask; -} const intel_arch_events[] = { - [INTEL_ARCH_CPU_CYCLES] = { 0x3c, 0x00 }, - [INTEL_ARCH_INSTRUCTIONS_RETIRED] = { 0xc0, 0x00 }, - [INTEL_ARCH_REFERENCE_CYCLES] = { 0x3c, 0x01 }, - [INTEL_ARCH_LLC_REFERENCES] = { 0x2e, 0x4f }, - [INTEL_ARCH_LLC_MISSES] = { 0x2e, 0x41 }, - [INTEL_ARCH_BRANCHES_RETIRED] = { 0xc4, 0x00 }, - [INTEL_ARCH_BRANCHES_MISPREDICTED] = { 0xc5, 0x00 }, - [PSEUDO_ARCH_REFERENCE_CYCLES] = { 0x00, 0x03 }, -}; - -/* mapping between fixed pmc index and intel_arch_events array */ -static int fixed_pmc_events[] = { - [0] = INTEL_ARCH_INSTRUCTIONS_RETIRED, - [1] = INTEL_ARCH_CPU_CYCLES, - [2] = PSEUDO_ARCH_REFERENCE_CYCLES, -}; - static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data) { struct kvm_pmc *pmc; @@ -440,8 +394,29 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 0; } +/* + * Map fixed counter events to architectural general purpose event encodings. + * Perf doesn't provide APIs to allow KVM to directly program a fixed counter, + * and so KVM instead programs the architectural event to effectively request + * the fixed counter. Perf isn't guaranteed to use a fixed counter and may + * instead program the encoding into a general purpose counter, e.g. if a + * different perf_event is already utilizing the requested counter, but the end + * result is the same (ignoring the fact that using a general purpose counter + * will likely exacerbate counter contention). + * + * Note, reference cycles is counted using a perf-defined "psuedo-encoding", + * as there is no architectural general purpose encoding for reference cycles. + */ static void setup_fixed_pmc_eventsel(struct kvm_pmu *pmu) { + const struct { + u8 eventsel; + u8 unit_mask; + } fixed_pmc_events[] = { + [0] = { 0xc0, 0x00 }, /* Instruction Retired / PERF_COUNT_HW_INSTRUCTIONS. */ + [1] = { 0x3c, 0x00 }, /* CPU Cycles/ PERF_COUNT_HW_CPU_CYCLES. */ + [2] = { 0x00, 0x03 }, /* Reference Cycles / PERF_COUNT_HW_REF_CPU_CYCLES*/ + }; int i; BUILD_BUG_ON(ARRAY_SIZE(fixed_pmc_events) != KVM_PMC_MAX_FIXED); @@ -449,10 +424,9 @@ static void setup_fixed_pmc_eventsel(struct kvm_pmu *pmu) for (i = 0; i < pmu->nr_arch_fixed_counters; i++) { int index = array_index_nospec(i, KVM_PMC_MAX_FIXED); struct kvm_pmc *pmc = &pmu->fixed_counters[index]; - u32 event = fixed_pmc_events[index]; - pmc->eventsel = (intel_arch_events[event].unit_mask << 8) | - intel_arch_events[event].eventsel; + pmc->eventsel = (fixed_pmc_events[index].unit_mask << 8) | + fixed_pmc_events[index].eventsel; } } -- cgit v1.2.3 From 61bb2ad795a72a7a75add3ffc257660fd6c7cfea Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 15:02:24 -0800 Subject: KVM: x86/pmu: Setup fixed counters' eventsel during PMU initialization Set the eventsel for all fixed counters during PMU initialization, the eventsel is hardcoded and consumed if and only if the counter is supported, i.e. there is no reason to redo the setup every time the PMU is refreshed. Configuring all KVM-supported fixed counter also eliminates a potential pitfall if/when KVM supports discontiguous fixed counters, in which case configuring only nr_arch_fixed_counters will be insufficient (ignoring the fact that KVM will need many other changes to support discontiguous fixed counters). Tested-by: Dapeng Mi Link: https://lore.kernel.org/r/20240109230250.424295-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/pmu_intel.c | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index f3c44ddc09f8..98e92b9ece09 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -407,27 +407,21 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) * Note, reference cycles is counted using a perf-defined "psuedo-encoding", * as there is no architectural general purpose encoding for reference cycles. */ -static void setup_fixed_pmc_eventsel(struct kvm_pmu *pmu) +static u64 intel_get_fixed_pmc_eventsel(int index) { const struct { - u8 eventsel; + u8 event; u8 unit_mask; } fixed_pmc_events[] = { [0] = { 0xc0, 0x00 }, /* Instruction Retired / PERF_COUNT_HW_INSTRUCTIONS. */ [1] = { 0x3c, 0x00 }, /* CPU Cycles/ PERF_COUNT_HW_CPU_CYCLES. */ [2] = { 0x00, 0x03 }, /* Reference Cycles / PERF_COUNT_HW_REF_CPU_CYCLES*/ }; - int i; BUILD_BUG_ON(ARRAY_SIZE(fixed_pmc_events) != KVM_PMC_MAX_FIXED); - for (i = 0; i < pmu->nr_arch_fixed_counters; i++) { - int index = array_index_nospec(i, KVM_PMC_MAX_FIXED); - struct kvm_pmc *pmc = &pmu->fixed_counters[index]; - - pmc->eventsel = (fixed_pmc_events[index].unit_mask << 8) | - fixed_pmc_events[index].eventsel; - } + return (fixed_pmc_events[index].unit_mask << 8) | + fixed_pmc_events[index].event; } static void intel_pmu_refresh(struct kvm_vcpu *vcpu) @@ -493,7 +487,6 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) kvm_pmu_cap.bit_width_fixed); pmu->counter_bitmask[KVM_PMC_FIXED] = ((u64)1 << edx.split.bit_width_fixed) - 1; - setup_fixed_pmc_eventsel(pmu); } for (i = 0; i < pmu->nr_arch_fixed_counters; i++) @@ -571,6 +564,7 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu) pmu->fixed_counters[i].vcpu = vcpu; pmu->fixed_counters[i].idx = i + INTEL_PMC_IDX_FIXED; pmu->fixed_counters[i].current_config = 0; + pmu->fixed_counters[i].eventsel = intel_get_fixed_pmc_eventsel(i); } lbr_desc->records.nr = 0; -- cgit v1.2.3 From 7a277c22412cf72cf6376f21c6d826e8f4a44cc3 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 15:02:25 -0800 Subject: KVM: x86/pmu: Get eventsel for fixed counters from perf Get the event selectors used to effectively request fixed counters for perf events from perf itself instead of hardcoding them in KVM and hoping that they match the underlying hardware. While fixed counters 0 and 1 use architectural events, as of ffbe4ab0beda ("perf/x86/intel: Extend the ref-cycles event to GP counters") fixed counter 2 (reference TSC cycles) may use a software-defined pseudo-encoding or a real hardware-defined encoding. Reported-by: Kan Liang Closes: https://lkml.kernel.org/r/4281eee7-6423-4ec8-bb18-c6aeee1faf2c%40linux.intel.com Reviewed-by: Kan Liang Tested-by: Dapeng Mi Link: https://lore.kernel.org/r/20240109230250.424295-6-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/pmu_intel.c | 30 +++++++++++++++++------------- 1 file changed, 17 insertions(+), 13 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 98e92b9ece09..ec4feaef3d55 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -404,24 +404,28 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) * result is the same (ignoring the fact that using a general purpose counter * will likely exacerbate counter contention). * - * Note, reference cycles is counted using a perf-defined "psuedo-encoding", - * as there is no architectural general purpose encoding for reference cycles. + * Forcibly inlined to allow asserting on @index at build time, and there should + * never be more than one user. */ -static u64 intel_get_fixed_pmc_eventsel(int index) +static __always_inline u64 intel_get_fixed_pmc_eventsel(unsigned int index) { - const struct { - u8 event; - u8 unit_mask; - } fixed_pmc_events[] = { - [0] = { 0xc0, 0x00 }, /* Instruction Retired / PERF_COUNT_HW_INSTRUCTIONS. */ - [1] = { 0x3c, 0x00 }, /* CPU Cycles/ PERF_COUNT_HW_CPU_CYCLES. */ - [2] = { 0x00, 0x03 }, /* Reference Cycles / PERF_COUNT_HW_REF_CPU_CYCLES*/ + const enum perf_hw_id fixed_pmc_perf_ids[] = { + [0] = PERF_COUNT_HW_INSTRUCTIONS, + [1] = PERF_COUNT_HW_CPU_CYCLES, + [2] = PERF_COUNT_HW_REF_CPU_CYCLES, }; + u64 eventsel; - BUILD_BUG_ON(ARRAY_SIZE(fixed_pmc_events) != KVM_PMC_MAX_FIXED); + BUILD_BUG_ON(ARRAY_SIZE(fixed_pmc_perf_ids) != KVM_PMC_MAX_FIXED); + BUILD_BUG_ON(index >= KVM_PMC_MAX_FIXED); - return (fixed_pmc_events[index].unit_mask << 8) | - fixed_pmc_events[index].event; + /* + * Yell if perf reports support for a fixed counter but perf doesn't + * have a known encoding for the associated general purpose event. + */ + eventsel = perf_get_hw_event_config(fixed_pmc_perf_ids[index]); + WARN_ON_ONCE(!eventsel && index < kvm_pmu_cap.num_counters_fixed); + return eventsel; } static void intel_pmu_refresh(struct kvm_vcpu *vcpu) -- cgit v1.2.3 From ecb490770ad42c7d3dc08f06efe7bf0779990745 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 15:02:26 -0800 Subject: KVM: x86/pmu: Don't ignore bits 31:30 for RDPMC index on AMD Stop stripping bits 31:30 prior to validating/consuming the RDPMC index on AMD. Per the APM's documentation of RDPMC, *values* greater than 27 are reserved. The behavior of upper bits being flags is firmly Intel-only. Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Tested-by: Dapeng Mi Link: https://lore.kernel.org/r/20240109230250.424295-7-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/pmu.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c index 1475d47c821c..1fafc46f61c9 100644 --- a/arch/x86/kvm/svm/pmu.c +++ b/arch/x86/kvm/svm/pmu.c @@ -77,8 +77,6 @@ static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - idx &= ~(3u << 30); - return idx < pmu->nr_arch_gp_counters; } @@ -86,7 +84,7 @@ static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) static struct kvm_pmc *amd_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu, unsigned int idx, u64 *mask) { - return amd_pmc_idx_to_pmc(vcpu_to_pmu(vcpu), idx & ~(3u << 30)); + return amd_pmc_idx_to_pmc(vcpu_to_pmu(vcpu), idx); } static struct kvm_pmc *amd_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr) -- cgit v1.2.3 From 7bb7fce13601d2e6818be500ef3ce0b60cd59603 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 15:02:27 -0800 Subject: KVM: x86/pmu: Prioritize VMX interception over #GP on RDPMC due to bad index Apply the pre-intercepts RDPMC validity check only to AMD, and rename all relevant functions to make it as clear as possible that the check is not a standard PMC index check. On Intel, the basic rule is that only invalid opcodes and privilege/permission/mode checks have priority over VM-Exit, i.e. RDPMC with an invalid index should VM-Exit, not #GP. While the SDM doesn't explicitly call out RDPMC, it _does_ explicitly use RDMSR of a non-existent MSR as an example where VM-Exit has priority over #GP, and RDPMC is effectively just a variation of RDMSR. Manually testing on various Intel CPUs confirms this behavior, and the inverted priority was introduced for SVM compatibility, i.e. was not an intentional change for Intel PMUs. On AMD, *all* exceptions on RDPMC have priority over VM-Exit. Check for a NULL kvm_pmu_ops.check_rdpmc_early instead of using a RET0 static call so as to provide a convenient location to document the difference between Intel and AMD, and to again try to make it as obvious as possible that the early check is a one-off thing, not a generic "is this PMC valid?" helper. Fixes: 8061252ee0d2 ("KVM: SVM: Add intercept checks for remaining twobyte instructions") Cc: Jim Mattson Tested-by: Dapeng Mi Link: https://lore.kernel.org/r/20240109230250.424295-8-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm-x86-pmu-ops.h | 2 +- arch/x86/kvm/emulate.c | 2 +- arch/x86/kvm/kvm_emulate.h | 2 +- arch/x86/kvm/pmu.c | 16 +++++++++++++--- arch/x86/kvm/pmu.h | 4 ++-- arch/x86/kvm/svm/pmu.c | 9 ++++++--- arch/x86/kvm/vmx/pmu_intel.c | 12 ------------ arch/x86/kvm/x86.c | 9 +++------ 8 files changed, 27 insertions(+), 29 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h index d7eebee4450c..f0cd48222133 100644 --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h @@ -15,7 +15,7 @@ BUILD_BUG_ON(1) KVM_X86_PMU_OP(pmc_idx_to_pmc) KVM_X86_PMU_OP(rdpmc_ecx_to_pmc) KVM_X86_PMU_OP(msr_idx_to_pmc) -KVM_X86_PMU_OP(is_valid_rdpmc_ecx) +KVM_X86_PMU_OP_OPTIONAL(check_rdpmc_early) KVM_X86_PMU_OP(is_valid_msr) KVM_X86_PMU_OP(get_msr) KVM_X86_PMU_OP(set_msr) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e223043ef5b2..695ab5b6055c 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3962,7 +3962,7 @@ static int check_rdpmc(struct x86_emulate_ctxt *ctxt) * protected mode. */ if ((!(cr4 & X86_CR4_PCE) && ctxt->ops->cpl(ctxt)) || - ctxt->ops->check_pmc(ctxt, rcx)) + ctxt->ops->check_rdpmc_early(ctxt, rcx)) return emulate_gp(ctxt, 0); return X86EMUL_CONTINUE; diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index e6d149825169..4351149484fb 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -208,7 +208,7 @@ struct x86_emulate_ops { int (*set_msr_with_filter)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 data); int (*get_msr_with_filter)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 *pdata); int (*get_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 *pdata); - int (*check_pmc)(struct x86_emulate_ctxt *ctxt, u32 pmc); + int (*check_rdpmc_early)(struct x86_emulate_ctxt *ctxt, u32 pmc); int (*read_pmc)(struct x86_emulate_ctxt *ctxt, u32 pmc, u64 *pdata); void (*halt)(struct x86_emulate_ctxt *ctxt); void (*wbinvd)(struct x86_emulate_ctxt *ctxt); diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 30945fea6988..0b0d804ee239 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -524,10 +524,20 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu) kvm_pmu_cleanup(vcpu); } -/* check if idx is a valid index to access PMU */ -bool kvm_pmu_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) +int kvm_pmu_check_rdpmc_early(struct kvm_vcpu *vcpu, unsigned int idx) { - return static_call(kvm_x86_pmu_is_valid_rdpmc_ecx)(vcpu, idx); + /* + * On Intel, VMX interception has priority over RDPMC exceptions that + * aren't already handled by the emulator, i.e. there are no additional + * check needed for Intel PMUs. + * + * On AMD, _all_ exceptions on RDPMC have priority over SVM intercepts, + * i.e. an invalid PMC results in a #GP, not #VMEXIT. + */ + if (!kvm_pmu_ops.check_rdpmc_early) + return 0; + + return static_call(kvm_x86_pmu_check_rdpmc_early)(vcpu, idx); } bool is_vmware_backdoor_pmc(u32 pmc_idx) diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 87ecf22f5b25..51bbb01b21c8 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -23,7 +23,7 @@ struct kvm_pmu_ops { struct kvm_pmc *(*rdpmc_ecx_to_pmc)(struct kvm_vcpu *vcpu, unsigned int idx, u64 *mask); struct kvm_pmc *(*msr_idx_to_pmc)(struct kvm_vcpu *vcpu, u32 msr); - bool (*is_valid_rdpmc_ecx)(struct kvm_vcpu *vcpu, unsigned int idx); + int (*check_rdpmc_early)(struct kvm_vcpu *vcpu, unsigned int idx); bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr); int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); @@ -215,7 +215,7 @@ static inline bool pmc_is_globally_enabled(struct kvm_pmc *pmc) void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu); void kvm_pmu_handle_event(struct kvm_vcpu *vcpu); int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data); -bool kvm_pmu_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx); +int kvm_pmu_check_rdpmc_early(struct kvm_vcpu *vcpu, unsigned int idx); bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr); int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info); int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info); diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c index 1fafc46f61c9..e886300f0f97 100644 --- a/arch/x86/kvm/svm/pmu.c +++ b/arch/x86/kvm/svm/pmu.c @@ -73,11 +73,14 @@ static inline struct kvm_pmc *get_gp_pmc_amd(struct kvm_pmu *pmu, u32 msr, return amd_pmc_idx_to_pmc(pmu, idx); } -static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) +static int amd_check_rdpmc_early(struct kvm_vcpu *vcpu, unsigned int idx) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - return idx < pmu->nr_arch_gp_counters; + if (idx >= pmu->nr_arch_gp_counters) + return -EINVAL; + + return 0; } /* idx is the ECX register of RDPMC instruction */ @@ -229,7 +232,7 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = { .pmc_idx_to_pmc = amd_pmc_idx_to_pmc, .rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc, .msr_idx_to_pmc = amd_msr_idx_to_pmc, - .is_valid_rdpmc_ecx = amd_is_valid_rdpmc_ecx, + .check_rdpmc_early = amd_check_rdpmc_early, .is_valid_msr = amd_is_valid_msr, .get_msr = amd_pmu_get_msr, .set_msr = amd_pmu_set_msr, diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index ec4feaef3d55..1b1f888ad32b 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -55,17 +55,6 @@ static struct kvm_pmc *intel_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx) } } -static bool intel_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) -{ - struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - bool fixed = idx & (1u << 30); - - idx &= ~(3u << 30); - - return fixed ? idx < pmu->nr_arch_fixed_counters - : idx < pmu->nr_arch_gp_counters; -} - static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu, unsigned int idx, u64 *mask) { @@ -718,7 +707,6 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = { .pmc_idx_to_pmc = intel_pmc_idx_to_pmc, .rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc, .msr_idx_to_pmc = intel_msr_idx_to_pmc, - .is_valid_rdpmc_ecx = intel_is_valid_rdpmc_ecx, .is_valid_msr = intel_is_valid_msr, .get_msr = intel_pmu_get_msr, .set_msr = intel_pmu_set_msr, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 363b1c080205..cbee277254f0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8389,12 +8389,9 @@ static int emulator_get_msr(struct x86_emulate_ctxt *ctxt, return kvm_get_msr(emul_to_vcpu(ctxt), msr_index, pdata); } -static int emulator_check_pmc(struct x86_emulate_ctxt *ctxt, - u32 pmc) +static int emulator_check_rdpmc_early(struct x86_emulate_ctxt *ctxt, u32 pmc) { - if (kvm_pmu_is_valid_rdpmc_ecx(emul_to_vcpu(ctxt), pmc)) - return 0; - return -EINVAL; + return kvm_pmu_check_rdpmc_early(emul_to_vcpu(ctxt), pmc); } static int emulator_read_pmc(struct x86_emulate_ctxt *ctxt, @@ -8526,7 +8523,7 @@ static const struct x86_emulate_ops emulate_ops = { .set_msr_with_filter = emulator_set_msr_with_filter, .get_msr_with_filter = emulator_get_msr_with_filter, .get_msr = emulator_get_msr, - .check_pmc = emulator_check_pmc, + .check_rdpmc_early = emulator_check_rdpmc_early, .read_pmc = emulator_read_pmc, .halt = emulator_halt, .wbinvd = emulator_wbinvd, -- cgit v1.2.3 From d652981db08fe2db185e00ad9cd41871f49807b0 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 15:02:28 -0800 Subject: KVM: x86/pmu: Apply "fast" RDPMC only to Intel PMUs Move the handling of "fast" RDPMC instructions, which drop bits 63:32 of the count, to Intel. The "fast" flag, and all modifiers for that matter, are Intel-only and aren't supported by AMD. Opportunistically replace open coded bit crud with proper #defines, and add comments to try and disentangle the flags vs. values mess for non-architectural vs. architectural PMUs. Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Reviewed-by: Dapeng Mi Tested-by: Dapeng Mi Link: https://lore.kernel.org/r/20240109230250.424295-9-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/pmu.c | 3 +-- arch/x86/kvm/vmx/pmu_intel.c | 16 ++++++++++++++-- 2 files changed, 15 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 0b0d804ee239..09b0feb975c3 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -576,10 +576,9 @@ static int kvm_pmu_rdpmc_vmware(struct kvm_vcpu *vcpu, unsigned idx, u64 *data) int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data) { - bool fast_mode = idx & (1u << 31); struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct kvm_pmc *pmc; - u64 mask = fast_mode ? ~0u : ~0ull; + u64 mask = ~0ull; if (!pmu->version) return 1; diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 1b1f888ad32b..03bd188b5754 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -20,6 +20,15 @@ #include "nested.h" #include "pmu.h" +/* + * Perf's "BASE" is wildly misleading, architectural PMUs use bits 31:16 of ECX + * to encode the "type" of counter to read, i.e. this is not a "base". And to + * further confuse things, non-architectural PMUs use bit 31 as a flag for + * "fast" reads, whereas the "type" is an explicit value. + */ +#define INTEL_RDPMC_FIXED INTEL_PMC_FIXED_RDPMC_BASE +#define INTEL_RDPMC_FAST BIT(31) + #define MSR_PMC_FULL_WIDTH_BIT (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0) static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data) @@ -59,11 +68,14 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu, unsigned int idx, u64 *mask) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - bool fixed = idx & (1u << 30); + bool fixed = idx & INTEL_RDPMC_FIXED; struct kvm_pmc *counters; unsigned int num_counters; - idx &= ~(3u << 30); + if (idx & INTEL_RDPMC_FAST) + *mask &= GENMASK_ULL(31, 0); + + idx &= ~(INTEL_RDPMC_FIXED | INTEL_RDPMC_FAST); if (fixed) { counters = pmu->fixed_counters; num_counters = pmu->nr_arch_fixed_counters; -- cgit v1.2.3 From 5728a4a0ea79e2f2e650db4793170900e57359a7 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 15:02:29 -0800 Subject: KVM: x86/pmu: Disallow "fast" RDPMC for architectural Intel PMUs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Inject #GP on RDPMC if the "fast" flag is set for architectural Intel PMUs, i.e. if the PMU version is non-zero. Per Intel's SDM, and confirmed on bare metal, the "fast" flag is supported only for non-architectural PMUs, and is reserved for architectural PMUs. If the processor does not support architectural performance monitoring (CPUID.0AH:EAX[7:0]=0), ECX[30:0] specifies the index of the PMC to be read. Setting ECX[31] selects “fast” read mode if supported. In this mode, RDPMC returns bits 31:0 of the PMC in EAX while clearing EDX to zero. If the processor does support architectural performance monitoring (CPUID.0AH:EAX[7:0] ≠ 0), ECX[31:16] specifies type of PMC while ECX[15:0] specifies the index of the PMC to be read within that type. The following PMC types are currently defined: — General-purpose counters use type 0. The index x (to read IA32_PMCx) must be less than the value enumerated by CPUID.0AH.EAX[15:8] (thus ECX[15:8] must be zero). — Fixed-function counters use type 4000H. The index x (to read IA32_FIXED_CTRx) can be used if either CPUID.0AH.EDX[4:0] > x or CPUID.0AH.ECX[x] = 1 (thus ECX[15:5] must be 0). — Performance metrics use type 2000H. This type can be used only if IA32_PERF_CAPABILITIES.PERF_METRICS_AVAILABLE[bit 15]=1. For this type, the index in ECX[15:0] is implementation specific. Opportunistically WARN if KVM ever actually tries to complete RDPMC for a non-architectural PMU, and drop the non-existent "support" for fast RDPMC, as KVM doesn't support such PMUs, i.e. kvm_pmu_rdpmc() should reject the RDPMC before getting to the Intel code. Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests") Fixes: 67f4d4288c35 ("KVM: x86: rdpmc emulation checks the counter incorrectly") Reviewed-by: Dapeng Mi Tested-by: Dapeng Mi Link: https://lore.kernel.org/r/20240109230250.424295-10-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/pmu_intel.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 03bd188b5754..5a5dfae6055c 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -27,7 +27,6 @@ * "fast" reads, whereas the "type" is an explicit value. */ #define INTEL_RDPMC_FIXED INTEL_PMC_FIXED_RDPMC_BASE -#define INTEL_RDPMC_FAST BIT(31) #define MSR_PMC_FULL_WIDTH_BIT (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0) @@ -72,10 +71,25 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu, struct kvm_pmc *counters; unsigned int num_counters; - if (idx & INTEL_RDPMC_FAST) - *mask &= GENMASK_ULL(31, 0); + /* + * The encoding of ECX for RDPMC is different for architectural versus + * non-architecturals PMUs (PMUs with version '0'). For architectural + * PMUs, bits 31:16 specify the PMC type and bits 15:0 specify the PMC + * index. For non-architectural PMUs, bit 31 is a "fast" flag, and + * bits 30:0 specify the PMC index. + * + * Yell and reject attempts to read PMCs for a non-architectural PMU, + * as KVM doesn't support such PMUs. + */ + if (WARN_ON_ONCE(!pmu->version)) + return NULL; - idx &= ~(INTEL_RDPMC_FIXED | INTEL_RDPMC_FAST); + /* + * Fixed PMCs are supported on all architectural PMUs. Note, KVM only + * emulates fixed PMCs for PMU v2+, but the flag itself is still valid, + * i.e. let RDPMC fail due to accessing a non-existent counter. + */ + idx &= ~INTEL_RDPMC_FIXED; if (fixed) { counters = pmu->fixed_counters; num_counters = pmu->nr_arch_fixed_counters; -- cgit v1.2.3 From 7a0fc734c20dd1ad9bc1b8b922b61616f31d3823 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 15:02:30 -0800 Subject: KVM: x86/pmu: Treat "fixed" PMU type in RDPMC as index as a value, not flag Refactor KVM's handling of ECX for RDPMC to treat the FIXED modifier as an explicit value, not a flag (minus one wart). While non-architectural PMUs do use bit 31 as a flag (for "fast" reads), architectural PMUs use the upper half of ECX to encode the type. From the SDM: ECX[31:16] specifies type of PMC while ECX[15:0] specifies the index of the PMC to be read within that type Note, that the known supported types are 4000H and 2000H, i.e. look a lot like flags, doesn't contradict the above statement that ECX[31:16] holds the type, at least not by any sane reading of the SDM. Keep the explicitly clearing of the FIXED "flag", as KVM subtly relies on that behavior to disallow unsupported types while allowing the correct indices for fixed counters. This wart will be cleaned up in short order. Opportunistically grab the per-type bitmask in the if-else blocks to eliminate the one-off usage of the local "fixed" bool. Reported-by: Jim Mattson Tested-by: Dapeng Mi Link: https://lore.kernel.org/r/20240109230250.424295-11-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/pmu_intel.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 5a5dfae6055c..c37dd3aa056b 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -28,6 +28,9 @@ */ #define INTEL_RDPMC_FIXED INTEL_PMC_FIXED_RDPMC_BASE +#define INTEL_RDPMC_TYPE_MASK GENMASK(31, 16) +#define INTEL_RDPMC_INDEX_MASK GENMASK(15, 0) + #define MSR_PMC_FULL_WIDTH_BIT (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0) static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data) @@ -66,10 +69,11 @@ static struct kvm_pmc *intel_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx) static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu, unsigned int idx, u64 *mask) { + unsigned int type = idx & INTEL_RDPMC_TYPE_MASK; struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - bool fixed = idx & INTEL_RDPMC_FIXED; struct kvm_pmc *counters; unsigned int num_counters; + u64 bitmask; /* * The encoding of ECX for RDPMC is different for architectural versus @@ -90,16 +94,20 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu, * i.e. let RDPMC fail due to accessing a non-existent counter. */ idx &= ~INTEL_RDPMC_FIXED; - if (fixed) { + if (type == INTEL_RDPMC_FIXED) { counters = pmu->fixed_counters; num_counters = pmu->nr_arch_fixed_counters; + bitmask = pmu->counter_bitmask[KVM_PMC_FIXED]; } else { counters = pmu->gp_counters; num_counters = pmu->nr_arch_gp_counters; + bitmask = pmu->counter_bitmask[KVM_PMC_GP]; } + if (idx >= num_counters) return NULL; - *mask &= pmu->counter_bitmask[fixed ? KVM_PMC_FIXED : KVM_PMC_GP]; + + *mask &= bitmask; return &counters[array_index_nospec(idx, num_counters)]; } -- cgit v1.2.3 From a634c76b2c1a642d11f5d6e1810181daedcbe0ab Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 15:02:31 -0800 Subject: KVM: x86/pmu: Explicitly check for RDPMC of unsupported Intel PMC types MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Explicitly check for attempts to read unsupported PMC types instead of letting the bounds check fail. Functionally, letting the check fail is ok, but it's unnecessarily subtle and does a poor job of documenting the architectural behavior that KVM is emulating. Reviewed-by: Dapeng Mi  Tested-by: Dapeng Mi Link: https://lore.kernel.org/r/20240109230250.424295-12-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/pmu_intel.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index c37dd3aa056b..b41bdb0a0995 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -26,6 +26,7 @@ * further confuse things, non-architectural PMUs use bit 31 as a flag for * "fast" reads, whereas the "type" is an explicit value. */ +#define INTEL_RDPMC_GP 0 #define INTEL_RDPMC_FIXED INTEL_PMC_FIXED_RDPMC_BASE #define INTEL_RDPMC_TYPE_MASK GENMASK(31, 16) @@ -89,21 +90,29 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu, return NULL; /* - * Fixed PMCs are supported on all architectural PMUs. Note, KVM only - * emulates fixed PMCs for PMU v2+, but the flag itself is still valid, - * i.e. let RDPMC fail due to accessing a non-existent counter. + * General Purpose (GP) PMCs are supported on all PMUs, and fixed PMCs + * are supported on all architectural PMUs, i.e. on all virtual PMUs + * supported by KVM. Note, KVM only emulates fixed PMCs for PMU v2+, + * but the type itself is still valid, i.e. let RDPMC fail due to + * accessing a non-existent counter. Reject attempts to read all other + * types, which are unknown/unsupported. */ - idx &= ~INTEL_RDPMC_FIXED; - if (type == INTEL_RDPMC_FIXED) { + switch (type) { + case INTEL_RDPMC_FIXED: counters = pmu->fixed_counters; num_counters = pmu->nr_arch_fixed_counters; bitmask = pmu->counter_bitmask[KVM_PMC_FIXED]; - } else { + break; + case INTEL_RDPMC_GP: counters = pmu->gp_counters; num_counters = pmu->nr_arch_gp_counters; bitmask = pmu->counter_bitmask[KVM_PMC_GP]; + break; + default: + return NULL; } + idx &= INTEL_RDPMC_INDEX_MASK; if (idx >= num_counters) return NULL; -- cgit v1.2.3 From 0dbd054699661dfffbc1c148664f8d03fd132569 Mon Sep 17 00:00:00 2001 From: Kunwu Chan Date: Tue, 16 Jan 2024 18:00:25 +0800 Subject: KVM: x86/mmu: Use KMEM_CACHE instead of kmem_cache_create() Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Note, KMEM_CACHE() uses the required alignment of the struct, '8' as the alignment, whereas KVM's existing code passes '0'. In the end, the two values yield the same result as x86's minimum slab alignment is also '8' (which is not at all coincidental). Signed-off-by: Kunwu Chan Link: https://lore.kernel.org/r/20240116100025.95702-1-chentao@kylinos.cn [sean: call out alignment behavior] Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 2d6cdeab1f8a..3c193b096b45 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6997,9 +6997,7 @@ int kvm_mmu_vendor_module_init(void) kvm_mmu_reset_all_pte_masks(); - pte_list_desc_cache = kmem_cache_create("pte_list_desc", - sizeof(struct pte_list_desc), - 0, SLAB_ACCOUNT, NULL); + pte_list_desc_cache = KMEM_CACHE(pte_list_desc, SLAB_ACCOUNT); if (!pte_list_desc_cache) goto out; -- cgit v1.2.3 From f933b88e20150f15787390e2a1754a7e412754ed Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 9 Nov 2023 18:28:48 -0800 Subject: KVM: x86/pmu: Zero out PMU metadata on AMD if PMU is disabled Move the purging of common PMU metadata from intel_pmu_refresh() to kvm_pmu_refresh(), and invoke the vendor refresh() hook if and only if the VM is supposed to have a vPMU. KVM already denies access to the PMU based on kvm->arch.enable_pmu, as get_gp_pmc_amd() returns NULL for all PMCs in that case, i.e. KVM already violates AMD's architecture by not virtualizing a PMU (kernels have long since learned to not panic when the PMU is unavailable). But configuring the PMU as if it were enabled causes unwanted side effects, e.g. calls to kvm_pmu_trigger_event() waste an absurd number of cycles due to the all_valid_pmc_idx bitmap being non-zero. Fixes: b1d66dad65dc ("KVM: x86/svm: Add module param to control PMU virtualization") Reported-by: Konstantin Khorenko Closes: https://lore.kernel.org/all/20231109180646.2963718-2-khorenko@virtuozzo.com Link: https://lore.kernel.org/r/20231110022857.1273836-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/pmu.c | 20 ++++++++++++++++++-- arch/x86/kvm/vmx/pmu_intel.c | 16 ++-------------- 2 files changed, 20 insertions(+), 16 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 09b0feb975c3..4d4366c6f7cd 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -749,6 +749,8 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu) */ void kvm_pmu_refresh(struct kvm_vcpu *vcpu) { + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); + if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm)) return; @@ -758,8 +760,22 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu) */ kvm_pmu_reset(vcpu); - bitmap_zero(vcpu_to_pmu(vcpu)->all_valid_pmc_idx, X86_PMC_IDX_MAX); - static_call(kvm_x86_pmu_refresh)(vcpu); + pmu->version = 0; + pmu->nr_arch_gp_counters = 0; + pmu->nr_arch_fixed_counters = 0; + pmu->counter_bitmask[KVM_PMC_GP] = 0; + pmu->counter_bitmask[KVM_PMC_FIXED] = 0; + pmu->reserved_bits = 0xffffffff00200000ull; + pmu->raw_event_mask = X86_RAW_EVENT_MASK; + pmu->global_ctrl_mask = ~0ull; + pmu->global_status_mask = ~0ull; + pmu->fixed_ctr_ctrl_mask = ~0ull; + pmu->pebs_enable_mask = ~0ull; + pmu->pebs_data_cfg_mask = ~0ull; + bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX); + + if (vcpu->kvm->arch.enable_pmu) + static_call(kvm_x86_pmu_refresh)(vcpu); } void kvm_pmu_init(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index b41bdb0a0995..a87f4725a9be 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -471,19 +471,6 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) u64 counter_mask; int i; - pmu->nr_arch_gp_counters = 0; - pmu->nr_arch_fixed_counters = 0; - pmu->counter_bitmask[KVM_PMC_GP] = 0; - pmu->counter_bitmask[KVM_PMC_FIXED] = 0; - pmu->version = 0; - pmu->reserved_bits = 0xffffffff00200000ull; - pmu->raw_event_mask = X86_RAW_EVENT_MASK; - pmu->global_ctrl_mask = ~0ull; - pmu->global_status_mask = ~0ull; - pmu->fixed_ctr_ctrl_mask = ~0ull; - pmu->pebs_enable_mask = ~0ull; - pmu->pebs_data_cfg_mask = ~0ull; - memset(&lbr_desc->records, 0, sizeof(lbr_desc->records)); /* @@ -495,8 +482,9 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) return; entry = kvm_find_cpuid_entry(vcpu, 0xa); - if (!entry || !vcpu->kvm->arch.enable_pmu) + if (!entry) return; + eax.full = entry->eax; edx.full = entry->edx; -- cgit v1.2.3 From be6b067dae1573cf4d53c8b08175d8872d82f030 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 9 Nov 2023 18:28:49 -0800 Subject: KVM: x86/pmu: Add common define to capture fixed counters offset Add a common define to "officially" solidify KVM's split of counters, i.e. to commit to using bits 31:0 to track general purpose counters and bits 63:32 to track fixed counters (which only Intel supports). KVM already bleeds this behavior all over common PMU code, and adding a KVM- defined macro allows clarifying that the value is a _base_, as oppposed to the _flag_ that is used to access fixed PMCs via RDPMC (which perf confusingly calls INTEL_PMC_FIXED_RDPMC_BASE). No functional change intended. Link: https://lore.kernel.org/r/20231110022857.1273836-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/pmu.c | 8 ++++---- arch/x86/kvm/pmu.h | 4 +++- arch/x86/kvm/vmx/pmu_intel.c | 12 ++++++------ 3 files changed, 13 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 4d4366c6f7cd..67d589ac9363 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -67,7 +67,7 @@ static const struct x86_cpu_id vmx_pebs_pdist_cpu[] = { * all perf counters (both gp and fixed). The mapping relationship * between pmc and perf counters is as the following: * * Intel: [0 .. KVM_INTEL_PMC_MAX_GENERIC-1] <=> gp counters - * [INTEL_PMC_IDX_FIXED .. INTEL_PMC_IDX_FIXED + 2] <=> fixed + * [KVM_FIXED_PMC_BASE_IDX .. KVM_FIXED_PMC_BASE_IDX + 2] <=> fixed * * AMD: [0 .. AMD64_NUM_COUNTERS-1] and, for families 15H * and later, [0 .. AMD64_NUM_COUNTERS_CORE-1] <=> gp counters */ @@ -411,7 +411,7 @@ static bool is_gp_event_allowed(struct kvm_x86_pmu_event_filter *f, static bool is_fixed_event_allowed(struct kvm_x86_pmu_event_filter *filter, int idx) { - int fixed_idx = idx - INTEL_PMC_IDX_FIXED; + int fixed_idx = idx - KVM_FIXED_PMC_BASE_IDX; if (filter->action == KVM_PMU_EVENT_DENY && test_bit(fixed_idx, (ulong *)&filter->fixed_counter_bitmap)) @@ -465,7 +465,7 @@ static void reprogram_counter(struct kvm_pmc *pmc) if (pmc_is_fixed(pmc)) { fixed_ctr_ctrl = fixed_ctrl_field(pmu->fixed_ctr_ctrl, - pmc->idx - INTEL_PMC_IDX_FIXED); + pmc->idx - KVM_FIXED_PMC_BASE_IDX); if (fixed_ctr_ctrl & 0x1) eventsel |= ARCH_PERFMON_EVENTSEL_OS; if (fixed_ctr_ctrl & 0x2) @@ -841,7 +841,7 @@ static inline bool cpl_is_matched(struct kvm_pmc *pmc) select_user = config & ARCH_PERFMON_EVENTSEL_USR; } else { config = fixed_ctrl_field(pmc_to_pmu(pmc)->fixed_ctr_ctrl, - pmc->idx - INTEL_PMC_IDX_FIXED); + pmc->idx - KVM_FIXED_PMC_BASE_IDX); select_os = config & 0x1; select_user = config & 0x2; } diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 51bbb01b21c8..e8c6a1f4b8e8 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -18,6 +18,8 @@ #define VMWARE_BACKDOOR_PMC_REAL_TIME 0x10001 #define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002 +#define KVM_FIXED_PMC_BASE_IDX INTEL_PMC_IDX_FIXED + struct kvm_pmu_ops { struct kvm_pmc *(*pmc_idx_to_pmc)(struct kvm_pmu *pmu, int pmc_idx); struct kvm_pmc *(*rdpmc_ecx_to_pmc)(struct kvm_vcpu *vcpu, @@ -130,7 +132,7 @@ static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc) if (pmc_is_fixed(pmc)) return fixed_ctrl_field(pmu->fixed_ctr_ctrl, - pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3; + pmc->idx - KVM_FIXED_PMC_BASE_IDX) & 0x3; return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE; } diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index a87f4725a9be..fe7a2ba51e1b 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -50,18 +50,18 @@ static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data) pmc = get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR0 + i); - __set_bit(INTEL_PMC_IDX_FIXED + i, pmu->pmc_in_use); + __set_bit(KVM_FIXED_PMC_BASE_IDX + i, pmu->pmc_in_use); kvm_pmu_request_counter_reprogram(pmc); } } static struct kvm_pmc *intel_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx) { - if (pmc_idx < INTEL_PMC_IDX_FIXED) { + if (pmc_idx < KVM_FIXED_PMC_BASE_IDX) { return get_gp_pmc(pmu, MSR_P6_EVNTSEL0 + pmc_idx, MSR_P6_EVNTSEL0); } else { - u32 idx = pmc_idx - INTEL_PMC_IDX_FIXED; + u32 idx = pmc_idx - KVM_FIXED_PMC_BASE_IDX; return get_fixed_pmc(pmu, idx + MSR_CORE_PERF_FIXED_CTR0); } @@ -516,7 +516,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) for (i = 0; i < pmu->nr_arch_fixed_counters; i++) pmu->fixed_ctr_ctrl_mask &= ~(0xbull << (i * 4)); counter_mask = ~(((1ull << pmu->nr_arch_gp_counters) - 1) | - (((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED)); + (((1ull << pmu->nr_arch_fixed_counters) - 1) << KVM_FIXED_PMC_BASE_IDX)); pmu->global_ctrl_mask = counter_mask; /* @@ -560,7 +560,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE; for (i = 0; i < pmu->nr_arch_fixed_counters; i++) { pmu->fixed_ctr_ctrl_mask &= - ~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4)); + ~(1ULL << (KVM_FIXED_PMC_BASE_IDX + i * 4)); } pmu->pebs_data_cfg_mask = ~0xff00000full; } else { @@ -586,7 +586,7 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu) for (i = 0; i < KVM_PMC_MAX_FIXED; i++) { pmu->fixed_counters[i].type = KVM_PMC_FIXED; pmu->fixed_counters[i].vcpu = vcpu; - pmu->fixed_counters[i].idx = i + INTEL_PMC_IDX_FIXED; + pmu->fixed_counters[i].idx = i + KVM_FIXED_PMC_BASE_IDX; pmu->fixed_counters[i].current_config = 0; pmu->fixed_counters[i].eventsel = intel_get_fixed_pmc_eventsel(i); } -- cgit v1.2.3 From b31880ca2f41dc2196e31d97e498b0fa884c2b2a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 9 Nov 2023 18:28:50 -0800 Subject: KVM: x86/pmu: Move pmc_idx => pmc translation helper to common code Add a common helper for *internal* PMC lookups, and delete the ops hook and Intel's implementation. Keep AMD's implementation, but rename it to amd_pmu_get_pmc() to make it somewhat more obvious that it's suited for both KVM-internal and guest-initiated lookups. Because KVM tracks all counters in a single bitmap, getting a counter when iterating over a bitmap, e.g. of all valid PMCs, requires a small amount of math, that while simple, isn't super obvious and doesn't use the same semantics as PMC lookups from RDPMC! Although AMD doesn't support fixed counters, the common PMU code still behaves as if there a split, the high half of which just happens to always be empty. Opportunstically add a comment to explain both what is going on, and why KVM uses a single bitmap, e.g. the boilerplate for iterating over separate bitmaps could be done via macros, so it's not (just) about deduplicating code. Link: https://lore.kernel.org/r/20231110022857.1273836-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 - arch/x86/kvm/pmu.c | 8 ++++---- arch/x86/kvm/pmu.h | 29 ++++++++++++++++++++++++++++- arch/x86/kvm/svm/pmu.c | 7 +++---- arch/x86/kvm/vmx/pmu_intel.c | 15 +-------------- 5 files changed, 36 insertions(+), 24 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h index f0cd48222133..f852b13aeefe 100644 --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h @@ -12,7 +12,6 @@ BUILD_BUG_ON(1) * a NULL definition, for example if "static_call_cond()" will be used * at the call sites. */ -KVM_X86_PMU_OP(pmc_idx_to_pmc) KVM_X86_PMU_OP(rdpmc_ecx_to_pmc) KVM_X86_PMU_OP(msr_idx_to_pmc) KVM_X86_PMU_OP_OPTIONAL(check_rdpmc_early) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 67d589ac9363..0873937c90bc 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -505,7 +505,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu) int bit; for_each_set_bit(bit, pmu->reprogram_pmi, X86_PMC_IDX_MAX) { - struct kvm_pmc *pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, bit); + struct kvm_pmc *pmc = kvm_pmc_idx_to_pmc(pmu, bit); if (unlikely(!pmc)) { clear_bit(bit, pmu->reprogram_pmi); @@ -725,7 +725,7 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu) bitmap_zero(pmu->reprogram_pmi, X86_PMC_IDX_MAX); for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) { - pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, i); + pmc = kvm_pmc_idx_to_pmc(pmu, i); if (!pmc) continue; @@ -801,7 +801,7 @@ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu) pmu->pmc_in_use, X86_PMC_IDX_MAX); for_each_set_bit(i, bitmask, X86_PMC_IDX_MAX) { - pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, i); + pmc = kvm_pmc_idx_to_pmc(pmu, i); if (pmc && pmc->perf_event && !pmc_speculative_in_use(pmc)) pmc_stop_counter(pmc); @@ -856,7 +856,7 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id) int i; for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) { - pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, i); + pmc = kvm_pmc_idx_to_pmc(pmu, i); if (!pmc || !pmc_event_is_allowed(pmc)) continue; diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index e8c6a1f4b8e8..56e8e665e1af 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -4,6 +4,8 @@ #include +#include + #define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu) #define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu)) #define pmc_to_pmu(pmc) (&(pmc)->vcpu->arch.pmu) @@ -21,7 +23,6 @@ #define KVM_FIXED_PMC_BASE_IDX INTEL_PMC_IDX_FIXED struct kvm_pmu_ops { - struct kvm_pmc *(*pmc_idx_to_pmc)(struct kvm_pmu *pmu, int pmc_idx); struct kvm_pmc *(*rdpmc_ecx_to_pmc)(struct kvm_vcpu *vcpu, unsigned int idx, u64 *mask); struct kvm_pmc *(*msr_idx_to_pmc)(struct kvm_vcpu *vcpu, u32 msr); @@ -56,6 +57,32 @@ static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu) return pmu->version > 1; } +/* + * KVM tracks all counters in 64-bit bitmaps, with general purpose counters + * mapped to bits 31:0 and fixed counters mapped to 63:32, e.g. fixed counter 0 + * is tracked internally via index 32. On Intel, (AMD doesn't support fixed + * counters), this mirrors how fixed counters are mapped to PERF_GLOBAL_CTRL + * and similar MSRs, i.e. tracking fixed counters at base index 32 reduces the + * amounter of boilerplate needed to iterate over PMCs *and* simplifies common + * enabling/disable/reset operations. + * + * WARNING! This helper is only for lookups that are initiated by KVM, it is + * NOT safe for guest lookups, e.g. will do the wrong thing if passed a raw + * ECX value from RDPMC (fixed counters are accessed by setting bit 30 in ECX + * for RDPMC, not by adding 32 to the fixed counter index). + */ +static inline struct kvm_pmc *kvm_pmc_idx_to_pmc(struct kvm_pmu *pmu, int idx) +{ + if (idx < pmu->nr_arch_gp_counters) + return &pmu->gp_counters[idx]; + + idx -= KVM_FIXED_PMC_BASE_IDX; + if (idx >= 0 && idx < pmu->nr_arch_fixed_counters) + return &pmu->fixed_counters[idx]; + + return NULL; +} + static inline u64 pmc_bitmask(struct kvm_pmc *pmc) { struct kvm_pmu *pmu = pmc_to_pmu(pmc); diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c index e886300f0f97..dfcc38bd97d3 100644 --- a/arch/x86/kvm/svm/pmu.c +++ b/arch/x86/kvm/svm/pmu.c @@ -25,7 +25,7 @@ enum pmu_type { PMU_TYPE_EVNTSEL, }; -static struct kvm_pmc *amd_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx) +static struct kvm_pmc *amd_pmu_get_pmc(struct kvm_pmu *pmu, int pmc_idx) { unsigned int num_counters = pmu->nr_arch_gp_counters; @@ -70,7 +70,7 @@ static inline struct kvm_pmc *get_gp_pmc_amd(struct kvm_pmu *pmu, u32 msr, return NULL; } - return amd_pmc_idx_to_pmc(pmu, idx); + return amd_pmu_get_pmc(pmu, idx); } static int amd_check_rdpmc_early(struct kvm_vcpu *vcpu, unsigned int idx) @@ -87,7 +87,7 @@ static int amd_check_rdpmc_early(struct kvm_vcpu *vcpu, unsigned int idx) static struct kvm_pmc *amd_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu, unsigned int idx, u64 *mask) { - return amd_pmc_idx_to_pmc(vcpu_to_pmu(vcpu), idx); + return amd_pmu_get_pmc(vcpu_to_pmu(vcpu), idx); } static struct kvm_pmc *amd_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr) @@ -229,7 +229,6 @@ static void amd_pmu_init(struct kvm_vcpu *vcpu) } struct kvm_pmu_ops amd_pmu_ops __initdata = { - .pmc_idx_to_pmc = amd_pmc_idx_to_pmc, .rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc, .msr_idx_to_pmc = amd_msr_idx_to_pmc, .check_rdpmc_early = amd_check_rdpmc_early, diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index fe7a2ba51e1b..845a964f22a6 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -55,18 +55,6 @@ static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data) } } -static struct kvm_pmc *intel_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx) -{ - if (pmc_idx < KVM_FIXED_PMC_BASE_IDX) { - return get_gp_pmc(pmu, MSR_P6_EVNTSEL0 + pmc_idx, - MSR_P6_EVNTSEL0); - } else { - u32 idx = pmc_idx - KVM_FIXED_PMC_BASE_IDX; - - return get_fixed_pmc(pmu, idx + MSR_CORE_PERF_FIXED_CTR0); - } -} - static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu, unsigned int idx, u64 *mask) { @@ -718,7 +706,7 @@ void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu) for_each_set_bit(bit, (unsigned long *)&pmu->global_ctrl, X86_PMC_IDX_MAX) { - pmc = intel_pmc_idx_to_pmc(pmu, bit); + pmc = kvm_pmc_idx_to_pmc(pmu, bit); if (!pmc || !pmc_speculative_in_use(pmc) || !pmc_is_globally_enabled(pmc) || !pmc->perf_event) @@ -735,7 +723,6 @@ void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu) } struct kvm_pmu_ops intel_pmu_ops __initdata = { - .pmc_idx_to_pmc = intel_pmc_idx_to_pmc, .rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc, .msr_idx_to_pmc = intel_msr_idx_to_pmc, .is_valid_msr = intel_is_valid_msr, -- cgit v1.2.3 From 004a0aa56edea3effc97bc90a620da1bcde5c63c Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 9 Nov 2023 18:28:51 -0800 Subject: KVM: x86/pmu: Snapshot and clear reprogramming bitmap before reprogramming Refactor the handling of the reprogramming bitmap to snapshot and clear to-be-processed bits before doing the reprogramming, and then explicitly set bits for PMCs that need to be reprogrammed (again). This will allow adding a macro to iterate over all valid PMCs without having to add special handling for the reprogramming bit, which (a) can have bits set for non-existent PMCs and (b) needs to clear such bits to avoid wasting cycles in perpetuity. Note, the existing behavior of clearing bits after reprogramming does NOT have a race with kvm_vm_ioctl_set_pmu_event_filter(). Setting a new PMU filter synchronizes SRCU _before_ setting the bitmap, i.e. guarantees that the vCPU isn't in the middle of reprogramming with a stale filter prior to setting the bitmap. Link: https://lore.kernel.org/r/20231110022857.1273836-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/pmu.c | 52 +++++++++++++++++++++++------------------ 2 files changed, 30 insertions(+), 23 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b5b2d0fde579..ad5319a503f0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -536,6 +536,7 @@ struct kvm_pmc { #define KVM_PMC_MAX_FIXED 3 #define MSR_ARCH_PERFMON_FIXED_CTR_MAX (MSR_ARCH_PERFMON_FIXED_CTR0 + KVM_PMC_MAX_FIXED - 1) #define KVM_AMD_PMC_MAX_GENERIC 6 + struct kvm_pmu { u8 version; unsigned nr_arch_gp_counters; diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 0873937c90bc..81a0c6719863 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -444,7 +444,7 @@ static bool pmc_event_is_allowed(struct kvm_pmc *pmc) check_pmu_event_filter(pmc); } -static void reprogram_counter(struct kvm_pmc *pmc) +static int reprogram_counter(struct kvm_pmc *pmc) { struct kvm_pmu *pmu = pmc_to_pmu(pmc); u64 eventsel = pmc->eventsel; @@ -455,7 +455,7 @@ static void reprogram_counter(struct kvm_pmc *pmc) emulate_overflow = pmc_pause_counter(pmc); if (!pmc_event_is_allowed(pmc)) - goto reprogram_complete; + return 0; if (emulate_overflow) __kvm_perf_overflow(pmc, false); @@ -476,43 +476,49 @@ static void reprogram_counter(struct kvm_pmc *pmc) } if (pmc->current_config == new_config && pmc_resume_counter(pmc)) - goto reprogram_complete; + return 0; pmc_release_perf_event(pmc); pmc->current_config = new_config; - /* - * If reprogramming fails, e.g. due to contention, leave the counter's - * regprogram bit set, i.e. opportunistically try again on the next PMU - * refresh. Don't make a new request as doing so can stall the guest - * if reprogramming repeatedly fails. - */ - if (pmc_reprogram_counter(pmc, PERF_TYPE_RAW, - (eventsel & pmu->raw_event_mask), - !(eventsel & ARCH_PERFMON_EVENTSEL_USR), - !(eventsel & ARCH_PERFMON_EVENTSEL_OS), - eventsel & ARCH_PERFMON_EVENTSEL_INT)) - return; - -reprogram_complete: - clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi); + return pmc_reprogram_counter(pmc, PERF_TYPE_RAW, + (eventsel & pmu->raw_event_mask), + !(eventsel & ARCH_PERFMON_EVENTSEL_USR), + !(eventsel & ARCH_PERFMON_EVENTSEL_OS), + eventsel & ARCH_PERFMON_EVENTSEL_INT); } void kvm_pmu_handle_event(struct kvm_vcpu *vcpu) { + DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX); struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); int bit; - for_each_set_bit(bit, pmu->reprogram_pmi, X86_PMC_IDX_MAX) { + bitmap_copy(bitmap, pmu->reprogram_pmi, X86_PMC_IDX_MAX); + + /* + * The reprogramming bitmap can be written asynchronously by something + * other than the task that holds vcpu->mutex, take care to clear only + * the bits that will actually processed. + */ + BUILD_BUG_ON(sizeof(bitmap) != sizeof(atomic64_t)); + atomic64_andnot(*(s64 *)bitmap, &pmu->__reprogram_pmi); + + for_each_set_bit(bit, bitmap, X86_PMC_IDX_MAX) { struct kvm_pmc *pmc = kvm_pmc_idx_to_pmc(pmu, bit); - if (unlikely(!pmc)) { - clear_bit(bit, pmu->reprogram_pmi); + if (unlikely(!pmc)) continue; - } - reprogram_counter(pmc); + /* + * If reprogramming fails, e.g. due to contention, re-set the + * regprogram bit set, i.e. opportunistically try again on the + * next PMU refresh. Don't make a new request as doing so can + * stall the guest if reprogramming repeatedly fails. + */ + if (reprogram_counter(pmc)) + set_bit(pmc->idx, pmu->reprogram_pmi); } /* -- cgit v1.2.3 From e5a65d4f723ab9997deab798c539e6bfd71f8440 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 9 Nov 2023 18:28:52 -0800 Subject: KVM: x86/pmu: Add macros to iterate over all PMCs given a bitmap Add and use kvm_for_each_pmc() to dedup a variety of open coded for-loops that iterate over valid PMCs given a bitmap (and because seeing checkpatch whine about bad macro style is always amusing). No functional change intended. Link: https://lore.kernel.org/r/20231110022857.1273836-6-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/pmu.c | 26 +++++++------------------- arch/x86/kvm/pmu.h | 6 ++++++ arch/x86/kvm/vmx/pmu_intel.c | 7 ++----- 3 files changed, 15 insertions(+), 24 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 81a0c6719863..7b5563ff259c 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -493,6 +493,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu) { DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX); struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); + struct kvm_pmc *pmc; int bit; bitmap_copy(bitmap, pmu->reprogram_pmi, X86_PMC_IDX_MAX); @@ -505,12 +506,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu) BUILD_BUG_ON(sizeof(bitmap) != sizeof(atomic64_t)); atomic64_andnot(*(s64 *)bitmap, &pmu->__reprogram_pmi); - for_each_set_bit(bit, bitmap, X86_PMC_IDX_MAX) { - struct kvm_pmc *pmc = kvm_pmc_idx_to_pmc(pmu, bit); - - if (unlikely(!pmc)) - continue; - + kvm_for_each_pmc(pmu, pmc, bit, bitmap) { /* * If reprogramming fails, e.g. due to contention, re-set the * regprogram bit set, i.e. opportunistically try again on the @@ -730,11 +726,7 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu) bitmap_zero(pmu->reprogram_pmi, X86_PMC_IDX_MAX); - for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) { - pmc = kvm_pmc_idx_to_pmc(pmu, i); - if (!pmc) - continue; - + kvm_for_each_pmc(pmu, pmc, i, pmu->all_valid_pmc_idx) { pmc_stop_counter(pmc); pmc->counter = 0; pmc->emulated_counter = 0; @@ -806,10 +798,8 @@ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu) bitmap_andnot(bitmask, pmu->all_valid_pmc_idx, pmu->pmc_in_use, X86_PMC_IDX_MAX); - for_each_set_bit(i, bitmask, X86_PMC_IDX_MAX) { - pmc = kvm_pmc_idx_to_pmc(pmu, i); - - if (pmc && pmc->perf_event && !pmc_speculative_in_use(pmc)) + kvm_for_each_pmc(pmu, pmc, i, bitmask) { + if (pmc->perf_event && !pmc_speculative_in_use(pmc)) pmc_stop_counter(pmc); } @@ -861,10 +851,8 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id) struct kvm_pmc *pmc; int i; - for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) { - pmc = kvm_pmc_idx_to_pmc(pmu, i); - - if (!pmc || !pmc_event_is_allowed(pmc)) + kvm_for_each_pmc(pmu, pmc, i, pmu->all_valid_pmc_idx) { + if (!pmc_event_is_allowed(pmc)) continue; /* Ignore checks for edge detect, pin control, invert and CMASK bits */ diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 56e8e665e1af..fd18bc0b281c 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -83,6 +83,12 @@ static inline struct kvm_pmc *kvm_pmc_idx_to_pmc(struct kvm_pmu *pmu, int idx) return NULL; } +#define kvm_for_each_pmc(pmu, pmc, i, bitmap) \ + for_each_set_bit(i, bitmap, X86_PMC_IDX_MAX) \ + if (!(pmc = kvm_pmc_idx_to_pmc(pmu, i))) \ + continue; \ + else \ + static inline u64 pmc_bitmask(struct kvm_pmc *pmc) { struct kvm_pmu *pmu = pmc_to_pmu(pmc); diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 845a964f22a6..73b9943ceb17 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -704,11 +704,8 @@ void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu) struct kvm_pmc *pmc = NULL; int bit, hw_idx; - for_each_set_bit(bit, (unsigned long *)&pmu->global_ctrl, - X86_PMC_IDX_MAX) { - pmc = kvm_pmc_idx_to_pmc(pmu, bit); - - if (!pmc || !pmc_speculative_in_use(pmc) || + kvm_for_each_pmc(pmu, pmc, bit, (unsigned long *)&pmu->global_ctrl) { + if (!pmc_speculative_in_use(pmc) || !pmc_is_globally_enabled(pmc) || !pmc->perf_event) continue; -- cgit v1.2.3 From d2b321ea9380564510d281d45ccd2b424da14e7f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 9 Nov 2023 18:28:53 -0800 Subject: KVM: x86/pmu: Process only enabled PMCs when emulating events in software Mask off disabled counters based on PERF_GLOBAL_CTRL *before* iterating over PMCs to emulate (branch) instruction required events in software. In the common case where the guest isn't utilizing the PMU, pre-checking for enabled counters turns a relatively expensive search into a few AND uops and a Jcc. Sadly, PMUs without PERF_GLOBAL_CTRL, e.g. most existing AMD CPUs, are out of luck as there is no way to check that a PMC isn't being used without checking the PMC's event selector. Cc: Konstantin Khorenko Link: https://lore.kernel.org/r/20231110022857.1273836-7-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/pmu.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 7b5563ff259c..c04c3f37a1b8 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -847,11 +847,20 @@ static inline bool cpl_is_matched(struct kvm_pmc *pmc) void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id) { + DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX); struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct kvm_pmc *pmc; int i; - kvm_for_each_pmc(pmu, pmc, i, pmu->all_valid_pmc_idx) { + BUILD_BUG_ON(sizeof(pmu->global_ctrl) * BITS_PER_BYTE != X86_PMC_IDX_MAX); + + if (!kvm_pmu_has_perf_global_ctrl(pmu)) + bitmap_copy(bitmap, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX); + else if (!bitmap_and(bitmap, pmu->all_valid_pmc_idx, + (unsigned long *)&pmu->global_ctrl, X86_PMC_IDX_MAX)) + return; + + kvm_for_each_pmc(pmu, pmc, i, bitmap) { if (!pmc_event_is_allowed(pmc)) continue; -- cgit v1.2.3 From f19063b1ca058b3971d6883b5ec45955b7f53f9c Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 9 Nov 2023 18:28:54 -0800 Subject: KVM: x86/pmu: Snapshot event selectors that KVM emulates in software Snapshot the event selectors for the events that KVM emulates in software, which is currently instructions retired and branch instructions retired. The event selectors a tied to the underlying CPU, i.e. are constant for a given platform even though perf doesn't manage the mappings as such. Getting the event selectors from perf isn't exactly cheap, especially if mitigations are enabled, as at least one indirect call is involved. Snapshot the values in KVM instead of optimizing perf as working with the raw event selectors will be required if KVM ever wants to emulate events that aren't part of perf's uABI, i.e. that don't have an "enum perf_hw_id" entry. Link: https://lore.kernel.org/r/20231110022857.1273836-8-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/pmu.c | 17 ++++++++--------- arch/x86/kvm/pmu.h | 13 ++++++++++++- arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/x86.c | 6 +++--- 4 files changed, 24 insertions(+), 14 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index c04c3f37a1b8..f40d8c1edca9 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -29,6 +29,9 @@ struct x86_pmu_capability __read_mostly kvm_pmu_cap; EXPORT_SYMBOL_GPL(kvm_pmu_cap); +struct kvm_pmu_emulated_event_selectors __read_mostly kvm_pmu_eventsel; +EXPORT_SYMBOL_GPL(kvm_pmu_eventsel); + /* Precise Distribution of Instructions Retired (PDIR) */ static const struct x86_cpu_id vmx_pebs_pdir_cpu[] = { X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D, NULL), @@ -819,13 +822,6 @@ static void kvm_pmu_incr_counter(struct kvm_pmc *pmc) kvm_pmu_request_counter_reprogram(pmc); } -static inline bool eventsel_match_perf_hw_id(struct kvm_pmc *pmc, - unsigned int perf_hw_id) -{ - return !((pmc->eventsel ^ perf_get_hw_event_config(perf_hw_id)) & - AMD64_RAW_EVENT_MASK_NB); -} - static inline bool cpl_is_matched(struct kvm_pmc *pmc) { bool select_os, select_user; @@ -845,7 +841,7 @@ static inline bool cpl_is_matched(struct kvm_pmc *pmc) return (static_call(kvm_x86_get_cpl)(pmc->vcpu) == 0) ? select_os : select_user; } -void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id) +void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel) { DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX); struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); @@ -865,7 +861,10 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id) continue; /* Ignore checks for edge detect, pin control, invert and CMASK bits */ - if (eventsel_match_perf_hw_id(pmc, perf_hw_id) && cpl_is_matched(pmc)) + if ((pmc->eventsel ^ eventsel) & AMD64_RAW_EVENT_MASK_NB) + continue; + + if (cpl_is_matched(pmc)) kvm_pmu_incr_counter(pmc); } } diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index fd18bc0b281c..4d52b0b539ba 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -22,6 +22,11 @@ #define KVM_FIXED_PMC_BASE_IDX INTEL_PMC_IDX_FIXED +struct kvm_pmu_emulated_event_selectors { + u64 INSTRUCTIONS_RETIRED; + u64 BRANCH_INSTRUCTIONS_RETIRED; +}; + struct kvm_pmu_ops { struct kvm_pmc *(*rdpmc_ecx_to_pmc)(struct kvm_vcpu *vcpu, unsigned int idx, u64 *mask); @@ -171,6 +176,7 @@ static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc) } extern struct x86_pmu_capability kvm_pmu_cap; +extern struct kvm_pmu_emulated_event_selectors kvm_pmu_eventsel; static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops) { @@ -212,6 +218,11 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops) pmu_ops->MAX_NR_GP_COUNTERS); kvm_pmu_cap.num_counters_fixed = min(kvm_pmu_cap.num_counters_fixed, KVM_PMC_MAX_FIXED); + + kvm_pmu_eventsel.INSTRUCTIONS_RETIRED = + perf_get_hw_event_config(PERF_COUNT_HW_INSTRUCTIONS); + kvm_pmu_eventsel.BRANCH_INSTRUCTIONS_RETIRED = + perf_get_hw_event_config(PERF_COUNT_HW_BRANCH_INSTRUCTIONS); } static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc) @@ -259,7 +270,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu); void kvm_pmu_cleanup(struct kvm_vcpu *vcpu); void kvm_pmu_destroy(struct kvm_vcpu *vcpu); int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp); -void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 perf_hw_id); +void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel); bool is_vmware_backdoor_pmc(u32 pmc_idx); diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 6329a306856b..994e014f8a50 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -3606,7 +3606,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) return 1; } - kvm_pmu_trigger_event(vcpu, PERF_COUNT_HW_BRANCH_INSTRUCTIONS); + kvm_pmu_trigger_event(vcpu, kvm_pmu_eventsel.BRANCH_INSTRUCTIONS_RETIRED); if (CC(evmptrld_status == EVMPTRLD_VMFAIL)) return nested_vmx_failInvalid(vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cbee277254f0..a30df9f8d9d5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8903,7 +8903,7 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu) if (unlikely(!r)) return 0; - kvm_pmu_trigger_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS); + kvm_pmu_trigger_event(vcpu, kvm_pmu_eventsel.INSTRUCTIONS_RETIRED); /* * rflags is the old, "raw" value of the flags. The new value has @@ -9216,9 +9216,9 @@ writeback: */ if (!ctxt->have_exception || exception_type(ctxt->exception.vector) == EXCPT_TRAP) { - kvm_pmu_trigger_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS); + kvm_pmu_trigger_event(vcpu, kvm_pmu_eventsel.INSTRUCTIONS_RETIRED); if (ctxt->is_branch) - kvm_pmu_trigger_event(vcpu, PERF_COUNT_HW_BRANCH_INSTRUCTIONS); + kvm_pmu_trigger_event(vcpu, kvm_pmu_eventsel.BRANCH_INSTRUCTIONS_RETIRED); kvm_rip_write(vcpu, ctxt->eip); if (r && (ctxt->tf || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP))) r = kvm_vcpu_do_singlestep(vcpu); -- cgit v1.2.3 From afda2d7666f894d1d7b8406cf54801e6c11f63c2 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 9 Nov 2023 18:28:55 -0800 Subject: KVM: x86/pmu: Expand the comment about what bits are check emulating events Expand the comment about what bits are and aren't checked when emulating PMC events in software. As pointed out by Jim, AMD's mask includes bits 35:32, which on Intel overlap with the IN_TX and IN_TXCP bits (32 and 33) as well as reserved bits (34 and 45). Checking The IN_TX* bits is actually correct, as it's safe to assert that the vCPU can't be in an HLE/RTM transaction if KVM is emulating an instruction, i.e. KVM *shouldn't count if either of those bits is set. For the reserved bits, KVM is has equal odds of being right if Intel adds new behavior, i.e. ignoring them is just as likely to be correct as checking them. Opportunistically explain *why* the other flags aren't checked. Suggested-by: Jim Mattson Link: https://lore.kernel.org/r/20231110022857.1273836-9-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/pmu.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index f40d8c1edca9..d891a954b45a 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -860,7 +860,20 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel) if (!pmc_event_is_allowed(pmc)) continue; - /* Ignore checks for edge detect, pin control, invert and CMASK bits */ + /* + * Ignore checks for edge detect (all events currently emulated + * but KVM are always rising edges), pin control (unsupported + * by modern CPUs), and counter mask and its invert flag (KVM + * doesn't emulate multiple events in a single clock cycle). + * + * Note, the uppermost nibble of AMD's mask overlaps Intel's + * IN_TX (bit 32) and IN_TXCP (bit 33), as well as two reserved + * bits (bits 35:34). Checking the "in HLE/RTM transaction" + * flags is correct as the vCPU can't be in a transaction if + * KVM is emulating an instruction. Checking the reserved bits + * might be wrong if they are defined in the future, but so + * could ignoring them, so do the simple thing for now. + */ if ((pmc->eventsel ^ eventsel) & AMD64_RAW_EVENT_MASK_NB) continue; -- cgit v1.2.3 From e35529fb4ac930a4a39e0c15bafcb28a30d26611 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 9 Nov 2023 18:28:56 -0800 Subject: KVM: x86/pmu: Check eventsel first when emulating (branch) insns retired When triggering events, i.e. emulating PMC events in software, check for a matching event selector before checking the event is allowed. The "is allowed" check *might* be cheap, but it could also be very costly, e.g. if userspace has defined a large PMU event filter. The event selector check on the other hand is all but guaranteed to be <10 uops, e.g. looks something like: 0xffffffff8105e615 <+5>: movabs $0xf0000ffff,%rax 0xffffffff8105e61f <+15>: xor %rdi,%rsi 0xffffffff8105e622 <+18>: test %rax,%rsi 0xffffffff8105e625 <+21>: sete %al Link: https://lore.kernel.org/r/20231110022857.1273836-10-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/pmu.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index d891a954b45a..8d81f176ab7b 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -857,9 +857,6 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel) return; kvm_for_each_pmc(pmu, pmc, i, bitmap) { - if (!pmc_event_is_allowed(pmc)) - continue; - /* * Ignore checks for edge detect (all events currently emulated * but KVM are always rising edges), pin control (unsupported @@ -874,11 +871,11 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel) * might be wrong if they are defined in the future, but so * could ignoring them, so do the simple thing for now. */ - if ((pmc->eventsel ^ eventsel) & AMD64_RAW_EVENT_MASK_NB) + if (((pmc->eventsel ^ eventsel) & AMD64_RAW_EVENT_MASK_NB) || + !pmc_event_is_allowed(pmc) || !cpl_is_matched(pmc)) continue; - if (cpl_is_matched(pmc)) - kvm_pmu_incr_counter(pmc); + kvm_pmu_incr_counter(pmc); } } EXPORT_SYMBOL_GPL(kvm_pmu_trigger_event); -- cgit v1.2.3 From 83bdfe04c968e0fe3181e4cd41b764e17ac73911 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 9 Nov 2023 18:28:57 -0800 Subject: KVM: x86/pmu: Avoid CPL lookup if PMC enabline for USER and KERNEL is the same Don't bother querying the CPL if a PMC is (not) counting for both USER and KERNEL, i.e. if the end result is guaranteed to be the same regardless of the CPL. Querying the CPL on Intel requires a VMREAD, i.e. isn't free, and a single CMP+Jcc is cheap. Link: https://lore.kernel.org/r/20231110022857.1273836-11-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/pmu.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 8d81f176ab7b..c397b28e3d1b 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -838,6 +838,13 @@ static inline bool cpl_is_matched(struct kvm_pmc *pmc) select_user = config & 0x2; } + /* + * Skip the CPL lookup, which isn't free on Intel, if the result will + * be the same regardless of the CPL. + */ + if (select_os == select_user) + return select_os; + return (static_call(kvm_x86_get_cpl)(pmc->vcpu) == 0) ? select_os : select_user; } -- cgit v1.2.3 From e1dda3afe2a9f466940d44db8baaaf6c0ff8793f Mon Sep 17 00:00:00 2001 From: Mathias Krause Date: Sat, 3 Feb 2024 13:45:22 +0100 Subject: KVM: x86: Fix broken debugregs ABI for 32 bit kernels The ioctl()s to get and set KVM's debug registers are broken for 32 bit kernels as they'd only copy half of the user register state because of a UAPI and in-kernel type mismatch (__u64 vs. unsigned long; 8 vs. 4 bytes). This makes it impossible for userland to set anything but DR0 without resorting to bit folding tricks. Switch to a loop for copying debug registers that'll implicitly do the type conversion for us, if needed. There are likely no users (left) for 32bit KVM, fix the bug nonetheless. Fixes: a1efbe77c1fd ("KVM: x86: Add support for saving&restoring debug registers") Signed-off-by: Mathias Krause Link: https://lore.kernel.org/r/20240203124522.592778-4-minipli@grsecurity.net Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 709844fcd521..7eb7736cfd76 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5510,9 +5510,14 @@ static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, struct kvm_debugregs *dbgregs) { unsigned long val; + unsigned int i; memset(dbgregs, 0, sizeof(*dbgregs)); - memcpy(dbgregs->db, vcpu->arch.db, sizeof(vcpu->arch.db)); + + BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db)); + for (i = 0; i < ARRAY_SIZE(vcpu->arch.db); i++) + dbgregs->db[i] = vcpu->arch.db[i]; + kvm_get_dr(vcpu, 6, &val); dbgregs->dr6 = val; dbgregs->dr7 = vcpu->arch.dr7; @@ -5521,6 +5526,8 @@ static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, struct kvm_debugregs *dbgregs) { + unsigned int i; + if (dbgregs->flags) return -EINVAL; @@ -5529,7 +5536,9 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, if (!kvm_dr7_valid(dbgregs->dr7)) return -EINVAL; - memcpy(vcpu->arch.db, dbgregs->db, sizeof(vcpu->arch.db)); + for (i = 0; i < ARRAY_SIZE(vcpu->arch.db); i++) + vcpu->arch.db[i] = dbgregs->db[i]; + kvm_update_dr0123(vcpu); vcpu->arch.dr6 = dbgregs->dr6; vcpu->arch.dr7 = dbgregs->dr7; -- cgit v1.2.3 From d7f0a00e438d2275d398536d78aab4097a3aa25e Mon Sep 17 00:00:00 2001 From: Chao Gao Date: Fri, 29 Dec 2023 10:26:52 +0800 Subject: KVM: VMX: Report up-to-date exit qualification to userspace Use vmx_get_exit_qual() to read the exit qualification. vcpu->arch.exit_qualification is cached for EPT violation only and even for EPT violation, it is stale at this point because the up-to-date value is cached later in handle_ept_violation(). Fixes: 70bcd708dfd1 ("KVM: vmx: expose more information for KVM_INTERNAL_ERROR_DELIVERY_EV exits") Signed-off-by: Chao Gao Link: https://lore.kernel.org/r/20231229022652.300095-1-chao.gao@intel.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e262bc2ba4e5..d4e6625e0a9a 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6509,7 +6509,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_DELIVERY_EV; vcpu->run->internal.data[0] = vectoring_info; vcpu->run->internal.data[1] = exit_reason.full; - vcpu->run->internal.data[2] = vcpu->arch.exit_qualification; + vcpu->run->internal.data[2] = vmx_get_exit_qual(vcpu); if (exit_reason.basic == EXIT_REASON_EPT_MISCONFIG) { vcpu->run->internal.data[ndata++] = vmcs_read64(GUEST_PHYSICAL_ADDRESS); -- cgit v1.2.3 From 03f6298c7cf6d2c1ccd0961ab9b340502d63840a Mon Sep 17 00:00:00 2001 From: Thomas Prescher Date: Tue, 12 Dec 2023 10:59:37 +0100 Subject: KVM: x86/emulator: emulate movbe with operand-size prefix The MOVBE instruction can come with an operand-size prefix (66h). In this, case the x86 emulation code returns EMULATION_FAILED. It turns out that em_movbe can already handle this case and all that is missing is an entry in respective opcode tables to populate gprefix->pfx_66. Signed-off-by: Thomas Prescher Signed-off-by: Julian Stecklina Acked-by: Borislav Petkov (AMD) Link: https://lore.kernel.org/r/20231212095938.26731-1-julian.stecklina@cyberus-technology.de Signed-off-by: Sean Christopherson --- arch/x86/kvm/emulate.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e223043ef5b2..c75924f4f737 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4505,11 +4505,11 @@ static const struct instr_dual instr_dual_0f_38_f1 = { }; static const struct gprefix three_byte_0f_38_f0 = { - ID(0, &instr_dual_0f_38_f0), N, N, N + ID(0, &instr_dual_0f_38_f0), ID(0, &instr_dual_0f_38_f0), N, N }; static const struct gprefix three_byte_0f_38_f1 = { - ID(0, &instr_dual_0f_38_f1), N, N, N + ID(0, &instr_dual_0f_38_f1), ID(0, &instr_dual_0f_38_f1), N, N }; /* -- cgit v1.2.3 From 6fd1e3963f20e850d7b3c89485e58d1ae79c309a Mon Sep 17 00:00:00 2001 From: Julian Stecklina Date: Mon, 9 Oct 2023 11:20:53 +0200 Subject: KVM: x86: Clean up partially uninitialized integer in emulate_pop() Explicitly zero out variables passed to emulate_pop() as output params to harden against consuming uninitialized data, and to make sanitizers happy. Many flows that use emulate_pop() pass an "unsigned long" so as to be able to hold the largest possible operand, but the actual number of bytes written is usually the word with of the vCPU. E.g. if the vCPU is in 16-bit or 32-bit mode (on a 64-bit host), the upper portion of the output param will be uninitialized. Passing around the uninitialized data is benign, as actual KVM usage of the output is also tied to the word width, but passing around uninitialized data makes some sanitizers rightly complain. Note, initializing the data in emulate_pop() is not a safe alternative, e.g. it would result in em_leave() clobbering RBP[31:16] if LEAVE were emulated with a 16-bit stack. Signed-off-by: Julian Stecklina Link: https://lore.kernel.org/r/20231009092054.556935-1-julian.stecklina@cyberus-technology.de [sean: massage changelog, drop em_popa() variable size change]] Signed-off-by: Sean Christopherson --- arch/x86/kvm/emulate.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c75924f4f737..711b3250ebaa 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1863,7 +1863,8 @@ static int emulate_popf(struct x86_emulate_ctxt *ctxt, void *dest, int len) { int rc; - unsigned long val, change_mask; + unsigned long val = 0; + unsigned long change_mask; int iopl = (ctxt->eflags & X86_EFLAGS_IOPL) >> X86_EFLAGS_IOPL_BIT; int cpl = ctxt->ops->cpl(ctxt); @@ -1954,7 +1955,7 @@ static int em_push_sreg(struct x86_emulate_ctxt *ctxt) static int em_pop_sreg(struct x86_emulate_ctxt *ctxt) { int seg = ctxt->src2.val; - unsigned long selector; + unsigned long selector = 0; int rc; rc = emulate_pop(ctxt, &selector, 2); @@ -2000,7 +2001,7 @@ static int em_popa(struct x86_emulate_ctxt *ctxt) { int rc = X86EMUL_CONTINUE; int reg = VCPU_REGS_RDI; - u32 val; + u32 val = 0; while (reg >= VCPU_REGS_RAX) { if (reg == VCPU_REGS_RSP) { @@ -2229,7 +2230,7 @@ static int em_cmpxchg8b(struct x86_emulate_ctxt *ctxt) static int em_ret(struct x86_emulate_ctxt *ctxt) { int rc; - unsigned long eip; + unsigned long eip = 0; rc = emulate_pop(ctxt, &eip, ctxt->op_bytes); if (rc != X86EMUL_CONTINUE) @@ -2241,7 +2242,8 @@ static int em_ret(struct x86_emulate_ctxt *ctxt) static int em_ret_far(struct x86_emulate_ctxt *ctxt) { int rc; - unsigned long eip, cs; + unsigned long eip = 0; + unsigned long cs = 0; int cpl = ctxt->ops->cpl(ctxt); struct desc_struct new_desc; @@ -3184,7 +3186,7 @@ fail: static int em_ret_near_imm(struct x86_emulate_ctxt *ctxt) { int rc; - unsigned long eip; + unsigned long eip = 0; rc = emulate_pop(ctxt, &eip, ctxt->op_bytes); if (rc != X86EMUL_CONTINUE) -- cgit v1.2.3 From 64435aaa4a6aca019fabca7df45e0782d98a477b Mon Sep 17 00:00:00 2001 From: Julian Stecklina Date: Mon, 9 Oct 2023 11:20:54 +0200 Subject: KVM: x86: rename push to emulate_push for consistency push and emulate_pop are counterparts. Rename push to emulate_push and harmonize its function signature with emulate_pop. This should remove a bit of cognitive load when reading this code. Signed-off-by: Julian Stecklina Link: https://lore.kernel.org/r/20231009092054.556935-2-julian.stecklina@cyberus-technology.de Signed-off-by: Sean Christopherson --- arch/x86/kvm/emulate.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 711b3250ebaa..eafb48f0b745 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1820,22 +1820,22 @@ static int writeback(struct x86_emulate_ctxt *ctxt, struct operand *op) return X86EMUL_CONTINUE; } -static int push(struct x86_emulate_ctxt *ctxt, void *data, int bytes) +static int emulate_push(struct x86_emulate_ctxt *ctxt, const void *data, int len) { struct segmented_address addr; - rsp_increment(ctxt, -bytes); + rsp_increment(ctxt, -len); addr.ea = reg_read(ctxt, VCPU_REGS_RSP) & stack_mask(ctxt); addr.seg = VCPU_SREG_SS; - return segmented_write(ctxt, addr, data, bytes); + return segmented_write(ctxt, addr, data, len); } static int em_push(struct x86_emulate_ctxt *ctxt) { /* Disable writeback. */ ctxt->dst.type = OP_NONE; - return push(ctxt, &ctxt->src.val, ctxt->op_bytes); + return emulate_push(ctxt, &ctxt->src.val, ctxt->op_bytes); } static int emulate_pop(struct x86_emulate_ctxt *ctxt, @@ -1921,7 +1921,7 @@ static int em_enter(struct x86_emulate_ctxt *ctxt) return X86EMUL_UNHANDLEABLE; rbp = reg_read(ctxt, VCPU_REGS_RBP); - rc = push(ctxt, &rbp, stack_size(ctxt)); + rc = emulate_push(ctxt, &rbp, stack_size(ctxt)); if (rc != X86EMUL_CONTINUE) return rc; assign_masked(reg_rmw(ctxt, VCPU_REGS_RBP), reg_read(ctxt, VCPU_REGS_RSP), -- cgit v1.2.3 From 882dd4aee36bcf4e734b9c7fb5c17b387d71241d Mon Sep 17 00:00:00 2001 From: Dionna Glaze Date: Thu, 7 Dec 2023 00:11:32 +0000 Subject: kvm: x86: use a uapi-friendly macro for BIT Change uapi header uses of BIT to instead use the uapi/linux/const.h bit macros, since BIT is not defined in uapi headers. The PMU mask uses _BITUL since it targets a 32 bit flag field, whereas the longmode definition is meant for a 64 bit flag field. Cc: Sean Christophersen Cc: Paolo Bonzini Signed-off-by: Dionna Glaze Message-Id: <20231207001142.3617856-1-dionnaglaze@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/include/uapi/asm/kvm.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index a448d0964fc0..9ae91a21ffea 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -7,6 +7,7 @@ * */ +#include #include #include #include @@ -526,7 +527,7 @@ struct kvm_pmu_event_filter { #define KVM_PMU_EVENT_ALLOW 0 #define KVM_PMU_EVENT_DENY 1 -#define KVM_PMU_EVENT_FLAG_MASKED_EVENTS BIT(0) +#define KVM_PMU_EVENT_FLAG_MASKED_EVENTS _BITUL(0) #define KVM_PMU_EVENT_FLAGS_VALID_MASK (KVM_PMU_EVENT_FLAG_MASKED_EVENTS) /* @@ -552,7 +553,7 @@ struct kvm_pmu_event_filter { (GENMASK_ULL(7, 0) | GENMASK_ULL(35, 32)) #define KVM_PMU_MASKED_ENTRY_UMASK_MASK (GENMASK_ULL(63, 56)) #define KVM_PMU_MASKED_ENTRY_UMASK_MATCH (GENMASK_ULL(15, 8)) -#define KVM_PMU_MASKED_ENTRY_EXCLUDE (BIT_ULL(55)) +#define KVM_PMU_MASKED_ENTRY_EXCLUDE (_BITULL(55)) #define KVM_PMU_MASKED_ENTRY_UMASK_MASK_SHIFT (56) /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */ @@ -560,7 +561,7 @@ struct kvm_pmu_event_filter { #define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */ /* x86-specific KVM_EXIT_HYPERCALL flags. */ -#define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0) +#define KVM_EXIT_HYPERCALL_LONG_MODE _BITULL(0) #define KVM_X86_DEFAULT_VM 0 #define KVM_X86_SW_PROTECTED_VM 1 -- cgit v1.2.3 From 458822416a88c66455eec884649eee12bdc53cb4 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Tue, 12 Dec 2023 11:26:53 -0500 Subject: kvm: x86: use a uapi-friendly macro for GENMASK Change uapi header uses of GENMASK to instead use the uapi/linux/bits.h bit macros, since GENMASK is not defined in uapi headers. Signed-off-by: Paolo Bonzini --- arch/arm64/include/uapi/asm/kvm.h | 8 ++++---- arch/x86/include/uapi/asm/kvm.h | 7 ++++--- arch/x86/include/uapi/asm/kvm_para.h | 2 +- 3 files changed, 9 insertions(+), 8 deletions(-) (limited to 'arch/x86') diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 89d2fc872d9f..6b8b57b97228 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -76,11 +76,11 @@ struct kvm_regs { /* KVM_ARM_SET_DEVICE_ADDR ioctl id encoding */ #define KVM_ARM_DEVICE_TYPE_SHIFT 0 -#define KVM_ARM_DEVICE_TYPE_MASK GENMASK(KVM_ARM_DEVICE_TYPE_SHIFT + 15, \ - KVM_ARM_DEVICE_TYPE_SHIFT) +#define KVM_ARM_DEVICE_TYPE_MASK __GENMASK(KVM_ARM_DEVICE_TYPE_SHIFT + 15, \ + KVM_ARM_DEVICE_TYPE_SHIFT) #define KVM_ARM_DEVICE_ID_SHIFT 16 -#define KVM_ARM_DEVICE_ID_MASK GENMASK(KVM_ARM_DEVICE_ID_SHIFT + 15, \ - KVM_ARM_DEVICE_ID_SHIFT) +#define KVM_ARM_DEVICE_ID_MASK __GENMASK(KVM_ARM_DEVICE_ID_SHIFT + 15, \ + KVM_ARM_DEVICE_ID_SHIFT) /* Supported device IDs */ #define KVM_ARM_DEVICE_VGIC_V2 0 diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index 9ae91a21ffea..bd36d74b593b 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -8,6 +8,7 @@ */ #include +#include #include #include #include @@ -550,9 +551,9 @@ struct kvm_pmu_event_filter { ((__u64)(!!(exclude)) << 55)) #define KVM_PMU_MASKED_ENTRY_EVENT_SELECT \ - (GENMASK_ULL(7, 0) | GENMASK_ULL(35, 32)) -#define KVM_PMU_MASKED_ENTRY_UMASK_MASK (GENMASK_ULL(63, 56)) -#define KVM_PMU_MASKED_ENTRY_UMASK_MATCH (GENMASK_ULL(15, 8)) + (__GENMASK_ULL(7, 0) | __GENMASK_ULL(35, 32)) +#define KVM_PMU_MASKED_ENTRY_UMASK_MASK (__GENMASK_ULL(63, 56)) +#define KVM_PMU_MASKED_ENTRY_UMASK_MATCH (__GENMASK_ULL(15, 8)) #define KVM_PMU_MASKED_ENTRY_EXCLUDE (_BITULL(55)) #define KVM_PMU_MASKED_ENTRY_UMASK_MASK_SHIFT (56) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 6e64b27b2c1e..6bc3456a8ebf 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -92,7 +92,7 @@ struct kvm_clock_pairing { #define KVM_ASYNC_PF_DELIVERY_AS_INT (1 << 3) /* MSR_KVM_ASYNC_PF_INT */ -#define KVM_ASYNC_PF_VEC_MASK GENMASK(7, 0) +#define KVM_ASYNC_PF_VEC_MASK __GENMASK(7, 0) /* MSR_KVM_MIGRATION_CONTROL */ #define KVM_MIGRATION_READY (1 << 0) -- cgit v1.2.3 From bcac0477277e44b0b4b5c93b24159d86eeff566a Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Thu, 11 Jan 2024 03:19:08 -0500 Subject: KVM: x86: move x86-specific structs to uapi/asm/kvm.h Several capabilities that exist only on x86 nevertheless have their structs defined in include/uapi/linux/kvm.h. Move them to arch/x86/include/uapi/asm/kvm.h for cleanliness. Signed-off-by: Paolo Bonzini --- arch/x86/include/uapi/asm/kvm.h | 262 +++++++++++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 265 ---------------------------------------- 2 files changed, 262 insertions(+), 265 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index bd36d74b593b..bf68e56fd484 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -531,6 +531,268 @@ struct kvm_pmu_event_filter { #define KVM_PMU_EVENT_FLAG_MASKED_EVENTS _BITUL(0) #define KVM_PMU_EVENT_FLAGS_VALID_MASK (KVM_PMU_EVENT_FLAG_MASKED_EVENTS) +/* for KVM_CAP_MCE */ +struct kvm_x86_mce { + __u64 status; + __u64 addr; + __u64 misc; + __u64 mcg_status; + __u8 bank; + __u8 pad1[7]; + __u64 pad2[3]; +}; + +/* for KVM_CAP_XEN_HVM */ +#define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR (1 << 0) +#define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1) +#define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2) +#define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 3) +#define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4) +#define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5) +#define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6) +#define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7) + +struct kvm_xen_hvm_config { + __u32 flags; + __u32 msr; + __u64 blob_addr_32; + __u64 blob_addr_64; + __u8 blob_size_32; + __u8 blob_size_64; + __u8 pad2[30]; +}; + +struct kvm_xen_hvm_attr { + __u16 type; + __u16 pad[3]; + union { + __u8 long_mode; + __u8 vector; + __u8 runstate_update_flag; + struct { + __u64 gfn; +#define KVM_XEN_INVALID_GFN ((__u64)-1) + } shared_info; + struct { + __u32 send_port; + __u32 type; /* EVTCHNSTAT_ipi / EVTCHNSTAT_interdomain */ + __u32 flags; +#define KVM_XEN_EVTCHN_DEASSIGN (1 << 0) +#define KVM_XEN_EVTCHN_UPDATE (1 << 1) +#define KVM_XEN_EVTCHN_RESET (1 << 2) + /* + * Events sent by the guest are either looped back to + * the guest itself (potentially on a different port#) + * or signalled via an eventfd. + */ + union { + struct { + __u32 port; + __u32 vcpu; + __u32 priority; + } port; + struct { + __u32 port; /* Zero for eventfd */ + __s32 fd; + } eventfd; + __u32 padding[4]; + } deliver; + } evtchn; + __u32 xen_version; + __u64 pad[8]; + } u; +}; + + +/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO */ +#define KVM_XEN_ATTR_TYPE_LONG_MODE 0x0 +#define KVM_XEN_ATTR_TYPE_SHARED_INFO 0x1 +#define KVM_XEN_ATTR_TYPE_UPCALL_VECTOR 0x2 +/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_EVTCHN_SEND */ +#define KVM_XEN_ATTR_TYPE_EVTCHN 0x3 +#define KVM_XEN_ATTR_TYPE_XEN_VERSION 0x4 +/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG */ +#define KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG 0x5 + +struct kvm_xen_vcpu_attr { + __u16 type; + __u16 pad[3]; + union { + __u64 gpa; +#define KVM_XEN_INVALID_GPA ((__u64)-1) + __u64 pad[8]; + struct { + __u64 state; + __u64 state_entry_time; + __u64 time_running; + __u64 time_runnable; + __u64 time_blocked; + __u64 time_offline; + } runstate; + __u32 vcpu_id; + struct { + __u32 port; + __u32 priority; + __u64 expires_ns; + } timer; + __u8 vector; + } u; +}; + +/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO */ +#define KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO 0x0 +#define KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO 0x1 +#define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR 0x2 +#define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT 0x3 +#define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_DATA 0x4 +#define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST 0x5 +/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_EVTCHN_SEND */ +#define KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID 0x6 +#define KVM_XEN_VCPU_ATTR_TYPE_TIMER 0x7 +#define KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR 0x8 + +/* Secure Encrypted Virtualization command */ +enum sev_cmd_id { + /* Guest initialization commands */ + KVM_SEV_INIT = 0, + KVM_SEV_ES_INIT, + /* Guest launch commands */ + KVM_SEV_LAUNCH_START, + KVM_SEV_LAUNCH_UPDATE_DATA, + KVM_SEV_LAUNCH_UPDATE_VMSA, + KVM_SEV_LAUNCH_SECRET, + KVM_SEV_LAUNCH_MEASURE, + KVM_SEV_LAUNCH_FINISH, + /* Guest migration commands (outgoing) */ + KVM_SEV_SEND_START, + KVM_SEV_SEND_UPDATE_DATA, + KVM_SEV_SEND_UPDATE_VMSA, + KVM_SEV_SEND_FINISH, + /* Guest migration commands (incoming) */ + KVM_SEV_RECEIVE_START, + KVM_SEV_RECEIVE_UPDATE_DATA, + KVM_SEV_RECEIVE_UPDATE_VMSA, + KVM_SEV_RECEIVE_FINISH, + /* Guest status and debug commands */ + KVM_SEV_GUEST_STATUS, + KVM_SEV_DBG_DECRYPT, + KVM_SEV_DBG_ENCRYPT, + /* Guest certificates commands */ + KVM_SEV_CERT_EXPORT, + /* Attestation report */ + KVM_SEV_GET_ATTESTATION_REPORT, + /* Guest Migration Extension */ + KVM_SEV_SEND_CANCEL, + + KVM_SEV_NR_MAX, +}; + +struct kvm_sev_cmd { + __u32 id; + __u64 data; + __u32 error; + __u32 sev_fd; +}; + +struct kvm_sev_launch_start { + __u32 handle; + __u32 policy; + __u64 dh_uaddr; + __u32 dh_len; + __u64 session_uaddr; + __u32 session_len; +}; + +struct kvm_sev_launch_update_data { + __u64 uaddr; + __u32 len; +}; + + +struct kvm_sev_launch_secret { + __u64 hdr_uaddr; + __u32 hdr_len; + __u64 guest_uaddr; + __u32 guest_len; + __u64 trans_uaddr; + __u32 trans_len; +}; + +struct kvm_sev_launch_measure { + __u64 uaddr; + __u32 len; +}; + +struct kvm_sev_guest_status { + __u32 handle; + __u32 policy; + __u32 state; +}; + +struct kvm_sev_dbg { + __u64 src_uaddr; + __u64 dst_uaddr; + __u32 len; +}; + +struct kvm_sev_attestation_report { + __u8 mnonce[16]; + __u64 uaddr; + __u32 len; +}; + +struct kvm_sev_send_start { + __u32 policy; + __u64 pdh_cert_uaddr; + __u32 pdh_cert_len; + __u64 plat_certs_uaddr; + __u32 plat_certs_len; + __u64 amd_certs_uaddr; + __u32 amd_certs_len; + __u64 session_uaddr; + __u32 session_len; +}; + +struct kvm_sev_send_update_data { + __u64 hdr_uaddr; + __u32 hdr_len; + __u64 guest_uaddr; + __u32 guest_len; + __u64 trans_uaddr; + __u32 trans_len; +}; + +struct kvm_sev_receive_start { + __u32 handle; + __u32 policy; + __u64 pdh_uaddr; + __u32 pdh_len; + __u64 session_uaddr; + __u32 session_len; +}; + +struct kvm_sev_receive_update_data { + __u64 hdr_uaddr; + __u32 hdr_len; + __u64 guest_uaddr; + __u32 guest_len; + __u64 trans_uaddr; + __u32 trans_len; +}; + +#define KVM_X2APIC_API_USE_32BIT_IDS (1ULL << 0) +#define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK (1ULL << 1) + +struct kvm_hyperv_eventfd { + __u32 conn_id; + __s32 fd; + __u32 flags; + __u32 padding[3]; +}; + +#define KVM_HYPERV_CONN_ID_MASK 0x00ffffff +#define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0) + /* * Masked event layout. * Bits Description diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index e107ea953585..9744e1891c16 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1224,40 +1224,6 @@ struct kvm_irq_routing { #endif -#ifdef KVM_CAP_MCE -/* x86 MCE */ -struct kvm_x86_mce { - __u64 status; - __u64 addr; - __u64 misc; - __u64 mcg_status; - __u8 bank; - __u8 pad1[7]; - __u64 pad2[3]; -}; -#endif - -#ifdef KVM_CAP_XEN_HVM -#define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR (1 << 0) -#define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1) -#define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2) -#define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 3) -#define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4) -#define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5) -#define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6) -#define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7) - -struct kvm_xen_hvm_config { - __u32 flags; - __u32 msr; - __u64 blob_addr_32; - __u64 blob_addr_64; - __u8 blob_size_32; - __u8 blob_size_64; - __u8 pad2[30]; -}; -#endif - #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0) /* * Available with KVM_CAP_IRQFD_RESAMPLE @@ -1737,58 +1703,6 @@ struct kvm_pv_cmd { #define KVM_XEN_HVM_GET_ATTR _IOWR(KVMIO, 0xc8, struct kvm_xen_hvm_attr) #define KVM_XEN_HVM_SET_ATTR _IOW(KVMIO, 0xc9, struct kvm_xen_hvm_attr) -struct kvm_xen_hvm_attr { - __u16 type; - __u16 pad[3]; - union { - __u8 long_mode; - __u8 vector; - __u8 runstate_update_flag; - struct { - __u64 gfn; -#define KVM_XEN_INVALID_GFN ((__u64)-1) - } shared_info; - struct { - __u32 send_port; - __u32 type; /* EVTCHNSTAT_ipi / EVTCHNSTAT_interdomain */ - __u32 flags; -#define KVM_XEN_EVTCHN_DEASSIGN (1 << 0) -#define KVM_XEN_EVTCHN_UPDATE (1 << 1) -#define KVM_XEN_EVTCHN_RESET (1 << 2) - /* - * Events sent by the guest are either looped back to - * the guest itself (potentially on a different port#) - * or signalled via an eventfd. - */ - union { - struct { - __u32 port; - __u32 vcpu; - __u32 priority; - } port; - struct { - __u32 port; /* Zero for eventfd */ - __s32 fd; - } eventfd; - __u32 padding[4]; - } deliver; - } evtchn; - __u32 xen_version; - __u64 pad[8]; - } u; -}; - - -/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO */ -#define KVM_XEN_ATTR_TYPE_LONG_MODE 0x0 -#define KVM_XEN_ATTR_TYPE_SHARED_INFO 0x1 -#define KVM_XEN_ATTR_TYPE_UPCALL_VECTOR 0x2 -/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_EVTCHN_SEND */ -#define KVM_XEN_ATTR_TYPE_EVTCHN 0x3 -#define KVM_XEN_ATTR_TYPE_XEN_VERSION 0x4 -/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG */ -#define KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG 0x5 - /* Per-vCPU Xen attributes */ #define KVM_XEN_VCPU_GET_ATTR _IOWR(KVMIO, 0xca, struct kvm_xen_vcpu_attr) #define KVM_XEN_VCPU_SET_ATTR _IOW(KVMIO, 0xcb, struct kvm_xen_vcpu_attr) @@ -1799,175 +1713,6 @@ struct kvm_xen_hvm_attr { #define KVM_GET_SREGS2 _IOR(KVMIO, 0xcc, struct kvm_sregs2) #define KVM_SET_SREGS2 _IOW(KVMIO, 0xcd, struct kvm_sregs2) -struct kvm_xen_vcpu_attr { - __u16 type; - __u16 pad[3]; - union { - __u64 gpa; -#define KVM_XEN_INVALID_GPA ((__u64)-1) - __u64 pad[8]; - struct { - __u64 state; - __u64 state_entry_time; - __u64 time_running; - __u64 time_runnable; - __u64 time_blocked; - __u64 time_offline; - } runstate; - __u32 vcpu_id; - struct { - __u32 port; - __u32 priority; - __u64 expires_ns; - } timer; - __u8 vector; - } u; -}; - -/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO */ -#define KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO 0x0 -#define KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO 0x1 -#define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR 0x2 -#define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT 0x3 -#define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_DATA 0x4 -#define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST 0x5 -/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_EVTCHN_SEND */ -#define KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID 0x6 -#define KVM_XEN_VCPU_ATTR_TYPE_TIMER 0x7 -#define KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR 0x8 - -/* Secure Encrypted Virtualization command */ -enum sev_cmd_id { - /* Guest initialization commands */ - KVM_SEV_INIT = 0, - KVM_SEV_ES_INIT, - /* Guest launch commands */ - KVM_SEV_LAUNCH_START, - KVM_SEV_LAUNCH_UPDATE_DATA, - KVM_SEV_LAUNCH_UPDATE_VMSA, - KVM_SEV_LAUNCH_SECRET, - KVM_SEV_LAUNCH_MEASURE, - KVM_SEV_LAUNCH_FINISH, - /* Guest migration commands (outgoing) */ - KVM_SEV_SEND_START, - KVM_SEV_SEND_UPDATE_DATA, - KVM_SEV_SEND_UPDATE_VMSA, - KVM_SEV_SEND_FINISH, - /* Guest migration commands (incoming) */ - KVM_SEV_RECEIVE_START, - KVM_SEV_RECEIVE_UPDATE_DATA, - KVM_SEV_RECEIVE_UPDATE_VMSA, - KVM_SEV_RECEIVE_FINISH, - /* Guest status and debug commands */ - KVM_SEV_GUEST_STATUS, - KVM_SEV_DBG_DECRYPT, - KVM_SEV_DBG_ENCRYPT, - /* Guest certificates commands */ - KVM_SEV_CERT_EXPORT, - /* Attestation report */ - KVM_SEV_GET_ATTESTATION_REPORT, - /* Guest Migration Extension */ - KVM_SEV_SEND_CANCEL, - - KVM_SEV_NR_MAX, -}; - -struct kvm_sev_cmd { - __u32 id; - __u64 data; - __u32 error; - __u32 sev_fd; -}; - -struct kvm_sev_launch_start { - __u32 handle; - __u32 policy; - __u64 dh_uaddr; - __u32 dh_len; - __u64 session_uaddr; - __u32 session_len; -}; - -struct kvm_sev_launch_update_data { - __u64 uaddr; - __u32 len; -}; - - -struct kvm_sev_launch_secret { - __u64 hdr_uaddr; - __u32 hdr_len; - __u64 guest_uaddr; - __u32 guest_len; - __u64 trans_uaddr; - __u32 trans_len; -}; - -struct kvm_sev_launch_measure { - __u64 uaddr; - __u32 len; -}; - -struct kvm_sev_guest_status { - __u32 handle; - __u32 policy; - __u32 state; -}; - -struct kvm_sev_dbg { - __u64 src_uaddr; - __u64 dst_uaddr; - __u32 len; -}; - -struct kvm_sev_attestation_report { - __u8 mnonce[16]; - __u64 uaddr; - __u32 len; -}; - -struct kvm_sev_send_start { - __u32 policy; - __u64 pdh_cert_uaddr; - __u32 pdh_cert_len; - __u64 plat_certs_uaddr; - __u32 plat_certs_len; - __u64 amd_certs_uaddr; - __u32 amd_certs_len; - __u64 session_uaddr; - __u32 session_len; -}; - -struct kvm_sev_send_update_data { - __u64 hdr_uaddr; - __u32 hdr_len; - __u64 guest_uaddr; - __u32 guest_len; - __u64 trans_uaddr; - __u32 trans_len; -}; - -struct kvm_sev_receive_start { - __u32 handle; - __u32 policy; - __u64 pdh_uaddr; - __u32 pdh_len; - __u64 session_uaddr; - __u32 session_len; -}; - -struct kvm_sev_receive_update_data { - __u64 hdr_uaddr; - __u32 hdr_len; - __u64 guest_uaddr; - __u32 guest_len; - __u64 trans_uaddr; - __u32 trans_len; -}; - -#define KVM_X2APIC_API_USE_32BIT_IDS (1ULL << 0) -#define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK (1ULL << 1) - /* Available with KVM_CAP_ARM_USER_IRQ */ /* Bits for run->s.regs.device_irq_level */ @@ -1975,16 +1720,6 @@ struct kvm_sev_receive_update_data { #define KVM_ARM_DEV_EL1_PTIMER (1 << 1) #define KVM_ARM_DEV_PMU (1 << 2) -struct kvm_hyperv_eventfd { - __u32 conn_id; - __s32 fd; - __u32 flags; - __u32 padding[3]; -}; - -#define KVM_HYPERV_CONN_ID_MASK 0x00ffffff -#define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0) - #define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (1 << 0) #define KVM_DIRTY_LOG_INITIALLY_SET (1 << 1) -- cgit v1.2.3 From 8886640dade4ae2595fcdce511c8bcc716aa47d3 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Thu, 11 Jan 2024 03:00:34 -0500 Subject: kvm: replace __KVM_HAVE_READONLY_MEM with Kconfig symbol KVM uses __KVM_HAVE_* symbols in the architecture-dependent uapi/asm/kvm.h to mask unused definitions in include/uapi/linux/kvm.h. __KVM_HAVE_READONLY_MEM however was nothing but a misguided attempt to define KVM_CAP_READONLY_MEM only on architectures where KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM) could possibly return nonzero. This however does not make sense, and it prevented userspace from supporting this architecture-independent feature without recompilation. Therefore, these days __KVM_HAVE_READONLY_MEM does not mask anything and is only used in virt/kvm/kvm_main.c. Userspace does not need to test it and there should be no need for it to exist. Remove it and replace it with a Kconfig symbol within Linux source code. Signed-off-by: Paolo Bonzini --- arch/arm64/include/uapi/asm/kvm.h | 1 - arch/arm64/kvm/Kconfig | 1 + arch/loongarch/include/uapi/asm/kvm.h | 2 -- arch/loongarch/kvm/Kconfig | 1 + arch/mips/include/uapi/asm/kvm.h | 2 -- arch/mips/kvm/Kconfig | 1 + arch/riscv/include/uapi/asm/kvm.h | 1 - arch/riscv/kvm/Kconfig | 1 + arch/x86/include/uapi/asm/kvm.h | 1 - arch/x86/kvm/Kconfig | 1 + virt/kvm/Kconfig | 3 +++ virt/kvm/kvm_main.c | 2 +- 12 files changed, 9 insertions(+), 8 deletions(-) (limited to 'arch/x86') diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 75809c8dc2f0..9c1040dad4eb 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -39,7 +39,6 @@ #define __KVM_HAVE_GUEST_DEBUG #define __KVM_HAVE_IRQ_LINE -#define __KVM_HAVE_READONLY_MEM #define __KVM_HAVE_VCPU_EVENTS #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 6c3c8ca73e7f..114626b816f7 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -36,6 +36,7 @@ menuconfig KVM select HAVE_KVM_IRQ_ROUTING select IRQ_BYPASS_MANAGER select HAVE_KVM_IRQ_BYPASS + select HAVE_KVM_READONLY_MEM select HAVE_KVM_VCPU_RUN_PID_CHANGE select SCHED_INFO select GUEST_PERF_EVENTS if PERF_EVENTS diff --git a/arch/loongarch/include/uapi/asm/kvm.h b/arch/loongarch/include/uapi/asm/kvm.h index 923d0bd38294..109785922cf9 100644 --- a/arch/loongarch/include/uapi/asm/kvm.h +++ b/arch/loongarch/include/uapi/asm/kvm.h @@ -14,8 +14,6 @@ * Some parts derived from the x86 version of this file. */ -#define __KVM_HAVE_READONLY_MEM - #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 #define KVM_DIRTY_LOG_PAGE_OFFSET 64 diff --git a/arch/loongarch/kvm/Kconfig b/arch/loongarch/kvm/Kconfig index 61f7e33b1f95..0768b45422a3 100644 --- a/arch/loongarch/kvm/Kconfig +++ b/arch/loongarch/kvm/Kconfig @@ -28,6 +28,7 @@ config KVM select KVM_GENERIC_HARDWARE_ENABLING select KVM_GENERIC_MMU_NOTIFIER select KVM_MMIO + select HAVE_KVM_READONLY_MEM select KVM_XFER_TO_GUEST_WORK help Support hosting virtualized guest machines using diff --git a/arch/mips/include/uapi/asm/kvm.h b/arch/mips/include/uapi/asm/kvm.h index edcf717c4327..9673dc9cb315 100644 --- a/arch/mips/include/uapi/asm/kvm.h +++ b/arch/mips/include/uapi/asm/kvm.h @@ -20,8 +20,6 @@ * Some parts derived from the x86 version of this file. */ -#define __KVM_HAVE_READONLY_MEM - #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 /* diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig index 18e7a17d5115..8474bf9c689e 100644 --- a/arch/mips/kvm/Kconfig +++ b/arch/mips/kvm/Kconfig @@ -26,6 +26,7 @@ config KVM select KVM_MMIO select KVM_GENERIC_MMU_NOTIFIER select KVM_GENERIC_HARDWARE_ENABLING + select HAVE_KVM_READONLY_MEM help Support for hosting Guest kernels. diff --git a/arch/riscv/include/uapi/asm/kvm.h b/arch/riscv/include/uapi/asm/kvm.h index 7499e88a947c..fdf4ba066131 100644 --- a/arch/riscv/include/uapi/asm/kvm.h +++ b/arch/riscv/include/uapi/asm/kvm.h @@ -16,7 +16,6 @@ #include #define __KVM_HAVE_IRQ_LINE -#define __KVM_HAVE_READONLY_MEM #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig index d490db943858..26d1727f0550 100644 --- a/arch/riscv/kvm/Kconfig +++ b/arch/riscv/kvm/Kconfig @@ -24,6 +24,7 @@ config KVM select HAVE_KVM_IRQ_ROUTING select HAVE_KVM_MSI select HAVE_KVM_VCPU_ASYNC_IOCTL + select HAVE_KVM_READONLY_MEM select KVM_COMMON select KVM_GENERIC_DIRTYLOG_READ_PROTECT select KVM_GENERIC_HARDWARE_ENABLING diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index bf68e56fd484..c5e682ee7726 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -51,7 +51,6 @@ #define __KVM_HAVE_DEBUGREGS #define __KVM_HAVE_XSAVE #define __KVM_HAVE_XCRS -#define __KVM_HAVE_READONLY_MEM /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 87e3da7b0439..5895aee5dfef 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -32,6 +32,7 @@ config KVM select IRQ_BYPASS_MANAGER select HAVE_KVM_IRQ_BYPASS select HAVE_KVM_IRQ_ROUTING + select HAVE_KVM_READONLY_MEM select KVM_ASYNC_PF select USER_RETURN_NOTIFIER select KVM_MMIO diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 184dab4ee871..a11e9c80fac9 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -55,6 +55,9 @@ config KVM_ASYNC_PF_SYNC config HAVE_KVM_MSI bool +config HAVE_KVM_READONLY_MEM + bool + config HAVE_KVM_CPU_RELAX_INTERCEPT bool diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 10bfc88a69f7..ff588677beb7 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1614,7 +1614,7 @@ static int check_memory_region_flags(struct kvm *kvm, if (mem->flags & KVM_MEM_GUEST_MEMFD) valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES; -#ifdef __KVM_HAVE_READONLY_MEM +#ifdef CONFIG_HAVE_KVM_READONLY_MEM valid_flags |= KVM_MEM_READONLY; #endif -- cgit v1.2.3 From 6bda055d625860736f7ea5b4eda816f276899d8b Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Thu, 11 Jan 2024 03:12:59 -0500 Subject: KVM: define __KVM_HAVE_GUEST_DEBUG unconditionally Since all architectures (for historical reasons) have to define struct kvm_guest_debug_arch, and since userspace has to check KVM_CHECK_EXTENSION(KVM_CAP_SET_GUEST_DEBUG) anyway, there is no advantage in masking the capability #define itself. Remove the #define __KVM_HAVE_GUEST_DEBUG from architecture-specific headers. Signed-off-by: Paolo Bonzini --- arch/arm64/include/uapi/asm/kvm.h | 1 - arch/powerpc/include/uapi/asm/kvm.h | 1 - arch/s390/include/uapi/asm/kvm.h | 1 - arch/x86/include/uapi/asm/kvm.h | 1 - include/uapi/linux/kvm.h | 7 +++++-- 5 files changed, 5 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 9c1040dad4eb..964df31da975 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -37,7 +37,6 @@ #include #include -#define __KVM_HAVE_GUEST_DEBUG #define __KVM_HAVE_IRQ_LINE #define __KVM_HAVE_VCPU_EVENTS diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index a96ef6b36bba..1691297a766a 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -28,7 +28,6 @@ #define __KVM_HAVE_PPC_SMT #define __KVM_HAVE_IRQCHIP #define __KVM_HAVE_IRQ_LINE -#define __KVM_HAVE_GUEST_DEBUG /* Not always available, but if it is, this is the correct offset. */ #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 diff --git a/arch/s390/include/uapi/asm/kvm.h b/arch/s390/include/uapi/asm/kvm.h index fee712ff1cf8..05eaf6db3ad4 100644 --- a/arch/s390/include/uapi/asm/kvm.h +++ b/arch/s390/include/uapi/asm/kvm.h @@ -12,7 +12,6 @@ #include #define __KVM_S390 -#define __KVM_HAVE_GUEST_DEBUG struct kvm_s390_skeys { __u64 start_gfn; diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index c5e682ee7726..0ad6bda1fc39 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -42,7 +42,6 @@ #define __KVM_HAVE_IRQ_LINE #define __KVM_HAVE_MSI #define __KVM_HAVE_USER_NMI -#define __KVM_HAVE_GUEST_DEBUG #define __KVM_HAVE_MSIX #define __KVM_HAVE_MCE #define __KVM_HAVE_PIT_STATE2 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 00d5cecd057d..2ad6aa81b03f 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -16,6 +16,11 @@ #define KVM_API_VERSION 12 +/* + * Backwards-compatible definitions. + */ +#define __KVM_HAVE_GUEST_DEBUG + /* for KVM_SET_USER_MEMORY_REGION */ struct kvm_userspace_memory_region { __u32 slot; @@ -682,9 +687,7 @@ struct kvm_enable_cap { /* Bug in KVM_SET_USER_MEMORY_REGION fixed: */ #define KVM_CAP_DESTROY_MEMORY_REGION_WORKS 21 #define KVM_CAP_USER_NMI 22 -#ifdef __KVM_HAVE_GUEST_DEBUG #define KVM_CAP_SET_GUEST_DEBUG 23 -#endif #ifdef __KVM_HAVE_PIT #define KVM_CAP_REINJECT_CONTROL 24 #endif -- cgit v1.2.3 From 61df71ee992d57ee2b7fdb802802c4268969409f Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Thu, 11 Jan 2024 03:02:05 -0500 Subject: kvm: move "select IRQ_BYPASS_MANAGER" to common code CONFIG_IRQ_BYPASS_MANAGER is a dependency of the common code included by CONFIG_HAVE_KVM_IRQ_BYPASS. There is no advantage in adding the corresponding "select" directive to each architecture. Signed-off-by: Paolo Bonzini --- arch/arm64/kvm/Kconfig | 1 - arch/powerpc/kvm/Kconfig | 1 - arch/x86/kvm/Kconfig | 1 - virt/kvm/Kconfig | 1 + 4 files changed, 1 insertion(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 6c3c8ca73e7f..46bdbd852857 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -34,7 +34,6 @@ menuconfig KVM select HAVE_KVM_MSI select HAVE_KVM_IRQCHIP select HAVE_KVM_IRQ_ROUTING - select IRQ_BYPASS_MANAGER select HAVE_KVM_IRQ_BYPASS select HAVE_KVM_VCPU_RUN_PID_CHANGE select SCHED_INFO diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index 074263429faf..dbfdc126bf14 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -22,7 +22,6 @@ config KVM select KVM_COMMON select HAVE_KVM_VCPU_ASYNC_IOCTL select KVM_VFIO - select IRQ_BYPASS_MANAGER select HAVE_KVM_IRQ_BYPASS config KVM_BOOK3S_HANDLER diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 87e3da7b0439..e7cbf011d766 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -29,7 +29,6 @@ config KVM select HAVE_KVM_PFNCACHE select HAVE_KVM_DIRTY_RING_TSO select HAVE_KVM_DIRTY_RING_ACQ_REL - select IRQ_BYPASS_MANAGER select HAVE_KVM_IRQ_BYPASS select HAVE_KVM_IRQ_ROUTING select KVM_ASYNC_PF diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 184dab4ee871..6408bc5dacd1 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -73,6 +73,7 @@ config KVM_COMPAT config HAVE_KVM_IRQ_BYPASS bool + select IRQ_BYPASS_MANAGER config HAVE_KVM_VCPU_ASYNC_IOCTL bool -- cgit v1.2.3 From dcf0926e9b899eca754a07c4064de69815b85a38 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Thu, 4 Jan 2024 15:15:43 -0500 Subject: x86: replace CONFIG_HAVE_KVM with IS_ENABLED(CONFIG_KVM) It is more accurate to check if KVM is enabled, instead of having the architecture say so. Architectures always "have" KVM, so for example checking CONFIG_HAVE_KVM in x86 code is pointless, but if KVM is disabled in a specific build, there is no need for support code. Alternatively, many of the #ifdefs could simply be deleted. However, this would add completely dead code. For example, when KVM is disabled, there should not be any posted interrupts, i.e. NOT wiring up the "dummy" handlers and treating IRQs on those vectors as spurious is the right thing to do. Cc: x86@kernel.org Cc: kbingham@kernel.org Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/hardirq.h | 2 +- arch/x86/include/asm/idtentry.h | 2 +- arch/x86/include/asm/irq.h | 2 +- arch/x86/include/asm/irq_vectors.h | 2 +- arch/x86/kernel/idt.c | 2 +- arch/x86/kernel/irq.c | 4 ++-- scripts/gdb/linux/constants.py.in | 6 +++++- scripts/gdb/linux/interrupts.py | 2 +- tools/arch/x86/include/asm/irq_vectors.h | 2 +- 9 files changed, 14 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index 66837b8c67f1..fbc7722b87d1 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -15,7 +15,7 @@ typedef struct { unsigned int irq_spurious_count; unsigned int icr_read_retry_count; #endif -#ifdef CONFIG_HAVE_KVM +#if IS_ENABLED(CONFIG_KVM) unsigned int kvm_posted_intr_ipis; unsigned int kvm_posted_intr_wakeup_ipis; unsigned int kvm_posted_intr_nested_ipis; diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 13639e57e1f8..d9c86733d0db 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -675,7 +675,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR, sysvec_irq_work); # endif #endif -#ifdef CONFIG_HAVE_KVM +#if IS_ENABLED(CONFIG_KVM) DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR, sysvec_kvm_posted_intr_ipi); DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR, sysvec_kvm_posted_intr_wakeup_ipi); DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR, sysvec_kvm_posted_intr_nested_ipi); diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h index 836c170d3087..194dfff84cb1 100644 --- a/arch/x86/include/asm/irq.h +++ b/arch/x86/include/asm/irq.h @@ -29,7 +29,7 @@ struct irq_desc; extern void fixup_irqs(void); -#ifdef CONFIG_HAVE_KVM +#if IS_ENABLED(CONFIG_KVM) extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void)); #endif diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index 3a19904c2db6..3f73ac3ed3a0 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -84,7 +84,7 @@ #define HYPERVISOR_CALLBACK_VECTOR 0xf3 /* Vector for KVM to deliver posted interrupt IPI */ -#ifdef CONFIG_HAVE_KVM +#if IS_ENABLED(CONFIG_KVM) #define POSTED_INTR_VECTOR 0xf2 #define POSTED_INTR_WAKEUP_VECTOR 0xf1 #define POSTED_INTR_NESTED_VECTOR 0xf0 diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c index 660b601f1d6c..d2bc67cbaf92 100644 --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -153,7 +153,7 @@ static const __initconst struct idt_data apic_idts[] = { #ifdef CONFIG_X86_LOCAL_APIC INTG(LOCAL_TIMER_VECTOR, asm_sysvec_apic_timer_interrupt), INTG(X86_PLATFORM_IPI_VECTOR, asm_sysvec_x86_platform_ipi), -# ifdef CONFIG_HAVE_KVM +# if IS_ENABLED(CONFIG_KVM) INTG(POSTED_INTR_VECTOR, asm_sysvec_kvm_posted_intr_ipi), INTG(POSTED_INTR_WAKEUP_VECTOR, asm_sysvec_kvm_posted_intr_wakeup_ipi), INTG(POSTED_INTR_NESTED_VECTOR, asm_sysvec_kvm_posted_intr_nested_ipi), diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 11761c124545..35fde0107901 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -164,7 +164,7 @@ int arch_show_interrupts(struct seq_file *p, int prec) #if defined(CONFIG_X86_IO_APIC) seq_printf(p, "%*s: %10u\n", prec, "MIS", atomic_read(&irq_mis_count)); #endif -#ifdef CONFIG_HAVE_KVM +#if IS_ENABLED(CONFIG_KVM) seq_printf(p, "%*s: ", prec, "PIN"); for_each_online_cpu(j) seq_printf(p, "%10u ", irq_stats(j)->kvm_posted_intr_ipis); @@ -290,7 +290,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi) } #endif -#ifdef CONFIG_HAVE_KVM +#if IS_ENABLED(CONFIG_KVM) static void dummy_handler(void) {} static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler; diff --git a/scripts/gdb/linux/constants.py.in b/scripts/gdb/linux/constants.py.in index e810e0c27ff1..5cace7588e24 100644 --- a/scripts/gdb/linux/constants.py.in +++ b/scripts/gdb/linux/constants.py.in @@ -130,7 +130,11 @@ LX_CONFIG(CONFIG_X86_MCE_THRESHOLD) LX_CONFIG(CONFIG_X86_MCE_AMD) LX_CONFIG(CONFIG_X86_MCE) LX_CONFIG(CONFIG_X86_IO_APIC) -LX_CONFIG(CONFIG_HAVE_KVM) +/* + * CONFIG_KVM can be "m" but it affects common code too. Use CONFIG_KVM_COMMON + * as a proxy for IS_ENABLED(CONFIG_KVM). + */ +LX_CONFIG_KVM = IS_BUILTIN(CONFIG_KVM_COMMON) LX_CONFIG(CONFIG_NUMA) LX_CONFIG(CONFIG_ARM64) LX_CONFIG(CONFIG_ARM64_4K_PAGES) diff --git a/scripts/gdb/linux/interrupts.py b/scripts/gdb/linux/interrupts.py index ef478e273791..66ae5c7690cf 100644 --- a/scripts/gdb/linux/interrupts.py +++ b/scripts/gdb/linux/interrupts.py @@ -151,7 +151,7 @@ def x86_show_interupts(prec): if cnt is not None: text += "%*s: %10u\n" % (prec, "MIS", cnt['counter']) - if constants.LX_CONFIG_HAVE_KVM: + if constants.LX_CONFIG_KVM: text += x86_show_irqstat(prec, "PIN", 'kvm_posted_intr_ipis', 'Posted-interrupt notification event') text += x86_show_irqstat(prec, "NPI", 'kvm_posted_intr_nested_ipis', 'Nested posted-interrupt event') text += x86_show_irqstat(prec, "PIW", 'kvm_posted_intr_wakeup_ipis', 'Posted-interrupt wakeup event') diff --git a/tools/arch/x86/include/asm/irq_vectors.h b/tools/arch/x86/include/asm/irq_vectors.h index 3a19904c2db6..3f73ac3ed3a0 100644 --- a/tools/arch/x86/include/asm/irq_vectors.h +++ b/tools/arch/x86/include/asm/irq_vectors.h @@ -84,7 +84,7 @@ #define HYPERVISOR_CALLBACK_VECTOR 0xf3 /* Vector for KVM to deliver posted interrupt IPI */ -#ifdef CONFIG_HAVE_KVM +#if IS_ENABLED(CONFIG_KVM) #define POSTED_INTR_VECTOR 0xf2 #define POSTED_INTR_WAKEUP_VECTOR 0xf1 #define POSTED_INTR_NESTED_VECTOR 0xf0 -- cgit v1.2.3 From f48212ee8e78f765917dc64a7ff4bd13c7e71384 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Thu, 4 Jan 2024 11:34:25 -0500 Subject: treewide: remove CONFIG_HAVE_KVM It has no users anymore. Signed-off-by: Paolo Bonzini --- arch/arm64/Kconfig | 1 - arch/arm64/kvm/Kconfig | 1 - arch/loongarch/Kconfig | 1 - arch/loongarch/kvm/Kconfig | 1 - arch/mips/Kconfig | 9 --------- arch/s390/Kconfig | 1 - arch/s390/kvm/Kconfig | 1 - arch/x86/Kconfig | 1 - arch/x86/kvm/Kconfig | 2 -- virt/kvm/Kconfig | 3 --- 10 files changed, 21 deletions(-) (limited to 'arch/x86') diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index aa7c1d435139..f86383d9c771 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -216,7 +216,6 @@ config ARM64 select HAVE_HW_BREAKPOINT if PERF_EVENTS select HAVE_IOREMAP_PROT select HAVE_IRQ_TIME_ACCOUNTING - select HAVE_KVM select HAVE_MOD_ARCH_SPECIFIC select HAVE_NMI select HAVE_PERF_EVENTS diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 46bdbd852857..99193d3b8312 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -20,7 +20,6 @@ if VIRTUALIZATION menuconfig KVM bool "Kernel-based Virtual Machine (KVM) support" - depends on HAVE_KVM select KVM_COMMON select KVM_GENERIC_HARDWARE_ENABLING select KVM_GENERIC_MMU_NOTIFIER diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 929f68926b34..eb2139387a54 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -133,7 +133,6 @@ config LOONGARCH select HAVE_KPROBES select HAVE_KPROBES_ON_FTRACE select HAVE_KRETPROBES - select HAVE_KVM select HAVE_MOD_ARCH_SPECIFIC select HAVE_NMI select HAVE_PCI diff --git a/arch/loongarch/kvm/Kconfig b/arch/loongarch/kvm/Kconfig index 61f7e33b1f95..ac4de790dc31 100644 --- a/arch/loongarch/kvm/Kconfig +++ b/arch/loongarch/kvm/Kconfig @@ -20,7 +20,6 @@ if VIRTUALIZATION config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on AS_HAS_LVZ_EXTENSION - depends on HAVE_KVM select HAVE_KVM_DIRTY_RING_ACQ_REL select HAVE_KVM_VCPU_ASYNC_IOCTL select KVM_COMMON diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 3eb3239013d9..cf5bb1c756e6 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -1262,7 +1262,6 @@ config CPU_LOONGSON64 select MIPS_FP_SUPPORT select GPIOLIB select SWIOTLB - select HAVE_KVM help The Loongson GSx64(GS264/GS464/GS464E/GS464V) series of processor cores implements the MIPS64R2 instruction set with many extensions, @@ -1375,7 +1374,6 @@ config CPU_MIPS32_R2 select CPU_SUPPORTS_32BIT_KERNEL select CPU_SUPPORTS_HIGHMEM select CPU_SUPPORTS_MSA - select HAVE_KVM help Choose this option to build a kernel for release 2 or later of the MIPS32 architecture. Most modern embedded systems with a 32-bit @@ -1391,7 +1389,6 @@ config CPU_MIPS32_R5 select CPU_SUPPORTS_HIGHMEM select CPU_SUPPORTS_MSA select CPU_SUPPORTS_VZ - select HAVE_KVM select MIPS_O32_FP64_SUPPORT help Choose this option to build a kernel for release 5 or later of the @@ -1408,7 +1405,6 @@ config CPU_MIPS32_R6 select CPU_SUPPORTS_HIGHMEM select CPU_SUPPORTS_MSA select CPU_SUPPORTS_VZ - select HAVE_KVM select MIPS_O32_FP64_SUPPORT help Choose this option to build a kernel for release 6 or later of the @@ -1444,7 +1440,6 @@ config CPU_MIPS64_R2 select CPU_SUPPORTS_HIGHMEM select CPU_SUPPORTS_HUGEPAGES select CPU_SUPPORTS_MSA - select HAVE_KVM help Choose this option to build a kernel for release 2 or later of the MIPS64 architecture. Many modern embedded systems with a 64-bit @@ -1463,7 +1458,6 @@ config CPU_MIPS64_R5 select CPU_SUPPORTS_MSA select MIPS_O32_FP64_SUPPORT if 32BIT || MIPS32_O32 select CPU_SUPPORTS_VZ - select HAVE_KVM help Choose this option to build a kernel for release 5 or later of the MIPS64 architecture. This is a intermediate MIPS architecture @@ -1482,7 +1476,6 @@ config CPU_MIPS64_R6 select CPU_SUPPORTS_MSA select MIPS_O32_FP64_SUPPORT if 32BIT || MIPS32_O32 select CPU_SUPPORTS_VZ - select HAVE_KVM help Choose this option to build a kernel for release 6 or later of the MIPS64 architecture. New MIPS processors, starting with the Warrior @@ -1500,7 +1493,6 @@ config CPU_P5600 select CPU_SUPPORTS_VZ select CPU_MIPSR2_IRQ_VI select CPU_MIPSR2_IRQ_EI - select HAVE_KVM select MIPS_O32_FP64_SUPPORT help Choose this option to build a kernel for MIPS Warrior P5600 CPU. @@ -1621,7 +1613,6 @@ config CPU_CAVIUM_OCTEON select USB_OHCI_BIG_ENDIAN_MMIO if CPU_BIG_ENDIAN select MIPS_L1_CACHE_SHIFT_7 select CPU_SUPPORTS_VZ - select HAVE_KVM help The Cavium Octeon processor is a highly integrated chip containing many ethernet hardware widgets for networking tasks. The processor diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index fe565f3a3a91..a6409a312747 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -193,7 +193,6 @@ config S390 select HAVE_KPROBES select HAVE_KPROBES_ON_FTRACE select HAVE_KRETPROBES - select HAVE_KVM select HAVE_LIVEPATCH select HAVE_MEMBLOCK_PHYS_MAP select HAVE_MOD_ARCH_SPECIFIC diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig index 72e9b7dcdf7d..cae908d64550 100644 --- a/arch/s390/kvm/Kconfig +++ b/arch/s390/kvm/Kconfig @@ -19,7 +19,6 @@ if VIRTUALIZATION config KVM def_tristate y prompt "Kernel-based Virtual Machine (KVM) support" - depends on HAVE_KVM select HAVE_KVM_CPU_RELAX_INTERCEPT select HAVE_KVM_VCPU_ASYNC_IOCTL select KVM_ASYNC_PF diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 5edec175b9bf..da140c3aed84 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -244,7 +244,6 @@ config X86 select HAVE_FUNCTION_ERROR_INJECTION select HAVE_KRETPROBES select HAVE_RETHOOK - select HAVE_KVM select HAVE_LIVEPATCH if X86_64 select HAVE_MIXED_BREAKPOINTS_REGS select HAVE_MOD_ARCH_SPECIFIC diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index e7cbf011d766..5692393b7565 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -7,7 +7,6 @@ source "virt/kvm/Kconfig" menuconfig VIRTUALIZATION bool "Virtualization" - depends on HAVE_KVM || X86 default y help Say Y here to get to see options for using your Linux host to run other @@ -20,7 +19,6 @@ if VIRTUALIZATION config KVM tristate "Kernel-based Virtual Machine (KVM) support" - depends on HAVE_KVM depends on HIGH_RES_TIMERS depends on X86_LOCAL_APIC select KVM_COMMON diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 6408bc5dacd1..aa8a7ee0f554 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -1,9 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 # KVM common configuration items and defaults -config HAVE_KVM - bool - config KVM_COMMON bool select EVENTFD -- cgit v1.2.3 From 4438355ec6e165f73357203e0792079dec9e3059 Mon Sep 17 00:00:00 2001 From: Paul Durrant Date: Thu, 15 Feb 2024 15:28:58 +0000 Subject: KVM: x86/xen: mark guest pages dirty with the pfncache lock held Sampling gpa and memslot from an unlocked pfncache may yield inconsistent values so, since there is no problem with calling mark_page_dirty_in_slot() with the pfncache lock held, relocate the calls in kvm_xen_update_runstate_guest() and kvm_xen_inject_pending_events() accordingly. Signed-off-by: Paul Durrant Reviewed-by: David Woodhouse Link: https://lore.kernel.org/r/20240215152916.1158-4-paul@xen.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/xen.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 4b4e738c6f1b..f3327508ae41 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -452,14 +452,13 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic) smp_wmb(); } - if (user_len2) + if (user_len2) { + mark_page_dirty_in_slot(v->kvm, gpc2->memslot, gpc2->gpa >> PAGE_SHIFT); read_unlock(&gpc2->lock); - - read_unlock_irqrestore(&gpc1->lock, flags); + } mark_page_dirty_in_slot(v->kvm, gpc1->memslot, gpc1->gpa >> PAGE_SHIFT); - if (user_len2) - mark_page_dirty_in_slot(v->kvm, gpc2->memslot, gpc2->gpa >> PAGE_SHIFT); + read_unlock_irqrestore(&gpc1->lock, flags); } void kvm_xen_update_runstate(struct kvm_vcpu *v, int state) @@ -565,13 +564,13 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v) : "0" (evtchn_pending_sel32)); WRITE_ONCE(vi->evtchn_upcall_pending, 1); } + + mark_page_dirty_in_slot(v->kvm, gpc->memslot, gpc->gpa >> PAGE_SHIFT); read_unlock_irqrestore(&gpc->lock, flags); /* For the per-vCPU lapic vector, deliver it as MSI. */ if (v->arch.xen.upcall_vector) kvm_xen_inject_vcpu_vector(v); - - mark_page_dirty_in_slot(v->kvm, gpc->memslot, gpc->gpa >> PAGE_SHIFT); } int __kvm_xen_has_interrupt(struct kvm_vcpu *v) -- cgit v1.2.3 From 78b74638eb6dffd9b24bc3b121556a9039292df6 Mon Sep 17 00:00:00 2001 From: Paul Durrant Date: Thu, 15 Feb 2024 15:28:59 +0000 Subject: KVM: pfncache: add a mark-dirty helper At the moment pages are marked dirty by open-coded calls to mark_page_dirty_in_slot(), directly deferefencing the gpa and memslot from the cache. After a subsequent patch these may not always be set so add a helper now so that caller will protected from the need to know about this detail. Signed-off-by: Paul Durrant Reviewed-by: David Woodhouse Link: https://lore.kernel.org/r/20240215152916.1158-5-paul@xen.org [sean: decrease indentation, use gpa_to_gfn()] Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 2 +- arch/x86/kvm/xen.c | 6 +++--- include/linux/kvm_host.h | 10 ++++++++++ 3 files changed, 14 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bf10a9073a09..f0f37c769a3a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3160,7 +3160,7 @@ static void kvm_setup_guest_pvclock(struct kvm_vcpu *v, guest_hv_clock->version = ++vcpu->hv_clock.version; - mark_page_dirty_in_slot(v->kvm, gpc->memslot, gpc->gpa >> PAGE_SHIFT); + kvm_gpc_mark_dirty_in_slot(gpc); read_unlock_irqrestore(&gpc->lock, flags); trace_kvm_pvclock_update(v->vcpu_id, &vcpu->hv_clock); diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index f3327508ae41..2d001a9c6378 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -453,11 +453,11 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic) } if (user_len2) { - mark_page_dirty_in_slot(v->kvm, gpc2->memslot, gpc2->gpa >> PAGE_SHIFT); + kvm_gpc_mark_dirty_in_slot(gpc2); read_unlock(&gpc2->lock); } - mark_page_dirty_in_slot(v->kvm, gpc1->memslot, gpc1->gpa >> PAGE_SHIFT); + kvm_gpc_mark_dirty_in_slot(gpc1); read_unlock_irqrestore(&gpc1->lock, flags); } @@ -565,7 +565,7 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v) WRITE_ONCE(vi->evtchn_upcall_pending, 1); } - mark_page_dirty_in_slot(v->kvm, gpc->memslot, gpc->gpa >> PAGE_SHIFT); + kvm_gpc_mark_dirty_in_slot(gpc); read_unlock_irqrestore(&gpc->lock, flags); /* For the per-vCPU lapic vector, deliver it as MSI. */ diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 7e7fd25b09b3..604ae285d9a9 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1795,6 +1795,16 @@ static inline bool kvm_is_error_gpa(struct kvm *kvm, gpa_t gpa) return kvm_is_error_hva(hva); } +static inline void kvm_gpc_mark_dirty_in_slot(struct gfn_to_pfn_cache *gpc) +{ + lockdep_assert_held(&gpc->lock); + + if (!gpc->memslot) + return; + + mark_page_dirty_in_slot(gpc->kvm, gpc->memslot, gpa_to_gfn(gpc->gpa)); +} + enum kvm_stat_kind { KVM_STAT_VM, KVM_STAT_VCPU, -- cgit v1.2.3 From a4bff3df51472f555ab8dea05a3d2faf4abbf199 Mon Sep 17 00:00:00 2001 From: Paul Durrant Date: Thu, 15 Feb 2024 15:29:00 +0000 Subject: KVM: pfncache: remove KVM_GUEST_USES_PFN usage As noted in [1] the KVM_GUEST_USES_PFN usage flag is never set by any callers of kvm_gpc_init(), and for good reason: the implementation is incomplete/broken. And it's not clear that there will ever be a user of KVM_GUEST_USES_PFN, as coordinating vCPUs with mmu_notifier events is non-trivial. Remove KVM_GUEST_USES_PFN and all related code, e.g. dropping KVM_GUEST_USES_PFN also makes the 'vcpu' argument redundant, to avoid having to reason about broken code as __kvm_gpc_refresh() evolves. Moreover, all existing callers specify KVM_HOST_USES_PFN so the usage check in hva_to_pfn_retry() and hence the 'usage' argument to kvm_gpc_init() are also redundant. [1] https://lore.kernel.org/all/ZQiR8IpqOZrOpzHC@google.com Signed-off-by: Paul Durrant Reviewed-by: David Woodhouse Link: https://lore.kernel.org/r/20240215152916.1158-6-paul@xen.org [sean: explicitly call out that guest usage is incomplete] Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 2 +- arch/x86/kvm/xen.c | 14 ++++------- include/linux/kvm_host.h | 11 +-------- include/linux/kvm_types.h | 8 ------- virt/kvm/pfncache.c | 61 +++++++---------------------------------------- 5 files changed, 16 insertions(+), 80 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f0f37c769a3a..415723a28dce 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12056,7 +12056,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) vcpu->arch.regs_avail = ~0; vcpu->arch.regs_dirty = ~0; - kvm_gpc_init(&vcpu->arch.pv_time, vcpu->kvm, vcpu, KVM_HOST_USES_PFN); + kvm_gpc_init(&vcpu->arch.pv_time, vcpu->kvm); if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu)) vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 2d001a9c6378..e90464225467 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -2108,14 +2108,10 @@ void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu) timer_setup(&vcpu->arch.xen.poll_timer, cancel_evtchn_poll, 0); - kvm_gpc_init(&vcpu->arch.xen.runstate_cache, vcpu->kvm, NULL, - KVM_HOST_USES_PFN); - kvm_gpc_init(&vcpu->arch.xen.runstate2_cache, vcpu->kvm, NULL, - KVM_HOST_USES_PFN); - kvm_gpc_init(&vcpu->arch.xen.vcpu_info_cache, vcpu->kvm, NULL, - KVM_HOST_USES_PFN); - kvm_gpc_init(&vcpu->arch.xen.vcpu_time_info_cache, vcpu->kvm, NULL, - KVM_HOST_USES_PFN); + kvm_gpc_init(&vcpu->arch.xen.runstate_cache, vcpu->kvm); + kvm_gpc_init(&vcpu->arch.xen.runstate2_cache, vcpu->kvm); + kvm_gpc_init(&vcpu->arch.xen.vcpu_info_cache, vcpu->kvm); + kvm_gpc_init(&vcpu->arch.xen.vcpu_time_info_cache, vcpu->kvm); } void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu) @@ -2158,7 +2154,7 @@ void kvm_xen_init_vm(struct kvm *kvm) { mutex_init(&kvm->arch.xen.xen_lock); idr_init(&kvm->arch.xen.evtchn_ports); - kvm_gpc_init(&kvm->arch.xen.shinfo_cache, kvm, NULL, KVM_HOST_USES_PFN); + kvm_gpc_init(&kvm->arch.xen.shinfo_cache, kvm); } void kvm_xen_destroy_vm(struct kvm *kvm) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 604ae285d9a9..3e1c04608c67 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1319,21 +1319,12 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn); * * @gpc: struct gfn_to_pfn_cache object. * @kvm: pointer to kvm instance. - * @vcpu: vCPU to be used for marking pages dirty and to be woken on - * invalidation. - * @usage: indicates if the resulting host physical PFN is used while - * the @vcpu is IN_GUEST_MODE (in which case invalidation of - * the cache from MMU notifiers---but not for KVM memslot - * changes!---will also force @vcpu to exit the guest and - * refresh the cache); and/or if the PFN used directly - * by KVM (and thus needs a kernel virtual mapping). * * This sets up a gfn_to_pfn_cache by initializing locks and assigning the * immutable attributes. Note, the cache must be zero-allocated (or zeroed by * the caller before init). */ -void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm, - struct kvm_vcpu *vcpu, enum pfn_cache_usage usage); +void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm); /** * kvm_gpc_activate - prepare a cached kernel mapping and HPA for a given guest diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index 9d1f7835d8c1..d93f6522b2c3 100644 --- a/include/linux/kvm_types.h +++ b/include/linux/kvm_types.h @@ -49,12 +49,6 @@ typedef u64 hfn_t; typedef hfn_t kvm_pfn_t; -enum pfn_cache_usage { - KVM_GUEST_USES_PFN = BIT(0), - KVM_HOST_USES_PFN = BIT(1), - KVM_GUEST_AND_HOST_USE_PFN = KVM_GUEST_USES_PFN | KVM_HOST_USES_PFN, -}; - struct gfn_to_hva_cache { u64 generation; gpa_t gpa; @@ -69,13 +63,11 @@ struct gfn_to_pfn_cache { unsigned long uhva; struct kvm_memory_slot *memslot; struct kvm *kvm; - struct kvm_vcpu *vcpu; struct list_head list; rwlock_t lock; struct mutex refresh_lock; void *khva; kvm_pfn_t pfn; - enum pfn_cache_usage usage; bool active; bool valid; }; diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index f3571f44d9af..6f4b537eb25b 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -25,9 +25,7 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, unsigned long end, bool may_block) { - DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS); struct gfn_to_pfn_cache *gpc; - bool evict_vcpus = false; spin_lock(&kvm->gpc_lock); list_for_each_entry(gpc, &kvm->gpc_list, list) { @@ -37,43 +35,10 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) && gpc->uhva >= start && gpc->uhva < end) { gpc->valid = false; - - /* - * If a guest vCPU could be using the physical address, - * it needs to be forced out of guest mode. - */ - if (gpc->usage & KVM_GUEST_USES_PFN) { - if (!evict_vcpus) { - evict_vcpus = true; - bitmap_zero(vcpu_bitmap, KVM_MAX_VCPUS); - } - __set_bit(gpc->vcpu->vcpu_idx, vcpu_bitmap); - } } write_unlock_irq(&gpc->lock); } spin_unlock(&kvm->gpc_lock); - - if (evict_vcpus) { - /* - * KVM needs to ensure the vCPU is fully out of guest context - * before allowing the invalidation to continue. - */ - unsigned int req = KVM_REQ_OUTSIDE_GUEST_MODE; - bool called; - - /* - * If the OOM reaper is active, then all vCPUs should have - * been stopped already, so perform the request without - * KVM_REQUEST_WAIT and be sad if any needed to be IPI'd. - */ - if (!may_block) - req &= ~KVM_REQUEST_WAIT; - - called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap); - - WARN_ON_ONCE(called && !may_block); - } } bool kvm_gpc_check(struct gfn_to_pfn_cache *gpc, unsigned long len) @@ -206,16 +171,14 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc) * pfn. Note, kmap() and memremap() can both sleep, so this * too must be done outside of gpc->lock! */ - if (gpc->usage & KVM_HOST_USES_PFN) { - if (new_pfn == gpc->pfn) - new_khva = old_khva; - else - new_khva = gpc_map(new_pfn); - - if (!new_khva) { - kvm_release_pfn_clean(new_pfn); - goto out_error; - } + if (new_pfn == gpc->pfn) + new_khva = old_khva; + else + new_khva = gpc_map(new_pfn); + + if (!new_khva) { + kvm_release_pfn_clean(new_pfn); + goto out_error; } write_lock_irq(&gpc->lock); @@ -346,18 +309,12 @@ int kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, unsigned long len) return __kvm_gpc_refresh(gpc, gpc->gpa, len); } -void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm, - struct kvm_vcpu *vcpu, enum pfn_cache_usage usage) +void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm) { - WARN_ON_ONCE(!usage || (usage & KVM_GUEST_AND_HOST_USE_PFN) != usage); - WARN_ON_ONCE((usage & KVM_GUEST_USES_PFN) && !vcpu); - rwlock_init(&gpc->lock); mutex_init(&gpc->refresh_lock); gpc->kvm = kvm; - gpc->vcpu = vcpu; - gpc->usage = usage; gpc->pfn = KVM_PFN_ERR_FAULT; gpc->uhva = KVM_HVA_ERR_BAD; } -- cgit v1.2.3 From c01c55a34f284d27719638c4398282442c13ca34 Mon Sep 17 00:00:00 2001 From: Paul Durrant Date: Thu, 15 Feb 2024 15:29:05 +0000 Subject: KVM: x86/xen: separate initialization of shared_info cache and content A subsequent patch will allow shared_info to be initialized using either a GPA or a user-space (i.e. VMM) HVA. To make that patch cleaner, separate the initialization of the shared_info content from the activation of the pfncache. Signed-off-by: Paul Durrant Reviewed-by: David Woodhouse Link: https://lore.kernel.org/r/20240215152916.1158-11-paul@xen.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/xen.c | 55 +++++++++++++++++++++++++++++++----------------------- 1 file changed, 32 insertions(+), 23 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index e90464225467..031e98d88ba2 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -34,41 +34,32 @@ static bool kvm_xen_hcall_evtchn_send(struct kvm_vcpu *vcpu, u64 param, u64 *r); DEFINE_STATIC_KEY_DEFERRED_FALSE(kvm_xen_enabled, HZ); -static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn) +static int kvm_xen_shared_info_init(struct kvm *kvm) { struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache; struct pvclock_wall_clock *wc; - gpa_t gpa = gfn_to_gpa(gfn); u32 *wc_sec_hi; u32 wc_version; u64 wall_nsec; int ret = 0; int idx = srcu_read_lock(&kvm->srcu); - if (gfn == KVM_XEN_INVALID_GFN) { - kvm_gpc_deactivate(gpc); - goto out; - } + read_lock_irq(&gpc->lock); + while (!kvm_gpc_check(gpc, PAGE_SIZE)) { + read_unlock_irq(&gpc->lock); - do { - ret = kvm_gpc_activate(gpc, gpa, PAGE_SIZE); + ret = kvm_gpc_refresh(gpc, PAGE_SIZE); if (ret) goto out; - /* - * This code mirrors kvm_write_wall_clock() except that it writes - * directly through the pfn cache and doesn't mark the page dirty. - */ - wall_nsec = kvm_get_wall_clock_epoch(kvm); - - /* It could be invalid again already, so we need to check */ read_lock_irq(&gpc->lock); + } - if (gpc->valid) - break; - - read_unlock_irq(&gpc->lock); - } while (1); + /* + * This code mirrors kvm_write_wall_clock() except that it writes + * directly through the pfn cache and doesn't mark the page dirty. + */ + wall_nsec = kvm_get_wall_clock_epoch(kvm); /* Paranoia checks on the 32-bit struct layout */ BUILD_BUG_ON(offsetof(struct compat_shared_info, wc) != 0x900); @@ -639,12 +630,30 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data) } break; - case KVM_XEN_ATTR_TYPE_SHARED_INFO: + case KVM_XEN_ATTR_TYPE_SHARED_INFO: { + int idx; + mutex_lock(&kvm->arch.xen.xen_lock); - r = kvm_xen_shared_info_init(kvm, data->u.shared_info.gfn); + + idx = srcu_read_lock(&kvm->srcu); + + if (data->u.shared_info.gfn == KVM_XEN_INVALID_GFN) { + kvm_gpc_deactivate(&kvm->arch.xen.shinfo_cache); + r = 0; + } else { + r = kvm_gpc_activate(&kvm->arch.xen.shinfo_cache, + gfn_to_gpa(data->u.shared_info.gfn), + PAGE_SIZE); + } + + srcu_read_unlock(&kvm->srcu, idx); + + if (!r && kvm->arch.xen.shinfo_cache.active) + r = kvm_xen_shared_info_init(kvm); + mutex_unlock(&kvm->arch.xen.xen_lock); break; - + } case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR: if (data->u.vector && data->u.vector < 0x10) r = -EINVAL; -- cgit v1.2.3 From 18b99e4d6db65fface45f1e9bcd1041d93c1ac66 Mon Sep 17 00:00:00 2001 From: Paul Durrant Date: Thu, 15 Feb 2024 15:29:06 +0000 Subject: KVM: x86/xen: re-initialize shared_info if guest (32/64-bit) mode is set If the shared_info PFN cache has already been initialized then the content of the shared_info page needs to be re-initialized whenever the guest mode is (re)set. Setting the guest mode is either done explicitly by the VMM via the KVM_XEN_ATTR_TYPE_LONG_MODE attribute, or implicitly when the guest writes the MSR to set up the hypercall page. Signed-off-by: Paul Durrant Reviewed-by: David Woodhouse Link: https://lore.kernel.org/r/20240215152916.1158-12-paul@xen.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/xen.c | 29 ++++++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 031e98d88ba2..52edf676c471 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -625,8 +625,16 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data) } else { mutex_lock(&kvm->arch.xen.xen_lock); kvm->arch.xen.long_mode = !!data->u.long_mode; + + /* + * Re-initialize shared_info to put the wallclock in the + * correct place. Whilst it's not necessary to do this + * unless the mode is actually changed, it does no harm + * to make the call anyway. + */ + r = kvm->arch.xen.shinfo_cache.active ? + kvm_xen_shared_info_init(kvm) : 0; mutex_unlock(&kvm->arch.xen.xen_lock); - r = 0; } break; @@ -1101,9 +1109,24 @@ int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data) u32 page_num = data & ~PAGE_MASK; u64 page_addr = data & PAGE_MASK; bool lm = is_long_mode(vcpu); + int r = 0; + + mutex_lock(&kvm->arch.xen.xen_lock); + if (kvm->arch.xen.long_mode != lm) { + kvm->arch.xen.long_mode = lm; + + /* + * Re-initialize shared_info to put the wallclock in the + * correct place. + */ + if (kvm->arch.xen.shinfo_cache.active && + kvm_xen_shared_info_init(kvm)) + r = 1; + } + mutex_unlock(&kvm->arch.xen.xen_lock); - /* Latch long_mode for shared_info pages etc. */ - vcpu->kvm->arch.xen.long_mode = lm; + if (r) + return r; /* * If Xen hypercall intercept is enabled, fill the hypercall -- cgit v1.2.3 From b9220d32799a62c634f7b0b3618da52772626952 Mon Sep 17 00:00:00 2001 From: Paul Durrant Date: Thu, 15 Feb 2024 15:29:07 +0000 Subject: KVM: x86/xen: allow shared_info to be mapped by fixed HVA The shared_info page is not guest memory as such. It is a dedicated page allocated by the VMM and overlaid onto guest memory in a GFN chosen by the guest and specified in the XENMEM_add_to_physmap hypercall. The guest may even request that shared_info be moved from one GFN to another by re-issuing that hypercall, but the HVA is never going to change. Because the shared_info page is an overlay the memory slots need to be updated in response to the hypercall. However, memory slot adjustment is not atomic and, whilst all vCPUs are paused, there is still the possibility that events may be delivered (which requires the shared_info page to be updated) whilst the shared_info GPA is absent. The HVA is never absent though, so it makes much more sense to use that as the basis for the kernel's mapping. Hence add a new KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA attribute type for this purpose and a KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag to advertize its availability. Don't actually advertize it yet though. That will be done in a subsequent patch, which will also add tests for the new attribute type. Also update the KVM API documentation with the new attribute and also fix it up to consistently refer to 'shared_info' (with the underscore). Signed-off-by: Paul Durrant Reviewed-by: David Woodhouse Link: https://lore.kernel.org/r/20240215152916.1158-13-paul@xen.org [sean: store "hva" as a user pointer, use kvm_gpc_is_{gpa,hva}_active()] Signed-off-by: Sean Christopherson --- Documentation/virt/kvm/api.rst | 25 +++++++++++++++++++------ arch/x86/include/uapi/asm/kvm.h | 6 +++++- arch/x86/kvm/xen.c | 40 ++++++++++++++++++++++++++++++++-------- 3 files changed, 56 insertions(+), 15 deletions(-) (limited to 'arch/x86') diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 3ec0b7a455a0..3372be85b335 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -372,7 +372,7 @@ The bits in the dirty bitmap are cleared before the ioctl returns, unless KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is enabled. For more information, see the description of the capability. -Note that the Xen shared info page, if configured, shall always be assumed +Note that the Xen shared_info page, if configured, shall always be assumed to be dirty. KVM will not explicitly mark it such. @@ -5487,8 +5487,9 @@ KVM_PV_ASYNC_CLEANUP_PERFORM __u8 long_mode; __u8 vector; __u8 runstate_update_flag; - struct { + union { __u64 gfn; + __u64 hva; } shared_info; struct { __u32 send_port; @@ -5516,10 +5517,10 @@ type values: KVM_XEN_ATTR_TYPE_LONG_MODE Sets the ABI mode of the VM to 32-bit or 64-bit (long mode). This - determines the layout of the shared info pages exposed to the VM. + determines the layout of the shared_info page exposed to the VM. KVM_XEN_ATTR_TYPE_SHARED_INFO - Sets the guest physical frame number at which the Xen "shared info" + Sets the guest physical frame number at which the Xen shared_info page resides. Note that although Xen places vcpu_info for the first 32 vCPUs in the shared_info page, KVM does not automatically do so and instead requires that KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO be used @@ -5528,7 +5529,7 @@ KVM_XEN_ATTR_TYPE_SHARED_INFO not be aware of the Xen CPU id which is used as the index into the vcpu_info[] array, so may know the correct default location. - Note that the shared info page may be constantly written to by KVM; + Note that the shared_info page may be constantly written to by KVM; it contains the event channel bitmap used to deliver interrupts to a Xen guest, amongst other things. It is exempt from dirty tracking mechanisms — KVM will not explicitly mark the page as dirty each @@ -5537,9 +5538,21 @@ KVM_XEN_ATTR_TYPE_SHARED_INFO any vCPU has been running or any event channel interrupts can be routed to the guest. - Setting the gfn to KVM_XEN_INVALID_GFN will disable the shared info + Setting the gfn to KVM_XEN_INVALID_GFN will disable the shared_info page. +KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA + If the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag is also set in the + Xen capabilities, then this attribute may be used to set the + userspace address at which the shared_info page resides, which + will always be fixed in the VMM regardless of where it is mapped + in guest physical address space. This attribute should be used in + preference to KVM_XEN_ATTR_TYPE_SHARED_INFO as it avoids + unnecessary invalidation of an internal cache when the page is + re-mapped in guest physcial address space. + + Setting the hva to zero will disable the shared_info page. + KVM_XEN_ATTR_TYPE_UPCALL_VECTOR Sets the exception vector used to deliver Xen event channel upcalls. This is the HVM-wide vector injected directly by the hypervisor diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index 0ad6bda1fc39..9482f22333c8 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -549,6 +549,7 @@ struct kvm_x86_mce { #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5) #define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6) #define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7) +#define KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA (1 << 8) struct kvm_xen_hvm_config { __u32 flags; @@ -567,9 +568,10 @@ struct kvm_xen_hvm_attr { __u8 long_mode; __u8 vector; __u8 runstate_update_flag; - struct { + union { __u64 gfn; #define KVM_XEN_INVALID_GFN ((__u64)-1) + __u64 hva; } shared_info; struct { __u32 send_port; @@ -611,6 +613,8 @@ struct kvm_xen_hvm_attr { #define KVM_XEN_ATTR_TYPE_XEN_VERSION 0x4 /* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG */ #define KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG 0x5 +/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA */ +#define KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA 0x6 struct kvm_xen_vcpu_attr { __u16 type; diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 52edf676c471..45a74307cc25 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -638,20 +638,36 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data) } break; - case KVM_XEN_ATTR_TYPE_SHARED_INFO: { + case KVM_XEN_ATTR_TYPE_SHARED_INFO: + case KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA: { int idx; mutex_lock(&kvm->arch.xen.xen_lock); idx = srcu_read_lock(&kvm->srcu); - if (data->u.shared_info.gfn == KVM_XEN_INVALID_GFN) { - kvm_gpc_deactivate(&kvm->arch.xen.shinfo_cache); - r = 0; + if (data->type == KVM_XEN_ATTR_TYPE_SHARED_INFO) { + gfn_t gfn = data->u.shared_info.gfn; + + if (gfn == KVM_XEN_INVALID_GFN) { + kvm_gpc_deactivate(&kvm->arch.xen.shinfo_cache); + r = 0; + } else { + r = kvm_gpc_activate(&kvm->arch.xen.shinfo_cache, + gfn_to_gpa(gfn), PAGE_SIZE); + } } else { - r = kvm_gpc_activate(&kvm->arch.xen.shinfo_cache, - gfn_to_gpa(data->u.shared_info.gfn), - PAGE_SIZE); + void __user * hva = u64_to_user_ptr(data->u.shared_info.hva); + + if (!PAGE_ALIGNED(hva) || !access_ok(hva, PAGE_SIZE)) { + r = -EINVAL; + } else if (!hva) { + kvm_gpc_deactivate(&kvm->arch.xen.shinfo_cache); + r = 0; + } else { + r = kvm_gpc_activate_hva(&kvm->arch.xen.shinfo_cache, + (unsigned long)hva, PAGE_SIZE); + } } srcu_read_unlock(&kvm->srcu, idx); @@ -715,13 +731,21 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data) break; case KVM_XEN_ATTR_TYPE_SHARED_INFO: - if (kvm->arch.xen.shinfo_cache.active) + if (kvm_gpc_is_gpa_active(&kvm->arch.xen.shinfo_cache)) data->u.shared_info.gfn = gpa_to_gfn(kvm->arch.xen.shinfo_cache.gpa); else data->u.shared_info.gfn = KVM_XEN_INVALID_GFN; r = 0; break; + case KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA: + if (kvm_gpc_is_hva_active(&kvm->arch.xen.shinfo_cache)) + data->u.shared_info.hva = kvm->arch.xen.shinfo_cache.uhva; + else + data->u.shared_info.hva = 0; + r = 0; + break; + case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR: data->u.vector = kvm->arch.xen.upcall_vector; r = 0; -- cgit v1.2.3 From 3991f35805d0f1ad84c3259dc1435021ccf63f16 Mon Sep 17 00:00:00 2001 From: Paul Durrant Date: Thu, 15 Feb 2024 15:29:08 +0000 Subject: KVM: x86/xen: allow vcpu_info to be mapped by fixed HVA If the guest does not explicitly set the GPA of vcpu_info structure in memory then, for guests with 32 vCPUs or fewer, the vcpu_info embedded in the shared_info page may be used. As described in a previous commit, the shared_info page is an overlay at a fixed HVA within the VMM, so in this case it also more optimal to activate the vcpu_info cache with a fixed HVA to avoid unnecessary invalidation if the guest memory layout is modified. Signed-off-by: Paul Durrant Reviewed-by: David Woodhouse Link: https://lore.kernel.org/r/20240215152916.1158-14-paul@xen.org [sean: use kvm_gpc_is_{gpa,hva}_active()] Signed-off-by: Sean Christopherson --- Documentation/virt/kvm/api.rst | 26 +++++++++++++++++++++----- arch/x86/include/uapi/asm/kvm.h | 3 +++ arch/x86/kvm/xen.c | 35 ++++++++++++++++++++++++++++------- 3 files changed, 52 insertions(+), 12 deletions(-) (limited to 'arch/x86') diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 3372be85b335..bd93cafd3e4e 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -5523,11 +5523,12 @@ KVM_XEN_ATTR_TYPE_SHARED_INFO Sets the guest physical frame number at which the Xen shared_info page resides. Note that although Xen places vcpu_info for the first 32 vCPUs in the shared_info page, KVM does not automatically do so - and instead requires that KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO be used - explicitly even when the vcpu_info for a given vCPU resides at the - "default" location in the shared_info page. This is because KVM may - not be aware of the Xen CPU id which is used as the index into the - vcpu_info[] array, so may know the correct default location. + and instead requires that KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO or + KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA be used explicitly even when + the vcpu_info for a given vCPU resides at the "default" location + in the shared_info page. This is because KVM may not be aware of + the Xen CPU id which is used as the index into the vcpu_info[] + array, so may know the correct default location. Note that the shared_info page may be constantly written to by KVM; it contains the event channel bitmap used to deliver interrupts to @@ -5649,6 +5650,21 @@ KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO on dirty logging. Setting the gpa to KVM_XEN_INVALID_GPA will disable the vcpu_info. +KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA + If the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag is also set in the + Xen capabilities, then this attribute may be used to set the + userspace address of the vcpu_info for a given vCPU. It should + only be used when the vcpu_info resides at the "default" location + in the shared_info page. In this case it is safe to assume the + userspace address will not change, because the shared_info page is + an overlay on guest memory and remains at a fixed host address + regardless of where it is mapped in guest physical address space + and hence unnecessary invalidation of an internal cache may be + avoided if the guest memory layout is modified. + If the vcpu_info does not reside at the "default" location then + it is not guaranteed to remain at the same host address and + hence the aforementioned cache invalidation is required. + KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO Sets the guest physical address of an additional pvclock structure for a given vCPU. This is typically used for guest vsyscall support. diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index 9482f22333c8..ad29984d5e39 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -622,6 +622,7 @@ struct kvm_xen_vcpu_attr { union { __u64 gpa; #define KVM_XEN_INVALID_GPA ((__u64)-1) + __u64 hva; __u64 pad[8]; struct { __u64 state; @@ -652,6 +653,8 @@ struct kvm_xen_vcpu_attr { #define KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID 0x6 #define KVM_XEN_VCPU_ATTR_TYPE_TIMER 0x7 #define KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR 0x8 +/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA */ +#define KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA 0x9 /* Secure Encrypted Virtualization command */ enum sev_cmd_id { diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 45a74307cc25..cd05faa19308 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -782,20 +782,33 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) switch (data->type) { case KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO: + case KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA: /* No compat necessary here. */ BUILD_BUG_ON(sizeof(struct vcpu_info) != sizeof(struct compat_vcpu_info)); BUILD_BUG_ON(offsetof(struct vcpu_info, time) != offsetof(struct compat_vcpu_info, time)); - if (data->u.gpa == KVM_XEN_INVALID_GPA) { - kvm_gpc_deactivate(&vcpu->arch.xen.vcpu_info_cache); - r = 0; - break; + if (data->type == KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO) { + if (data->u.gpa == KVM_XEN_INVALID_GPA) { + kvm_gpc_deactivate(&vcpu->arch.xen.vcpu_info_cache); + r = 0; + break; + } + + r = kvm_gpc_activate(&vcpu->arch.xen.vcpu_info_cache, + data->u.gpa, sizeof(struct vcpu_info)); + } else { + if (data->u.hva == 0) { + kvm_gpc_deactivate(&vcpu->arch.xen.vcpu_info_cache); + r = 0; + break; + } + + r = kvm_gpc_activate_hva(&vcpu->arch.xen.vcpu_info_cache, + data->u.hva, sizeof(struct vcpu_info)); } - r = kvm_gpc_activate(&vcpu->arch.xen.vcpu_info_cache, - data->u.gpa, sizeof(struct vcpu_info)); if (!r) kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); @@ -1017,13 +1030,21 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) switch (data->type) { case KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO: - if (vcpu->arch.xen.vcpu_info_cache.active) + if (kvm_gpc_is_gpa_active(&vcpu->arch.xen.vcpu_info_cache)) data->u.gpa = vcpu->arch.xen.vcpu_info_cache.gpa; else data->u.gpa = KVM_XEN_INVALID_GPA; r = 0; break; + case KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA: + if (kvm_gpc_is_hva_active(&vcpu->arch.xen.vcpu_info_cache)) + data->u.hva = vcpu->arch.xen.vcpu_info_cache.uhva; + else + data->u.hva = 0; + r = 0; + break; + case KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO: if (vcpu->arch.xen.vcpu_time_info_cache.active) data->u.gpa = vcpu->arch.xen.vcpu_time_info_cache.gpa; -- cgit v1.2.3 From 615451d8cb3f82265e0ed60414b606b4fa120f5e Mon Sep 17 00:00:00 2001 From: Paul Durrant Date: Thu, 15 Feb 2024 15:29:11 +0000 Subject: KVM: x86/xen: advertize the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA capability Now that all relevant kernel changes and selftests are in place, enable the new capability. Signed-off-by: Paul Durrant Reviewed-by: David Woodhouse Link: https://lore.kernel.org/r/20240215152916.1158-17-paul@xen.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 415723a28dce..2911e6383fef 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4682,7 +4682,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) KVM_XEN_HVM_CONFIG_SHARED_INFO | KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL | KVM_XEN_HVM_CONFIG_EVTCHN_SEND | - KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE; + KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE | + KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA; if (sched_info_on()) r |= KVM_XEN_HVM_CONFIG_RUNSTATE | KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG; -- cgit v1.2.3 From 003d914220c97ef93cabfe3ec4e245e2383e19e9 Mon Sep 17 00:00:00 2001 From: Paul Durrant Date: Thu, 15 Feb 2024 15:29:15 +0000 Subject: KVM: x86/xen: allow vcpu_info content to be 'safely' copied If the guest sets an explicit vcpu_info GPA then, for any of the first 32 vCPUs, the content of the default vcpu_info in the shared_info page must be copied into the new location. Because this copy may race with event delivery (which updates the 'evtchn_pending_sel' field in vcpu_info), event delivery needs to be deferred until the copy is complete. Happily there is already a shadow of 'evtchn_pending_sel' in kvm_vcpu_xen that is used in atomic context if the vcpu_info PFN cache has been invalidated so that the update of vcpu_info can be deferred until the cache can be refreshed (on vCPU thread's the way back into guest context). Use this shadow if the vcpu_info cache has been *deactivated*, so that the VMM can safely copy the vcpu_info content and then re-activate the cache with the new GPA. To do this, stop considering an inactive vcpu_info cache as a hard error in kvm_xen_set_evtchn_fast(), and let the existing kvm_gpc_check() fail and kick the vCPU (if necessary). Signed-off-by: Paul Durrant Reviewed-by: David Woodhouse Link: https://lore.kernel.org/r/20240215152916.1158-21-paul@xen.org [sean: add a bit of verbosity to the changelog] Signed-off-by: Sean Christopherson --- arch/x86/kvm/xen.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index cd05faa19308..8a04e0ae9245 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -1697,9 +1697,6 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm) WRITE_ONCE(xe->vcpu_idx, vcpu->vcpu_idx); } - if (!vcpu->arch.xen.vcpu_info_cache.active) - return -EINVAL; - if (xe->port >= max_evtchn_port(kvm)) return -EINVAL; -- cgit v1.2.3 From b1a3c366cbc783d6600b357ccfec2f440eed5453 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 16:23:40 -0800 Subject: x86/cpu: Add a VMX flag to enumerate 5-level EPT support to userspace Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query whether or not the CPU supports 5-level EPT paging. EPT capabilities are enumerated via MSR, i.e. aren't accessible to userspace without help from the kernel, and knowing whether or not 5-level EPT is supported is useful for debug, triage, testing, etc. For example, when EPT is enabled, bits 51:48 of guest physical addresses are consumed by the CPU if and only if 5-level EPT is enabled. For CPUs with MAXPHYADDR > 48, KVM *can't* map all legal guest memory without 5-level EPT, making 5-level EPT support valuable information for userspace. Reported-by: Yi Lai Cc: Tao Su Cc: Xudong Hao Link: https://lore.kernel.org/r/20240110002340.485595-1-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/vmxfeatures.h | 1 + arch/x86/kernel/cpu/feat_ctl.c | 2 ++ 2 files changed, 3 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h index c6a7eed03914..266daf5b5b84 100644 --- a/arch/x86/include/asm/vmxfeatures.h +++ b/arch/x86/include/asm/vmxfeatures.h @@ -25,6 +25,7 @@ #define VMX_FEATURE_EPT_EXECUTE_ONLY ( 0*32+ 17) /* "ept_x_only" EPT entries can be execute only */ #define VMX_FEATURE_EPT_AD ( 0*32+ 18) /* EPT Accessed/Dirty bits */ #define VMX_FEATURE_EPT_1GB ( 0*32+ 19) /* 1GB EPT pages */ +#define VMX_FEATURE_EPT_5LEVEL ( 0*32+ 20) /* 5-level EPT paging */ /* Aggregated APIC features 24-27 */ #define VMX_FEATURE_FLEXPRIORITY ( 0*32+ 24) /* TPR shadow + virt APIC */ diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c index 03851240c3e3..1640ae76548f 100644 --- a/arch/x86/kernel/cpu/feat_ctl.c +++ b/arch/x86/kernel/cpu/feat_ctl.c @@ -72,6 +72,8 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c) c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_AD); if (ept & VMX_EPT_1GB_PAGE_BIT) c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_1GB); + if (ept & VMX_EPT_PAGE_WALK_5_BIT) + c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_5LEVEL); /* Synthetic APIC features that are aggregates of multiple features. */ if ((c->vmx_capability[PRIMARY_CTLS] & VMX_F(VIRTUAL_TPR)) && -- cgit v1.2.3 From fc5375dd8c062d81001b50008d65ca09e942f924 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 9 Feb 2024 14:07:51 -0800 Subject: KVM: x86: Make kvm_get_dr() return a value, not use an out parameter Convert kvm_get_dr()'s output parameter to a return value, and clean up most of the mess that was created by forcing callers to provide a pointer. No functional change intended. Acked-by: Mathias Krause Reviewed-by: Mathias Krause Link: https://lore.kernel.org/r/20240209220752.388160-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/emulate.c | 17 ++++------------- arch/x86/kvm/kvm_emulate.h | 2 +- arch/x86/kvm/smm.c | 15 ++++----------- arch/x86/kvm/svm/svm.c | 7 ++----- arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/vmx/vmx.c | 5 +---- arch/x86/kvm/x86.c | 20 +++++++------------- 8 files changed, 21 insertions(+), 49 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b5b2d0fde579..7a59043a6ab0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2045,7 +2045,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3); int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4); int kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8); int kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val); -void kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val); +unsigned long kvm_get_dr(struct kvm_vcpu *vcpu, int dr); unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu); void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw); int kvm_emulate_xsetbv(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index eafb48f0b745..3328c7c7f2f9 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3013,7 +3013,7 @@ static int emulator_do_task_switch(struct x86_emulate_ctxt *ctxt, ret = em_push(ctxt); } - ops->get_dr(ctxt, 7, &dr7); + dr7 = ops->get_dr(ctxt, 7); ops->set_dr(ctxt, 7, dr7 & ~(DR_LOCAL_ENABLE_MASK | DR_LOCAL_SLOWDOWN)); return ret; @@ -3868,15 +3868,6 @@ static int check_cr_access(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static int check_dr7_gd(struct x86_emulate_ctxt *ctxt) -{ - unsigned long dr7; - - ctxt->ops->get_dr(ctxt, 7, &dr7); - - return dr7 & DR7_GD; -} - static int check_dr_read(struct x86_emulate_ctxt *ctxt) { int dr = ctxt->modrm_reg; @@ -3889,10 +3880,10 @@ static int check_dr_read(struct x86_emulate_ctxt *ctxt) if ((cr4 & X86_CR4_DE) && (dr == 4 || dr == 5)) return emulate_ud(ctxt); - if (check_dr7_gd(ctxt)) { + if (ctxt->ops->get_dr(ctxt, 7) & DR7_GD) { ulong dr6; - ctxt->ops->get_dr(ctxt, 6, &dr6); + dr6 = ctxt->ops->get_dr(ctxt, 6); dr6 &= ~DR_TRAP_BITS; dr6 |= DR6_BD | DR6_ACTIVE_LOW; ctxt->ops->set_dr(ctxt, 6, dr6); @@ -5451,7 +5442,7 @@ twobyte_insn: ctxt->dst.val = ops->get_cr(ctxt, ctxt->modrm_reg); break; case 0x21: /* mov from dr to reg */ - ops->get_dr(ctxt, ctxt->modrm_reg, &ctxt->dst.val); + ctxt->dst.val = ops->get_dr(ctxt, ctxt->modrm_reg); break; case 0x40 ... 0x4f: /* cmov */ if (test_cc(ctxt->b, ctxt->eflags)) diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index e6d149825169..153977c556f4 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -203,7 +203,7 @@ struct x86_emulate_ops { ulong (*get_cr)(struct x86_emulate_ctxt *ctxt, int cr); int (*set_cr)(struct x86_emulate_ctxt *ctxt, int cr, ulong val); int (*cpl)(struct x86_emulate_ctxt *ctxt); - void (*get_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong *dest); + ulong (*get_dr)(struct x86_emulate_ctxt *ctxt, int dr); int (*set_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong value); int (*set_msr_with_filter)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 data); int (*get_msr_with_filter)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 *pdata); diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c index dc3d95fdca7d..19a7a0a31953 100644 --- a/arch/x86/kvm/smm.c +++ b/arch/x86/kvm/smm.c @@ -184,7 +184,6 @@ static void enter_smm_save_state_32(struct kvm_vcpu *vcpu, struct kvm_smram_state_32 *smram) { struct desc_ptr dt; - unsigned long val; int i; smram->cr0 = kvm_read_cr0(vcpu); @@ -195,10 +194,8 @@ static void enter_smm_save_state_32(struct kvm_vcpu *vcpu, for (i = 0; i < 8; i++) smram->gprs[i] = kvm_register_read_raw(vcpu, i); - kvm_get_dr(vcpu, 6, &val); - smram->dr6 = (u32)val; - kvm_get_dr(vcpu, 7, &val); - smram->dr7 = (u32)val; + smram->dr6 = (u32)kvm_get_dr(vcpu, 6); + smram->dr7 = (u32)kvm_get_dr(vcpu, 7); enter_smm_save_seg_32(vcpu, &smram->tr, &smram->tr_sel, VCPU_SREG_TR); enter_smm_save_seg_32(vcpu, &smram->ldtr, &smram->ldtr_sel, VCPU_SREG_LDTR); @@ -231,7 +228,6 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu, struct kvm_smram_state_64 *smram) { struct desc_ptr dt; - unsigned long val; int i; for (i = 0; i < 16; i++) @@ -240,11 +236,8 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu, smram->rip = kvm_rip_read(vcpu); smram->rflags = kvm_get_rflags(vcpu); - - kvm_get_dr(vcpu, 6, &val); - smram->dr6 = val; - kvm_get_dr(vcpu, 7, &val); - smram->dr7 = val; + smram->dr6 = kvm_get_dr(vcpu, 6); + smram->dr7 = kvm_get_dr(vcpu, 7); smram->cr0 = kvm_read_cr0(vcpu); smram->cr3 = kvm_read_cr3(vcpu); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index e90b429c84f1..dda91f7cd71b 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2735,7 +2735,6 @@ static int dr_interception(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); int reg, dr; - unsigned long val; int err = 0; /* @@ -2763,11 +2762,9 @@ static int dr_interception(struct kvm_vcpu *vcpu) dr = svm->vmcb->control.exit_code - SVM_EXIT_READ_DR0; if (dr >= 16) { /* mov to DRn */ dr -= 16; - val = kvm_register_read(vcpu, reg); - err = kvm_set_dr(vcpu, dr, val); + err = kvm_set_dr(vcpu, dr, kvm_register_read(vcpu, reg)); } else { - kvm_get_dr(vcpu, dr, &val); - kvm_register_write(vcpu, reg, val); + kvm_register_write(vcpu, reg, kvm_get_dr(vcpu, dr)); } return kvm_complete_insn_gp(vcpu, err); diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 6329a306856b..692fd559f086 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -4433,7 +4433,7 @@ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) (vm_entry_controls_get(to_vmx(vcpu)) & VM_ENTRY_IA32E_MODE); if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_DEBUG_CONTROLS) - kvm_get_dr(vcpu, 7, (unsigned long *)&vmcs12->guest_dr7); + vmcs12->guest_dr7 = kvm_get_dr(vcpu, 7); if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_IA32_EFER) vmcs12->guest_ia32_efer = vcpu->arch.efer; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e262bc2ba4e5..aa47433d0c9b 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5566,10 +5566,7 @@ static int handle_dr(struct kvm_vcpu *vcpu) reg = DEBUG_REG_ACCESS_REG(exit_qualification); if (exit_qualification & TYPE_MOV_FROM_DR) { - unsigned long val; - - kvm_get_dr(vcpu, dr, &val); - kvm_register_write(vcpu, reg, val); + kvm_register_write(vcpu, reg, kvm_get_dr(vcpu, dr)); err = 0; } else { err = kvm_set_dr(vcpu, dr, kvm_register_read(vcpu, reg)); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7eb7736cfd76..b195eb21639c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1399,22 +1399,19 @@ int kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val) } EXPORT_SYMBOL_GPL(kvm_set_dr); -void kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val) +unsigned long kvm_get_dr(struct kvm_vcpu *vcpu, int dr) { size_t size = ARRAY_SIZE(vcpu->arch.db); switch (dr) { case 0 ... 3: - *val = vcpu->arch.db[array_index_nospec(dr, size)]; - break; + return vcpu->arch.db[array_index_nospec(dr, size)]; case 4: case 6: - *val = vcpu->arch.dr6; - break; + return vcpu->arch.dr6; case 5: default: /* 7 */ - *val = vcpu->arch.dr7; - break; + return vcpu->arch.dr7; } } EXPORT_SYMBOL_GPL(kvm_get_dr); @@ -5509,7 +5506,6 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu, static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, struct kvm_debugregs *dbgregs) { - unsigned long val; unsigned int i; memset(dbgregs, 0, sizeof(*dbgregs)); @@ -5518,8 +5514,7 @@ static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, for (i = 0; i < ARRAY_SIZE(vcpu->arch.db); i++) dbgregs->db[i] = vcpu->arch.db[i]; - kvm_get_dr(vcpu, 6, &val); - dbgregs->dr6 = val; + dbgregs->dr6 = kvm_get_dr(vcpu, 6); dbgregs->dr7 = vcpu->arch.dr7; } @@ -8173,10 +8168,9 @@ static void emulator_wbinvd(struct x86_emulate_ctxt *ctxt) kvm_emulate_wbinvd_noskip(emul_to_vcpu(ctxt)); } -static void emulator_get_dr(struct x86_emulate_ctxt *ctxt, int dr, - unsigned long *dest) +static unsigned long emulator_get_dr(struct x86_emulate_ctxt *ctxt, int dr) { - kvm_get_dr(emul_to_vcpu(ctxt), dr, dest); + return kvm_get_dr(emul_to_vcpu(ctxt), dr); } static int emulator_set_dr(struct x86_emulate_ctxt *ctxt, int dr, -- cgit v1.2.3 From 2a5f091ce1c9222fc9f98374d92db9539e6004ae Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 9 Feb 2024 14:07:52 -0800 Subject: KVM: x86: Open code all direct reads to guest DR6 and DR7 Bite the bullet, and open code all direct reads of DR6 and DR7. KVM currently has a mix of open coded accesses and calls to kvm_get_dr(), which is confusing and ugly because there's no rhyme or reason as to why any particular chunk of code uses kvm_get_dr(). The obvious alternative is to force all accesses through kvm_get_dr(), but it's not at all clear that doing so would be a net positive, e.g. even if KVM ends up wanting/needing to force all reads through a common helper, e.g. to play caching games, the cost of reverting this change is likely lower than the ongoing cost of maintaining weird, arbitrary code. No functional change intended. Cc: Mathias Krause Reviewed-by: Mathias Krause Link: https://lore.kernel.org/r/20240209220752.388160-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/smm.c | 8 ++++---- arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/x86.c | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c index 19a7a0a31953..d06d43d8d2aa 100644 --- a/arch/x86/kvm/smm.c +++ b/arch/x86/kvm/smm.c @@ -194,8 +194,8 @@ static void enter_smm_save_state_32(struct kvm_vcpu *vcpu, for (i = 0; i < 8; i++) smram->gprs[i] = kvm_register_read_raw(vcpu, i); - smram->dr6 = (u32)kvm_get_dr(vcpu, 6); - smram->dr7 = (u32)kvm_get_dr(vcpu, 7); + smram->dr6 = (u32)vcpu->arch.dr6; + smram->dr7 = (u32)vcpu->arch.dr7; enter_smm_save_seg_32(vcpu, &smram->tr, &smram->tr_sel, VCPU_SREG_TR); enter_smm_save_seg_32(vcpu, &smram->ldtr, &smram->ldtr_sel, VCPU_SREG_LDTR); @@ -236,8 +236,8 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu, smram->rip = kvm_rip_read(vcpu); smram->rflags = kvm_get_rflags(vcpu); - smram->dr6 = kvm_get_dr(vcpu, 6); - smram->dr7 = kvm_get_dr(vcpu, 7); + smram->dr6 = vcpu->arch.dr6; + smram->dr7 = vcpu->arch.dr7; smram->cr0 = kvm_read_cr0(vcpu); smram->cr3 = kvm_read_cr3(vcpu); diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 692fd559f086..0cd977f22c4b 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -4433,7 +4433,7 @@ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) (vm_entry_controls_get(to_vmx(vcpu)) & VM_ENTRY_IA32E_MODE); if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_DEBUG_CONTROLS) - vmcs12->guest_dr7 = kvm_get_dr(vcpu, 7); + vmcs12->guest_dr7 = vcpu->arch.dr7; if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_IA32_EFER) vmcs12->guest_ia32_efer = vcpu->arch.efer; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b195eb21639c..f97e443ce853 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5514,7 +5514,7 @@ static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, for (i = 0; i < ARRAY_SIZE(vcpu->arch.db); i++) dbgregs->db[i] = vcpu->arch.db[i]; - dbgregs->dr6 = kvm_get_dr(vcpu, 6); + dbgregs->dr6 = vcpu->arch.dr6; dbgregs->dr7 = vcpu->arch.dr7; } -- cgit v1.2.3 From 474b99ed703b7e4031f3925adacf19e7c8af2075 Mon Sep 17 00:00:00 2001 From: Mingwei Zhang Date: Fri, 2 Feb 2024 16:23:40 -0800 Subject: KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic Drop KVM's completely pointless acquisition of mmu_lock when deciding whether or not to unprotect any shadow pages residing at the gfn before resuming the guest to let it retry an instruction that KVM failed to emulated. In this case, indirect_shadow_pages is used as a coarse-grained heuristic to check if there is any chance of there being a relevant shadow page to unprotected. But acquiring mmu_lock largely defeats any benefit to the heuristic, as taking mmu_lock for write is likely far more costly to the VM as a whole than unnecessarily walking mmu_page_hash. Furthermore, the current code is already prone to false negatives and false positives, as it drops mmu_lock before checking the flag and unprotecting shadow pages. And as evidenced by the lack of bug reports, neither false positives nor false negatives are problematic. A false positive simply means that KVM will try to unprotect shadow pages that have already been zapped. And a false negative means that KVM will resume the guest without unprotecting the gfn, i.e. if a shadow page was _just_ created, the vCPU will hit the same page fault and do the whole dance all over again, and detect and unprotect the shadow page the second time around (or not, if something else zaps it first). Reported-by: Jim Mattson Signed-off-by: Mingwei Zhang [sean: drop READ_ONCE() and comment change, rewrite changelog] Link: https://lore.kernel.org/r/20240203002343.383056-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 363b1c080205..7015f8786397 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8789,13 +8789,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, /* The instructions are well-emulated on direct mmu. */ if (vcpu->arch.mmu->root_role.direct) { - unsigned int indirect_shadow_pages; - - write_lock(&vcpu->kvm->mmu_lock); - indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages; - write_unlock(&vcpu->kvm->mmu_lock); - - if (indirect_shadow_pages) + if (vcpu->kvm->arch.indirect_shadow_pages) kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); return true; -- cgit v1.2.3 From 515c18a64e704bc932c5a64e25aaeb712252cf0b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 2 Feb 2024 16:23:41 -0800 Subject: KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction() Now that KVM doesn't pointlessly acquire mmu_lock for direct MMUs, drop the dedicated path entirely and always query indirect_shadow_pages when deciding whether or not to try unprotecting the gfn. For indirect, a.k.a. shadow MMUs, checking indirect_shadow_pages is harmless; unless *every* shadow page was somehow zapped while KVM was attempting to emulate the instruction, indirect_shadow_pages is guaranteed to be non-zero. Well, unless the instruction used a direct hugepage with 2-level paging for its code page, but in that case, there's obviously nothing to unprotect. And in the extremely unlikely case all shadow pages were zapped, there's again obviously nothing to unprotect. Link: https://lore.kernel.org/r/20240203002343.383056-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7015f8786397..ac3ea5829df6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8787,27 +8787,27 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, kvm_release_pfn_clean(pfn); - /* The instructions are well-emulated on direct mmu. */ - if (vcpu->arch.mmu->root_role.direct) { - if (vcpu->kvm->arch.indirect_shadow_pages) - kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); - - return true; - } - /* - * if emulation was due to access to shadowed page table - * and it failed try to unshadow page and re-enter the - * guest to let CPU execute the instruction. + * If emulation may have been triggered by a write to a shadowed page + * table, unprotect the gfn (zap any relevant SPTEs) and re-enter the + * guest to let the CPU re-execute the instruction in the hope that the + * CPU can cleanly execute the instruction that KVM failed to emulate. */ - kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); + if (vcpu->kvm->arch.indirect_shadow_pages) + kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); /* - * If the access faults on its page table, it can not - * be fixed by unprotecting shadow page and it should - * be reported to userspace. + * If the failed instruction faulted on an access to page tables that + * are used to translate any part of the instruction, KVM can't resolve + * the issue by unprotecting the gfn, as zapping the shadow page will + * result in the instruction taking a !PRESENT page fault and thus put + * the vCPU into an infinite loop of page faults. E.g. KVM will create + * a SPTE and write-protect the gfn to resolve the !PRESENT fault, and + * then zap the SPTE to unprotect the gfn, and then do it all over + * again. Report the error to userspace. */ - return !(emulation_type & EMULTYPE_WRITE_PF_TO_SP); + return vcpu->arch.mmu->root_role.direct || + !(emulation_type & EMULTYPE_WRITE_PF_TO_SP); } static bool retry_instruction(struct x86_emulate_ctxt *ctxt, -- cgit v1.2.3 From dfeef3d3f310ee464493e848383c4e9fe879089a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 2 Feb 2024 16:23:42 -0800 Subject: KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag Remove reexecute_instruction()'s final check on the MMU being direct, as EMULTYPE_WRITE_PF_TO_SP is only ever set if the MMU is indirect, i.e. is a shadow MMU. Prior to commit 93c05d3ef252 ("KVM: x86: improve reexecute_instruction"), the flag simply didn't exist (and KVM actually returned "true" unconditionally for both types of MMUs). I.e. the explicit check for a direct MMU is simply leftover artifact from old code. Link: https://lore.kernel.org/r/20240203002343.383056-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ac3ea5829df6..48ec889452e2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8806,8 +8806,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, * then zap the SPTE to unprotect the gfn, and then do it all over * again. Report the error to userspace. */ - return vcpu->arch.mmu->root_role.direct || - !(emulation_type & EMULTYPE_WRITE_PF_TO_SP); + return !(emulation_type & EMULTYPE_WRITE_PF_TO_SP); } static bool retry_instruction(struct x86_emulate_ctxt *ctxt, -- cgit v1.2.3 From 9c9025ea003a03f967affd690f39b4ef3452c0f5 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 17:27:00 -0800 Subject: KVM: x86: Plumb "force_immediate_exit" into kvm_entry() tracepoint Annotate the kvm_entry() tracepoint with "immediate exit" when KVM is forcing a VM-Exit immediately after VM-Enter, e.g. when KVM wants to inject an event but needs to first complete some other operation. Knowing that KVM is (or isn't) forcing an exit is useful information when debugging issues related to event injection. Suggested-by: Maxim Levitsky Link: https://lore.kernel.org/r/20240110012705.506918-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 3 ++- arch/x86/kvm/svm/svm.c | 5 +++-- arch/x86/kvm/trace.h | 9 ++++++--- arch/x86/kvm/vmx/vmx.c | 4 ++-- arch/x86/kvm/x86.c | 2 +- 5 files changed, 14 insertions(+), 9 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7a59043a6ab0..bdda3edddec9 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1663,7 +1663,8 @@ struct kvm_x86_ops { void (*flush_tlb_guest)(struct kvm_vcpu *vcpu); int (*vcpu_pre_run)(struct kvm_vcpu *vcpu); - enum exit_fastpath_completion (*vcpu_run)(struct kvm_vcpu *vcpu); + enum exit_fastpath_completion (*vcpu_run)(struct kvm_vcpu *vcpu, + bool force_immediate_exit); int (*handle_exit)(struct kvm_vcpu *vcpu, enum exit_fastpath_completion exit_fastpath); int (*skip_emulated_instruction)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index dda91f7cd71b..ab8ad8d0e818 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4112,12 +4112,13 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, bool spec_ctrl_in guest_state_exit_irqoff(); } -static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu) +static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, + bool force_immediate_exit) { struct vcpu_svm *svm = to_svm(vcpu); bool spec_ctrl_intercepted = msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL); - trace_kvm_entry(vcpu); + trace_kvm_entry(vcpu, force_immediate_exit); svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX]; svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP]; diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index 83843379813e..88659de4d2a7 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -15,20 +15,23 @@ * Tracepoint for guest mode entry. */ TRACE_EVENT(kvm_entry, - TP_PROTO(struct kvm_vcpu *vcpu), - TP_ARGS(vcpu), + TP_PROTO(struct kvm_vcpu *vcpu, bool force_immediate_exit), + TP_ARGS(vcpu, force_immediate_exit), TP_STRUCT__entry( __field( unsigned int, vcpu_id ) __field( unsigned long, rip ) + __field( bool, immediate_exit ) ), TP_fast_assign( __entry->vcpu_id = vcpu->vcpu_id; __entry->rip = kvm_rip_read(vcpu); + __entry->immediate_exit = force_immediate_exit; ), - TP_printk("vcpu %u, rip 0x%lx", __entry->vcpu_id, __entry->rip) + TP_printk("vcpu %u, rip 0x%lx%s", __entry->vcpu_id, __entry->rip, + __entry->immediate_exit ? "[immediate exit]" : "") ); /* diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index aa47433d0c9b..e3653d11f001 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7265,7 +7265,7 @@ out: guest_state_exit_irqoff(); } -static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) +static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) { struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long cr3, cr4; @@ -7292,7 +7292,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) return EXIT_FASTPATH_NONE; } - trace_kvm_entry(vcpu); + trace_kvm_entry(vcpu, force_immediate_exit); if (vmx->ple_window_dirty) { vmx->ple_window_dirty = false; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f97e443ce853..e92f14b9d0d6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10956,7 +10956,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) WARN_ON_ONCE((kvm_vcpu_apicv_activated(vcpu) != kvm_vcpu_apicv_active(vcpu)) && (kvm_get_apic_mode(vcpu) != LAPIC_MODE_DISABLED)); - exit_fastpath = static_call(kvm_x86_vcpu_run)(vcpu); + exit_fastpath = static_call(kvm_x86_vcpu_run)(vcpu, req_immediate_exit); if (likely(exit_fastpath != EXIT_FASTPATH_REENTER_GUEST)) break; -- cgit v1.2.3 From e6b5d16bbd2d4c8259ad76aa33de80d561aba5f9 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 17:27:01 -0800 Subject: KVM: VMX: Re-enter guest in fastpath for "spurious" preemption timer exits Re-enter the guest in the fast path if VMX preeemption timer VM-Exit was "spurious", i.e. if KVM "soft disabled" the timer by writing -1u and by some miracle the timer expired before any other VM-Exit occurred. This is just an intermediate step to cleaning up the preemption timer handling, optimizing these types of spurious VM-Exits is not interesting as they are extremely rare/infrequent. Link: https://lore.kernel.org/r/20240110012705.506918-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e3653d11f001..a44b09e35fd6 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5992,8 +5992,15 @@ static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); - if (!vmx->req_immediate_exit && - !unlikely(vmx->loaded_vmcs->hv_timer_soft_disabled)) { + /* + * In the *extremely* unlikely scenario that this is a spurious VM-Exit + * due to the timer expiring while it was "soft" disabled, just eat the + * exit and re-enter the guest. + */ + if (unlikely(vmx->loaded_vmcs->hv_timer_soft_disabled)) + return EXIT_FASTPATH_REENTER_GUEST; + + if (!vmx->req_immediate_exit) { kvm_lapic_expired_hv_timer(vcpu); return EXIT_FASTPATH_REENTER_GUEST; } -- cgit v1.2.3 From 11776aa0cfa7d007ad1799b1553bdcbd830e5010 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 17:27:02 -0800 Subject: KVM: VMX: Handle forced exit due to preemption timer in fastpath Handle VMX preemption timer VM-Exits due to KVM forcing an exit in the exit fastpath, i.e. avoid calling back into handle_preemption_timer() for the same exit. There is no work to be done for forced exits, as the name suggests the goal is purely to get control back in KVM. In addition to shaving a few cycles, this will allow cleanly separating handle_fastpath_preemption_timer() from handle_preemption_timer(), e.g. it's not immediately obvious why _apparently_ calling handle_fastpath_preemption_timer() twice on a "slow" exit is necessary: the "slow" call is necessary to handle exits from L2, which are excluded from the fastpath by vmx_vcpu_run(). Link: https://lore.kernel.org/r/20240110012705.506918-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index a44b09e35fd6..7bd360b5bfb9 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6000,12 +6000,15 @@ static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu) if (unlikely(vmx->loaded_vmcs->hv_timer_soft_disabled)) return EXIT_FASTPATH_REENTER_GUEST; - if (!vmx->req_immediate_exit) { - kvm_lapic_expired_hv_timer(vcpu); - return EXIT_FASTPATH_REENTER_GUEST; - } + /* + * If the timer expired because KVM used it to force an immediate exit, + * then mission accomplished. + */ + if (vmx->req_immediate_exit) + return EXIT_FASTPATH_EXIT_HANDLED; - return EXIT_FASTPATH_NONE; + kvm_lapic_expired_hv_timer(vcpu); + return EXIT_FASTPATH_REENTER_GUEST; } static int handle_preemption_timer(struct kvm_vcpu *vcpu) -- cgit v1.2.3 From bf1a49436ea37b98dd2f37c57608951d0e28eecc Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 17:27:03 -0800 Subject: KVM: x86: Move handling of is_guest_mode() into fastpath exit handlers Let the fastpath code decide which exits can/can't be handled in the fastpath when L2 is active, e.g. when KVM generates a VMX preemption timer exit to forcefully regain control, there is no "work" to be done and so such exits can be handled in the fastpath regardless of whether L1 or L2 is active. Moving the is_guest_mode() check into the fastpath code also makes it easier to see that L2 isn't allowed to use the fastpath in most cases, e.g. it's not immediately obvious why handle_fastpath_preemption_timer() is called from the fastpath and the normal path. Link: https://lore.kernel.org/r/20240110012705.506918-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 6 +++--- arch/x86/kvm/vmx/vmx.c | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index ab8ad8d0e818..fe4ac03aaa3a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4089,6 +4089,9 @@ static int svm_vcpu_pre_run(struct kvm_vcpu *vcpu) static fastpath_t svm_exit_handlers_fastpath(struct kvm_vcpu *vcpu) { + if (is_guest_mode(vcpu)) + return EXIT_FASTPATH_NONE; + if (to_svm(vcpu)->vmcb->control.exit_code == SVM_EXIT_MSR && to_svm(vcpu)->vmcb->control.exit_info_1) return handle_fastpath_set_msr_irqoff(vcpu); @@ -4235,9 +4238,6 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, svm_complete_interrupts(vcpu); - if (is_guest_mode(vcpu)) - return EXIT_FASTPATH_NONE; - return svm_exit_handlers_fastpath(vcpu); } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 7bd360b5bfb9..6f21d389bed5 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7214,6 +7214,9 @@ void noinstr vmx_spec_ctrl_restore_host(struct vcpu_vmx *vmx, static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu) { + if (is_guest_mode(vcpu)) + return EXIT_FASTPATH_NONE; + switch (to_vmx(vcpu)->exit_reason.basic) { case EXIT_REASON_MSR_WRITE: return handle_fastpath_set_msr_irqoff(vcpu); @@ -7425,9 +7428,6 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) vmx_recover_nmi_blocking(vmx); vmx_complete_interrupts(vmx); - if (is_guest_mode(vcpu)) - return EXIT_FASTPATH_NONE; - return vmx_exit_handlers_fastpath(vcpu); } -- cgit v1.2.3 From 7b3d1bbf8d68d76fb21210932a5e8ed8ea80dbcc Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 17:27:04 -0800 Subject: KVM: VMX: Handle KVM-induced preemption timer exits in fastpath for L2 Eat VMX treemption timer exits in the fastpath regardless of whether L1 or L2 is active. The VM-Exit is 100% KVM-induced, i.e. there is nothing directly related to the exit that KVM needs to do on behalf of the guest, thus there is no reason to wait until the slow path to do nothing. Opportunistically add comments explaining why preemption timer exits for emulating the guest's APIC timer need to go down the slow path. Link: https://lore.kernel.org/r/20240110012705.506918-6-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 6f21d389bed5..1a3a4a92dc60 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6007,13 +6007,26 @@ static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu) if (vmx->req_immediate_exit) return EXIT_FASTPATH_EXIT_HANDLED; + /* + * If L2 is active, go down the slow path as emulating the guest timer + * expiration likely requires synthesizing a nested VM-Exit. + */ + if (is_guest_mode(vcpu)) + return EXIT_FASTPATH_NONE; + kvm_lapic_expired_hv_timer(vcpu); return EXIT_FASTPATH_REENTER_GUEST; } static int handle_preemption_timer(struct kvm_vcpu *vcpu) { - handle_fastpath_preemption_timer(vcpu); + /* + * This non-fastpath handler is reached if and only if the preemption + * timer was being used to emulate a guest timer while L2 is active. + * All other scenarios are supposed to be handled in the fastpath. + */ + WARN_ON_ONCE(!is_guest_mode(vcpu)); + kvm_lapic_expired_hv_timer(vcpu); return 1; } @@ -7214,7 +7227,12 @@ void noinstr vmx_spec_ctrl_restore_host(struct vcpu_vmx *vmx, static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu) { - if (is_guest_mode(vcpu)) + /* + * If L2 is active, some VMX preemption timer exits can be handled in + * the fastpath even, all other exits must use the slow path. + */ + if (is_guest_mode(vcpu) && + to_vmx(vcpu)->exit_reason.basic != EXIT_REASON_PREEMPTION_TIMER) return EXIT_FASTPATH_NONE; switch (to_vmx(vcpu)->exit_reason.basic) { -- cgit v1.2.3 From 0ec3d6d1f169baa7fc512ae4b78d17e7c94b7763 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 17:27:05 -0800 Subject: KVM: x86: Fully defer to vendor code to decide how to force immediate exit Now that vmx->req_immediate_exit is used only in the scope of vmx_vcpu_run(), use force_immediate_exit to detect that KVM should usurp the VMX preemption to force a VM-Exit and let vendor code fully handle forcing a VM-Exit. Opportunsitically drop __kvm_request_immediate_exit() and just have vendor code call smp_send_reschedule() directly. SVM already does this when injecting an event while also trying to single-step an IRET, i.e. it's not exactly secret knowledge that KVM uses a reschedule IPI to force an exit. Link: https://lore.kernel.org/r/20240110012705.506918-7-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm-x86-ops.h | 1 - arch/x86/include/asm/kvm_host.h | 3 --- arch/x86/kvm/svm/svm.c | 7 ++++--- arch/x86/kvm/vmx/vmx.c | 32 ++++++++++++++------------------ arch/x86/kvm/vmx/vmx.h | 2 -- arch/x86/kvm/x86.c | 10 +--------- 6 files changed, 19 insertions(+), 36 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index 378ed944b849..3942b74c1b75 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -103,7 +103,6 @@ KVM_X86_OP(write_tsc_multiplier) KVM_X86_OP(get_exit_info) KVM_X86_OP(check_intercept) KVM_X86_OP(handle_exit_irqoff) -KVM_X86_OP(request_immediate_exit) KVM_X86_OP(sched_in) KVM_X86_OP_OPTIONAL(update_cpu_dirty_logging) KVM_X86_OP_OPTIONAL(vcpu_blocking) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index bdda3edddec9..198d30cceaf2 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1732,8 +1732,6 @@ struct kvm_x86_ops { struct x86_exception *exception); void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu); - void (*request_immediate_exit)(struct kvm_vcpu *vcpu); - void (*sched_in)(struct kvm_vcpu *vcpu, int cpu); /* @@ -2239,7 +2237,6 @@ extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn); int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu); int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err); -void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu); void __user *__x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index fe4ac03aaa3a..b9096bb79c00 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4140,9 +4140,12 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, * is enough to force an immediate vmexit. */ disable_nmi_singlestep(svm); - smp_send_reschedule(vcpu->cpu); + force_immediate_exit = true; } + if (force_immediate_exit) + smp_send_reschedule(vcpu->cpu); + pre_svm_run(vcpu); sync_lapic_to_cr8(vcpu); @@ -4995,8 +4998,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .check_intercept = svm_check_intercept, .handle_exit_irqoff = svm_handle_exit_irqoff, - .request_immediate_exit = __kvm_request_immediate_exit, - .sched_in = svm_sched_in, .nested_ops = &svm_nested_ops, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 1a3a4a92dc60..c9e117506d73 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -49,6 +49,8 @@ #include #include +#include + #include "capabilities.h" #include "cpuid.h" #include "hyperv.h" @@ -1281,8 +1283,6 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) u16 fs_sel, gs_sel; int i; - vmx->req_immediate_exit = false; - /* * Note that guest MSRs to be saved/restored can also be changed * when guest state is loaded. This happens when guest transitions @@ -5988,7 +5988,8 @@ static int handle_pml_full(struct kvm_vcpu *vcpu) return 1; } -static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu) +static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu, + bool force_immediate_exit) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -6004,7 +6005,7 @@ static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu) * If the timer expired because KVM used it to force an immediate exit, * then mission accomplished. */ - if (vmx->req_immediate_exit) + if (force_immediate_exit) return EXIT_FASTPATH_EXIT_HANDLED; /* @@ -7166,13 +7167,13 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx) msrs[i].host, false); } -static void vmx_update_hv_timer(struct kvm_vcpu *vcpu) +static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit) { struct vcpu_vmx *vmx = to_vmx(vcpu); u64 tscl; u32 delta_tsc; - if (vmx->req_immediate_exit) { + if (force_immediate_exit) { vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, 0); vmx->loaded_vmcs->hv_timer_soft_disabled = false; } else if (vmx->hv_deadline_tsc != -1) { @@ -7225,7 +7226,8 @@ void noinstr vmx_spec_ctrl_restore_host(struct vcpu_vmx *vmx, barrier_nospec(); } -static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu) +static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu, + bool force_immediate_exit) { /* * If L2 is active, some VMX preemption timer exits can be handled in @@ -7239,7 +7241,7 @@ static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu) case EXIT_REASON_MSR_WRITE: return handle_fastpath_set_msr_irqoff(vcpu); case EXIT_REASON_PREEMPTION_TIMER: - return handle_fastpath_preemption_timer(vcpu); + return handle_fastpath_preemption_timer(vcpu, force_immediate_exit); default: return EXIT_FASTPATH_NONE; } @@ -7382,7 +7384,9 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) vmx_passthrough_lbr_msrs(vcpu); if (enable_preemption_timer) - vmx_update_hv_timer(vcpu); + vmx_update_hv_timer(vcpu, force_immediate_exit); + else if (force_immediate_exit) + smp_send_reschedule(vcpu->cpu); kvm_wait_lapic_expire(vcpu); @@ -7446,7 +7450,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) vmx_recover_nmi_blocking(vmx); vmx_complete_interrupts(vmx); - return vmx_exit_handlers_fastpath(vcpu); + return vmx_exit_handlers_fastpath(vcpu, force_immediate_exit); } static void vmx_vcpu_free(struct kvm_vcpu *vcpu) @@ -7926,11 +7930,6 @@ static __init void vmx_set_cpu_caps(void) kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG); } -static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu) -{ - to_vmx(vcpu)->req_immediate_exit = true; -} - static int vmx_check_intercept_io(struct kvm_vcpu *vcpu, struct x86_instruction_info *info) { @@ -8383,8 +8382,6 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .check_intercept = vmx_check_intercept, .handle_exit_irqoff = vmx_handle_exit_irqoff, - .request_immediate_exit = vmx_request_immediate_exit, - .sched_in = vmx_sched_in, .cpu_dirty_log_size = PML_ENTITY_NUM, @@ -8644,7 +8641,6 @@ static __init int hardware_setup(void) if (!enable_preemption_timer) { vmx_x86_ops.set_hv_timer = NULL; vmx_x86_ops.cancel_hv_timer = NULL; - vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit; } kvm_caps.supported_mce_cap |= MCG_LMCE_P; diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index e3b0985bb74a..65786dbe7d60 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -332,8 +332,6 @@ struct vcpu_vmx { unsigned int ple_window; bool ple_window_dirty; - bool req_immediate_exit; - /* Support for PML */ #define PML_ENTITY_NUM 512 struct page *pml_pg; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e92f14b9d0d6..d9724a35d8f3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10667,12 +10667,6 @@ static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) static_call_cond(kvm_x86_set_apic_access_page_addr)(vcpu); } -void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu) -{ - smp_send_reschedule(vcpu->cpu); -} -EXPORT_SYMBOL_GPL(__kvm_request_immediate_exit); - /* * Called within kvm->srcu read side. * Returns 1 to let vcpu_run() continue the guest execution loop without @@ -10922,10 +10916,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) goto cancel_injection; } - if (req_immediate_exit) { + if (req_immediate_exit) kvm_make_request(KVM_REQ_EVENT, vcpu); - static_call(kvm_x86_request_immediate_exit)(vcpu); - } fpregs_assert_state_consistent(); if (test_thread_flag(TIF_NEED_FPU_LOAD)) -- cgit v1.2.3 From a78d9046696b88079a5696bccec4e4e439a3f2a2 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 9 Feb 2024 14:20:46 -0800 Subject: KVM: x86: Move "KVM no-APIC vCPU" key management into local APIC code Move incrementing and decrementing of kvm_has_noapic_vcpu into kvm_create_lapic() and kvm_free_lapic() respectively to fix a benign bug where KVM fails to decrement the count if vCPU creation ultimately fails, e.g. due to a memory allocation failing. Note, the bug is benign as kvm_has_noapic_vcpu is used purely to optimize lapic_in_kernel() checks, and that optimization is quite dubious. That, and practically speaking no setup that cares at all about performance runs with a userspace local APIC. Reported-by: Li RongQing Cc: Maxim Levitsky Reviewed-by: Xu Yilun Link: https://lore.kernel.org/r/20240209222047.394389-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/lapic.c | 27 ++++++++++++++++++++++++++- arch/x86/kvm/x86.c | 29 +++-------------------------- 2 files changed, 29 insertions(+), 27 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 3242f3da2457..681f6d82d015 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -124,6 +124,9 @@ static inline int __apic_test_and_clear_vector(int vec, void *bitmap) return __test_and_clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); } +__read_mostly DEFINE_STATIC_KEY_FALSE(kvm_has_noapic_vcpu); +EXPORT_SYMBOL_GPL(kvm_has_noapic_vcpu); + __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_hw_disabled, HZ); __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_sw_disabled, HZ); @@ -2466,8 +2469,10 @@ void kvm_free_lapic(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->arch.apic; - if (!vcpu->arch.apic) + if (!vcpu->arch.apic) { + static_branch_dec(&kvm_has_noapic_vcpu); return; + } hrtimer_cancel(&apic->lapic_timer.timer); @@ -2809,6 +2814,11 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns) ASSERT(vcpu != NULL); + if (!irqchip_in_kernel(vcpu->kvm)) { + static_branch_inc(&kvm_has_noapic_vcpu); + return 0; + } + apic = kzalloc(sizeof(*apic), GFP_KERNEL_ACCOUNT); if (!apic) goto nomem; @@ -2844,6 +2854,21 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns) static_branch_inc(&apic_sw_disabled.key); /* sw disabled at reset */ kvm_iodevice_init(&apic->dev, &apic_mmio_ops); + /* + * Defer evaluating inhibits until the vCPU is first run, as this vCPU + * will not get notified of any changes until this vCPU is visible to + * other vCPUs (marked online and added to the set of vCPUs). + * + * Opportunistically mark APICv active as VMX in particularly is highly + * unlikely to have inhibits. Ignore the current per-VM APICv state so + * that vCPU creation is guaranteed to run with a deterministic value, + * the request will ensure the vCPU gets the correct state before VM-Entry. + */ + if (enable_apicv) { + apic->apicv_active = true; + kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu); + } + return 0; nomem_free_apic: kfree(apic); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d9724a35d8f3..19b18679779c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12046,27 +12046,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) if (r < 0) return r; - if (irqchip_in_kernel(vcpu->kvm)) { - r = kvm_create_lapic(vcpu, lapic_timer_advance_ns); - if (r < 0) - goto fail_mmu_destroy; - - /* - * Defer evaluating inhibits until the vCPU is first run, as - * this vCPU will not get notified of any changes until this - * vCPU is visible to other vCPUs (marked online and added to - * the set of vCPUs). Opportunistically mark APICv active as - * VMX in particularly is highly unlikely to have inhibits. - * Ignore the current per-VM APICv state so that vCPU creation - * is guaranteed to run with a deterministic value, the request - * will ensure the vCPU gets the correct state before VM-Entry. - */ - if (enable_apicv) { - vcpu->arch.apic->apicv_active = true; - kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu); - } - } else - static_branch_inc(&kvm_has_noapic_vcpu); + r = kvm_create_lapic(vcpu, lapic_timer_advance_ns); + if (r < 0) + goto fail_mmu_destroy; r = -ENOMEM; @@ -12187,8 +12169,6 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) srcu_read_unlock(&vcpu->kvm->srcu, idx); free_page((unsigned long)vcpu->arch.pio_data); kvfree(vcpu->arch.cpuid_entries); - if (!lapic_in_kernel(vcpu)) - static_branch_dec(&kvm_has_noapic_vcpu); } void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) @@ -12465,9 +12445,6 @@ bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu) return (vcpu->arch.apic_base & MSR_IA32_APICBASE_BSP) != 0; } -__read_mostly DEFINE_STATIC_KEY_FALSE(kvm_has_noapic_vcpu); -EXPORT_SYMBOL_GPL(kvm_has_noapic_vcpu); - void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); -- cgit v1.2.3 From fc3c94142b3a4391cab94adde56fcfbea25723e5 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 9 Feb 2024 14:20:47 -0800 Subject: KVM: x86: Sanity check that kvm_has_noapic_vcpu is zero at module_exit() WARN if kvm.ko is unloaded with an elevated kvm_has_noapic_vcpu to guard against incorrect management of the key, e.g. to detect if KVM fails to decrement the key in error paths. Because kvm_has_noapic_vcpu is purely an optimization, in all likelihood KVM could completely botch handling of kvm_has_noapic_vcpu and no one would notice (which is a good argument for deleting the key entirely, but that's a problem for another day). Note, ideally the sanity check would be performance when kvm_usage_count goes to zero, but adding an arch callback just for this sanity check isn't at all worth doing. Link: https://lore.kernel.org/r/20240209222047.394389-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 19b18679779c..019320580a84 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -13871,9 +13871,6 @@ module_init(kvm_x86_init); static void __exit kvm_x86_exit(void) { - /* - * If module_init() is implemented, module_exit() must also be - * implemented to allow module unload. - */ + WARN_ON_ONCE(static_branch_unlikely(&kvm_has_noapic_vcpu)); } module_exit(kvm_x86_exit); -- cgit v1.2.3 From 77bcd9e6231a5297ef417a7d7f734d61c2bcceb6 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 16:39:35 -0800 Subject: KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernel Plumb in a dedicated hook for querying whether or not a vCPU was preempted in-kernel. Unlike literally every other architecture, x86's VMX can check if a vCPU is in kernel context if and only if the vCPU is loaded on the current pCPU. x86's kvm_arch_vcpu_in_kernel() works around the limitation by querying kvm_get_running_vcpu() and redirecting to vcpu->arch.preempted_in_kernel as needed. But that's unnecessary, confusing, and fragile, e.g. x86 has had at least one bug where KVM incorrectly used a stale preempted_in_kernel. No functional change intended. Reviewed-by: Yuan Yao Link: https://lore.kernel.org/r/20240110003938.490206-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 5 +++++ include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 14 +++++++++++++- 3 files changed, 19 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 019320580a84..94346490f407 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -13054,6 +13054,11 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu) return false; } +bool kvm_arch_vcpu_preempted_in_kernel(struct kvm_vcpu *vcpu) +{ + return kvm_arch_vcpu_in_kernel(vcpu); +} + bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu) { if (READ_ONCE(vcpu->arch.pv.pv_unhalted)) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 7e7fd25b09b3..28b020404a41 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1505,6 +1505,7 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu); int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu); bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu); bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); +bool kvm_arch_vcpu_preempted_in_kernel(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); int kvm_arch_create_vm_debugfs(struct kvm *kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 10bfc88a69f7..9b92858c8b72 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4042,6 +4042,18 @@ static bool vcpu_dy_runnable(struct kvm_vcpu *vcpu) return false; } +/* + * By default, simply query the target vCPU's current mode when checking if a + * vCPU was preempted in kernel mode. All architectures except x86 (or more + * specifical, except VMX) allow querying whether or not a vCPU is in kernel + * mode even if the vCPU is NOT loaded, i.e. using kvm_arch_vcpu_in_kernel() + * directly for cross-vCPU checks is functionally correct and accurate. + */ +bool __weak kvm_arch_vcpu_preempted_in_kernel(struct kvm_vcpu *vcpu) +{ + return kvm_arch_vcpu_in_kernel(vcpu); +} + bool __weak kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu) { return false; @@ -4080,7 +4092,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) continue; if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode && !kvm_arch_dy_has_pending_interrupt(vcpu) && - !kvm_arch_vcpu_in_kernel(vcpu)) + !kvm_arch_vcpu_preempted_in_kernel(vcpu)) continue; if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) continue; -- cgit v1.2.3 From 9b8615c5d37fca15b330882bafceaf24f2398352 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 16:39:36 -0800 Subject: KVM: x86: Rely solely on preempted_in_kernel flag for directed yield Snapshot preempted_in_kernel using kvm_arch_vcpu_in_kernel() so that the flag is "accurate" (or rather, consistent and deterministic within KVM) for guests with protected state, and explicitly use preempted_in_kernel when checking if a vCPU was preempted in kernel mode instead of bouncing through kvm_arch_vcpu_in_kernel(). Drop the gnarly logic in kvm_arch_vcpu_in_kernel() that redirects to preempted_in_kernel if the target vCPU is not the "running", i.e. loaded, vCPU, as the only reason that code existed was for the directed yield case where KVM wants to check the CPL of a vCPU that may or may not be loaded on the current pCPU. Cc: Like Xu Reviewed-by: Yuan Yao Link: https://lore.kernel.org/r/20240110003938.490206-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 94346490f407..ab7d2baaae7a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5059,8 +5059,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) int idx; if (vcpu->preempted) { - if (!vcpu->arch.guest_state_protected) - vcpu->arch.preempted_in_kernel = !static_call(kvm_x86_get_cpl)(vcpu); + vcpu->arch.preempted_in_kernel = kvm_arch_vcpu_in_kernel(vcpu); /* * Take the srcu lock as memslots will be accessed to check the gfn @@ -13056,7 +13055,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu) bool kvm_arch_vcpu_preempted_in_kernel(struct kvm_vcpu *vcpu) { - return kvm_arch_vcpu_in_kernel(vcpu); + return vcpu->arch.preempted_in_kernel; } bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu) @@ -13079,9 +13078,6 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu) if (vcpu->arch.guest_state_protected) return true; - if (vcpu != kvm_get_running_vcpu()) - return vcpu->arch.preempted_in_kernel; - return static_call(kvm_x86_get_cpl)(vcpu) == 0; } -- cgit v1.2.3 From 322d79f1db4b033844dcb2de43abd570abbd04b4 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Jan 2024 16:39:37 -0800 Subject: KVM: x86: Clean up directed yield API for "has pending interrupt" Directly return the boolean result of whether or not a vCPU has a pending interrupt instead of effectively doing: if (true) return true; return false; Reviewed-by: Yuan Yao Link: https://lore.kernel.org/r/20240110003938.490206-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ab7d2baaae7a..e6283628fd53 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -13046,11 +13046,8 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu) { - if (kvm_vcpu_apicv_active(vcpu) && - static_call(kvm_x86_dy_apicv_has_pending_interrupt)(vcpu)) - return true; - - return false; + return kvm_vcpu_apicv_active(vcpu) && + static_call(kvm_x86_dy_apicv_has_pending_interrupt)(vcpu); } bool kvm_arch_vcpu_preempted_in_kernel(struct kvm_vcpu *vcpu) -- cgit v1.2.3 From 8ca983631f3c4ba16ac70d3310a31316e06f9e36 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 10 Jan 2024 18:00:41 -0800 Subject: KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity Zap invalidated TDP MMU roots at maximum granularity, i.e. with more frequent conditional resched checkpoints, in order to avoid running for an extended duration (milliseconds, or worse) without honoring a reschedule request. And for kernels running with full or real-time preempt models, zapping at 4KiB granularity also provides significantly reduced latency for other tasks that are contending for mmu_lock (which isn't necessarily an overall win for KVM, but KVM should do its best to honor the kernel's preemption model). To keep KVM's assertion that zapping at 1GiB granularity is functionally ok, which is the main reason 1GiB was selected in the past, skip straight to zapping at 1GiB if KVM is configured to prove the MMU. Zapping roots is far more common than a vCPU replacing a 1GiB page table with a hugepage, e.g. generally happens multiple times during boot, and so keeping the test coverage provided by root zaps is desirable, just not for production. Cc: David Matlack Cc: Pattara Teerapong Link: https://lore.kernel.org/r/20240111020048.844847-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 6ae19b4ee5b1..372da098d3ce 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -734,15 +734,26 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, rcu_read_lock(); /* - * To avoid RCU stalls due to recursively removing huge swaths of SPs, - * split the zap into two passes. On the first pass, zap at the 1gb - * level, and then zap top-level SPs on the second pass. "1gb" is not - * arbitrary, as KVM must be able to zap a 1gb shadow page without - * inducing a stall to allow in-place replacement with a 1gb hugepage. + * Zap roots in multiple passes of decreasing granularity, i.e. zap at + * 4KiB=>2MiB=>1GiB=>root, in order to better honor need_resched() (all + * preempt models) or mmu_lock contention (full or real-time models). + * Zapping at finer granularity marginally increases the total time of + * the zap, but in most cases the zap itself isn't latency sensitive. * - * Because zapping a SP recurses on its children, stepping down to - * PG_LEVEL_4K in the iterator itself is unnecessary. + * If KVM is configured to prove the MMU, skip the 4KiB and 2MiB zaps + * in order to mimic the page fault path, which can replace a 1GiB page + * table with an equivalent 1GiB hugepage, i.e. can get saddled with + * zapping a 1GiB region that's fully populated with 4KiB SPTEs. This + * allows verifying that KVM can safely zap 1GiB regions, e.g. without + * inducing RCU stalls, without relying on a relatively rare event + * (zapping roots is orders of magnitude more common). Note, because + * zapping a SP recurses on its children, stepping down to PG_LEVEL_4K + * in the iterator itself is unnecessary. */ + if (!IS_ENABLED(CONFIG_KVM_PROVE_MMU)) { + __tdp_mmu_zap_root(kvm, root, shared, PG_LEVEL_4K); + __tdp_mmu_zap_root(kvm, root, shared, PG_LEVEL_2M); + } __tdp_mmu_zap_root(kvm, root, shared, PG_LEVEL_1G); __tdp_mmu_zap_root(kvm, root, shared, root->role.level); -- cgit v1.2.3 From fcdffe97f80e6fb488f6b5c6bd38f6cd899944ab Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 10 Jan 2024 18:00:42 -0800 Subject: KVM: x86/mmu: Don't do TLB flush when zappings SPTEs in invalid roots Don't force a TLB flush when zapping SPTEs in invalid roots as vCPUs can't be actively using invalid roots (zapping SPTEs in invalid roots is necessary only to ensure KVM doesn't mark a page accessed/dirty after it is freed by the primary MMU). Link: https://lore.kernel.org/r/20240111020048.844847-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 372da098d3ce..68920877370b 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -811,7 +811,13 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, continue; tdp_mmu_iter_set_spte(kvm, &iter, 0); - flush = true; + + /* + * Zappings SPTEs in invalid roots doesn't require a TLB flush, + * see kvm_tdp_mmu_zap_invalidated_roots() for details. + */ + if (!root->role.invalid) + flush = true; } rcu_read_unlock(); -- cgit v1.2.3 From 6577f1efdff443277b19c0fbe4b933404e7c84e6 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 10 Jan 2024 18:00:43 -0800 Subject: KVM: x86/mmu: Allow passing '-1' for "all" as_id for TDP MMU iterators Modify for_each_tdp_mmu_root() and __for_each_tdp_mmu_root_yield_safe() to accept -1 for _as_id to mean "process all memslot address spaces". That way code that wants to process both SMM and !SMM doesn't need to iterate over roots twice (and likely copy+paste code in the process). Deliberately don't cast _as_id to an "int", just in case not casting helps the compiler elide the "_as_id >=0" check when being passed an unsigned value, e.g. from a memslot. No functional change intended. Link: https://lore.kernel.org/r/20240111020048.844847-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 68920877370b..60fff2aad59e 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -149,11 +149,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, * If shared is set, this function is operating under the MMU lock in read * mode. */ -#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _only_valid)\ - for (_root = tdp_mmu_next_root(_kvm, NULL, _only_valid); \ - ({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \ - _root = tdp_mmu_next_root(_kvm, _root, _only_valid)) \ - if (kvm_mmu_page_as_id(_root) != _as_id) { \ +#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _only_valid) \ + for (_root = tdp_mmu_next_root(_kvm, NULL, _only_valid); \ + ({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \ + _root = tdp_mmu_next_root(_kvm, _root, _only_valid)) \ + if (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) { \ } else #define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \ @@ -171,10 +171,10 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, * Holding mmu_lock for write obviates the need for RCU protection as the list * is guaranteed to be stable. */ -#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \ - list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \ - if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \ - kvm_mmu_page_as_id(_root) != _as_id) { \ +#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \ + list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \ + if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \ + _as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) { \ } else static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu) -- cgit v1.2.3 From 99b85fda91b164b91a0d4e0aae376f32dc38d59c Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 10 Jan 2024 18:00:44 -0800 Subject: KVM: x86/mmu: Skip invalid roots when zapping leaf SPTEs for GFN range When zapping a GFN in response to an APICv or MTRR change, don't zap SPTEs for invalid roots as KVM only needs to ensure the guest can't use stale mappings for the GFN. Unlike kvm_tdp_mmu_unmap_gfn_range(), which must zap "unreachable" SPTEs to ensure KVM doesn't mark a page accessed/dirty, kvm_tdp_mmu_zap_leafs() isn't used (and isn't intended to be used) to handle freeing of host memory. Link: https://lore.kernel.org/r/20240111020048.844847-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 60fff2aad59e..1a9c16e5c287 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -830,16 +830,16 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, } /* - * Zap leaf SPTEs for the range of gfns, [start, end), for all roots. Returns - * true if a TLB flush is needed before releasing the MMU lock, i.e. if one or - * more SPTEs were zapped since the MMU lock was last acquired. + * Zap leaf SPTEs for the range of gfns, [start, end), for all *VALID** roots. + * Returns true if a TLB flush is needed before releasing the MMU lock, i.e. if + * one or more SPTEs were zapped since the MMU lock was last acquired. */ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush) { struct kvm_mmu_page *root; lockdep_assert_held_write(&kvm->mmu_lock); - for_each_tdp_mmu_root_yield_safe(kvm, root) + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, -1) flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush); return flush; -- cgit v1.2.3 From d746182337c205660fd4d8eaa5fdc4f4e8320b9a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 10 Jan 2024 18:00:45 -0800 Subject: KVM: x86/mmu: Skip invalid TDP MMU roots when write-protecting SPTEs When write-protecting SPTEs, don't process invalid roots as invalid roots are unreachable, i.e. can't be used to access guest memory and thus don't need to be write-protected. Note, this is *almost* a nop for kvm_tdp_mmu_clear_dirty_pt_masked(), which is called under slots_lock, i.e. is mutually exclusive with kvm_mmu_zap_all_fast(). But it's possible for something other than the "fast zap" thread to grab a reference to an invalid root and thus keep a root alive (but completely empty) after kvm_mmu_zap_all_fast() completes. The kvm_tdp_mmu_write_protect_gfn() case is more interesting as KVM write- protects SPTEs for reasons other than dirty logging, e.g. if a KVM creates a SPTE for a nested VM while a fast zap is in-progress. Add another TDP MMU iterator to visit only valid roots, and opportunistically convert kvm_tdp_mmu_get_vcpu_root_hpa() to said iterator. Link: https://lore.kernel.org/r/20240111020048.844847-6-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 1a9c16e5c287..e0a8343f66dc 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -171,12 +171,19 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, * Holding mmu_lock for write obviates the need for RCU protection as the list * is guaranteed to be stable. */ -#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \ +#define __for_each_tdp_mmu_root(_kvm, _root, _as_id, _only_valid) \ list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \ if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \ - _as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) { \ + ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) || \ + ((_only_valid) && (_root)->role.invalid))) { \ } else +#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \ + __for_each_tdp_mmu_root(_kvm, _root, _as_id, false) + +#define for_each_valid_tdp_mmu_root(_kvm, _root, _as_id) \ + __for_each_tdp_mmu_root(_kvm, _root, _as_id, true) + static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu) { struct kvm_mmu_page *sp; @@ -224,11 +231,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) lockdep_assert_held_write(&kvm->mmu_lock); - /* - * Check for an existing root before allocating a new one. Note, the - * role check prevents consuming an invalid root. - */ - for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) { + /* Check for an existing root before allocating a new one. */ + for_each_valid_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) { if (root->role.word == role.word && kvm_tdp_mmu_get_root(root)) goto out; @@ -1639,7 +1643,7 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm, { struct kvm_mmu_page *root; - for_each_tdp_mmu_root(kvm, root, slot->as_id) + for_each_valid_tdp_mmu_root(kvm, root, slot->as_id) clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot); } @@ -1757,7 +1761,7 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm, bool spte_set = false; lockdep_assert_held_write(&kvm->mmu_lock); - for_each_tdp_mmu_root(kvm, root, slot->as_id) + for_each_valid_tdp_mmu_root(kvm, root, slot->as_id) spte_set |= write_protect_gfn(kvm, root, gfn, min_level); return spte_set; -- cgit v1.2.3 From f5238c2a60f1e0eb48ce21037bce6f4781afa37f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 10 Jan 2024 18:00:46 -0800 Subject: KVM: x86/mmu: Check for usable TDP MMU root while holding mmu_lock for read When allocating a new TDP MMU root, check for a usable root while holding mmu_lock for read and only acquire mmu_lock for write if a new root needs to be created. There is no need to serialize other MMU operations if a vCPU is simply grabbing a reference to an existing root, holding mmu_lock for write is "necessary" (spoiler alert, it's not strictly necessary) only to ensure KVM doesn't end up with duplicate roots. Allowing vCPUs to get "new" roots in parallel is beneficial to VM boot and to setups that frequently delete memslots, i.e. which force all vCPUs to reload all roots. Link: https://lore.kernel.org/r/20240111020048.844847-7-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu.c | 8 +++---- arch/x86/kvm/mmu/tdp_mmu.c | 60 ++++++++++++++++++++++++++++++++++++++-------- arch/x86/kvm/mmu/tdp_mmu.h | 2 +- 3 files changed, 55 insertions(+), 15 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 3c193b096b45..c9150080fbd2 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3693,15 +3693,15 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) unsigned i; int r; + if (tdp_mmu_enabled) + return kvm_tdp_mmu_alloc_root(vcpu); + write_lock(&vcpu->kvm->mmu_lock); r = make_mmu_pages_available(vcpu); if (r < 0) goto out_unlock; - if (tdp_mmu_enabled) { - root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu); - mmu->root.hpa = root; - } else if (shadow_root_level >= PT64_ROOT_4LEVEL) { + if (shadow_root_level >= PT64_ROOT_4LEVEL) { root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level); mmu->root.hpa = root; } else if (shadow_root_level == PT32E_ROOT_LEVEL) { diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index e0a8343f66dc..9a8250a14fc1 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -223,21 +223,52 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp, tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role); } -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) +static struct kvm_mmu_page *kvm_tdp_mmu_try_get_root(struct kvm_vcpu *vcpu) { union kvm_mmu_page_role role = vcpu->arch.mmu->root_role; + int as_id = kvm_mmu_role_as_id(role); struct kvm *kvm = vcpu->kvm; struct kvm_mmu_page *root; - lockdep_assert_held_write(&kvm->mmu_lock); - - /* Check for an existing root before allocating a new one. */ - for_each_valid_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) { - if (root->role.word == role.word && - kvm_tdp_mmu_get_root(root)) - goto out; + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) { + if (root->role.word == role.word) + return root; } + return NULL; +} + +int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu) +{ + struct kvm_mmu *mmu = vcpu->arch.mmu; + union kvm_mmu_page_role role = mmu->root_role; + struct kvm *kvm = vcpu->kvm; + struct kvm_mmu_page *root; + + /* + * Check for an existing root while holding mmu_lock for read to avoid + * unnecessary serialization if multiple vCPUs are loading a new root. + * E.g. when bringing up secondary vCPUs, KVM will already have created + * a valid root on behalf of the primary vCPU. + */ + read_lock(&kvm->mmu_lock); + root = kvm_tdp_mmu_try_get_root(vcpu); + read_unlock(&kvm->mmu_lock); + + if (root) + goto out; + + write_lock(&kvm->mmu_lock); + + /* + * Recheck for an existing root after acquiring mmu_lock for write. It + * is possible a new usable root was created between dropping mmu_lock + * (for read) and acquiring it for write. + */ + root = kvm_tdp_mmu_try_get_root(vcpu); + if (root) + goto out_unlock; + root = tdp_mmu_alloc_sp(vcpu); tdp_mmu_init_sp(root, NULL, 0, role); @@ -254,8 +285,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots); spin_unlock(&kvm->arch.tdp_mmu_pages_lock); +out_unlock: + write_unlock(&kvm->mmu_lock); out: - return __pa(root->spt); + /* + * Note, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS will prevent entering the guest + * and actually consuming the root if it's invalidated after dropping + * mmu_lock, and the root can't be freed as this vCPU holds a reference. + */ + mmu->root.hpa = __pa(root->spt); + mmu->root.pgd = 0; + return 0; } static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, @@ -917,7 +957,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm) * the VM is being destroyed). * * Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference. - * See kvm_tdp_mmu_get_vcpu_root_hpa(). + * See kvm_tdp_mmu_alloc_root(). */ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm) { diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 20d97aa46c49..6e1ea04ca885 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -10,7 +10,7 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm); void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm); -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu); +int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu); __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root) { -- cgit v1.2.3 From dab285e4ec736d964cfa6c7fd6eebd22666b5ebc Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 10 Jan 2024 18:00:47 -0800 Subject: KVM: x86/mmu: Alloc TDP MMU roots while holding mmu_lock for read Allocate TDP MMU roots while holding mmu_lock for read, and instead use tdp_mmu_pages_lock to guard against duplicate roots. This allows KVM to create new roots without forcing kvm_tdp_mmu_zap_invalidated_roots() to yield, e.g. allows vCPUs to load new roots after memslot deletion without forcing the zap thread to detect contention and yield (or complete if the kernel isn't preemptible). Note, creating a new TDP MMU root as an mmu_lock reader is safe for two reasons: (1) paths that must guarantee all roots/SPTEs are *visited* take mmu_lock for write and so are still mutually exclusive, e.g. mmu_notifier invalidations, and (2) paths that require all roots/SPTEs to *observe* some given state without holding mmu_lock for write must ensure freshness through some other means, e.g. toggling dirty logging must first wait for SRCU readers to recognize the memslot flags change before processing existing roots/SPTEs. Link: https://lore.kernel.org/r/20240111020048.844847-8-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 55 +++++++++++++++++++--------------------------- 1 file changed, 22 insertions(+), 33 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 9a8250a14fc1..d078157e62aa 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -223,51 +223,42 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp, tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role); } -static struct kvm_mmu_page *kvm_tdp_mmu_try_get_root(struct kvm_vcpu *vcpu) -{ - union kvm_mmu_page_role role = vcpu->arch.mmu->root_role; - int as_id = kvm_mmu_role_as_id(role); - struct kvm *kvm = vcpu->kvm; - struct kvm_mmu_page *root; - - for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) { - if (root->role.word == role.word) - return root; - } - - return NULL; -} - int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu) { struct kvm_mmu *mmu = vcpu->arch.mmu; union kvm_mmu_page_role role = mmu->root_role; + int as_id = kvm_mmu_role_as_id(role); struct kvm *kvm = vcpu->kvm; struct kvm_mmu_page *root; /* - * Check for an existing root while holding mmu_lock for read to avoid + * Check for an existing root before acquiring the pages lock to avoid * unnecessary serialization if multiple vCPUs are loading a new root. * E.g. when bringing up secondary vCPUs, KVM will already have created * a valid root on behalf of the primary vCPU. */ read_lock(&kvm->mmu_lock); - root = kvm_tdp_mmu_try_get_root(vcpu); - read_unlock(&kvm->mmu_lock); - if (root) - goto out; + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) { + if (root->role.word == role.word) + goto out_read_unlock; + } - write_lock(&kvm->mmu_lock); + spin_lock(&kvm->arch.tdp_mmu_pages_lock); /* - * Recheck for an existing root after acquiring mmu_lock for write. It - * is possible a new usable root was created between dropping mmu_lock - * (for read) and acquiring it for write. + * Recheck for an existing root after acquiring the pages lock, another + * vCPU may have raced ahead and created a new usable root. Manually + * walk the list of roots as the standard macros assume that the pages + * lock is *not* held. WARN if grabbing a reference to a usable root + * fails, as the last reference to a root can only be put *after* the + * root has been invalidated, which requires holding mmu_lock for write. */ - root = kvm_tdp_mmu_try_get_root(vcpu); - if (root) - goto out_unlock; + list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) { + if (root->role.word == role.word && + !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) + goto out_spin_unlock; + } root = tdp_mmu_alloc_sp(vcpu); tdp_mmu_init_sp(root, NULL, 0, role); @@ -280,14 +271,12 @@ int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu) * is ultimately put by kvm_tdp_mmu_zap_invalidated_roots(). */ refcount_set(&root->tdp_mmu_root_count, 2); - - spin_lock(&kvm->arch.tdp_mmu_pages_lock); list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots); - spin_unlock(&kvm->arch.tdp_mmu_pages_lock); -out_unlock: - write_unlock(&kvm->mmu_lock); -out: +out_spin_unlock: + spin_unlock(&kvm->arch.tdp_mmu_pages_lock); +out_read_unlock: + read_unlock(&kvm->mmu_lock); /* * Note, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS will prevent entering the guest * and actually consuming the root if it's invalidated after dropping -- cgit v1.2.3 From 576a15de8d299d9d225b86504547ff6498bc2eeb Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 10 Jan 2024 18:00:48 -0800 Subject: KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read Free TDP MMU roots from vCPU context while holding mmu_lock for read, it is completely legal to invoke kvm_tdp_mmu_put_root() as a reader. This eliminates the last mmu_lock writer in the TDP MMU's "fast zap" path after requesting vCPUs to reload roots, i.e. allows KVM to zap invalidated roots, free obsolete roots, and allocate new roots in parallel. On large VMs, e.g. 100+ vCPUs, allowing the bulk of the "fast zap" operation to run in parallel with freeing and allocating roots reduces the worst case latency for a vCPU to reload a root from 2-3ms to <100us. Link: https://lore.kernel.org/r/20240111020048.844847-9-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu.c | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c9150080fbd2..e5e2af69e24d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3575,10 +3575,14 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa, if (WARN_ON_ONCE(!sp)) return; - if (is_tdp_mmu_page(sp)) + if (is_tdp_mmu_page(sp)) { + lockdep_assert_held_read(&kvm->mmu_lock); kvm_tdp_mmu_put_root(kvm, sp); - else if (!--sp->root_count && sp->role.invalid) - kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); + } else { + lockdep_assert_held_write(&kvm->mmu_lock); + if (!--sp->root_count && sp->role.invalid) + kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); + } *root_hpa = INVALID_PAGE; } @@ -3587,6 +3591,7 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa, void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu, ulong roots_to_free) { + bool is_tdp_mmu = tdp_mmu_enabled && mmu->root_role.direct; int i; LIST_HEAD(invalid_list); bool free_active_root; @@ -3609,7 +3614,10 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu, return; } - write_lock(&kvm->mmu_lock); + if (is_tdp_mmu) + read_lock(&kvm->mmu_lock); + else + write_lock(&kvm->mmu_lock); for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) @@ -3635,8 +3643,13 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu, mmu->root.pgd = 0; } - kvm_mmu_commit_zap_page(kvm, &invalid_list); - write_unlock(&kvm->mmu_lock); + if (is_tdp_mmu) { + read_unlock(&kvm->mmu_lock); + WARN_ON_ONCE(!list_empty(&invalid_list)); + } else { + kvm_mmu_commit_zap_page(kvm, &invalid_list); + write_unlock(&kvm->mmu_lock); + } } EXPORT_SYMBOL_GPL(kvm_mmu_free_roots); -- cgit v1.2.3 From 0cbca1bf44a0b8666c91ce3438f235c6fe70fbf1 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Fri, 23 Feb 2024 05:17:53 -0500 Subject: x86: irq: unconditionally define KVM interrupt vectors Unlike arch/x86/kernel/idt.c, FRED support chose to remove the #ifdefs from the .c files and concentrate them in the headers, where unused handlers are #define'd to NULL. However, the constants for KVM's 3 posted interrupt vectors are still defined conditionally in irq_vectors.h. In the tree that FRED support was developed on, this is innocuous because CONFIG_HAVE_KVM was effectively always set. With the cleanups that recently went into the KVM tree to remove CONFIG_HAVE_KVM, the conditional became IS_ENABLED(CONFIG_KVM). This causes a linux-next compilation failure in FRED code, when CONFIG_KVM=n. In preparation for the merging of FRED in Linux 6.9, define the interrupt vector numbers unconditionally. Cc: x86@kernel.org Reported-by: Stephen Rothwell Suggested-by: Xin Li (Intel) Suggested-by: Thomas Gleixner Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/irq_vectors.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index 3f73ac3ed3a0..d18bfb238f66 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -84,11 +84,9 @@ #define HYPERVISOR_CALLBACK_VECTOR 0xf3 /* Vector for KVM to deliver posted interrupt IPI */ -#if IS_ENABLED(CONFIG_KVM) #define POSTED_INTR_VECTOR 0xf2 #define POSTED_INTR_WAKEUP_VECTOR 0xf1 #define POSTED_INTR_NESTED_VECTOR 0xf0 -#endif #define MANAGED_IRQ_SHUTDOWN_VECTOR 0xef -- cgit v1.2.3 From 284851ee5caef1b42b513752bf1642ce4570bdc1 Mon Sep 17 00:00:00 2001 From: Oliver Upton Date: Fri, 16 Feb 2024 15:59:41 +0000 Subject: KVM: Get rid of return value from kvm_arch_create_vm_debugfs() The general expectation with debugfs is that any initialization failure is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows implementations to return an error and kvm_create_vm_debugfs() allows that to fail VM creation. Change to a void return to discourage architectures from making debugfs failures fatal for the VM. Seems like everyone already had the right idea, as all implementations already return 0 unconditionally. Acked-by: Marc Zyngier Acked-by: Paolo Bonzini Link: https://lore.kernel.org/r/20240216155941.2029458-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton --- arch/powerpc/kvm/powerpc.c | 3 +-- arch/x86/kvm/debugfs.c | 3 +-- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c | 8 ++------ 4 files changed, 5 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 23407fbd73c9..d32abe7fe6ab 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -2538,9 +2538,8 @@ void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_ vcpu->kvm->arch.kvm_ops->create_vcpu_debugfs(vcpu, debugfs_dentry); } -int kvm_arch_create_vm_debugfs(struct kvm *kvm) +void kvm_arch_create_vm_debugfs(struct kvm *kvm) { if (kvm->arch.kvm_ops->create_vm_debugfs) kvm->arch.kvm_ops->create_vm_debugfs(kvm); - return 0; } diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c index 95ea1a1f7403..999227fc7c66 100644 --- a/arch/x86/kvm/debugfs.c +++ b/arch/x86/kvm/debugfs.c @@ -189,9 +189,8 @@ static const struct file_operations mmu_rmaps_stat_fops = { .release = kvm_mmu_rmaps_stat_release, }; -int kvm_arch_create_vm_debugfs(struct kvm *kvm) +void kvm_arch_create_vm_debugfs(struct kvm *kvm) { debugfs_create_file("mmu_rmaps_stat", 0644, kvm->debugfs_dentry, kvm, &mmu_rmaps_stat_fops); - return 0; } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 7e7fd25b09b3..9a45f673f687 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1507,7 +1507,7 @@ bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu); bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); -int kvm_arch_create_vm_debugfs(struct kvm *kvm); +void kvm_arch_create_vm_debugfs(struct kvm *kvm); #ifndef __KVM_HAVE_ARCH_VM_ALLOC /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 10bfc88a69f7..c681149c382a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1150,10 +1150,7 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname) &stat_fops_per_vm); } - ret = kvm_arch_create_vm_debugfs(kvm); - if (ret) - goto out_err; - + kvm_arch_create_vm_debugfs(kvm); return 0; out_err: kvm_destroy_vm_debugfs(kvm); @@ -1183,9 +1180,8 @@ void __weak kvm_arch_pre_destroy_vm(struct kvm *kvm) * Cleanup should be automatic done in kvm_destroy_vm_debugfs() recursively, so * a per-arch destroy interface is not needed. */ -int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm) +void __weak kvm_arch_create_vm_debugfs(struct kvm *kvm) { - return 0; } static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) -- cgit v1.2.3 From 812d432373f629eb8d6cb696ea6804fca1534efa Mon Sep 17 00:00:00 2001 From: Like Xu Date: Wed, 6 Dec 2023 11:20:54 +0800 Subject: KVM: x86/pmu: Explicitly check NMI from guest to reducee false positives Explicitly check that the source of external interrupt is indeed an NMI in kvm_arch_pmi_in_guest(), which reduces perf-kvm false positive samples (host samples labelled as guest samples) generated by perf/core NMI mode if an NMI arrives after VM-Exit, but before kvm_after_interrupt(): # test: perf-record + cpu-cycles:HP (which collects host-only precise samples) # Symbol Overhead sys usr guest sys guest usr # ....................................... ........ ........ ........ ......... ......... # # Before: [g] entry_SYSCALL_64 24.63% 0.00% 0.00% 24.63% 0.00% [g] syscall_return_via_sysret 23.23% 0.00% 0.00% 23.23% 0.00% [g] files_lookup_fd_raw 6.35% 0.00% 0.00% 6.35% 0.00% # After: [k] perf_adjust_freq_unthr_context 57.23% 57.23% 0.00% 0.00% 0.00% [k] __vmx_vcpu_run 4.09% 4.09% 0.00% 0.00% 0.00% [k] vmx_update_host_rsp 3.17% 3.17% 0.00% 0.00% 0.00% In the above case, perf records the samples labelled '[g]', the RIPs behind the weird samples are actually being queried by perf_instruction_pointer() after determining whether it's in GUEST state or not, and here's the issue: If VM-Exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest() will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt(). During this window, if a PMI occurs on host (since the KVM instructions on host are being executed), the control flow, with the help of the host NMI context, will be transferred to perf/core to generate performance samples, thus perf_instruction_pointer() and perf_guest_get_ip() is called. Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may cause perf/core to mistakenly assume that the source RIP of the host NMI belongs to the guest world and use perf_guest_get_ip() to get the RIP of a vCPU that has already exited by a non-NMI interrupt. Error samples are recorded and presented to the end-user via perf-report. Such false positive samples could be eliminated by explicitly determining if the exit reason is KVM_HANDLING_NMI. Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI is cleared, it's also still possible that another PMI is generated on host. Also for perf/core timer mode, the false positives are still possible since those non-NMI sources of interrupts are not always being used by perf/core. For events that are host-only, perf/core can and should eliminate false positives by checking event->attr.exclude_guest, i.e. events that are configured to exclude KVM guests should never fire in the guest. Events that are configured to count host and guest are trickier, perhaps impossible to handle with 100% accuracy? And regardless of what accuracy is provided by perf/core, improving KVM's accuracy is cheap and easy, with no real downsides. Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting") Signed-off-by: Like Xu Reviewed-by: Maxim Levitsky Link: https://lore.kernel.org/r/20231206032054.55070-1-likexu@tencent.com [sean: massage changelog, squash !!in_nmi() fixup from Like] Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 10 +++++++++- arch/x86/kvm/x86.h | 6 ------ 2 files changed, 9 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ad5319a503f0..672ea5e97478 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1880,8 +1880,16 @@ static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, } #endif /* CONFIG_HYPERV */ +enum kvm_intr_type { + /* Values are arbitrary, but must be non-zero. */ + KVM_HANDLING_IRQ = 1, + KVM_HANDLING_NMI, +}; + +/* Enable perf NMI and timer modes to work, and minimise false positives. */ #define kvm_arch_pmi_in_guest(vcpu) \ - ((vcpu) && (vcpu)->arch.handling_intr_from_guest) + ((vcpu) && (vcpu)->arch.handling_intr_from_guest && \ + (!!in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI))) void __init kvm_mmu_x86_module_init(void); int kvm_mmu_vendor_module_init(void); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 2f7e19166658..4dc38092d599 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -431,12 +431,6 @@ static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm) return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED; } -enum kvm_intr_type { - /* Values are arbitrary, but must be non-zero. */ - KVM_HANDLING_IRQ = 1, - KVM_HANDLING_NMI, -}; - static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, enum kvm_intr_type intr) { -- cgit v1.2.3 From 8e24eeedfda34a920d35eda2f17a0f4b4473dcde Mon Sep 17 00:00:00 2001 From: Dongli Zhang Date: Fri, 23 Feb 2024 12:21:02 -0800 Subject: KVM: VMX: fix comment to add LBR to passthrough MSRs According to the is_valid_passthrough_msr(), the LBR MSRs are also passthrough MSRs, since the commit 1b5ac3226a1a ("KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVE"). Signed-off-by: Dongli Zhang Link: https://lore.kernel.org/r/20240223202104.3330974-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index d4e6625e0a9a..05319b74bd3d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -159,7 +159,7 @@ module_param(allow_smaller_maxphyaddr, bool, S_IRUGO); /* * List of MSRs that can be directly passed to the guest. - * In addition to these x2apic and PT MSRs are handled specially. + * In addition to these x2apic, PT and LBR MSRs are handled specially. */ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = { MSR_IA32_SPEC_CTRL, -- cgit v1.2.3 From bab22040d7fdeb935e57215a183912ba46eee7f0 Mon Sep 17 00:00:00 2001 From: Dongli Zhang Date: Fri, 23 Feb 2024 12:21:03 -0800 Subject: KVM: VMX: return early if msr_bitmap is not supported The vmx_msr_filter_changed() may directly/indirectly calls only vmx_enable_intercept_for_msr() or vmx_disable_intercept_for_msr(). Those two functions may exit immediately if !cpu_has_vmx_msr_bitmap(). vmx_msr_filter_changed() -> vmx_disable_intercept_for_msr() -> pt_update_intercept_for_msr() -> vmx_set_intercept_for_msr() -> vmx_enable_intercept_for_msr() -> vmx_disable_intercept_for_msr() Therefore, we exit early if !cpu_has_vmx_msr_bitmap(). Signed-off-by: Dongli Zhang Link: https://lore.kernel.org/r/20240223202104.3330974-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 05319b74bd3d..5a866d3c2bc8 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -4127,6 +4127,9 @@ static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu) struct vcpu_vmx *vmx = to_vmx(vcpu); u32 i; + if (!cpu_has_vmx_msr_bitmap()) + return; + /* * Redo intercept permissions for MSRs that KVM is passing through to * the guest. Disabling interception will check the new MSR filter and -- cgit v1.2.3 From a364c014a2c1ad6e011bc5fdb8afb9d4ba316956 Mon Sep 17 00:00:00 2001 From: Andrei Vagin Date: Tue, 13 Feb 2024 11:23:40 -0800 Subject: kvm/x86: allocate the write-tracking metadata on-demand The write-track is used externally only by the gpu/drm/i915 driver. Currently, it is always enabled, if a kernel has been compiled with this driver. Enabling the write-track mechanism adds a two-byte overhead per page across all memory slots. It isn't significant for regular VMs. However in gVisor, where the entire process virtual address space is mapped into the VM, even with a 39-bit address space, the overhead amounts to 256MB. Rework the write-tracking mechanism to enable it on-demand in kvm_page_track_register_notifier. Here is Sean's comment about the locking scheme: The only potential hiccup would be if taking slots_arch_lock would deadlock, but it should be impossible for slots_arch_lock to be taken in any other path that involves VFIO and/or KVMGT *and* can be coincident. Except for kvm_arch_destroy_vm() (which deletes KVM's internal memslots), slots_arch_lock is taken only through KVM ioctls(), and the caller of kvm_page_track_register_notifier() *must* hold a reference to the VM. Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Zhenyu Wang Co-developed-by: Sean Christopherson Signed-off-by: Andrei Vagin Link: https://lore.kernel.org/r/20240213192340.2023366-1-avagin@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 9 ++++++ arch/x86/kvm/mmu/page_track.c | 68 +++++++++++++++++++++++++++++++++++++++-- 2 files changed, 75 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b5b2d0fde579..7d33a2605ad5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1466,6 +1466,15 @@ struct kvm_arch { */ bool shadow_root_allocated; +#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING + /* + * If set, the VM has (or had) an external write tracking user, and + * thus all write tracking metadata has been allocated, even if KVM + * itself isn't using write tracking. + */ + bool external_write_tracking_enabled; +#endif + #if IS_ENABLED(CONFIG_HYPERV) hpa_t hv_root_tdp; spinlock_t hv_root_tdp_lock; diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c index c87da11f3a04..f6448284c18e 100644 --- a/arch/x86/kvm/mmu/page_track.c +++ b/arch/x86/kvm/mmu/page_track.c @@ -20,10 +20,23 @@ #include "mmu_internal.h" #include "page_track.h" +static bool kvm_external_write_tracking_enabled(struct kvm *kvm) +{ +#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING + /* + * Read external_write_tracking_enabled before related pointers. Pairs + * with the smp_store_release in kvm_page_track_write_tracking_enable(). + */ + return smp_load_acquire(&kvm->arch.external_write_tracking_enabled); +#else + return false; +#endif +} + bool kvm_page_track_write_tracking_enabled(struct kvm *kvm) { - return IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) || - !tdp_enabled || kvm_shadow_root_allocated(kvm); + return kvm_external_write_tracking_enabled(kvm) || + kvm_shadow_root_allocated(kvm) || !tdp_enabled; } void kvm_page_track_free_memslot(struct kvm_memory_slot *slot) @@ -153,6 +166,50 @@ int kvm_page_track_init(struct kvm *kvm) return init_srcu_struct(&head->track_srcu); } +static int kvm_enable_external_write_tracking(struct kvm *kvm) +{ + struct kvm_memslots *slots; + struct kvm_memory_slot *slot; + int r = 0, i, bkt; + + mutex_lock(&kvm->slots_arch_lock); + + /* + * Check for *any* write tracking user (not just external users) under + * lock. This avoids unnecessary work, e.g. if KVM itself is using + * write tracking, or if two external users raced when registering. + */ + if (kvm_page_track_write_tracking_enabled(kvm)) + goto out_success; + + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { + slots = __kvm_memslots(kvm, i); + kvm_for_each_memslot(slot, bkt, slots) { + /* + * Intentionally do NOT free allocations on failure to + * avoid having to track which allocations were made + * now versus when the memslot was created. The + * metadata is guaranteed to be freed when the slot is + * freed, and will be kept/used if userspace retries + * the failed ioctl() instead of killing the VM. + */ + r = kvm_page_track_write_tracking_alloc(slot); + if (r) + goto out_unlock; + } + } + +out_success: + /* + * Ensure that external_write_tracking_enabled becomes true strictly + * after all the related pointers are set. + */ + smp_store_release(&kvm->arch.external_write_tracking_enabled, true); +out_unlock: + mutex_unlock(&kvm->slots_arch_lock); + return r; +} + /* * register the notifier so that event interception for the tracked guest * pages can be received. @@ -161,10 +218,17 @@ int kvm_page_track_register_notifier(struct kvm *kvm, struct kvm_page_track_notifier_node *n) { struct kvm_page_track_notifier_head *head; + int r; if (!kvm || kvm->mm != current->mm) return -ESRCH; + if (!kvm_external_write_tracking_enabled(kvm)) { + r = kvm_enable_external_write_tracking(kvm); + if (r) + return r; + } + kvm_get_kvm(kvm); head = &kvm->arch.track_notifier_head; -- cgit v1.2.3 From 78ccfce774435a08d9c69ce434099166cc7952c8 Mon Sep 17 00:00:00 2001 From: John Allen Date: Tue, 27 Feb 2024 20:03:56 +0000 Subject: KVM: SVM: Rename vmplX_ssp -> plX_ssp The SSP fields in the SEV-ES save area were mistakenly named vmplX_ssp instead of plX_ssp. Rename these to the correct names as defined in the APM. Fixes: 6d3b3d34e39e ("KVM: SVM: Update the SEV-ES save area mapping") Signed-off-by: John Allen Link: https://lore.kernel.org/r/20240227200356.35114-1-john.allen@amd.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/svm.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 87a7b917d30e..728c98175b9c 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -358,10 +358,10 @@ struct sev_es_save_area { struct vmcb_seg ldtr; struct vmcb_seg idtr; struct vmcb_seg tr; - u64 vmpl0_ssp; - u64 vmpl1_ssp; - u64 vmpl2_ssp; - u64 vmpl3_ssp; + u64 pl0_ssp; + u64 pl1_ssp; + u64 pl2_ssp; + u64 pl3_ssp; u64 u_cet; u8 reserved_0xc8[2]; u8 vmpl; -- cgit v1.2.3 From 259720c37d51aae21f70060ef96e1f1b08df0652 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 23 Feb 2024 12:21:04 -0800 Subject: KVM: VMX: Combine "check" and "get" APIs for passthrough MSR lookups Combine possible_passthrough_msr_slot() and is_valid_passthrough_msr() into a single function, vmx_get_passthrough_msr_slot(), and have the combined helper return the slot on success, using a negative value to indicate "failure". Combining the operations avoids iterating over the array of passthrough MSRs twice for relevant MSRs. Suggested-by: Dongli Zhang Reviewed-by: Dongli Zhang Link: https://lore.kernel.org/r/20240223202104.3330974-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx.c | 65 ++++++++++++++++++++------------------------------ 1 file changed, 26 insertions(+), 39 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 5a866d3c2bc8..7e7d044e0669 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -658,25 +658,14 @@ static inline bool cpu_need_virtualize_apic_accesses(struct kvm_vcpu *vcpu) return flexpriority_enabled && lapic_in_kernel(vcpu); } -static int possible_passthrough_msr_slot(u32 msr) +static int vmx_get_passthrough_msr_slot(u32 msr) { - u32 i; - - for (i = 0; i < ARRAY_SIZE(vmx_possible_passthrough_msrs); i++) - if (vmx_possible_passthrough_msrs[i] == msr) - return i; - - return -ENOENT; -} - -static bool is_valid_passthrough_msr(u32 msr) -{ - bool r; + int i; switch (msr) { case 0x800 ... 0x8ff: /* x2APIC MSRs. These are handled in vmx_update_msr_bitmap_x2apic() */ - return true; + return -ENOENT; case MSR_IA32_RTIT_STATUS: case MSR_IA32_RTIT_OUTPUT_BASE: case MSR_IA32_RTIT_OUTPUT_MASK: @@ -691,14 +680,16 @@ static bool is_valid_passthrough_msr(u32 msr) case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8: case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8: /* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */ - return true; + return -ENOENT; } - r = possible_passthrough_msr_slot(msr) != -ENOENT; - - WARN(!r, "Invalid MSR %x, please adapt vmx_possible_passthrough_msrs[]", msr); + for (i = 0; i < ARRAY_SIZE(vmx_possible_passthrough_msrs); i++) { + if (vmx_possible_passthrough_msrs[i] == msr) + return i; + } - return r; + WARN(1, "Invalid MSR %x, please adapt vmx_possible_passthrough_msrs[]", msr); + return -ENOENT; } struct vmx_uret_msr *vmx_find_uret_msr(struct vcpu_vmx *vmx, u32 msr) @@ -3954,6 +3945,7 @@ void vmx_disable_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr, int type) { struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap; + int idx; if (!cpu_has_vmx_msr_bitmap()) return; @@ -3963,16 +3955,13 @@ void vmx_disable_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr, int type) /* * Mark the desired intercept state in shadow bitmap, this is needed * for resync when the MSR filters change. - */ - if (is_valid_passthrough_msr(msr)) { - int idx = possible_passthrough_msr_slot(msr); - - if (idx != -ENOENT) { - if (type & MSR_TYPE_R) - clear_bit(idx, vmx->shadow_msr_intercept.read); - if (type & MSR_TYPE_W) - clear_bit(idx, vmx->shadow_msr_intercept.write); - } + */ + idx = vmx_get_passthrough_msr_slot(msr); + if (idx >= 0) { + if (type & MSR_TYPE_R) + clear_bit(idx, vmx->shadow_msr_intercept.read); + if (type & MSR_TYPE_W) + clear_bit(idx, vmx->shadow_msr_intercept.write); } if ((type & MSR_TYPE_R) && @@ -3998,6 +3987,7 @@ void vmx_enable_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr, int type) { struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap; + int idx; if (!cpu_has_vmx_msr_bitmap()) return; @@ -4007,16 +3997,13 @@ void vmx_enable_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr, int type) /* * Mark the desired intercept state in shadow bitmap, this is needed * for resync when the MSR filter changes. - */ - if (is_valid_passthrough_msr(msr)) { - int idx = possible_passthrough_msr_slot(msr); - - if (idx != -ENOENT) { - if (type & MSR_TYPE_R) - set_bit(idx, vmx->shadow_msr_intercept.read); - if (type & MSR_TYPE_W) - set_bit(idx, vmx->shadow_msr_intercept.write); - } + */ + idx = vmx_get_passthrough_msr_slot(msr); + if (idx >= 0) { + if (type & MSR_TYPE_R) + set_bit(idx, vmx->shadow_msr_intercept.read); + if (type & MSR_TYPE_W) + set_bit(idx, vmx->shadow_msr_intercept.write); } if (type & MSR_TYPE_R) -- cgit v1.2.3 From 451a707813aee24b4a734f28d1d414be0360862b Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Tue, 27 Feb 2024 11:49:15 +0000 Subject: KVM: x86/xen: improve accuracy of Xen timers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A test program such as http://david.woodhou.se/timerlat.c confirms user reports that timers are increasingly inaccurate as the lifetime of a guest increases. Reporting the actual delay observed when asking for 100µs of sleep, it starts off OK on a newly-launched guest but gets worse over time, giving incorrect sleep times: root@ip-10-0-193-21:~# ./timerlat -c -n 5 00000000 latency 103243/100000 (3.2430%) 00000001 latency 103243/100000 (3.2430%) 00000002 latency 103242/100000 (3.2420%) 00000003 latency 103245/100000 (3.2450%) 00000004 latency 103245/100000 (3.2450%) The biggest problem is that get_kvmclock_ns() returns inaccurate values when the guest TSC is scaled. The guest sees a TSC value scaled from the host TSC by a mul/shift conversion (hopefully done in hardware). The guest then converts that guest TSC value into nanoseconds using the mul/shift conversion given to it by the KVM pvclock information. But get_kvmclock_ns() performs only a single conversion directly from host TSC to nanoseconds, giving a different result. A test program at http://david.woodhou.se/tsdrift.c demonstrates the cumulative error over a day. It's non-trivial to fix get_kvmclock_ns(), although I'll come back to that. The actual guest hv_clock is per-CPU, and *theoretically* each vCPU could be running at a *different* frequency. But this patch is needed anyway because... The other issue with Xen timers was that the code would snapshot the host CLOCK_MONOTONIC at some point in time, and then... after a few interrupts may have occurred, some preemption perhaps... would also read the guest's kvmclock. Then it would proceed under the false assumption that those two happened at the *same* time. Any time which *actually* elapsed between reading the two clocks was introduced as inaccuracies in the time at which the timer fired. Fix it to use a variant of kvm_get_time_and_clockread(), which reads the host TSC just *once*, then use the returned TSC value to calculate the kvmclock (making sure to do that the way the guest would instead of making the same mistake get_kvmclock_ns() does). Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen timers still have to use CLOCK_MONOTONIC. In practice the difference between the two won't matter over the timescales involved, as the *absolute* values don't matter; just the delta. This does mean a new variant of kvm_get_time_and_clockread() is needed; called kvm_get_monotonic_and_clockread() because that's what it does. Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") Signed-off-by: David Woodhouse Reviewed-by: Paul Durrant Link: https://lore.kernel.org/r/20240227115648.3104-2-dwmw2@infradead.org [sean: massage moved comment, tweak if statement formatting] Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 61 ++++++++++++++++++++++--- arch/x86/kvm/x86.h | 1 + arch/x86/kvm/xen.c | 130 ++++++++++++++++++++++++++++++++++++++--------------- 3 files changed, 152 insertions(+), 40 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2911e6383fef..89815a887e4d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2862,7 +2862,11 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp, return v * clock->mult; } -static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) +/* + * As with get_kvmclock_base_ns(), this counts from boot time, at the + * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot). + */ +static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp) { struct pvclock_gtod_data *gtod = &pvclock_gtod_data; unsigned long seq; @@ -2881,6 +2885,29 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) return mode; } +/* + * This calculates CLOCK_MONOTONIC at the time of the TSC snapshot, with + * no boot time offset. + */ +static int do_monotonic(s64 *t, u64 *tsc_timestamp) +{ + struct pvclock_gtod_data *gtod = &pvclock_gtod_data; + unsigned long seq; + int mode; + u64 ns; + + do { + seq = read_seqcount_begin(>od->seq); + ns = gtod->clock.base_cycles; + ns += vgettsc(>od->clock, tsc_timestamp, &mode); + ns >>= gtod->clock.shift; + ns += ktime_to_ns(gtod->clock.offset); + } while (unlikely(read_seqcount_retry(>od->seq, seq))); + *t = ns; + + return mode; +} + static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp) { struct pvclock_gtod_data *gtod = &pvclock_gtod_data; @@ -2902,18 +2929,42 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp) return mode; } -/* returns true if host is using TSC based clocksource */ +/* + * Calculates the kvmclock_base_ns (CLOCK_MONOTONIC_RAW + boot time) and + * reports the TSC value from which it do so. Returns true if host is + * using TSC based clocksource. + */ static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp) { /* checked again under seqlock below */ if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode)) return false; - return gtod_is_based_on_tsc(do_monotonic_raw(kernel_ns, - tsc_timestamp)); + return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns, + tsc_timestamp)); } -/* returns true if host is using TSC based clocksource */ +/* + * Calculates CLOCK_MONOTONIC and reports the TSC value from which it did + * so. Returns true if host is using TSC based clocksource. + */ +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp) +{ + /* checked again under seqlock below */ + if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode)) + return false; + + return gtod_is_based_on_tsc(do_monotonic(kernel_ns, + tsc_timestamp)); +} + +/* + * Calculates CLOCK_REALTIME and reports the TSC value from which it did + * so. Returns true if host is using TSC based clocksource. + * + * DO NOT USE this for anything related to migration. You want CLOCK_TAI + * for that. + */ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts, u64 *tsc_timestamp) { diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 2f7e19166658..56b7a78f45bf 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -294,6 +294,7 @@ void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip); u64 get_kvmclock_ns(struct kvm *kvm); uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm); +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp); int kvm_read_guest_virt(struct kvm_vcpu *vcpu, gva_t addr, void *val, unsigned int bytes, diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 8a04e0ae9245..8a4c563e2398 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -24,6 +24,7 @@ #include #include +#include #include "cpuid.h" #include "trace.h" @@ -149,8 +150,93 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) return HRTIMER_NORESTART; } -static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) +static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, + bool linux_wa) { + int64_t kernel_now, delta; + uint64_t guest_now; + + /* + * The guest provides the requested timeout in absolute nanoseconds + * of the KVM clock — as *it* sees it, based on the scaled TSC and + * the pvclock information provided by KVM. + * + * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW + * so use CLOCK_MONOTONIC. In the timescales covered by timers, the + * difference won't matter much as there is no cumulative effect. + * + * Calculate the time for some arbitrary point in time around "now" + * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the + * delta between the kvmclock "now" value and the guest's requested + * timeout, apply the "Linux workaround" described below, and add + * the resulting delta to the CLOCK_MONOTONIC "now" value, to get + * the absolute CLOCK_MONOTONIC time at which the timer should + * fire. + */ + if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock && + static_cpu_has(X86_FEATURE_CONSTANT_TSC)) { + uint64_t host_tsc, guest_tsc; + + if (!IS_ENABLED(CONFIG_64BIT) || + !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) { + /* + * Don't fall back to get_kvmclock_ns() because it's + * broken; it has a systemic error in its results + * because it scales directly from host TSC to + * nanoseconds, and doesn't scale first to guest TSC + * and *then* to nanoseconds as the guest does. + * + * There is a small error introduced here because time + * continues to elapse between the ktime_get() and the + * subsequent rdtsc(). But not the systemic drift due + * to get_kvmclock_ns(). + */ + kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */ + host_tsc = rdtsc(); + } + + /* Calculate the guest kvmclock as the guest would do it. */ + guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc); + guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock, + guest_tsc); + } else { + /* + * Without CONSTANT_TSC, get_kvmclock_ns() is the only option. + * + * Also if the guest PV clock hasn't been set up yet, as is + * likely to be the case during migration when the vCPU has + * not been run yet. It would be possible to calculate the + * scaling factors properly in that case but there's not much + * point in doing so. The get_kvmclock_ns() drift accumulates + * over time, so it's OK to use it at startup. Besides, on + * migration there's going to be a little bit of skew in the + * precise moment at which timers fire anyway. Often they'll + * be in the "past" by the time the VM is running again after + * migration. + */ + guest_now = get_kvmclock_ns(vcpu->kvm); + kernel_now = ktime_get(); + } + + delta = guest_abs - guest_now; + + /* + * Xen has a 'Linux workaround' in do_set_timer_op() which checks for + * negative absolute timeout values (caused by integer overflow), and + * for values about 13 days in the future (2^50ns) which would be + * caused by jiffies overflow. For those cases, Xen sets the timeout + * 100ms in the future (not *too* soon, since if a guest really did + * set a long timeout on purpose we don't want to keep churning CPU + * time by waking it up). Emulate Xen's workaround when starting the + * timer in response to __HYPERVISOR_set_timer_op. + */ + if (linux_wa && + unlikely((int64_t)guest_abs < 0 || + (delta > 0 && (uint32_t) (delta >> 50) != 0))) { + delta = 100 * NSEC_PER_MSEC; + guest_abs = guest_now + delta; + } + /* * Avoid races with the old timer firing. Checking timer_expires * to avoid calling hrtimer_cancel() will only have false positives @@ -162,14 +248,12 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ atomic_set(&vcpu->arch.xen.timer_pending, 0); vcpu->arch.xen.timer_expires = guest_abs; - if (delta_ns <= 0) { + if (delta <= 0) xen_timer_callback(&vcpu->arch.xen.timer); - } else { - ktime_t ktime_now = ktime_get(); + else hrtimer_start(&vcpu->arch.xen.timer, - ktime_add_ns(ktime_now, delta_ns), + ktime_add_ns(kernel_now, delta), HRTIMER_MODE_ABS_HARD); - } } static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu) @@ -997,9 +1081,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) /* Start the timer if the new value has a valid vector+expiry. */ if (data->u.timer.port && data->u.timer.expires_ns) - kvm_xen_start_timer(vcpu, data->u.timer.expires_ns, - data->u.timer.expires_ns - - get_kvmclock_ns(vcpu->kvm)); + kvm_xen_start_timer(vcpu, data->u.timer.expires_ns, false); r = 0; break; @@ -1472,7 +1554,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd, { struct vcpu_set_singleshot_timer oneshot; struct x86_exception e; - s64 delta; if (!kvm_xen_timer_enabled(vcpu)) return false; @@ -1506,9 +1587,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd, return true; } - /* A delta <= 0 results in an immediate callback, which is what we want */ - delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm); - kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta); + kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false); *r = 0; return true; @@ -1531,29 +1610,10 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout, if (!kvm_xen_timer_enabled(vcpu)) return false; - if (timeout) { - uint64_t guest_now = get_kvmclock_ns(vcpu->kvm); - int64_t delta = timeout - guest_now; - - /* Xen has a 'Linux workaround' in do_set_timer_op() which - * checks for negative absolute timeout values (caused by - * integer overflow), and for values about 13 days in the - * future (2^50ns) which would be caused by jiffies - * overflow. For those cases, it sets the timeout 100ms in - * the future (not *too* soon, since if a guest really did - * set a long timeout on purpose we don't want to keep - * churning CPU time by waking it up). - */ - if (unlikely((int64_t)timeout < 0 || - (delta > 0 && (uint32_t) (delta >> 50) != 0))) { - delta = 100 * NSEC_PER_MSEC; - timeout = guest_now + delta; - } - - kvm_xen_start_timer(vcpu, timeout, delta); - } else { + if (timeout) + kvm_xen_start_timer(vcpu, timeout, true); + else kvm_xen_stop_timer(vcpu); - } *r = 0; return true; -- cgit v1.2.3 From 8e62bf2bfa46367e14d0ffdcde5aada08759497c Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Tue, 27 Feb 2024 11:49:16 +0000 Subject: KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled Linux guests since commit b1c3497e604d ("x86/xen: Add support for HVMOP_set_evtchn_upcall_vector") in v6.0 onwards will use the per-vCPU upcall vector when it's advertised in the Xen CPUID leaves. This upcall is injected through the guest's local APIC as an MSI, unlike the older system vector which was merely injected by the hypervisor any time the CPU was able to receive an interrupt and the upcall_pending flags is set in its vcpu_info. Effectively, that makes the per-CPU upcall edge triggered instead of level triggered, which results in the upcall being lost if the MSI is delivered when the local APIC is *disabled*. Xen checks the vcpu_info->evtchn_upcall_pending flag when the local APIC for a vCPU is software enabled (in fact, on any write to the SPIV register which doesn't disable the APIC). Do the same in KVM since KVM doesn't provide a way for userspace to intervene and trap accesses to the SPIV register of a local APIC emulated by KVM. Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse Reviewed-by: Paul Durrant Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240227115648.3104-3-dwmw2@infradead.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/lapic.c | 5 ++++- arch/x86/kvm/xen.c | 2 +- arch/x86/kvm/xen.h | 18 ++++++++++++++++++ 3 files changed, 23 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 3242f3da2457..75bc7d3f0022 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -41,6 +41,7 @@ #include "ioapic.h" #include "trace.h" #include "x86.h" +#include "xen.h" #include "cpuid.h" #include "hyperv.h" #include "smm.h" @@ -499,8 +500,10 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val) } /* Check if there are APF page ready requests pending */ - if (enabled) + if (enabled) { kvm_make_request(KVM_REQ_APF_READY, apic->vcpu); + kvm_xen_sw_enable_lapic(apic->vcpu); + } } static inline void kvm_apic_set_xapic_id(struct kvm_lapic *apic, u8 id) diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 8a4c563e2398..5c24c92588cf 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -567,7 +567,7 @@ void kvm_xen_update_runstate(struct kvm_vcpu *v, int state) kvm_xen_update_runstate_guest(v, state == RUNSTATE_runnable); } -static void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v) +void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v) { struct kvm_lapic_irq irq = { }; int r; diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h index f8f1fe22d090..f5841d9000ae 100644 --- a/arch/x86/kvm/xen.h +++ b/arch/x86/kvm/xen.h @@ -18,6 +18,7 @@ extern struct static_key_false_deferred kvm_xen_enabled; int __kvm_xen_has_interrupt(struct kvm_vcpu *vcpu); void kvm_xen_inject_pending_events(struct kvm_vcpu *vcpu); +void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *vcpu); int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data); int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data); int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data); @@ -36,6 +37,19 @@ int kvm_xen_setup_evtchn(struct kvm *kvm, const struct kvm_irq_routing_entry *ue); void kvm_xen_update_tsc_info(struct kvm_vcpu *vcpu); +static inline void kvm_xen_sw_enable_lapic(struct kvm_vcpu *vcpu) +{ + /* + * The local APIC is being enabled. If the per-vCPU upcall vector is + * set and the vCPU's evtchn_upcall_pending flag is set, inject the + * interrupt. + */ + if (static_branch_unlikely(&kvm_xen_enabled.key) && + vcpu->arch.xen.vcpu_info_cache.active && + vcpu->arch.xen.upcall_vector && __kvm_xen_has_interrupt(vcpu)) + kvm_xen_inject_vcpu_vector(vcpu); +} + static inline bool kvm_xen_msr_enabled(struct kvm *kvm) { return static_branch_unlikely(&kvm_xen_enabled.key) && @@ -101,6 +115,10 @@ static inline void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu) { } +static inline void kvm_xen_sw_enable_lapic(struct kvm_vcpu *vcpu) +{ +} + static inline bool kvm_xen_msr_enabled(struct kvm *kvm) { return false; -- cgit v1.2.3 From 66e3cf729b1ea017cad643415ade319556862b68 Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Tue, 27 Feb 2024 11:49:17 +0000 Subject: KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery The kvm_xen_inject_vcpu_vector() function has a comment saying "the fast version will always work for physical unicast", justifying its use of kvm_irq_delivery_to_apic_fast() and the WARN_ON_ONCE() when that fails. In fact that assumption isn't true if X2APIC isn't in use by the guest and there is (8-bit x)APIC ID aliasing. A single "unicast" destination APIC ID *may* then be delivered to multiple vCPUs. Remove the warning, and in fact it might as well just call kvm_irq_delivery_to_apic(). Reported-by: Michal Luczaj Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse Reviewed-by: Paul Durrant Link: https://lore.kernel.org/r/20240227115648.3104-4-dwmw2@infradead.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/xen.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 5c24c92588cf..54c4ced96b69 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -10,7 +10,7 @@ #include "x86.h" #include "xen.h" #include "hyperv.h" -#include "lapic.h" +#include "irq.h" #include #include @@ -570,7 +570,6 @@ void kvm_xen_update_runstate(struct kvm_vcpu *v, int state) void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v) { struct kvm_lapic_irq irq = { }; - int r; irq.dest_id = v->vcpu_id; irq.vector = v->arch.xen.upcall_vector; @@ -579,8 +578,7 @@ void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v) irq.delivery_mode = APIC_DM_FIXED; irq.level = 1; - /* The fast version will always work for physical unicast */ - WARN_ON_ONCE(!kvm_irq_delivery_to_apic_fast(v->kvm, NULL, &irq, &r, NULL)); + kvm_irq_delivery_to_apic(v->kvm, NULL, &irq, NULL); } /* -- cgit v1.2.3 From 7a36d680658ba5a0d350f2ad275b97156b8d4333 Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Tue, 27 Feb 2024 11:49:19 +0000 Subject: KVM: x86/xen: fix recursive deadlock in timer injection The fast-path timer delivery introduced a recursive locking deadlock when userspace configures a timer which has already expired and is delivered immediately. The call to kvm_xen_inject_timer_irqs() can call to kvm_xen_set_evtchn() which may take kvm->arch.xen.xen_lock, which is already held in kvm_xen_vcpu_get_attr(). ============================================ WARNING: possible recursive locking detected 6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G O -------------------------------------------- xen_shinfo_test/250013 is trying to acquire lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm] but task is already holding lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm] Now that the gfn_to_pfn_cache has its own self-sufficient locking, its callers no longer need to ensure serialization, so just stop taking kvm->arch.xen.xen_lock from kvm_xen_set_evtchn(). Fixes: 77c9b9dea4fb ("KVM: x86/xen: Use fast path for Xen timer delivery") Signed-off-by: David Woodhouse Reviewed-by: Paul Durrant Link: https://lore.kernel.org/r/20240227115648.3104-6-dwmw2@infradead.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/xen.c | 4 ---- 1 file changed, 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 54c4ced96b69..f65b35a05d91 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -1862,8 +1862,6 @@ static int kvm_xen_set_evtchn(struct kvm_xen_evtchn *xe, struct kvm *kvm) mm_borrowed = true; } - mutex_lock(&kvm->arch.xen.xen_lock); - /* * It is theoretically possible for the page to be unmapped * and the MMU notifier to invalidate the shared_info before @@ -1891,8 +1889,6 @@ static int kvm_xen_set_evtchn(struct kvm_xen_evtchn *xe, struct kvm *kvm) srcu_read_unlock(&kvm->srcu, idx); } while(!rc); - mutex_unlock(&kvm->arch.xen.xen_lock); - if (mm_borrowed) kthread_unuse_mm(kvm->mm); -- cgit v1.2.3