summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/kvm.c
AgeCommit message (Collapse)AuthorFilesLines
2024-03-19Merge tag 'kvm-x86-asyncpf_abi-6.9' of https://github.com/kvm-x86/linux into ↵Paolo Bonzini1-5/+6
HEAD Guest-side KVM async #PF ABI cleanup for 6.9 Delete kvm_vcpu_pv_apf_data.enabled to fix a goof in KVM's async #PF ABI where the enabled field pushes the size of "struct kvm_vcpu_pv_apf_data" from 64 to 68 bytes, i.e. beyond a single cache line. The enabled field is purely a guest-side flag that Linux-as-a-guest uses to track whether or not the guest has enabled async #PF support. The actual flag that is passed to the host, i.e. to KVM proper, is a single bit in a synthetic MSR, MSR_KVM_ASYNC_PF_EN, i.e. is in a location completely unrelated to the shared kvm_vcpu_pv_apf_data structure. Simply drop the the field and use a dedicated guest-side per-CPU variable to fix the ABI, as opposed to fixing the documentation to match reality. KVM has never consumed kvm_vcpu_pv_apf_data.enabled, so the odds of the ABI change breaking anything are extremely low.
2024-03-15Merge tag 'mm-stable-2024-03-13-20-04' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...
2024-03-12Merge tag 'x86-fred-2024-03-10' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FRED support from Thomas Gleixner: "Support for x86 Fast Return and Event Delivery (FRED). FRED is a replacement for IDT event delivery on x86 and addresses most of the technical nightmares which IDT exposes: 1) Exception cause registers like CR2 need to be manually preserved in nested exception scenarios. 2) Hardware interrupt stack switching is suboptimal for nested exceptions as the interrupt stack mechanism rewinds the stack on each entry which requires a massive effort in the low level entry of #NMI code to handle this. 3) No hardware distinction between entry from kernel or from user which makes establishing kernel context more complex than it needs to be especially for unconditionally nestable exceptions like NMI. 4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a problem when the perf NMI takes a fault when collecting a stack trace. 5) Partial restore of ESP when returning to a 16-bit segment 6) Limitation of the vector space which can cause vector exhaustion on large systems. 7) Inability to differentiate NMI sources FRED addresses these shortcomings by: 1) An extended exception stack frame which the CPU uses to save exception cause registers. This ensures that the meta information for each exception is preserved on stack and avoids the extra complexity of preserving it in software. 2) Hardware interrupt stack switching is non-rewinding if a nested exception uses the currently interrupt stack. 3) The entry points for kernel and user context are separate and GS BASE handling which is required to establish kernel context for per CPU variable access is done in hardware. 4) NMIs are now nesting protected. They are only reenabled on the return from NMI. 5) FRED guarantees full restore of ESP 6) FRED does not put a limitation on the vector space by design because it uses a central entry points for kernel and user space and the CPUstores the entry type (exception, trap, interrupt, syscall) on the entry stack along with the vector number. The entry code has to demultiplex this information, but this removes the vector space restriction. The first hardware implementations will still have the current restricted vector space because lifting this limitation requires further changes to the local APIC. 7) FRED stores the vector number and meta information on stack which allows having more than one NMI vector in future hardware when the required local APIC changes are in place. The series implements the initial FRED support by: - Reworking the existing entry and IDT handling infrastructure to accomodate for the alternative entry mechanism. - Expanding the stack frame to accomodate for the extra 16 bytes FRED requires to store context and meta information - Providing FRED specific C entry points for events which have information pushed to the extended stack frame, e.g. #PF and #DB. - Providing FRED specific C entry points for #NMI and #MCE - Implementing the FRED specific ASM entry points and the C code to demultiplex the events - Providing detection and initialization mechanisms and the necessary tweaks in context switching, GS BASE handling etc. The FRED integration aims for maximum code reuse vs the existing IDT implementation to the extent possible and the deviation in hot paths like context switching are handled with alternatives to minimalize the impact. The low level entry and exit paths are seperate due to the extended stack frame and the hardware based GS BASE swichting and therefore have no impact on IDT based systems. It has been extensively tested on existing systems and on the FRED simulation and as of now there are no outstanding problems" * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/fred: Fix init_task thread stack pointer initialization MAINTAINERS: Add a maintainer entry for FRED x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly x86/fred: Invoke FRED initialization code to enable FRED x86/fred: Add FRED initialization functions x86/syscall: Split IDT syscall setup code into idt_syscall_init() KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled x86/traps: Add sysvec_install() to install a system interrupt handler x86/fred: FRED entry/exit and dispatch code x86/fred: Add a machine check entry stub for FRED x86/fred: Add a NMI entry stub for FRED x86/fred: Add a debug fault entry stub for FRED x86/idtentry: Incorporate definitions/declarations of the FRED entries x86/fred: Make exc_page_fault() work for FRED x86/fred: Allow single-step trap and NMI when starting a new task x86/fred: No ESPFIX needed when FRED is enabled ...
2024-02-24x86, crash: wrap crash dumping code into crash related ifdefsBaoquan He1-2/+2
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on x86 with some adjustments. Here, also change some ifdefs or IS_ENABLED() check to more appropriate ones, e,g - #ifdef CONFIG_KEXEC_CORE -> #ifdef CONFIG_CRASH_DUMP - (!IS_ENABLED(CONFIG_KEXEC_CORE)) - > (!IS_ENABLED(CONFIG_CRASH_RESERVE)) [bhe@redhat.com: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope] Link: https://lore.kernel.org/all/SN6PR02MB4157931105FA68D72E3D3DB8D47B2@SN6PR02MB4157.namprd02.prod.outlook.com/T/#u Link: https://lkml.kernel.org/r/20240124051254.67105-7-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-06x86/kvm: Use separate percpu variable to track the enabling of asyncpfXiaoyao Li1-5/+6
Refer to commit fd10cde9294f ("KVM paravirt: Add async PF initialization to PV guest") and commit 344d9588a9df ("KVM: Add PV MSR to enable asynchronous page faults delivery"). It turns out that at the time when asyncpf was introduced, the purpose was defining the shared PV data 'struct kvm_vcpu_pv_apf_data' with the size of 64 bytes. However, it made a mistake and defined the size to 68 bytes, which failed to make fit in a cache line and made the code inconsistent with the documentation. Below justification quoted from Sean[*] KVM (the host side) has *never* read kvm_vcpu_pv_apf_data.enabled, and the documentation clearly states that enabling is based solely on the bit in the synthetic MSR. So rather than update the documentation, fix the goof by removing the enabled filed and use the separate percpu variable instread. KVM-as-a-host obviously doesn't enforce anything or consume the size, and changing the header will only affect guests that are rebuilt against the new header, so there's no chance of ABI breakage between KVM and its guests. The only possible breakage is if some other hypervisor is emulating KVM's async #PF (LOL) and relies on the guest to set kvm_vcpu_pv_apf_data.enabled. But (a) I highly doubt such a hypervisor exists, (b) that would arguably be a violation of KVM's "spec", and (c) the worst case scenario is that the guest would simply lose async #PF functionality. [*] https://lore.kernel.org/all/ZS7ERnnRqs8Fl0ZF@google.com/T/#u Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20231025055914.1201792-2-xiaoyao.li@intel.com [sean: use true/false instead of 1/0 for booleans] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01x86/kvm: Fix SEV check in sev_map_percpu_data()Kirill A. Shutemov1-1/+2
The function sev_map_percpu_data() checks if it is running on an SEV platform by checking the CC_ATTR_GUEST_MEM_ENCRYPT attribute. However, this attribute is also defined for TDX. To avoid false positives, add a cc_vendor check. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: 4d96f9109109 ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()") Suggested-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: David Rientjes <rientjes@google.com> Message-Id: <20240124130317.495519-1-kirill.shutemov@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-01x86/traps: Add sysvec_install() to install a system interrupt handlerXin Li1-1/+1
Add sysvec_install() to install a system interrupt handler into the IDT or the FRED system interrupt handler table. Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20231205105030.8698-28-xin3.li@intel.com
2024-01-09Merge tag 'x86-cleanups-2024-01-08' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: - Change global variables to local - Add missing kernel-doc function parameter descriptions - Remove unused parameter from a macro - Remove obsolete Kconfig entry - Fix comments - Fix typos, mostly scripted, manually reviewed and a micro-optimization got misplaced as a cleanup: - Micro-optimize the asm code in secondary_startup_64_no_verify() * tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arch/x86: Fix typos x86/head_64: Use TESTB instead of TESTL in secondary_startup_64_no_verify() x86/docs: Remove reference to syscall trampoline in PTI x86/Kconfig: Remove obsolete config X86_32_SMP x86/io: Remove the unused 'bw' parameter from the BUILDIO() macro x86/mtrr: Document missing function parameters in kernel-doc x86/setup: Make relocated_ramdisk a local variable of relocate_initrd()
2024-01-03arch/x86: Fix typosBjorn Helgaas1-1/+1
Fix typos, most reported by "codespell arch/x86". Only touches comments, no code changes. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2023-12-10x86/paravirt: Move some functions and defines to alternative.cJuergen Gross1-2/+2
As a preparation for replacing paravirt patching completely by alternative patching, move some backend functions and #defines to the alternatives code and header. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231129133332.31043-3-jgross@suse.com
2023-10-10x86/apic: Use u32 for APIC IDs in global dataThomas Gleixner1-3/+3
APIC IDs are used with random data types u16, u32, int, unsigned int, unsigned long. Make it all consistently use u32 because that reflects the hardware register width and fixup the most obvious usage sites of that. The APIC callbacks will be addressed separately. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.922905727@linutronix.de
2023-08-30Merge tag 'x86_apic_for_6.6-rc1' of ↵Linus Torvalds1-7/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 apic updates from Dave Hansen: "This includes a very thorough rework of the 'struct apic' handlers. Quite a variety of them popped up over the years, especially in the 32-bit days when odd apics were much more in vogue. The end result speaks for itself, which is a removal of a ton of code and static calls to replace indirect calls. If there's any breakage here, it's likely to be around the 32-bit museum pieces that get light to no testing these days. Summary: - Rework apic callbacks, getting rid of unnecessary ones and coalescing lots of silly duplicates. - Use static_calls() instead of indirect calls for apic->foo() - Tons of cleanups an crap removal along the way" * tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) x86/apic: Turn on static calls x86/apic: Provide static call infrastructure for APIC callbacks x86/apic: Wrap IPI calls into helper functions x86/apic: Mark all hotpath APIC callback wrappers __always_inline x86/xen/apic: Mark apic __ro_after_init x86/apic: Convert other overrides to apic_update_callback() x86/apic: Replace acpi_wake_cpu_handler_update() and apic_set_eoi_cb() x86/apic: Provide apic_update_callback() x86/xen/apic: Use standard apic driver mechanism for Xen PV x86/apic: Provide common init infrastructure x86/apic: Wrap apic->native_eoi() into a helper x86/apic: Nuke ack_APIC_irq() x86/apic: Remove pointless arguments from [native_]eoi_write() x86/apic/noop: Tidy up the code x86/apic: Remove pointless NULL initializations x86/apic: Sanitize APIC ID range validation x86/apic: Prepare x2APIC for using apic::max_apic_id x86/apic: Simplify X2APIC ID validation x86/apic: Add max_apic_id member x86/apic: Wrap APIC ID validation into an inline ...
2023-08-25x86/sev: Make enc_dec_hypercall() accept a size instead of npagesSteve Rutherford1-3/+1
enc_dec_hypercall() accepted a page count instead of a size, which forced its callers to round up. As a result, non-page aligned vaddrs caused pages to be spuriously marked as decrypted via the encryption status hypercall, which in turn caused consistent corruption of pages during live migration. Live migration requires accurate encryption status information to avoid migrating pages from the wrong perspective. Fixes: 064ce6c550a0 ("mm: x86: Invoke hypercall when page encryption status is changed") Signed-off-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Tested-by: Ben Hillier <bhillier@google.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230824223731.2055016-1-srutherford@google.com
2023-08-09x86/apic: Convert other overrides to apic_update_callback()Thomas Gleixner1-3/+3
Convert all places which just assign a new function directly to the apic callback to use apic_update_callback() which prepares for using static calls. Mark snp_set_wakeup_secondary_cpu() and kvm_setup_pv_ipi() __init, as they are only invoked from init code and otherwise trigger a section mismatch as they are now invoking a __init function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
2023-08-09x86/apic: Replace acpi_wake_cpu_handler_update() and apic_set_eoi_cb()Thomas Gleixner1-2/+2
Switch them over to apic_update_callback() and remove the code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
2023-08-09x86/apic: Wrap apic->native_eoi() into a helperThomas Gleixner1-1/+1
Prepare for converting the hotpath APIC callbacks to static calls. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
2023-08-09x86/apic: Nuke ack_APIC_irq()Dave Hansen1-1/+1
Yet another wrapper of a wrapper gone along with the outdated comment that this compiles to a single instruction. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
2023-08-09x86/apic: Remove pointless arguments from [native_]eoi_write()Thomas Gleixner1-3/+3
Every callsite hands in the same constants which is a pointless exercise and cannot be optimized by the compiler due to the indirect calls. Use the constants in the eoi() callbacks and remove the arguments. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
2022-12-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-1/+1
Pull kvm updates from Paolo Bonzini: "ARM64: - Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. - Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a97d: "Fix a number of issues with MTE, such as races on the tags being initialised vs the PG_mte_tagged flag as well as the lack of support for VM_SHARED when KVM is involved. Patches from Catalin Marinas and Peter Collingbourne"). - Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. - Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. - Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. - Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. s390: - Second batch of the lazy destroy patches - First batch of KVM changes for kernel virtual != physical address support - Removal of a unused function x86: - Allow compiling out SMM support - Cleanup and documentation of SMM state save area format - Preserve interrupt shadow in SMM state save area - Respond to generic signals during slow page faults - Fixes and optimizations for the non-executable huge page errata fix. - Reprogram all performance counters on PMU filter change - Cleanups to Hyper-V emulation and tests - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest running on top of a L1 Hyper-V hypervisor) - Advertise several new Intel features - x86 Xen-for-KVM: - Allow the Xen runstate information to cross a page boundary - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured - Add support for 32-bit guests in SCHEDOP_poll - Notable x86 fixes and cleanups: - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. - Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. - Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. - Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency. - Advertise (on AMD) that the SMM_CTL MSR is not supported - Remove unnecessary exports Generic: - Support for responding to signals during page faults; introduces new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks Selftests: - Fix an inverted check in the access tracking perf test, and restore support for asserting that there aren't too many idle pages when running on bare metal. - Fix build errors that occur in certain setups (unsure exactly what is unique about the problematic setup) due to glibc overriding static_assert() to a variant that requires a custom message. - Introduce actual atomics for clear/set_bit() in selftests - Add support for pinning vCPUs in dirty_log_perf_test. - Rename the so called "perf_util" framework to "memstress". - Add a lightweight psuedo RNG for guest use, and use it to randomize the access pattern and write vs. read percentage in the memstress tests. - Add a common ucall implementation; code dedup and pre-work for running SEV (and beyond) guests in selftests. - Provide a common constructor and arch hook, which will eventually be used by x86 to automatically select the right hypercall (AMD vs. Intel). - A bunch of added/enabled/fixed selftests for ARM64, covering memslots, breakpoints, stage-2 faults and access tracking. - x86-specific selftest changes: - Clean up x86's page table management. - Clean up and enhance the "smaller maxphyaddr" test, and add a related test to cover generic emulation failure. - Clean up the nEPT support checks. - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values. - Fix an ordering issue in the AMX test introduced by recent conversions to use kvm_cpu_has(), and harden the code to guard against similar bugs in the future. Anything that tiggers caching of KVM's supported CPUID, kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if the caching occurs before the test opts in via prctl(). Documentation: - Remove deleted ioctls from documentation - Clean up the docs for the x86 MSR filter. - Various fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits) KVM: x86: Add proper ReST tables for userspace MSR exits/flags KVM: selftests: Allocate ucall pool from MEM_REGION_DATA KVM: arm64: selftests: Align VA space allocator with TTBR0 KVM: arm64: Fix benign bug with incorrect use of VA_BITS KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow KVM: x86: Advertise that the SMM_CTL MSR is not supported KVM: x86: remove unnecessary exports KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic" tools: KVM: selftests: Convert clear/set_bit() to actual atomics tools: Drop "atomic_" prefix from atomic test_and_set_bit() tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests perf tools: Use dedicated non-atomic clear/set bit helpers tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers KVM: arm64: selftests: Enable single-step without a "full" ucall() KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself KVM: Remove stale comment about KVM_REQ_UNHALT KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR KVM: Reference to kvm_userspace_memory_region in doc and comments KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl ...
2022-11-24x86/paravirt: Use common macro for creating simple asm paravirt functionsJuergen Gross1-13/+6
There are some paravirt assembler functions which are sharing a common pattern. Introduce a macro DEFINE_PARAVIRT_ASM() for creating them. Note that this macro is including explicit alignment of the generated functions, leading to __raw_callee_save___kvm_vcpu_is_preempted(), _paravirt_nop() and paravirt_ret0() to be aligned at 4 byte boundaries now. The explicit _paravirt_nop() prototype in paravirt.c isn't needed, as it is included in paravirt_types.h already. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Link: https://lkml.kernel.org/r/20221109134418.6516-1-jgross@suse.com
2022-11-09x86/kvm: Remove unused virt to phys translation in kvm_guest_cpu_init()Rafael Mendonca1-1/+1
Presumably, this was introduced due to a conflict resolution with commit ef68017eb570 ("x86/kvm: Handle async page faults directly through do_page_fault()"), given that the last posted version [1] of the blamed commit was not based on the aforementioned commit. [1] https://lore.kernel.org/kvm/20200525144125.143875-9-vkuznets@redhat.com/ Fixes: b1d405751cd5 ("KVM: x86: Switch KVM guest to using interrupts for page ready APF delivery") Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com> Message-Id: <20221021020113.922027-1-rafaelmendsr@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-17x86/paravirt: Properly align PV functionsThomas Gleixner1-0/+1
Ensure inline asm functions are consistently aligned with compiler generated and SYM_FUNC_START*() functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20220915111144.038540008@infradead.org
2022-06-20x86: kvm: remove NULL check before kfreeDongliang Mu1-2/+1
kfree can handle NULL pointer as its argument. According to coccinelle isnullfree check, remove NULL check before kfree operation. Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Message-Id: <20220614133458.147314-1-dzm91@hust.edu.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25x86, kvm: use correct GFP flags for preemption disabledPaolo Bonzini1-1/+1
Commit ddd7ed842627 ("x86/kvm: Alloc dummy async #PF token outside of raw spinlock") leads to the following Smatch static checker warning: arch/x86/kernel/kvm.c:212 kvm_async_pf_task_wake() warn: sleeping in atomic context arch/x86/kernel/kvm.c 202 raw_spin_lock(&b->lock); 203 n = _find_apf_task(b, token); 204 if (!n) { 205 /* 206 * Async #PF not yet handled, add a dummy entry for the token. 207 * Allocating the token must be down outside of the raw lock 208 * as the allocator is preemptible on PREEMPT_RT kernels. 209 */ 210 if (!dummy) { 211 raw_spin_unlock(&b->lock); --> 212 dummy = kzalloc(sizeof(*dummy), GFP_KERNEL); ^^^^^^^^^^ Smatch thinks the caller has preempt disabled. The `smdb.py preempt kvm_async_pf_task_wake` output call tree is: sysvec_kvm_asyncpf_interrupt() <- disables preempt -> __sysvec_kvm_asyncpf_interrupt() -> kvm_async_pf_task_wake() The caller is this: arch/x86/kernel/kvm.c 290 DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_asyncpf_interrupt) 291 { 292 struct pt_regs *old_regs = set_irq_regs(regs); 293 u32 token; 294 295 ack_APIC_irq(); 296 297 inc_irq_stat(irq_hv_callback_count); 298 299 if (__this_cpu_read(apf_reason.enabled)) { 300 token = __this_cpu_read(apf_reason.token); 301 kvm_async_pf_task_wake(token); 302 __this_cpu_write(apf_reason.token, 0); 303 wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1); 304 } 305 306 set_irq_regs(old_regs); 307 } The DEFINE_IDTENTRY_SYSVEC() is a wrapper that calls this function from the call_on_irqstack_cond(). It's inside the call_on_irqstack_cond() where preempt is disabled (unless it's already disabled). The irq_enter/exit_rcu() functions disable/enable preempt. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25x86/kvm: Alloc dummy async #PF token outside of raw spinlockSean Christopherson1-14/+27
Drop the raw spinlock in kvm_async_pf_task_wake() before allocating the the dummy async #PF token, the allocator is preemptible on PREEMPT_RT kernels and must not be called from truly atomic contexts. Opportunistically document why it's ok to loop on allocation failure, i.e. why the function won't get stuck in an infinite loop. Reported-by: Yajun Deng <yajun.deng@linux.dev> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25Merge tag 'kvmarm-5.19' of ↵Paolo Bonzini1-0/+13
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 5.19 - Add support for the ARMv8.6 WFxT extension - Guard pages for the EL2 stacks - Trap and emulate AArch32 ID registers to hide unsupported features - Ability to select and save/restore the set of hypercalls exposed to the guest - Support for PSCI-initiated suspend in collaboration with userspace - GICv3 register-based LPI invalidation support - Move host PMU event merging into the vcpu data structure - GICv3 ITS save/restore fixes - The usual set of small-scale cleanups and fixes [Due to the conflict, KVM_SYSTEM_EVENT_SEV_TERM is relocated from 4 to 6. - Paolo]
2022-04-21x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resumeWanpeng Li1-0/+13
MSR_KVM_POLL_CONTROL is cleared on reset, thus reverting guests to host-side polling after suspend/resume. Non-bootstrap CPUs are restored correctly by the haltpoll driver because they are hot-unplugged during suspend and hot-plugged during resume; however, the BSP is not hotpluggable and remains in host-sde polling mode after the guest resume. The makes the guest pay for the cost of vmexits every time the guest enters idle. Fix it by recording BSP's haltpoll state and resuming it during guest resume. Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1650267752-46796-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13Merge branch 'kvm-older-features' into HEADPaolo Bonzini1-38/+39
Merge branch for features that did not make it into 5.18: * New ioctls to get/set TSC frequency for a whole VM * Allow userspace to opt out of hypercall patching Nested virtualization improvements for AMD: * Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE, nested vGIF) * Allow AVIC to co-exist with a nested guest running * Fixes for LBR virtualizations when a nested guest is running, and nested LBR virtualization support * PAUSE filtering for nested hypervisors Guest support: * Decoupling of vcpu_is_preempted from PV spinlocks Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-1/+1
Pull kvm fixes from Paolo Bonzini: - Only do MSR filtering for MSRs accessed by rdmsr/wrmsr - Documentation improvements - Prevent module exit until all VMs are freed - PMU Virtualization fixes - Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences - Other miscellaneous bugfixes * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits) KVM: x86: fix sending PV IPI KVM: x86/mmu: do compare-and-exchange of gPTE via the user address KVM: x86: Remove redundant vm_entry_controls_clearbit() call KVM: x86: cleanup enter_rmode() KVM: x86: SVM: fix tsc scaling when the host doesn't support it kvm: x86: SVM: remove unused defines KVM: x86: SVM: move tsc ratio definitions to svm.h KVM: x86: SVM: fix avic spec based definitions again KVM: MIPS: remove reference to trap&emulate virtualization KVM: x86: document limitations of MSR filtering KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr KVM: x86/emulator: Emulate RDPID only if it is enabled in guest KVM: x86/pmu: Fix and isolate TSX-specific performance event logic KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs KVM: x86: Trace all APICv inhibit changes and capture overall status KVM: x86: Add wrappers for setting/clearing APICv inhibits KVM: x86: Make APICv inhibit reasons an enum and cleanup naming KVM: X86: Handle implicit supervisor access with SMAP KVM: X86: Rename variable smap to not_smap in permission_fault() ...
2022-04-02KVM: x86: Support the vCPU preemption check with nopvspin and realtime hintLi RongQing1-37/+38
If guest kernel is configured with nopvspin, or CONFIG_PARAVIRT_SPINLOCK is disabled, or guest find its has dedicated pCPUs from realtime hint feature, the pvspinlock will be disabled, and vCPU preemption check is disabled too. Hoever, KVM still can emulating HLT for vCPU for both cases. Checking if a vCPU is preempted or not can still boost performance in IPI-heavy scenarios such as unixbench file copy and pipe-based context switching tests: Here the vCPU is running with a dedicated pCPU, so the guest kernel has nopvspin but is emulating HLT for the vCPU: Testcase Base with patch System Benchmarks Index Values INDEX INDEX Dhrystone 2 using register variables 3278.4 3277.7 Double-Precision Whetstone 822.8 825.8 Execl Throughput 1296.5 941.1 File Copy 1024 bufsize 2000 maxblocks 2124.2 2142.7 File Copy 256 bufsize 500 maxblocks 1335.9 1353.6 File Copy 4096 bufsize 8000 maxblocks 4256.3 4760.3 Pipe Throughput 1050.1 1054.0 Pipe-based Context Switching 243.3 352.0 Process Creation 820.1 814.4 Shell Scripts (1 concurrent) 2169.0 2086.0 Shell Scripts (8 concurrent) 7710.3 7576.3 System Call Overhead 672.4 673.9 ======== ======= System Benchmarks Index Score 1467.2 1483.0 Move the setting of pv_ops.lock.vcpu_is_preempted to kvm_guest_init, so that it does not depend on pvspinlock. Signed-off-by: Li RongQing <lirongqing@baidu.com> Message-Id: <1646815610-43315-1-git-send-email-lirongqing@baidu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: fix sending PV IPILi RongQing1-1/+1
If apic_id is less than min, and (max - apic_id) is greater than KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but the new apic_id does not fit the bitmask. In this case __send_ipi_mask should send the IPI. This is mostly theoretical, but it can happen if the apic_ids on three iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0. Fixes: aaffcfd1e82 ("KVM: X86: Implement PV IPIs in linux guest") Signed-off-by: Li RongQing <lirongqing@baidu.com> Message-Id: <1646814944-51801-1-git-send-email-lirongqing@baidu.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-15x86/ibt,paravirt: Sprinkle ENDBRPeter Zijlstra1-1/+2
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220308154318.051635891@infradead.org
2022-02-25KVM: x86: Yield to IPI target vCPU only if it is busyLi RongQing1-1/+1
When sending a call-function IPI-many to vCPUs, yield to the IPI target vCPU which is marked as preempted. but when emulating HLT, an idling vCPU will be voluntarily scheduled out and mark as preempted from the guest kernel perspective. yielding to idle vCPU is pointless and increase unnecessary vmexit, maybe miss the true preempted vCPU so yield to IPI target vCPU only if vCPU is busy and preempted Signed-off-by: Li RongQing <lirongqing@baidu.com> Message-Id: <1644380201-29423-1-git-send-email-lirongqing@baidu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25x86/kvm: Don't use PV TLB/yield when mwait is advertisedWanpeng Li1-0/+2
MWAIT is advertised in host is not overcommitted scenario, however, PV TLB/sched yield should be enabled in host overcommitted scenario. Let's add the MWAIT checking when enabling PV TLB/sched yield. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1645777780-2581-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPUWanpeng Li1-3/+6
Inspired by commit 3553ae5690a (x86/kvm: Don't use pvqspinlock code if only 1 vCPU), on a VM with only 1 vCPU, there is no need to enable pv tlb/ipi/sched_yield and we can save the memory for __pv_cpu_mask. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1645171838-2855-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07x86/kvm: Silence per-cpu pr_info noise about KVM clocks and steal timeDavid Woodhouse1-3/+3
I made the actual CPU bringup go nice and fast... and then Linux spends half a minute printing stupid nonsense about clocks and steal time for each of 256 vCPUs. Don't do that. Nobody cares. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211209150938.3518-12-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-11Merge branch 'kvm-5.16-fixes' into kvm-masterPaolo Bonzini1-1/+1
* Fix misuse of gfn-to-pfn cache when recording guest steal time / preempted status * Fix selftests on APICv machines * Fix sparse warnings * Fix detection of KVM features in CPUID * Cleanups for bogus writes to MSR_KVM_PV_EOI_EN * Fixes and cleanups for MSR bitmap handling * Cleanups for INVPCID * Make x86 KVM_SOFT_MAX_VCPUS consistent with other architectures
2021-11-11KVM: x86: Make sure KVM_CPUID_FEATURES really are KVM_CPUID_FEATURESPaul Durrant1-1/+1
Currently when kvm_update_cpuid_runtime() runs, it assumes that the KVM_CPUID_FEATURES leaf is located at 0x40000001. This is not true, however, if Hyper-V support is enabled. In this case the KVM leaves will be offset. This patch introdues as new 'kvm_cpuid_base' field into struct kvm_vcpu_arch to track the location of the KVM leaves and function kvm_update_kvm_cpuid_base() (called from kvm_set_cpuid()) to locate the leaves using the 'KVMKVMKVM\0\0\0' signature (which is now given a definition in kvm_para.h). Adjustment of KVM_CPUID_FEATURES will hence now target the correct leaf. NOTE: A new for_each_possible_hypervisor_cpuid_base() macro is intoduced into processor.h to avoid having duplicate code for the iteration over possible hypervisor base leaves. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Message-Id: <20211105095101.5384-3-pdurrant@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-11Merge branch 'kvm-guest-sev-migration' into kvm-masterPaolo Bonzini1-0/+107
Add guest api and guest kernel support for SEV live migration. Introduces a new hypercall to notify the host of changes to the page encryption status. If the page is encrypted then it must be migrated through the SEV firmware or a helper VM sharing the key. If page is not encrypted then it can be migrated normally by userspace. This new hypercall is invoked using paravirt_ops. Conflicts: sev_active() replaced by cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT).
2021-11-11x86/kvm: Add kexec support for SEV Live Migration.Ashish Kalra1-0/+25
Reset the host's shared pages list related to kernel specific page encryption status settings before we load a new kernel by kexec. We cannot reset the complete shared pages list here as we need to retain the UEFI/OVMF firmware specific settings. The host's shared pages list is maintained for the guest to keep track of all unencrypted guest memory regions, therefore we need to explicitly mark all shared pages as encrypted again before rebooting into the new guest kernel. Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Reviewed-by: Steve Rutherford <srutherford@google.com> Message-Id: <3e051424ab839ea470f88333273d7a185006754f.1629726117.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-11x86/kvm: Add guest support for detecting and enabling SEV Live Migration ↵Ashish Kalra1-0/+82
feature. The guest support for detecting and enabling SEV Live migration feature uses the following logic : - kvm_init_plaform() checks if its booted under the EFI - If not EFI, i) if kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), issue a wrmsrl() to enable the SEV live migration support - If EFI, i) If kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), read the UEFI variable which indicates OVMF support for live migration ii) the variable indicates live migration is supported, issue a wrmsrl() to enable the SEV live migration support The EFI live migration check is done using a late_initcall() callback. Also, ensure that _bss_decrypted section is marked as decrypted in the hypervisor's guest page encryption status tracking. Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Reviewed-by: Steve Rutherford <srutherford@google.com> Message-Id: <b4453e4c87103ebef12217d2505ea99a1c3e0f0f.1629726117.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-04x86/sev: Replace occurrences of sev_active() with cc_platform_has()Tom Lendacky1-1/+2
Replace uses of sev_active() with the more generic cc_platform_has() using CC_ATTR_GUEST_MEM_ENCRYPT. If future support is added for other memory encryption technologies, the use of CC_ATTR_GUEST_MEM_ENCRYPT can be updated, as required. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210928191009.32551-7-bp@alien8.de
2021-09-06x86/kvm: Don't enable IRQ when IRQ enabled in kvm_waitLai Jiangshan1-2/+3
Commit f4e61f0c9add3 ("x86/kvm: Fix broken irq restoration in kvm_wait") replaced "local_irq_restore() when IRQ enabled" with "local_irq_enable() when IRQ enabled" to suppress a warnning. Although there is no similar debugging warnning for doing local_irq_enable() when IRQ enabled as doing local_irq_restore() in the same IRQ situation. But doing local_irq_enable() when IRQ enabled is no less broken as doing local_irq_restore() and we'd better avoid it. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210814035129.154242-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()Vitaly Kuznetsov1-25/+17
Simplify the code by making PV features shutdown happen in one place. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07x86/kvm: Disable all PV features on crashVitaly Kuznetsov1-12/+32
Crash shutdown handler only disables kvmclock and steal time, other PV features remain active so we risk corrupting memory or getting some side-effects in kdump kernel. Move crash handler to kvm.c and unify with CPU offline. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07x86/kvm: Disable kvmclock on all CPUs on shutdownVitaly Kuznetsov1-0/+1
Currenly, we disable kvmclock from machine_shutdown() hook and this only happens for boot CPU. We need to disable it for all CPUs to guard against memory corruption e.g. on restore from hibernate. Note, writing '0' to kvmclock MSR doesn't clear memory location, it just prevents hypervisor from updating the location so for the short while after write and while CPU is still alive, the clock remains usable and correct so we don't need to switch to some other clocksource. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07x86/kvm: Teardown PV features on boot CPU as wellVitaly Kuznetsov1-16/+40
Various PV features (Async PF, PV EOI, steal time) work through memory shared with hypervisor and when we restore from hibernation we must properly teardown all these features to make sure hypervisor doesn't write to stale locations after we jump to the previously hibernated kernel (which can try to place anything there). For secondary CPUs the job is already done by kvm_cpu_down_prepare(), register syscore ops to do the same for boot CPU. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-05x86/kvm: Fix pr_info() for async PF setup/teardownVitaly Kuznetsov1-2/+2
'pr_fmt' already has 'kvm-guest: ' so 'KVM' prefix is redundant. "Unregister pv shared memory" is very ambiguous, it's hard to say which particular PV feature it relates to. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-68/+60
Pull kvm updates from Paolo Bonzini: "This is a large update by KVM standards, including AMD PSP (Platform Security Processor, aka "AMD Secure Technology") and ARM CoreSight (debug and trace) changes. ARM: - CoreSight: Add support for ETE and TRBE - Stage-2 isolation for the host kernel when running in protected mode - Guest SVE support when running in nVHE mode - Force W^X hypervisor mappings in nVHE mode - ITS save/restore for guests using direct injection with GICv4.1 - nVHE panics now produce readable backtraces - Guest support for PTP using the ptp_kvm driver - Performance improvements in the S2 fault handler x86: - AMD PSP driver changes - Optimizations and cleanup of nested SVM code - AMD: Support for virtual SPEC_CTRL - Optimizations of the new MMU code: fast invalidation, zap under read lock, enable/disably dirty page logging under read lock - /dev/kvm API for AMD SEV live migration (guest API coming soon) - support SEV virtual machines sharing the same encryption context - support SGX in virtual machines - add a few more statistics - improved directed yield heuristics - Lots and lots of cleanups Generic: - Rework of MMU notifier interface, simplifying and optimizing the architecture-specific code - a handful of "Get rid of oprofile leftovers" patches - Some selftests improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits) KVM: selftests: Speed up set_memory_region_test selftests: kvm: Fix the check of return value KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt() KVM: SVM: Skip SEV cache flush if no ASIDs have been used KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids() KVM: SVM: Drop redundant svm_sev_enabled() helper KVM: SVM: Move SEV VMCB tracking allocation to sev.c KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup() KVM: SVM: Unconditionally invoke sev_hardware_teardown() KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported) KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features KVM: SVM: Move SEV module params/variables to sev.c KVM: SVM: Disable SEV/SEV-ES if NPT is disabled KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails KVM: SVM: Zero out the VMCB array used to track SEV ASID association x86/sev: Drop redundant and potentially misleading 'sev_enabled' KVM: x86: Move reverse CPUID helpers to separate header file KVM: x86: Rename GPR accessors to make mode-aware variants the defaults ...
2021-04-29Merge tag 'x86-mm-2021-04-29' of ↵Linus Torvalds1-3/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 tlb updates from Ingo Molnar: "The x86 MM changes in this cycle were: - Implement concurrent TLB flushes, which overlaps the local TLB flush with the remote TLB flush. In testing this improved sysbench performance measurably by a couple of percentage points, especially if TLB-heavy security mitigations are active. - Further micro-optimizations to improve the performance of TLB flushes" * tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smp: Micro-optimize smp_call_function_many_cond() smp: Inline on_each_cpu_cond() and on_each_cpu() x86/mm/tlb: Remove unnecessary uses of the inline keyword cpumask: Mark functions as pure x86/mm/tlb: Do not make is_lazy dirty for no reason x86/mm/tlb: Privatize cpu_tlbstate x86/mm/tlb: Flush remote and local TLBs concurrently x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote() smp: Run functions concurrently in smp_call_function_many_cond()