summaryrefslogtreecommitdiff
path: root/Documentation/virt/kvm/x86/mmu.rst
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2023-11-03 04:45:15 +0300
committerLinus Torvalds <torvalds@linux-foundation.org>2023-11-03 04:45:15 +0300
commit6803bd7956ca8fc43069c2e42016f17f3c2fbf30 (patch)
treeebcd7d47efe649781817dd0d7664c7c618645b21 /Documentation/virt/kvm/x86/mmu.rst
parent5be9911406ada8fe6187db7ce402f7ff4c21ebdf (diff)
parent45b890f7689eb0aba454fc5831d2d79763781677 (diff)
downloadlinux-6803bd7956ca8fc43069c2e42016f17f3c2fbf30.tar.xz
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini: "ARM: - Generalized infrastructure for 'writable' ID registers, effectively allowing userspace to opt-out of certain vCPU features for its guest - Optimization for vSGI injection, opportunistically compressing MPIDR to vCPU mapping into a table - Improvements to KVM's PMU emulation, allowing userspace to select the number of PMCs available to a VM - Guest support for memory operation instructions (FEAT_MOPS) - Cleanups to handling feature flags in KVM_ARM_VCPU_INIT, squashing bugs and getting rid of useless code - Changes to the way the SMCCC filter is constructed, avoiding wasted memory allocations when not in use - Load the stage-2 MMU context at vcpu_load() for VHE systems, reducing the overhead of errata mitigations - Miscellaneous kernel and selftest fixes LoongArch: - New architecture for kvm. The hardware uses the same model as x86, s390 and RISC-V, where guest/host mode is orthogonal to supervisor/user mode. The virtualization extensions are very similar to MIPS, therefore the code also has some similarities but it's been cleaned up to avoid some of the historical bogosities that are found in arch/mips. The kernel emulates MMU, timer and CSR accesses, while interrupt controllers are only emulated in userspace, at least for now. RISC-V: - Support for the Smstateen and Zicond extensions - Support for virtualizing senvcfg - Support for virtualized SBI debug console (DBCN) S390: - Nested page table management can be monitored through tracepoints and statistics x86: - Fix incorrect handling of VMX posted interrupt descriptor in KVM_SET_LAPIC, which could result in a dropped timer IRQ - Avoid WARN on systems with Intel IPI virtualization - Add CONFIG_KVM_MAX_NR_VCPUS, to allow supporting up to 4096 vCPUs without forcing more common use cases to eat the extra memory overhead. - Add virtualization support for AMD SRSO mitigation (IBPB_BRTYPE and SBPB, aka Selective Branch Predictor Barrier). - Fix a bug where restoring a vCPU snapshot that was taken within 1 second of creating the original vCPU would cause KVM to try to synchronize the vCPU's TSC and thus clobber the correct TSC being set by userspace. - Compute guest wall clock using a single TSC read to avoid generating an inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads. - "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain about a "Firmware Bug" if the bit isn't set for select F/M/S combos. Likewise "virtualize" (ignore) MSR_AMD64_TW_CFG to appease Windows Server 2022. - Don't apply side effects to Hyper-V's synthetic timer on writes from userspace to fix an issue where the auto-enable behavior can trigger spurious interrupts, i.e. do auto-enabling only for guest writes. - Remove an unnecessary kick of all vCPUs when synchronizing the dirty log without PML enabled. - Advertise "support" for non-serializing FS/GS base MSR writes as appropriate. - Harden the fast page fault path to guard against encountering an invalid root when walking SPTEs. - Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n. - Use the fast path directly from the timer callback when delivering Xen timer events, instead of waiting for the next iteration of the run loop. This was not done so far because previously proposed code had races, but now care is taken to stop the hrtimer at critical points such as restarting the timer or saving the timer information for userspace. - Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag. - Optimize injection of PMU interrupts that are simultaneous with NMIs. - Usual handful of fixes for typos and other warts. x86 - MTRR/PAT fixes and optimizations: - Clean up code that deals with honoring guest MTRRs when the VM has non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled. - Zap EPT entries when non-coherent DMA assignment stops/start to prevent using stale entries with the wrong memtype. - Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y This was done as a workaround for virtual machine BIOSes that did not bother to clear CR0.CD (because ancient KVM/QEMU did not bother to set it, in turn), and there's zero reason to extend the quirk to also ignore guest PAT. x86 - SEV fixes: - Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while running an SEV-ES guest. - Clean up the recognition of emulation failures on SEV guests, when KVM would like to "skip" the instruction but it had already been partially emulated. This makes it possible to drop a hack that second guessed the (insufficient) information provided by the emulator, and just do the right thing. Documentation: - Various updates and fixes, mostly for x86 - MTRR and PAT fixes and optimizations" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (164 commits) KVM: selftests: Avoid using forced target for generating arm64 headers tools headers arm64: Fix references to top srcdir in Makefile KVM: arm64: Add tracepoint for MMIO accesses where ISV==0 KVM: arm64: selftest: Perform ISB before reading PAR_EL1 KVM: arm64: selftest: Add the missing .guest_prepare() KVM: arm64: Always invalidate TLB for stage-2 permission faults KVM: x86: Service NMI requests after PMI requests in VM-Enter path KVM: arm64: Handle AArch32 SPSR_{irq,abt,und,fiq} as RAZ/WI KVM: arm64: Do not let a L1 hypervisor access the *32_EL2 sysregs KVM: arm64: Refine _EL2 system register list that require trap reinjection arm64: Add missing _EL2 encodings arm64: Add missing _EL12 encodings KVM: selftests: aarch64: vPMU test for validating user accesses KVM: selftests: aarch64: vPMU register test for unimplemented counters KVM: selftests: aarch64: vPMU register test for implemented counters KVM: selftests: aarch64: Introduce vpmu_counter_access test tools: Import arm_pmuv3.h KVM: arm64: PMU: Allow userspace to limit PMCR_EL0.N for the guest KVM: arm64: Sanitize PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} before first run KVM: arm64: Add {get,set}_user for PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} ...
Diffstat (limited to 'Documentation/virt/kvm/x86/mmu.rst')
-rw-r--r--Documentation/virt/kvm/x86/mmu.rst43
1 files changed, 34 insertions, 9 deletions
diff --git a/Documentation/virt/kvm/x86/mmu.rst b/Documentation/virt/kvm/x86/mmu.rst
index d47595b33fcf..2b3b6d442302 100644
--- a/Documentation/virt/kvm/x86/mmu.rst
+++ b/Documentation/virt/kvm/x86/mmu.rst
@@ -202,10 +202,22 @@ Shadow pages contain the following information:
Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D
bits before Haswell; shadow EPT page tables also cannot use A/D bits
if the L1 hypervisor does not enable them.
+ role.guest_mode:
+ Indicates the shadow page is created for a nested guest.
role.passthrough:
The page is not backed by a guest page table, but its first entry
points to one. This is set if NPT uses 5-level page tables (host
CR4.LA57=1) and is shadowing L1's 4-level NPT (L1 CR4.LA57=0).
+ mmu_valid_gen:
+ The MMU generation of this page, used to fast zap of all MMU pages within a
+ VM without blocking vCPUs too long. Specifically, KVM updates the per-VM
+ valid MMU generation which causes the mismatch of mmu_valid_gen for each mmu
+ page. This makes all existing MMU pages obsolete. Obsolete pages can't be
+ used. Therefore, vCPUs must load a new, valid root before re-entering the
+ guest. The MMU generation is only ever '0' or '1'. Note, the TDP MMU doesn't
+ use this field as non-root TDP MMU pages are reachable only from their
+ owning root. Thus it suffices for TDP MMU to use role.invalid in root pages
+ to invalidate all MMU pages.
gfn:
Either the guest page table containing the translations shadowed by this
page, or the base page frame for linear translations. See role.direct.
@@ -219,21 +231,30 @@ Shadow pages contain the following information:
at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte.
The spt array forms a DAG structure with the shadow page as a node, and
guest pages as leaves.
- gfns:
- An array of 512 guest frame numbers, one for each present pte. Used to
- perform a reverse map from a pte to a gfn. When role.direct is set, any
- element of this array can be calculated from the gfn field when used, in
- this case, the array of gfns is not allocated. See role.direct and gfn.
- root_count:
- A counter keeping track of how many hardware registers (guest cr3 or
- pdptrs) are now pointing at the page. While this counter is nonzero, the
- page cannot be destroyed. See role.invalid.
+ shadowed_translation:
+ An array of 512 shadow translation entries, one for each present pte. Used
+ to perform a reverse map from a pte to a gfn as well as its access
+ permission. When role.direct is set, the shadow_translation array is not
+ allocated. This is because the gfn contained in any element of this array
+ can be calculated from the gfn field when used. In addition, when
+ role.direct is set, KVM does not track access permission for each of the
+ gfn. See role.direct and gfn.
+ root_count / tdp_mmu_root_count:
+ root_count is a reference counter for root shadow pages in Shadow MMU.
+ vCPUs elevate the refcount when getting a shadow page that will be used as
+ a root page, i.e. page that will be loaded into hardware directly (CR3,
+ PDPTRs, nCR3 EPTP). Root pages cannot be destroyed while their refcount is
+ non-zero. See role.invalid. tdp_mmu_root_count is similar but exclusively
+ used in TDP MMU as an atomic refcount.
parent_ptes:
The reverse mapping for the pte/ptes pointing at this page's spt. If
parent_ptes bit 0 is zero, only one spte points at this page and
parent_ptes points at this single spte, otherwise, there exists multiple
sptes pointing at this page and (parent_ptes & ~0x1) points at a data
structure with a list of parent sptes.
+ ptep:
+ The kernel virtual address of the SPTE that points at this shadow page.
+ Used exclusively by the TDP MMU, this field is a union with parent_ptes.
unsync:
If true, then the translations in this page may not match the guest's
translation. This is equivalent to the state of the tlb when a pte is
@@ -261,6 +282,10 @@ Shadow pages contain the following information:
since the last time the page table was actually used; if emulation
is triggered too frequently on this page, KVM will unmap the page
to avoid emulation in the future.
+ tdp_mmu_page:
+ Is 1 if the shadow page is a TDP MMU page. This variable is used to
+ bifurcate the control flows for KVM when walking any data structure that
+ may contain pages from both TDP MMU and shadow MMU.
Reverse map
===========