Age | Commit message (Collapse) | Author | Files | Lines |
|
KVM internally uses accessor functions when reading or writing the
guest's system registers. This takes care of accessing either the stored
copy or using the "live" EL1 system registers when the host uses VHE.
With the introduction of virtual EL2 we add a bunch of EL2 system
registers, which now must also be taken care of:
- If the guest is running in vEL2, and we access an EL1 sysreg, we must
revert to the stored version of that, and not use the CPU's copy.
- If the guest is running in vEL1, and we access an EL2 sysreg, we must
also use the stored version, since the CPU carries the EL1 copy.
- Some EL2 system registers are supposed to affect the current execution
of the system, so we need to put them into their respective EL1
counterparts. For this we need to define a mapping between the two.
- Some EL2 system registers have a different format than their EL1
counterpart, so we need to translate them before writing them to the
CPU. This is done using an (optional) translate function in the map.
All of these cases are now wrapped into the existing accessor functions,
so KVM users wouldn't need to care whether they access EL2 or EL1
registers and also which state the guest is in.
Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Co-developed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
With ARMv8.4-NV, registers that can be directly accessed in memory
by the guest have to live at architected offsets in a special page.
Let's annotate the sysreg enum to reflect the offset at which they
are in this page, whith a little twist:
If running on HW that doesn't have the ARMv8.4-NV feature, or even
a VM that doesn't use NV, we store all the system registers in the
usual sys_regs array. The only difference with the pre-8.4
situation is that VNCR-capable registers are at a "similar" offset
as in the VNCR page (we can compute the actual offset at compile
time), and that the sys_regs array is both bigger and sparse.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Add two helpers to deal with EL2 registers are are either redirected
to the VNCR page, or that are redirected to their EL1 counterpart.
In either cases, no trap is expected.
THe relevant register descriptors are repainted accordingly.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
In order to ease the debugging of NV, it is helpful to have the kernel
shout at you when an unexpected trap is handled. We already have this
in a couple of cases. Make this a more generic infrastructure that we
will make use of very shortly.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
VNCR_EL2 points to a page containing a number of system registers
accessed by a guest hypervisor when ARMv8.4-NV is enabled.
Let's document the offsets in that page, as we are going to use
this layout.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Some EL2 system registers immediately affect the current execution
of the system, so we need to use their respective EL1 counterparts.
For this we need to define a mapping between the two. In general,
this only affects non-VHE guest hypervisors, as VHE system registers
are compatible with the EL1 counterparts.
These helpers will get used in subsequent patches.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Co-developed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
With FEAT_NV2, a bunch of system register writes are turned into
memory writes. This is specially the fate of the EL12 registers
that the guest hypervisor manipulates out of context.
Remove the trap descriptors for those, as they are never going
to be used again.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Now that we have a full copy of the idregs for each VM, there is
no point in repainting the sysregs on each access. Instead, we
can simply perform the transmation as a one-off and be done
with it.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
A rather common idiom when writing NV code as part of KVM is
to have things such has:
if (vcpu_has_nv(vcpu) && is_hyp_ctxt(vcpu)) {
[...]
}
to check that we are in a hyp-related context. The second part of
the conjunction would be enough, but the first one contains a
static key that allows the rest of the checkis to be elided when
in a non-NV environment.
Rewrite is_hyp_ctxt() to directly use vcpu_has_nv(). The result
is the same, and the code easier to read. The one occurence of
this that is already merged is rewritten in the process.
In order to avoid nasty cirtular dependencies between kvm_emulate.h
and kvm_nested.h, vcpu_has_feature() is itself hoisted into kvm_host.h,
at the cost of some #deferry...
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
To anyone who has played with FEAT_NV, it is obvious that the level
of performance is rather low due to the trap amplification that it
imposes on the host hypervisor. FEAT_NV2 solves a number of the
problems that FEAT_NV had.
It also turns out that all the existing hardware that has FEAT_NV
also has FEAT_NV2. Finally, it is now allowed by the architecture
to build FEAT_NV2 *only* (as denoted by ID_AA64MMFR4_EL1.NV_frac),
which effectively seals the fate of FEAT_NV.
Restrict the NV support to NV2, and be done with it. Nobody will
cry over the old crap. NV_frac will eventually be supported once
the intrastructure is ready.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
This patch adds LASX (256bit SIMD) support for LoongArch KVM.
There will be LASX exception in KVM when guest use the LASX instructions.
KVM will enable LASX and restore the vector registers for guest and then
return to guest to continue running.
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
This patch adds LSX (128bit SIMD) support for LoongArch KVM.
There will be LSX exception in KVM when guest use the LSX instructions.
KVM will enable LSX and restore the vector registers for guest and then
return to guest to continue running.
Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
When timer is fired in oneshot mode, CSR TVAL will be -1 rather than 0.
There needs special handing for this situation. There are two scenarios
when oneshot timer is fired. One scenario is that time is fired after
exiting to host, CSR TVAL is set with 0 in order to inject hw interrupt,
and -1 will assigned to CSR TVAL soon.
The other situation is that timer is fired in VM and guest kernel is
hanlding timer IRQ, IRQ is acked and is ready to set next expired timer
value, then vm exits to host. Timer interrupt should not be inject at
this point, else there will be spurious timer interrupt.
Here hw timer irq status in CSR ESTAT is used to judge these two
scenarios. If CSR TVAL is -1, the oneshot timer is fired; and if timer
hw irq is on in CSR ESTAT register, it happens after exiting to host;
else if timer hw irq is off, we think that it happens in vm and timer
IRQ handler has already acked IRQ.
With this patch, runltp with version ltp20230516 passes to run in vm.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Timer emulation method in VM is switch to SW timer, there are two
places where timer emulation is needed. One is during vcpu thread
context switch, the other is halt-polling with idle instruction
emulation. SW timer switching is removed during halt-polling mode,
so it is not necessary to disable SW timer before entering to guest.
This patch removes SW timer handling before entering guest mode, and
put it in HW timer restoring flow when vcpu thread is sched-in. With
this patch, vm timer emulation is simpler, there is SW/HW timer
switch only in vcpu thread context switch scenario.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Currently HW timer CSR registers are allowed to access before entering
to vm and disabled if switch to SW timer in host mode, instead it is not
necessary to do so. HW timer CSR registers can be accessed always, it
is nothing to do with whether it is in vm mode or host mode. This patch
removes the limitation.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
With halt-polling supported, there is checking for pending events or
interrupts when vcpu executes idle instruction. Pending interrupts
include injected SW interrupts and passthrough HW interrupts, such as
HW timer interrupts, since HW timer works still even if vcpu exists from
VM mode.
Since HW timer pending interrupt can be set directly with CSR status
register, and pending HW timer interrupt checking is used in vcpu block
checking function, it is not necessary to switch to SW timer during
halt-polling. This patch adds preemption disabling in function
kvm_cpu_has_pending_timer(), and removes SW timer switching in idle
instruction emulation function.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
During shadow mmu page fault, there is checking for huge page for
specified memslot. Page fault is hot path, check logic can be done
when memslot is created. Here two flags are added for huge page
checking, KVM_MEM_HUGEPAGE_CAPABLE and KVM_MEM_HUGEPAGE_INCAPABLE.
Indeed for an optimized qemu, memslot for DRAM is always huge page
aligned. The flag is firstly checked during hot page fault path.
Now only huge page flag is supported, there is a long way for super
page support in LoongArch system. Since super page size is 64G for
16K pagesize and 1G for 4K pagesize, 64G physical address is rarely
used and LoongArch kernel needs support super page for 4K. Also memory
layout of LoongArch qemu VM should be 1G aligned.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
* kvm-arm64/fgt-rework: (30 commits)
: .
: Fine Grain Trapping update, courtesy of Fuad Tabba.
:
: From the cover letter:
:
: "This patch series has fixes, updates, and code for validating
: fine grain trap register masks, as well as some fixes to feature
: trapping in pKVM.
:
: New fine grain trap (FGT) bits have been defined in the latest
: Arm Architecture System Registers xml specification (DDI0601 and
: DDI0602 2023-09) [1], so the code is updated to reflect them.
: Moreover, some of the already-defined masks overlap with RES0,
: which this series fixes.
:
: It also adds FGT register masks that weren't defined earlier,
: handling of HAFGRTR_EL2 in nested virt, as well as build time
: validation that the bits of the various masks are all accounted
: for and without overlap."
:
: This branch also drags the arm64/for-next/sysregs branch,
: which is a dependency on this work.
: .
KVM: arm64: Trap external trace for protected VMs
KVM: arm64: Mark PAuth as a restricted feature for protected VMs
KVM: arm64: Fix which features are marked as allowed for protected VMs
KVM: arm64: Macros for setting/clearing FGT bits
KVM: arm64: Define FGT nMASK bits relative to other fields
KVM: arm64: Use generated FGT RES0 bits instead of specifying them
KVM: arm64: Add build validation for FGT trap mask values
KVM: arm64: Update and fix FGT register masks
KVM: arm64: Handle HAFGRTR_EL2 trapping in nested virt
KVM: arm64: Add bit masks for HAFGRTR_EL2
KVM: arm64: Add missing HFGITR_EL2 FGT entries to nested virt
KVM: arm64: Add missing HFGxTR_EL2 FGT entries to nested virt
KVM: arm64: Explicitly trap unsupported HFGxTR_EL2 features
arm64/sysreg: Add missing system instruction definitions for FGT
arm64/sysreg: Add missing system register definitions for FGT
arm64/sysreg: Add missing ExtTrcBuff field definition to ID_AA64DFR0_EL1
arm64/sysreg: Add missing Pauth_LR field definitions to ID_AA64ISAR1_EL1
arm64/sysreg: Add new system registers for GCS
arm64/sysreg: Add definition for FPMR
arm64/sysreg: Update HCRX_EL2 definition for DDI0601 2023-09
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The recent fix to ignore invalid x2APIC entries inadvertently broke
systems with creative MADT APIC tables. The affected systems have APIC
MADT tables where all entries have invalid APIC IDs (0xFF), which means
they register exactly zero CPUs.
But the condition to ignore the entries of APIC IDs < 255 in the X2APIC
MADT table is solely based on the count of MADT APIC table entries.
As a consequence, the affected machines enumerate no secondary CPUs at
all because the APIC table has entries and therefore the X2APIC table
entries with APIC IDs < 255 are ignored.
Change the condition so that the APIC table preference for APIC IDs <
255 only becomes effective when the APIC table has valid APIC ID
entries.
IOW, an APIC table full of invalid APIC IDs is considered to be empty
which in consequence enables the X2APIC table entries with a APIC ID
< 255 and restores the expected behaviour.
Fixes: ec9aedb2aa1a ("x86/acpi: Ignore invalid x2APIC entries")
Reported-by: John Sperbeck <jsperbeck@google.com>
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/169953729188.3135.6804572126118798018.tip-bot2@tip-bot2
|
|
pKVM does not support external trace for protected VMs. Trap
external trace, and add the ExtTrcBuff to make it possible to
check for the feature.
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-18-tabba@google.com
|
|
Protected VMs will only support basic PAuth (FEAT_PAuth). Mark it
as restricted to ensure that later versions aren't supported for
protected guests.
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-17-tabba@google.com
|
|
Cache maintenance operations are not trapped for protected VMs,
and shouldn't be. Mark them as allowed.
Moreover, features advertised by ID_AA64PFR2 and ID_AA64MMFR3 are
(already) not allowed, mark them as such.
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-16-tabba@google.com
|
|
There's a lot of boilerplate code for setting and clearing FGT
bits when activating guest traps. Refactor it into macros. These
macros will also be used in future patch series.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-15-tabba@google.com
|
|
Now that RES0 and MASK have full coverage, no need to manually
encode nMASK. Calculate it relative to the other fields.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-14-tabba@google.com
|
|
Now that all FGT fields are accounted for and represented, use
the generated value instead of manually specifying them.
For __HFGWTR_EL2_RES0, however, there is no generated value. Its
fields are subset of HFGRTR_EL2, with the remaining being RES0.
Therefore, add a mask that represents the HFGRTR_EL2 only bits
and define __HFGWTR_EL2_* using those and the __HFGRTR_EL2_*
fields.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-13-tabba@google.com
|
|
These checks help ensure that all the bits are accounted for,
that there hasn't been a transcribing error from the spec nor
from the generated mask values, which will be used in subsequent
patches.
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-12-tabba@google.com
|
|
New trap bits have been defined since the latest update to this
patch. Moreover, the existing definitions of some of the mask
and the RES0 bits overlap, which could be wrong, confusing, or
both.
Update the bits based on DDI0601 2023-09, and ensure that the
existing bits are consistent.
Subsequent patches will use the generated RES0 fields instead of
specifying them manually. This patch keeps the manual encoding of
the bits to make it easier to review the series.
Fixes: 0fd76865006d ("KVM: arm64: Add nPIR{E0}_EL1 to HFG traps")
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-11-tabba@google.com
|
|
Add the encodings to fine grain trapping fields for HAFGRTR_EL2
and add the associated handling code in nested virt. Based on
DDI0601 2023-09. Add the missing field definitions as well,
both to generate the correct RES0 mask and to be able to toggle
their FGT bits.
Also add the code for handling FGT trapping, reading of the
register, to nested virt.
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-10-tabba@google.com
|
|
The KERNEL_FPR mask only contains a flag for the first eight vector
registers. However floating point registers overlay parts of the first
sixteen vector registers.
This could lead to vector register corruption if a kernel fpu context uses
any of the vector registers 8 to 15 and is interrupted or calls a
KERNEL_FPR context. If that context uses also vector registers 8 to 15,
their contents will be corrupted on return.
Luckily this is currently not a real bug, since the kernel has only one
KERNEL_FPR user with s390_adjust_jiffies() and it is only using floating
point registers 0 to 2.
Fix this by using the correct bits for KERNEL_FPR.
Fixes: 7f79695cc1b6 ("s390/fpu: improve kernel_fpu_[begin|end]")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
To support HAFGRTR_EL2 supported in nested virt in the following
patch, first add its bitmask definitions based on DDI0601 2023-09.
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-9-tabba@google.com
|
|
Add the missing nested virt FGT table entries HFGITR_EL2. Based
on DDI0601 and DDI0602 2023-09.
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-8-tabba@google.com
|
|
Add the missing nested virt FGT table entries HFGxTR_EL2. Based
on DDI0601 2023-09.
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-7-tabba@google.com
|
|
Do not rely on the value of __HFGRTR_EL2_nMASK to trap
unsupported features, since the nMASK can (and will) change as
new traps are added and as its value is updated. Instead,
explicitly specify the trap bits.
Suggested-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-6-tabba@google.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix a bug where heavy VAS (accelerator) usage could race with
partition migration and prevent the migration from completing.
- Update MAINTAINERS to add Aneesh & Naveen.
Thanks to Haren Myneni.
* tag 'powerpc-6.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
MAINTAINERS: powerpc: Add Aneesh & Naveen
powerpc/pseries/vas: Migration suspend waits for no in-progress open windows
|
|
Add the definitions of missing system instructions that are
trappable by fine grain traps. The definitions are based on
DDI0602 2023-09.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20231214100158.2305400-5-tabba@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Add the definitions of missing system registers that are
trappable by fine grain traps. The definitions are based on
DDI0601 2023-09.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-4-tabba@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Add the ExtTrcBuff field definitions to ID_AA64DFR0_EL1 from
DDI0601 2023-09.
This field isn't used yet. Adding it for completeness and because
it will be used in future patches.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-3-tabba@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Add the Pauth_LR field definitions to ID_AA64ISAR1_EL1, based on
DDI0601 2023-09.
These fields aren't used yet. Adding them for completeness and
consistency (definition already exists for ID_AA64ISAR2_EL1).
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20231214100158.2305400-2-tabba@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Arm CMN perf: fix the DTC allocation failure path which can end up
erroneously clearing live counters
- arm64/mm: fix hugetlb handling of the dirty page state leading to a
continuous fault loop in user on hardware without dirty bit
management (DBM). That's caused by the dirty+writeable information
not being properly preserved across a series of mprotect(PROT_NONE),
mprotect(PROT_READ|PROT_WRITE)
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: mm: Always make sw-dirty PTEs hw-dirty in pte_modify
perf/arm-cmn: Fail DTC counter allocation correctly
|
|
This code is rarely (never?) enabled by distros, and it hasn't caught
anything in decades. Let's kill off this legacy debug code.
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"17 hotfixes. 8 are cc:stable and the other 9 pertain to post-6.6
issues"
* tag 'mm-hotfixes-stable-2023-12-15-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/mglru: reclaim offlined memcgs harder
mm/mglru: respect min_ttl_ms with memcgs
mm/mglru: try to stop at high watermarks
mm/mglru: fix underprotected page cache
mm/shmem: fix race in shmem_undo_range w/THP
Revert "selftests: error out if kernel header files are not yet built"
crash_core: fix the check for whether crashkernel is from high memory
x86, kexec: fix the wrong ifdeffery CONFIG_KEXEC
sh, kexec: fix the incorrect ifdeffery and dependency of CONFIG_KEXEC
mips, kexec: fix the incorrect ifdeffery and dependency of CONFIG_KEXEC
m68k, kexec: fix the incorrect ifdeffery and build dependency of CONFIG_KEXEC
loongarch, kexec: change dependency of object files
mm/damon/core: make damon_start() waits until kdamond_fn() starts
selftests/mm: cow: print ksft header before printing anything else
mm: fix VMA heap bounds checking
riscv: fix VMALLOC_START definition
kexec: drop dependency on ARCH_SUPPORTS_KEXEC from CRASH_DUMP
|
|
apply_alternatives() treats alternatives with the ALT_FLAG_NOT flag set
special as it optimizes the existing NOPs in place.
Unfortunately, this happens with interrupts enabled and does not provide any
form of core synchronization.
So an interrupt hitting in the middle of the update and using the affected code
path will observe a half updated NOP and crash and burn. The following
3 NOP sequence was observed to expose this crash halfway reliably under QEMU
32bit:
0x90 0x90 0x90
which is replaced by the optimized 3 byte NOP:
0x8d 0x76 0x00
So an interrupt can observe:
1) 0x90 0x90 0x90 nop nop nop
2) 0x8d 0x90 0x90 undefined
3) 0x8d 0x76 0x90 lea -0x70(%esi),%esi
4) 0x8d 0x76 0x00 lea 0x0(%esi),%esi
Where only #1 and #4 are true NOPs. The same problem exists for 64bit obviously.
Disable interrupts around this NOP optimization and invoke sync_core()
before re-enabling them.
Fixes: 270a69c4485d ("x86/alternative: Support relocations in alternatives")
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/ZT6narvE%2BLxX%2B7Be@windriver.com
|
|
text_poke_early() does:
local_irq_save(flags);
memcpy(addr, opcode, len);
local_irq_restore(flags);
sync_core();
That's not really correct because the synchronization should happen before
interrupts are re-enabled to ensure that a pending interrupt observes the
complete update of the opcodes.
It's not entirely clear whether the interrupt entry provides enough
serialization already, but moving the sync_core() invocation into interrupt
disabled region does no harm and is obviously correct.
Fixes: 6fffacb30349 ("x86/alternatives, jumplabel: Use text_poke_early() before mm_init()")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/ZT6narvE%2BLxX%2B7Be@windriver.com
|
|
Chris reported that a Dell PowerEdge T340 system stopped to boot when upgrading
to a kernel which contains the parallel hotplug changes. Disabling parallel
hotplug on the kernel command line makes it boot again.
It turns out that the Dell BIOS has x2APIC enabled and the boot CPU comes up in
X2APIC mode, but the APs come up inconsistently in xAPIC mode.
Parallel hotplug requires that the upcoming CPU reads out its APIC ID from the
local APIC in order to map it to the Linux CPU number.
In this particular case the readout on the APs uses the MMIO mapped registers
because the BIOS failed to enable x2APIC mode. That readout results in a page
fault because the kernel does not have the APIC MMIO space mapped when X2APIC
mode was enabled by the BIOS on the boot CPU and the kernel switched to X2APIC
mode early. That page fault can't be handled on the upcoming CPU that early and
results in a silent boot failure.
If parallel hotplug is disabled the system boots because in that case the APIC
ID read is not required as the Linux CPU number is provided to the AP in the
smpboot control word. When the kernel uses x2APIC mode then the APs are
switched to x2APIC mode too slightly later in the bringup process, but there is
no reason to do it that late.
Cure the BIOS bogosity by checking in the parallel bootup path whether the
kernel uses x2APIC mode and if so switching over the APs to x2APIC mode before
the APIC ID readout.
Fixes: 0c7ffa32dbd6 ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it")
Reported-by: Chris Lindee <chris.lindee@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Tested-by: Chris Lindee <chris.lindee@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/CA%2B2tU59853R49EaU_tyvOZuOTDdcU0RshGyydccp9R1NX9bEeQ@mail.gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Current release - regressions:
- tcp: fix tcp_disordered_ack() vs usec TS resolution
Current release - new code bugs:
- dpll: sanitize possible null pointer dereference in
dpll_pin_parent_pin_set()
- eth: octeon_ep: initialise control mbox tasks before using APIs
Previous releases - regressions:
- io_uring/af_unix: disable sending io_uring over sockets
- eth: mlx5e:
- TC, don't offload post action rule if not supported
- fix possible deadlock on mlx5e_tx_timeout_work
- eth: iavf: fix iavf_shutdown to call iavf_remove instead iavf_close
- eth: bnxt_en: fix skb recycling logic in bnxt_deliver_skb()
- eth: ena: fix DMA syncing in XDP path when SWIOTLB is on
- eth: team: fix use-after-free when an option instance allocation
fails
Previous releases - always broken:
- neighbour: don't let neigh_forced_gc() disable preemption for long
- net: prevent mss overflow in skb_segment()
- ipv6: support reporting otherwise unknown prefix flags in
RTM_NEWPREFIX
- tcp: remove acked SYN flag from packet in the transmit queue
correctly
- eth: octeontx2-af:
- fix a use-after-free in rvu_nix_register_reporters
- fix promisc mcam entry action
- eth: dwmac-loongson: make sure MDIO is initialized before use
- eth: atlantic: fix double free in ring reinit logic"
* tag 'net-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
net: atlantic: fix double free in ring reinit logic
appletalk: Fix Use-After-Free in atalk_ioctl
net: stmmac: Handle disabled MDIO busses from devicetree
net: stmmac: dwmac-qcom-ethqos: Fix drops in 10M SGMII RX
dpaa2-switch: do not ask for MDB, VLAN and FDB replay
dpaa2-switch: fix size of the dma_unmap
net: prevent mss overflow in skb_segment()
vsock/virtio: Fix unsigned integer wrap around in virtio_transport_has_space()
Revert "tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set"
MIPS: dts: loongson: drop incorrect dwmac fallback compatible
stmmac: dwmac-loongson: drop useless check for compatible fallback
stmmac: dwmac-loongson: Make sure MDIO is initialized before use
tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
dpll: sanitize possible null pointer dereference in dpll_pin_parent_pin_set()
net: ena: Fix XDP redirection error
net: ena: Fix DMA syncing in XDP path when SWIOTLB is on
net: ena: Fix xdp drops handling due to multibuf packets
net: ena: Destroy correct number of xdp queues upon failure
net: Remove acked SYN flag from packet in the transmit queue correctly
qed: Fix a potential use-after-free in qed_cxt_tables_alloc
...
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into arm/fixes
- Fix ethernet node for Orange Pi Zero 3 board
* tag 'sunxi-fixes-for-6.7-1' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
arm64: dts: allwinner: h616: update emac for Orange Pi Zero 3
Link: https://lore.kernel.org/r/ZXtVUJ0SG2NRpPG4@archlinux
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Add hugetlb definitions if THP enabled. ARC doesn't support
HugeTLB FS but it supports THP. Some kernel code such as pagemap
uses hugetlb definitions with THP.
This patch fixes ARC build issue (HPAGE_SIZE undeclared error) with
TRANSPARENT_HUGEPAGE enabled.
Signed-off-by: Pavel Kozlov <pavel.kozlov@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- Deal with a regression in the recently refactored x86 EFI stub code
on older Dell systems by disabling randomization of the physical load
address
- Use the correct load address for relocatable Loongarch kernels
* tag 'efi-urgent-for-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi/x86: Avoid physical KASLR on older Dell systems
efi/loongarch: Use load address to calculate kernel entry address
|
|
When intercepts are enabled for MSR_IA32_XSS, the host will swap in/out
the guest-defined values while context-switching to/from guest mode.
However, in the case of SEV-ES, vcpu->arch.guest_state_protected is set,
so the guest-defined value is effectively ignored when switching to
guest mode with the understanding that the VMSA will handle swapping
in/out this register state.
However, SVM is still configured to intercept these accesses for SEV-ES
guests, so the values in the initial MSR_IA32_XSS are effectively
read-only, and a guest will experience undefined behavior if it actually
tries to write to this MSR. Fortunately, only CET/shadowstack makes use
of this register on SEV-ES-capable systems currently, which isn't yet
widely used, but this may become more of an issue in the future.
Additionally, enabling intercepts of MSR_IA32_XSS results in #VC
exceptions in the guest in certain paths that can lead to unexpected #VC
nesting levels. One example is SEV-SNP guests when handling #VC
exceptions for CPUID instructions involving leaf 0xD, subleaf 0x1, since
they will access MSR_IA32_XSS as part of servicing the CPUID #VC, then
generate another #VC when accessing MSR_IA32_XSS, which can lead to
guest crashes if an NMI occurs at that point in time. Running perf on a
guest while it is issuing such a sequence is one example where these can
be problematic.
Address this by disabling intercepts of MSR_IA32_XSS for SEV-ES guests
if the host/guest configuration allows it. If the host/guest
configuration doesn't allow for MSR_IA32_XSS, leave it intercepted so
that it can be caught by the existing checks in
kvm_{set,get}_msr_common() if the guest still attempts to access it.
Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Cc: Alexey Kardashevskiy <aik@amd.com>
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20231016132819.1002933-4-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The hypervisor returns migration failure if all VAS windows are not
closed. During pre-migration stage, vas_migration_handler() sets
migration_in_progress flag and closes all windows from the list.
The allocate VAS window routine checks the migration flag, setup
the window and then add it to the list. So there is possibility of
the migration handler missing the window that is still in the
process of setup.
t1: Allocate and open VAS t2: Migration event
window
lock vas_pseries_mutex
If migration_in_progress set
unlock vas_pseries_mutex
return
open window HCALL
unlock vas_pseries_mutex
Modify window HCALL lock vas_pseries_mutex
setup window migration_in_progress=true
Closes all windows from the list
// May miss windows that are
// not in the list
unlock vas_pseries_mutex
lock vas_pseries_mutex return
if nr_closed_windows == 0
// No DLPAR CPU or migration
add window to the list
// Window will be added to the
// list after the setup is completed
unlock vas_pseries_mutex
return
unlock vas_pseries_mutex
Close VAS window
// due to DLPAR CPU or migration
return -EBUSY
This patch resolves the issue with the following steps:
- Set the migration_in_progress flag without holding mutex.
- Introduce nr_open_wins_progress counter in VAS capabilities
struct
- This counter tracks the number of open windows are still in
progress
- The allocate setup window thread closes windows if the migration
is set and decrements nr_open_window_progress counter
- The migration handler waits for no in-progress open windows.
The code flow with the fix is as follows:
t1: Allocate and open VAS t2: Migration event
window
lock vas_pseries_mutex
If migration_in_progress set
unlock vas_pseries_mutex
return
open window HCALL
nr_open_wins_progress++
// Window opened, but not
// added to the list yet
unlock vas_pseries_mutex
Modify window HCALL migration_in_progress=true
setup window lock vas_pseries_mutex
Closes all windows from the list
While nr_open_wins_progress {
unlock vas_pseries_mutex
lock vas_pseries_mutex sleep
if nr_closed_windows == 0 // Wait if any open window in
or migration is not started // progress. The open window
// No DLPAR CPU or migration // thread closes the window without
add window to the list // adding to the list and return if
nr_open_wins_progress-- // the migration is in progress.
unlock vas_pseries_mutex
return
Close VAS window
nr_open_wins_progress--
unlock vas_pseries_mutex
return -EBUSY lock vas_pseries_mutex
}
unlock vas_pseries_mutex
return
Fixes: 37e6764895ef ("powerpc/pseries/vas: Add VAS migration handler")
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231125235104.3405008-1-haren@linux.ibm.com
|