summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-04-12x86/sev: Take NUMA node into account when allocating memory for per-CPU SEV dataLi RongQing1-4/+4
per-CPU SEV data is dominantly accessed from their own local CPUs, so allocate them node-local to improve performance. Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Nikunj A Dadhania <nikunj@amd.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/20240412030130.49704-1-lirongqing@baidu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-12Merge branch 'x86/urgent' into x86/cpu, to resolve conflictIngo Molnar7-101/+216
There's a new conflict between this commit pending in x86/cpu: 63edbaa48a57 x86/cpu/topology: Add support for the AMD 0x80000026 leaf And these fixes in x86/urgent: c064b536a8f9 x86/cpu/amd: Make the NODEID_MSR union actually work 1b3108f6898e x86/cpu/amd: Make the CPUID 0x80000008 parser correct Resolve them. Conflicts: arch/x86/kernel/cpu/topology_amd.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-12x86/cpu/amd: Move TOPOEXT enablement into the topology parserThomas Gleixner2-15/+21
The topology rework missed that early_init_amd() tries to re-enable the Topology Extensions when the BIOS disabled them. The new parser is invoked before early_init_amd() so the re-enable attempt happens too late. Move it into the AMD specific topology parser code where it belongs. Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/878r1j260l.ffs@tglx
2024-04-12x86/cpu/amd: Make the NODEID_MSR union actually workThomas Gleixner1-3/+3
A system with NODEID_MSR was reported to crash during early boot without any output. The reason is that the union which is used for accessing the bitfields in the MSR is written wrongly and the resulting executable code accesses the wrong part of the MSR data. As a consequence a later division by that value results in 0 and that result is used for another division as divisor, which obviously does not work well. The magic world of C, unions and bitfields: union { u64 bita : 3, bitb : 3; u64 all; } x; x.all = foo(); a = x.bita; b = x.bitb; results in the effective executable code of: a = b = x.bita; because bita and bitb are treated as union members and therefore both end up at bit offset 0. Wrapping the bitfield into an anonymous struct: union { struct { u64 bita : 3, bitb : 3; }; u64 all; } x; works like expected. Rework the NODEID_MSR union in exactly that way to cure the problem. Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser") Reported-by: "kernelci.org bot" <bot@kernelci.org> Reported-by: Laura Nao <laura.nao@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Laura Nao <laura.nao@collabora.com> Link: https://lore.kernel.org/r/20240410194311.596282919@linutronix.de Closes: https://lore.kernel.org/all/20240322175210.124416-1-laura.nao@collabora.com/
2024-04-12x86/cpu/amd: Make the CPUID 0x80000008 parser correctThomas Gleixner1-6/+18
CPUID 0x80000008 ECX.cpu_nthreads describes the number of threads in the package. The parser uses this value to initialize the SMT domain level. That's wrong because cpu_nthreads does not describe the number of threads per physical core. So this needs to set the CORE domain level and let the later parsers set the SMT shift if available. Preset the SMT domain level with the assumption of one thread per core, which is correct ifrt here are no other CPUID leafs to parse, and propagate cpu_nthreads and the core level APIC bitwidth into the CORE domain. Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser") Reported-by: "kernelci.org bot" <bot@kernelci.org> Reported-by: Laura Nao <laura.nao@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Laura Nao <laura.nao@collabora.com> Link: https://lore.kernel.org/r/20240410194311.535206450@linutronix.de
2024-04-12x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHIJosh Poimboeuf1-1/+1
For consistency with the other CONFIG_MITIGATION_* options, replace the CONFIG_SPECTRE_BHI_{ON,OFF} options with a single CONFIG_MITIGATION_SPECTRE_BHI option. [ mingo: Fix ] Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/3833812ea63e7fdbe36bf8b932e63f70d18e2a2a.1712813475.git.jpoimboe@kernel.org
2024-04-12x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=autoJosh Poimboeuf1-9/+1
Unlike most other mitigations' "auto" options, spectre_bhi=auto only mitigates newer systems, which is confusing and not particularly useful. Remove it. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/412e9dc87971b622bbbaf64740ebc1f140bff343.1712813475.git.jpoimboe@kernel.org
2024-04-11Merge tag 'v6.9-rc3' into x86/boot, to pick up fixes before queueing up more ↵Ingo Molnar24-201/+185
changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-11x86/bugs: Clarify that syscall hardening isn't a BHI mitigationJosh Poimboeuf1-3/+3
While syscall hardening helps prevent some BHI attacks, there's still other low-hanging fruit remaining. Don't classify it as a mitigation and make it clear that the system may still be vulnerable if it doesn't have a HW or SW mitigation enabled. Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/b5951dae3fdee7f1520d5136a27be3bdfe95f88b.1712813475.git.jpoimboe@kernel.org
2024-04-11x86/bugs: Fix BHI handling of RRSBAJosh Poimboeuf1-12/+18
The ARCH_CAP_RRSBA check isn't correct: RRSBA may have already been disabled by the Spectre v2 mitigation (or can otherwise be disabled by the BHI mitigation itself if needed). In that case retpolines are fine. Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/6f56f13da34a0834b69163467449be7f58f253dc.1712813475.git.jpoimboe@kernel.org
2024-04-11x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'Ingo Molnar3-42/+42
So we are using the 'ia32_cap' value in a number of places, which got its name from MSR_IA32_ARCH_CAPABILITIES MSR register. But there's very little 'IA32' about it - this isn't 32-bit only code, nor does it originate from there, it's just a historic quirk that many Intel MSR names are prefixed with IA32_. This is already clear from the helper method around the MSR: x86_read_arch_cap_msr(), which doesn't have the IA32 prefix. So rename 'ia32_cap' to 'x86_arch_cap_msr' to be consistent with its role and with the naming of the helper function. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Nikolay Borisov <nik.borisov@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org
2024-04-11x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIESJosh Poimboeuf1-15/+7
There's no need to keep reading MSR_IA32_ARCH_CAPABILITIES over and over. It's even read in the BHI sysfs function which is a big no-no. Just read it once and cache it. Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org
2024-04-10x86/topology: Don't update cpu_possible_map in topo_set_cpuids()Thomas Gleixner1-2/+5
topo_set_cpuids() updates cpu_present_map and cpu_possible map. It is invoked during enumeration and "physical hotplug" operations. In the latter case this results in a kernel crash because cpu_possible_map is marked read only after init completes. There is no reason to update cpu_possible_map in that function. During enumeration cpu_possible_map is not relevant and gets fully initialized after enumeration completed. On "physical hotplug" the bit is already set because the kernel allows only CPUs to be plugged which have been enumerated and associated to a CPU number during early boot. Remove the bogus update of cpu_possible_map. Fixes: 0e53e7b656cf ("x86/cpu/topology: Sanitize the APIC admission logic") Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/87ttkc6kwx.ffs@tglx
2024-04-10x86/bugs: Fix return type of spectre_bhi_state()Daniel Sneddon1-1/+1
The definition of spectre_bhi_state() incorrectly returns a const char * const. This causes the a compiler warning when building with W=1: warning: type qualifiers ignored on function return type [-Wignored-qualifiers] 2812 | static const char * const spectre_bhi_state(void) Remove the const qualifier from the pointer. Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob") Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240409230806.1545822-1-daniel.sneddon@linux.intel.com
2024-04-10x86/cpu: Improve readability of per-CPU cpumask initialization codeIngo Molnar2-14/+17
In smp_prepare_cpus_common() and x2apic_prepare_cpu(): - use 'cpu' instead of 'i' - use 'node' instead of 'n' - use vertical alignment to improve readability - better structure basic blocks - reduce col80 checkpatch damage Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org
2024-04-10x86/cpu: Take NUMA node into account when allocating per-CPU cpumasksLi RongQing2-7/+9
per-CPU cpumasks are dominantly accessed from their own local CPUs, so allocate them node-local to improve performance. [ mingo: Rewrote the changelog. ] Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240410030114.6201-1-lirongqing@baidu.com
2024-04-09x86/alternatives: Sort local vars in apply_alternatives()Borislav Petkov (AMD)1-2/+2
In a reverse x-mas tree. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240130105941.19707-5-bp@alien8.de
2024-04-09x86/alternatives: Optimize optimize_nops()Borislav Petkov (AMD)1-0/+4
Return early if NOPs have already been optimized. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240130105941.19707-4-bp@alien8.de
2024-04-09x86/alternatives: Get rid of __optimize_nops()Borislav Petkov (AMD)1-43/+16
There's no need to carve out bits of the NOP optimization functionality and look at JMP opcodes - simply do one more NOPs optimization pass at the end of patching. A lot simpler code. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240130105941.19707-3-bp@alien8.de
2024-04-09x86/alternatives: Use a temporary buffer when optimizing NOPsBorislav Petkov (AMD)2-45/+48
Instead of optimizing NOPs in-place, use a temporary buffer like the usual alternatives patching flow does. This obviates the need to grab locks when patching, see 6778977590da ("x86/alternatives: Disable interrupts and sync when optimizing NOPs in place") While at it, add nomenclature definitions clarifying and simplifying the naming of function-local variables in the alternatives code. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240130105941.19707-2-bp@alien8.de
2024-04-09x86/alternatives: Catch late X86_FEATURE modifiersBorislav Petkov (AMD)1-0/+3
After alternatives have been patched, changes to the X86_FEATURE flags won't take effect and could potentially even be wrong. Warn about it. This is something which has been long overdue. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Srikanth Aithal <sraithal@amd.com> Link: https://lore.kernel.org/r/20240327154317.29909-3-bp@alien8.de
2024-04-09Merge tag 'v6.9-rc3' into locking/core, to pick up fixesIngo Molnar24-201/+185
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-09x86/mce: Implement recovery for errors in TDX/SEAM non-root modeTony Luck2-2/+32
Machine check SMIs (MSMI) signaled during SEAM operation (typically inside TDX guests), on a system with Intel eMCA enabled, might eventually be reported to the kernel #MC handler with the saved RIP on the stack pointing to the instruction in kernel code after the SEAMCALL instruction that entered the SEAM operation. Linux currently says that is a fatal error and shuts down. There is a new bit in IA32_MCG_STATUS that, when set to 1, indicates that the machine check didn't originally occur at that saved RIP, but during SEAM non-root operation. Add new entries to the severity table to detect this for both data load and instruction fetch that set the severity to "AR" (action required). Increase the width of the mcgmask/mcgres fields in "struct severity" from unsigned char to unsigned short since the new bit is in position 12. Action required for these errors is just mark the page as poisoned and return from the machine check handler. HW ABI notes: ============= The SEAM_NR bit in IA32_MCG_STATUS hasn't yet made it into the Intel Software Developers' Manual. But it is described in section 16.5.2 of "Intel(R) Trust Domain Extensions (Intel(R) TDX) Module Base Architecture Specification" downloadable from: https://cdrdv2.intel.com/v1/dl/getContent/733575 Backport notes: =============== Little value in backporting this patch to stable or LTS kernels as this is only relevant with support for TDX, which I assume won't be backported. But for anyone taking this to v6.1 or older, you also need commit: a51cbd0d86d3 ("x86/mce: Use severity table to handle uncorrected errors in kernel") Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240408180944.44638-1-tony.luck@intel.com
2024-04-09Merge tag 'v6.9-rc3' into x86/cpu, to pick up fixesIngo Molnar24-201/+185
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-09Merge tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-21/+125
Pull x86 mitigations from Thomas Gleixner: "Mitigations for the native BHI hardware vulnerabilty: Branch History Injection (BHI) attacks may allow a malicious application to influence indirect branch prediction in kernel by poisoning the branch history. eIBRS isolates indirect branch targets in ring0. The BHB can still influence the choice of indirect branch predictor entry, and although branch predictor entries are isolated between modes when eIBRS is enabled, the BHB itself is not isolated between modes. Add mitigations against it either with the help of microcode or with software sequences for the affected CPUs" [ This also ends up enabling the full mitigation by default despite the system call hardening, because apparently there are other indirect calls that are still sufficiently reachable, and the 'auto' case just isn't hardened enough. We'll have some more inevitable tweaking in the future - Linus ] * tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM: x86: Add BHI_NO x86/bhi: Mitigate KVM by default x86/bhi: Add BHI mitigation knob x86/bhi: Enumerate Branch History Injection (BHI) bug x86/bhi: Define SPEC_CTRL_BHI_DIS_S x86/bhi: Add support for clearing branch history at syscall entry x86/syscall: Don't force use of indirect calls for system calls x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs file
2024-04-08x86/bhi: Mitigate KVM by defaultPawan Gupta1-1/+8
BHI mitigation mode spectre_bhi=auto does not deploy the software mitigation by default. In a cloud environment, it is a likely scenario where userspace is trusted but the guests are not trusted. Deploying system wide mitigation in such cases is not desirable. Update the auto mode to unconditionally mitigate against malicious guests. Deploy the software sequence at VMexit in auto mode also, when hardware mitigation is not available. Unlike the force =on mode, software sequence is not deployed at syscalls in auto mode. Suggested-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08x86/bhi: Add BHI mitigation knobPawan Gupta1-1/+89
Branch history clearing software sequences and hardware control BHI_DIS_S were defined to mitigate Branch History Injection (BHI). Add cmdline spectre_bhi={on|off|auto} to control BHI mitigation: auto - Deploy the hardware mitigation BHI_DIS_S, if available. on - Deploy the hardware mitigation BHI_DIS_S, if available, otherwise deploy the software sequence at syscall entry and VMexit. off - Turn off BHI mitigation. The default is auto mode which does not deploy the software sequence mitigation. This is because of the hardening done in the syscall dispatch path, which is the likely target of BHI. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08x86/bhi: Enumerate Branch History Injection (BHI) bugPawan Gupta1-8/+16
Mitigation for BHI is selected based on the bug enumeration. Add bits needed to enumerate BHI bug. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08x86/bhi: Define SPEC_CTRL_BHI_DIS_SDaniel Sneddon1-0/+1
Newer processors supports a hardware control BHI_DIS_S to mitigate Branch History Injection (BHI). Setting BHI_DIS_S protects the kernel from userspace BHI attacks without having to manually overwrite the branch history. Define MSR_SPEC_CTRL bit BHI_DIS_S and its enumeration CPUID.BHI_CTRL. Mitigation is enabled later. Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs fileJosh Poimboeuf1-12/+12
Change the format of the 'spectre_v2' vulnerabilities sysfs file slightly by converting the commas to semicolons, so that mitigations for future variants can be grouped together and separated by commas. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2024-04-06Merge branch 'linus' into x86/urgent, to pick up dependent commitIngo Molnar2-7/+8
We want to fix: 0e110732473e ("x86/retpoline: Do the necessary fixup to the Zen3/4 srso return thunk for !SRSO") So merge in Linus's latest into x86/urgent to have it available. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-06x86/microcode/AMD: Remove unused PATCH_MAX_SIZE macroBorislav Petkov (AMD)1-2/+0
Orphaned after 05e91e721138 ("x86/microcode/AMD: Rip out static buffers") No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-04-06x86/microcode/AMD: Avoid -Wformat warning with clang-15Arnd Bergmann1-1/+1
Older versions of clang show a warning for amd.c after a fix for a gcc warning: arch/x86/kernel/cpu/microcode/amd.c:478:47: error: format specifies type \ 'unsigned char' but the argument has type 'u16' (aka 'unsigned short') [-Werror,-Wformat] "amd-ucode/microcode_amd_fam%02hhxh.bin", family); ~~~~~~ ^~~~~~ %02hx In clang-16 and higher, this warning is disabled by default, but clang-15 is still supported, and it's trivial to avoid by adapting the types according to the range of the passed data and the format string. [ bp: Massage commit message. ] Fixes: 2e9064faccd1 ("x86/microcode/amd: Fix snprintf() format string warning in W=1 build") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240405204919.1003409-1-arnd@kernel.org
2024-04-05Merge tag 'net-6.9-rc3' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, bluetooth and bpf. Fairly usual collection of driver and core fixes. The large selftest accompanying one of the fixes is also becoming a common occurrence. Current release - regressions: - ipv6: fix infinite recursion in fib6_dump_done() - net/rds: fix possible null-deref in newly added error path Current release - new code bugs: - net: do not consume a full cacheline for system_page_pool - bpf: fix bpf_arena-related file descriptor leaks in the verifier - drv: ice: fix freeing uninitialized pointers, fixing misuse of the newfangled __free() auto-cleanup Previous releases - regressions: - x86/bpf: fixes the BPF JIT with retbleed=stuff - xen-netfront: add missing skb_mark_for_recycle, fix page pool accounting leaks, revealed by recently added explicit warning - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6 non-wildcard addresses - Bluetooth: - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with better workarounds to un-break some buggy Qualcomm devices - set conn encrypted before conn establishes, fix re-connecting to some headsets which use slightly unusual sequence of msgs - mptcp: - prevent BPF accessing lowat from a subflow socket - don't account accept() of non-MPC client as fallback to TCP - drv: mana: fix Rx DMA datasize and skb_over_panic - drv: i40e: fix VF MAC filter removal Previous releases - always broken: - gro: various fixes related to UDP tunnels - netns crossing problems, incorrect checksum conversions, and incorrect packet transformations which may lead to panics - bpf: support deferring bpf_link dealloc to after RCU grace period - nf_tables: - release batch on table validation from abort path - release mutex after nft_gc_seq_end from abort path - flush pending destroy work before exit_net release - drv: r8169: skip DASH fw status checks when DASH is disabled" * tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits) netfilter: validate user input for expected length net/sched: act_skbmod: prevent kernel-infoleak net: usb: ax88179_178a: avoid the interface always configured as random address net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45() net: ravb: Always update error counters net: ravb: Always process TX descriptor ring netfilter: nf_tables: discard table flag update with pending basechain deletion netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get() netfilter: nf_tables: reject new basechain after table flag update netfilter: nf_tables: flush pending destroy work before exit_net release netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path netfilter: nf_tables: release batch on table validation from abort path Revert "tg3: Remove residual error handling in tg3_suspend" tg3: Remove residual error handling in tg3_suspend net: mana: Fix Rx DMA datasize and skb_over_panic net/sched: fix lockdep splat in qdisc_tree_reduce_backlog() net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping net: stmmac: fix rx queue priority assignment net: txgbe: fix i2c dev name cannot match clkdev net: fec: Set mac_managed_pm during probe ...
2024-04-04x86/mce: Make sure to grab mce_sysfs_mutex in set_bank()Borislav Petkov (AMD)1-1/+3
Modifying a MCA bank's MCA_CTL bits which control which error types to be reported is done over /sys/devices/system/machinecheck/ ├── machinecheck0 │   ├── bank0 │   ├── bank1 │   ├── bank10 │   ├── bank11 ... sysfs nodes by writing the new bit mask of events to enable. When the write is accepted, the kernel deletes all current timers and reinits all banks. Doing that in parallel can lead to initializing a timer which is already armed and in the timer wheel, i.e., in use already: ODEBUG: init active (active state 0) object: ffff888063a28000 object type: timer_list hint: mce_timer_fn+0x0/0x240 arch/x86/kernel/cpu/mce/core.c:2642 WARNING: CPU: 0 PID: 8120 at lib/debugobjects.c:514 debug_print_object+0x1a0/0x2a0 lib/debugobjects.c:514 Fix that by grabbing the sysfs mutex as the rest of the MCA sysfs code does. Reported by: Yue Sun <samsun1006219@gmail.com> Reported by: xingwei lee <xrivendell7@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/CAEkJfYNiENwQY8yV1LYJ9LjJs%2Bx_-PqMv98gKig55=2vbzffRw@mail.gmail.com
2024-04-04x86/extable: Remove unused fixup type EX_TYPE_COPYTong Tiangen1-1/+0
After 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function") rewrote __copy_user_nocache() to use EX_TYPE_UACCESS instead of the EX_TYPE_COPY exception type, there are no more EX_TYPE_COPY users, so remove it. [ bp: Massage commit message. ] Signed-off-by: Tong Tiangen <tongtiangen@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240204082627.3892816-2-tongtiangen@huawei.com
2024-04-04x86/CPU/AMD: Track SNP host status with cc_platform_*()Borislav Petkov (AMD)3-26/+24
The host SNP worthiness can determined later, after alternatives have been patched, in snp_rmptable_init() depending on cmdline options like iommu=pt which is incompatible with SNP, for example. Which means that one cannot use X86_FEATURE_SEV_SNP and will need to have a special flag for that control. Use that newly added CC_ATTR_HOST_SEV_SNP in the appropriate places. Move kdump_sev_callback() to its rightful place, while at it. Fixes: 216d106c7ff7 ("x86/sev: Add SEV-SNP host initialization support") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Srikanth Aithal <sraithal@amd.com> Link: https://lore.kernel.org/r/20240327154317.29909-6-bp@alien8.de
2024-04-04x86/coco: Require seeding RNG with RDRAND on CoCo systemsJason A. Donenfeld1-0/+2
There are few uses of CoCo that don't rely on working cryptography and hence a working RNG. Unfortunately, the CoCo threat model means that the VM host cannot be trusted and may actively work against guests to extract secrets or manipulate computation. Since a malicious host can modify or observe nearly all inputs to guests, the only remaining source of entropy for CoCo guests is RDRAND. If RDRAND is broken -- due to CPU hardware fault -- the RNG as a whole is meant to gracefully continue on gathering entropy from other sources, but since there aren't other sources on CoCo, this is catastrophic. This is mostly a concern at boot time when initially seeding the RNG, as after that the consequences of a broken RDRAND are much more theoretical. So, try at boot to seed the RNG using 256 bits of RDRAND output. If this fails, panic(). This will also trigger if the system is booted without RDRAND, as RDRAND is essential for a safe CoCo boot. Add this deliberately to be "just a CoCo x86 driver feature" and not part of the RNG itself. Many device drivers and platforms have some desire to contribute something to the RNG, and add_device_randomness() is specifically meant for this purpose. Any driver can call it with seed data of any quality, or even garbage quality, and it can only possibly make the quality of the RNG better or have no effect, but can never make it worse. Rather than trying to build something into the core of the RNG, consider the particular CoCo issue just a CoCo issue, and therefore separate it all out into driver (well, arch/platform) code. [ bp: Massage commit message. ] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Elena Reshetova <elena.reshetova@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240326160735.73531-1-Jason@zx2c4.com
2024-04-04x86/fpu: Update fpu_swap_kvm_fpu() uses in comments as wellLi RongQing1-2/+2
The following commit: d69c1382e1b7 ("x86/kvm: Convert FPU handling to a single swap buffer") reworked KVM FPU handling, but forgot to update the comments in xstate_op_valid(): fpu_swap_kvm_fpu() doesn't exist anymore, fpu_swap_kvm_fpstate() is used instead. Update the comments accordingly. [ mingo: Improved the changelog. ] Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240403091803.818-1-lirongqing@baidu.com
2024-04-03Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-5/+6
Pull KVM fixes from Paolo Bonzini: "ARM: - Ensure perf events programmed to count during guest execution are actually enabled before entering the guest in the nVHE configuration - Restore out-of-range handler for stage-2 translation faults - Several fixes to stage-2 TLB invalidations to avoid stale translations, possibly including partial walk caches - Fix early handling of architectural VHE-only systems to ensure E2H is appropriately set - Correct a format specifier warning in the arch_timer selftest - Make the KVM banner message correctly handle all of the possible configurations RISC-V: - Remove redundant semicolon in num_isa_ext_regs() - Fix APLIC setipnum_le/be write emulation - Fix APLIC in_clrip[x] read emulation x86: - Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID entries (old vs. new) and ultimately neglects to clear PV_UNHALT from vCPUs with HLT-exiting disabled - Documentation fixes for SEV - Fix compat ABI for KVM_MEMORY_ENCRYPT_OP - Fix a 14-year-old goof in a declaration shared by host and guest; the enabled field used by Linux when running as a guest pushes the size of "struct kvm_vcpu_pv_apf_data" from 64 to 68 bytes. This is really unconsequential because KVM never consumes anything beyond the first 64 bytes, but the resulting struct does not match the documentation Selftests: - Fix spelling mistake in arch_timer selftest" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: arm64: Rationalise KVM banner output arm64: Fix early handling of FEAT_E2H0 not being implemented KVM: arm64: Ensure target address is granule-aligned for range TLBI KVM: arm64: Use TLBI_TTL_UNKNOWN in __kvm_tlb_flush_vmid_range() KVM: arm64: Don't pass a TLBI level hint when zapping table entries KVM: arm64: Don't defer TLB invalidation when zapping table entries KVM: selftests: Fix __GUEST_ASSERT() format warnings in ARM's arch timer test KVM: arm64: Fix out-of-IPA space translation fault handling KVM: arm64: Fix host-programmed guest events in nVHE RISC-V: KVM: Fix APLIC in_clrip[x] read emulation RISC-V: KVM: Fix APLIC setipnum_le/be write emulation RISC-V: KVM: Remove second semicolon KVM: selftests: Fix spelling mistake "trigged" -> "triggered" Documentation: kvm/sev: clarify usage of KVM_MEMORY_ENCRYPT_OP Documentation: kvm/sev: separate description of firmware KVM: SEV: fix compat ABI for KVM_MEMORY_ENCRYPT_OP KVM: selftests: Check that PV_UNHALT is cleared when HLT exiting is disabled KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT KVM: x86: Introduce __kvm_get_hypervisor_cpuid() helper KVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES ...
2024-04-03x86/apic: Improve data types to fix Coccinelle warningsThorsten Blum1-4/+4
Given that acpi_pm_read_early() returns a u32 (masked to 24 bits), several variables that store its return value are improved by adjusting their data types from unsigned long to u32. Specifically, change deltapm's type from long to u32 because its value fits into 32 bits and it cannot be negative. These data type improvements resolve the following two Coccinelle/ coccicheck warnings reported by do_div.cocci: arch/x86/kernel/apic/apic.c:734:1-7: WARNING: do_div() does a 64-by-32 division, please consider using div64_long instead. arch/x86/kernel/apic/apic.c:742:2-8: WARNING: do_div() does a 64-by-32 division, please consider using div64_long instead. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240318104721.117741-3-thorsten.blum%40toblux.com
2024-04-03x86/rtc: Remove unused intel-mid.hAndy Shevchenko1-1/+0
The rtc driver used to be disabled with a direct check for Intel MID platforms. But that direct check was replaced long ago (see second link). Remove the (unused since 2016) include. [ dhansen: rewrite changelog to include some history ] Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240305161024.1364098-1-andriy.shevchenko%40linux.intel.com Link: https://lore.kernel.org/all/1460592286-300-5-git-send-email-mcgrof@kernel.org
2024-04-03x86/resctrl: Fix uninitialized memory read when last CPU of domain goes offlineReinette Chatre1-1/+2
Tony encountered this OOPS when the last CPU of a domain goes offline while running a kernel built with CONFIG_NO_HZ_FULL: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 Oops: 0000 [#1] PREEMPT SMP NOPTI ... RIP: 0010:__find_nth_andnot_bit+0x66/0x110 ... Call Trace: <TASK> ? __die() ? page_fault_oops() ? exc_page_fault() ? asm_exc_page_fault() cpumask_any_housekeeping() mbm_setup_overflow_handler() resctrl_offline_cpu() resctrl_arch_offline_cpu() cpuhp_invoke_callback() cpuhp_thread_fun() smpboot_thread_fn() kthread() ret_from_fork() ret_from_fork_asm() </TASK> The NULL pointer dereference is encountered while searching for another online CPU in the domain (of which there are none) that can be used to run the MBM overflow handler. Because the kernel is configured with CONFIG_NO_HZ_FULL the search for another CPU (in its effort to prefer those CPUs that aren't marked nohz_full) consults the mask representing the nohz_full CPUs, tick_nohz_full_mask. On a kernel with CONFIG_CPUMASK_OFFSTACK=y tick_nohz_full_mask is not allocated unless the kernel is booted with the "nohz_full=" parameter and because of that any access to tick_nohz_full_mask needs to be guarded with tick_nohz_full_enabled(). Replace the IS_ENABLED(CONFIG_NO_HZ_FULL) with tick_nohz_full_enabled(). The latter ensures tick_nohz_full_mask can be accessed safely and can be used whether kernel is built with CONFIG_NO_HZ_FULL enabled or not. [ Use Ingo's suggestion that combines the two NO_HZ checks into one. ] Fixes: a4846aaf3945 ("x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow") Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/ff8dfc8d3dcb04b236d523d1e0de13d2ef585223.1711993956.git.reinette.chatre@intel.com Closes: https://lore.kernel.org/lkml/ZgIFT5gZgIQ9A9G7@agluck-desk3/
2024-04-03x86/of: Change x86_dtb_parse_smp_config() to staticSaurabh Sengar1-9/+9
x86_dtb_parse_smp_config() is called locally only, change it to static. Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/1712068830-4513-5-git-send-email-ssengar@linux.microsoft.com
2024-04-03x86/of: Map NUMA node to CPUs as per DeviceTreeSaurabh Sengar1-0/+2
Currently for DeviceTree bootup, x86 code does the default mapping of CPUs to NUMA, which is wrong. This can cause incorrect mapping and WARNs on SMT enabled systems: CPU #1's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. WARNING: CPU: 1 PID: 0 at topology_sane.isra.0+0x5c/0x6d match_smt+0xf6/0xfc set_cpu_sibling_map.cold+0x24f/0x512 start_secondary+0x5c/0x110 Call the set_apicid_to_node() function in dtb_cpu_setup() for setting the NUMA to CPU mapping for DeviceTree platforms. Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/1712068830-4513-4-git-send-email-ssengar@linux.microsoft.com
2024-04-03x86/of: Set the parse_smp_cfg for all the DeviceTree platforms by defaultSaurabh Sengar1-2/+4
x86_dtb_parse_smp_config() must be set by DeviceTree platform for parsing SMP configuration. Set the parse_smp_cfg pointer to x86_dtb_parse_smp_config() by default so that all the dtb platforms need not to assign it explicitly. Today there are only two platforms using DeviceTree in x86, ce4100 and hv_vtl. Remove the explicit assignment of x86_dtb_parse_smp_config() function from these. Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/1712068830-4513-3-git-send-email-ssengar@linux.microsoft.com
2024-04-02Merge tag 'kvmarm-fixes-6.9-1' of ↵Paolo Bonzini12-128/+114
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.9, part #1 - Ensure perf events programmed to count during guest execution are actually enabled before entering the guest in the nVHE configuration. - Restore out-of-range handler for stage-2 translation faults. - Several fixes to stage-2 TLB invalidations to avoid stale translations, possibly including partial walk caches. - Fix early handling of architectural VHE-only systems to ensure E2H is appropriately set. - Correct a format specifier warning in the arch_timer selftest. - Make the KVM banner message correctly handle all of the possible configurations.
2024-04-02x86/bpf: Fix IP for relocating call depth accountingJoan Bruguera Micó1-2/+2
The commit: 59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()") made PER_CPU_VAR() to use rip-relative addressing, hence INCREMENT_CALL_DEPTH macro and skl_call_thunk_template got rip-relative asm code inside of it. A follow up commit: 17bce3b2ae2d ("x86/callthunks: Handle %rip-relative relocations in call thunk template") changed x86_call_depth_emit_accounting() to use apply_relocation(), but mistakenly assumed that the code is being patched in-place (where the destination of the relocation matches the address of the code), using *pprog as the destination ip. This is not true for the call depth accounting, emitted by the BPF JIT, so the calculated address was wrong, JIT-ed BPF progs on kernels with call depth tracking got broken and usually caused a page fault. Pass the destination IP when the BPF JIT emits call depth accounting. Fixes: 17bce3b2ae2d ("x86/callthunks: Handle %rip-relative relocations in call thunk template") Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com> Reviewed-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20240401185821.224068-3-ubizjak@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-31Merge tag 'perf_urgent_for_v6.9_rc2' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf fixes from Borislav Petkov: - Define the correct set of default hw events on AMD Zen4 - Use the correct stalled cycles PMCs on AMD Zen2 and newer - Fix detection of the LBR freeze feature on AMD * tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/amd/core: Define a proper ref-cycles event for Zen 4 and later perf/x86/amd/core: Update and fix stalled-cycles-* events for Zen 2 and later perf/x86/amd/lbr: Use freeze based on availability x86/cpufeatures: Add new word for scattered features
2024-03-29x86/boot: Move kernel cmdline setup earlier in the boot process (again)Julian Stecklina1-16/+16
When split_lock_detect=off (or similar) is specified in CONFIG_CMDLINE, its effect is lost. The flow is currently this: setup_arch(): -> early_cpu_init() -> early_identify_cpu() -> sld_setup() -> sld_state_setup() -> Looks for split_lock_detect in boot_command_line -> e820__memory_setup() -> Assemble final command line: boot_command_line = builtin_cmdline + boot_cmdline -> parse_early_param() There were earlier attempts at fixing this in: 8d48bf8206f7 ("x86/boot: Pull up cmdline preparation and early param parsing") later reverted in: fbe618399854 ("Revert "x86/boot: Pull up cmdline preparation and early param parsing"") ... because parse_early_param() can't be called before e820__memory_setup(). In this patch, we just move the command line concatenation to the beginning of early_cpu_init(). This should fix sld_state_setup(), while not running in the same issues as the earlier attempt. The order is now: setup_arch(): -> Assemble final command line: boot_command_line = builtin_cmdline + boot_cmdline -> early_cpu_init() -> early_identify_cpu() -> sld_setup() -> sld_state_setup() -> Looks for split_lock_detect in boot_command_line -> e820__memory_setup() -> parse_early_param() Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: linux-kernel@vger.kernel.org