summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2024-01-09Merge tag 'perf-core-2024-01-08' of ↵Linus Torvalds18-114/+550
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: - Add branch stack counters ABI extension to better capture the growing amount of information the PMU exposes via branch stack sampling. There's matching tooling support. - Fix race when creating the nr_addr_filters sysfs file - Add Intel Sierra Forest and Grand Ridge intel/cstate PMU support - Add Intel Granite Rapids, Sierra Forest and Grand Ridge uncore PMU support - Misc cleanups & fixes * tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Factor out topology_gidnid_map() perf/x86/intel/uncore: Fix NULL pointer dereference issue in upi_fill_topology() perf/x86/amd: Reject branch stack for IBS events perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge perf/x86/intel/uncore: Support IIO free-running counters on GNR perf/x86/intel/uncore: Support Granite Rapids perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR perf: Fix the nr_addr_filters fix perf/x86/intel/cstate: Add Grand Ridge support perf/x86/intel/cstate: Add Sierra Forest support x86/smp: Export symbol cpu_clustergroup_mask() perf/x86/intel/cstate: Cleanup duplicate attr_groups perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file perf/x86/intel: Support branch counters logging perf/x86/intel: Reorganize attrs and is_visible perf: Add branch_sample_call_stack perf/x86: Add PERF_X86_EVENT_NEEDS_BRANCH_STACK flag perf: Add branch stack counters
2024-01-09Merge tag 'x86-entry-2024-01-08' of ↵Linus Torvalds2-14/+31
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry updates from Ingo Molnar: - Optimize common_interrupt_return() - Harden the return-to-user code by making a CONFIG_DEBUG_ENTRY=y check unconditional & moving it closer to the IRET. * tag 'x86-entry-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry: Harden return-to-user x86/entry: Optimize common_interrupt_return()
2024-01-09Merge tag 'x86-core-2024-01-08' of ↵Linus Torvalds1-2/+18
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Ingo Molnar: - Add comments about the magic behind the shadow STI before MWAIT in __sti_mwait(). - Fix possible unintended timer delays caused by a race in mwait_idle_with_hints(). * tag 'x86-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram x86: Add a comment about the "magic" behind shadow sti before mwait
2024-01-09Merge tag 'x86-cleanups-2024-01-08' of ↵Linus Torvalds65-90/+88
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: - Change global variables to local - Add missing kernel-doc function parameter descriptions - Remove unused parameter from a macro - Remove obsolete Kconfig entry - Fix comments - Fix typos, mostly scripted, manually reviewed and a micro-optimization got misplaced as a cleanup: - Micro-optimize the asm code in secondary_startup_64_no_verify() * tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arch/x86: Fix typos x86/head_64: Use TESTB instead of TESTL in secondary_startup_64_no_verify() x86/docs: Remove reference to syscall trampoline in PTI x86/Kconfig: Remove obsolete config X86_32_SMP x86/io: Remove the unused 'bw' parameter from the BUILDIO() macro x86/mtrr: Document missing function parameters in kernel-doc x86/setup: Make relocated_ramdisk a local variable of relocate_initrd()
2024-01-09Merge tag 'x86-build-2024-01-08' of ↵Linus Torvalds5-41/+11
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 build updates from Ingo Molnar: - Update the objdump & instruction decoder self-test code for better LLVM toolchain compatibility - Rework CONFIG_X86_PAE dependencies, for better readability and higher robustness. - Misc cleanups * tag 'x86-build-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tools: objdump_reformat.awk: Skip bad instructions from llvm-objdump x86/Kconfig: Rework CONFIG_X86_PAE dependency x86/tools: Remove chkobjdump.awk x86/tools: objdump_reformat.awk: Allow for spaces x86/tools: objdump_reformat.awk: Ensure regex matches fwait
2024-01-09Merge tag 'x86-boot-2024-01-08' of ↵Linus Torvalds5-1/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Ingo Molnar: - Ignore NMIs during very early boot, to address kexec crashes - Remove redundant initialization in boot/string.c's strcmp() * tag 'x86-boot-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Remove redundant initialization of the 'delta' variable in strcmp() x86/boot: Ignore NMIs during very early boot
2024-01-09Merge tag 'x86-asm-2024-01-08' of ↵Linus Torvalds8-55/+102
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Ingo Molnar: "Replace magic numbers in GDT descriptor definitions & handling: - Introduce symbolic names via macros for descriptor types/fields/flags, and then use these symbolic names. - Clean up definitions a bit, such as GDT_ENTRY_INIT() - Fix/clean up details that became visibly inconsistent after the symbol-based code was introduced: - Unify accessed flag handling - Set the D/B size flag consistently & according to the HW specification" * tag 'x86-asm-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm: Add DB flag to 32-bit percpu GDT entry x86/asm: Always set A (accessed) flag in GDT descriptors x86/asm: Replace magic numbers in GDT descriptors, script-generated change x86/asm: Replace magic numbers in GDT descriptors, preparations x86/asm: Provide new infrastructure for GDT descriptors
2024-01-09Merge tag 'x86-apic-2024-01-08' of ↵Linus Torvalds12-283/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 apic updates from Ingo Molnar: - Clean up 'struct apic': - Drop ::delivery_mode - Drop 'enum apic_delivery_modes' - Drop 'struct local_apic' - Fix comments * tag 'x86-apic-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ioapic: Remove unfinished sentence from comment x86/apic: Drop struct local_apic x86/apic: Drop enum apic_delivery_modes x86/apic: Drop apic::delivery_mode
2024-01-09Merge tag 'ras_core_for_v6.8' of ↵Linus Torvalds7-254/+388
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Borislav Petkov: - Convert the hw error storm handling into a finer-grained, per-bank solution which allows for more timely detection and reporting of errors - Start a documentation section which will hold down relevant RAS features description and how they should be used - Add new AMD error bank types - Slim down and remove error type descriptions from the kernel side of error decoding to rasdaemon which can be used from now on to decode hw errors on AMD - Mark pages containing uncorrectable errors as poison so that kdump can avoid them and thus not cause another panic - The usual cleanups and fixlets * tag 'ras_core_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Handle Intel threshold interrupt storms x86/mce: Add per-bank CMCI storm mitigation x86/mce: Remove old CMCI storm mitigation code Documentation: Begin a RAS section x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types EDAC/mce_amd: Remove SMCA Extended Error code descriptions x86/mce/amd, EDAC/mce_amd: Move long names to decoder module x86/mce/inject: Clear test status value x86/mce: Remove redundant check from mce_device_create() x86/mce: Mark fatal MCE's page as poison to avoid panic in the kdump kernel
2024-01-09Merge tag 'x86_cpu_for_v6.8' of ↵Linus Torvalds7-155/+169
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu feature updates from Borislav Petkov: - Add synthetic X86_FEATURE flags for the different AMD Zen generations and use them everywhere instead of ad-hoc family/model checks. Drop an ancient AMD errata checking facility as a result - Fix a fragile initcall ordering in intel_epb - Do not issue the MFENCE+LFENCE barrier for the TSC deadline and X2APIC MSRs on AMD as it is not needed there * tag 'x86_cpu_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/CPU/AMD: Add X86_FEATURE_ZEN1 x86/CPU/AMD: Drop now unused CPU erratum checking function x86/CPU/AMD: Get rid of amd_erratum_1485[] x86/CPU/AMD: Get rid of amd_erratum_400[] x86/CPU/AMD: Get rid of amd_erratum_383[] x86/CPU/AMD: Get rid of amd_erratum_1054[] x86/CPU/AMD: Move the DIV0 bug detection to the Zen1 init function x86/CPU/AMD: Move Zenbleed check to the Zen2 init function x86/CPU/AMD: Rename init_amd_zn() to init_amd_zen_common() x86/CPU/AMD: Call the spectral chicken in the Zen2 init function x86/CPU/AMD: Move erratum 1076 fix into the Zen1 init function x86/CPU/AMD: Move the Zen3 BTC_NO detection to the Zen3 init function x86/CPU/AMD: Carve out the erratum 1386 fix x86/CPU/AMD: Add ZenX generations flags x86/cpu/intel_epb: Don't rely on link order x86/barrier: Do not serialize MSR accesses on AMD
2024-01-09Merge tag 'x86_sev_for_v6.8' of ↵Linus Torvalds1-9/+22
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Convert the sev-guest plaform ->remove callback to return void - Move the SEV C-bit verification to the BSP as it needs to happen only once and not on every AP * tag 'x86_sev_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: virt: sev-guest: Convert to platform remove callback returning void x86/sev: Do the C-bit verification only on the BSP
2024-01-09Merge tag 'x86_paravirt_for_v6.8' of ↵Linus Torvalds13-289/+169
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 paravirt updates from Borislav Petkov: - Replace the paravirt patching functionality using the alternatives infrastructure and remove the former - Misc other improvements * tag 'x86_paravirt_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/alternative: Correct feature bit debug output x86/paravirt: Remove no longer needed paravirt patching code x86/paravirt: Switch mixed paravirt/alternative calls to alternatives x86/alternative: Add indirect call patching x86/paravirt: Move some functions and defines to alternative.c x86/paravirt: Introduce ALT_NOT_XEN x86/paravirt: Make the struct paravirt_patch_site packed x86/paravirt: Use relative reference for the original instruction offset
2024-01-09Merge tag 'x86_misc_for_v6.8' of ↵Linus Torvalds3-3/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Borislav Petkov: - Add an informational message which gets issued when IA32 emulation has been disabled on the cmdline - Clarify in detail how /proc/cpuinfo is used on x86 - Fix a theoretical overflow in num_digits() * tag 'x86_misc_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ia32: State that IA32 emulation is disabled Documentation/x86: Document what /proc/cpuinfo is for x86/lib: Fix overflow when counting digits
2024-01-09Merge tag 'x86_microcode_for_v6.8' of ↵Linus Torvalds1-13/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 microcode updates from Borislav Petkov: - Correct minor issues after the microcode revision reporting sanitization * tag 'x86_microcode_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode/intel: Set new revision only after a successful update x86/microcode/intel: Remove redundant microcode late updated message
2024-01-08Merge tag 'vfs-6.8.mount' of ↵Linus Torvalds2-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains the work to retrieve detailed information about mounts via two new system calls. This is hopefully the beginning of the end of the saga that started with fsinfo() years ago. The LWN articles in [1] and [2] can serve as a summary so we can avoid rehashing everything here. At LSFMM in May 2022 we got into a room and agreed on what we want to do about fsinfo(). Basically, split it into pieces. This is the first part of that agreement. Specifically, it is concerned with retrieving information about mounts. So this only concerns the mount information retrieval, not the mount table change notification, or the extended filesystem specific mount option work. That is separate work. Currently mounts have a 32bit id. Mount ids are already in heavy use by libmount and other low-level userspace but they can't be relied upon because they're recycled very quickly. We agreed that mounts should carry a unique 64bit id by which they can be referenced directly. This is now implemented as part of this work. The new 64bit mount id is exposed in statx() through the new STATX_MNT_ID_UNIQUE flag. If the flag isn't raised the old mount id is returned. If it is raised and the kernel supports the new 64bit mount id the flag is raised in the result mask and the new 64bit mount id is returned. New and old mount ids do not overlap so they cannot be conflated. Two new system calls are introduced that operate on the 64bit mount id: statmount() and listmount(). A summary of the api and usage can be found on LWN as well (cf. [3]) but of course, I'll provide a summary here as well. Both system calls rely on struct mnt_id_req. Which is the request struct used to pass the 64bit mount id identifying the mount to operate on. It is extensible to allow for the addition of new parameters and for future use in other apis that make use of mount ids. statmount() mimicks the semantics of statx() and exposes a set flags that userspace may raise in mnt_id_req to request specific information to be retrieved. A statmount() call returns a struct statmount filled in with information about the requested mount. Supported requests are indicated by raising the request flag passed in struct mnt_id_req in the @mask argument in struct statmount. Currently we do support: - STATMOUNT_SB_BASIC: Basic filesystem info - STATMOUNT_MNT_BASIC Mount information (mount id, parent mount id, mount attributes etc) - STATMOUNT_PROPAGATE_FROM Propagation from what mount in current namespace - STATMOUNT_MNT_ROOT Path of the root of the mount (e.g., mount --bind /bla /mnt returns /bla) - STATMOUNT_MNT_POINT Path of the mount point (e.g., mount --bind /bla /mnt returns /mnt) - STATMOUNT_FS_TYPE Name of the filesystem type as the magic number isn't enough due to submounts The string options STATMOUNT_MNT_{ROOT,POINT} and STATMOUNT_FS_TYPE are appended to the end of the struct. Userspace can use the offsets in @fs_type, @mnt_root, and @mnt_point to reference those strings easily. The struct statmount reserves quite a bit of space currently for future extensibility. This isn't really a problem and if this bothers us we can just send a follow-up pull request during this cycle. listmount() is given a 64bit mount id via mnt_id_req just as statmount(). It takes a buffer and a size to return an array of the 64bit ids of the child mounts of the requested mount. Userspace can thus choose to either retrieve child mounts for a mount in batches or iterate through the child mounts. For most use-cases it will be sufficient to just leave space for a few child mounts. But for big mount tables having an iterator is really helpful. Iterating through a mount table works by setting @param in mnt_id_req to the mount id of the last child mount retrieved in the previous listmount() call" Link: https://lwn.net/Articles/934469 [1] Link: https://lwn.net/Articles/829212 [2] Link: https://lwn.net/Articles/950569 [3] * tag 'vfs-6.8.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: add selftest for statmount/listmount fs: keep struct mnt_id_req extensible wire up syscalls for statmount/listmount add listmount(2) syscall statmount: simplify string option retrieval statmount: simplify numeric option retrieval add statmount(2) syscall namespace: extract show_path() helper mounts: keep list of mounts in an rbtree add unique mount ID
2024-01-08Merge tag 'vfs-6.8.misc' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_*() calls with their current ida_*() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ...
2024-01-08x86/kvm: Do not try to disable kvmclock if it was not enabledKirill A. Shutemov1-4/+8
kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is present in the VM. It leads to write to a MSR that doesn't exist on some configurations, namely in TDX guest: unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000) at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30) kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt features. Do not disable kvmclock if it was not enabled. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown") Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Message-Id: <20231205004510.27164-6-kirill.shutemov@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-08Merge tag 'kvm-x86-mmu-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini4-63/+54
KVM x86 MMU changes for 6.8: - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Relax the TDP MMU's lockdep assertions related to holding mmu_lock for read versus write so that KVM doesn't pass "bool shared" all over the place just to have precise assertions in paths that don't actually care about whether the caller is a reader or a writer.
2024-01-08Merge tag 'kvm-x86-xen-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2-6/+31
KVM Xen change for 6.8: To workaround Xen guests that don't expect Xen PV clocks to be marked as being based on a stable TSC, add a Xen config knob to allow userspace to opt out of KVM setting the "TSC stable" bit in Xen PV clocks. Note, the "TSC stable" bit was added to the PVCLOCK ABI by KVM without an ack from Xen, i.e. KVM isn't entirely blameless for the buggy guest behavior.
2024-01-08Merge tag 'kvm-x86-svm-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini3-19/+21
KVM SVM changes for 6.8: - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - Fix a benign NMI virtualization bug where KVM would unnecessarily intercept IRET when manually injecting an NMI, e.g. when KVM pends an NMI and injects a second, "simultaneous" NMI.
2024-01-08Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini18-30/+134
KVM x86 support for virtualizing Linear Address Masking (LAM) Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality checks for most virtual address usage in 64-bit mode, such that only the most significant bit of the untranslated address bits must match the polarity of the last translated address bit. This allows software to use ignored, untranslated address bits for metadata, e.g. to efficiently tag pointers for address sanitization. LAM can be enabled separately for user pointers and supervisor pointers, and for userspace LAM can be select between 48-bit and 57-bit masking - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15. - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6. For user pointers, LAM enabling utilizes two previously-reserved high bits from CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and 61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.: - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.: - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers The modified LAM canonicality checks: - LAM_S48 : [ 1 ][ metadata ][ 1 ] 63 47 - LAM_U48 : [ 0 ][ metadata ][ 0 ] 63 47 - LAM_S57 : [ 1 ][ metadata ][ 1 ] 63 56 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ] 63 56 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ] 63 56..47 The bulk of KVM support for LAM is to emulate LAM's modified canonicality checks. The approach taken by KVM is to "fill" the metadata bits using the highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value from the raw, untagged virtual address is kept for the canonicality check. This untagging allows Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26], enabling via CR3 and CR4 bits, etc.
2024-01-08Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini7-109/+137
KVM x86 PMU changes for 6.8: - Fix a variety of bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow.
2024-01-08Merge tag 'kvm-x86-misc-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini6-41/+70
KVM x86 misc changes for 6.8: - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds.
2024-01-08Merge tag 'kvm-x86-hyperv-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini27-742/+1050
KVM x86 Hyper-V changes for 6.8: - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time.
2024-01-08Merge tag 'kvm-x86-generic-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-2/+2
Common KVM changes for 6.8: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
2024-01-08KVM: x86: add missing "depends on KVM"Paolo Bonzini1-1/+1
Support for KVM software-protected VMs should not be configurable, if KVM is not available at all. Fixes: 89ea60c2c7b5 ("KVM: x86: Add support for "protected VMs" that can utilize private memory") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-08KVM: introduce CONFIG_KVM_COMMONPaolo Bonzini1-2/+1
CONFIG_HAVE_KVM is currently used by some architectures to either enabled the KVM config proper, or to enable host-side code that is not part of the KVM module. However, CONFIG_KVM's "select" statement in virt/kvm/Kconfig corresponds to a third meaning, namely to enable common Kconfigs required by all architectures that support KVM. These three meanings can be replaced respectively by an architecture-specific Kconfig, by IS_ENABLED(CONFIG_KVM), or by a new Kconfig symbol that is in turn selected by the architecture-specific "config KVM". Start by introducing such a new Kconfig symbol, CONFIG_KVM_COMMON. Unlike CONFIG_HAVE_KVM, it is selected by CONFIG_KVM, not by architecture code, and it brings in all dependencies of common KVM code. In particular, INTERVAL_TREE was missing in loongarch and riscv, so that is another thing that is fixed. Fixes: 8132d887a702 ("KVM: remove CONFIG_HAVE_KVM_EVENTFD", 2023-12-08) Reported-by: Randy Dunlap <rdunlap@infradead.org> Closes: https://lore.kernel.org/all/44907c6b-c5bd-4e4a-a921-e4d3825539d8@infradead.org/ Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-06Merge tag 'for-netdev' of ↵Jakub Kicinski1-25/+22
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-01-05 We've added 40 non-merge commits during the last 2 day(s) which contain a total of 73 files changed, 1526 insertions(+), 951 deletions(-). The main changes are: 1) Fix a memory leak when streaming AF_UNIX sockets were inserted into multiple sockmap slots/maps, from John Fastabend. 2) Fix gotol in s390 BPF JIT with large offsets, from Ilya Leoshkevich. 3) Fix reattachment branch in bpf_tracing_prog_attach() and reject the request if there is no valid attach_btf, from Jiri Olsa. 4) Remove deprecated bpfilter kernel leftovers given the project is developed in user space (https://github.com/facebook/bpfilter), from Quentin Deslandes. 5) Relax tracing BPF program recursive attach rules given right now it is not possible to create tracing program call cycles, from Dmitrii Dolgov. 6) Fix excessive memory consumption for the bpf_global_percpu_ma for systems with a large number of CPUs, from Yonghong Song. 7) Small x86 BPF JIT cleanup to reuse emit_nops instead of open-coding memcpy of x86_nops, from Leon Hwang. 8) Follow-up for libbpf to support __arg_ctx global function argument tag semantics to complement the merged kernel side, from Andrii Nakryiko. 9) Introduce "volatile compare" macros for BPF selftests in order to make the latter more robust against compiler optimization, from Alexei Starovoitov. 10) Small simplification in verifier's size checking of helper accesses along with additional selftests, from Andrei Matei. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (40 commits) selftests/bpf: Test re-attachment fix for bpf_tracing_prog_attach bpf: Fix re-attachment branch in bpf_tracing_prog_attach selftests/bpf: Add test for recursive attachment of tracing progs bpf: Relax tracing prog recursive attach rules bpf, x86: Use emit_nops to replace memcpy x86_nops selftests/bpf: Test gotol with large offsets selftests/bpf: Double the size of test_loader log s390/bpf: Fix gotol with large offsets bpfilter: remove bpfilter bpf: Remove unnecessary cpu == 0 check in memalloc selftests/bpf: add __arg_ctx BTF rewrite test selftests/bpf: add arg:ctx cases to test_global_funcs tests libbpf: implement __arg_ctx fallback logic libbpf: move BTF loading step after relocation step libbpf: move exception callbacks assignment logic into relocation step libbpf: use stable map placeholder FDs libbpf: don't rely on map->fd as an indicator of map being created libbpf: use explicit map reuse flag to skip map creation steps libbpf: make uniform use of btf__fd() accessor inside libbpf selftests/bpf: Add a selftest with > 512-byte percpu allocation size ... ==================== Link: https://lore.kernel.org/r/20240105170105.21070-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-06Merge tag 'mm-hotfixes-stable-2024-01-05-11-35' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc mm fixes from Andrew Morton: "12 hotfixes. Two are cc:stable and the remainder either address post-6.7 issues or aren't considered necessary for earlier kernel versions" * tag 'mm-hotfixes-stable-2024-01-05-11-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: shrinker: use kvzalloc_node() from expand_one_shrinker_info() mailmap: add entries for Mathieu Othacehe MAINTAINERS: change vmware.com addresses to broadcom.com arch/mm/fault: fix major fault accounting when retrying under per-VMA lock mm/mglru: skip special VMAs in lru_gen_look_around() MAINTAINERS: hand over hwpoison maintainership to Miaohe Lin MAINTAINERS: remove hugetlb maintainer Mike Kravetz mm: fix unmap_mapping_range high bits shift bug mm: memcg: fix split queue list crash when large folio migration mm: fix arithmetic for max_prop_frac when setting max_ratio mm: fix arithmetic for bdi min_ratio mm: align larger anonymous mappings on THP boundaries
2024-01-05x86/crash: use SZ_1M macro instead of hardcoded valueYuntao Wang1-1/+1
Use SZ_1M macro instead of hardcoded 1<<20 to make code more readable. Link: https://lkml.kernel.org/r/20240102144905.110047-3-ytcoode@gmail.com Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-05x86/crash: remove the unused image parameter from prepare_elf_headers()Yuntao Wang1-5/+5
Patch series "crash: Some cleanups and fixes", v2. This patchset includes two cleanups and one fix. This patch (of 3): The image parameter is no longer in use, remove it. Also, tidy up the code formatting. Link: https://lkml.kernel.org/r/20240102144905.110047-1-ytcoode@gmail.com Link: https://lkml.kernel.org/r/20240102144905.110047-2-ytcoode@gmail.com Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-05mm/mglru: add dummy pmd_dirty()Kinsey Ho1-0/+1
Add dummy pmd_dirty() for architectures that don't provide it. This is similar to commit 6617da8fb565 ("mm: add dummy pmd_young() for architectures not having it"). Link: https://lkml.kernel.org/r/20231227141205.2200125-5-kinseyho@google.com Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312210606.1Etqz3M4-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202312210042.xQEiqlEh-lkp@intel.com/ Signed-off-by: Kinsey Ho <kinseyho@google.com> Suggested-by: Yu Zhao <yuzhao@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Donet Tom <donettom@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-05mm/mglru: add CONFIG_ARCH_HAS_HW_PTE_YOUNGKinsey Ho2-6/+1
Patch series "mm/mglru: Kconfig cleanup", v4. This series is the result of the following discussion: https://lore.kernel.org/47066176-bd93-55dd-c2fa-002299d9e034@linux.ibm.com/ It mainly avoids building the code that walks page tables on CPUs that use it, i.e., those don't support hardware accessed bit. Specifically, it introduces a new Kconfig to guard some of functions added by commit bd74fdaea146 ("mm: multi-gen LRU: support page table walks") on CPUs like POWER9, on which the series was tested. This patch (of 5): Some architectures are able to set the accessed bit in PTEs when PTEs are used as part of linear address translations. Add CONFIG_ARCH_HAS_HW_PTE_YOUNG for such architectures to be able to override arch_has_hw_pte_young(). Link: https://lkml.kernel.org/r/20231227141205.2200125-1-kinseyho@google.com Link: https://lkml.kernel.org/r/20231227141205.2200125-2-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Co-developed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Tested-by: Donet Tom <donettom@linux.vnet.ibm.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-05Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-1/+6
Pull kvm fix from Paolo Bonzini: - Fix boolean logic in intel_guest_get_msrs * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86/pmu: fix masking logic for MSR_CORE_PERF_GLOBAL_CTRL
2024-01-05Merge tag 'probes-fixes-v6.7-rc8' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull kprobes/x86 fix from Masami Hiramatsu: - Fix to emulate indirect call which size is not 5 byte. Current code expects the indirect call instructions are 5 bytes, but that is incorrect. Usually indirect call based on register is shorter than that, thus the emulation causes a kernel crash by accessing wrong instruction boundary. This uses the instruction size to calculate the return address correctly. * tag 'probes-fixes-v6.7-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: x86/kprobes: fix incorrect return address calculation in kprobe_emulate_call_indirect
2024-01-05um: Mark 32bit syscall helpers as clobbering memoryBenjamin Berg1-6/+12
The 64bit helper are marked to clobber the memory, but the 32bit ones are not. Add the appropriate clobber to the 32bit helper routines so that the compiler cannot do invalid optimizations. Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-01-05um: Rely on PTRACE_SETREGSET to set FS/GS base registersBenjamin Berg6-68/+16
These registers are saved/restored together with the other general registers using ptrace. In arch_set_tls we then just need to set the register and it will be synced back normally. Most of this logic was introduced in commit f355559cf7845 ("[PATCH] uml: x86_64 thread fixes"). However, at least today we can rely on ptrace to restore the base registers for us. As such, only the part of the patch that tracks the FS register for use as thread local storage is actually needed. Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-01-05bpf, x86: Use emit_nops to replace memcpy x86_nopsLeon Hwang1-25/+22
Move emit_nops() before emit_prologue() and replace memcpy(prog, x86_nops[5], X86_PATCH_SIZE) with emit_nops(&prog, X86_PATCH_SIZE). Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Link: https://lore.kernel.org/r/20240104142226.87869-2-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski9-76/+88
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt.c e009b2efb7a8 ("bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters()") 0f2b21477988 ("bnxt_en: Fix compile error without CONFIG_RFS_ACCEL") https://lore.kernel.org/all/20240105115509.225aa8a2@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-05x86/csum: clean up `csum_partial' furtherLinus Torvalds1-44/+37
Commit 688eb8191b47 ("x86/csum: Improve performance of `csum_partial`") ended up improving the code generation for the IP csum calculations, and in particular special-casing the 40-byte case that is a hot case for IPv6 headers. It then had _another_ special case for the 64-byte unrolled loop, which did two chains of 32-byte blocks, which allows modern CPU's to improve performance by doing the chains in parallel thanks to renaming the carry flag. This just unifies the special cases and combines them into just one single helper the 40-byte csum case, and replaces the 64-byte case by a 80-byte case that just does that single helper twice. It avoids having all these different versions of inline assembly, and actually improved performance further in my tests. There was never anything magical about the 64-byte unrolled case, even though it happens to be a common size (and typically is the cacheline size). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-05x86/csum: Remove unnecessary odd handlingNoah Goldstein1-32/+4
The special case for odd aligned buffers is unnecessary and mostly just adds overhead. Aligned buffers is the expectations, and even for unaligned buffer, the only case that was helped is if the buffer was 1-byte from word aligned which is ~1/7 of the cases. Overall it seems highly unlikely to be worth to extra branch. It was left in the previous perf improvement patch because I was erroneously comparing the exact output of `csum_partial(...)`, but really we only need `csum_fold(csum_partial(...))` to match so its safe to remove. All csum kunit tests pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Laight <david.laight@aculab.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-05um: Always inline stub functionsBenjamin Berg2-18/+20
The stub executable page is remapped to a different location in the userland process. As these functions may be used by the stub, they really need to be always inlined rather than permitting the compiler to emit a function. Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-01-05um: Drop support for hosts without SYSEMU_SINGLESTEP supportBenjamin Berg4-61/+5
These features have existed since Linux 2.6.14 and can be considered widely available at this point. Also drop the backward compatibility code for PTRACE_SETOPTIONS. Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> ---- v2: * Continue to define PTRACE_SYSEMU_SINGLESTEP as glibc only added it in version 2.27. Signed-off-by: Richard Weinberger <richard@nod.at>
2024-01-04KVM: x86/pmu: fix masking logic for MSR_CORE_PERF_GLOBAL_CTRLPaolo Bonzini1-1/+6
When commit c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") switched the initialization of cpuc->guest_switch_msrs to use compound literals, it screwed up the boolean logic: + u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable; ... - arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask; - arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable); + .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask), Before the patch, the value of arr[0].guest would have been intel_ctrl & ~cpuc->intel_ctrl_host_mask & ~pebs_mask. The intent is to always treat PEBS events as host-only because, while the guest runs, there is no way to tell the processor about the virtual address where to put PEBS records intended for the host. Unfortunately, the new expression can be expanded to (intel_ctrl & ~cpuc->intel_ctrl_host_mask) | (intel_ctrl & ~pebs_mask) which makes no sense; it includes any bit that isn't *both* marked as exclude_guest and using PEBS. So, reinstate the old logic. Another way to write it could be "intel_ctrl & ~(cpuc->intel_ctrl_host_mask | pebs_mask)", presumably the intention of the author of the faulty. However, I personally find the repeated application of A AND NOT B to be a bit more readable. This shows up as guest failures when running concurrent long-running perf workloads on the host, and was reported to happen with rcutorture. All guests on a given host would die simultaneously with something like an instruction fault or a segmentation violation. Reported-by: Paul E. McKenney <paulmck@kernel.org> Analyzed-by: Sean Christopherson <seanjc@google.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Cc: stable@vger.kernel.org Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-04x86/tools: objdump_reformat.awk: Skip bad instructions from llvm-objdumpNathan Chancellor1-1/+1
When running the instruction decoder selftest with LLVM=1 and CONFIG_PVH=y, there is a series of warnings: arch/x86/tools/insn_decoder_test: warning: Found an x86 instruction decoder bug, please report this. arch/x86/tools/insn_decoder_test: warning: ffffffff81000050 ea <unknown> arch/x86/tools/insn_decoder_test: warning: objdump says 1 bytes, but insn_get_length() says 7 arch/x86/tools/insn_decoder_test: warning: Decoded and checked 7214721 instructions with 1 failures GNU objdump outputs "(bad)" instead of "<unknown>", which is already handled in the bad_expr regex, so there is no warning. $ objdump -d arch/x86/platform/pvh/head.o | grep -E '50:\s+ea' 50: ea (bad) $ llvm-objdump -d arch/x86/platform/pvh/head.o | grep -E '50:\s+ea' 50: ea <unknown> Add "<unknown>" to the bad_expr regex to clear up the warning, allowing the instruction decoder selftest to fully pass with llvm-objdump. Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231205-objdump_reformat-awk-handle-llvm-objdump-bad_expr-v1-1-b4a74f39396f@kernel.org
2024-01-04x86/kprobes: fix incorrect return address calculation in ↵Jinghao Jia1-1/+2
kprobe_emulate_call_indirect kprobe_emulate_call_indirect currently uses int3_emulate_call to emulate indirect calls. However, int3_emulate_call always assumes the size of the call to be 5 bytes when calculating the return address. This is incorrect for register-based indirect calls in x86, which can be either 2 or 3 bytes depending on whether REX prefix is used. At kprobe runtime, the incorrect return address causes control flow to land onto the wrong place after return -- possibly not a valid instruction boundary. This can lead to a panic like the following: [ 7.308204][ C1] BUG: unable to handle page fault for address: 000000000002b4d8 [ 7.308883][ C1] #PF: supervisor read access in kernel mode [ 7.309168][ C1] #PF: error_code(0x0000) - not-present page [ 7.309461][ C1] PGD 0 P4D 0 [ 7.309652][ C1] Oops: 0000 [#1] SMP [ 7.309929][ C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.7.0-rc5-trace-for-next #6 [ 7.310397][ C1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014 [ 7.311068][ C1] RIP: 0010:__common_interrupt+0x52/0xc0 [ 7.311349][ C1] Code: 01 00 4d 85 f6 74 39 49 81 fe 00 f0 ff ff 77 30 4c 89 f7 4d 8b 5e 68 41 ba 91 76 d8 42 45 03 53 fc 74 02 0f 0b cc ff d3 65 48 <8b> 05 30 c7 ff 7e 65 4c 89 3d 28 c7 ff 7e 5b 41 5c 41 5e 41 5f c3 [ 7.312512][ C1] RSP: 0018:ffffc900000e0fd0 EFLAGS: 00010046 [ 7.312899][ C1] RAX: 0000000000000001 RBX: 0000000000000023 RCX: 0000000000000001 [ 7.313334][ C1] RDX: 00000000000003cd RSI: 0000000000000001 RDI: ffff888100d302a4 [ 7.313702][ C1] RBP: 0000000000000001 R08: 0ef439818636191f R09: b1621ff338a3b482 [ 7.314146][ C1] R10: ffffffff81e5127b R11: ffffffff81059810 R12: 0000000000000023 [ 7.314509][ C1] R13: 0000000000000000 R14: ffff888100d30200 R15: 0000000000000000 [ 7.314951][ C1] FS: 0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000 [ 7.315396][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7.315691][ C1] CR2: 000000000002b4d8 CR3: 0000000003028003 CR4: 0000000000370ef0 [ 7.316153][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7.316508][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 7.316948][ C1] Call Trace: [ 7.317123][ C1] <IRQ> [ 7.317279][ C1] ? __die_body+0x64/0xb0 [ 7.317482][ C1] ? page_fault_oops+0x248/0x370 [ 7.317712][ C1] ? __wake_up+0x96/0xb0 [ 7.317964][ C1] ? exc_page_fault+0x62/0x130 [ 7.318211][ C1] ? asm_exc_page_fault+0x22/0x30 [ 7.318444][ C1] ? __cfi_native_send_call_func_single_ipi+0x10/0x10 [ 7.318860][ C1] ? default_idle+0xb/0x10 [ 7.319063][ C1] ? __common_interrupt+0x52/0xc0 [ 7.319330][ C1] common_interrupt+0x78/0x90 [ 7.319546][ C1] </IRQ> [ 7.319679][ C1] <TASK> [ 7.319854][ C1] asm_common_interrupt+0x22/0x40 [ 7.320082][ C1] RIP: 0010:default_idle+0xb/0x10 [ 7.320309][ C1] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 b8 0c 67 40 a5 66 90 0f 00 2d 09 b9 3b 00 fb f4 <fa> c3 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 b8 0c 67 40 a5 e9 [ 7.321449][ C1] RSP: 0018:ffffc9000009bee8 EFLAGS: 00000256 [ 7.321808][ C1] RAX: ffff88813bca8b68 RBX: 0000000000000001 RCX: 000000000001ef0c [ 7.322227][ C1] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000001ef0c [ 7.322656][ C1] RBP: ffffc9000009bef8 R08: 8000000000000000 R09: 00000000000008c2 [ 7.323083][ C1] R10: 0000000000000000 R11: ffffffff81058e70 R12: 0000000000000000 [ 7.323530][ C1] R13: ffff8881002b30c0 R14: 0000000000000000 R15: 0000000000000000 [ 7.323948][ C1] ? __cfi_lapic_next_deadline+0x10/0x10 [ 7.324239][ C1] default_idle_call+0x31/0x50 [ 7.324464][ C1] do_idle+0xd3/0x240 [ 7.324690][ C1] cpu_startup_entry+0x25/0x30 [ 7.324983][ C1] start_secondary+0xb4/0xc0 [ 7.325217][ C1] secondary_startup_64_no_verify+0x179/0x17b [ 7.325498][ C1] </TASK> [ 7.325641][ C1] Modules linked in: [ 7.325906][ C1] CR2: 000000000002b4d8 [ 7.326104][ C1] ---[ end trace 0000000000000000 ]--- [ 7.326354][ C1] RIP: 0010:__common_interrupt+0x52/0xc0 [ 7.326614][ C1] Code: 01 00 4d 85 f6 74 39 49 81 fe 00 f0 ff ff 77 30 4c 89 f7 4d 8b 5e 68 41 ba 91 76 d8 42 45 03 53 fc 74 02 0f 0b cc ff d3 65 48 <8b> 05 30 c7 ff 7e 65 4c 89 3d 28 c7 ff 7e 5b 41 5c 41 5e 41 5f c3 [ 7.327570][ C1] RSP: 0018:ffffc900000e0fd0 EFLAGS: 00010046 [ 7.327910][ C1] RAX: 0000000000000001 RBX: 0000000000000023 RCX: 0000000000000001 [ 7.328273][ C1] RDX: 00000000000003cd RSI: 0000000000000001 RDI: ffff888100d302a4 [ 7.328632][ C1] RBP: 0000000000000001 R08: 0ef439818636191f R09: b1621ff338a3b482 [ 7.329223][ C1] R10: ffffffff81e5127b R11: ffffffff81059810 R12: 0000000000000023 [ 7.329780][ C1] R13: 0000000000000000 R14: ffff888100d30200 R15: 0000000000000000 [ 7.330193][ C1] FS: 0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000 [ 7.330632][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7.331050][ C1] CR2: 000000000002b4d8 CR3: 0000000003028003 CR4: 0000000000370ef0 [ 7.331454][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7.331854][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 7.332236][ C1] Kernel panic - not syncing: Fatal exception in interrupt [ 7.332730][ C1] Kernel Offset: disabled [ 7.333044][ C1] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- The relevant assembly code is (from objdump, faulting address highlighted): ffffffff8102ed9d: 41 ff d3 call *%r11 ffffffff8102eda0: 65 48 <8b> 05 30 c7 ff mov %gs:0x7effc730(%rip),%rax The emulation incorrectly sets the return address to be ffffffff8102ed9d + 0x5 = ffffffff8102eda2, which is the 8b byte in the middle of the next mov. This in turn causes incorrect subsequent instruction decoding and eventually triggers the page fault above. Instead of invoking int3_emulate_call, perform push and jmp emulation directly in kprobe_emulate_call_indirect. At this point we can obtain the instruction size from p->ainsn.size so that we can calculate the correct return address. Link: https://lore.kernel.org/all/20240102233345.385475-1-jinghao7@illinois.edu/ Fixes: 6256e668b7af ("x86/kprobes: Use int3 instead of debug trap for single-step") Cc: stable@vger.kernel.org Signed-off-by: Jinghao Jia <jinghao7@illinois.edu> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-01-03arch/x86: Fix typosBjorn Helgaas60-72/+72
Fix typos, most reported by "codespell arch/x86". Only touches comments, no code changes. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2024-01-03Merge branches 'apple/dart', 'arm/rockchip', 'arm/smmu', 'virtio', ↵Joerg Roedel2-2/+3
'x86/vt-d', 'x86/amd' and 'core' into next
2024-01-02Merge tag 'kvm-riscv-6.8-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini8-34/+55
KVM/riscv changes for 6.8 part #1 - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Steal time account support along with selftest
2024-01-02Merge tag 'loongarch-kvm-6.8' of ↵Paolo Bonzini30-174/+297
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.8 1. Optimization for memslot hugepage checking. 2. Cleanup and fix some HW/SW timer issues. 3. Add LSX/LASX (128bit/256bit SIMD) support.