summaryrefslogtreecommitdiff
path: root/arch/x86/lib
AgeCommit message (Collapse)AuthorFilesLines
2023-06-28Merge tag 'locking-core-2023-06-27' of ↵Linus Torvalds3-33/+80
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() The cmpxchg128() family of functions is basically & functionally the same as cmpxchg_double(), but with a saner interface. Instead of a 6-parameter horror that forced u128 - u64/u64-halves layout details on the interface and exposed users to complexity, fragility & bugs, use a natural 3-parameter interface with u128 types. - Restructure the generated atomic headers, and add kerneldoc comments for all of the generic atomic{,64,_long}_t operations. The generated definitions are much cleaner now, and come with documentation. - Implement lock_set_cmp_fn() on lockdep, for defining an ordering when taking multiple locks of the same type. This gets rid of one use of lockdep_set_novalidate_class() in the bcache code. - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable shadowing generating garbage code on Clang on certain ARM builds. * tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits) locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg() locking/atomic: treewide: delete arch_atomic_*() kerneldoc locking/atomic: docs: Add atomic operations to the driver basic API documentation locking/atomic: scripts: generate kerneldoc comments docs: scripts: kernel-doc: accept bitwise negation like ~@var locking/atomic: scripts: simplify raw_atomic*() definitions locking/atomic: scripts: simplify raw_atomic_long*() definitions locking/atomic: scripts: split pfx/name/sfx/order locking/atomic: scripts: restructure fallback ifdeffery locking/atomic: scripts: build raw_atomic_long*() directly locking/atomic: treewide: use raw_atomic*_<op>() locking/atomic: scripts: add trivial raw_atomic*_<op>() locking/atomic: scripts: factor out order template generation locking/atomic: scripts: remove leftover "${mult}" locking/atomic: scripts: remove bogus order parameter locking/atomic: xtensa: add preprocessor symbols locking/atomic: x86: add preprocessor symbols locking/atomic: sparc: add preprocessor symbols locking/atomic: sh: add preprocessor symbols ...
2023-06-27Merge tag 'x86_misc_for_v6.5' of ↵Linus Torvalds3-61/+96
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Borislav Petkov: - Remove the local symbols prefix of the get/put_user() exception handling symbols so that tools do not get confused by the presence of code belonging to the wrong symbol/not belonging to any symbol - Improve csum_partial()'s performance - Some improvements to the kcpuid tool * tag 'x86_misc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/lib: Make get/put_user() exception handling a visible symbol x86/csum: Fix clang -Wuninitialized in csum_partial() x86/csum: Improve performance of `csum_partial` tools/x86/kcpuid: Add .gitignore tools/x86/kcpuid: Dump the correct CPUID function in error
2023-06-27Merge tag 'x86_cleanups_for_6.5' of ↵Linus Torvalds2-13/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Dave Hansen: "As usual, these are all over the map. The biggest cluster is work from Arnd to eliminate -Wmissing-prototype warnings: - Address -Wmissing-prototype warnings - Remove repeated 'the' in comments - Remove unused current_untag_mask() - Document urgent tip branch timing - Clean up MSR kernel-doc notation - Clean up paravirt_ops doc - Update Srivatsa S. Bhat's maintained areas - Remove unused extern declaration acpi_copy_wakeup_routine()" * tag 'x86_cleanups_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) x86/acpi: Remove unused extern declaration acpi_copy_wakeup_routine() Documentation: virt: Clean up paravirt_ops doc x86/mm: Remove unused current_untag_mask() x86/mm: Remove repeated word in comments x86/lib/msr: Clean up kernel-doc notation x86/platform: Avoid missing-prototype warnings for OLPC x86/mm: Add early_memremap_pgprot_adjust() prototype x86/usercopy: Include arch_wb_cache_pmem() declaration x86/vdso: Include vdso/processor.h x86/mce: Add copy_mc_fragile_handle_tail() prototype x86/fbdev: Include asm/fb.h as needed x86/hibernate: Declare global functions in suspend.h x86/entry: Add do_SYSENTER_32() prototype x86/quirks: Include linux/pnp.h for arch_pnpbios_disabled() x86/mm: Include asm/numa.h for set_highmem_pages_init() x86: Avoid missing-prototype warnings for doublefault code x86/fpu: Include asm/fpu/regset.h x86: Add dummy prototype for mk_early_pgtbl_32() x86/pci: Mark local functions as 'static' x86/ftrace: Move prepare_ftrace_return prototype to header ...
2023-06-27Merge tag 'x86_cpu_for_v6.5' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu updates from Borislav Petkov: - Compute the purposeful misalignment of zen_untrain_ret automatically and assert __x86_return_thunk's alignment so that future changes to the symbol macros do not accidentally break them. - Remove CONFIG_X86_FEATURE_NAMES Kconfig option as its existence is pointless * tag 'x86_cpu_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/retbleed: Add __x86_return_thunk alignment checks x86/cpu: Remove X86_FEATURE_NAMES x86/Kconfig: Make X86_FEATURE_NAMES non-configurable in prompt
2023-06-27Merge tag 'x86_alternatives_for_v6.5' of ↵Linus Torvalds1-8/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 instruction alternatives updates from Borislav Petkov: - Up until now the Fast Short Rep Mov optimizations implied the presence of the ERMS CPUID flag. AMD decoupled them with a BIOS setting so decouple that dependency in the kernel code too - Teach the alternatives machinery to handle relocations - Make debug_alternative accept flags in order to see only that set of patching done one is interested in - Other fixes, cleanups and optimizations to the patching code * tag 'x86_alternatives_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/alternative: PAUSE is not a NOP x86/alternatives: Add cond_resched() to text_poke_bp_batch() x86/nospec: Shorten RESET_CALL_DEPTH x86/alternatives: Add longer 64-bit NOPs x86/alternatives: Fix section mismatch warnings x86/alternative: Optimize returns patching x86/alternative: Complicate optimize_nops() some more x86/alternative: Rewrite optimize_nops() some x86/lib/memmove: Decouple ERMS from FSRM x86/alternative: Support relocations in alternatives x86/alternative: Make debug-alternative selective
2023-06-06x86/lib/msr: Clean up kernel-doc notationRandy Dunlap1-13/+19
Convert x86/lib/msr.c comments to kernel-doc notation to eliminate kernel-doc warnings: msr.c:30: warning: This comment starts with '/**', but isn't \ a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst ... Fixes: 22085a66c2fa ("x86: Add another set of MSR accessor functions") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/oe-kbuild-all/202304120048.v4uqUq9Q-lkp@intel.com/
2023-06-05percpu: Wire up cmpxchg128Peter Zijlstra3-33/+80
In order to replace cmpxchg_double() with the newly minted cmpxchg128() family of functions, wire it up in this_cpu_cmpxchg(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132323.654945124@infradead.org
2023-06-02x86/lib: Make get/put_user() exception handling a visible symbolNadav Amit2-28/+28
The .L-prefixed exception handling symbols of get_user() and put_user() do get discarded from the symbol table of the final kernel image. This confuses tools which parse that symbol table and try to map the chunk of code to a symbol. And, in general, from toolchain perspective, it is a good practice to have all code belong to a symbol, and the correct one at that. ( Currently, objdump displays that exception handling chunk as part of the previous symbol which is a "fallback" of sorts and not correct. ) While at it, rename them to something more descriptive. [ bp: Rewrite commit message, rename symbols. ] Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230525184244.2311-1-namit@vmware.com
2023-05-29x86/csum: Fix clang -Wuninitialized in csum_partial()Nathan Chancellor1-4/+6
Clang warns: arch/x86/lib/csum-partial_64.c:74:20: error: variable 'result' is uninitialized when used here [-Werror,-Wuninitialized] return csum_tail(result, temp64, odd); ^~~~~~ arch/x86/lib/csum-partial_64.c:48:22: note: initialize the variable 'result' to silence this warning unsigned odd, result; ^ = 0 1 error generated. The only initialization and uses of result in csum_partial() were moved into csum_tail() but result is still being passed by value to csum_tail() (clang's -Wuninitialized does not do interprocedural analysis to realize that result is always assigned in csum_tail() however). Sink the declaration of result into csum_tail() to clear up the warning. Closes: https://lore.kernel.org/202305262039.3HUYjWJk-lkp@intel.com/ Fixes: 688eb8191b47 ("x86/csum: Improve performance of `csum_partial`") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230526-csum_partial-wuninitialized-v1-1-ebc0108dcec1%40kernel.org
2023-05-26x86: re-introduce support for ERMS copies for user space accessesLinus Torvalds1-1/+9
I tried to streamline our user memory copy code fairly aggressively in commit adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies"), in order to then be able to clean up the code and inline the modern FSRM case in commit 577e6a7fd50d ("x86: inline the 'rep movs' in user copies for the FSRM case"). We had reports [1] of that causing regressions earlier with blogbench, but that turned out to be a horrible benchmark for that case, and not a sufficient reason for re-instating "rep movsb" on older machines. However, now Eric Dumazet reported [2] a regression in performance that seems to be a rather more real benchmark, where due to the removal of "rep movs" a TCP stream over a 100Gbps network no longer reaches line speed. And it turns out that with the simplified the calling convention for the non-FSRM case in commit 427fda2c8a49 ("x86: improve on the non-rep 'copy_user' function"), re-introducing the ERMS case is actually fairly simple. Of course, that "fairly simple" is glossing over several missteps due to having to fight our assembler alternative code. This code really wanted to rewrite a conditional branch to have two different targets, but that made objtool sufficiently unhappy that this instead just ended up doing a choice between "jump to the unrolled loop, or use 'rep movsb' directly". Let's see if somebody finds a case where the kernel memory copies also care (see commit 68674f94ffc9: "x86: don't use REP_GOOD or ERMS for small memory copies"). But Eric does argue that the user copies are special because networking tries to copy up to 32KB at a time, if order-3 pages allocations are possible. In-kernel memory copies are typically small, unless they are the special "copy pages at a time" kind that still use "rep movs". Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1] Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/ [2] Reported-and-tested-by: Eric Dumazet <edumazet@google.com> Fixes: adfcf4231b8c ("x86: don't use REP_GOOD or ERMS for user memory copies") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-25x86/csum: Improve performance of `csum_partial`Noah Goldstein1-32/+65
1) Add special case for len == 40 as that is the hottest value. The nets a ~8-9% latency improvement and a ~30% throughput improvement in the len == 40 case. 2) Use multiple accumulators in the 64-byte loop. This dramatically improves ILP and results in up to a 40% latency/throughput improvement (better for more iterations). Results from benchmarking on Icelake. Times measured with rdtsc() len lat_new lat_old r tput_new tput_old r 8 3.58 3.47 1.032 3.58 3.51 1.021 16 4.14 4.02 1.028 3.96 3.78 1.046 24 4.99 5.03 0.992 4.23 4.03 1.050 32 5.09 5.08 1.001 4.68 4.47 1.048 40 5.57 6.08 0.916 3.05 4.43 0.690 48 6.65 6.63 1.003 4.97 4.69 1.059 56 7.74 7.72 1.003 5.22 4.95 1.055 64 6.65 7.22 0.921 6.38 6.42 0.994 96 9.43 9.96 0.946 7.46 7.54 0.990 128 9.39 12.15 0.773 8.90 8.79 1.012 200 12.65 18.08 0.699 11.63 11.60 1.002 272 15.82 23.37 0.677 14.43 14.35 1.005 440 24.12 36.43 0.662 21.57 22.69 0.951 952 46.20 74.01 0.624 42.98 53.12 0.809 1024 47.12 78.24 0.602 46.36 58.83 0.788 1552 72.01 117.30 0.614 71.92 96.78 0.743 2048 93.07 153.25 0.607 93.28 137.20 0.680 2600 114.73 194.30 0.590 114.28 179.32 0.637 3608 156.34 268.41 0.582 154.97 254.02 0.610 4096 175.01 304.03 0.576 175.89 292.08 0.602 There is no such thing as a free lunch, however, and the special case for len == 40 does add overhead to the len != 40 cases. This seems to amount to be ~5% throughput and slightly less in terms of latency. Testing: Part of this change is a new kunit test. The tests check all alignment X length pairs in [0, 64) X [0, 512). There are three cases. 1) Precomputed random inputs/seed. The expected results where generated use the generic implementation (which is assumed to be non-buggy). 2) An input of all 1s. The goal of this test is to catch any case a carry is missing. 3) An input that never carries. The goal of this test si to catch any case of incorrectly carrying. More exhaustive tests that test all alignment X length pairs in [0, 8192) X [0, 8192] on random data are also available here: https://github.com/goldsteinn/csum-reproduction The reposity also has the code for reproducing the above benchmark numbers. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.com
2023-05-18x86/usercopy: Include arch_wb_cache_pmem() declarationArnd Bergmann1-0/+1
arch_wb_cache_pmem() is declared in a global header but defined by the architecture. On x86, the implementation needs to include the header to avoid this warning: arch/x86/lib/usercopy_64.c:39:6: error: no previous prototype for 'arch_wb_cache_pmem' [-Werror=missing-prototypes] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-18-arnd%40kernel.org
2023-05-17x86/retbleed: Add __x86_return_thunk alignment checksBorislav Petkov (AMD)1-1/+1
Add a linker assertion and compute the 0xcc padding dynamically so that __x86_return_thunk is always cacheline-aligned. Leave the SYM_START() macro in as the untraining doesn't need ENDBR annotations anyway. Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Link: https://lore.kernel.org/r/20230515140726.28689-1-bp@alien8.de
2023-05-13x86/retbleed: Fix return thunk alignmentBorislav Petkov (AMD)1-2/+2
SYM_FUNC_START_LOCAL_NOALIGN() adds an endbr leading to this layout (leaving only the last 2 bytes of the address): 3bff <zen_untrain_ret>: 3bff: f3 0f 1e fa endbr64 3c03: f6 test $0xcc,%bl 3c04 <__x86_return_thunk>: 3c04: c3 ret 3c05: cc int3 3c06: 0f ae e8 lfence However, "the RET at __x86_return_thunk must be on a 64 byte boundary, for alignment within the BTB." Use SYM_START instead. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-10x86/lib/memmove: Decouple ERMS from FSRMBorislav Petkov (AMD)1-8/+5
Up until now it was perceived that FSRM is an improvement to ERMS and thus it was made dependent on latter. However, there are AMD BIOSes out there which allow for disabling of either features and thus preventing kernels from booting due to the CMP disappearing and thus breaking the logic in the memmove() function. Similar observation happens on some VM migration scenarios. Patch the proper sequences depending on which feature is enabled. Reported-by: Daniel Verkamp <dverkamp@chromium.org> Reported-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/Y/yK0dyzI0MMdTie@zn.tnic
2023-04-29Merge tag 'objtool-core-2023-04-27' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did this inconsistently follow this new, common convention, and fix all the fallout that objtool can now detect statically - Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code - Generate ORC data for __pfx code - Add more __noreturn annotations to various kernel startup/shutdown and panic functions - Misc improvements & fixes * tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/hyperv: Mark hv_ghcb_terminate() as noreturn scsi: message: fusion: Mark mpt_halt_firmware() __noreturn x86/cpu: Mark {hlt,resume}_play_dead() __noreturn btrfs: Mark btrfs_assertfail() __noreturn objtool: Include weak functions in global_noreturns check cpu: Mark nmi_panic_self_stop() __noreturn cpu: Mark panic_smp_self_stop() __noreturn arm64/cpu: Mark cpu_park_loop() and friends __noreturn x86/head: Mark *_start_kernel() __noreturn init: Mark start_kernel() __noreturn init: Mark [arch_call_]rest_init() __noreturn objtool: Generate ORC data for __pfx code x86/linkage: Fix padding for typed functions objtool: Separate prefix code from stack validation code objtool: Remove superfluous dead_end_function() check objtool: Add symbol iteration helpers objtool: Add WARN_INSN() scripts/objdump-func: Support multiple functions context_tracking: Fix KCSAN noinstr violation objtool: Add stackleak instrumentation to uaccess safe list ...
2023-04-28Merge tag 'x86_mm_for_6.4' of ↵Linus Torvalds2-82/+55
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 LAM (Linear Address Masking) support from Dave Hansen: "Add support for the new Linear Address Masking CPU feature. This is similar to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use" * tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA selftests/x86/lam: Add test cases for LAM vs thread creation selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking selftests/x86/lam: Add inherit test cases for linear-address masking selftests/x86/lam: Add io_uring test cases for linear-address masking selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking x86/mm/iommu/sva: Make LAM and SVA mutually exclusive iommu/sva: Replace pasid_valid() helper with mm_valid_pasid() mm: Expose untagging mask in /proc/$PID/status x86/mm: Provide arch_prctl() interface for LAM x86/mm: Reduce untagged_addr() overhead for systems without LAM x86/uaccess: Provide untagged_addr() and remove tags before address check mm: Introduce untagged_addr_remote() x86/mm: Handle LAM on context switch x86: CPUID and CR3/CR4 flags for Linear Address Masking x86: Allow atomic MM_CONTEXT flags setting x86/mm: Rework address range check in get_user() and put_user()
2023-04-28Merge tag 'x86_cleanups_for_v6.4_rc1' of ↵Linus Torvalds1-9/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: - Unify duplicated __pa() and __va() definitions - Simplify sysctl tables registration - Remove unused symbols - Correct function name in comment * tag 'x86_cleanups_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Centralize __pa()/__va() definitions x86: Simplify one-level sysctl registration for itmt_kern_table x86: Simplify one-level sysctl registration for abi_table2 x86/platform/intel-mid: Remove unused definitions from intel-mid.h x86/uaccess: Remove memcpy_page_flushcache() x86/entry: Change stale function name in comment to error_return()
2023-04-21x86: rewrite '__copy_user_nocache' functionLinus Torvalds3-214/+243
I didn't really want to do this, but as part of all the other changes to the user copy loops, I've been looking at this horror. I tried to clean it up multiple times, but every time I just found more problems, and the way it's written, it's just too hard to fix them. For example, the code is written to do quad-word alignment, and will use regular byte accesses to get to that point. That's fairly simple, but it means that any initial 8-byte alignment will be done with cached copies. However, the code then is very careful to do any 4-byte _tail_ accesses using an uncached 4-byte write, and that was claimed to be relevant in commit a82eee742452 ("x86/uaccess/64: Handle the caching of 4-byte nocache copies properly in __copy_user_nocache()"). So if you do a 4-byte copy using that function, it carefully uses a 4-byte 'movnti' for the destination. But if you were to do a 12-byte copy that is 4-byte aligned, it would _not_ do a 4-byte 'movnti' followed by a 8-byte 'movnti' to keep it all uncached. Instead, it would align the destination to 8 bytes using a byte-at-a-time loop, and then do a 8-byte 'movnti' for the final 8 bytes. The main caller that cares is __copy_user_flushcache(), which knows about this insanity, and has odd cases for it all. But I just can't deal with looking at this kind of "it does one case right, and another related case entirely wrong". And the code really wasn't fixable without hard drugs, which I try to avoid. So instead, rewrite it in a form that hopefully not only gets this right, but is a bit more maintainable. Knock wood. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-20x86: remove 'zerorest' argument from __copy_user_nocache()Linus Torvalds2-3/+3
Every caller passes in zero, meaning they don't want any partial copy to zero the remainder of the destination buffer. Which is just as well, because the implementation of that function didn't actually even look at that argument, and wasn't even aware it existed, although some misleading comments did mention it still. The 'zerorest' thing is a historical artifact of how "copy_from_user()" worked, in that it would zero the rest of the kernel buffer that it copied into. That zeroing still exists, but it's long since been moved to generic code, and the raw architecture-specific code doesn't do it. See _copy_from_user() in lib/usercopy.c for this all. However, while __copy_user_nocache() shares some history and superficial other similarities with copy_from_user(), it is in many ways also very different. In particular, while the code makes it *look* similar to the generic user copy functions that can copy both to and from user space, and take faults on both reads and writes as a result, __copy_user_nocache() does no such thing at all. __copy_user_nocache() always copies to kernel space, and will never take a page fault on the destination. What *can* happen, though, is that the non-temporal stores take a machine check because one of the use cases is for writing to stable memory, and any memory errors would then take synchronous faults. So __copy_user_nocache() does look a lot like copy_from_user(), but has faulting behavior that is more akin to our old copy_in_user() (which no longer exists, but copied from user space to user space and could fault on both source and destination). And it very much does not have the "zero the end of the destination buffer", since a problem with the destination buffer is very possibly the very source of the partial copy. So this whole thing was just a confusing historical artifact from having shared some code with a completely different function with completely different use cases. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-19x86: improve on the non-rep 'copy_user' functionLinus Torvalds1-156/+133
The old 'copy_user_generic_unrolled' function was oddly implemented for largely historical reasons: it had been largely based on the uncached copy case, which has some other concerns. For example, the __copy_user_nocache() function uses 'movnti' for the destination stores, and those want the destination to be aligned. In contrast, the regular copy function doesn't really care, and trying to align things only complicates matters. Also, like the clear_user function, the copy function had some odd handling of the repeat counts, complicating the exception handling for no really good reason. So as with clear_user, just write it to keep all the byte counts in the %rcx register, exactly like the 'rep movs' functionality that this replaces. Unlike a real 'rep movs', we do allow for this to trash a few temporary registers to not have to unnecessarily save/restore registers on the stack. And like the clearing case, rename this to what it now clearly is: 'rep_movs_alternative', and make it one coherent function, so that it shows up as such in profiles (instead of the odd split between "copy_user_generic_unrolled" and "copy_user_short_string", the latter of which was not about strings at all, and which was shared with the uncached case). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-19x86: improve on the non-rep 'clear_user' functionLinus Torvalds1-44/+70
The old version was oddly written to have the repeat count in multiple registers. So instead of taking advantage of %rax being zero, it had some sub-counts in it. All just for a "single word clearing" loop, which isn't even efficient to begin with. So get rid of those games, and just keep all the state in the same registers we got it in (and that we should return things in). That not only makes this act much more like 'rep stos' (which this function is replacing), but makes it much easier to actually do the obvious loop unrolling. Also rename the function from the now nonsensical 'clear_user_original' to what it now clearly is: 'rep_stos_alternative'. End result: if we don't have a fast 'rep stosb', at least we can have a fast fallback for it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-19x86: inline the 'rep movs' in user copies for the FSRM caseLinus Torvalds1-34/+21
This does the same thing for the user copies as commit 0db7058e8e23 ("x86/clear_user: Make it faster") did for clear_user(). In other words, it inlines the "rep movs" case when X86_FEATURE_FSRM is set, avoiding the function call entirely. In order to do that, it makes the calling convention for the out-of-line case ("copy_user_generic_unrolled") match the 'rep movs' calling convention, although it does also end up clobbering a number of additional registers. Also, to simplify code sharing in the low-level assembly with the __copy_user_nocache() function (that uses the normal C calling convention), we end up with a kind of mixed return value for the low-level asm code: it will return the result in both %rcx (to work as an alternative for the 'rep movs' case), _and_ in %rax (for the nocache case). We could avoid this by wrapping __copy_user_nocache() callers in an inline asm, but since the cost is just an extra register copy, it's probably not worth it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-19x86: move stac/clac from user copy routines into callersLinus Torvalds2-11/+5
This is preparatory work for inlining the 'rep movs' case, but also a cleanup. The __copy_user_nocache() function was mis-used by the rdma code to do uncached kernel copies that don't actually want user copies at all, and as a result doesn't want the stac/clac either. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-19x86: don't use REP_GOOD or ERMS for user memory clearingLinus Torvalds1-75/+0
The modern target to use is FSRS (Fast Short REP STOS), and the other cases should only be used for bigger areas (ie mainly things like page clearing). Note! This changes the conditional for the inlining from FSRM ("fast short rep movs") to FSRS ("fast short rep stos"). We'll have a separate fixup for AMD microarchitectures that have a good 'rep stosb' yet do not set the new Intel-specific FSRS bit (because FSRM was there first). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-19x86: don't use REP_GOOD or ERMS for user memory copiesLinus Torvalds1-44/+7
The modern target to use is FSRM (Fast Short REP MOVS), and the other cases should only be used for bigger areas (ie mainly things like page clearing). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-19x86: don't use REP_GOOD or ERMS for small memory clearingLinus Torvalds1-36/+11
The modern target to use is FSRS (Fast Short REP STOS), and the other cases should only be used for bigger areas (ie mainly things like page clearing). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-19x86: don't use REP_GOOD or ERMS for small memory copiesLinus Torvalds1-24/+10
The modern target to use is FSRM (Fast Short REP MOVS), and the other cases should only be used for bigger areas (ie mainly things like page copying and clearing). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-24x86,objtool: Split UNWIND_HINT_EMPTY in twoJosh Poimboeuf1-3/+3
Mark reported that the ORC unwinder incorrectly marks an unwind as reliable when the unwind terminates prematurely in the dark corners of return_to_handler() due to lack of information about the next frame. The problem is UNWIND_HINT_EMPTY is used in two different situations: 1) The end of the kernel stack unwind before hitting user entry, boot code, or fork entry 2) A blind spot in ORC coverage where the unwinder has to bail due to lack of information about the next frame The ORC unwinder has no way to tell the difference between the two. When it encounters an undefined stack state with 'end=1', it blindly marks the stack reliable, which can break the livepatch consistency model. Fix it by splitting UNWIND_HINT_EMPTY into UNWIND_HINT_UNDEFINED and UNWIND_HINT_END_OF_STACK. Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/fd6212c8b450d3564b855e1cb48404d6277b4d9f.1677683419.git.jpoimboe@kernel.org
2023-03-16x86/mm: Rework address range check in get_user() and put_user()Kirill A. Shutemov2-82/+55
The functions get_user() and put_user() check that the target address range resides in the user space portion of the virtual address space. In order to perform this check, the functions compare the end of the range against TASK_SIZE_MAX. For kernels compiled with CONFIG_X86_5LEVEL, this process requires some additional trickery using ALTERNATIVE, as TASK_SIZE_MAX depends on the paging mode in use. Linus suggested that this check could be simplified for 64-bit kernels. It is sufficient to check bit 63 of the address to ensure that the range belongs to user space. Additionally, the use of branches can be avoided by setting the target address to all ones if bit 63 is set. There's no need to check the end of the access range as there's huge gap between end of userspace range and start of the kernel range. The gap consists of canonical hole and unused ranges on both kernel and userspace sides. If an address with bit 63 set is passed down, it will trigger a #GP exception. _ASM_EXTABLE_UA() complains about this. Replace it with plain _ASM_EXTABLE() as it is expected behaviour now. The updated get_user() and put_user() checks are also compatible with Linear Address Masking, which allows user space to encode metadata in the upper bits of pointers and eliminates the need to untag the address before handling it. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20230312112612.31869-2-kirill.shutemov%40linux.intel.com
2023-03-16x86/uaccess: Remove memcpy_page_flushcache()Ira Weiny1-9/+0
Commit 21b56c847753 ("iov_iter: get rid of separate bvec and xarray callbacks") removed the calls to memcpy_page_flushcache(). In addition, memcpy_page_flushcache() uses the deprecated kmap_atomic(). Remove the unused x86 memcpy_page_flushcache() implementation and also get rid of one more kmap_atomic() user. [ dhansen: tweak changelog ] Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20221230-kmap-x86-v1-1-15f1ecccab50%40intel.com
2023-02-22Merge tag 'x86_cpu_for_v6.3_rc1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: - Cache the AMD debug registers in per-CPU variables to avoid MSR writes where possible, when supporting a debug registers swap feature for SEV-ES guests - Add support for AMD's version of eIBRS called Automatic IBRS which is a set-and-forget control of indirect branch restriction speculation resources on privilege change - Add support for a new x86 instruction - LKGS - Load kernel GS which is part of the FRED infrastructure - Reset SPEC_CTRL upon init to accomodate use cases like kexec which rediscover - Other smaller fixes and cleanups * tag 'x86_cpu_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/amd: Cache debug register values in percpu variables KVM: x86: Propagate the AMD Automatic IBRS feature to the guest x86/cpu: Support AMD Automatic IBRS x86/cpu, kvm: Add the SMM_CTL MSR not present feature x86/cpu, kvm: Add the Null Selector Clears Base feature x86/cpu, kvm: Move X86_FEATURE_LFENCE_RDTSC to its native leaf x86/cpu, kvm: Add the NO_NESTED_DATA_BP feature KVM: x86: Move open-coded CPUID leaf 0x80000021 EAX bit propagation code x86/cpu, kvm: Add support for CPUID_80000021_EAX x86/gsseg: Add the new <asm/gsseg.h> header to <asm/asm-prototypes.h> x86/gsseg: Use the LKGS instruction if available for load_gs_index() x86/gsseg: Move load_gs_index() to its own new header file x86/gsseg: Make asm_load_gs_index() take an u16 x86/opcode: Add the LKGS instruction to x86-opcode-map x86/cpufeature: Add the CPU feature bit for LKGS x86/bugs: Reset speculation control settings on init x86/cpu: Remove redundant extern x86_read_arch_cap_msr()
2023-02-21Merge tag 'x86-asm-2023-02-20' of ↵Linus Torvalds2-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Ingo Molnar: "Header fixes and a DocBook fix" * tag 'x86-asm-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/lib: Fix compiler and kernel-doc warnings x86/lib: Include <asm/misc.h> to fix a missing prototypes warning at build time
2023-01-31Merge tag 'v6.2-rc6' into sched/core, to pick up fixesIngo Molnar2-11/+11
Pick up fixes before merging another batch of cpuidle updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-01-13entry, kasan, x86: Disallow overriding mem*() functionsPeter Zijlstra3-5/+8
KASAN cannot just hijack the mem*() functions, it needs to emit __asan_mem*() variants if it wants instrumentation (other sanitizers already do this). vmlinux.o: warning: objtool: sync_regs+0x24: call to memcpy() leaves .noinstr.text section vmlinux.o: warning: objtool: vc_switch_off_ist+0xbe: call to memcpy() leaves .noinstr.text section vmlinux.o: warning: objtool: fixup_bad_iret+0x36: call to memset() leaves .noinstr.text section vmlinux.o: warning: objtool: __sev_get_ghcb+0xa0: call to memcpy() leaves .noinstr.text section vmlinux.o: warning: objtool: __sev_put_ghcb+0x35: call to memcpy() leaves .noinstr.text section Remove the weak aliases to ensure nobody hijacks these functions and add them to the noinstr section. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230112195542.028523143@infradead.org
2023-01-12x86/opcode: Add the LKGS instruction to x86-opcode-mapH. Peter Anvin (Intel)1-0/+1
Add the instruction opcode used by LKGS to x86-opcode-map. Opcode number is per public FRED draft spec v3.0. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230112072032.35626-3-xin3.li@intel.com
2023-01-03x86/lib: Fix compiler and kernel-doc warningsAnuradha Weeraman1-1/+3
Fix the following W=1 warnings: arch/x86/lib/cmdline.c: - Include <asm/cmdline.h> to fix missing-prototypes warnings. - Update comment for __cmdline_find_option_bool to fix a kernel-doc warning. Signed-off-by: Anuradha Weeraman <anuradha@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230103114725.108431-1-anuradha@debian.org
2023-01-03x86/insn: Avoid namespace clash by separating instruction decoder MMIO type ↵Jason A. Donenfeld1-10/+10
from MMIO trace type Both <linux/mmiotrace.h> and <asm/insn-eval.h> define various MMIO_ enum constants, whose namespace overlaps. Rename the <asm/insn-eval.h> ones to have a INSN_ prefix, so that the headers can be used from the same source file. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230101162910.710293-2-Jason@zx2c4.com
2023-01-03x86/asm: Fix an assembler warning with current binutilsMikulas Patocka1-1/+1
Fix a warning: "found `movsd'; assuming `movsl' was meant" Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org
2023-01-03x86/lib: Include <asm/misc.h> to fix a missing prototypes warning at build timeAnuradha Weeraman1-0/+2
Signed-off-by: Anuradha Weeraman <anuradha@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230103034637.74679-1-anuradha@debian.org
2022-12-15Merge tag 'x86_core_for_v6.2' of ↵Linus Torvalds3-21/+149
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Borislav Petkov: - Add the call depth tracking mitigation for Retbleed which has been long in the making. It is a lighterweight software-only fix for Skylake-based cores where enabling IBRS is a big hammer and causes a significant performance impact. What it basically does is, it aligns all kernel functions to 16 bytes boundary and adds a 16-byte padding before the function, objtool collects all functions' locations and when the mitigation gets applied, it patches a call accounting thunk which is used to track the call depth of the stack at any time. When that call depth reaches a magical, microarchitecture-specific value for the Return Stack Buffer, the code stuffs that RSB and avoids its underflow which could otherwise lead to the Intel variant of Retbleed. This software-only solution brings a lot of the lost performance back, as benchmarks suggest: https://lore.kernel.org/all/20220915111039.092790446@infradead.org/ That page above also contains a lot more detailed explanation of the whole mechanism - Implement a new control flow integrity scheme called FineIBT which is based on the software kCFI implementation and uses hardware IBT support where present to annotate and track indirect branches using a hash to validate them - Other misc fixes and cleanups * tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits) x86/paravirt: Use common macro for creating simple asm paravirt functions x86/paravirt: Remove clobber bitmask from .parainstructions x86/debug: Include percpu.h in debugreg.h to get DECLARE_PER_CPU() et al x86/cpufeatures: Move X86_FEATURE_CALL_DEPTH from bit 18 to bit 19 of word 11, to leave space for WIP X86_FEATURE_SGX_EDECCSSA bit x86/Kconfig: Enable kernel IBT by default x86,pm: Force out-of-line memcpy() objtool: Fix weak hole vs prefix symbol objtool: Optimize elf_dirty_reloc_sym() x86/cfi: Add boot time hash randomization x86/cfi: Boot time selection of CFI scheme x86/ibt: Implement FineIBT objtool: Add --cfi to generate the .cfi_sites section x86: Add prefix symbols for function padding objtool: Add option to generate prefix symbols objtool: Avoid O(bloody terrible) behaviour -- an ode to libelf objtool: Slice up elf_create_section_symbol() kallsyms: Revert "Take callthunks into account" x86: Unconfuse CONFIG_ and X86_FEATURE_ namespaces x86/retpoline: Fix crash printing warning x86/paravirt: Fix a !PARAVIRT build warning ...
2022-12-14Merge tag 'x86_asm_for_v6.2' of ↵Linus Torvalds3-187/+201
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Borislav Petkov: - Move the 32-bit memmove() asm implementation out-of-line in order to fix a 32-bit full LTO build failure with clang where it would fail at register allocation. Move it to an asm file and clean it up while at it, similar to what has been already done on 64-bit * tag 'x86_asm_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mem: Move memmove to out of line assembler
2022-11-22Merge tag 'v6.1-rc6' into x86/core, to resolve conflictsIngo Molnar1-0/+3
Resolve conflicts between these commits in arch/x86/kernel/asm-offsets.c: # upstream: debc5a1ec0d1 ("KVM: x86: use a separate asm-offsets.c file") # retbleed work in x86/core: 5d8213864ade ("x86/retbleed: Add SKL return thunk") ... and these commits in include/linux/bpf.h: # upstram: 18acb7fac22f ("bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop")") # x86/core commits: 931ab63664f0 ("x86/ibt: Implement FineIBT") bea75b33895f ("x86/Kconfig: Introduce function padding") The latter two modify BPF_DISPATCHER_ATTRIBUTES(), which was removed upstream. Conflicts: arch/x86/kernel/asm-offsets.c include/linux/bpf.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-11-09x86/uaccess: instrument copy_from_user_nmi()Alexander Potapenko1-0/+3
Make sure usercopy hooks from linux/instrumented.h are invoked for copy_from_user_nmi(). This fixes KMSAN false positives reported when dumping opcodes for a stack trace. Link: https://lkml.kernel.org/r/20221102110611.1085175-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Marco Elver <elver@google.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-02x86/mem: Move memmove to out of line assemblerNick Desaulniers3-187/+201
When building ARCH=i386 with CONFIG_LTO_CLANG_FULL=y, it's possible (depending on additional configs which I have not been able to isolate) to observe a failure during register allocation: error: inline assembly requires more registers than available when memmove is inlined into tcp_v4_fill_cb() or tcp_v6_fill_cb(). memmove is quite large and probably shouldn't be inlined due to size alone. A noinline function attribute would be the simplest fix, but there's a few things that stand out with the current definition: In addition to having complex constraints that can't always be resolved, the clobber list seems to be missing %bx. By using numbered operands rather than symbolic operands, the constraints are quite obnoxious to refactor. Having a large function be 99% inline asm is a code smell that this function should simply be written in stand-alone out-of-line assembler. Moving this to out of line assembler guarantees that the compiler cannot inline calls to memmove. This has been done previously for 64b: commit 9599ec0471de ("x86-64, mem: Convert memmove() to assembly file and fix return value bug") That gives the opportunity for other cleanups like fixing the inconsistent use of tabs vs spaces and instruction suffixes, and the label 3 appearing twice. Symbolic operands, local labels, and additional comments would provide this code with a fresh coat of paint. Finally, add a test that tickles the `rep movsl` implementation to test it for correctness, since it has implicit operands. Suggested-by: Ingo Molnar <mingo@kernel.org> Suggested-by: David Laight <David.Laight@aculab.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/all/20221018172155.287409-1-ndesaulniers%40google.com
2022-10-17x86/calldepth: Add ret/call counting for debugThomas Gleixner1-1/+6
Add a debuigfs mechanism to validate the accounting, e.g. vs. call/ret balance and to gather statistics about the stuffing to call ratio. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111148.204285506@infradead.org
2022-10-17x86/retpoline: Add SKL retthunk retpolinesPeter Zijlstra1-8/+63
Ensure that retpolines do the proper call accounting so that the return accounting works correctly. Specifically; retpolines are used to replace both 'jmp *%reg' and 'call *%reg', however these two cases do not have the same accounting requirements. Therefore split things up and provide two different retpoline arrays for SKL. The 'jmp *%reg' case needs no accounting, the __x86_indirect_jump_thunk_array[] covers this. The retpoline is changed to not use the return thunk; it's a simple call;ret construct. [ strictly speaking it should do: andq $(~0x1f), PER_CPU_VAR(__x86_call_depth) but we can argue this can be covered by the fuzz we already have in the accounting depth (12) vs the RSB depth (16) ] The 'call *%reg' case does need accounting, the __x86_indirect_call_thunk_array[] covers this. Again, this retpoline avoids the use of the return-thunk, in this case to avoid double accounting. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111147.996634749@infradead.org
2022-10-17x86/retbleed: Add SKL return thunkThomas Gleixner1-0/+31
To address the Intel SKL RSB underflow issue in software it's required to do call depth tracking. Provide a return thunk for call depth tracking on Intel SKL CPUs. The tracking does not use a counter. It uses uses arithmetic shift right on call entry and logical shift left on return. The depth tracking variable is initialized to 0x8000.... when the call depth is zero. The arithmetic shift right sign extends the MSB and saturates after the 12th call. The shift count is 5 so the tracking covers 12 nested calls. On return the variable is shifted left logically so it becomes zero again. CALL RET 0: 0x8000000000000000 0x0000000000000000 1: 0xfc00000000000000 0xf000000000000000 ... 11: 0xfffffffffffffff8 0xfffffffffffffc00 12: 0xffffffffffffffff 0xffffffffffffffe0 After a return buffer fill the depth is credited 12 calls before the next stuffing has to take place. There is a inaccuracy for situations like this: 10 calls 5 returns 3 calls 4 returns 3 calls .... The shift count might cause this to be off by one in either direction, but there is still a cushion vs. the RSB depth. The algorithm does not claim to be perfect, but it should obfuscate the problem enough to make exploitation extremly difficult. The theory behind this is: RSB is a stack with depth 16 which is filled on every call. On the return path speculation "pops" entries to speculate down the call chain. Once the speculative RSB is empty it switches to other predictors, e.g. the Branch History Buffer, which can be mistrained by user space and misguide the speculation path to a gadget. Call depth tracking is designed to break this speculation path by stuffing speculation trap calls into the RSB which are never getting a corresponding return executed. This stalls the prediction path until it gets resteered, The assumption is that stuffing at the 12th return is sufficient to break the speculation before it hits the underflow and the fallback to the other predictors. Testing confirms that it works. Johannes, one of the retbleed researchers. tried to attack this approach but failed. There is obviously no scientific proof that this will withstand future research progress, but all we can do right now is to speculate about it. The SAR/SHL usage was suggested by Andi Kleen. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111147.890071690@infradead.org
2022-10-17x86/putuser: Provide room for paddingThomas Gleixner1-13/+49
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111146.746429822@infradead.org
2022-10-17x86/error_inject: Align function properlyPeter Zijlstra1-0/+1
Ensure inline asm functions are consistently aligned with compiler generated and SYM_FUNC_START*() functions. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111143.930201368@infradead.org