summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/cpu/mce/core.c
AgeCommit message (Collapse)AuthorFilesLines
2022-01-27x86/mce: Mark mce_read_aux() noinstrBorislav Petkov1-1/+1
[ Upstream commit db6c996d6ce45dfb44891f0824a65ecec216f47a ] Fixes vmlinux.o: warning: objtool: do_machine_check()+0x681: call to mce_read_aux() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-10-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/mce: Mark mce_end() noinstrBorislav Petkov1-3/+11
[ Upstream commit b4813539d37fa31fed62cdfab7bd2dd8929c5b2e ] It is called by the #MC handler which is noinstr. Fixes vmlinux.o: warning: objtool: do_machine_check()+0xbd6: call to memset() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-9-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/mce: Mark mce_panic() noinstrBorislav Petkov1-3/+12
[ Upstream commit 3c7ce80a818fa7950be123cac80cd078e5ac1013 ] And allow instrumentation inside it because it does calls to other facilities which will not be tagged noinstr. Fixes vmlinux.o: warning: objtool: do_machine_check()+0xc73: call to mce_panic() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-8-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/mce: Allow instrumentation during task work queueingBorislav Petkov1-0/+11
[ Upstream commit 4fbce464db81a42f9a57ee242d6150ec7f996415 ] Fixes vmlinux.o: warning: objtool: do_machine_check()+0xdb1: call to queue_task_work() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-6-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-14x86/mce: Avoid infinite loop for copy from user recoveryTony Luck1-11/+32
There are two cases for machine check recovery: 1) The machine check was triggered by ring3 (application) code. This is the simpler case. The machine check handler simply queues work to be executed on return to user. That code unmaps the page from all users and arranges to send a SIGBUS to the task that triggered the poison. 2) The machine check was triggered in kernel code that is covered by an exception table entry. In this case the machine check handler still queues a work entry to unmap the page, etc. but this will not be called right away because the #MC handler returns to the fix up code address in the exception table entry. Problems occur if the kernel triggers another machine check before the return to user processes the first queued work item. Specifically, the work is queued using the ->mce_kill_me callback structure in the task struct for the current thread. Attempting to queue a second work item using this same callback results in a loop in the linked list of work functions to call. So when the kernel does return to user, it enters an infinite loop processing the same entry for ever. There are some legitimate scenarios where the kernel may take a second machine check before returning to the user. 1) Some code (e.g. futex) first tries a get_user() with page faults disabled. If this fails, the code retries with page faults enabled expecting that this will resolve the page fault. 2) Copy from user code retries a copy in byte-at-time mode to check whether any additional bytes can be copied. On the other side of the fence are some bad drivers that do not check the return value from individual get_user() calls and may access multiple user addresses without noticing that some/all calls have failed. Fix by adding a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is four machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same address with page faults enabled ... repeat in copy tail bytes). Just in case there is some code that loops forever enforce a limit of 10. [ bp: Massage commit message, drop noinstr, fix typo, extend panic messages. ] Fixes: 5567d11c21a1 ("x86/mce: Send #MC singal from task work") Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.com
2021-08-24x86/mce: Defer processing of early errorsBorislav Petkov1-3/+8
When a fatal machine check results in a system reset, Linux does not clear the error(s) from machine check bank(s) - hardware preserves the machine check banks across a warm reset. During initialization of the kernel after the reboot, Linux reads, logs, and clears all machine check banks. But there is a problem. In: 5de97c9f6d85 ("x86/mce: Factor out and deprecate the /dev/mcelog driver") the call to mce_register_decode_chain() moved later in the boot sequence. This means that /dev/mcelog doesn't see those early error logs. This was partially fixed by: cd9c57cad3fe ("x86/MCE: Dump MCE to dmesg if no consumers") which made sure that the logs were not lost completely by printing to the console. But parsing console logs is error prone. Users of /dev/mcelog should expect to find any early errors logged to standard places. Add a new flag MCP_QUEUE_LOG to machine_check_poll() to be used in early machine check initialization to indicate that any errors found should just be queued to genpool. When mcheck_late_init() is called it will call mce_schedule_work() to actually log and flush any errors queued in the genpool. [ Based on an original patch, commit message by and completely productized by Tony Luck. ] Fixes: 5de97c9f6d85 ("x86/mce: Factor out and deprecate the /dev/mcelog driver") Reported-by: Sumanth Kamatala <skamatala@juniper.net> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210824003129.GA1642753@agluck-desk2.amr.corp.intel.com
2021-06-29mm,hwpoison: send SIGBUS with error virutal addressNaoya Horiguchi1-2/+11
Now an action required MCE in already hwpoisoned address surely sends a SIGBUS to current process, but the SIGBUS doesn't convey error virtual address. That's not optimal for hwpoison-aware applications. To fix the issue, make memory_failure() call kill_accessing_process(), that does pagetable walk to find the error virtual address. It could find multiple virtual addresses for the same error page, and it seems hard to tell which virtual address is correct one. But that's rare and sending incorrect virtual address could be better than no address. So let's report the first found virtual address for now. [naoya.horiguchi@nec.com: fix walk_page_range() return] Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.com Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Jue Wang <juew@google.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-18x86: Fix various typos in commentsIngo Molnar1-1/+1
Fix ~144 single-word typos in arch/x86/ code comments. Doing this in a single commit should reduce the churn. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org
2021-02-21Merge tag 'ras_updates_for_v5.12' of ↵Linus Torvalds1-2/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - move therm_throt.c to the thermal framework, where it belongs. - identify CPUs which miss to enter the broadcast handler, as an additional debugging aid. * tag 'ras_updates_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: thermal: Move therm_throt there from x86/mce x86/mce: Get rid of mcheck_intel_therm_init() x86/mce: Make mce_timed_out() identify holdout CPUs
2021-02-08x86/mce: Get rid of mcheck_intel_therm_init()Borislav Petkov1-1/+0
Move the APIC_LVTTHMR read which needs to happen on the BSP, to intel_init_thermal(). One less boot dependency. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Link: https://lkml.kernel.org/r/20210201142704.12495-2-bp@alien8.de
2021-01-12x86/mce: Remove explicit/superfluous tracingPeter Zijlstra1-3/+4
There's some explicit tracing left in exc_machine_check_kernel(), remove it, as it's already implied by irqentry_nmi_enter(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210106144017.719310466@infradead.org
2021-01-08x86/mce: Make mce_timed_out() identify holdout CPUsPaul E. McKenney1-1/+14
The "Timeout: Not all CPUs entered broadcast exception handler" message will appear from time to time given enough systems, but this message does not identify which CPUs failed to enter the broadcast exception handler. This information would be valuable if available, for example, in order to correlate with other hardware-oriented error messages. Add a cpumask of CPUs which maintains which CPUs have entered this handler, and print out which ones failed to enter in the event of a timeout. [ bp: Massage. ] Reported-by: Jonathan Lemon <bsd@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20210106174102.GA23874@paulmck-ThinkPad-P72
2020-12-15Merge tag 'core-entry-2020-12-14' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core entry/exit updates from Thomas Gleixner: "A set of updates for entry/exit handling: - More generalization of entry/exit functionality - The consolidation work to reclaim TIF flags on x86 and also for non-x86 specific TIF flags which are solely relevant for syscall related work and have been moved into their own storage space. The x86 specific part had to be merged in to avoid a major conflict. - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal delivery mode of task work and results in an impressive performance improvement for io_uring. The non-x86 consolidation of this is going to come seperate via Jens. - The selective syscall redirection facility which provides a clean and efficient way to support the non-Linux syscalls of WINE by catching them at syscall entry and redirecting them to the user space emulation. This can be utilized for other purposes as well and has been designed carefully to avoid overhead for the regular fastpath. This includes the core changes and the x86 support code. - Simplification of the context tracking entry/exit handling for the users of the generic entry code which guarantee the proper ordering and protection. - Preparatory changes to make the generic entry code accomodate S390 specific requirements which are mostly related to their syscall restart mechanism" * tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) entry: Add syscall_exit_to_user_mode_work() entry: Add exit_to_user_mode() wrapper entry_Add_enter_from_user_mode_wrapper entry: Rename exit_to_user_mode() entry: Rename enter_from_user_mode() docs: Document Syscall User Dispatch selftests: Add benchmark for syscall user dispatch selftests: Add kselftest for syscall user dispatch entry: Support Syscall User Dispatch on common syscall entry kernel: Implement selective syscall userspace redirection signal: Expose SYS_USER_DISPATCH si_code type x86: vdso: Expose sigreturn address on vdso to the kernel MAINTAINERS: Add entry for common entry code entry: Fix boot for !CONFIG_GENERIC_ENTRY x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs sched: Detect call to schedule from critical entry code context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK x86: Reclaim unused x86 TI flags ...
2020-12-01x86/mce: Rename kill_it to kill_current_taskGabriele Paoloni1-8/+8
Currently, if an MCE happens in user-mode or while the kernel is copying data from user space, 'kill_it' is used to check if execution of the interrupted task can be recovered or not; the flag name however is not very meaningful, hence rename it to match its goal. [ bp: Massage commit message, rename the queue_task_work() arg too. ] Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201127161819.3106432-6-gabriele.paoloni@intel.com
2020-12-01x86/mce: Remove redundant call to irq_work_queue()Gabriele Paoloni1-3/+0
Currently, __mc_scan_banks() in do_machine_check() does the following callchain: __mc_scan_banks()->mce_log()->irq_work_queue(&mce_irq_work). Hence, the call to irq_work_queue() below after __mc_scan_banks() seems redundant. Just remove it. Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20201127161819.3106432-5-gabriele.paoloni@intel.com
2020-12-01x86/mce: Panic for LMCE only if mca_cfg.tolerant < 3Gabriele Paoloni1-1/+1
Right now for LMCE, if no_way_out is set, mce_panic() is called regardless of mca_cfg.tolerant. This is not correct as, if mca_cfg.tolerant = 3, the code should never panic. Add that check. [ bp: use local ptr 'cfg'. ] Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20201127161819.3106432-4-gabriele.paoloni@intel.com
2020-12-01x86/mce: Move the mce_panic() call and 'kill_it' assignments to the right placesGabriele Paoloni1-11/+4
Right now, for local MCEs the machine calls panic(), if needed, right after lmce is set. For MCE broadcasting, mce_reign() takes care of calling mce_panic(). Hence: - improve readability by moving the conditional evaluation of tolerant up to when kill_it is set first; - move the mce_panic() call up into the statement where mce_end() fails. [ bp: Massage, remove comment in the mce_end() failure case because it is superfluous; use local ptr 'cfg' in both tests. ] Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20201127161819.3106432-3-gabriele.paoloni@intel.com
2020-12-01Merge tag 'v5.10-rc6' into ras/coreBorislav Petkov1-2/+4
Merge the -rc6 tag to pick up dependent changes. Signed-off-by: Borislav Petkov <bp@suse.de>
2020-11-27x86/mce: Do not overwrite no_way_out if mce_end() failsGabriele Paoloni1-2/+4
Currently, if mce_end() fails, no_way_out - the variable denoting whether the machine can recover from this MCE - is determined by whether the worst severity that was found across the MCA banks associated with the current CPU, is of panic severity. However, at this point no_way_out could have been already set by mca_start() after looking at all severities of all CPUs that entered the MCE handler. If mce_end() fails, check first if no_way_out is already set and, if so, stick to it, otherwise use the local worst value. [ bp: Massage. ] Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201127161819.3106432-2-gabriele.paoloni@intel.com
2020-11-06x86/mce: Correct the detection of invalid notifier prioritiesZhen Lei1-1/+2
Commit c9c6d216ed28 ("x86/mce: Rename "first" function as "early"") changed the enumeration of MCE notifier priorities. Correct the check for notifier priorities to cover the new range. [ bp: Rewrite commit message, remove superfluous brackets in conditional. ] Fixes: c9c6d216ed28 ("x86/mce: Rename "first" function as "early"") Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201106141216.2062-2-thunder.leizhen@huawei.com
2020-11-06x86/mce: Assign boolean values to a bool variableKaixu Xia1-2/+2
Fix the following coccinelle warnings: ./arch/x86/kernel/cpu/mce/core.c:1765:3-20: WARNING: Assignment of 0/1 to bool variable ./arch/x86/kernel/cpu/mce/core.c:1584:2-9: WARNING: Assignment of 0/1 to bool variable [ bp: Massage commit message. ] Reported-by: Tosk Robot <tencent_os_robot@tencent.com> Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/1604654363-1463-1-git-send-email-kaixuxia@tencent.com
2020-11-05x86/entry: Move nmi entry/exit into common codeThomas Gleixner1-3/+3
Lockdep state handling on NMI enter and exit is nothing specific to X86. It's not any different on other architectures. Also the extra state type is not necessary, irqentry_state_t can carry the necessary information as well. Move it to common code and extend irqentry_state_t to carry lockdep state. [ Ira: Make exit_rcu and lockdep a union as they are mutually exclusive between the IRQ and NMI exceptions, and add kernel documentation for struct irqentry_state_t ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20201102205320.1458656-7-ira.weiny@intel.com
2020-10-26x86/mce: Remove unneeded breakTom Rix1-2/+0
A break is not needed if it is preceded by a return. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201019200803.17619-1-trix@redhat.com
2020-10-18task_work: cleanup notification modesJens Axboe1-1/+1
A previous commit changed the notification mode from true/false to an int, allowing notify-no, notify-yes, or signal-notify. This was backwards compatible in the sense that any existing true/false user would translate to either 0 (on notification sent) or 1, the latter which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2. Clean this up properly, and define a proper enum for the notification mode. Now we have: - TWA_NONE. This is 0, same as before the original change, meaning no notification requested. - TWA_RESUME. This is 1, same as before the original change, meaning that we use TIF_NOTIFY_RESUME. - TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the notification. Clean up all the callers, switching their 0/1/false/true to using the appropriate TWA_* mode for notifications. Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()") Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-12Merge tag 'ras_updates_for_v5.10' of ↵Linus Torvalds1-53/+129
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Extend the recovery from MCE in kernel space also to processes which encounter an MCE in kernel space but while copying from user memory by sending them a SIGBUS on return to user space and umapping the faulty memory, by Tony Luck and Youquan Song. - memcpy_mcsafe() rework by splitting the functionality into copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables support for new hardware which can recover from a machine check encountered during a fast string copy and makes that the default and lets the older hardware which does not support that advance recovery, opt in to use the old, fragile, slow variant, by Dan Williams. - New AMD hw enablement, by Yazen Ghannam and Akshay Gupta. - Do not use MSR-tracing accessors in #MC context and flag any fault while accessing MCA architectural MSRs as an architectural violation with the hope that such hw/fw misdesigns are caught early during the hw eval phase and they don't make it into production. - Misc fixes, improvements and cleanups, as always. * tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Allow for copy_mc_fragile symbol checksum to be generated x86/mce: Decode a kernel instruction to determine if it is copying from user x86/mce: Recover from poison found while copying from user space x86/mce: Avoid tail copy when machine check terminated a copy from user x86/mce: Add _ASM_EXTABLE_CPY for copy user access x86/mce: Provide method to find out the type of an exception handler x86/mce: Pass pointer to saved pt_regs to severity calculation routines x86/copy_mc: Introduce copy_mc_enhanced_fast_string() x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list x86/mce: Add Skylake quirk for patrol scrub reported errors RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE() x86/mce: Annotate mce_rd/wrmsrl() with noinstr x86/mce/dev-mcelog: Do not update kflags on AMD systems x86/mce: Stop mce_reign() from re-computing severity for every CPU x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR x86/mce: Increase maximum number of banks to 64 x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check() x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap RAS/CEC: Fix cec_init() prototype
2020-10-07x86/mce: Decode a kernel instruction to determine if it is copying from userTony Luck1-3/+8
All instructions copying data between kernel and user memory are tagged with either _ASM_EXTABLE_UA or _ASM_EXTABLE_CPY entries in the exception table. ex_fault_handler_type() returns EX_HANDLER_UACCESS for both of these. Recovery is only possible when the machine check was triggered on a read from user memory. In this case the same strategy for recovery applies as if the user had made the access in ring3. If the fault was in kernel memory while copying to user there is no current recovery plan. For MOV and MOVZ instructions a full decode of the instruction is done to find the source address. For MOVS instructions the source address is in the %rsi register. The function fault_in_kernel_space() determines whether the source address is kernel or user, upgrade it from "static" so it can be used here. Co-developed-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201006210910.21062-7-tony.luck@intel.com
2020-10-07x86/mce: Recover from poison found while copying from user spaceTony Luck1-7/+20
Existing kernel code can only recover from a machine check on code that is tagged in the exception table with a fault handling recovery path. Add two new fields in the task structure to pass information from machine check handler to the "task_work" that is queued to run before the task returns to user mode: + mce_vaddr: will be initialized to the user virtual address of the fault in the case where the fault occurred in the kernel copying data from a user address. This is so that kill_me_maybe() can provide that information to the user SIGBUS handler. + mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe() to determine if mce_vaddr is applicable to this error. Add code to recover from a machine check while copying data from user space to the kernel. Action for this case is the same as if the user touched the poison directly; unmap the page and send a SIGBUS to the task. Use a new helper function to share common code between the "fault in user mode" case and the "fault while copying from user" case. New code paths will be activated by the next patch which sets MCE_IN_KERNEL_COPYIN. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201006210910.21062-6-tony.luck@intel.com
2020-10-07x86/mce: Pass pointer to saved pt_regs to severity calculation routinesYouquan Song1-7/+7
New recovery features require additional information about processor state when a machine check occurred. Pass pt_regs down to the routines that need it. No functional change. Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201006210910.21062-2-tony.luck@intel.com
2020-10-06x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()Dan Williams1-6/+2
In reaction to a proposal to introduce a memcpy_mcsafe_fast() implementation Linus points out that memcpy_mcsafe() is poorly named relative to communicating the scope of the interface. Specifically what addresses are valid to pass as source, destination, and what faults / exceptions are handled. Of particular concern is that even though x86 might be able to handle the semantics of copy_mc_to_user() with its common copy_user_generic() implementation other archs likely need / want an explicit path for this case: On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > However now I see that copy_user_generic() works for the wrong reason. > > It works because the exception on the source address due to poison > > looks no different than a write fault on the user address to the > > caller, it's still just a short copy. So it makes copy_to_user() work > > for the wrong reason relative to the name. > > Right. > > And it won't work that way on other architectures. On x86, we have a > generic function that can take faults on either side, and we use it > for both cases (and for the "in_user" case too), but that's an > artifact of the architecture oddity. > > In fact, it's probably wrong even on x86 - because it can hide bugs - > but writing those things is painful enough that everybody prefers > having just one function. Replace a single top-level memcpy_mcsafe() with either copy_mc_to_user(), or copy_mc_to_kernel(). Introduce an x86 copy_mc_fragile() name as the rename for the low-level x86 implementation formerly named memcpy_mcsafe(). It is used as the slow / careful backend that is supplanted by a fast copy_mc_generic() in a follow-on patch. One side-effect of this reorganization is that separating copy_mc_64.S to its own file means that perf no longer needs to track dependencies for its memcpy_64.S benchmarks. [ bp: Massage a bit. ] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: <stable@vger.kernel.org> Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
2020-09-30x86/mce: Use idtentry_nmi_enter/exit()Thomas Gleixner1-2/+4
The recent fix for NMI vs. IRQ state tracking missed to apply the cure to the MCE handler. Fixes: ba1f2b2eaa2a ("x86/entry: Fix NMI vs IRQ state tracking") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/87mu17ism2.fsf@nanos.tec.linutronix.de
2020-09-18x86/mce: Annotate mce_rd/wrmsrl() with noinstrBorislav Petkov1-6/+21
They do get called from the #MC handler which is already marked "noinstr". Commit e2def7d49d08 ("x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR") already got rid of the instrumentation in the MSR accessors, fix the annotation now too, in order to get rid of: vmlinux.o: warning: objtool: do_machine_check()+0x4a: call to mce_rdmsrl() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200915194020.28807-1-bp@alien8.de
2020-09-14x86/mce: Stop mce_reign() from re-computing severity for every CPUTony Luck1-8/+8
Back in commit: 20d51a426fe9 ("x86/mce: Reuse one of the u16 padding fields in 'struct mce'") a field was added to "struct mce" to save the computed error severity. Make use of this in mce_reign() to avoid re-computing the severity for every CPU. In the case where the machine panics, one call to mce_severity() is still needed in order to provide the correct message giving the reason for the panic. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200908175519.14223-2-tony.luck@intel.com
2020-09-11x86/mce: Make mce_rdmsrl() panic on an inaccessible MSRBorislav Petkov1-12/+60
If an exception needs to be handled while reading an MSR - which is in most of the cases caused by a #GP on a non-existent MSR - then this is most likely the incarnation of a BIOS or a hardware bug. Such bug violates the architectural guarantee that MCA banks are present with all MSRs belonging to them. The proper fix belongs in the hardware/firmware - not in the kernel. Handling an #MC exception which is raised while an NMI is being handled would cause the nasty NMI nesting issue because of the shortcoming of IRET of reenabling NMIs when executed. And the machine is in an #MC context already so <Deity> be at its side. Tracing MSR accesses while in #MC is another no-no due to tracing being inherently a bad idea in atomic context: vmlinux.o: warning: objtool: do_machine_check()+0x4a: call to mce_rdmsrl() leaves .noinstr.text section so remove all that "additional" functionality from mce_rdmsrl() and provide it with a special exception handler which panics the machine when that MSR is not accessible. The exception handler prints a human-readable message explaining what the panic reason is but, what is more, it panics while in the #GP handler and latter won't have executed an IRET, thus opening the NMI nesting issue in the case when the #MC has happened while handling an NMI. (#MC itself won't be reenabled until MCG_STATUS hasn't been cleared). Suggested-by: Andy Lutomirski <luto@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> [ Add missing prototypes for ex_handler_* ] Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200906212130.GA28456@zn.tnic
2020-08-26x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check()Tony Luck1-5/+4
A long time ago, Linux cleared IA32_MCG_STATUS at the very end of machine check processing. Then, some fancy recovery and IST manipulation was added in: d4812e169de4 ("x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks") and clearing IA32_MCG_STATUS was pulled earlier in the function. Next change moved the actual recovery out of do_machine_check() and just used task_work_add() to schedule it later (before returning to the user): 5567d11c21a1 ("x86/mce: Send #MC singal from task work") Most recently the fancy IST footwork was removed as no longer needed: b052df3da821 ("x86/entry: Get rid of ist_begin/end_non_atomic()") At this point there is no reason remaining to clear IA32_MCG_STATUS early. It can move back to the very end of the function. Also move sync_core(). The comments for this function say that it should only be called when instructions have been changed/re-mapped. Recovery for an instruction fetch may change the physical address. But that doesn't happen until the scheduled work runs (which could be on another CPU). [ bp: Massage commit message. ] Reported-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200824221237.5397-1-tony.luck@intel.com
2020-08-05Merge tag 'x86-entry-2020-08-04' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 conversion to generic entry code from Thomas Gleixner: "The conversion of X86 syscall, interrupt and exception entry/exit handling to the generic code. Pretty much a straight-forward 1:1 conversion plus the consolidation of the KVM handling of pending work before entering guest mode" * tag 'x86-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu() x86/kvm: Use generic xfer to guest work function x86/entry: Cleanup idtentry_enter/exit x86/entry: Use generic interrupt entry/exit code x86/entry: Cleanup idtentry_entry/exit_user x86/entry: Use generic syscall exit functionality x86/entry: Use generic syscall entry function x86/ptrace: Provide pt_regs helper for entry/exit x86/entry: Move user return notifier out of loop x86/entry: Consolidate 32/64 bit syscall entry x86/entry: Consolidate check_user_regs() x86: Correct noinstr qualifiers x86/idtentry: Remove stale comment
2020-08-04Merge tag 'ras-core-2020-08-03' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Ingo Molnar: "Boris is on vacation and he asked us to send you the pending RAS bits: - Print the PPIN field on CPUs that fill them out - Fix an MCE injection bug - Simplify a kzalloc in dev_mcelog_init_device()" * tag 'ras-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce, EDAC/mce_amd: Print PPIN in machine check records x86/mce/dev-mcelog: Use struct_size() helper in kzalloc() x86/mce/inject: Fix a wrong assignment of i_mce.status
2020-07-27x86/cpu: Relocate sync_core() to sync_core.hRicardo Neri1-0/+1
Having sync_core() in processor.h is problematic since it is not possible to check for hardware capabilities via the *cpu_has() family of macros. The latter needs the definitions in processor.h. It also looks more intuitive to relocate the function to sync_core.h. This changeset does not make changes in functionality. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200727043132.15082-3-ricardo.neri-calderon@linux.intel.com
2020-07-24x86/entry: Cleanup idtentry_entry/exit_userThomas Gleixner1-2/+2
Cleanup the temporary defines and use irqentry_ instead of idtentry_. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20200722220520.602603691@linutronix.de
2020-07-24x86: Correct noinstr qualifiersIra Weiny1-1/+1
The noinstr qualifier is to be specified before the return type in the same way inline is used. These 2 cases were missed by previous patches. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200723161405.852613-1-ira.weiny@intel.com
2020-07-05Merge tag 'x86-urgent-2020-07-05' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A series of fixes for x86: - Reset MXCSR in kernel_fpu_begin() to prevent using a stale user space value. - Prevent writing MSR_TEST_CTRL on CPUs which are not explicitly whitelisted for split lock detection. Some CPUs which do not support it crash even when the MSR is written to 0 which is the default value. - Fix the XEN PV fallout of the entry code rework - Fix the 32bit fallout of the entry code rework - Add more selftests to ensure that these entry problems don't come back. - Disable 16 bit segments on XEN PV. It's not supported because XEN PV does not implement ESPFIX64" * tag 'x86-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ldt: Disable 16-bit segments on Xen PV x86/entry/32: Fix #MC and #DB wiring on x86_32 x86/entry/xen: Route #DB correctly on Xen PV x86/entry, selftests: Further improve user entry sanity checks x86/entry/compat: Clear RAX high bits on Xen PV SYSENTER selftests/x86: Consolidate and fix get/set_eflags() helpers selftests/x86/syscall_nt: Clear weird flags after each test selftests/x86/syscall_nt: Add more flag combinations x86/entry/64/compat: Fix Xen PV SYSENTER frame setup x86/entry: Move SYSENTER's regs->sp and regs->flags fixups into C x86/entry: Assert that syscalls are on the right stack x86/split_lock: Don't write MSR_TEST_CTRL on CPUs that aren't whitelisted x86/fpu: Reset MXCSR to default in kernel_fpu_begin()
2020-07-04x86/entry/32: Fix #MC and #DB wiring on x86_32Andy Lutomirski1-1/+3
DEFINE_IDTENTRY_MCE and DEFINE_IDTENTRY_DEBUG were wired up as non-RAW on x86_32, but the code expected them to be RAW. Get rid of all the macro indirection for them on 32-bit and just use DECLARE_IDTENTRY_RAW and DEFINE_IDTENTRY_RAW directly. Also add a warning to make sure that we only hit the _kernel paths in kernel mode. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/9e90a7ee8e72fd757db6d92e1e5ff16339c1ecf9.1593795633.git.luto@kernel.org
2020-06-23x86/mce, EDAC/mce_amd: Print PPIN in machine check recordsSmita Koralahalli1-0/+2
Print the Protected Processor Identification Number (PPIN) on processors which support it. [ bp: Massage. ] Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200623130059.8870-1-Smita.KoralahalliChannabasappa@amd.com
2020-06-15x86/entry, cpumask: Provide non-instrumented variant of cpu_is_offline()Peter Zijlstra1-1/+1
vmlinux.o: warning: objtool: exc_nmi()+0x12: call to cpumask_test_cpu.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: mce_check_crashing_cpu()+0x12: call to cpumask_test_cpu.constprop.0()leaves .noinstr.text section cpumask_test_cpu() test_bit() instrument_atomic_read() arch_test_bit() Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-06-11x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisonedTony Luck1-4/+14
An interesting thing happened when a guest Linux instance took a machine check. The VMM unmapped the bad page from guest physical space and passed the machine check to the guest. Linux took all the normal actions to offline the page from the process that was using it. But then guest Linux crashed because it said there was a second machine check inside the kernel with this stack trace: do_memory_failure set_mce_nospec set_memory_uc _set_memory_uc change_page_attr_set_clr cpa_flush clflush_cache_range_opt This was odd, because a CLFLUSH instruction shouldn't raise a machine check (it isn't consuming the data). Further investigation showed that the VMM had passed in another machine check because is appeared that the guest was accessing the bad page. Fix is to check the scope of the poison by checking the MCi_MISC register. If the entire page is affected, then unmap the page. If only part of the page is affected, then mark the page as uncacheable. This assumes that VMMs will do the logical thing and pass in the "whole page scope" via the MCi_MISC register (since they unmapped the entire page). [ bp: Adjust to x86/entry changes. ] Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()") Reported-by: Jue Wang <juew@google.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jue Wang <juew@google.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.com
2020-06-11Merge branch 'x86/entry' into ras/coreThomas Gleixner1-50/+115
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow up patches can be applied without creating a horrible merge conflict afterwards.
2020-06-11x86/entry: Rename trace_hardirqs_off_prepare()Peter Zijlstra1-1/+1
The typical pattern for trace_hardirqs_off_prepare() is: ENTRY lockdep_hardirqs_off(); // because hardware ... do entry magic instrumentation_begin(); trace_hardirqs_off_prepare(); ... do actual work trace_hardirqs_on_prepare(); lockdep_hardirqs_on_prepare(); instrumentation_end(); ... do exit magic lockdep_hardirqs_on(); which shows that it's named wrong, rename it to trace_hardirqs_off_finish(), as it concludes the hardirq_off transition. Also, given that the above is the only correct order, make the traditional all-in-one trace_hardirqs_off() follow suit. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200529213321.415774872@infradead.org
2020-06-11x86/entry, mce: Disallow #DB during #MCPeter Zijlstra1-0/+12
#MC is fragile as heck, don't tempt fate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200529213321.131187767@infradead.org
2020-06-11x86/entry: Move paranoid irq tracing out of ASM codeThomas Gleixner1-0/+3
The last step to remove the irq tracing cruft from ASM. Ignore #DF as the maschine is going to die anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20200521202120.414043330@linutronix.de
2020-06-11x86/idtentry: Switch to conditional RCU handlingThomas Gleixner1-2/+2
Switch all idtentry_enter/exit() users over to the new conditional RCU handling scheme and make the user mode entries in #DB, #INT3 and #MCE use the user mode idtentry functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20200521202117.382387286@linutronix.de
2020-06-11x86/mce: Address objtools noinstr complaintsThomas Gleixner1-5/+15
Mark the relevant functions noinstr, use the plain non-instrumented MSR accessors. The only odd part is the instrumentation_begin()/end() pair around the indirect machine_check_vector() call as objtool can't figure that out. The possible invoked functions are annotated correctly. Also use notrace variant of nmi_enter/exit(). If MCEs happen then hardware latency tracing is the least of the worries. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200505135315.476734898@linutronix.de