From edfc8730ba45eac3cca20dba3799d6ae6c584b56 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Sep 2021 11:44:51 +0200 Subject: ABI: sysfs-mce: add a new ABI file Reduce the gap of missing ABIs for Intel servers with MCE by adding a new ABI file. The contents of this file comes from: Documentation/x86/x86_64/machinecheck.rst Reviewed-by: Andi Kleen Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/801a26985e32589eb78ba4b728d3e19fdea18f04.1632994837.git.mchehab+huawei@kernel.org Signed-off-by: Greg Kroah-Hartman --- Documentation/x86/x86_64/machinecheck.rst | 56 ++----------------------------- 1 file changed, 2 insertions(+), 54 deletions(-) (limited to 'Documentation/x86') diff --git a/Documentation/x86/x86_64/machinecheck.rst b/Documentation/x86/x86_64/machinecheck.rst index b402e04bee60..cea12ee97200 100644 --- a/Documentation/x86/x86_64/machinecheck.rst +++ b/Documentation/x86/x86_64/machinecheck.rst @@ -21,60 +21,8 @@ from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN (N = CPU number). -The directory contains some configurable entries: - -bankNctl - (N bank number) - - 64bit Hex bitmask enabling/disabling specific subevents for bank N - When a bit in the bitmask is zero then the respective - subevent will not be reported. - By default all events are enabled. - Note that BIOS maintain another mask to disable specific events - per bank. This is not visible here - -The following entries appear for each CPU, but they are truly shared -between all CPUs. - -check_interval - How often to poll for corrected machine check errors, in seconds - (Note output is hexadecimal). Default 5 minutes. When the poller - finds MCEs it triggers an exponential speedup (poll more often) on - the polling interval. When the poller stops finding MCEs, it - triggers an exponential backoff (poll less often) on the polling - interval. The check_interval variable is both the initial and - maximum polling interval. 0 means no polling for corrected machine - check errors (but some corrected errors might be still reported - in other ways) - -tolerant - Tolerance level. When a machine check exception occurs for a non - corrected machine check the kernel can take different actions. - Since machine check exceptions can happen any time it is sometimes - risky for the kernel to kill a process because it defies - normal kernel locking rules. The tolerance level configures - how hard the kernel tries to recover even at some risk of - deadlock. Higher tolerant values trade potentially better uptime - with the risk of a crash or even corruption (for tolerant >= 3). - - 0: always panic on uncorrected errors, log corrected errors - 1: panic or SIGBUS on uncorrected errors, log corrected errors - 2: SIGBUS or log uncorrected errors, log corrected errors - 3: never panic or SIGBUS, log all errors (for testing only) - - Default: 1 - - Note this only makes a difference if the CPU allows recovery - from a machine check exception. Current x86 CPUs generally do not. - -trigger - Program to run when a machine check event is detected. - This is an alternative to running mcelog regularly from cron - and allows to detect events faster. -monarch_timeout - How long to wait for the other CPUs to machine check too on a - exception. 0 to disable waiting for other CPUs. - Unit: us +The directory contains some configurable entries. See +Documentation/ABI/testing/sysfs-mce for more details. TBD document entries for AMD threshold interrupt configuration -- cgit v1.2.3