summaryrefslogtreecommitdiff
path: root/drivers/ras/Kconfig
diff options
context:
space:
mode:
authorYazen Ghannam <yazen.ghannam@amd.com>2024-02-14 06:35:16 +0300
committerBorislav Petkov (AMD) <bp@alien8.de>2024-02-20 20:56:15 +0300
commit6f15e617cc99323339dc241d19956f0d640c4354 (patch)
tree8144508a6e57975c6ad9d5eca4301647e79224bb /drivers/ras/Kconfig
parent3b566b30b41401888ee0e8eb904a1e7a6693794b (diff)
downloadlinux-6f15e617cc99323339dc241d19956f0d640c4354.tar.xz
RAS: Introduce a FRU memory poison manager
Memory errors are an expected occurrence on systems with high memory density. Generally, errors within a small number of unique physical locations are acceptable, based on manufacturer and/or admin policy. During run time, memory with errors may be retired so it is no longer used by the system. This is done in mm through page poisoning, and the effect will remain until the system is restarted. If a memory location is consistently faulty, then the same run time error handling may occur in the next reboot cycle, leading to terminating jobs due to that already known bad memory. This could be prevented if information from the previous boot was not lost. Some add-in cards with driver-managed memory have on-board persistent storage. Their driver saves memory error information to the persistent storage during run time. The information is then restored after reset, and known bad memory will be retired before the hardware is used. A running log of bad memory locations is kept across multiple resets. A similar solution is desirable for CPUs. However, this solution should leverage industry-standard components as much as possible, rather than a bespoke platform driver. Two components are needed: a record format and a persistent storage interface. Implement a new module to manage the record formats on persistent storage. Use the requirements for an AMD MI300-based system to start. Vendor- and platform-specific details can be abstracted later as needed. [ bp: Massage commit message and code, squash 30-ish more fixes from Yazen and me. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Co-developed-by: <naveenkrishna.chatradhi@amd.com> Signed-off-by: <naveenkrishna.chatradhi@amd.com> Co-developed-by: <muralidhara.mk@amd.com> Signed-off-by: <muralidhara.mk@amd.com> Tested-by: <sathyapriya.k@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@amd.com
Diffstat (limited to 'drivers/ras/Kconfig')
-rw-r--r--drivers/ras/Kconfig12
1 files changed, 12 insertions, 0 deletions
diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index 2e969f59c0ca..fc4f4bb94a4c 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -34,4 +34,16 @@ if RAS
source "arch/x86/ras/Kconfig"
source "drivers/ras/amd/atl/Kconfig"
+config RAS_FMPM
+ tristate "FRU Memory Poison Manager"
+ default m
+ depends on AMD_ATL && ACPI_APEI
+ help
+ Support saving and restoring memory error information across reboot
+ using ACPI ERST as persistent storage. Error information is saved with
+ the UEFI CPER "FRU Memory Poison" section format.
+
+ Memory will be retired during boot time and run time depending on
+ platform-specific policies.
+
endif