summaryrefslogtreecommitdiff
path: root/drivers/edac/amd64_edac.c
AgeCommit message (Collapse)AuthorFilesLines
2024-03-12Merge tag 'edac_updates_for_v6.9' of ↵Linus Torvalds1-277/+9
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add a FRU (Field Replaceable Unit) memory poison manager which collects and manages previously encountered hw errors in order to save them to persistent storage across reboots. Previously recorded errors are "replayed" upon reboot in order to poison memory which has caused said errors in the past. The main use case is stacked, on-chip memory which cannot simply be replaced so poisoning faulty areas of it and thus making them inaccessible is the only strategy to prolong its lifetime. - Add an AMD address translation library glue which converts the reported addresses of hw errors into system physical addresses in order to be used by other subsystems like memory failure, for example. Add support for MI300 accelerators to that library. - igen6: Add support for Alder Lake-N SoC - i10nm: Add Grand Ridge support - The usual fixlets and cleanups * tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/versal: Convert to platform remove callback returning void RAS/AMD/FMPM: Fix off by one when unwinding on error RAS/AMD/FMPM: Add debugfs interface to print record entries RAS/AMD/FMPM: Save SPA values RAS: Export helper to get ras_debugfs_dir RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2() RAS: Introduce a FRU memory poison manager RAS/AMD/ATL: Add MI300 row retirement support Documentation: Move RAS section to admin-guide EDAC/versal: Make the bit position of injected errors configurable EDAC/i10nm: Add Intel Grand Ridge micro-server support EDAC/igen6: Add one more Intel Alder Lake-N SoC support RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300() RAS/AMD/ATL: Add MI300 support Documentation: RAS: Add index and address translation section EDAC/amd64: Use new AMD Address Translation Library RAS: Introduce AMD Address Translation Library EDAC/synopsys: Convert to devm_platform_ioremap_resource()
2024-02-16x86/cpu/amd: Provide a separate accessor for Node IDThomas Gleixner1-2/+2
AMD (ab)uses topology_die_id() to store the Node ID information and topology_max_dies_per_pkg to store the number of nodes per package. This collides with the proper processor die level enumeration which is coming on AMD with CPUID 8000_0026, unless there is a correlation between the two. There is zero documentation about that. So provide new storage and new accessors which for now still access die_id and topology_max_die_per_pkg(). Will be mopped up after AMD and HYGON are converted over. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153624.956116738@linutronix.de
2024-01-24EDAC/amd64: Use new AMD Address Translation LibraryYazen Ghannam1-277/+9
Remove old address translation code and use the new AMD Address Translation Library. Use "imply" in Kconfig so that the "AMD_ATL" config option takes the value of "EDAC_AMD64" as its default. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240123041401.79812-3-yazen.ghannam@amd.com
2023-11-29EDAC/amd64: Add support for family 0x19, models 0x90-9f devicesMuralidhara M K1-18/+48
AMD Models 90h-9fh are APUs. They have built-in HBM3 memory. ECC support is enabled by default. APU models have a single Data Fabric (DF) per Package. Each DF is visible to the OS in the same way as chiplet-based systems like Zen2 CPUs and later. However, the Unified Memory Controllers (UMCs) are arranged in the same way as GPU-based MI200 devices rather than CPU-based systems. Use the existing gpu_ops for hetergeneous systems to support enumeration of nodes and memory topology with few fixups. [ bp: Massage comments. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231102114225.2006878-5-muralimk@amd.com
2023-08-10EDAC/amd64: Add support for AMD family 1Ah models 00h-1Fh and 40h-4FhAvadhut Naik1-0/+15
Add support for family 1Ah-based models 00h-1Fh and 40h-4Fh. [ bp: Simplify. ] Signed-off-by: Avadhut Naik <Avadhut.Naik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230809035244.2722455-4-avadhut.naik@amd.com
2023-06-27Merge tag 'ras_core_for_v6.5' of ↵Linus Torvalds1-31/+355
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Add initial support for RAS hardware found on AMD server GPUs (MI200). Those GPUs and CPUs are connected together through the coherent fabric and the GPU memory controllers report errors through x86's MCA so EDAC needs to support them. The amd64_edac driver supports now HBM (High Bandwidth Memory) and thus such heterogeneous memory controller systems - Other small cleanups and improvements * tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: EDAC/amd64: Cache and use GPU node map EDAC/amd64: Add support for AMD heterogeneous Family 19h Model 30h-3Fh EDAC/amd64: Document heterogeneous system enumeration x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errors x86/amd_nb: Re-sort and re-indent PCI defines x86/amd_nb: Add MI200 PCI IDs ras/debugfs: Fix error checking for debugfs_create_dir() x86/MCE: Check a hw error's address to determine proper recovery action
2023-06-19EDAC/amd64: Cache and use GPU node mapYazen Ghannam1-0/+76
AMD systems have historically provided an "AMD Node ID" that is a unique identifier for each die in a multi-die package. This was associated with a unique instance of the AMD Northbridge on a legacy system. And now it is associated with a unique instance of the AMD Data Fabric on modern systems. Each instance is referred to as a "Node"; this is an AMD-specific term not to be confused with NUMA nodes. The data fabric provides a number of interfaces accessible through a set of functions in a single PCI device. There is one PCI device per Data Fabric (AMD Node), and multi-die systems will see multiple such PCI devices. The AMD Node ID matches a Node's position in the PCI hierarchy. For example, the Node 0 is accessed using the first PCI device, Node 1 is accessed using the second, and so on. A logical CPU can find its AMD Node ID using CPUID. Furthermore, the AMD Node ID is used within the hardware fabric, so it is not purely a logical value. Heterogeneous AMD systems, with a CPU Data Fabric connected to GPU data fabrics, follow a similar convention. Each CPU and GPU die has a unique AMD Node ID value, and each Node ID corresponds to PCI devices in sequential order. However, there are two caveats: 1) GPUs are not x86, and they don't have CPUID to read their AMD Node ID like on CPUs. This means the value is more implicit and based on PCI enumeration and hardware-specifics. 2) There is a gap in the hardware values for AMD Node IDs. Values 0-7 are for CPUs and values 8-15 are for GPUs. For example, a system with one CPU die and two GPUs dies will have the following values: CPU0 -> AMD Node 0 GPU0 -> AMD Node 8 GPU1 -> AMD Node 9 EDAC is the only subsystem where this has a practical effect. Memory errors on AMD systems are commonly reported through MCA to a CPU on the local AMD Node. The error information is passed along to EDAC where the AMD EDAC modules use the AMD Node ID of reporting logical CPU to access AMD Node information. However, memory errors from a GPU die will be reported to the CPU die. Therefore, the logical CPU's AMD Node ID can't be used since it won't match the AMD Node ID of the GPU die. The AMD Node ID of the GPU die is provided as part of the MCA information, and the value will match the hardware enumeration (e.g. 8-15). Handle this situation by discovering GPU dies the same way as CPU dies in the AMD NB code. But do a "node id" fixup in AMD64 EDAC where it's needed. The GPU data fabrics provide a register with the base AMD Node ID for their local "type", i.e. GPU data fabric. This value is the same for all fabrics of the same type in a system. Read and cache the base AMD Node ID from one of the GPU devices during module initialization. Use this to fixup the "node id" when reporting memory errors at runtime. [ bp: Squash a fix making gpu_node_map static as reported by Tom Rix <trix@redhat.com>. Link: https://lore.kernel.org/r/20230610210930.174074-1-trix@redhat.com ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Co-developed-by: Muralidhara M K <muralidhara.mk@amd.com> Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230515113537.1052146-6-muralimk@amd.com
2023-06-05EDAC/amd64: Add support for AMD heterogeneous Family 19h Model 30h-3FhMuralidhara M K1-31/+279
AMD Family 19h Model 30h-3Fh systems can be connected to AMD MI200 accelerator/GPU devices such that the CPU and GPU data fabrics are connected together. In this configuration, the CPU manages error logging and reporting for MCA banks located on the GPUs. This includes HBM memory errors reported from Unified Memory Controllers (UMCs) on the GPUs. The GPU memory errors are handled like CPU memory errors. AMD CPU UMC support in EDAC can be re-used for GPU UMC support. However, keeping them separate means drastic changes in one path (e.g. to support newer products) should have less impact on the other path. Also, simplify the "gpu_" helper functions where possible. GPU product configuration, like memory type and channel count, is fixed compared to CPU products. GPU UMCs each have four physical connections (phys) connected to eight channels. There is a single "chip select". This differs from CPUs where each UMC has one physical connection connected to one channel, and each channel has up to four "chip selects". Enumerate each UMC "phy" as an EDAC CSROW, since there is only a single chip select for each physical connection. This is similar to how a CPU UMC "phy" is enumerated as an EDAC CHANNEL, since there is only a single channel for each physical connection. Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230515113537.1052146-5-muralimk@amd.com
2023-05-15EDAC/amd64: Add support for ECC on family 19h model 60h-7FhHristo Venev1-0/+8
Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels instead of 12. With two 32GB dual-rank DIMMs the sizes appear to be reported correctly: EDAC MC0: Giving out device to module amd64_edac controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT) EDAC amd64: F19h_M60h detected (node 0). EDAC MC: UMC0 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 16384MB 3: 16384MB EDAC MC: UMC1 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 16384MB 3: 16384MB AMD64 EDAC driver v3.5.0 ECC errors can also be detected: mce: [Hardware Error]: Machine check events logged [Hardware Error]: Corrected error, no action required. [Hardware Error]: CPU:0 (19:61:2) MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000400011b [Hardware Error]: Error Addr: 0x00000007ff7e93c0 [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000100010a801203 [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x1) [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD According to Mario Limonciello, the same code should also work for models 70h-7Fh (follow thread in Link). [ bp: Massage, the translation logic updates are pending. ] Signed-off-by: Hristo Venev <hristo@venev.name> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20230425201239.324476-1-hristo@venev.name Link: https://lore.kernel.org/r/20230511174506.875153-2-hristo@venev.name
2023-05-10EDAC/amd64: Remove module version stringYazen Ghannam1-3/+1
The AMD64 EDAC module version information is not exposed through ABI like MODULE_VERSION(). Instead it is printed during module init. Version numbers can be confusing in cases where module updates are partly backported resulting in a difference between upstream and backported module versions. Remove the AMD64 EDAC module version information to avoid user confusion. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230410190959.3367528-1-yazen.ghannam@amd.com
2023-04-24Merge branches 'edac-drivers', 'edac-amd64' and 'edac-misc' into edac-updatesBorislav Petkov (AMD)1-573/+447
Combine all queued EDAC changes for submission into v6.4: * ras/edac-drivers: EDAC/i10nm: Add Intel Sierra Forest server support EDAC/skx: Fix overflows on the DRAM row address mapping arrays * ras/edac-amd64: (27 commits) EDAC/amd64: Fix indentation in umc_determine_edac_cap() EDAC/amd64: Add get_err_info() to pvt->ops EDAC/amd64: Split dump_misc_regs() into dct/umc functions EDAC/amd64: Split init_csrows() into dct/umc functions EDAC/amd64: Split determine_edac_cap() into dct/umc functions EDAC/amd64: Rename f17h_determine_edac_ctl_cap() EDAC/amd64: Split setup_mci_misc_attrs() into dct/umc functions EDAC/amd64: Split ecc_enabled() into dct/umc functions EDAC/amd64: Split read_mc_regs() into dct/umc functions EDAC/amd64: Split determine_memory_type() into dct/umc functions EDAC/amd64: Split read_base_mask() into dct/umc functions EDAC/amd64: Split prep_chip_selects() into dct/umc functions EDAC/amd64: Rework hw_info_{get,put} EDAC/amd64: Merge struct amd64_family_type into struct amd64_pvt EDAC/amd64: Do not discover ECC symbol size for Family 17h and later EDAC/amd64: Drop dbam_to_cs() for Family 17h and later EDAC/amd64: Split get_csrow_nr_pages() into dct/umc functions EDAC/amd64: Rename debug_display_dimm_sizes() * ras/edac-misc: EDAC/altera: Remove MODULE_LICENSE in non-module EDAC: Sanitize MODULE_AUTHOR strings EDAC/amd81[13]1: Remove trailing newline from MODULE_AUTHOR EDAC/i5100: Fix typo in comment EDAC/altera: Remove redundant error logging Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-04-04EDAC/amd64: Fix indentation in umc_determine_edac_cap()Yang Li1-10/+10
Use consistent indentation to improve the readability and fix: drivers/edac/amd64_edac.c:1279 umc_determine_edac_cap() warn: inconsistent indenting Fixes: f6a4b4a1aa16 ("EDAC/amd64: Split determine_edac_cap() into dct/umc functions") Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230404022557.46409-1-yang.lee@linux.alibaba.com
2023-03-28EDAC: Sanitize MODULE_AUTHOR stringsBorislav Petkov (AMD)1-4/+2
Fixup the remaining MODULE_AUTHOR strings to not contain newlines. Shorten and unbreak others. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230328134309.23159-1-bp@alien8.de
2023-03-24EDAC/amd64: Add get_err_info() to pvt->opsMuralidhara M K1-5/+8
GPU Nodes will use a different method to determine the chip select and channel of an error. A function pointer should be used rather than introduce another branching condition. Prepare for this by adding get_err_info() to pvt->ops. This function is only called from the modern code path, so a legacy function is not defined. Make sure to call this after MCA_STATUS[SyndV] is checked, since the csrow value is found in MCA_SYND. [ Yazen: rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-23-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Split dump_misc_regs() into dct/umc functionsMuralidhara M K1-13/+6
Add a function pointer to pvt->ops. No functional change is intended. [ Yazen: Rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-22-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Split init_csrows() into dct/umc functionsMuralidhara M K1-16/+7
Call them from their respective setup_mci_misc_attrs() paths. Also, drop the check for an "empty" device, i.e. one without memory. This is redundant and already done in instance_has_memory() earlier in the init path. No functional change is intended. [ Yazen: rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-21-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Split determine_edac_cap() into dct/umc functionsMuralidhara M K1-13/+17
Call them from their respective setup_mci_misc_attrs() paths. No functional change is intended. [ Yazen: rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-20-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Rename f17h_determine_edac_ctl_cap()Yazen Ghannam1-2/+2
...to match the "umc_" prefix convention. No functional change is intended. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-19-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Split setup_mci_misc_attrs() into dct/umc functionsMuralidhara M K1-13/+24
The init_one_instance() path is shared between legacy and modern systems. So add the new functions to a function pointer in pvt->ops. No functional change is intended. [ Yazen: Rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-18-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Split ecc_enabled() into dct/umc functionsMuralidhara M K1-30/+39
Call them using a function pointer in pvt->ops. The "ECC enabled" check is done outside of the hardware information gathering done in hw_info_get(). So a high-level function pointer is needed to separate the legacy and modern paths. No functional change is intended. [Yazen: rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-17-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Split read_mc_regs() into dct/umc functionsMuralidhara M K1-13/+4
Call them from their respective hw_info_get() paths. ECC symbol size is not needed on UMC systems, so determine_ecc_sym_sz() is left out of the UMC path. Do not save TOP_MEM* values on modern controllers because they're not needed there (read: they were used only for debugging, if anything). [ Yazen: rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-16-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Split determine_memory_type() into dct/umc functionsMuralidhara M K1-9/+6
Call them from their respective hw_info_get() paths. Call them after all other hardware registers have been saved, since the memory type for a device will be determined based on the saved information. [ Yazen: rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-15-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Split read_base_mask() into dct/umc functionsMuralidhara M K1-6/+4
Call them from their respective hw_info_get() paths. Call the new functions after the setting the chip select base and mask counts, since those are need to read the correct number of chip select base and mask registers. And call the new functions before the remaining set up, because the base and mask register values will be needed later. [Yazen: Rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-14-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Split prep_chip_selects() into dct/umc functionsMuralidhara M K1-11/+13
Call them from their respective hw_info_get() function. Avoid the need for family/model-based function pointers. Add the calls before reading hardware registers from the memory controllers, since the number of chip select bases and masks needs to be known first. [ Yazen: rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-13-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Rework hw_info_{get,put}Yazen Ghannam1-45/+33
The bulk of system-specific information is gathered at init time with hw_info_get(). This function calls a number of helper functions, and many of these helper functions are split between a modern UMC/DF path and a legacy DCT path. Split hw_info_get() into legacy and modern versions. This creates two separate code paths early on, and legacy and modern helper functions can be called directly in the appropriate code path. Also, simplify hw_info_put() and share it between legacy and modern systems. NULL pointer checks are done in pci_dev_put() and kfree(), so they can be called unconditionally. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-12-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Merge struct amd64_family_type into struct amd64_pvtMuralidhara M K1-195/+113
Future AMD systems will support heterogeneous "AMD Node" types, e.g. CPU and GPU types. Therefore, a global family type shared across all AMD nodes is no longer appropriate. Move struct low_ops routines and members of struct amd64_family_type to struct amd64_pvt. Currently, there are many code branches that split between "modern" and "legacy" systems. Another code branch will be needed in order to cover GPU cases. However, rather than introduce another branching case in multiple functions, the current branching code should be switched to a set of function pointers. This change makes the code more readable and simplifies adding support for new families/models. In order to reuse code, define two sets of function pointers. Use one for modern systems (Family 17h and later). This will not change between current CPU families. Use another set of function pointers for legacy systems (before Family 17h). Use the Family 16h versions as default for the legacy ops since these are the latest, and adjust the function pointers as needed for older families. [ Yazen: rebased/reworked patch and reworded commit message. ] [ bp: Fix rev8 or later check. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-11-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Do not discover ECC symbol size for Family 17h and laterYazen Ghannam1-18/+3
The ECC symbol size was needed on legacy system to lookup the ECC syndrome. This is not needed on modern systems because the ECC syndrome is explicitly provided in the MCA information. Remove the ECC symbol size discovery code for modern UMC-based systems. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-10-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Drop dbam_to_cs() for Family 17h and laterYazen Ghannam1-105/+81
The same function is used to calculate chip select size for all Zen-based family/models. Therefore, a family/model function pointer is not necessary. Drop the dbam_to_cs() function pointer for Family 17h and later systems. Also, move the Family 17h function to avoid a forward declaration. Rename it to indicate that the UMC Address Mask is used rather than the legacy DBAM value. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-9-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Split get_csrow_nr_pages() into dct/umc functionsYazen Ghannam1-14/+26
Split get_csrow_nr_pages() into a legacy and modern versions in preparation for further legacy/modern refactoring. Also, rename f17_get_cs_mode() to match the new convention. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-8-yazen.ghannam@amd.com
2023-03-24EDAC/amd64: Rename debug_display_dimm_sizes()Yazen Ghannam1-65/+63
Use the "dct" and "umc" prefixes for legacy and modern versions respectively. Also, move the "dct" version to avoid a forward declaration, and fixup some checkpatch warnings in the process. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-7-yazen.ghannam@amd.com
2023-02-14EDAC/amd64: Shut up an -Werror,-Wsometimes-uninitialized clang false positiveYazen Ghannam1-1/+1
Reportedly, clang cannot do interprocedural analysis: https://lore.kernel.org/r/20230213-amd64_edac-wsometimes-uninitialized-v1-1-5bde32b89e02@kernel.org and see that those arguments won't be used uninitialized. So, yeah, the code's fine even without this. Normally, such a "fix" won't be applied but that warning gets automatically enabled in -Wall builds and when CONFIG_WERROR is set in allmodconfig builds, the build fails. So shut it up with a minimal fix as this code will see more reorganization very soon. [ bp: Write commit message. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/Y%2BqdVHidnrrKvxiD@dev-arch.thelio-3990X
2023-02-09EDAC/amd64: Remove early_channel_count()Yazen Ghannam1-114/+2
The early_channel_count() function seems to have been useful in the past for knowing how many EDAC mci structures to populate. However, this is no longer needed as the maximum channel count for a system is used instead. Remove the early_channel_count() helper functions and related code. Use the size of the channel layer when iterating over channel structures. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-6-yazen.ghannam@amd.com
2023-02-09EDAC/amd64: Remove PCI Function 0Yazen Ghannam1-33/+5
PCI Function 0 is used on Family 17h and later only to read the "dhar" value. This value is printed and provided through a module-specific debug sysfs file. The value is not used for any Family 17h and later code, and it does not have any apparent debug value on these systems. Remove "dhar", Function 0 PCI IDs, and all related code. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-5-yazen.ghannam@amd.com
2023-02-09EDAC/amd64: Remove PCI Function 6Yazen Ghannam1-21/+1
PCI Function 6 is used on Family 17h and later to access scrub registers. With scrub access removed, this function has no other use. Remove all Function 6 PCI IDs and related code. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-4-yazen.ghannam@amd.com
2023-02-09EDAC/amd64: Remove scrub rate control for Family 17h and laterYazen Ghannam1-28/+5
The scrub registers on AMD Family 17h and later may be inaccessible to the OS. Furthermore, hardware designers recommend that the scrubbing feature is managed by the firmware. Remove support for the sdram_scrub_rate interface for AMD Family 17h systems and later by not setting the scrub function pointers. The EDAC MC core will then not expose the scrub files in sysfs. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-3-yazen.ghannam@amd.com
2023-02-09EDAC/amd64: Don't set up EDAC PCI control on Family 17h+Yazen Ghannam1-4/+4
EDAC PCI control is used to detect/report legacy PCI errors like "Parity" and "SERROR". Modern AMD systems use PCIe Advanced Error Reporting (AER), and legacy PCI errors should not be reported. Remove EDAC PCI control setup on AMD Family 17h and later systems. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-2-yazen.ghannam@amd.com
2022-10-21EDAC: Check for GHES preference in the chipset-specific EDAC driversJia He1-0/+3
Call ghes_get_devices() to check whether ghes_edac should be used on the platform where it is preferred over the corresponding chipset-specific EDAC driver. Unlike the existing edac_get_owner() check, the ghes_get_devices() check works independent to the module_init ordering. [ bp: Massage. ] Suggested-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221010023559.69655-6-justin.he@arm.com
2022-04-05x86/amd_nb: Unexport amd_cache_northbridges()Muralidhara M K1-1/+1
amd_cache_northbridges() is exported by amd_nb.c and is called by amd64-agp.c and amd64_edac.c modules at module_init() time so that NB descriptors are properly cached before those drivers can use them. However, the init_amd_nbs() initcall already does call amd_cache_northbridges() unconditionally and thus makes sure the NB descriptors are enumerated. That initcall is a fs_initcall type which is on the 5th group (starting from 0) of initcalls that gets run in increasing numerical order by the init code. The module_init() call is turned into an __initcall() in the MODULE=n case and those are device-level initcalls, i.e., group 6. Therefore, the northbridges caching is already finished by the time module initialization starts and thus the correct initialization order is retained. Unexport amd_cache_northbridges(), update dependent modules to call amd_nb_num() instead. While at it, simplify the checks in amd_cache_northbridges(). [ bp: Heavily massage and *actually* explain why the change is ok. ] Signed-off-by: Muralidhara M K <muralimk@amd.com> Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220324122729.221765-1-nchatrad@amd.com
2022-02-24EDAC/amd64: Add new register offset support and related changesYazen Ghannam1-16/+64
Introduce a "family flags" bitmask that can be used to indicate any special behavior needed on a per-family basis. Add a flag to indicate a system uses the new register offsets introduced with Family 19h Model 10h. Use this flag to account for register offset changes, a new bitfield indicating DDR5 use on a memory controller, and to set the proper number of chip select masks. Rework f17_addr_mask_to_cs_size() to properly handle the change in chip select masks. And update code comments to reflect the updated Chip Select, DIMM, and Mask relationships. [uninitialized variable warning] Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: William Roche <william.roche@oracle.com> Link: https://lore.kernel.org/r/20220202144307.2678405-3-yazen.ghannam@amd.com
2022-02-23EDAC/amd64: Set memory type per DIMMYazen Ghannam1-12/+31
Current AMD systems allow mixing of DIMM types within a system. However, DIMMs within a channel, i.e. managed by a single Unified Memory Controller (UMC), must be of the same type. Handle this possible configuration by checking and setting the memory type for each individual "UMC" structure. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: William Roche <william.roche@oracle.com> Link: https://lore.kernel.org/r/20220202144307.2678405-2-yazen.ghannam@amd.com
2022-01-10Merge tag 'edac_updates_for_v5.17_rc1' of ↵Linus Torvalds1-1/+35
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add support for version 3 of the Synopsys DDR controller to synopsys_edac - Add support for DRR5 and new models 0x10-0x1f and 0x50-0x5f of AMD family 0x19 CPUs to amd64_edac - The usual set of fixes and cleanups * tag 'edac_updates_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/amd64: Add support for family 19h, models 50h-5fh EDAC/sb_edac: Remove redundant initialization of variable rc RAS/CEC: Remove a repeated 'an' in a comment EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh EDAC: Add RDDR5 and LRDDR5 memory types EDAC/sifive: Fix non-kernel-doc comment dt-bindings: memory: Add entry for version 3.80a EDAC/synopsys: Enable the driver on Intel's N5X platform EDAC/synopsys: Add support for version 3 of the Synopsys EDAC DDR EDAC/synopsys: Use the quirk for version instead of ddr version
2021-12-24EDAC/amd64: Add support for family 19h, models 50h-5fhMarc Bevand1-0/+15
Add the new family 19h models 50h-5fh PCI IDs (device 18h functions 0 and 6) to support Ryzen 5000 APUs ("Cezanne"). Signed-off-by: Marc Bevand <m@zorinaq.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20211221233112.556927-1-m@zorinaq.com
2021-12-10EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFhYazen Ghannam1-1/+20
Add a new family type for AMD Family 19h Models 10h to 1Fh. Use this new family type for Models A0h to AFh also. Increase the maximum number of controllers from 8 to 12. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208174356.1997855-3-yazen.ghannam@amd.com
2021-11-15EDAC/amd64: Add context structYazen Ghannam1-42/+55
Define an address translation context struct. This will hold values that will be passed between multiple functions. Save return address, Node ID, and the Instance ID number to start. Currently, the UMC number is used as the Instance ID, but future DF versions may use another value. Also include a "tmp" field to use when reading registers. This is to avoid having to define a temporary variable in multiple functions. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211028175728.121452-5-yazen.ghannam@amd.com
2021-11-15EDAC/amd64: Allow for DF Indirect Broadcast readsYazen Ghannam1-8/+21
The DF Indirect Access method allows for "Broadcast" accesses in which case no specific instance is targeted. Add support using a reserved instance ID of 0xFF to indicate a broadcast access. Set the FICAA register appropriately. Define helpers functions for instance and broadcast reads and use them where appropriate. Drop the "amd_" prefix since these functions are all static. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211028175728.121452-4-yazen.ghannam@amd.com
2021-11-15x86/amd_nb, EDAC/amd64: Move DF Indirect Read to AMD64 EDACYazen Ghannam1-0/+50
df_indirect_read() is used only for address translation. Move it to EDAC along with the translation code. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211028175728.121452-3-yazen.ghannam@amd.com
2021-11-15x86/MCE/AMD, EDAC/amd64: Move address translation to AMD64 EDACYazen Ghannam1-0/+199
The address translation code used for current AMD systems is non-architectural. So move it to EDAC. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211028175728.121452-2-yazen.ghannam@amd.com
2021-10-07EDAC/amd64: Handle three rank interleaving modeYazen Ghannam1-1/+21
AMD Rome systems and later support interleaving between three identical ranks within a channel. Check for this mode by counting the number of enabled chip selects and comparing their masks. If there are exactly three enabled chip selects and their masks are identical, then three rank interleaving is enabled. The size of a rank is determined from its mask value. However, three rank interleaving doesn't follow the method of swapping an interleave bit with the most significant bit. Rather, the interleave bit is flipped and the most significant bit remains the same. There is only a single interleave bit in this case. Account for this when determining the chip select size by keeping the most significant bit at its original value and ignoring any zero bits. This will return a full bitmask in [MSB:1]. Fixes: e53a3b267fb0 ("EDAC/amd64: Find Chip Select memory size using Address Mask") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211005154419.2060504-1-yazen.ghannam@amd.com
2021-07-13EDAC/amd64: Use DEVICE_ATTR helper macrosDwaipayan Ray1-13/+8
Instead of "open coding" DEVICE_ATTR, use the corresponding helper macros DEVICE_ATTR_{RW,RO,WO} in amd64_edac.c Some function names needed to be changed to match the device conventions <foo>_show and <foo>_store, but the functionality itself is unchanged. The devices using EDAC_DCT_ATTR_SHOW() are left unchanged. Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210713065130.2151-1-dwaipayanray1@gmail.com
2021-05-10x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFGBrijesh Singh1-1/+1
The SYSCFG MSR continued being updated beyond the K8 family; drop the K8 name from it. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Joerg Roedel <jroedel@suse.de> Link: https://lkml.kernel.org/r/20210427111636.1207-4-brijesh.singh@amd.com