summaryrefslogtreecommitdiff
path: root/drivers/edac
AgeCommit message (Collapse)AuthorFilesLines
2023-03-22EDAC/sysfs: move to use bus_get_dev_root()Greg Kroah-Hartman2-12/+18
Direct access to the struct bus_type dev_root pointer is going away soon so replace that with a call to bus_get_dev_root() instead, which is what it is there for. Cc: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rric@kernel.org> Cc: linux-edac@vger.kernel.org Link: https://lore.kernel.org/r/20230313182918.1312597-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-22EDAC/altera: Remove redundant error loggingDeepak R Varma1-6/+3
A call to platform_get_irq() already prints an error on failure within its own implementation. So printing another error based on its return value in the caller is redundant and should be removed. The clean up also makes if condition block braces unnecessary. Remove that as well. Issue identified using platform_get_irq.cocci coccinelle semantic patch. Signed-off-by: Deepak R Varma <drv@mailo.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/Y/+j27kqdhflPtaj@ubun2204.myguest.virtualbox.org
2023-03-16qcom: llcc/edac: Support polling mode for ECC handlingManivannan Sadhasivam1-21/+29
Not all Qcom platforms support IRQ mode for ECC handling. For those platforms, the current EDAC driver will not be probed due to missing ECC IRQ in devicetree. So add support for polling mode so that the EDAC driver can be used on all Qcom platforms supporting LLCC. The polling delay of 5000ms is chosen based on Qcom downstream/vendor driver. Reported-by: Luca Weiss <luca.weiss@fairphone.com> Tested-by: Luca Weiss <luca.weiss@fairphone.com> Tested-by: Steev Klimaszewski <steev@kali.org> # Thinkpad X13s Tested-by: Andrew Halaney <ahalaney@redhat.com> # sa8540p-ride Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230314080443.64635-14-manivannan.sadhasivam@linaro.org
2023-03-16qcom: llcc/edac: Fix the base address used for accessing LLCC banksManivannan Sadhasivam1-9/+5
The Qualcomm LLCC/EDAC drivers were using a fixed register stride for accessing the (Control and Status Registers) CSRs of each LLCC bank. This stride only works for some SoCs like SDM845 for which driver support was initially added. But the later SoCs use different register stride that vary between the banks with holes in-between. So it is not possible to use a single register stride for accessing the CSRs of each bank. By doing so could result in a crash. For fixing this issue, let's obtain the base address of each LLCC bank from devicetree and get rid of the fixed stride. This also means, there is no need to rely on reg-names property and the base addresses can be obtained using the index. First index is LLCC bank 0 and last index is LLCC broadcast. If the SoC supports more than one bank, then those need to be defined in devicetree for index from 1..N-1. Reported-by: Parikshit Pareek <quic_ppareek@quicinc.com> Tested-by: Luca Weiss <luca.weiss@fairphone.com> Tested-by: Steev Klimaszewski <steev@kali.org> # Thinkpad X13s Tested-by: Andrew Halaney <ahalaney@redhat.com> # sa8540p-ride Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230314080443.64635-13-manivannan.sadhasivam@linaro.org
2023-03-13EDAC/skx: Fix overflows on the DRAM row address mapping arraysQiuxu Zhuo1-2/+2
The current DRAM row address mapping arrays skx_{open,close}_row[] only support ranks with sizes up to 16G. Decoding a rank address to a DRAM row address for a 32G rank by using either one of the above arrays by the skx_edac driver, will result in an overflow on the array. For a 32G rank, the most significant DRAM row address bit (the bit17) is mapped from the bit34 of the rank address. Add this new mapping item to both arrays to fix the overflow issue. Fixes: 4ec656bdf43a ("EDAC, skx_edac: Add EDAC driver for Skylake") Reported-by: Feng Xu <feng.f.xu@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230211011728.71764-1-qiuxu.zhuo@intel.com
2023-02-21Merge tag 'edac_updates_for_v6.3' of ↵Linus Torvalds9-359/+961
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add a driver for the RAS functionality on Xilinx's on chip memory controller - Add support for decoding errors from the first and second level memory on SKL-based hardware - Add support for the memory controllers in Intel Granite Rapids and Emerald Rapids machines - First round of amd64_edac driver simplification and removal of unneeded functionality - The usual cleanups and fixes * tag 'edac_updates_for_v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/amd64: Shut up an -Werror,-Wsometimes-uninitialized clang false positive EDAC/amd64: Remove early_channel_count() EDAC/amd64: Remove PCI Function 0 EDAC/amd64: Remove PCI Function 6 EDAC/amd64: Remove scrub rate control for Family 17h and later EDAC/amd64: Don't set up EDAC PCI control on Family 17h+ EDAC/i10nm: Add driver decoder for Sapphire Rapids server EDAC/i10nm: Add Intel Granite Rapids server support EDAC/i10nm: Make more configurations CPU model specific EDAC/i10nm: Add Intel Emerald Rapids server support EDAC/skx_common: Delete duplicated and unreachable code EDAC/skx_common: Enable EDAC support for the "near" memory EDAC/qcom: Add platform_device_id table for module autoloading EDAC/zynqmp: Add EDAC support for Xilinx ZynqMP OCM dt-bindings: edac: Add bindings for Xilinx ZynqMP OCM
2023-02-21Merge tag 'ras_core_for_v6.3_rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Add support for reporting more bits of the physical address on error, on newer AMD CPUs - Mask out bits which don't belong to the address of the error being reported * tag 'ras_core_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Mask out non-address bits from machine check bank x86/mce: Add support for Extended Physical Address MCA changes x86/mce: Define a function to extract ErrorAddr from MCA_ADDR
2023-02-14EDAC/amd64: Shut up an -Werror,-Wsometimes-uninitialized clang false positiveYazen Ghannam1-1/+1
Reportedly, clang cannot do interprocedural analysis: https://lore.kernel.org/r/20230213-amd64_edac-wsometimes-uninitialized-v1-1-5bde32b89e02@kernel.org and see that those arguments won't be used uninitialized. So, yeah, the code's fine even without this. Normally, such a "fix" won't be applied but that warning gets automatically enabled in -Wall builds and when CONFIG_WERROR is set in allmodconfig builds, the build fails. So shut it up with a minimal fix as this code will see more reorganization very soon. [ bp: Write commit message. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/Y%2BqdVHidnrrKvxiD@dev-arch.thelio-3990X
2023-02-09EDAC/amd64: Remove early_channel_count()Yazen Ghannam2-116/+2
The early_channel_count() function seems to have been useful in the past for knowing how many EDAC mci structures to populate. However, this is no longer needed as the maximum channel count for a system is used instead. Remove the early_channel_count() helper functions and related code. Use the size of the channel layer when iterating over channel structures. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-6-yazen.ghannam@amd.com
2023-02-09EDAC/amd64: Remove PCI Function 0Yazen Ghannam2-43/+7
PCI Function 0 is used on Family 17h and later only to read the "dhar" value. This value is printed and provided through a module-specific debug sysfs file. The value is not used for any Family 17h and later code, and it does not have any apparent debug value on these systems. Remove "dhar", Function 0 PCI IDs, and all related code. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-5-yazen.ghannam@amd.com
2023-02-09EDAC/amd64: Remove PCI Function 6Yazen Ghannam2-31/+3
PCI Function 6 is used on Family 17h and later to access scrub registers. With scrub access removed, this function has no other use. Remove all Function 6 PCI IDs and related code. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-4-yazen.ghannam@amd.com
2023-02-09EDAC/amd64: Remove scrub rate control for Family 17h and laterYazen Ghannam2-30/+5
The scrub registers on AMD Family 17h and later may be inaccessible to the OS. Furthermore, hardware designers recommend that the scrubbing feature is managed by the firmware. Remove support for the sdram_scrub_rate interface for AMD Family 17h systems and later by not setting the scrub function pointers. The EDAC MC core will then not expose the scrub files in sysfs. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-3-yazen.ghannam@amd.com
2023-02-09EDAC/amd64: Don't set up EDAC PCI control on Family 17h+Yazen Ghannam1-4/+4
EDAC PCI control is used to detect/report legacy PCI errors like "Parity" and "SERROR". Modern AMD systems use PCIe Advanced Error Reporting (AER), and legacy PCI errors should not be reported. Remove EDAC PCI control setup on AMD Family 17h and later systems. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-2-yazen.ghannam@amd.com
2023-02-08EDAC/i10nm: Add driver decoder for Sapphire Rapids serverYouquan Song1-33/+69
Intel SDM (December 2022) vol3B 17.13.2 contains IMC MC error codes for Sapphire Rapids. Current i10nm_edac only supports firmware decoder (ACPI DSM methods) for Sapphire Rapids. So add the driver decoder (decoding DDR memory errors via extracting error information from the IMC MC error codes) for Sapphire Rapids for better decoding performance. Co-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-01-25EDAC/i10nm: Add Intel Granite Rapids server supportQiuxu Zhuo2-25/+217
The Granite Rapids CPU model uses similar memory controller registers as Sapphire Rapids server but with some different configurations: - Various memory controller numbers for different Granite Rapids CPUs. So detect the number of present memory controllers at run time. - Different MMIO offsets of memory controllers. - Different triples of bus/dev/fun of some PCI devices used in i10nm_edac. Add above configurations and Granite Rapids CPU model ID for EDAC support. [Tony: Fixed 2 typos s/strcture/structure/] Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230113032802.41752-1-qiuxu.zhuo@intel.com
2023-01-25EDAC/i10nm: Make more configurations CPU model specificQiuxu Zhuo2-42/+121
The numbers of memory controllers per socket, channels per memory controller, DIMMs per channel and the triples of bus/device/function of PCI devices used in i10nm_edac can be CPU model specific. Add new fields to the structure res_config for above numbers and triples to make them CPU model specific. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230113032802.41752-1-qiuxu.zhuo@intel.com
2023-01-25EDAC/i10nm: Add Intel Emerald Rapids server supportQiuxu Zhuo1-0/+1
The Emerald Rapids CPU model uses similar memory controller registers as Sapphire Rapids server. Add Emerald Rapids CPU model number ID for EDAC support. Tested-by: Li Zhang <li4.zhang@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230113032802.41752-1-qiuxu.zhuo@intel.com
2023-01-25EDAC/skx_common: Delete duplicated and unreachable codeQiuxu Zhuo1-37/+21
skx_mce_check_error() returns early if the error isn't from memory. So when skx_mce_output_error() is invoked from skx_mce_check_error(), it doesn't need to re-check whether the error is from memory. Delete the duplicated and unreachable code from skx_mce_output_error(). Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230113032802.41752-1-qiuxu.zhuo@intel.com
2023-01-25EDAC/skx_common: Enable EDAC support for the "near" memoryQiuxu Zhuo2-6/+36
The current {skx,i10nm}_edac miss the EDAC support to decode errors from the 1st level memory (the fast "near" memory as cache) of the 2-level memory system. Introduce a helper function skx_error_in_mem() to check whether errors are from memory at the beginning of skx_mce_check_error(). As long as the errors are from memory (either the 1-level memory system or the 2-level memory system), decode the errors. Reported-and-tested-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230113032802.41752-1-qiuxu.zhuo@intel.com
2023-01-20EDAC/qcom: Do not pass llcc_driv_data as edac_device_ctl_info's pvt_infoManivannan Sadhasivam1-3/+2
The memory for llcc_driv_data is allocated by the LLCC driver. But when it is passed as the private driver info to the EDAC core, it will get freed during the qcom_edac driver release. So when the qcom_edac driver gets probed again, it will try to use the freed data leading to the use-after-free bug. Hence, do not pass llcc_driv_data as pvt_info but rather reference it using the platform_data pointer in the qcom_edac driver. Fixes: 27450653f1db ("drivers: edac: Add EDAC driver support for QCOM SoCs") Reported-by: Steev Klimaszewski <steev@kali.org> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Steev Klimaszewski <steev@kali.org> # Thinkpad X13s Tested-by: Andrew Halaney <ahalaney@redhat.com> # sa8540p-ride Cc: <stable@vger.kernel.org> # 4.20 Link: https://lore.kernel.org/r/20230118150904.26913-4-manivannan.sadhasivam@linaro.org
2023-01-20EDAC/qcom: Add platform_device_id table for module autoloadingManivannan Sadhasivam1-0/+7
Add a device ID table so that the driver loads automatically when the associated platform_device gets registered. Reported-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Steev Klimaszewski <steev@kali.org> # Thinkpad X13s Tested-by: Andrew Halaney <ahalaney@redhat.com> # sa8540p-ride Link: https://lore.kernel.org/r/20230118150904.26913-3-manivannan.sadhasivam@linaro.org
2023-01-19EDAC/device: Respect any driver-supplied workqueue polling valueManivannan Sadhasivam1-8/+7
The EDAC drivers may optionally pass the poll_msec value. Use that value if available, else fall back to 1000ms. [ bp: Touchups. ] Fixes: e27e3dac6517 ("drivers/edac: add edac_device class") Reported-by: Luca Weiss <luca.weiss@fairphone.com> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Steev Klimaszewski <steev@kali.org> # Thinkpad X13s Tested-by: Andrew Halaney <ahalaney@redhat.com> # sa8540p-ride Cc: <stable@vger.kernel.org> # 4.9 Link: https://lore.kernel.org/r/COZYL8MWN97H.MROQ391BGA09@otso
2023-01-10x86/mce: Mask out non-address bits from machine check bankTony Luck1-1/+1
Systems that support various memory encryption schemes (MKTME, TDX, SEV) use high order physical address bits to indicate which key should be used for a specific memory location. When a memory error is reported, some systems may report those key bits in the IA32_MCi_ADDR machine check MSR. The Intel SDM has a footnote for the contents of the address register that says: "Useful bits in this field depend on the address methodology in use when the register state is saved." AMD Processor Programming Reference has a more explicit description of the MCA_ADDR register: "For physical addresses, the most significant bit is given by Core::X86::Cpuid::LongModeInfo[PhysAddrSize]." Add a new #define MCI_ADDR_PHYSADDR for the mask of valid physical address bits within the machine check bank address register. Use this mask for recoverable machine check handling and in the EDAC driver to ignore any key bits that may be present. [ Tony: Based on independent fixes proposed by Fan Du and Isaku Yamahata ] Reported-by: Isaku Yamahata <isaku.yamahata@intel.com> Reported-by: Fan Du <fan.du@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20230109152936.397862-1-tony.luck@intel.com
2023-01-09EDAC/zynqmp: Add EDAC support for Xilinx ZynqMP OCMSai Krishna Potthuri3-0/+476
Add EDAC support for Xilinx ZynqMP OCM Controller, so this driver reports CE and UE errors upon interrupt generation. Also add debugfs files for error injection. On Xilinx ZynqMP platform, both OCM Controller driver(zynqmp_edac) and DDR Memory Controller driver(synopsys_edac) co-exist which means both can be loaded at a time. This scenario is tested on Xilinx ZynqMP platform. Fix following issue reported by the robot: "MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/edac/xlnx,zynqmp-ocmc.yaml" [ bp: - Massage commit message - s/EDAC_ZYNQMP_OCM/EDAC_ZYNQMP/ - Touchups ] Reported-by: kernel test robot <lkp@intel.com> Co-developed-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com> Signed-off-by: Sai Krishna Potthuri <sai.krishna.potthuri@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230104084512.1855243-3-sai.krishna.potthuri@amd.com
2023-01-03EDAC/highbank: Fix memory leak in highbank_mc_probe()Miaoqian Lin1-2/+5
When devres_open_group() fails, it returns -ENOMEM without freeing memory allocated by edac_mc_alloc(). Call edac_mc_free() on the error handling path to avoid a memory leak. [ bp: Massage commit message. ] Fixes: a1b01edb2745 ("edac: add support for Calxeda highbank memory controller") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20221229054825.1361993-1-linmq006@gmail.com
2022-12-30EDAC/device: Fix period calculation in edac_device_reset_delay_period()Eliav Farber2-10/+9
Fix period calculation in case user sets a value of 1000. The input of round_jiffies_relative() should be in jiffies and not in milli-seconds. [ bp: Use the same code pattern as in edac_device_workq_setup() for clarity. ] Fixes: c4cf3b454eca ("EDAC: Rework workqueue handling") Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20221020124458.22153-1-farbere@amazon.com
2022-12-12Merge branches 'edac-ghes' and 'edac-misc' into edac-updates-for-v6.2Borislav Petkov (AMD)4-3/+28
Combine all queued EDAC changes for submission into v6.2: * ras/edac-ghes: EDAC/igen6: Return the correct error type when not the MC owner apei/ghes: Use xchg_release() for updating new cache slot instead of cmpxchg() EDAC: Check for GHES preference in the chipset-specific EDAC drivers EDAC/ghes: Make ghes_edac a proper module EDAC/ghes: Prepare to make ghes_edac a proper module EDAC/ghes: Add a notifier for reporting memory errors efi/cper: Export several helpers for ghes_edac to use * ras/edac-misc: EDAC/i10nm: fix refcount leak in pci_get_dev_wrapper() EDAC/i5400: Fix typo in comment: vaious -> various EDAC/mc_sysfs: Increase legacy channel support to 12 MAINTAINERS: Make Mauro EDAC reviewer MAINTAINERS: Make Manivannan Sadhasivam the maintainer of qcom_edac EDAC/i5000: Mark as BROKEN Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2022-11-28EDAC/i10nm: fix refcount leak in pci_get_dev_wrapper()Yang Yingliang1-2/+1
As the comment of pci_get_domain_bus_and_slot() says, it returns a PCI device with refcount incremented, so it doesn't need to call an extra pci_dev_get() in pci_get_dev_wrapper(), and the PCI device needs to be put in the error path. Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20221128065512.3572550-1-yangyingliang@huawei.com
2022-11-25EDAC/i5400: Fix typo in comment: vaious -> variousChen Zhang1-1/+2
Fix spelling typo in comment: vaious -> various. [ bp: Massage. ] Reported-by: k2ci <kernel-bot@kylinos.cn> Signed-off-by: Chen Zhang <chenzhang@kylinos.cn> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221102081248.45694-1-chenzhang@kylinos.cn
2022-10-31EDAC/mc_sysfs: Increase legacy channel support to 12Yazen Ghannam1-0/+24
Newer AMD systems, such as Genoa, can support up to 12 channels per EDAC "mc" device. These are detected by the device's EDAC module, and the current EDAC interface is properly enumerated. However, the legacy EDAC sysfs interface provides device attributes only for channels 0 to 7. Therefore, channels 8 to 11 will not be visible in the legacy interface. This was overlooked in the initial support for AMD Genoa. Add additional device attributes so that up to 12 channels are visible in the legacy EDAC sysfs interface. Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20221018153630.14664-1-yazen.ghannam@amd.com
2022-10-25EDAC/igen6: Return the correct error type when not the MC ownerJia He1-1/+1
Return -EBUSY instead of -ENODEV just like the other EDAC drivers do. [ bp: Rewrite text. ] Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221018082214.569504-8-justin.he@arm.com
2022-10-21EDAC: Check for GHES preference in the chipset-specific EDAC driversJia He11-0/+31
Call ghes_get_devices() to check whether ghes_edac should be used on the platform where it is preferred over the corresponding chipset-specific EDAC driver. Unlike the existing edac_get_owner() check, the ghes_get_devices() check works independent to the module_init ordering. [ bp: Massage. ] Suggested-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221010023559.69655-6-justin.he@arm.com
2022-10-21EDAC/ghes: Make ghes_edac a proper moduleJia He2-4/+40
Commit dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in apci_init()") introduced a bug leading to ghes_edac_register() to be invoked before edac_init(). Because at that time the bus "edac" hadn't been even registered, this created sysfs nodes as /devices/mc0 instead of /sys/devices/system/edac/mc/mc0 on an Ampere eMag server. Fix this by turning ghes_edac into a proper module. The list of GHES devices returned is not protected from being modified concurrently but it is pretty static as it gets created only during GHES init and latter is not a module so... [ bp: Massage. ] Fixes: dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in apci_init()") Co-developed-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221010023559.69655-5-justin.he@arm.com
2022-10-21EDAC/ghes: Prepare to make ghes_edac a proper moduleJia He1-33/+2
To make ghes_edac a proper module, prepare to decouple its dependencies from GHES. Move the ghes_edac.force_load parameter to ghes.c in order to properly control whether ghes_edac should be force-loaded: In ghes_edac_register() it is too late to set the module flag. Introduce a helper ghes_get_devices(), which returns the list of GHES devices which got probed when the platform-check passes on the system. The previous force_load check is not needed in ghes_edac_unregister() since it will be checked in the module's init function of ghes_edac later. [ bp: Massage. ] Suggested-by: Toshi Kani <toshi.kani@hpe.com> Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221010023559.69655-4-justin.he@arm.com
2022-10-20EDAC/ghes: Add a notifier for reporting memory errorsJia He1-2/+17
In order to make it a proper module and disentangle it from facilities, add a notifier for reporting memory errors. Use an atomic notifier because calls sites like ghes_proc_in_irq() run in interrupt context. [ bp: Massage commit message. ] Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221010023559.69655-3-justin.he@arm.com
2022-10-17EDAC/i5000: Mark as BROKENAristeu Rozanski1-0/+1
i5000_edac supports very old hardware which isn't available and it's been broken for single/dual channel for many years without anyone noticing. Marking as BROKEN. Signed-off-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220921181009.oxytvicy6sry6it7@redhat.com
2022-10-13Merge patch series "Use composable cache instead of L2 cache"Palmer Dabbelt2-7/+7
Zong Li <zong.li@sifive.com> says: Since composable cache may be L3 cache if private L2 cache exists, we should use its original name "composable cache" to prevent confusion. This patchset contains the modification which is related to ccache, such as DT binding and EDAC driver. * b4-shazam-merge: riscv: Add cache information in AUX vector soc: sifive: ccache: define the macro for the register shifts soc: sifive: ccache: use pr_fmt() to remove CCACHE: prefixes soc: sifive: ccache: reduce printing on init soc: sifive: ccache: determine the cache level from dts soc: sifive: ccache: Rename SiFive L2 cache to Composable cache. dt-bindings: sifive-ccache: change Sifive L2 cache to Composable cache Link: https://lore.kernel.org/r/20220913061817.22564-1-zong.li@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-10-13soc: sifive: ccache: Rename SiFive L2 cache to Composable cache.Greentime Hu2-7/+7
Since composable cache may be L3 cache if there is a L2 cache, we should use its original name composable cache to prevent confusion. There are some new lines were generated due to adding the compatible "sifive,ccache0" into ID table and indent requirement. The sifive L2 has been renamed to sifive CCACHE, EDAC driver needs to apply the change as well. Signed-off-by: Greentime Hu <greentime.hu@sifive.com> Signed-off-by: Zong Li <zong.li@sifive.com> Co-developed-by: Zong Li <zong.li@sifive.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20220913061817.22564-3-zong.li@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-10-04Merge branches 'edac-drivers' and 'edac-misc' into edac-updates-for-v6.1Borislav Petkov10-66/+451
Combine all queued EDAC changes for submission into v6.1: * edac-drivers: EDAC/ie31200: Add Skylake-S support * edac-misc: EDAC/i7300: Correct the i7300_exit() function name in comment x86/sb_edac: Add row column translation for Broadwell EDAC/i10nm: Print an extra register set of retry_rd_err_log EDAC/i10nm: Retrieve and print retry_rd_err_log registers for HBM EDAC/skx_common: Add ChipSelect ADXL component EDAC/ppc_4xx: Reorder symbols to get rid of a few forward declarations EDAC: Remove obsolete declarations in edac_module.h EDAC/i10nm: Add driver decoder for Ice Lake and Tremont CPUs EDAC/skx_common: Make output format similar EDAC/skx_common: Use driver decoder first EDAC/mc: Drop duplicated dimm->nr_pages debug printout EDAC/mc: Replace spaces with tabs in memtype flags definition EDAC/wq: Remove unneeded flush_workqueue() Signed-off-by: Borislav Petkov <bp@suse.de>
2022-09-24EDAC/i7300: Correct the i7300_exit() function name in commentColin Ian King1-1/+1
The incorrect function name is being used in the comment for function i7300_exit. Correct this. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220805125008.2346559-1-colin.i.king@gmail.com
2022-09-23x86/sb_edac: Add row column translation for BroadwellYouquan Song1-10/+138
The sb_edac driver lacks translation for DIMM internal address. Add memory address translation for row/column/bank/bank_group on Broadwell. Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220722233338.341567-1-tony.luck@intel.com
2022-09-23EDAC/i10nm: Print an extra register set of retry_rd_err_logQiuxu Zhuo2-11/+72
Sapphire Rapids server adds an extra register set for logging more retry_rd_err_log data. So add code to print the extra register set. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220722233338.341567-1-tony.luck@intel.com
2022-09-23EDAC/i10nm: Retrieve and print retry_rd_err_log registers for HBMQiuxu Zhuo2-17/+71
An HBM memory channel is divided into two pseudo channels. Each pseudo channel has its own retry_rd_err_log registers. Retrieve and print retry_rd_err_log registers of the HBM pseudo channel if the memory error is from HBM. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220722233338.341567-1-tony.luck@intel.com
2022-09-23EDAC/skx_common: Add ChipSelect ADXL componentQiuxu Zhuo2-0/+9
Each pseudo channel of HBM has its own retry_rd_err_log registers. The bit 0 of ChipSelect ADXL component encodes the pseudo channel number of HBM memory. So add ChipSelect ADXL component to get HBM pseudo channel number. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220722233338.341567-1-tony.luck@intel.com
2022-09-18EDAC/ppc_4xx: Reorder symbols to get rid of a few forward declarationsUwe Kleine-König1-14/+9
When moving the definition of ppc4xx_edac_driver further down, the forward declarations can just be dropped. Do this to reduce needless line repetition. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220917232013.489931-1-u.kleine-koenig@pengutronix.de
2022-09-11EDAC: Remove obsolete declarations in edac_module.hGaosheng Cui1-4/+0
Commit 4de78c6877ec ("drivers/edac: mod PCI poll names"), renamed the respective variables and accessors but left the old accessor declarations edac_get_log_ce(), edac_get_log_ue(), edac_get_poll_msec() and edac_get_panic_on_ue() in place. Remove them. [ bp: Masssage commit message. ] Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220911094038.3224365-1-cuigaosheng1@huawei.com
2022-09-08EDAC/i10nm: Add driver decoder for Ice Lake and Tremont CPUsYouquan Song3-2/+138
Current i10nm_edac only supports firmware decoder (ACPI DSM methods). MCA bank registers of Ice Lake or Tremont CPUs contain the information to decode DDR memory errors. To get better decoding performance, add the driver decoder (decoding DDR memory errors via extracting error information from MCA bank registers) for Ice Lake and Tremont CPUs. Co-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220901194310.115427-1-tony.luck@intel.com/
2022-09-08EDAC/skx_common: Make output format similarQiuxu Zhuo1-2/+2
The decoded output format of driver decoder is different from the output format of firmware decoder. Make output format similar regardless of decode function (Align driver decoder's to firmware decoder's). Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220901194310.115427-1-tony.luck@intel.com/
2022-09-08EDAC/skx_common: Use driver decoder firstQiuxu Zhuo3-9/+17
The performance of driver decoder[1] is better than the performance of firmware decoder[2], especially on frequent correctable errors. So use the driver decoder first, fall back to firmware decoder if the driver decoder is unavailable. Also rename the function pointer skx_decode to driver_decode (better name to contrast with adxl_decode). [1] Decode errors by extracting error information from registers of memory controllers and/or MCA bank registers. [2] Decode errors by calling ACPI DSM methods. Co-developed-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220901194310.115427-1-tony.luck@intel.com/
2022-09-01EDAC/mc: Drop duplicated dimm->nr_pages debug printoutSerge Semin1-1/+0
The duplicated edac_dbg()-based dimm->nr_pages print was introduced in 6e84d359b2be ("edac_mc: Cleanup per-dimm_info debug messages"). The duplicated line can be found even in the commit message text: [ 1011.380101] EDAC DEBUG: edac_mc_dump_dimm: dimm->nr_pages = 0x40000 [ 1011.380103] EDAC DEBUG: edac_mc_dump_dimm: dimm->grain = 8 [ 1011.380104] EDAC DEBUG: edac_mc_dump_dimm: dimm->nr_pages = 0x40000 Drop the second edac_dbg() call. [ bp: Massage commit message. ] Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220822190730.27277-14-Sergey.Semin@baikalelectronics.ru