summaryrefslogtreecommitdiff
path: root/tools/testing/selftests/powerpc/eeh
AgeCommit message (Collapse)AuthorFilesLines
2021-01-31selftests/powerpc: Add VF recovery testsOliver O'Halloran3-0/+188
The basic EEH test ignores VFs since we the way the eeh_dev_break debugfs interface works means that if multiple VFs are enabled we may cause errors on all them them. However, we can work around that by only enabling a single VF at a time. This patch adds some infrastructure for finding SR-IOV capable devices and enabling / disabling VFs so we can exercise the VF specific EEH recovery paths. Two new tests are added, one for testing EEH aware devices and one for EEH un-aware VFs. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201103044503.917128-3-oohall@gmail.com
2021-01-31selftests/powerpc: Use stderr for debug messages in eeh-functionsOliver O'Halloran1-8/+12
We want to use stdout to return lists of devices, etc so log debug / status messages to stderr rather than stdout. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201103044503.917128-2-oohall@gmail.com
2021-01-31selftests/powerpc: Hoist helper code out of eeh-basicOliver O'Halloran2-36/+51
Hoist some of the useful test environment checking and prep code into eeh-functions.sh so they can be reused in other tests. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201103044503.917128-1-oohall@gmail.com
2021-01-04selftests/powerpc: Make the test check in eeh-basic.sh posix compliantPo-Hsu Lin1-1/+1
The == operand is a bash extension, thus this will fail on Ubuntu with: ./eeh-basic.sh: 89: test: 2: unexpected operator As the /bin/sh on Ubuntu is pointed to DASH. Use -eq to fix this posix compatibility issue. Fixes: 996f9e0f93f162 ("selftests/powerpc: Fix eeh-basic.sh exit codes") Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201228043459.14281-1-po-hsu.lin@canonical.com
2020-11-19selftests/powerpc/eeh: disable kselftest timeout setting for eeh-basicPo-Hsu Lin2-1/+2
The eeh-basic test got its own 60 seconds timeout (defined in commit 414f50434aa2 "selftests/eeh: Bump EEH wait time to 60s") per breakable device. And we have discovered that the number of breakable devices varies on different hardware. The device recovery time ranges from 0 to 35 seconds. In our test pool it will take about 30 seconds to run on a Power8 system that with 5 breakable devices, 60 seconds to run on a Power9 system that with 4 breakable devices. Extend the timeout setting in the kselftest framework to 5 minutes to give it a chance to finish. Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201023024539.9512-1-po-hsu.lin@canonical.com
2020-10-14selftests/powerpc: Fix eeh-basic.sh exit codesOliver O'Halloran1-3/+6
The kselftests test running infrastructure expects tests to finish with an exit code of 4 if the test decided it should be skipped. Currently eeh-basic.sh exits with the number of devices that failed to recover, so if four devices didn't recover we'll report a skip instead of a fail. Fix this by checking if the return code is non-zero and report success and failure by returning 0 or 1 respectively. For the cases where should actually skip return 4. Fixes: 85d86c8aa52e ("selftests/powerpc: Add basic EEH selftest") Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201014024711.1138386-1-oohall@gmail.com
2020-07-29selftests/powerpc: Squash spurious errors due to device removalOliver O'Halloran1-3/+8
For drivers that don't have the error handling callbacks we implement recovery by removing the device and re-probing it. This causes the sysfs directory for the PCI device to be removed which causes the following spurious error to be printed when checking the PE state: Breaking 0005:03:00.0... ./eeh-basic.sh: line 13: can't open /sys/bus/pci/devices/0005:03:00.0/eeh_pe_state: no such file 0005:03:00.0, waited 0/60 0005:03:00.0, waited 1/60 0005:03:00.0, waited 2/60 0005:03:00.0, waited 3/60 0005:03:00.0, waited 4/60 0005:03:00.0, waited 5/60 0005:03:00.0, waited 6/60 0005:03:00.0, waited 7/60 0005:03:00.0, Recovered after 8 seconds We currently try to avoid this by checking if the PE state file exists before reading from it. This is however inherently racy so re-work the state checking so that we only read from the file once, and we squash any errors that occur while reading. Fixes: 85d86c8aa52e ("selftests/powerpc: Add basic EEH selftest") Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200727010127.23698-1-oohall@gmail.com
2020-04-02selftests/eeh: Skip ahci adaptersMichael Ellerman1-0/+5
The ahci driver doesn't support error recovery, and if your root filesystem is attached to it the eeh-basic.sh test will likely kill your machine. So skip any device we see using the ahci driver. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200326061144.2006522-1-mpe@ellerman.id.au
2020-01-25selftests/eeh: Bump EEH wait time to 60sOliver O'Halloran1-3/+7
Some newer cards supported by aacraid can take up to 40s to recover after an EEH event. This causes spurious failures in the basic EEH self-test since the current maximim timeout is only 30s. Fix the immediate issue by bumping the timeout to a default of 60s, and allow the wait time to be specified via an environmental variable (EEH_MAX_WAIT). Reported-by: Steve Best <sbest@redhat.com> Suggested-by: Douglas Miller <dougmill@us.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200122031125.25991-1-oohall@gmail.com
2019-09-05selftests/powerpc: Add basic EEH selftestOliver O'Halloran3-0/+167
Use the new eeh_dev_check and eeh_dev_break interfaces to test EEH recovery. Historically this has been done manually using platform specific EEH error injection facilities (e.g. via RTAS). However, documentation on how to use these facilities is haphazard at best and non-existent at worst so it's hard to develop a cross-platform test. The new debugfs interfaces allow the kernel to handle the platform specific details so we can write a more generic set of sets. This patch adds the most basic of recovery tests where: a) Errors are injected and recovered from sequentially, b) Errors are not injected into PCI-PCI bridges, such as PCIe switches. c) Errors are only injected into device function zero. d) No errors are injected into Virtual Functions. a), b) and c) are largely due to limitations of Linux's EEH support. EEH recovery is serialised in the EEH recovery thread which forces a). Similarly, multi-function PCI devices are almost always grouped into the same PE so injecting an error on one function exercises the same code paths. c) is because we currently more or less ignore PCI bridges during recovery and assume that the recovered topology will be the same as the original. d) is due to the limits of the eeh_dev_break interface. With the current implementation we can't inject an error into a specific VF without potentially causing additional errors on other VFs. Due to the serialised recovery process we might end up timing out waiting for another function to recover before the function of interest is recovered. The platform specific error injection facilities are finer-grained and allow this capability, but doing that requires working out how to use those facilities first. Basicly, it's better than nothing and it's a base to build on. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190903101605.2890-15-oohall@gmail.com