summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
authorOded Gabbay <ogabbay@kernel.org>2021-04-02 01:43:18 +0300
committerOded Gabbay <ogabbay@kernel.org>2021-04-09 14:09:24 +0300
commit639781dcab8261f39c7028db4ed4fd0e760d69fa (patch)
treeadaa1b4c4e4eb50a89fe3a0de2c3ddda66795287 /Documentation
parente65448faf4cfeddd95a0e661aabf2fae1efc9831 (diff)
downloadlinux-639781dcab8261f39c7028db4ed4fd0e760d69fa.tar.xz
habanalabs/gaudi: add debugfs to DMA from the device
When trying to debug program, the user often needs to dump large parts of the device's DRAM, which can reach to tens of GBs. Because reading from the device's internal memory through the PCI BAR is extremely slow, the debug can take hours. Instead, we can provide the user to copy data through one of the DMA engines. This will make the operation much faster. Currently, only GAUDI is supported. In GAUDI, we need to find a PCI DMA engine that is IDLE and set the DMA as secured to be able to bypass our MMU as we currently don't map the temporary buffer to the MMU. Example bash one-line to dump entire HBM to file (~2 minutes): for (( i=0x0; i < 0x800000000; i+=0x8000000 )); do \ printf '0x%x\n' $i | sudo tee /sys/kernel/debug/habanalabs/hl0/addr ; \ echo 0x8000000 | sudo tee /sys/kernel/debug/habanalabs/hl0/dma_size ; \ sudo cat /sys/kernel/debug/habanalabs/hl0/data_dma >> hbm.txt ; done Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/debugfs-driver-habanalabs68
1 files changed, 53 insertions, 15 deletions
diff --git a/Documentation/ABI/testing/debugfs-driver-habanalabs b/Documentation/ABI/testing/debugfs-driver-habanalabs
index f9e233cbdc37..c78fc9282876 100644
--- a/Documentation/ABI/testing/debugfs-driver-habanalabs
+++ b/Documentation/ABI/testing/debugfs-driver-habanalabs
@@ -82,6 +82,24 @@ Description: Allows the root user to read or write 64 bit data directly
If the IOMMU is disabled, it also allows the root user to read
or write from the host a device VA of a host mapped memory
+What: /sys/kernel/debug/habanalabs/hl<n>/data_dma
+Date: Apr 2021
+KernelVersion: 5.13
+Contact: ogabbay@kernel.org
+Description: Allows the root user to read from the device's internal
+ memory (DRAM/SRAM) through a DMA engine.
+ This property is a binary blob that contains the result of the
+ DMA transfer.
+ This custom interface is needed (instead of using the generic
+ Linux user-space PCI mapping) because the amount of internal
+ memory is huge (>32GB) and reading it via the PCI bar will take
+ a very long time.
+ This interface doesn't support concurrency in the same device.
+ In GAUDI and GOYA, this action can cause undefined behavior
+ in case the it is done while the device is executing user
+ workloads.
+ Only supported on GAUDI at this stage.
+
What: /sys/kernel/debug/habanalabs/hl<n>/device
Date: Jan 2019
KernelVersion: 5.1
@@ -90,6 +108,24 @@ Description: Enables the root user to set the device to specific state.
Valid values are "disable", "enable", "suspend", "resume".
User can read this property to see the valid values
+What: /sys/kernel/debug/habanalabs/hl<n>/dma_size
+Date: Apr 2021
+KernelVersion: 5.13
+Contact: ogabbay@kernel.org
+Description: Specify the size of the DMA transaction when using DMA to read
+ from the device's internal memory. The value can not be larger
+ than 128MB. Writing to this value initiates the DMA transfer.
+ When the write is finished, the user can read the "data_dma"
+ blob
+
+What: /sys/kernel/debug/habanalabs/hl<n>/dump_security_violations
+Date: Jan 2021
+KernelVersion: 5.12
+Contact: ogabbay@kernel.org
+Description: Dumps all security violations to dmesg. This will also ack
+ all security violations meanings those violations will not be
+ dumped next time user calls this API
+
What: /sys/kernel/debug/habanalabs/hl<n>/engines
Date: Jul 2019
KernelVersion: 5.3
@@ -154,6 +190,16 @@ Description: Displays the hop values and physical address for a given ASID
e.g. to display info about VA 0x1000 for ASID 1 you need to do:
echo "1 0x1000" > /sys/kernel/debug/habanalabs/hl0/mmu
+What: /sys/kernel/debug/habanalabs/hl<n>/mmu_error
+Date: Mar 2021
+KernelVersion: 5.12
+Contact: fkassabri@habana.ai
+Description: Check and display page fault or access violation mmu errors for
+ all MMUs specified in mmu_cap_mask.
+ e.g. to display error info for MMU hw cap bit 9, you need to do:
+ echo "0x200" > /sys/kernel/debug/habanalabs/hl0/mmu_error
+ cat /sys/kernel/debug/habanalabs/hl0/mmu_error
+
What: /sys/kernel/debug/habanalabs/hl<n>/set_power_state
Date: Jan 2019
KernelVersion: 5.1
@@ -161,6 +207,13 @@ Contact: ogabbay@kernel.org
Description: Sets the PCI power state. Valid values are "1" for D0 and "2"
for D3Hot
+What: /sys/kernel/debug/habanalabs/hl<n>/stop_on_err
+Date: Mar 2020
+KernelVersion: 5.6
+Contact: ogabbay@kernel.org
+Description: Sets the stop-on_error option for the device engines. Value of
+ "0" is for disable, otherwise enable.
+
What: /sys/kernel/debug/habanalabs/hl<n>/userptr
Date: Jan 2019
KernelVersion: 5.1
@@ -175,18 +228,3 @@ KernelVersion: 5.1
Contact: ogabbay@kernel.org
Description: Displays a list with information about all the active virtual
address mappings per ASID and all user mappings of HW blocks
-
-What: /sys/kernel/debug/habanalabs/hl<n>/stop_on_err
-Date: Mar 2020
-KernelVersion: 5.6
-Contact: ogabbay@kernel.org
-Description: Sets the stop-on_error option for the device engines. Value of
- "0" is for disable, otherwise enable.
-
-What: /sys/kernel/debug/habanalabs/hl<n>/dump_security_violations
-Date: Jan 2021
-KernelVersion: 5.12
-Contact: ogabbay@kernel.org
-Description: Dumps all security violations to dmesg. This will also ack
- all security violations meanings those violations will not be
- dumped next time user calls this API