summaryrefslogtreecommitdiff
path: root/drivers/misc/habanalabs
AgeCommit message (Collapse)AuthorFilesLines
2021-06-21habanalabs/gaudi: refactor hard-reset related codeKoby Elbaz4-47/+56
There is code related to hard-reset, which is done in gaudi specific code. However, this code can be used by future ASICs and therefore it is better to move it to the common code section. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-21habanalabs/gaudi: add support for NIC DERROfir Bitton3-5/+16
We add support for NIC DERR ECC error events, in case this error is received a device reset will be performed. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-21habanalabs: add validity check for signal csfarah kassabri3-24/+75
In preparation for a new feature that allows the user to reserve signals ahead of submissions, we need to change a current assumption in the code. Currently, the driver uses 2 SOBs to support signal CS. When the first SOB reaches max value, the driver switches to the other one and assumes that when it will need to switch back to the first one, all of the signals have already been handled. This assumption won't hold when the new feature will be added, because using signal reservation, the driver can reach the max SOB value very fast. The change is to add a validity check when submitting a signal CS, to make sure the previous SOB is available (all the signals attached to it indeed finished). Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-21habanalabs: get lower/upper 32 bits via maskingKoby Elbaz2-2/+2
fix multiple similar occurrences of the following sparse warning: 'warning: cast truncates bits from constant value (7ffc113000 becomes fc113000)' Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-21habanalabs: allow reset upon device releaseOfir Bitton4-5/+30
We introduce a new type of reset which is reset upon device release. This reset is very similar to soft reset except the fact it is performed only upon device release and not upon user sysfs request nor TDR. The purpose of this reset is to make sure the device is returned to IDLE state after the current user has finished working with the device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18debugfs: add skip_reset_on_timeout optionYuri Nudelman3-0/+9
To be able to debug long-running CS better, without changing the userspace code, we are adding a new option through debugfs interface to skip the reset of the device in case of CS timeout. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: fix typoZvika Yehudai1-1/+1
fix a spelling error in comment Signed-off-by: Zvika Yehudai <zyehudai@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: correct driver events numberingOfir Bitton3-16/+20
Currently driver sends fc interrupt id to FW instead of using cpu interrupt id. We intend to fix that and keep backward compatibility by using the same interrupt values. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: remove a rogue #ifdefOded Gabbay1-5/+0
There was a rogue #ifdef that crept into the upstream code for backwards compatibility which isn't needed of course. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: print last QM PQEs on errorOhad Sharabi2-37/+182
In case QMAN has an error and stop_on_err is true, print specific information of the "offending" command buffer batch. If the error occurred on one of the higher CPs, the CQ pointer and size will be printed along with (up to) last 8 PQEs of the stream. If the error occurred in the lower CP, the CQ pointer and size will be printed along with (up to) last 8 PQEs of ALL upper CPs as we have no way to know which upper CP sent the job there. This is done so higher SW levels will be able to debug their CS by extracting the raw data of the offending command buffer batch and examine those offline to detect the issue. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/goya: add '__force' attribute to suppress false alarmKoby Elbaz1-1/+1
fix (suppress) the following sparse warnings: 'warning: cast removes address space of expression' Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: added open_stats info ioctlYuri Nudelman4-0/+35
In a system with multiple ASICs, there is a need to provide monitoring tools with information on how long a device was opened and how many times a device was opened. Therefore, we add a new opcode to the INFO ioctl to provide that information. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: set the correct rc in case of errKoby Elbaz1-1/+3
fix the following smatch warnings: gaudi_internal_cb_pool_init() warn: missing error code 'rc' Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: update coresight configurationTal Albo1-2/+2
Update STMTCSR and STMSYNCR values in order to reduce amount of sync packets Signed-off-by: Tal Albo <talbo@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: remove node from list before freeing the nodeKoby Elbaz2-0/+2
fix the following smatch warnings: goya_pin_memory_before_cs() warn: '&userptr->job_node' not removed from list gaudi_pin_memory_before_cs() warn: '&userptr->job_node' not removed from list Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: set rc as 'valid' in case of intentional func exitKoby Elbaz2-3/+7
fix the following smatch warnings: hl_fw_static_init_cpu() warn: missing error code 'rc' Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: zero complex structures using memsetKoby Elbaz1-1/+2
fix the following sparse warnings: 'warning: Using plain integer as NULL pointer' 'warning: missing braces around initializer' Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: print more info when failing to pin user memoryTomer Tayar1-2/+2
pin_user_pages_fast() might fail and return a negative number, or pin less pages than requested and return the number of the pages that were pinned. For the latter, it is informative to print also the memory size and the number of requested pages. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: Fix an error handling path in 'hl_pci_probe()'Christophe JAILLET1-0/+1
If an error occurs after a 'pci_enable_pcie_error_reporting()' call, it must be undone by a corresponding 'pci_disable_pcie_error_reporting()' call, as already done in the remove function. Fixes: 2e5eda4681f9 ("habanalabs: PCIe Advanced Error Reporting support") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: print firmware versionsOded Gabbay1-10/+95
Firmware in habanalabs devices is composed of several components. During device initialization, we read these versions from the device. Print them during device initialization to allow better visibility in automated systems. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: add hard reset timeout for PLDMOmer Shpigelman2-2/+8
Hard reset flow on PLDM might take more than 2 minutes. Hence add a dedicated hard reset timeout of 6 minutes for PLDM. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: enable dram scramble before linux f/wBharat Jauhari4-5/+21
In current code, for dynamic f/w loading flow, DRAM scrambling is enabled post Linux fit image is loaded to the card. This can cause the device CPU to go into reset state. The correct sequence should be: 1. Load boot fit image 2. Enable scrambling 3. Load Linux fit image This commit aligns the DRAM scrambling enabling with the static f/w load flow. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: enable stop on error for all QMANs and enginesOfir Bitton1-0/+1
If there is an error in the QMAN/engine, there is no point of trying to continue running the workload. It is better to stop to allow the user to debug the program. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: report EQ fault during heartbeatOhad Sharabi2-1/+27
In case we have EQ fault we would like to know about it. For this, a status bitmask was added in which EQ_FAULT bit is set by FW in case of EQ fault. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: small code refactoringKoby Elbaz4-7/+7
Use datatype defines instead of hard coded values, and rename set_fixed_properties function. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: use standard error codesOded Gabbay1-6/+6
When there is an ECC error in the HBM, return a standard error code, -EIO in this case, and not a positive value. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: fix mask to obtain page offsetOhad Sharabi1-3/+11
When converting virtual address to physical we need to add correct offset to the physical page. For this we need to use mask that include ALL bits of page offset. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: skip valid test for boot_dev_sts regsOhad Sharabi2-41/+40
Get rid of the need to check if boot_dev_sts is valid on every access to value read from these registers. This is done by storing the register value in hdev props ONLY if register is enabled. This way if register is NOT enabled all capability bits will not be set. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: reset device upon FD close if not idleOfir Bitton4-12/+19
If device is not idle after user closes the FD we must reset device as next user that will try to open FD will encounter a non-functional device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: add debug flag to prevent failure on timeoutYuri Nudelman2-5/+25
Sometimes it is useful to allow the command to continue running despite the timeout occurred, to differentiate between really stuck or just very time consuming commands. This can be achieved by passing a new debug flag alongside the cs, HL_CS_FLAGS_SKIP_RESET_ON_TIMEOUT. Anyway, if the timeout occurred, a warning print shall be issued, however this shall not fail the submission. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: add FW alive event supportOfir Bitton4-1/+32
In order for driver to be aware of process or thread crashes inside GAUDI's CPU, we introduce a new event which contains all relevant information. Upon event reception, driver will dump information and will reset the device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: don't use disabled ports in collective waitOfir Bitton1-148/+71
In the collective wait, we put jobs on the QMANs of all the NICs. The code takes into account if a port is disabled only in case of PCI card. When this info arrives from the f/w, the code doesn't take it into account, and it tries to schedule jobs on NICs that aren't enabled and thats a bug. To fix this, after the f/w sends us the list of disabled ports, we update the state of the QMANs according to that list. In addition, we need to update the HW_CAP bits so the collective wait operation will not try to use those QMANs. We also need to update the collective master monitor mask. Moreover, we need to add a protection for such future cases and in case the user will try to submit work to those QMANs. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: update to latest f/w specsOded Gabbay2-10/+33
Update the firmware interface files to their latest version. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: split host irq interfaces towards FWOfir Bitton4-8/+44
Current implementation uses a single interrupt interface towards FW, this interface is causing races between interrupt types. We split this interface to interface per interrupt type. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: prefer ASYNC device probingOded Gabbay1-1/+5
There is no dependency when probing multiple devices so indicate to the kernel that it can probe our devices in ASYNC fashion. This shortens insmod of the driver from ~2 minutes to 20 seconds on a system with 8 devices. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: add ARB to QM stop on error masksTomer Tayar2-15/+17
Update the QM stop on error masks to also stop on ARB errors. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: don't use nic_ports_mask in computeOded Gabbay1-12/+12
nic_ports_mask is used by the networking part of the driver. In the compute part, we use the HW_CAP bits to select what is active and what is not. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: set the correct cpu_id on MME2_QM failureKoby Elbaz1-1/+1
This fix was applied since there was an incorrect reported CPU ID to GIC such that an error in MME2 QMAN aliased to be an arriving from DMA0_QM. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: refactor reset codeOded Gabbay2-18/+34
After all the latest changes to the reset code, there were some redundancy and errors in the flows. If the Linux FIT is loaded to the ASIC CPU, we need to communicate with it only via GIC. If it is not loaded, we need to either use COMMS protocol (for newer f/w) or MSG_TO_CPU register (for older f/w). In addition, if we halted the device CPU then we need to mark that the driver will do the reset, regardless of the capabilities. Also, to prevent false errors, we need to keep track whether the device CPU was already halted. If so, we shouldn't try to halt it again. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: track security status using positive logicOhad Sharabi8-50/+51
Using negative logic (i.e. fw_security_disabled) is confusing. Modify the flag to use positive logic (fw_security_enabled). Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: use COMMS to reset device / halt CPUKoby Elbaz3-4/+39
This is needed because legacy FW 'communication' protocol will soon become obsolete. Because COMMS is a boot protocol, communicating through it is supported only until Linux is loaded to the device CPU, where in that case we will fallback to the former implementation. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: disable GIC usage if security is enabledKoby Elbaz3-15/+29
Security is set based on PCI ID, and after reading preboot status bits. GIC usage is set in both scenarios since GIC can't be used when security is enabled. Moreover, writing to GIC/SP is enabled only after Linux is fully loaded. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: read preboot status bits in an earlier stageKoby Elbaz1-8/+2
On newer releases, host won't be able to trigger an interrupt directly to the ASIC GIC controller. To be able to decide whether GIC can/not be used, we must read device's preboot status bits in a stage that precedes the possible first use of GIC (when device is in dirty state). Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: check running index in eqe controlOded Gabbay5-5/+49
To harden the event queue mechanism, we add a running index to the control header of the entry. The firmware writes the index in each entry and the driver verifies that the index of the current entry is larger by 1 of the index of the previous entry. In case it isn't, the driver will treat the entry as if it wasn't valid (it won't process it but won't skip it). Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: set memory scrubbing to disabled by defaultOded Gabbay1-2/+2
Scrubbing memory after every unmap is very costly in terms of performance. If a user wants it he can enable it but the default should prioritize performance. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: do not move HBM bar if iATU done by FWOfir Bitton1-0/+20
As iATU configuration is done by FW, driver should not try and move HBM bar. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: read GIC sts after FW is loadedKoby Elbaz3-28/+57
Reading of GIC privileged status will be done after F/W is loaded, because privileged GIC capability is only available with the correct ARMCP version, and after it's loaded. Such versions necessarily support COMMS, so GIC alternatives (SP regs) will be read directly from dynamic regs. As well, initiation of DMA QMANs will occur after F/W is loaded since it depends on GIC configuration. In case F/W isn't loaded there's no problem since either way there won't be any GIC IRQ handling. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: check if asic secured with asic typeOhad Sharabi1-1/+1
Fix issue in which the input to the function is_asic_secured was device PCI_IDS number instead of the asic_type enumeration. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs/gaudi: send hard reset cause to prebootKoby Elbaz6-8/+208
LKD should provide hard reset cause to preboot prior to loading any FW components (in case needed). Current implementation is based on the new FW 'COMMS' protocol In cased 'COMMS' is disabled - reset cause won't be sent. Currently, only 2 reset causes are shared: HEARTBEAT & TDR. Sending the reset cause will provide the missing watchdog info that the firmware needs to provide to the BMC. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18habanalabs: notify before f/w loadingOded Gabbay1-0/+3
An information print notifying on starting to load the f/w was removed by mistake when moving to the new dynamic f/w loading mechanism. Restore that print as the F/W loading usually takes between 10 to 20 seconds and this print helps the user know the status of the driver load. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>