Age | Commit message (Collapse) | Author | Files | Lines |
|
Due to a H/W issue, AXI drain event does not include a read/write
indication, hence we remove this print.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The glbl error cause handling has a wrong assumption that all error
bits are consecutive.
Fix the handling to check all relevant error bits per ASIC.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
interrupt info
The FW interrupt info for a PCIe addr_dec event is set correctly, so
check for either global errors or razwi according to the indications
there.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The hop size related properties is a MMU properties and not
asic properties.
As for PMMU and HMMU we could have different sizes.
Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
to device
gaudi2_memset_memory_chunk_using_edma_qm() calls the access_dev_mem()
ASIC function, but ignores its return value.
Add this missing check.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In newer kernel versions, irq_set_affinity_hint() is deprecated.
Instead, use the newer version which is irq_set_affinity_and_hint().
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Currently the HMMU page tables reside in the host memory,
which will cause host access from the device for every page walk.
This can affect PCIe bandwidth in certain scenarios.
To prevent that problem, HMMU page tables will be moved to the device
memory so the miss transaction will read the hops from there instead of
going to the host.
Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
User interrupts are MSIx interrupts coming from Gaudi2, that have
specific range of IDs and are assigned to the sole use of the user
process that opened the Gaudi2 device (reminder: there can be only
a single user process running on Gaudi2 at any given time).
The interrupts are allocated and managed by the driver and therefore,
the user expects the driver to initialize them properly, which also
includes setting the affinity to the related CPU cores of the
device's NUMA node to get maximum performance.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Part of the undefined opcode data is updated in
gaudi2_handle_qman_err_generic() and some in
handle_lower_qman_data_on_err().
However, the 'write_enable' flag is checked only in
gaudi2_handle_qman_err_generic(), and information of more than a single
error can be mixed there.
Moreover, handle_lower_qman_data_on_err() is called only for the lower
QMAN, so for an error in the upper QMAN there is only a partial info.
Move all the data update to be done in a single place, protected by the
'write_enable' flag.
As mainly the lower QMAN's info is interesting, avoid saving the partial
info for the upper QMAN.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
QM instructions are in multiples of 64 bits and the command type is in
the upper bits of first QWORD.
To make it clearer that an undefined command is due to a type of 0x0,
always print all 64 bits and add a zero padding if needed.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The QM CQ PTR_LO/PTR_HI/TSIZE registers are for pushing a CQ entry, and
although they are updated by HW even when descriptors are fetched by PQ
and CB addresses are fed into CQ, the correct registers to use when
dumping the CQ info are the ones with the _STS suffix.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Upon a QM error, the address/size from both the CQ and the ARC_CQ are
printed, although the instruction that led to the error was received
from only one of them.
Moreover, in case of a QM undefined opcode, only one of these
address/size sets will be captured based on the value of ARC_CQ_PTR.
However, this value can be non-zero even if currently the CQ is used, in
case the CQ/ARC_CQ are alternately used.
Under the assumption of having a stop-on-error configuration, modify to
use CP_STS.CUR_CQ field to get the relevant CQ for the QM error.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
currently the undefined opcode event bit in set only for lower cp and
only if 'write_enable' is true. It should be set anyway and for all
streams in order to report that event to userspace.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
When a PCIe AXI drain event happens, it is possible that the driver
cannot access the device through PCIe, and therefore cannot send a
hard-reset request to FW.
Starting from FW version 1.13, FW will initiate a hard-reset in such
a case without waiting for a reset request from the driver.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
event_types_num received from the user can be 0. In that case, the
event_mask should be 0.
In addition, to create a correct mask we need to match the number
of event types to the bit location such that bit 0 represents a single
event type, bit 1 represents 2 types and so on.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
|
|
Traces the DMA [un]map_sgtable using the new traces we added.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
asic_dma_{un}map_single() asic-specific functions are no longer called
from the common code, so delete these functions.
In addition, delete the gaudi2 implementation as they are also not
called.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
|
|
There are cases such when FW runs MBIST, that preboot is expected to take
longer than the usual. In such cases the firmware reports status
SECURITY_READY/IN_PREBOOT and we extend the timeout waiting for it.
This is currently implemented for Gaudi2 only.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Currently driver maintain one list for both pending user interrupts
which seeks to wait till CQ reaches it's target value and also the ones
that seeks to get timestamp records when the CQ reaches it's target
value.
This causes delay in handling the waiters which gets higher priority
than the timestamp records.
In order to solve this, let's split the list into two,
one for each case and each one is protected by it's own spinlock.
Waiters will be handled within the interrupt context first,
then the timestamp records will be set.
Freeing the timestamp related memory will be handled in a workqueue.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
This is the second patch for applying the eq health check mechanism
which will add support for the interrupt flow for gaudi2 asic.
More info about the interrupt mechanism:
set a dedicated msix for the eq error interrupt, and add
interrupt handler for it.
when FW detects some issue with EQ like EQ_FULL, it'll
raise that interrupt and driver should reset the device.
Driver will inform the FW which msix index to use through
the already existing handshake mechanism which will
send msix info message to fw.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Add mechanism for fw eq health check. this will be done using two flows:
using the heartbeat mechanism and raising a dedicated interrupt to
indicate an eq failure like EQ full.
This patch will add implementation for the eq heartbeat for gaudi2 asic.
More info about the heartbeat mechanism:
Expand the heartbeat mechanism to monitor a new event that
will be sent from FW upon receiving heartbeat message.
that way driver can know that the eq is working or not.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Print to kernel log any device power mode changes events reported by
the FW.
Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The CPUCP interface is moved to a shared folder outside of accel as
a pre-requisite to upstream the NIC drivers that will also include
this file.
Signed-off-by: David Meriin <dmeriin@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
During ECC event handling, Memory wrapper id was mistakenly
printed as block id. Fix the print and in addition fetch the actual
block-id from firmware.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
coresight ETF blocks have different size. As a result, sync packets
need to be aligned based on fifo size.
Signed-off-by: Benjamin Dotan <bdotan@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
gaudi2_psoc_razwi_get_engines()
snprintf() returns the "number of characters which *would* be generated for
the given input", not the size *really* generated.
In order to avoid too large values for 'str_size' (and potential negative
values for "PSOC_RAZWI_ENG_STR_SIZE - str_size") use scnprintf()
instead of snprintf().
Fixes: c0e6df916050 ("accel/habanalabs: fix address decode RAZWI handling")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
`strncpy` is deprecated for use on NUL-terminated destination strings [1].
We see that `prop->cpucp_info.card_name` is supposed to be
NUL-terminated based on its usage within `__hwmon_device_register()`
(wherein it's called "name"):
| if (name && (!strlen(name) || strpbrk(name, "-* \t\n")))
| dev_warn(dev,
| "hwmon: '%s' is not a valid name attribute, please fix\n",
| name);
A suitable replacement is `strscpy_pad` [2] due to the fact that it
guarantees both NUL-termination and NUL-padding on its destination
buffer.
NUL-padding on `prop->cpucp_info.card_name` is not strictly necessary as
`hdev->prop` is explicitly zero-initialized but should be used
regardless as it gets copied out to userspace directly -- as per Kees'
suggestion.
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
When config_etr or config_etf are called we need to validate the
parameters that are passed into them to make sure the requested
operation is valid.
Signed-off-by: Benjamin Dotan <bdotan@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Because firmware is blocking PSOC_ARC_DBG, we need to disable access
to this block.
Signed-off-by: Benjamin Dotan <bdotan@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The soft reset has transitioned to CPUCP packet instead of plain
register write and is about to be removed from the struct cpu_dyn_regs.
As a preparation for removing the cpu_rst_status field from
struct cpu_dyn_regs, switch to use the plain macro - this keeps the
backward compatibility.
Signed-off-by: Igor Grinberg <igrinberg@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
User gets notification for every engine error report, but he still
lacks the exact engine information. Hence, we allow user to query
for the exact engine reported an error.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
If we are initializing the kernel context when we have a Gaudi2 device,
we don't need to do any late initializing of that context with
specific Gaudi2 code.
Reviewed-by: Ofir Bitton <obitton@habana.ai>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The soft reset has transitioned to CPUCP packet instead of plain
register write and is about to be removed from the struct cpu_dyn_regs.
As a preparation for removing the gic_host_soft_rst_irq field from
struct cpu_dyn_regs, switch to use the plain macro - this keeps the
backward compatibility.
Signed-off-by: Igor Grinberg <igrinberg@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
As TPC kernels now must use those registers we unsecure them.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The F/W dynamically allocates one of the PSOC scratchpad registers for
the engine cores, so they can raise events towards the F/W.
To allow the engine cores to access this register, this register must be
non-secured.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Implement razwi handling for arc farm and add it to arc farm sei
event handler.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Because in this case we have only a single possible cause, we can
safely stop fetching the cause from firmware.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In order for user to be aware of undefined opcode events, we must
store all relevant information and notify user about the failure.
The user will fetch the stored info via info ioctl.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In order to increase reliability of the event queue interface,
we apply to Gaudi2 the same mechanism we have in Gaudi1.
The extra validation is basically checking that the received
event index matches the expected index.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In order to utilize Engine Barrier padding, user must have access to
this register set.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
addr_dec info should always be fetched, regardless of cause value.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
When receiving page fault from hmmu, the captured address is scrambled
both by HW and by driver. The driver part is unscrambled but the HW
part isn't getting unscrambled.
To avoid declaring wrong address, the HW scrambled part will be
masked.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
By default, the upper QMANs are not used, and instead engines ARCs
access the lower QMANs directly.
Errors for upper QMANs are therefore not expected, and the debug print
of the PQ entries is not needed.
Modify the QMAN debug data print on errors to include only information
for the lower QMAN.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The QMAN GLBL_ERR_STS_4 register has indications for errors also in the
lower CQ and the ARC CQ, and not just for errors in the lower CP.
Modify the relevant define/struct and the related print to use "lower
QM" instead of "lower CP".
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
When receiving sei interrupt from tpc or decoder, we need to check
the binning mask because if the engine is binned, the razwi info
won't be in the router of the binned engine, instead will be in the
router of the substitute engine.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Initially, the driver used to read the error cause data directly from
the ASIC. However, the FW now clears it before the driver could read
it. Therefore we should use the error cause data that is extracted by
the FW.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The gaudi2_get_tpc_idle_status() function returned the incorrect variable
so it always returned true.
Fixes: d85f0531b928 ("accel/habanalabs: break is_idle function into per-engine sub-routines")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Due to missing indication of address decode source (LBW/HBW bus),
we should always try and fetch extended information.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Use a straight forward approach to get a conditional result.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
For some reason the last possible tpc interrupt cause in
gaudi2_tpc_interrupts_cause is missing from the code.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|