Age | Commit message (Collapse) | Author | Files | Lines |
|
hl_access_sram_dram_region() uses a region base which is set within the
hl_set_dram_bar() function. However, for SRAM access this function is
not called, and we end up with a wrong value of region base and with a
bad calculated address.
Fix it by initializing the region base value independently of whether
hl_set_dram_bar() is called or not.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
To avoid memory corruption in kernel memory while using timestamp
registration nodes, zero the kernel buff memory when its allocated.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Consecutive error protects a device reset loop from being triggered
due to h/w issues and enters the device into an unavailable state.
When user may cause the error, an unavailable state
will prevent the user from running its workloads.
The commit prevents entering consecutive state when a user context
is enabled.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Use graceful hard reset when detecting a CS timeout that requires a
device reset.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Add an option to control the timeout value for the driver's watchdog
of the reset process. The timeout represents the amount of the user
has to close his process once he gets a device reset notification from
the driver.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Calling hl_device_reset() for a hard reset will lead to a quite
immediate device reset and to killing user process.
For resets that follow errors, it disables the option to debug the
errors on both the device side and the user application side.
This patch adds a 'graceful hard reset' option and a new
hl_device_cond_reset() function.
Under some conditions, mainly if there is no user process or if he is
not registered to driver notifications, this function will execute hard
reset as usual.
Otherwise, the reset will be postponed and a notification will be sent
to user, to let him perform post-error actions and then to release the
device, after which reset will take place.
If device is not released by user in some defined time, a watchdog work
will execute the reset in any case.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Currently there is no verification whether the divisor is legal.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
As there are 2 types of user mappings, pmmu and hmmu, calculate
only the relevant mappings for the requested type.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Up until now the use-case in the driver was that the HBM is accessed
using the HBM BAR, yet the BAR sometimes cannot cover the whole HBM and
so we needed to set the BAR to other HBM offset.
Now we are facing the need to access other PCI memory regions that can
be covered by the HBM BAR.
To answer that we are allowing the caller to determine if the HBM BAR
need to be set or not regardless of the PCI memory region.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The code uses the pointer for trace purpose (without actually
dereference it) but still get static analysis warning.
This patch eliminate the warning.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The event notifier mechanism should not raise an empty
event (event equals zero).
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In case of HBM MMU page fault, capture its relevant mappings.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
'struct hl_device_reset_work' is used as a wrapper for the reset work
and its parameters, including the reset workqueue on which it runs.
In a future commit, another reset related work with similar parameters
is going to be added, but it won't use the reset workqueue.
As in any case there is a single reset workqueue, and to allow the resue
of this structure, move the reset workqueue to 'struct hl_device'.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Unregistering eventfd is for releasing host resources and doesn't
involve an access to the device. As such, there is no reason to disallow
it when device isn't operational.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
If reset upon device release is enabled, there is no need to check the
device idle status in hpriv_release(), because device is going to be
reset in any case.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
pf was an abbreviation for prefetch but because pf already stands
for 'physical function', we decided to change it to 'prefetch'.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Only the first page fault will be saved.
Besides the address which caused the page fault, the driver captures
all of the mmu user mappings.
User can retrieve this data via the new uapi (new opcode in INFO ioctl).
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
This fixes sparse warning on doing cast to 32-bits
Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
This event notification was compatible only with gaudi, where razwi
and page fault happens together.
To make it compatible with all ASICs, this refactor contains:
1. Razwi notification will only notify about razwi info.
New notification will be added in future patch, to retrieve data
about page fault error.
2. Changed razwi info structure to support all ASICs.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Use the simplified API that calculates distance between two devices.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Monitoring apps would like to query device state at any time so we
should allow it also during reset because it doesn't involve
accessing the h/w.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
If hl_cpu_accessible_dma_pool_alloc() fails, we should check
'req_cpu_addr', fix it.
Fixes: 0c88760f8f5e ("habanalabs/gaudi2: add secured attestation info uapi")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
When doing sizeof() and giving as argument a dereference of
a pointer-to-a-pointer object, clang will issue a warning.
Eliminate the warning by passing struct <name>*
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
To be forward-backward compatible with the firmware in the initial
communication during preboot, we need to remove the validation of the
header size. This will allow us to add more fields to the
lkd_fw_comms_desc structure.
Instead of the validation of the header size, we just print warning
when some mismatch in descriptor has been revealed, and we calculate
the CRC base on descriptor size reported by the firmware instead of
calculating it ourselves.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
If the user wants to open the device, and the device is currently in
reset, the user will get an error from the open().
We don't need to display an error in the dmesg for that as it is
not a real error and we can spam the kernel log with this message.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
User will provide a nonce via the ioctl, and will retrieve
secured attestation data of the boot, generated using given
nonce.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
As a preparation for adding more errors to it,
change to more suitable name.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The code used the mmu mutex to protect access to the context's page
tables and invalidation of the MMU cache. Because pgt are per
context, the mmu mutex was a member of the context object.
The problem is that the device has a single MMU invalidation h/w
(per MMU). Therefore, the mmu mutex should not be a property of the
context but a property of the device.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Current code does not takes into account the new DRAM region base
and so calculated address is wrong and can lead to crush.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Firmware now responds with a more detailed cpucp return codes.
Driver can now distinguish between error and debug return codes.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
F/W security status might change after every reset.
Add the reading of the preboot status to the hard reset sequence, which
among others reads this security indication.
As this preboot status reading includes the waiting for the preboot to
be ready, it can be removed from the CPU init which is done in a later
stage.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
cb_map_mem() uses gen_pool_alloc() to get virtual address for
mapping a CB.
The mapping is done in chunks of page size, so if the CB size is
larger, it is possible that the allocated virtual addresses won't
be consecutive.
User retrieves this device VA which returns the virtual address
in the first va_block. If there is a "hole" in the virtual addresses,
user can configure a HW block with a bad device VA.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
'Device activity open packet' should be sent outside of mutex as
there is no real necessity for a lock.
In addition 'device activity close packet' should be sent upon an
actual release of the device.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
As part of the RAS that is done by the f/w, we should send a message
to the f/w when a user either acquires or releases the device.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
EEPROM errors reported by firmware are basically warnings and
should not fail the boot process.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Except Goya, none of our ASICs require context switch flow, hence we
enable this flow only where it is needed.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Set the addresses for userspace command buffer dynamically
instead of hard-coded. There is no reason for it to
be hard-coded.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
This patch add tracepoints in the code for DMA allocation.
The main purpose is to be able to cross data with the map operations and
determine whether memory violation occurred, for example free DMA
allocation before unmapping it from device memory.
To achieve this the DMA alloc/free code flows were refactored so that a
single DMA tracepoint will catch many flows.
To get better understanding of what happened in the DMA allocations
the real allocating function is added to the trace as well.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
This patch utilize the defined tracepoint to trace the MMU's pages
map/unmap operations.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
This patch adds trace events for habanalabs driver to gain all the
benefits such an infrastructure can supply.
The following events were added:
- MMU map/unmap: to be able to track driver's memory allocations
- DMA alloc/free: to track our DMA allocation
the above trace points in conjunction will help us map the device memory
usage as well as to be able to track memory violations.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Acked-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The calculation of the device DRAM base address before setting the
relevant PCIe BAR to point at it, has an assumption that this BAR is
used to access only the DRAM, and thus the covered DRAM size is a power
of 2.
In future ASICs it is not necessarily true, so need to update the
calculation to support also a non-power-of-2 size.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The original code tried to unmap a page that was not mapped as part of
the map page error path.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Instead of recalculating the cdev index, store it in a dedicated data
member. This data member is intended to be passed to other drivers using
the auxiliary bus infra and hence this new data member is necessary in
case that the calculation is changed in the future.
Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
the size of a block is always 'block->end - block->start + 1'
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In order for the user to know if he is running on a secured device
or not, we add it also to the hw_ip info ioctl.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In order for the user to know if he is running on a secured device
or not, a sysfs node is added.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Secured PCI ID will not be supported in new asics because the
security status can always be read from the f/w.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Several munmap() calls can be done or a mapped H/W block that has a
larger size than a page size.
Releasing the object should be done only when all mapped range is
unmapped.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Since hwmon fini code is common for all asics, unified it to common
function.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The current flow of halting the engine cores is implemented by command
buffers built by the user space and sent towards the Driver.
This current flow is broken since the user space does not know when
the cores actually halt as sending a workload is async op.
Therefore the application can not free the memory that is mapped
to the engine cores.
This new API allows the user space to control the running mode. The
API call is sync (returns after the cores are set to the
requested mode).
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|