summaryrefslogtreecommitdiff
path: root/drivers/misc/habanalabs
AgeCommit message (Collapse)AuthorFilesLines
2023-01-26habanalabs: move driver to accel subsystemOded Gabbay424-333685/+0
Now that we have a subsystem for compute accelerators, move the habanalabs driver to it. This patch only moves the files and fixes the Makefiles. Future patches will change the existing code to register to the accel subsystem and expose the accel device char files instead of the habanalabs device char files. Update the MAINTAINERS file to reflect this change. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs/uapi: move uapi file to drmOded Gabbay14-14/+15
Move the habanalabs.h uapi file from include/uapi/misc to include/uapi/drm, and rename it to habanalabs_accel.h. This is required before moving the actual driver to the accel subsystem. Update MAINTAINERS file accordingly. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: fix dma-buf release handling if dma_buf_fd() failsTomer Tayar1-2/+8
The dma-buf private object is freed if a call to dma_buf_fd() fails, and because a file was already associated with the dma-buf in dma_buf_export(), the release op will be called and will use this object. Mark the 'priv' field as NULL in this case, and avoid accessing it from the release op. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs/gaudi2: dump event description even if no causeOfir Bitton1-2/+2
In order to have the no-cause error print be more informative, we add the event description in addition to the event id. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: pass-through request from user to f/wfarah kassabri4-7/+135
Add a uAPI, as part of the INFO IOCTL, to allow users to send requests directly to f/w, according to a pre-defined set of opcodes that the f/w exposes. The f/w will put the result in a kernel-allocated buffer, which the driver will then copy to the user-supplied buffer. This will allow f/w tools to communicate directly with the f/w without the need to add a new uAPI to the driver for each new type of request. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: support receiving ascii message from preboot f/wTal Cohen2-15/+78
An Ascii message that is sent from preboot towards the driver will indicate the specific error that occurred on the f/w. This commit supports that message and parse the ascii string in order to print it into the kernel log The commit also changes the way the descriptor struct is declared. While its size increased (it now above 1024 bytes), it will be allocated by using kmalloc instead of stack declaration. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: fix asic-specific functions documentationOhad Sharabi1-1/+2
- Add missing documentation of set DRAM props - fix typo Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: fix wrong variable type used for vzallocfarah kassabri1-1/+2
vzalloc expects void* and not void __iomem*. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs/gaudi2: wait for preboot ready if HW state is dirtyOhad Sharabi3-2/+27
Instead of waiting for BTM indication we should wait for preboot ready. Consider the below scenario: 1. FW update is being triggered - setting the dirty bit 2. hard reset will be triggered due to the dirty bit 3. FW initiates the reset: - dirty bit cleared - BTM indication cleared - preboot ready indication cleared 4. during hard reset: - BTM indication will be set - BIST test performed and another reset triggered 5. only after this reset the preboot will set the preboot ready When polling on BTM indication alone we can lose sync with FW while trying to communicate with FW that is during reset. To overcome this we will always wait to preboot ready indication. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: put fences in case of unexpected wait statusTomer Tayar1-1/+2
Need to put fences even if an unexpected status value is received while waiting for a fence. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: fix handling of wait CS for interrupting signalsTomer Tayar1-6/+11
The -ERESTARTSYS return value is not handled correctly when a signal is received while waiting for CS completion. This can lead to bad output values to user when waiting for a single CS completion, and more severe, it can cause a non-stopping loop when waiting to multi-CS completion and until a CS timeout. Fix the handling and exit the waiting if this return value is received. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: fix dmabuf to export only required sizeOhad Sharabi2-9/+28
This patch fixes a bug that was found in the dmabuf flow. Bug description as found on Gaudi2 device: 1. User allocates 4MB of device memory - Note that although the allocation size was 4MB the HMMU allocated a full page of 768MB to back the request. - The user gets a memory handle that points to a single page (768MB) - Mapping the handle, the user gets virtual address to the start of the page. 2. User exports the buffer 3. User registers the exported buffer in the importer. This flow has a callback to the exporter which in turn converts the phys_page_pack to an SG list for the importer. This SG list is of single entry of size 768MB. However, the size that was passed to the importer was only 4MB. The solution for this is to make sure the importer gets exposure only to the exported size. This will be done by fixing the SG created by the exporter to be of the total size of the actual exported memory requested by the user. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: modify export dmabuf APIOhad Sharabi2-25/+203
A previous commit deprecated the option to export from handle, leaving the code with no support for devices with virtual memory. This commit modifies the export API in a way that unifies the uAPI to user address for both cases (i.e. with and without MMU support) and add the actual support for devices with virtual memory. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: helper function to validate export paramsOhad Sharabi1-35/+44
Validate export parameters in a dedicated function instead of in the main export flow. This will be useful later when support to export dmabuf for devices with virtual memory will be added. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: remove support to export dmabuf from handleOhad Sharabi2-139/+9
The API to the user which allows exporting DMA buffer from handle is deprecated here. It was never used as it is relevant only for Gaudi2, and the user stack has yet to add support for dmabuf in Gaudi2. Looking forward, a modified API to export DMA buffer for ASICs that supports virtual memory will be added. Until the new API will be ready- exporting DMA buffer will not be supported for ASICs with virtual memory. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: set log level for descriptor validation to debugfarah kassabri1-2/+2
This warning doesn't have real consequences, and therefore can be printed in debug level. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: trace COMMS protocolOhad Sharabi1-0/+31
Call COMMS tracepoints from within the dynamic CPU FW load. This can help debug failures or delays in the dynamic FW load flow. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs/gaudi2: support abrupt device reset eventOfir Bitton3-0/+4
In certain scenarios, firmware might encounter a fatal event for which a device reset is required. Hence, a proper notification is needed for driver to be aware and initiate a reset sequence. In secured environments the reset will be performed by firmware without an explicit request from the driver. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: skip device idle check in hpriv_release if in resetTomer Tayar1-2/+4
When user context is released and hpriv_release() is called, there is a device idle status check, to understand if user has left the device not idle and then a reset is required. However, if the user process is killed because of device hard reset, the device at this point would always be not idle, because the device engines were already forcefully halted. Modify hpriv_release() to skip the idle check if reset is in progress. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: adjacent timestamps should be more accurateTamir Gilad-Raz1-3/+3
timestamp events that expire on the same interrupt will get the same timestamp value Signed-off-by: Tamir Gilad-Raz <tgiladraz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs/gaudi2: remove duplicated event printsOfir Bitton1-190/+149
In order to reduce error log, we try to minimize the dumped rows while keeping all relevant error info. In addition we completely remove clock throttling debug logs. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs/gaudi2: count interrupt causesOfir Bitton1-109/+252
During event handling we extract interrupt cause and count it. In case we could not find any cause we should add proper error. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: update DRAM props according to preboot dataOhad Sharabi1-0/+4
If the f/w reports the binning masks at the preboot stage, the driver must align its DRAM properties according to the new information. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: fix double assignment in MMU V1Marco Pagani1-1/+0
Removing double assignment of the hop2_pte_addr variable in dram_default_mapping_fini(). Dead store reported by clang-analyzer. Signed-off-by: Marco Pagani <marpagan@redhat.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: make set_dram_properties an ASIC functionOhad Sharabi4-1/+15
As ASICs are evolving, we will need to update the DRAM properties at various points because we may get different information from the f/w at different points of the initialization. This ASIC function is a foundation for this capability. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: use dev_dbg() when hl_mmap_mem_buf_get() failsTomer Tayar1-2/+1
As hl_mmap_mem_buf_get() is called also from IOCTLs which can have a bad handle from user, modify the print for "no match to handle" to use dev_dbg(). Calls to this function which are not dependent on user, already have an error print when the function fails. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: don't allow user to destroy CB handle more than onceTomer Tayar5-4/+32
The refcount of a CB buffer is initialized when user allocates a CB, and is decreased when he destroys the CB handle. If this refcount is increased also from kernel and user sends more than one destroy requests for the handle, the buffer will be released/freed and later be accessed when the refcount is put from kernel side. To avoid it, prevent user from destroying the handle more than once. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: don't notify user about clk throttling due to powerOfir Bitton2-6/+8
As clock throttling due to high power consumption can happen very frequently and there is no real reason to notify the user about it, we skip this notification in all asics. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: abort waiting user threads upon errorTomer Tayar3-3/+28
User should close the FD when being notified about an error, after which a device reset takes place. However, if the user has pending threads that wait for completions, the device release won't be called and eventually the watchdog timeout will expire, leading to hard reset and killing the user process. To avoid it, abort such waiting threads right after the error notification, and block following waiting operations. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: remove releasing of user threads from device releaseTomer Tayar1-5/+0
The device file is not in use when hl_device_release() is called, and there aren't any user threads that use IOCTLs to wait for interrupts. Therefore there is no need to release them at this point. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: read binning info from prebootfarah kassabri3-10/+50
Sometimes we need the binning info at a very early state of the driver initialization. Therefore, support was added in preboot to provide the binning info as part of the f/w descriptor and the driver can now use that. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs/gaudi2: fix BMON 3rd address rangetal albo1-4/+4
Fix programming incorrect value of address range Signed-off-by: tal albo <talbo@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-12-16Merge tag 'char-misc-6.2-rc1' of ↵Linus Torvalds19-512/+1184
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here is the large set of char/misc and other driver subsystem changes for 6.2-rc1. Nothing earth-shattering in here at all, just a lot of new driver development and minor fixes. Highlights include: - fastrpc driver updates - iio new drivers and updates - habanalabs driver updates for new hardware and features - slimbus driver updates - speakup module parameters added to aid in boot time configuration - i2c probe_new conversions for lots of different drivers - other small driver fixes and additions One semi-interesting change in here is the increase of the number of misc dynamic minors available to 1048448 to handle new huge-cpu systems. All of these have been in linux-next for a while with no reported problems" * tag 'char-misc-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (521 commits) extcon: usbc-tusb320: Convert to i2c's .probe_new() extcon: rt8973: Convert to i2c's .probe_new() extcon: fsa9480: Convert to i2c's .probe_new() extcon: max77843: Replace irqchip mask_invert with unmask_base chardev: fix error handling in cdev_device_add() mcb: mcb-parse: fix error handing in chameleon_parse_gdd() drivers: mcb: fix resource leak in mcb_probe() coresight: etm4x: fix repeated words in comments coresight: cti: Fix null pointer error on CTI init before ETM coresight: trbe: remove cpuhp instance node before remove cpuhp state counter: stm32-lptimer-cnt: fix the check on arr and cmp registers update misc: fastrpc: Add dma_mask to fastrpc_channel_ctx misc: fastrpc: Add mmap request assigning for static PD pool misc: fastrpc: Safekeep mmaps on interrupted invoke misc: fastrpc: Add support for audiopd misc: fastrpc: Rework fastrpc_req_munmap misc: fastrpc: Use fastrpc_map_put in fastrpc_map_create on fail misc: fastrpc: Add fastrpc_remote_heap_alloc misc: fastrpc: Add reserved mem support misc: fastrpc: Rename audio protection domain to root ...
2022-12-01habanalabs: remove FOLL_FORCE usageDavid Hildenbrand1-2/+1
FOLL_FORCE is really only for ptrace access. As we unpin the pinned pages using unpin_user_pages_dirty_lock(true), the assumption is that all these pages are writable. FOLL_FORCE in this case seems to be due to copy-and-past from other drivers. Let's just remove it. Link: https://lkml.kernel.org/r/20221116102659.70287-20-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Oded Gabbay <ogabbay@kernel.org> Cc: Oded Gabbay <ogabbay@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-23habanalabs: fix VA range calculationOhad Sharabi1-8/+4
Current implementation is fixing the page size to PAGE_SIZE whereas the input page size may be different. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: fail driver load if EEPROM errors detectedOfir Bitton1-12/+11
In case EEPROM is not burned, firmware sets default EEPROM values. As this is not valid in production, driver should fail load upon any EEPROM error reported by firmware. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: make print of engines idle mask more readableTomer Tayar1-6/+21
The engines idle mask was increased to be an array of 4 u64 entries. To make the print of this mask more readable, remove the "0x" prefix, and zero-pad each u64 to 16 bytes if either it isn't zero or if any of the higher-order u64's is not zero. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: clear non-released encapsulated signalsTomer Tayar3-31/+71
Reserved encapsulated signals which were not released hold the context refcount, leading to a failure when killing the user process on device reset or device fini. Add the release of these left signals in the CS roll-back process. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: don't put context in hl_encaps_handle_do_release_sob()Tomer Tayar1-1/+0
hl_encaps_handle_do_release_sob() can be called only when the last reference to the context object is released and hl_ctx_do_release() is initiated, and therefore it shouldn't call hl_ctx_put(). Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: print context refcount value if hard reset failsTomer Tayar1-3/+15
Failing to kill a user process during a hard reset can be due to a reference to the user context which isn't released. To make it easier to understand if this the reason for the failure and not something else, add a print of the context refcount value. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: add RMWREG32_SHIFTED to set a val within a maskDafna Hirschfeld2-10/+6
This is similar to RMWREG32, but the given 'val' is already shifted according to the mask. This allows several 'ORed' vals and masks to be set at once The patch also fixes wrong usage of RMWREG32 by replacing it with RMWREG32_SHIFTED Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: fix rc when new CPUCP opcodes are not supportedTomer Tayar1-0/+1
When the new CPUCP opcodes are not supported and a CPUCP packet fails, the return value is the F/W error resposone which is a positive value. If this packet is sent from IOCTL and the positive value is used, the ICOTL will not be considered as unsuccessful. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs/gaudi2: added memset for the cq_size registerMarco Pagani1-0/+1
The clang-analyzer reported a warning: "Value stored to 'cq_size_addr' is never read". The cq_size register of dcore0 is not being zeroed using gaudi2_memset_device_lbw(), along with the other cq_* registers, even though the corresponding cq_size_addr variable is set. Signed-off-by: Marco Pagani <marpagan@redhat.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: added return value check for hl_fw_dynamic_send_clear_cmd()Marco Pagani1-0/+2
The clang-analyzer reported a warning: "Value stored to 'rc' is never read". The return value check for the first hl_fw_dynamic_send_clear_cmd() call in hl_fw_dynamic_send_protocol_cmd() appears to be missing. Signed-off-by: Marco Pagani <marpagan@redhat.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: increase the size of busy engines maskTomer Tayar1-4/+5
Increase the size of the busy engines mask in 'struct hl_info_hw_idle', for future ASICs with more than 128 engines. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs/gaudi2: change memory scrub mechanismfarah kassabri1-46/+83
Currently the scrubbing mechanism used the EDMA engines by directly setting the engine core registers to scrub a chunk of memory. Due to a sporadic failure with this mechanism, it was decided to initiate the engines via its QMAN using LIN-DMA packets. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: extend process wait timeout in device fineOded Gabbay2-5/+12
Processes that use our device are likely to use at the same time other devices such as remote storage. In case our device is removed and a user process is still using the device, we need to kill the user process. However, if that process has a thread waiting for i/o to complete on remote storage, for example, the process won't terminate. Let's give it enough time to terminate before giving up. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Tomer Tayar <ttayar@habana.ai>
2022-11-23habanalabs: check schedule_hard_reset correctlyOded Gabbay1-12/+13
schedule_hard_reset can be true only if we didn't do hard-reset. Therefore, no point of checking it in case hard_reset is true. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Tomer Tayar <ttayar@habana.ai>
2022-11-23habanalabs: reset device if still in use when releasedTomer Tayar1-3/+4
If the device file is released while a context is still held, it won't be possible to reopen it until the context is eventually released. If that doesn't happen, only a device reset will revert it back to an operational state, i.e. need to wait for a CS timeout or an error, or to wait for an external intervention of injecting a reset via sysfs. At this stage, after the device was released by user, context is held either because of CS which were left running on the device and are not relevant anymore, or due to missing cleanup steps from user side. All of this is in any case handled in the device reset flow, so initiate the reset at this point instead of waiting for it. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs/gaudi2: return to reset upon SM SEI BRESP errorTomer Tayar1-13/+6
Due to a H/W issue in the LBW path to the PCIE_DBI MSI-X doorbell, there were false sporadic error responses in SM when it was configured to write to there, and hence no reset was done as part of handling the relevant event. Now that the virtual MSI-X doorbell is used, such errors in SM are not expected and reset shouldn't be skipped. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>