summaryrefslogtreecommitdiff
path: root/drivers/misc/habanalabs/common/habanalabs.h
AgeCommit message (Collapse)AuthorFilesLines
2023-01-26habanalabs: move driver to accel subsystemOded Gabbay1-4002/+0
Now that we have a subsystem for compute accelerators, move the habanalabs driver to it. This patch only moves the files and fixes the Makefiles. Future patches will change the existing code to register to the accel subsystem and expose the accel device char files instead of the habanalabs device char files. Update the MAINTAINERS file to reflect this change. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs/uapi: move uapi file to drmOded Gabbay1-1/+1
Move the habanalabs.h uapi file from include/uapi/misc to include/uapi/drm, and rename it to habanalabs_accel.h. This is required before moving the actual driver to the accel subsystem. Update MAINTAINERS file accordingly. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: pass-through request from user to f/wfarah kassabri1-0/+2
Add a uAPI, as part of the INFO IOCTL, to allow users to send requests directly to f/w, according to a pre-defined set of opcodes that the f/w exposes. The f/w will put the result in a kernel-allocated buffer, which the driver will then copy to the user-supplied buffer. This will allow f/w tools to communicate directly with the f/w without the need to add a new uAPI to the driver for each new type of request. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: fix asic-specific functions documentationOhad Sharabi1-1/+2
- Add missing documentation of set DRAM props - fix typo Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs/gaudi2: wait for preboot ready if HW state is dirtyOhad Sharabi1-0/+1
Instead of waiting for BTM indication we should wait for preboot ready. Consider the below scenario: 1. FW update is being triggered - setting the dirty bit 2. hard reset will be triggered due to the dirty bit 3. FW initiates the reset: - dirty bit cleared - BTM indication cleared - preboot ready indication cleared 4. during hard reset: - BTM indication will be set - BIST test performed and another reset triggered 5. only after this reset the preboot will set the preboot ready When polling on BTM indication alone we can lose sync with FW while trying to communicate with FW that is during reset. To overcome this we will always wait to preboot ready indication. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: fix dmabuf to export only required sizeOhad Sharabi1-0/+2
This patch fixes a bug that was found in the dmabuf flow. Bug description as found on Gaudi2 device: 1. User allocates 4MB of device memory - Note that although the allocation size was 4MB the HMMU allocated a full page of 768MB to back the request. - The user gets a memory handle that points to a single page (768MB) - Mapping the handle, the user gets virtual address to the start of the page. 2. User exports the buffer 3. User registers the exported buffer in the importer. This flow has a callback to the exporter which in turn converts the phys_page_pack to an SG list for the importer. This SG list is of single entry of size 768MB. However, the size that was passed to the importer was only 4MB. The solution for this is to make sure the importer gets exposure only to the exported size. This will be done by fixing the SG created by the exporter to be of the total size of the actual exported memory requested by the user. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: modify export dmabuf APIOhad Sharabi1-0/+9
A previous commit deprecated the option to export from handle, leaving the code with no support for devices with virtual memory. This commit modifies the export API in a way that unifies the uAPI to user address for both cases (i.e. with and without MMU support) and add the actual support for devices with virtual memory. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: remove support to export dmabuf from handleOhad Sharabi1-5/+0
The API to the user which allows exporting DMA buffer from handle is deprecated here. It was never used as it is relevant only for Gaudi2, and the user stack has yet to add support for dmabuf in Gaudi2. Looking forward, a modified API to export DMA buffer for ASICs that supports virtual memory will be added. Until the new API will be ready- exporting DMA buffer will not be supported for ASICs with virtual memory. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: make set_dram_properties an ASIC functionOhad Sharabi1-0/+1
As ASICs are evolving, we will need to update the DRAM properties at various points because we may get different information from the f/w at different points of the initialization. This ASIC function is a foundation for this capability. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: don't allow user to destroy CB handle more than onceTomer Tayar1-1/+5
The refcount of a CB buffer is initialized when user allocates a CB, and is decreased when he destroys the CB handle. If this refcount is increased also from kernel and user sends more than one destroy requests for the handle, the buffer will be released/freed and later be accessed when the refcount is put from kernel side. To avoid it, prevent user from destroying the handle more than once. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: abort waiting user threads upon errorTomer Tayar1-0/+1
User should close the FD when being notified about an error, after which a device reset takes place. However, if the user has pending threads that wait for completions, the device release won't be called and eventually the watchdog timeout will expire, leading to hard reset and killing the user process. To avoid it, abort such waiting threads right after the error notification, and block following waiting operations. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26habanalabs: read binning info from prebootfarah kassabri1-0/+5
Sometimes we need the binning info at a very early state of the driver initialization. Therefore, support was added in preboot to provide the binning info as part of the f/w descriptor and the driver can now use that. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: clear non-released encapsulated signalsTomer Tayar1-1/+2
Reserved encapsulated signals which were not released hold the context refcount, leading to a failure when killing the user process on device reset or device fini. Add the release of these left signals in the CS roll-back process. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: add RMWREG32_SHIFTED to set a val within a maskDafna Hirschfeld1-7/+3
This is similar to RMWREG32, but the given 'val' is already shifted according to the mask. This allows several 'ORed' vals and masks to be set at once The patch also fixes wrong usage of RMWREG32 by replacing it with RMWREG32_SHIFTED Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: extend process wait timeout in device fineOded Gabbay1-3/+8
Processes that use our device are likely to use at the same time other devices such as remote storage. In case our device is removed and a user process is still using the device, we need to kill the user process. However, if that process has a thread waiting for i/o to complete on remote storage, for example, the process won't terminate. Let's give it enough time to terminate before giving up. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Tomer Tayar <ttayar@habana.ai>
2022-11-23habanalabs/gaudi: add page fault notify eventDani Liberman1-0/+2
Each time page fault happens, besides capturing its data, also notify the user about it. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs/gaudi: add razwi notify eventDani Liberman1-0/+2
Each time razwi (read-only zero, write ignore) happens, besides capturing its data, also notify the user about it. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs/gaudi2: add PCI revision 2 supportOfir Bitton1-0/+2
Add support for Gaudi2 Device with PCI revision 2. Functionality is exactly the same as revision 1, the only difference is device name exposed to user. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: remove redundant gaudi2_sec asic typeOfir Bitton1-2/+0
As Gaudi2 has a single PCI id, the secured asic type is redundant. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: add support for graceful hard resetTomer Tayar1-2/+12
Calling hl_device_reset() for a hard reset will lead to a quite immediate device reset and to killing user process. For resets that follow errors, it disables the option to debug the errors on both the device side and the user application side. This patch adds a 'graceful hard reset' option and a new hl_device_cond_reset() function. Under some conditions, mainly if there is no user process or if he is not registered to driver notifications, this function will execute hard reset as usual. Otherwise, the reset will be postponed and a notification will be sent to user, to let him perform post-error actions and then to release the device, after which reset will take place. If device is not released by user in some defined time, a watchdog work will execute the reset in any case. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: allow setting HBM BAR to other regionsOhad Sharabi1-0/+2
Up until now the use-case in the driver was that the HBM is accessed using the HBM BAR, yet the BAR sometimes cannot cover the whole HBM and so we needed to set the BAR to other HBM offset. Now we are facing the need to access other PCI memory regions that can be covered by the HBM BAR. To answer that we are allowing the caller to determine if the HBM BAR need to be set or not regardless of the PCI memory region. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: move reset workqueue to be under hl_deviceTomer Tayar1-6/+6
'struct hl_device_reset_work' is used as a wrapper for the reset work and its parameters, including the reset workqueue on which it runs. In a future commit, another reset related work with similar parameters is going to be added, but it won't use the reset workqueue. As in any case there is a single reset workqueue, and to allow the resue of this structure, move the reset workqueue to 'struct hl_device'. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: replace 'pf' to 'prefetch'Dafna Hirschfeld1-4/+4
pf was an abbreviation for prefetch but because pf already stands for 'physical function', we decided to change it to 'prefetch'. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: add page fault info uapiDani Liberman1-1/+21
Only the first page fault will be saved. Besides the address which caused the page fault, the driver captures all of the mmu user mappings. User can retrieve this data via the new uapi (new opcode in INFO ioctl). Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: use lower_32_bits()Bharat Jauhari1-3/+3
This fixes sparse warning on doing cast to 32-bits Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: refactor razwi event notificationDani Liberman1-25/+6
This event notification was compatible only with gaudi, where razwi and page fault happens together. To make it compatible with all ASICs, this refactor contains: 1. Razwi notification will only notify about razwi info. New notification will be added in future patch, to retrieve data about page fault error. 2. Changed razwi info structure to support all ASICs. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: allow control device open during resetOfir Bitton1-0/+2
Monitoring apps would like to query device state at any time so we should allow it also during reset because it doesn't involve accessing the h/w. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi2: add secured attestation info uapiDani Liberman1-0/+3
User will provide a nonce via the ioctl, and will retrieve secured attestation data of the boot, generated using given nonce. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: rename error info structureDani Liberman1-6/+6
As a preparation for adding more errors to it, change to more suitable name. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: MMU invalidation h/w is per deviceOded Gabbay1-3/+7
The code used the mmu mutex to protect access to the context's page tables and invalidation of the MMU cache. Because pgt are per context, the mmu mutex was a member of the context object. The problem is that the device has a single MMU invalidation h/w (per MMU). Therefore, the mmu mutex should not be a property of the context but a property of the device. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: add support for new cpucp return codesOfir Bitton1-0/+2
Firmware now responds with a more detailed cpucp return codes. Driver can now distinguish between error and debug return codes. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: fix possible hole in device vafarah kassabri1-3/+4
cb_map_mem() uses gen_pool_alloc() to get virtual address for mapping a CB. The mapping is done in chunks of page size, so if the CB size is larger, it is possible that the allocated virtual addresses won't be consecutive. User retrieves this device VA which returns the virtual address in the first va_block. If there is a "hole" in the virtual addresses, user can configure a HW block with a bad device VA. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: send device active message to f/wfarah kassabri1-0/+3
As part of the RAS that is done by the f/w, we should send a message to the f/w when a user either acquires or releases the device. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: perform context switch flow only if neededOfir Bitton1-0/+2
Except Goya, none of our ASICs require context switch flow, hence we enable this flow only where it is needed. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: set command buffer host VA dynamicallyDafna Hirschfeld1-7/+3
Set the addresses for userspace command buffer dynamically instead of hard-coded. There is no reason for it to be hard-coded. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: trace DMA allocationsOhad Sharabi1-9/+31
This patch add tracepoints in the code for DMA allocation. The main purpose is to be able to cross data with the map operations and determine whether memory violation occurred, for example free DMA allocation before unmapping it from device memory. To achieve this the DMA alloc/free code flows were refactored so that a single DMA tracepoint will catch many flows. To get better understanding of what happened in the DMA allocations the real allocating function is added to the trace as well. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: add cdev index data memberOmer Shpigelman1-1/+3
Instead of recalculating the cdev index, store it in a dedicated data member. This data member is intended to be passed to other drivers using the auxiliary bus infra and hence this new data member is necessary in case that the calculation is changed in the future. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: fix H/W block handling for partial unmappingsTomer Tayar1-2/+4
Several munmap() calls can be done or a mapped H/W block that has a larger size than a page size. Releasing the object should be done only when all mapped range is unmapped. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: unify hwmon resources clean upDani Liberman1-0/+1
Since hwmon fini code is common for all asics, unified it to common function. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs/gaudi2: new API to control engine cores running modeTal Cohen1-1/+7
The current flow of halting the engine cores is implemented by command buffers built by the user space and sent towards the Driver. This current flow is broken since the user space does not know when the cores actually halt as sending a workload is async op. Therefore the application can not free the memory that is mapped to the engine cores. This new API allows the user space to control the running mode. The API call is sync (returns after the cores are set to the requested mode). Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: remove all kdma locksOded Gabbay1-4/+0
We don't use KDMA concurrently in the driver. The only use is through debugfs and we don't protect concurrent access through it. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: wrap macro arg with parenthesesOhad Sharabi1-2/+2
The macro argument <val> is cast-ed to u32 in some of the places. Because this arg can be some arithmetic computation (e.g. address + offset) the cast should be on the whole expression. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: fix spelling mistakesBharat Jauhari1-14/+13
Cosmetic commit, no logical changes. It just fixes the spelling mistakes. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: add return code field to module iteratorOhad Sharabi1-2/+5
Up until now the module iterator called void callback functions and so caller activating callback that may fail suffered from 2 issues: 1. The need to "plant" return called in the private data. This is a drawback since the iterator itself should not be aware of the private data of the caller. 2. Due to 1 even in a failure the iterator would keep iterating instead of break upon error. To overcome this an optional rc field added to the iterator context. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: rename non_hard_reset to compute_resetOfir Bitton1-2/+2
In order to be more explicit we should use the term compute_reset for describing the reset in which only the compute engines gets reset. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: removed seq_file parameter from is_idle asic functionsDani Liberman1-2/+15
Change is_idle functions so it would be more usable outside debugfs. Do this by replacing seq_file parameter with regular string. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: rename soft reset to compute resetOded Gabbay1-8/+8
Doing compute reset can be the traditional inference soft reset that is supported only in Goya. Or it can be the new reset upon device release, which is supported in Gaudi2 and above. Therefore, wherever suitable, use the terminology of compute reset instead of soft reset. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add a value field to hl_fw_send_pci_access_msg()Tomer Tayar1-1/+1
For gaudi2 we need to send a value to F/W as part of the PCI_ACCESS packet. As a preparation, modify hl_fw_send_pci_access_msg() to have a 'value' field. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: fixes to the poll-timeout macrosOhad Sharabi1-29/+90
- use conventional internal macro variables (double underscore prefix) - adjust address casting - on register poll using ELBI use ELBI read rather than BAR read on error condition - remove unused macro Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: save f/w preboot minor versionSagiv Ozeri1-1/+3
We need this property for backward compatibility against the f/w. Signed-off-by: Sagiv Ozeri <sozeri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>