summaryrefslogtreecommitdiff
path: root/include/drm/gpu_scheduler.h
AgeCommit message (Collapse)AuthorFilesLines
2021-07-01drm/sched: Allow using a dedicated workqueue for the timeout/fault tdrBoris Brezillon1-1/+22
Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU reset. This leads to extra complexity when we need to synchronize timeout works with the reset work. One solution to address that is to have an ordered workqueue at the driver level that will be used by the different schedulers to queue their timeout work. Thanks to the serialization provided by the ordered workqueue we are guaranteed that timeout handlers are executed sequentially, and can thus easily reset the GPU from the timeout handler without extra synchronization. v5: * Add a new paragraph to the timedout_job() method v3: * New patch v4: * Actually use the timeout_wq to queue the timeout work Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Lucas Stach <l.stach@pengutronix.de> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Christian König <christian.koenig@amd.com> Cc: Qiang Yu <yuq825@gmail.com> Cc: Emma Anholt <emma@anholt.net> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210630062751.2832545-3-boris.brezillon@collabora.com
2021-07-01drm/sched: Document what the timedout_job method should doBoris Brezillon1-0/+14
The documentation is a bit vague and doesn't really describe what the ->timedout_job() is expected to do. Let's add a few more details. v5: * New patch Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20210630062751.2832545-2-boris.brezillon@collabora.com
2021-06-03drm/sched: Fix inverted comment for hang_limitAlyssa Rosenzweig1-1/+1
The hang_limit is the threshold after which the kernel no longer attempts to schedule a job. Its documentation stated the opposite due to a typo. Correct the wording to indicate the actual purpose of the field. Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20210528235152.38447-1-alyssa.rosenzweig@collabora.com
2021-04-09drm/amd/amdgpu implement tdr advanced modeJack Zhang1-0/+3
[Why] Previous tdr design treats the first job in job_timeout as the bad job. But sometimes a later bad compute job can block a good gfx job and cause an unexpected gfx job timeout because gfx and compute ring share internal GC HW mutually. [How] This patch implements an advanced tdr mode.It involves an additinal synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit step in order to find the real bad job. 1. At Step0 Resubmit stage, it synchronously submits and pends for the first job being signaled. If it gets timeout, we identify it as guilty and do hw reset. After that, we would do the normal resubmit step to resubmit left jobs. 2. For whole gpu reset(vram lost), do resubmit as the old way. v2: squash in build fix (Alex) Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-08drm/sched: add missing member documentationChristian König1-0/+1
Just fix a warning. Signed-off-by: Christian König <christian.koenig@amd.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: f2f12eb9c32b ("drm/scheduler: provide scheduler score externally") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210401125213.138855-1-christian.koenig@amd.com
2021-02-05drm/scheduler: provide scheduler score externallyChristian König1-2/+3
Allow multiple schedulers to share the load balancing score. This is useful when one engine has different hw rings. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-and-Tested-by: Leo Liu <leo.liu@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210204144405.2737-1-christian.koenig@amd.com
2021-01-29drm/scheduler: Job timeout handler returns status (v3)Luben Tuikov1-3/+15
This patch does not change current behaviour. The driver's job timeout handler now returns status indicating back to the DRM layer whether the device (GPU) is no longer available, such as after it's been unplugged, or whether all is normal, i.e. current behaviour. All drivers which make use of the drm_sched_backend_ops' .timedout_job() callback have been accordingly renamed and return the would've-been default value of DRM_GPU_SCHED_STAT_NOMINAL to restart the task's timeout timer--this is the old behaviour, and is preserved by this patch. v2: Use enum as the status of a driver's job timeout callback method. v3: Return scheduler/device information, rather than task information. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Russell King <linux+etnaviv@armlinux.org.uk> Cc: Christian Gmeiner <christian.gmeiner@gmail.com> Cc: Qiang Yu <yuq825@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Steven Price <steven.price@arm.com> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Eric Anholt <eric@anholt.net> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Steven Price <steven.price@arm.com> Signed-off-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/415095/
2020-12-10drm/sched: Add missing structure commentLuben Tuikov1-1/+1
Add a missing structure comment for the recently added @list member. Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Christian König <christian.koenig@amd.com> Fixes: 8935ff00e3b1 ("drm/scheduler: "node" --> "list"") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/406546/
2020-12-08gpu/drm: ring_mirror_list --> pending_listLuben Tuikov1-5/+5
Rename "ring_mirror_list" to "pending_list", to describe what something is, not what it does, how it's used, or how the hardware implements it. This also abstracts the actual hardware implementation, i.e. how the low-level driver communicates with the device it drives, ring, CAM, etc., shouldn't be exposed to DRM. The pending_list keeps jobs submitted, which are out of our control. Usually this means they are pending execution status in hardware, but the latter definition is a more general (inclusive) definition. Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/405573/ Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Christian König <christian.koenig@amd.com>
2020-12-08drm/scheduler: "node" --> "list"Luben Tuikov1-2/+2
Rename "node" to "list" in struct drm_sched_job, in order to make it consistent with what we see being used throughout gpu_scheduler.h, for instance in struct drm_sched_entity, as well as the rest of DRM and the kernel. Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/403515/ Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Christian König <christian.koenig@amd.com>
2020-09-08Merge tag 'amd-drm-next-5.10-2020-09-03' of ↵Dave Airlie1-6/+7
git://people.freedesktop.org/~agd5f/linux into drm-next amd-drm-next-5.10-2020-09-03: amdgpu: - RAS fixes - Sienna Cichlid updates - Navy Flounder updates - DCE6 (SI) support in DC - Enable plane rotation - Rework pre-OS vram reservation handling during driver init - Add standard interface to dump GPU metrics table from SMU - Rework tiling and tmz state handling in atomic commits - Pstate fixes - Add voltage and power hwmon interfaces for renoir - SW CTF fixes - S/G display fix for Raven - Print client strings for vmfaults for vega and newer - Manual fan control fixes - Display updates - Reorg power management directory structure - Misc bug fixes - Misc code cleanups amdkfd: - Topology fixes - Add SMI events for thermal throttling and GPU resets radeon: - switch from pci_* to dma_* for dma allocations - PLL fix Scheduler: - Clean up priority levels UAPI: - amdgpu INFO IOCTL query update for TMZ state https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6049 - amdkfd SMI event interface updates https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/therm_thrott From: Alex Deucher <alexdeucher@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200903222921.4152-1-alexander.deucher@amd.com
2020-08-19drm/scheduler: Remove priority macro INVALID (v2)Luben Tuikov1-1/+0
Remove DRM_SCHED_PRIORITY_INVALID. We no longer carry around an invalid priority and cut it off at the source. Backwards compatibility behaviour of AMDGPU CTX IOCTL passing in garbage for context priority from user space and then mapping that to DRM_SCHED_PRIORITY_NORMAL is preserved. v2: Revert "res" --> "r" and "prio" --> "priority". Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-19drm/scheduler: Scheduler priority fixes (v2)Luben Tuikov1-5/+7
Remove DRM_SCHED_PRIORITY_LOW, as it was used in only one place. Rename and separate by a line DRM_SCHED_PRIORITY_MAX to DRM_SCHED_PRIORITY_COUNT as it represents a (total) count of said priorities and it is used as such in loops throughout the code. (0-based indexing is the the count number.) Remove redundant word HIGH in priority names, and rename *KERNEL* to *HIGH*, as it really means that, high. v2: Add back KERNEL and remove SW and HW, in lieu of a single HIGH between NORMAL and KERNEL. Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-06-26drm/scheduler: improve job distribution with multiple queuesNirmoy Das1-3/+3
This patch uses score to select a new drm scheduler for better loadbalance between multiple drm schedulers instead of num_jobs. Below are test results after running amdgpu_test for ~10 times. Before this patch: sched_name num of many times it got schedule ========= ================================== sdma0 1463 sdma1 198 comp_1.0.1 280 After this patch: sched_name num of many times it got schedule ========= ================================== sdma0 925 sdma1 928 comp_1.0.1 177 comp_1.1.1 44 comp_1.2.1 43 comp_1.3.1 44 Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/373000/ Signed-off-by: Christian König <christian.koenig@amd.com>
2020-04-05drm/sched: fix kernel-doc in gpu_scheduler.hSam Ravnborg1-0/+1
Fix following warning: gpu_scheduler.h:103: warning: Function parameter or member 'priority' not described in 'drm_sched_entity' Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Nirmoy Das <nirmoy.das@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20200328132025.19910-4-sam@ravnborg.org
2020-03-16drm/sched: implement and export drm_sched_pick_bestNirmoy Das1-0/+3
Remove drm_sched_entity_get_free_sched() and use the logic of picking the least loaded drm scheduler from a drm scheduler list to implement drm_sched_pick_best(). This patch also exports drm_sched_pick_best() so that it can be utilized by other drm drivers. Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-16Revert "drm/scheduler: improve job distribution with multiple queues"changzhu1-3/+3
It needs to revert this patch to avoid amdgpu_test compute hang problem on picasso. This reverts commit 56822db194232c089601728d68ed078dccb97f8b. Signed-off-by: changzhu <Changfeng.Zhu@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-09drm/scheduler: implement a function to modify sched listNirmoy Das1-0/+4
Implement drm_sched_entity_modify_sched() which modifies existing sched_list with a different one. This is going to be helpful when userspace changes priority of a ctx/entity then the driver can switch to the corresponding HW scheduler list for that priority. Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-28drm/amdgpu: fix doc by clarifying sched_list definitionNirmoy Das1-2/+3
expand sched_list definition for better understanding. Also fix a typo atleast -> at least Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-16drm/scheduler: fix documentation by replacing rq_list with sched_listNirmoy Das1-5/+5
This also replaces old artifacts with a correct one in drm_sched_entity_init() declaration Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-16drm/scheduler: improve job distribution with multiple queuesNirmoy Das1-3/+3
This patch uses score based logic to select a new rq for better loadbalance between multiple rq/scheds instead of num_jobs. Below are test results after running amdgpu_test from mesa drm Before this patch: sched_name num of many times it got scheduled ========= ================================== sdma0 314 sdma1 32 comp_1.0.0 56 comp_1.0.1 0 comp_1.1.0 0 comp_1.1.1 0 comp_1.2.0 0 comp_1.2.1 0 comp_1.3.0 0 comp_1.3.1 0 After this patch: sched_name num of many times it got scheduled ========= ================================== sdma0 216 sdma1 185 comp_1.0.0 39 comp_1.0.1 9 comp_1.1.0 12 comp_1.1.1 0 comp_1.2.0 12 comp_1.2.1 0 comp_1.3.0 12 comp_1.3.1 0 Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-19drm/scheduler: rework entity creationNirmoy Das1-3/+5
Entity currently keeps a copy of run_queue list and modify it in drm_sched_entity_set_priority(). Entities shouldn't modify run_queue list. Use drm_gpu_scheduler list instead of drm_sched_rq list in drm_sched_entity struct. In this way we can select a runqueue based on entity/ctx's priority for a drm scheduler. Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-08drm/sched: struct completion requires linux/completion.h inclusionStephen Rothwell1-0/+1
Fixes: 83a7772ba223 ("drm/sched: Use completion to wait for sched->thread idle v2.") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-08drm/sched: Use completion to wait for sched->thread idle v2.Andrey Grodzovsky1-0/+2
Removes thread park/unpark hack from drm_sched_entity_fini and by this fixes reactivation of scheduler thread while the thread is supposed to be stopped. v2: Per sched entity completion. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-02drm/scheduler: Add flag to hint the release of guilty job.Andrey Grodzovsky1-0/+2
Problem: Sched thread's cleanup function races against TO handler and removes the guilty job from mirror list and we have no way of differentiating if the job was removed from within the TO handler or from the sched thread's clean-up function. Fix: Add a flag to scheduler to hint the TO handler that the guilty job needs to be explicitly released. v2: whitespace fix Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/1555599624-12285-5-git-send-email-andrey.grodzovsky@amd.com
2019-05-02drm/scheduler: rework job destructionChristian König1-5/+1
We now destroy finished jobs from the worker thread to make sure that we never destroy a job currently in timeout processing. By this we avoid holding lock around ring mirror list in drm_sched_stop which should solve a deadlock reported by a user. v2: Remove unused variable. v4: Move guilty job free into sched code. v5: Move sched->hw_rq_count to drm_sched_start to account for counter decrement in drm_sched_stop even when we don't call resubmit jobs if guily job did signal. v6: remove unused variable Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692 Acked-by: Chunming Zhou <david1.zhou@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/1555599624-12285-3-git-send-email-andrey.grodzovsky@amd.com
2019-01-26drm/sched: Rework HW fence processing.Andrey Grodzovsky1-4/+2
Expedite job deletion from ring mirror list to the HW fence signal callback instead from finish_work, together with waiting for all such fences to signal in drm_sched_stop we garantee that already signaled job will not be processed twice. Remove the sched finish fence callback and just submit finish_work directly from the HW fence callback. v2: Fix comments. v3: Attach hw fence cb to sched_job v5: Rebase Suggested-by: Christian Koenig <Christian.Koenig@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-01-26drm/sched: Refactor ring mirror list handling.Andrey Grodzovsky1-3/+4
Decauple sched threads stop and start and ring mirror list handling from the policy of what to do about the guilty jobs. When stoppping the sched thread and detaching sched fences from non signaled HW fenes wait for all signaled HW fences to complete before rerunning the jobs. v2: Fix resubmission of guilty job into HW after refactoring. v4: Full restart for all the jobs, not only from guilty ring. Extract karma increase into standalone function. v5: Rework waiting for signaled jobs without relying on the job struct itself as those might already be freed for non 'guilty' job's schedulers. Expose karma increase to drivers. v6: Use list_for_each_entry_safe_continue and drm_sched_process_job in case fence already signaled. Call drm_sched_increase_karma only once for amdgpu and add documentation. v7: Wait only for the latest job's fence. Suggested-by: Christian Koenig <Christian.Koenig@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-12-06drm/scheduler: Add drm_sched_suspend/resume_timeout()Sharat Masetty1-0/+4
This patch adds two new functions to help client drivers suspend and resume the scheduler job timeout. This can be useful in cases where the hardware has preemption support enabled. Using this, it is possible to have the timeout active only for the ring which is active on the ringbuffer. This patch also makes the job_list_lock IRQ safe. Suggested-by: Christian Koenig <Christian.Koenig@amd.com> Signed-off-by: Sharat Masetty <smasetty@codeaurora.org> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-05drm/scheduler: Add drm_sched_job_cleanupSharat Masetty1-0/+1
This patch adds a new API to clean up the scheduler job resources. This is primarliy needed in cases the job was created but was not queued to the scheduler queue. Additionally with this change, the layer which creates the scheduler job also gets to free up the job's resources and this entails moving the dma_fence_put(finished_fence) to the drivers ops free handler routines. Signed-off-by: Sharat Masetty <smasetty@codeaurora.org> Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-05drm/sched: Add boolean to mark if sched is ready to work v5Andrey Grodzovsky1-0/+3
Problem: A particular scheduler may become unsuable (underlying HW) after some event (e.g. GPU reset). If it's later chosen by the get free sched. policy a command will fail to be submitted. Fix: Add a driver specific callback to report the sched status so rq with bad sched can be avoided in favor of working one or none in which case job init will fail. v2: Switch from driver callback to flag in scheduler. v3: rebase v4: Remove ready paramter from drm_sched_init, set uncoditionally to true once init done. v5: fix missed change in v3d in v4 (Alex) Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-05drm/sched: add drm_sched_faultChristian König1-0/+1
Add a helper to immediately start timeout handling in case of a hardware fault. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-09-27drm/scheduler: remove timeout work_struct from drm_sched_job (v3)Nayan Deshmukh1-3/+3
having a delayed work item per job is redundant as we only need one per scheduler to track the time out the currently executing job. v2: the first element of the ring mirror list is the currently executing job so we don't need a additional variable for it v3: squash in fixes for v3d and etnaviv Signed-off-by: Nayan Deshmukh <nayan26deshmukh@gmail.com> Suggested-by: Christian König <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-08-27drm/scheduler: Add stopped flag to drm_sched_entityAndrey Grodzovsky1-0/+2
The flag will prevent another thread from same process to reinsert the entity queue into scheduler's rq after it was already removewd from there by another thread during drm_sched_entity_flush. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Chunming Zhou <david1.zhou@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-08-27drm/scheduler: move entity handling into separate fileChristian König1-9/+19
This is complex enough on it's own. Move it into a separate C file. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-08-27drm/scheduler: fix setting the priorty for entities (v2)Christian König1-3/+2
Since we now deal with multiple rq we need to update all of them, not just the current one. v2: Trivial: Removed unused variable (Alex) Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-08-27drm/scheduler: add counter for total jobs in schedulerNayan Deshmukh1-0/+2
To keep track of the scheduler load. Signed-off-by: Nayan Deshmukh <nayan26deshmukh@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-08-27drm/scheduler: add a list of run queues to the entityNayan Deshmukh1-1/+6
These are the potential run queues on which the jobs from this entity can be scheduled. We will use this to do load balancing. Signed-off-by: Nayan Deshmukh <nayan26deshmukh@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-08-01drm/scheduler: only kill entity if last user is killed v2Christian König1-0/+2
Note which task is using the entity and only kill it if the last user of the entity is killed. This should prevent problems when entities are leaked to child processes. v2: add missing kernel doc Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-07-25drm/scheduler: remove sched field from the entityNayan Deshmukh1-2/+0
The scheduler of the entity is decided by the run queue on which it is queued. This patch avoids us the effort required to maintain a sync between rq and sched field when we start shifting entites among different rqs. Signed-off-by: Nayan Deshmukh <nayan26deshmukh@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Eric Anholt <eric@anholt.net> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-07-25drm/scheduler: modify API to avoid redundancyNayan Deshmukh1-7/+3
entity has a scheduler field and we don't need the sched argument in any of the functions where entity is provided. Signed-off-by: Nayan Deshmukh <nayan26deshmukh@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Eric Anholt <eric@anholt.net> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-07-13drm/scheduler: modify args of drm_sched_entity_initNayan Deshmukh1-3/+3
replace run queue by a list of run queues and remove the sched arg as that is part of run queue itself Signed-off-by: Nayan Deshmukh <nayan26deshmukh@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Eric Anholt <eric@anholt.net> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-07-13drm/scheduler: add a pointer to scheduler in the rqNayan Deshmukh1-0/+2
This patch is in preparation for a better load balancing in scheduler. It allows us to associate entities with the run queues instead of binding them to a scheduler. Signed-off-by: Nayan Deshmukh <nayan26deshmukh@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Eric Anholt <eric@anholt.net> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-07-06drm/scheduler: Rename cleanup functions v2.Andrey Grodzovsky1-3/+3
Everything in the flush code path (i.e. waiting for SW queue to become empty) names with *_flush() and everything in the release code path names *_fini() This patch also effect the amdgpu and etnaviv drivers which use those functions. v2: Also pplay the change to vd3. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Acked-by: Lucas Stach <l.stach@pengutronix.de> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-06-15drm/scheduler: Avoid using wait_event_killable for dying process (V4)Andrey Grodzovsky1-3/+4
Dying process might be blocked from receiving any more signals so avoid using it. Also retire enity->fini_status and just check the SW queue, if it's not empty do the fallback cleanup. Also handle entity->last_scheduled == NULL use case which happens when HW ring is already hangged whem a new entity tried to enqeue jobs. v2: Return the remaining timeout and use that as parameter for the next call. This way when we need to cleanup multiple queues we don't wait for the entire TO period for each queue but rather in total. Styling comments. Rebase. v3: Update types from unsigned to long. Work with jiffies instead of ms. Return 0 when TO expires. Rebase. v4: Remove unnecessary timeout calculation. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-06-15drm/scheduler: add documentationNayan Deshmukh1-29/+124
convert existing raw comments into kernel-doc format as well as add new documentation v2: reword the overview Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Nayan Deshmukh <nayan26deshmukh@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Daniel Vetter <daniel@ffwll.ch>
2018-05-19drm/scheduler: Remove obsolete spinlock.Andrey Grodzovsky1-1/+0
This spinlock is superfluous, any call to drm_sched_entity_push_job should already be under a lock together with matching drm_sched_job_init to match the order of insertion into queue with job's fence seqence number. v2: Improve patch description. Add functions documentation describing the locking considerations Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Chunming Zhou <david1.zhou@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-05-15drm/scheduler: remove unused parameterNayan Deshmukh1-1/+1
this patch also effect the amdgpu and etnaviv drivers which use the function drm_sched_entity_init Signed-off-by: Nayan Deshmukh <nayan26deshmukh@gmail.com> Suggested-by: Christian König <christian.koenig@amd.com> Acked-by: Lucas Stach <l.stach@pengutronix.de> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-05-15drm/gpu-sched: fix force APP kill hang(v4)Emily Deng1-0/+7
issue: there are VMC page fault occurred if force APP kill during 3dmark test, the cause is in entity_fini we manually signal all those jobs in entity's queue which confuse the sync/dep mechanism: 1)page fault occurred in sdma's clear job which operate on shadow buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release since the fence in its reservation was fake signaled by entity_fini() under the case of SIGKILL received. 2)page fault occurred in gfx' job because during the lifetime of gfx job we manually fake signal all jobs from its entity in entity_fini(), thus the unmapping/clear PTE job depend on those result fence is satisfied and sdma start clearing the PTE and lead to GFX page fault. fix: 1)should at least wait all jobs already scheduled complete in entity_fini() if SIGKILL is the case. 2)if a fence signaled and try to clear some entity's dependency, should set this entity guilty to prevent its job really run since the dependency is fake signaled. v2: splitting drm_sched_entity_fini() into two functions: 1)The first one is does the waiting, removes the entity from the runqueue and returns an error when the process was killed. 2)The second one then goes over the entity, install it as completion signal for the remaining jobs and signals all jobs with an error code. v3: 1)Replace the fini1 and fini2 with better name 2)Call the first part before the VM teardown in amdgpu_driver_postclose_kms() and the second part after the VM teardown 3)Keep the original function drm_sched_entity_fini to refine the code. v4: 1)Rename entity->finished to entity->last_scheduled; 2)Rename drm_sched_entity_fini_job_cb() to drm_sched_entity_kill_jobs_cb(); 3)Pass NULL to drm_sched_entity_fini_job_cb() if -ENOENT; 4)Replace the type of entity->fini_status with "int"; 5)Remove the check about entity->finished. Signed-off-by: Monk Liu <Monk.Liu@amd.com> Signed-off-by: Emily Deng <Emily.Deng@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-04-11drm/sched: Extend the documentation.Eric Anholt1-4/+42
These comments answer all the questions I had for myself when implementing a driver using the GPU scheduler. Signed-off-by: Eric Anholt <eric@anholt.net> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>