summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)AuthorFilesLines
2023-10-31Merge tag 'sched-core-2023-10-28' of ↵Linus Torvalds14-844/+740
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Fair scheduler (SCHED_OTHER) improvements: - Remove the old and now unused SIS_PROP code & option - Scan cluster before LLC in the wake-up path - Use candidate prev/recent_used CPU if scanning failed for cluster wakeup NUMA scheduling improvements: - Improve the VMA access-PID code to better skip/scan VMAs - Extend tracing to cover VMA-skipping decisions - Improve/fix the recently introduced sched_numa_find_nth_cpu() code - Generalize numa_map_to_online_node() Energy scheduling improvements: - Remove the EM_MAX_COMPLEXITY limit - Add tracepoints to track energy computation - Make the behavior of the 'sched_energy_aware' sysctl more consistent - Consolidate and clean up access to a CPU's max compute capacity - Fix uclamp code corner cases RT scheduling improvements: - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates - Drive the ->rto_mask with rt_rq->pushable_tasks updates Scheduler scalability improvements: - Rate-limit updates to tg->load_avg - On x86 disable IBRS when CPU is offline to improve single-threaded performance - Micro-optimize in_task() and in_interrupt() - Micro-optimize the PSI code - Avoid updating PSI triggers and ->rtpoll_total when there are no state changes Core scheduler infrastructure improvements: - Use saved_state to reduce some spurious freezer wakeups - Bring in a handful of fast-headers improvements to scheduler headers - Make the scheduler UAPI headers more widely usable by user-space - Simplify the control flow of scheduler syscalls by using lock guards - Fix sched_setaffinity() vs. CPU hotplug race Scheduler debuggability improvements: - Disallow writing invalid values to sched_rt_period_us - Fix a race in the rq-clock debugging code triggering warnings - Fix a warning in the bandwidth distribution code - Micro-optimize in_atomic_preempt_off() checks - Enforce that the tasklist_lock is held in for_each_thread() - Print the TGID in sched_show_task() - Remove the /proc/sys/kernel/sched_child_runs_first sysctl ... and misc cleanups & fixes" * tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits) sched/fair: Remove SIS_PROP sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup sched/fair: Scan cluster before scanning LLC in wake-up path sched: Add cpus_share_resources API sched/core: Fix RQCF_ACT_SKIP leak sched/fair: Remove unused 'curr' argument from pick_next_entity() sched/nohz: Update comments about NEWILB_KICK sched/fair: Remove duplicate #include sched/psi: Update poll => rtpoll in relevant comments sched: Make PELT acronym definition searchable sched: Fix stop_one_cpu_nowait() vs hotplug sched/psi: Bail out early from irq time accounting sched/topology: Rename 'DIE' domain to 'PKG' sched/psi: Delete the 'update_total' function parameter from update_triggers() sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes sched/headers: Remove comment referring to rq::cpu_load, since this has been removed sched/numa: Complete scanning of inactive VMAs when there is no alternative sched/numa: Complete scanning of partial VMAs regardless of PID activity sched/numa: Move up the access pid reset logic sched/numa: Trace decisions related to skipping VMAs ...
2023-10-31Merge tag 'locking-core-2023-10-28' of ↵Linus Torvalds1-12/+50
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Info Molnar: "Futex improvements: - Add the 'futex2' syscall ABI, which is an attempt to get away from the multiplex syscall and adds a little room for extentions, while lifting some limitations. - Fix futex PI recursive rt_mutex waiter state bug - Fix inter-process shared futexes on no-MMU systems - Use folios instead of pages Micro-optimizations of locking primitives: - Improve arch_spin_value_unlocked() on asm-generic ticket spinlock architectures, to improve lockref code generation - Improve the x86-32 lockref_get_not_zero() main loop by adding build-time CMPXCHG8B support detection for the relevant lockref code, and by better interfacing the CMPXCHG8B assembly code with the compiler - Introduce arch_sync_try_cmpxchg() on x86 to improve sync_try_cmpxchg() code generation. Convert some sync_cmpxchg() users to sync_try_cmpxchg(). - Micro-optimize rcuref_put_slowpath() Locking debuggability improvements: - Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well - Enforce atomicity of sched_submit_work(), which is de-facto atomic but was un-enforced previously. - Extend <linux/cleanup.h>'s no_free_ptr() with __must_check semantics - Fix ww_mutex self-tests - Clean up const-propagation in <linux/seqlock.h> and simplify the API-instantiation macros a bit RT locking improvements: - Provide the rt_mutex_*_schedule() primitives/helpers and use them in the rtmutex code to avoid recursion vs. rtlock on the PI state. - Add nested blocking lockdep asserts to rt_mutex_lock(), rtlock_lock() and rwbase_read_lock() .. plus misc fixes & cleanups" * tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits) futex: Don't include process MM in futex key on no-MMU locking/seqlock: Fix grammar in comment alpha: Fix up new futex syscall numbers locking/seqlock: Propagate 'const' pointers within read-only methods, remove forced type casts locking/lockdep: Fix string sizing bug that triggers a format-truncation compiler-warning locking/seqlock: Change __seqprop() to return the function pointer locking/seqlock: Simplify SEQCOUNT_LOCKNAME() locking/atomics: Use atomic_try_cmpxchg_release() to micro-optimize rcuref_put_slowpath() locking/atomic, xen: Use sync_try_cmpxchg() instead of sync_cmpxchg() locking/atomic/x86: Introduce arch_sync_try_cmpxchg() locking/atomic: Add generic support for sync_try_cmpxchg() and its fallback locking/seqlock: Fix typo in comment futex/requeue: Remove unnecessary ‘NULL’ initialization from futex_proxy_trylock_atomic() locking/local, arch: Rewrite local_add_unless() as a static inline function locking/debug: Fix debugfs API return value checks to use IS_ERR() locking/ww_mutex/test: Make sure we bail out instead of livelock locking/ww_mutex/test: Fix potential workqueue corruption locking/ww_mutex/test: Use prng instead of rng to avoid hangs at bootup futex: Add sys_futex_requeue() futex: Add flags2 argument to futex_requeue() ...
2023-10-24sched/fair: Remove SIS_PROPPeter Zijlstra4-57/+0
SIS_UTIL seems to work well, lets remove the old thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231020134337.GD33965@noisy.programming.kicks-ass.net
2023-10-24sched/fair: Use candidate prev/recent_used CPU if scanning failed for ↵Yicong Yang1-1/+16
cluster wakeup Chen Yu reports a hackbench regression of cluster wakeup when hackbench threads equal to the CPU number [1]. Analysis shows it's because we wake up more on the target CPU even if the prev_cpu is a good wakeup candidate and leads to the decrease of the CPU utilization. Generally if the task's prev_cpu is idle we'll wake up the task on it without scanning. On cluster machines we'll try to wake up the task in the same cluster of the target for better cache affinity, so if the prev_cpu is idle but not sharing the same cluster with the target we'll still try to find an idle CPU within the cluster. This will improve the performance at low loads on cluster machines. But in the issue above, if the prev_cpu is idle but not in the cluster with the target CPU, we'll try to scan an idle one in the cluster. But since the system is busy, we're likely to fail the scanning and use target instead, even if the prev_cpu is idle. Then leads to the regression. This patch solves this in 2 steps: o record the prev_cpu/recent_used_cpu if they're good wakeup candidates but not sharing the cluster with the target. o on scanning failure use the prev_cpu/recent_used_cpu if they're recorded as idle [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/ Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/ Reported-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231019033323.54147-4-yangyicong@huawei.com
2023-10-24sched/fair: Scan cluster before scanning LLC in wake-up pathBarry Song3-4/+49
For platforms having clusters like Kunpeng920, CPUs within the same cluster have lower latency when synchronizing and accessing shared resources like cache. Thus, this patch tries to find an idle cpu within the cluster of the target CPU before scanning the whole LLC to gain lower latency. This will be implemented in 2 steps in select_idle_sibling(): 1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them if they're sharing cluster with the target CPU. Otherwise trying to scan for an idle CPU in the target's cluster. 2. Scanning the cluster prior to the LLC of the target CPU for an idle CPU to wakeup. Testing has been done on Kunpeng920 by pinning tasks to one numa and two numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. With this patch, We noticed enhancement on tbench and netperf within one numa or cross two numa on top of tip-sched-core commit 9b46f1abc6d4 ("sched/debug: Print 'tgid' in sched_show_task()") tbench results (node 0): baseline patched 1: 327.2833 372.4623 ( 13.80%) 4: 1320.5933 1479.8833 ( 12.06%) 8: 2638.4867 2921.5267 ( 10.73%) 16: 5282.7133 5891.5633 ( 11.53%) 32: 9810.6733 9877.3400 ( 0.68%) 64: 7408.9367 7447.9900 ( 0.53%) 128: 6203.2600 6191.6500 ( -0.19%) tbench results (node 0-1): baseline patched 1: 332.0433 372.7223 ( 12.25%) 4: 1325.4667 1477.6733 ( 11.48%) 8: 2622.9433 2897.9967 ( 10.49%) 16: 5218.6100 5878.2967 ( 12.64%) 32: 10211.7000 11494.4000 ( 12.56%) 64: 13313.7333 16740.0333 ( 25.74%) 128: 13959.1000 14533.9000 ( 4.12%) netperf results TCP_RR (node 0): baseline patched 1: 76546.5033 90649.9867 ( 18.42%) 4: 77292.4450 90932.7175 ( 17.65%) 8: 77367.7254 90882.3467 ( 17.47%) 16: 78519.9048 90938.8344 ( 15.82%) 32: 72169.5035 72851.6730 ( 0.95%) 64: 25911.2457 25882.2315 ( -0.11%) 128: 10752.6572 10768.6038 ( 0.15%) netperf results TCP_RR (node 0-1): baseline patched 1: 76857.6667 90892.2767 ( 18.26%) 4: 78236.6475 90767.3017 ( 16.02%) 8: 77929.6096 90684.1633 ( 16.37%) 16: 77438.5873 90502.5787 ( 16.87%) 32: 74205.6635 88301.5612 ( 19.00%) 64: 69827.8535 71787.6706 ( 2.81%) 128: 25281.4366 25771.3023 ( 1.94%) netperf results UDP_RR (node 0): baseline patched 1: 96869.8400 110800.8467 ( 14.38%) 4: 97744.9750 109680.5425 ( 12.21%) 8: 98783.9863 110409.9637 ( 11.77%) 16: 99575.0235 110636.2435 ( 11.11%) 32: 95044.7250 97622.8887 ( 2.71%) 64: 32925.2146 32644.4991 ( -0.85%) 128: 12859.2343 12824.0051 ( -0.27%) netperf results UDP_RR (node 0-1): baseline patched 1: 97202.4733 110190.1200 ( 13.36%) 4: 95954.0558 106245.7258 ( 10.73%) 8: 96277.1958 105206.5304 ( 9.27%) 16: 97692.7810 107927.2125 ( 10.48%) 32: 79999.6702 103550.2999 ( 29.44%) 64: 80592.7413 87284.0856 ( 8.30%) 128: 27701.5770 29914.5820 ( 7.99%) Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch in the code has not been tested but it supposed to work. Chen Yu also noticed this will improve the performance of tbench and netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one cluster sharing L2 Cache. [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lkml.kernel.org/r/20231019033323.54147-3-yangyicong@huawei.com
2023-10-24sched: Add cpus_share_resources APIBarry Song3-0/+26
Add cpus_share_resources() API. This is the preparation for the optimization of select_idle_cpu() on platforms with cluster scheduler level. On a machine with clusters cpus_share_resources() will test whether two cpus are within the same cluster. On a non-cluster machine it will behaves the same as cpus_share_cache(). So we use "resources" here for cache resources. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com
2023-10-24sched/core: Fix RQCF_ACT_SKIP leakHao Jia1-4/+1
Igor Raits and Bagas Sanjaya report a RQCF_ACT_SKIP leak warning. This warning may be triggered in the following situations: CPU0 CPU1 __schedule() *rq->clock_update_flags <<= 1;* unregister_fair_sched_group() pick_next_task_fair+0x4a/0x410 destroy_cfs_bandwidth() newidle_balance+0x115/0x3e0 for_each_possible_cpu(i) *i=0* rq_unpin_lock(this_rq, rf) __cfsb_csd_unthrottle() raw_spin_rq_unlock(this_rq) rq_lock(*CPU0_rq*, &rf) rq_clock_start_loop_update() rq->clock_update_flags & RQCF_ACT_SKIP <-- raw_spin_rq_lock(this_rq) The purpose of RQCF_ACT_SKIP is to skip the update rq clock, but the update is very early in __schedule(), but we clear RQCF_*_SKIP very late, causing it to span that gap above and triggering this warning. In __schedule() we can clear the RQCF_*_SKIP flag immediately after update_rq_clock() to avoid this RQCF_ACT_SKIP leak warning. And set rq->clock_update_flags to RQCF_UPDATED to avoid rq->clock_update_flags < RQCF_ACT_SKIP warning that may be triggered later. Fixes: ebb83d84e49b ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") Closes: https://lore.kernel.org/all/20230913082424.73252-1-jiahao.os@bytedance.com Reported-by: Igor Raits <igor.raits@gmail.com> Reported-by: Bagas Sanjaya <bagasdotme@gmail.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/a5dd536d-041a-2ce9-f4b7-64d8d85c86dc@gmail.com
2023-10-23Merge tag 'v6.6-rc7' into sched/core, to pick up fixesIngo Molnar1-14/+60
Pick up recent sched/urgent fixes merged upstream. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-20sched/fair: Remove unused 'curr' argument from pick_next_entity()Yiwei Lin1-4/+4
The 'curr' argument of pick_next_entity() has become unused after the EEVDF changes. [ mingo: Updated the changelog. ] Signed-off-by: Yiwei Lin <s921975628@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231020055617.42064-1-s921975628@gmail.com
2023-10-20sched/nohz: Update comments about NEWILB_KICKJoel Fernandes (Google)1-2/+13
How ILB is triggered without IPIs is cryptic. Out of mercy for future code readers, document it in code comments. The comments are derived from a discussion with Vincent in a past review. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231020014031.919742-2-joel@joelfernandes.org
2023-10-18sched/fair: Remove duplicate #includeJiapeng Chong1-2/+0
./kernel/sched/fair.c: linux/sched/cond_resched.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231018062759.44375-1-jiapeng.chong@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6907
2023-10-18sched/eevdf: Fix heap corruption morePeter Zijlstra1-1/+2
Because someone is a flaming idiot... and forgot we have current as se->on_rq but not actually in the tree itself, and walking rb_parent() on an entry not in the tree is 'funky' and KASAN complains. Fixes: 8dafa9d0eb1a ("sched/eevdf: Fix min_deadline heap integrity") Reported-by: 0599jiangyc@gmail.com Reported-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218020 Link: https://lkml.kernel.org/r/CAJwJo6ZGXO07%3DQvW4fgQfbsDzQPs9xj5sAQ1zp%3DmAyPMNbHYww%40mail.gmail.com
2023-10-16sched/psi: Update poll => rtpoll in relevant commentsFan Yu1-16/+16
The PSI trigger code is now making a distinction between privileged and unprivileged triggers, after the following commit: 65457b74aa94 ("sched/psi: Rename existing poll members in preparation") But some comments have not been modified along with the code, so they need to be updated. This will help readers better understand the code. Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Ziljstra <peterz@infradead.org> Link: https://lore.kernel.org/r/202310161920399921184@zte.com.cn
2023-10-13sched: Make PELT acronym definition searchableMathieu Desnoyers1-1/+1
The PELT acronym definition can be found right at the top of kernel/sched/pelt.c (of course), but it cannot be found through use of grep -r PELT kernel/sched/ Add the acronym "(PELT)" after "Per Entity Load Tracking" at the top of the source file. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231012125824.1260774-1-mathieu.desnoyers@efficios.com
2023-10-13sched: Fix stop_one_cpu_nowait() vs hotplugPeter Zijlstra4-3/+17
Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning. Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete(). The race scenario is: CPU0 CPU1 // doing _cpu_down() __set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..) <PREEMPT: cpu_stopper_thread() MULTI_STOP_PREPARE ... __set_cpus_allowed_ptr_locked() affine_move_task() task_rq_unlock(); <PREEMPT: cpu_stopper_thread()\> ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!! That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete. OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks. Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination: preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable(); This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed. Apply this pattern to all similar stop_one_cpu_nowait() invocations. Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net
2023-10-13sched/psi: Bail out early from irq time accountingHaifeng Xu1-0/+3
We could bail out early when psi was disabled. Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20230926115722.467833-1-haifeng.xu@shopee.com
2023-10-12sched/topology: Rename 'DIE' domain to 'PKG'Peter Zijlstra2-5/+5
While reworking the x86 topology code Thomas tripped over creating a 'DIE' domain for the package mask. :-) Since these names are CONFIG_SCHED_DEBUG=y only, rename them to make the name less ambiguous. [ Shrikanth Hegde: rename on s390 as well. ] [ Valentin Schneider: also rename it in the comments. ] [ mingo: port to recent kernels & find all remaining occurances. ] Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230712141056.GI3100107@hirez.programming.kicks-ass.net
2023-10-12sched/psi: Delete the 'update_total' function parameter from update_triggers()Yang Yang1-14/+3
The 'update_total' parameter of update_triggers() is always true after the previous commit: 80cc1d1d5ee3 ("sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes") If the 'changed_states & group->rtpoll_states' condition is true, 'new_stall' in update_triggers() will be true, and then 'update_total' should also be true. So update_total is redundant - remove it. [ mingo: Changelog updates ] Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Ziljstra <peterz@infradead.org> Link: https://lore.kernel.org/r/202310101645437859599@zte.com.cn
2023-10-12sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no ↵Yang Yang1-3/+4
state changes When psimon wakes up and there are no state changes for ->rtpoll_states, it's unnecessary to update triggers and ->rtpoll_total because the pressures being monitored by the user have not changed. This will help to slightly reduce unnecessary computations of PSI. [ mingo: Changelog updates ] Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Ziljstra <peterz@infradead.org> Link: https://lore.kernel.org/r/202310101641075436843@zte.com.cn
2023-10-11sched/headers: Remove comment referring to rq::cpu_load, since this has been ↵Colin Ian King1-4/+0
removed There is a comment that refers to cpu_load, however, this cpu_load was removed with: 55627e3cd22c ("sched/core: Remove rq->cpu_load[]") ... back in 2019. The comment does not make sense with respect to this removed array, so remove the comment. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010155744.1381065-1-colin.i.king@gmail.com
2023-10-11sched/numa: Complete scanning of inactive VMAs when there is no alternativeMel Gorman1-3/+52
VMAs are skipped if there is no recent fault activity but this represents a chicken-and-egg problem as there may be no fault activity if the PTEs are never updated to trap NUMA hints. There is an indirect reliance on scanning to be forced early in the lifetime of a task but this may fail to detect changes in phase behaviour. Force inactive VMAs to be scanned when all other eligible VMAs have been updated within the same scan sequence. Test results in general look good with some changes in performance, both negative and positive, depending on whether the additional scanning and faulting was beneficial or not to the workload. The autonuma benchmark workload NUMA01_THREADLOCAL was picked for closer examination. The workload creates two processes with numerous threads and thread-local storage that is zero-filled in a loop. It exercises the corner case where unrelated threads may skip VMAs that are thread-local to another thread and still has some VMAs that inactive while the workload executes. The VMA skipping activity frequency with and without the patch: 6.6.0-rc2-sched-numabtrace-v1 ============================= 649 reason=scan_delay 9,094 reason=unsuitable 48,915 reason=shared_ro 143,919 reason=inaccessible 193,050 reason=pid_inactive 6.6.0-rc2-sched-numabselective-v1 ============================= 146 reason=seq_completed 622 reason=ignore_pid_inactive 624 reason=scan_delay 6,570 reason=unsuitable 16,101 reason=shared_ro 27,608 reason=inaccessible 41,939 reason=pid_inactive Note that with the patch applied, the PID activity is ignored (ignore_pid_inactive) to ensure a VMA with some activity is completely scanned. In addition, a small number of VMAs are scanned when no other eligible VMA is available during a single scan window (seq_completed). The number of times a VMA is skipped due to no PID activity from the scanning task (pid_inactive) drops dramatically. It is expected that this will increase the number of PTEs updated for NUMA hinting faults as well as hinting faults but these represent PTEs that would otherwise have been missed. The tradeoff is scan+fault overhead versus improving locality due to migration. On a 2-socket Cascade Lake test machine, the time to complete the workload is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%) Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%* Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%) CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%) Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%) The time to complete the workload is reduced by almost 30%: 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 / Duration User 91201.80 63506.64 Duration System 2015.53 1819.78 Duration Elapsed 1234.77 868.37 In this specific case, system CPU time was not increased but it's not universally true. From vmstat, the NUMA scanning and fault activity is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Ops NUMA base-page range updates 64272.00 26374386.00 Ops NUMA PTE updates 36624.00 55538.00 Ops NUMA PMD updates 54.00 51404.00 Ops NUMA hint faults 15504.00 75786.00 Ops NUMA hint local faults % 14860.00 56763.00 Ops NUMA hint local percent 95.85 74.90 Ops NUMA pages migrated 1629.00 6469222.00 Both the number of PTE updates and hint faults is dramatically increased. While this is superficially unfortunate, it represents ranges that were simply skipped without the patch. As a result of the scanning and hinting faults, many more pages were also migrated but as the time to completion is reduced, the overhead is offset by the gain. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Link: https://lore.kernel.org/r/20231010083143.19593-7-mgorman@techsingularity.net
2023-10-11sched/numa: Complete scanning of partial VMAs regardless of PID activityMel Gorman1-3/+15
NUMA Balancing skips VMAs when the current task has not trapped a NUMA fault within the VMA. If the VMA is skipped then mm->numa_scan_offset advances and a task that is trapping faults within the VMA may never fully update PTEs within the VMA. Force tasks to update PTEs for partially scanned PTEs. The VMA will be tagged for NUMA hints by some task but this removes some of the benefit of tracking PID activity within a VMA. A follow-on patch will mitigate this problem. The test cases and machines evaluated did not trigger the corner case so the performance results are neutral with only small changes within the noise from normal test-to-test variance. However, the next patch makes the corner case easier to trigger. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Link: https://lore.kernel.org/r/20231010083143.19593-6-mgorman@techsingularity.net
2023-10-10sched/numa: Move up the access pid reset logicRaghavendra K T1-10/+7
Recent NUMA hinting faulting activity is reset approximately every VMA_PID_RESET_PERIOD milliseconds. However, if the current task has not accessed a VMA then the reset check is missed and the reset is potentially deferred forever. Check if the PID activity information should be reset before checking if the current task recently trapped a NUMA hinting fault. [ mgorman@techsingularity.net: Rewrite changelog ] Suggested-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-5-mgorman@techsingularity.net
2023-10-10sched/numa: Trace decisions related to skipping VMAsMel Gorman1-4/+13
NUMA balancing skips or scans VMAs for a variety of reasons. In preparation for completing scans of VMAs regardless of PID access, trace the reasons why a VMA was skipped. In a later patch, the tracing will be used to track if a VMA was forcibly scanned. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-4-mgorman@techsingularity.net
2023-10-10sched/numa: Rename vma_numab_state::access_pids[] => ::pids_active[], ↵Mel Gorman1-6/+6
::next_pid_reset => ::pids_active_reset The access_pids[] field name is somewhat ambiguous as no PIDs are accessed. Similarly, it's not clear that next_pid_reset is related to access_pids[]. Rename the fields to more accurately reflect their purpose. [ mingo: Rename in the comments too. ] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-3-mgorman@techsingularity.net
2023-10-09Merge tag 'v6.6-rc5' into locking/core, to pick up fixesIngo Molnar5-5/+18
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-09sched/topology: Move the declaration of 'schedutil_gov' to kernel/sched/sched.hIngo Molnar2-1/+2
Move it out of the .c file into the shared scheduler-internal header file, to gain type-checking. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Cc: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
2023-10-09sched/topology: Change behaviour of the 'sched_energy_aware' sysctl, based ↵Shrikanth Hegde1-38/+74
on the platform The 'sched_energy_aware' sysctl is available for the admin to disable/enable energy aware scheduling(EAS). EAS is enabled only if few conditions are met by the platform. They are, asymmetric CPU capacity, no SMT, schedutil CPUfreq governor, frequency invariant load tracking etc. A platform may boot without EAS capability, but could gain such capability at runtime. For example, changing/registering the cpufreq governor to schedutil. At present, though platform doesn't support EAS, this sysctl returns 1 and it ends up calling build_perf_domains on write to 1 and NOP when writing to 0. That is confusing and un-necessary. Desired behavior would be to have this sysctl to enable/disable the EAS on supported platform. On non-supported platform write to the sysctl would return not supported error and read of the sysctl would return empty. So sched_energy_aware returns empty - EAS is not possible at this moment This will include EAS capable platforms which have at least one EAS condition false during startup, e.g. not using the schedutil cpufreq governor sched_energy_aware returns 0 - EAS is supported but disabled by admin. sched_energy_aware returns 1 - EAS is supported and enabled. User can find out the reason why EAS is not possible by checking info messages. sched_is_eas_possible returns true if the platform can do EAS at this moment. Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
2023-10-09sched/psi: Change update_triggers() to a 'void' functionYang Yang1-4/+3
Update_triggers() always returns now + group->rtpoll_min_period, and the return value is only used by psi_rtpoll_work(), so change update_triggers() to a void function, let group->rtpoll_next_update = now + group->rtpoll_min_period directly. This will avoid unnecessary function return value passing & simplifies the function. [ mingo: Updated changelog ] Suggested-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/202310092024289721617@zte.com.cn
2023-10-09sched/topology: Remove the EM_MAX_COMPLEXITY limitPierre Gondois1-36/+3
The Energy Aware Scheduler (EAS) estimates the energy consumption of placing a task on different CPUs. The goal is to minimize this energy consumption. Estimating the energy of different task placements is increasingly complex with the size of the platform. To avoid having a slow wake-up path, EAS is only enabled if this complexity is low enough. The current complexity limit was set in: b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate platforms") ... based on the first implementation of EAS, which was re-computing the power of the whole platform for each task placement scenario, see: 390031e4c309 ("sched/fair: Introduce an energy estimation helper function") ... but the complexity of EAS was reduced in: eb92692b2544d ("sched/fair: Speed-up energy-aware wake-ups") ... and find_energy_efficient_cpu() (feec) algorithm was updated in: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()") find_energy_efficient_cpu() (feec) is now doing: feec() \_ for_each_pd(pd) [0] // get max_spare_cap_cpu and compute_prev_delta \_ for_each_cpu(pd) [1] \_ eenv_pd_busy_time(pd) [2] \_ for_each_cpu(pd) // compute_energy(pd) without the task \_ eenv_pd_max_util(pd, -1) [3.0] \_ for_each_cpu(pd) \_ em_cpu_energy(pd, -1) \_ for_each_ps(pd) // compute_energy(pd) with the task on prev_cpu \_ eenv_pd_max_util(pd, prev_cpu) [3.1] \_ for_each_cpu(pd) \_ em_cpu_energy(pd, prev_cpu) \_ for_each_ps(pd) // compute_energy(pd) with the task on max_spare_cap_cpu \_ eenv_pd_max_util(pd, max_spare_cap_cpu) [3.2] \_ for_each_cpu(pd) \_ em_cpu_energy(pd, max_spare_cap_cpu) \_ for_each_ps(pd) [3.1] happens only once since prev_cpu is unique. With the same definitions for nr_pd, nr_cpus and nr_ps, the complexity is of: nr_pd * (2 * [nr_cpus in pd] + 2 * ([nr_cpus in pd] + [nr_ps in pd])) + ([nr_cpus in pd] + [nr_ps in pd]) [0] * ( [1] + [2] + [3.0] + [3.2] ) + [3.1] = nr_pd * (4 * [nr_cpus in pd] + 2 * [nr_ps in pd]) + [nr_cpus in prev pd] + nr_ps The complexity limit was set to 2048 in: b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate platforms") ... to make "EAS usable up to 16 CPUs with per-CPU DVFS and less than 8 performance states each". For the same platform, the complexity would actually be of: 16 * (4 + 2 * 7) + 1 + 7 = 296 Since the EAS complexity was greatly reduced since the limit was introduced, bigger platforms can handle EAS. For instance, a platform with 112 CPUs with 7 performance states each would not reach it: 112 * (4 + 2 * 7) + 1 + 7 = 2024 To reflect this improvement in the underlying EAS code, remove the EAS complexity check. Note that a limit on the number of CPUs still holds against EM_MAX_NUM_CPUS to avoid overflows during the energy estimation. [ mingo: Updates to the changelog. ] Signed-off-by: Pierre Gondois <Pierre.Gondois@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20231009060037.170765-2-sshegde@linux.vnet.ibm.com
2023-10-09sched/topology: Consolidate and clean up access to a CPU's max compute capacityVincent Guittot7-23/+18
Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity() instead. The scheduler uses 3 methods to get access to a CPU's max compute capacity: - arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity. - cpu_capacity_orig field which is periodically updated with arch_scale_cpu_capacity(). - capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig. There is no real need to save the value returned by arch_scale_cpu_capacity() in struct rq. arch_scale_cpu_capacity() returns: - either a per_cpu variable. - or a const value for systems which have only one capacity. Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere. No functional changes. Some performance tests on Arm64: - small SMP device (hikey): no noticeable changes - HMP device (RB5): hackbench shows minor improvement (1-2%) - large smp (thx2): hackbench and tbench shows minor improvement (1%) Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org
2023-10-09sched/rt: Change the type of 'sysctl_sched_rt_period' from 'unsigned int' to ↵Yajun Deng2-4/+4
'int' Doing this matches the natural type of 'int' based calculus in sched_rt_handler(), and also enables the adding in of a correct upper bounds check on the sysctl interface. [ mingo: Rewrote the changelog. ] Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231008021538.3063250-1-yajun.deng@linux.dev
2023-10-09sched/nohz: Remove unnecessarily complex error handling pattern from ↵Ingo Molnar1-3/+2
find_new_ilb() find_new_ilb() returns nr_cpu_ids on failure - which is the usual cpumask bitops return pattern, but is weird & unnecessary in this context: not only is it a global variable, it it is a +1 out of bounds CPU index and also has different signedness ... Its only user, kick_ilb(), then checks the return against nr_cpu_ids to decide to return. There's no other use. So instead of this, use a standard -1 return on failure to find an idle CPU, as the argument is signed already. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20231006102518.2452758-4-mingo@kernel.org
2023-10-09sched/nohz: Use consistent variable names in find_new_ilb() and kick_ilb()Ingo Molnar1-5/+5
Use 'ilb_cpu' consistently in both functions. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20231006102518.2452758-3-mingo@kernel.org
2023-10-09sched/nohz: Update idle load-balancing (ILB) commentsIngo Molnar1-14/+16
- Fix incorrect/misleading comments, - clarify some others, - fix typos & grammar, - and use more consistent style throughout. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20231006102518.2452758-2-mingo@kernel.org
2023-10-09sched/eevdf: Fix pick_eevdf()Benjamin Segall1-14/+58
The old pick_eevdf() could fail to find the actual earliest eligible deadline when it descended to the right looking for min_deadline, but it turned out that that min_deadline wasn't actually eligible. In that case we need to go back and search through any left branches we skipped looking for the actual best _eligible_ min_deadline. This is more expensive, but still O(log n), and at worst should only involve descending two branches of the rbtree. I've run this through a userspace stress test (thank you tools/lib/rbtree.c), so hopefully this implementation doesn't miss any corner cases. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/xm261qego72d.fsf_-_@google.com
2023-10-09sched/eevdf: Fix min_deadline heap integrityPeter Zijlstra1-0/+1
Marek and Biju reported instances of: "EEVDF scheduling fail, picking leftmost" which Mike correlated with cgroup scheduling and the min_deadline heap getting corrupted; some trace output confirms: > And yeah, min_deadline is hosed somehow: > > validate_cfs_rq: --- / > __print_se: ffff88845cf48080 w: 1024 ve: -58857638 lag: 870381 vd: -55861854 vmd: -66302085 E (11372/tr) > __print_se: ffff88810d165800 w: 25 ve: -80323686 lag: 22336429 vd: -41496434 vmd: -66302085 E (-1//autogroup-31) > __print_se: ffff888108379000 w: 25 ve: 0 lag: -57987257 vd: 114632828 vmd: 114632828 N (-1//autogroup-33) > validate_cfs_rq: min_deadline: -55861854 avg_vruntime: -62278313462 / 1074 = -57987256 Turns out that reweight_entity(), which tries really hard to be fast, does not do the normal dequeue+update+enqueue pattern but *does* scale the deadline. However, it then fails to propagate the updated deadline value up the heap. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Reported-by: Biju Das <biju.das.jz@bp.renesas.com> Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Biju Das <biju.das.jz@bp.renesas.com> Tested-by: Mike Galbraith <efault@gmx.de> Link: https://lkml.kernel.org/r/20231006192445.GE743@noisy.programming.kicks-ass.net
2023-10-07sched/debug: Print 'tgid' in sched_show_task()Yajun Deng1-3/+3
Multiple blocked tasks are printed when the system hangs. They may have the same parent pid, but belong to different task groups. Printing tgid lets users better know whether these tasks are from the same task group or not. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230720080516.1515297-1-yajun.deng@linux.dev
2023-10-07sched/core: Update stale comment in try_to_wake_up()Ingo Molnar1-1/+1
The following commit: 9b3c4ab3045e ("sched,rcu: Rework try_invoke_on_locked_down_task()") ... renamed try_invoke_on_locked_down_task() to task_call_func(), but forgot to update the comment in try_to_wake_up(). But it turns out that the smp_rmb() doesn't live in task_call_func() either, it was moved to __task_needs_rq_lock() in: 91dabf33ae5d ("sched: Fix race in task_call_func()") Fix that now. Also fix the s/smb/smp typo while at it. Reported-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230731085759.11443-1-zhangqiao22@huawei.com
2023-10-07Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh ↵Ingo Molnar5-7/+43
the branch Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-05cpufreq: schedutil: Update next_freq when cpufreq_limits changeXuewen Yan1-1/+2
When cpufreq's policy is 'single', there is a scenario that will cause sg_policy's next_freq to be unable to update. When the CPU's util is always max, the cpufreq will be max, and then if we change the policy's scaling_max_freq to be a lower freq, indeed, the sg_policy's next_freq need change to be the lower freq, however, because the cpu_is_busy, the next_freq would keep the max_freq. For example: The cpu7 is a single CPU: unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737 pid 4737's current affinity mask: ff pid 4737's new affinity mask: 80 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2301000 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq 2301000 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2171000 At this time, the sg_policy's next_freq would stay at 2301000, which is wrong. To fix this, add a check for the ->need_freq_update flag. [ mingo: Clarified the changelog. ] Co-developed-by: Guohua Yan <guohua.yan@unisoc.com> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Guohua Yan <guohua.yan@unisoc.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com
2023-10-03sched/headers: Remove duplicate header inclusionsYu Liao2-2/+0
<linux/psi.h> and "autogroup.h" are included twice, remove the duplicate header inclusion. Signed-off-by: Yu Liao <liaoyu15@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230802021501.2511569-1-liaoyu15@huawei.com
2023-10-03sched/eevdf: Fix avg_vruntime()Peter Zijlstra1-1/+9
The expectation is that placing a task at avg_vruntime() makes it eligible. Turns out there is a corner case where this is not the case. Specifically, avg_vruntime() relies on the fact that integer division is a flooring function (eg. it discards the remainder). By this property the value returned is slightly left of the true average. However! when the average is a negative (relative to min_vruntime) the effect is flipped and it becomes a ceil, with the result that the returned value is just right of the average and thus not eligible. Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-10-03sched/eevdf: Also update slice on placementPeter Zijlstra1-2/+4
Tasks that never consume their full slice would not update their slice value. This means that tasks that are spawned before the sysctl scaling keep their original (UP) slice length. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230915124822.847197830@noisy.programming.kicks-ass.net
2023-10-02sched/rt: Disallow writing invalid values to sched_rt_period_usCyril Hrubis1-4/+5
The validation of the value written to sched_rt_period_us was broken because: - the sysclt_sched_rt_period is declared as unsigned int - parsed by proc_do_intvec() - the range is asserted after the value parsed by proc_do_intvec() Because of this negative values written to the file were written into a unsigned integer that were later on interpreted as large positive integers which did passed the check: if (sysclt_sched_rt_period <= 0) return EINVAL; This commit fixes the parsing by setting explicit range for both perid_us and runtime_us into the sched_rt_sysctls table and processes the values with proc_dointvec_minmax() instead. Alternatively if we wanted to use full range of unsigned int for the period value we would have to split the proc_handler and use proc_douintvec() for it however even the Documentation/scheduller/sched-rt-group.rst describes the range as 1 to INT_MAX. As far as I can tell the only problem this causes is that the sysctl file allows writing negative values which when read back may confuse userspace. There is also a LTP test being submitted for these sysctl files at: http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/ Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz
2023-09-29sched/debug: Add new tracepoint to track compute energy computationQais Yousef2-1/+7
It was useful to track feec() placement decision and debug the spare capacity and optimization issues vs uclamp_max. Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230916232955.2099394-4-qyousef@layalina.io
2023-09-29sched/uclamp: Ignore (util == 0) optimization in feec() when p_util_max = 0Qais Yousef1-17/+1
find_energy_efficient_cpu() bails out early if effective util of the task is 0 as the delta at this point will be zero and there's nothing for EAS to do. When uclamp is being used, this could lead to wrong decisions when uclamp_max is set to 0. In this case the task is capped to performance point 0, but it is actually running and consuming energy and we can benefit from EAS energy calculations. Rework the condition so that it bails out when both util and uclamp_min are 0. We can do that without needing to use uclamp_task_util(); remove it. Fixes: d81304bc6193 ("sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition") Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230916232955.2099394-3-qyousef@layalina.io
2023-09-29sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0Qais Yousef1-6/+5
When uclamp_max is being used, the util of the task could be higher than the spare capacity of the CPU, but due to uclamp_max value we force-fit it there. The way the condition for checking for max_spare_cap in find_energy_efficient_cpu() was constructed; it ignored any CPU that has its spare_cap less than or _equal_ to max_spare_cap. Since we initialize max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and hence ending up never performing compute_energy() for this cluster and missing an opportunity for a better energy efficient placement to honour uclamp_max setting. max_spare_cap = 0; cpu_cap = capacity_of(cpu) - cpu_util(p); // 0 if cpu_util(p) is high ... util_fits_cpu(...); // will return true if uclamp_max forces it to fit ... // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0 if (cpu_cap > max_spare_cap) { max_spare_cap = cpu_cap; max_spare_cap_cpu = cpu; } prev_spare_cap suffers from a similar problem. Fix the logic by converting the variables into long and treating -1 value as 'not populated' instead of 0 which is a viable and correct spare capacity value. We need to be careful signed comparison is used when comparing with cpu_cap in one of the conditions. Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230916232955.2099394-2-qyousef@layalina.io
2023-09-29sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloadedValentin Schneider3-45/+14
dl_rq->dl_nr_migratory is increased whenever a DL entity is enqueued and it has nr_cpus_allowed > 1. Unlike the pushable_dl_tasks tree, dl_rq->dl_nr_migratory includes a dl_rq's current task. This means a dl_rq can have a migratable current, N non-migratable queued tasks, and be flagged as overloaded and have its CPU set in the dlo_mask, despite having an empty pushable_tasks tree. Make an dl_rq's overload logic be driven by {enqueue,dequeue}_pushable_dl_task(), in other words make DL RQs only be flagged as overloaded if they have at least one runnable-but-not-current migratable task. o push_dl_task() is unaffected, as it is a no-op if there are no pushable tasks. o pull_dl_task() now no longer scans runqueues whose sole migratable task is their current one, which it can't do anything about anyway. It may also now pull tasks to a DL RQ with dl_nr_running > 1 if only its current task is migratable. Since dl_rq->dl_nr_migratory becomes unused, remove it. RT had the exact same mechanism (rt_rq->rt_nr_migratory) which was dropped in favour of relying on rt_rq->pushable_tasks, see: 612f769edd06 ("sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask") Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20230928150251.463109-1-vschneid@redhat.com
2023-09-28sched/rt: Fix live lock between select_fallback_rq() and RT pushJoel Fernandes (Google)1-0/+1
During RCU-boost testing with the TREE03 rcutorture config, I found that after a few hours, the machine locks up. On tracing, I found that there is a live lock happening between 2 CPUs. One CPU has an RT task running, while another CPU is being offlined which also has an RT task running. During this offlining, all threads are migrated. The migration thread is repeatedly scheduled to migrate actively running tasks on the CPU being offlined. This results in a live lock because select_fallback_rq() keeps picking the CPU that an RT task is already running on only to get pushed back to the CPU being offlined. It is anyway pointless to pick CPUs for pushing tasks to if they are being offlined only to get migrated away to somewhere else. This could also add unwanted latency to this task. Fix these issues by not selecting CPUs in RT if they are not 'active' for scheduling, using the cpu_active_mask. Other parts in core.c already use cpu_active_mask to prevent tasks from being put on CPUs going offline. With this fix I ran the tests for days and could not reproduce the hang. Without the patch, I hit it in a few hours. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230923011409.3522762-1-joel@joelfernandes.org