summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)AuthorFilesLines
2022-08-23cpufreq: schedutil: Move max CPU capacity to sugov_policyLukasz Luba1-15/+15
There is no need to keep the max CPU capacity in the per_cpu instance. Furthermore, there is no need to check and update that variable (sg_cpu->max) every time in the frequency change request, which is part of hot path. Instead use struct sugov_policy to store that information. Initialize the max CPU capacity during the setup and start callback. We can do that since all CPUs in the same frequency domain have the same max capacity (capacity setup and thermal pressure are based on that). Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-08-23sched/fair: Don't init util/runnable_avg for !fair taskChengming Zhou1-14/+14
post_init_entity_util_avg() init task util_avg according to the cpu util_avg at the time of fork, which will decay when switched_to_fair() some time later, we'd better to not set them at all in the case of !fair task. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-10-zhouchengming@bytedance.com
2022-08-23sched/fair: Move task sched_avg attach to enqueue_task_fair()Chengming Zhou1-8/+3
When wake_up_new_task(), we use post_init_entity_util_avg() to init util_avg/runnable_avg based on cpu's util_avg at that time, and attach task sched_avg to cfs_rq. Since enqueue_task_fair() -> enqueue_entity() -> update_load_avg() loop will do attach, we can move this work to update_load_avg(). wake_up_new_task(p) post_init_entity_util_avg(p) attach_entity_cfs_rq() --> (1) activate_task(rq, p) enqueue_task() := enqueue_task_fair() enqueue_entity() loop update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH) if (!se->avg.last_update_time && (flags & DO_ATTACH)) attach_entity_load_avg() --> (2) This patch move attach from (1) to (2), update related comments too. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-9-zhouchengming@bytedance.com
2022-08-23sched/fair: Allow changing cgroup of new forked taskChengming Zhou2-20/+12
commit 7dc603c9028e ("sched/fair: Fix PELT integrity for new tasks") introduce a TASK_NEW state and an unnessary limitation that would fail when changing cgroup of new forked task. Because at that time, we can't handle task_change_group_fair() for new forked fair task which hasn't been woken up by wake_up_new_task(), which will cause detach on an unattached task sched_avg problem. This patch delete this unnessary limitation by adding check before do detach or attach in task_change_group_fair(). So cpu_cgrp_subsys.can_attach() has nothing to do for fair tasks, only define it in #ifdef CONFIG_RT_GROUP_SCHED. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-8-zhouchengming@bytedance.com
2022-08-23sched/fair: Fix another detach on unattached task corner caseChengming Zhou1-0/+11
commit 7dc603c9028e ("sched/fair: Fix PELT integrity for new tasks") fixed two load tracking problems for new task, including detach on unattached new task problem. There still left another detach on unattached task problem for the task which has been woken up by try_to_wake_up() and waiting for actually being woken up by sched_ttwu_pending(). try_to_wake_up(p) cpu = select_task_rq(p) if (task_cpu(p) != cpu) set_task_cpu(p, cpu) migrate_task_rq_fair() remove_entity_load_avg() --> unattached se->avg.last_update_time = 0; __set_task_cpu() ttwu_queue(p, cpu) ttwu_queue_wakelist() __ttwu_queue_wakelist() task_change_group_fair() detach_task_cfs_rq() detach_entity_cfs_rq() detach_entity_load_avg() --> detach on unattached task set_task_rq() attach_task_cfs_rq() attach_entity_cfs_rq() attach_entity_load_avg() The reason of this problem is similar, we should check in detach_entity_cfs_rq() that se->avg.last_update_time != 0, before do detach_entity_load_avg(). Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-7-zhouchengming@bytedance.com
2022-08-23sched/fair: Combine detach into dequeue when migrating taskChengming Zhou1-12/+16
When we are migrating task out of the CPU, we can combine detach and propagation into dequeue_entity() to save the detach_entity_cfs_rq() in migrate_task_rq_fair(). This optimization is like combining DO_ATTACH in the enqueue_entity() when migrating task to the CPU. So we don't have to traverse the CFS tree extra time to do the detach_entity_cfs_rq() -> propagate_entity_cfs_rq(), which wouldn't be called anymore with this patch's change. detach_task() deactivate_task() dequeue_task_fair() for_each_sched_entity(se) dequeue_entity() update_load_avg() /* (1) */ detach_entity_load_avg() set_task_cpu() migrate_task_rq_fair() detach_entity_cfs_rq() /* (2) */ update_load_avg(); detach_entity_load_avg(); propagate_entity_cfs_rq(); for_each_sched_entity() update_load_avg() This patch save the detach_entity_cfs_rq() called in (2) by doing the detach_entity_load_avg() for a CPU migrating task inside (1) (the task being the first se in the loop) Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-6-zhouchengming@bytedance.com
2022-08-23sched/fair: Update comments in enqueue/dequeue_entity()Chengming Zhou1-2/+4
When reading the sched_avg related code, I found the comments in enqueue/dequeue_entity() are not updated with the current code. We don't add/subtract entity's runnable_avg from cfs_rq->runnable_avg during enqueue/dequeue_entity(), those are done only for attach/detach. This patch updates the comments to reflect the current code working. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-5-zhouchengming@bytedance.com
2022-08-23sched/fair: Reset sched_avg last_update_time before set_task_rq()Chengming Zhou1-1/+1
set_task_rq() -> set_task_rq_fair() will try to synchronize the blocked task's sched_avg when migrate, which is not needed for already detached task. task_change_group_fair() will detached the task sched_avg from prev cfs_rq first, so reset sched_avg last_update_time before set_task_rq() to avoid that. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-4-zhouchengming@bytedance.com
2022-08-23sched/fair: Remove redundant cpu_cgrp_subsys->fork()Chengming Zhou3-49/+6
We use cpu_cgrp_subsys->fork() to set task group for the new fair task in cgroup_post_fork(). Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races") has already set_task_rq() for the new fair task in sched_cgroup_fork(), so cpu_cgrp_subsys->fork() can be removed. cgroup_can_fork() --> pin parent's sched_task_group sched_cgroup_fork() __set_task_cpu() set_task_rq() cgroup_post_fork() ss->fork() := cpu_cgroup_fork() sched_change_group(..., TASK_SET_GROUP) task_set_group_fair() set_task_rq() --> can be removed After this patch's change, task_change_group_fair() only need to care about task cgroup migration, make the code much simplier. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20220818124805.601-3-zhouchengming@bytedance.com
2022-08-23sched/fair: Maintain task se depth in set_task_rq()Chengming Zhou2-8/+1
Previously we only maintain task se depth in task_move_group_fair(), if a !fair task change task group, its se depth will not be updated, so commit eb7a59b2c888 ("sched/fair: Reset se-depth when task switched to FAIR") fix the problem by updating se depth in switched_to_fair() too. Then commit daa59407b558 ("sched/fair: Unify switched_{from,to}_fair() and task_move_group_fair()") unified these two functions, moved se.depth setting to attach_task_cfs_rq(), which further into attach_entity_cfs_rq() with commit df217913e72e ("sched/fair: Factorize attach/detach entity"). This patch move task se depth maintenance from attach_entity_cfs_rq() to set_task_rq(), which will be called when CPU/cgroup change, so its depth will always be correct. This patch is preparation for the next patch. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-2-zhouchengming@bytedance.com
2022-08-16sched/psi: Remove unused parameter nbytes of psi_trigger_create()Hao Jia1-2/+2
psi_trigger_create()'s 'nbytes' parameter is not used, so we can remove it. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-16sched/psi: Zero the memory of struct psi_groupHao Jia1-5/+1
After commit 5f69a6577bc3 ("psi: dont alloc memory for psi by default"), the memory used by struct psi_group is no longer allocated and zeroed in cgroup_create(). Since the memory of struct psi_group is not zeroed, the data in this memory is random, which will lead to inaccurate psi statistics when creating a new cgroup. So we use kzlloc() to allocate and zero the struct psi_group and remove the redundant zeroing in group_init(). Steps to reproduce: 1. Use cgroup v2 and enable CONFIG_PSI 2. Create a new cgroup, and query psi statistics mkdir /sys/fs/cgroup/test cat /sys/fs/cgroup/test/cpu.pressure some avg10=0.00 avg60=0.00 avg300=47927752200.00 total=12884901 full avg10=561815124.00 avg60=125835394188.00 avg300=1077090462000.00 total=10273561772 cat /sys/fs/cgroup/test/io.pressure some avg10=1040093132823.95 avg60=1203770351379.21 avg300=3862252669559.46 total=4294967296 full avg10=921884564601.39 avg60=0.00 avg300=1984507298.35 total=442381631 cat /sys/fs/cgroup/test/memory.pressure some avg10=232476085778.11 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=2585658472280.57 total=12884901 Fixes: commit 5f69a6577bc3 ("psi: dont alloc memory for psi by default") Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-12sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()Ingo Molnar7-25/+26
There's no good reason to crash a user's system with a BUG_ON(), chances are high that they'll never even see the crash message on Xorg, and it won't make it into the syslog either. By using a WARN_ON_ONCE() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs. None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON_ONCE(). There's one exception: WARN_ON_ONCE() arguments with side-effects, such as locking - in this case we use the return value of the WARN_ON_ONCE(), such as in: - BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON_ONCE(!lock_task_sighand(p, &flags))) + return; Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com
2022-08-07Merge tag 'sched-urgent-2022-08-06' of ↵Linus Torvalds2-8/+15
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix and a comment fix" * tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Do not requeue task on CPU excluded from cpus_mask sched/rt: Fix Sparse warnings due to undefined rt.c declarations exit: Fix typo in comment: s/sub-theads/sub-threads sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed
2022-08-04sched/core: Do not requeue task on CPU excluded from cpus_maskMel Gorman1-2/+6
The following warning was triggered on a large machine early in boot on a distribution kernel but the same problem should also affect mainline. WARNING: CPU: 439 PID: 10 at ../kernel/workqueue.c:2231 process_one_work+0x4d/0x440 Call Trace: <TASK> rescuer_thread+0x1f6/0x360 kthread+0x156/0x180 ret_from_fork+0x22/0x30 </TASK> Commit c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") optimises ttwu by queueing a task that is descheduling on the wakelist, but does not check if the task descheduling is still allowed to run on that CPU. In this warning, the problematic task is a workqueue rescue thread which checks if the rescue is for a per-cpu workqueue and running on the wrong CPU. While this is early in boot and it should be possible to create workers, the rescue thread may still used if the MAYDAY_INITIAL_TIMEOUT is reached or MAYDAY_INTERVAL and on a sufficiently large machine, the rescue thread is being used frequently. Tracing confirmed that the task should have migrated properly using the stopper thread to handle the migration. However, a parallel wakeup from udev running on another CPU that does not share CPU cache observes p->on_cpu and uses task_cpu(p), queues the task on the old CPU and triggers the warning. Check that the wakee task that is descheduling is still allowed to run on its current CPU and if not, wait for the descheduling to complete and select an allowed CPU. Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220804092119.20137-1-mgorman@techsingularity.net
2022-08-04sched/core: Remove superfluous semicolonXin Gao1-1/+1
Signed-off-by: Xin Gao <gaoxin@cdjrlc.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220719111044.7095-1-gaoxin@cdjrlc.com
2022-08-03sched/fair: Make per-cpu cpumasks staticBing Huang2-13/+9
The load_balance_mask and select_rq_mask percpu variables are only used in kernel/sched/fair.c. Make them static and move their allocation into init_sched_fair_class(). Replace kzalloc_node() with zalloc_cpumask_var_node() to get rid of the CONFIG_CPUMASK_OFFSTACK #ifdef and to align with per-cpu cpumask allocation for RT (local_cpu_mask in init_sched_rt_class()) and DL class (local_cpu_mask_dl in init_sched_dl_class()). [ mingo: Tidied up changelog & touched up the code. ] Signed-off-by: Bing Huang <huangbing@kylinos.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220722213609.3901-1-huangbing775@126.com
2022-08-03sched/fair: Remove unused parameter idle of _nohz_idle_balance()Hao Jia1-4/+3
After commit 7a82e5f52a35 ("sched/fair: Merge for each idle cpu loop of ILB"), _nohz_idle_balance()'s 'idle' parameter is not used anymore, so we can remove it. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220803130223.70419-1-jiahao.os@bytedance.com
2022-08-03Merge tag 'cgroup-for-5.20' of ↵Linus Torvalds1-6/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Several core optimizations: - threadgroup_rwsem write locking is skipped when configuring controllers in empty subtrees. Combined with CLONE_INTO_CGROUP, this allows the common static usage pattern to not grab threadgroup_rwsem at all (glibc still doesn't seem ready for CLONE_INTO_CGROUP unfortunately). - threadgroup_rwsem used to be put into non-percpu mode by default due to latency concerns in specific use cases. There's no reason for everyone else to pay for it. Make the behavior optional. - psi no longer allocates memory when disabled. ... along with some code cleanups" * tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Skip subtree root in cgroup_update_dfl_csses() cgroup: remove "no" prefixed mount options cgroup: Make !percpu threadgroup_rwsem operations optional cgroup: Add "no" prefixed mount options cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes psi: dont alloc memory for psi by default
2022-08-03sched/rt: Fix Sparse warnings due to undefined rt.c declarationsBen Dooks1-3/+4
There are several symbols defined in kernel/sched/sched.h but get wrapped in CONFIG_CGROUP_SCHED, even though dummy versions get built in rt.c and therefore trigger Sparse warnings: kernel/sched/rt.c:309:6: warning: symbol 'unregister_rt_sched_group' was not declared. Should it be static? kernel/sched/rt.c:311:6: warning: symbol 'free_rt_sched_group' was not declared. Should it be static? kernel/sched/rt.c:313:5: warning: symbol 'alloc_rt_sched_group' was not declared. Should it be static? Fix this by moving them outside the CONFIG_CGROUP_SCHED block. [ mingo: Refreshed to the latest scheduler tree, tweaked changelog. ] Signed-off-by: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220721145155.358366-1-ben-linux@fluff.org
2022-08-03sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowedWaiman Long1-3/+5
With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating that the cpuset will just use the effective CPUs of its parent. So cpuset_can_attach() can call task_can_attach() with an empty mask. This can lead to cpumask_any_and() returns nr_cpu_ids causing the call to dl_bw_of() to crash due to percpu value access of an out of bound CPU value. For example: [80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0 : [80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0 : [80468.207946] Call Trace: [80468.208947] cpuset_can_attach+0xa0/0x140 [80468.209953] cgroup_migrate_execute+0x8c/0x490 [80468.210931] cgroup_update_dfl_csses+0x254/0x270 [80468.211898] cgroup_subtree_control_write+0x322/0x400 [80468.212854] kernfs_fop_write_iter+0x11c/0x1b0 [80468.213777] new_sync_write+0x11f/0x1b0 [80468.214689] vfs_write+0x1eb/0x280 [80468.215592] ksys_write+0x5f/0xe0 [80468.216463] do_syscall_64+0x5c/0x80 [80468.224287] entry_SYSCALL_64_after_hwframe+0x44/0xae Fix that by using effective_cpus instead. For cgroup v1, effective_cpus is the same as cpus_allowed. For v2, effective_cpus is the real cpumask to be used by tasks within the cpuset anyway. Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to reflect the change. In addition, a check is added to task_can_attach() to guard against the possibility that cpumask_any_and() may return a value >= nr_cpu_ids. Fixes: 7f51412a415d ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
2022-08-03Merge tag 'rcu.2022.07.26a' of ↵Linus Torvalds3-6/+39
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates - Miscellaneous fixes - Callback-offload updates, perhaps most notably a new RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be offloaded at boot time, regardless of kernel boot parameters. This is useful to battery-powered systems such as ChromeOS and Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot parameter prevents offloaded callbacks from interfering with real-time workloads and with energy-efficiency mechanisms - Polled grace-period updates, perhaps most notably making these APIs account for both normal and expedited grace periods - Tasks RCU updates, perhaps most notably reducing the CPU overhead of RCU tasks trace grace periods by more than a factor of two on a system with 15,000 tasks. The reduction is expected to increase with the number of tasks, so it seems reasonable to hypothesize that a system with 150,000 tasks might see a 20-fold reduction in CPU overhead - Torture-test updates - Updates that merge RCU's dyntick-idle tracking into context tracking, thus reducing the overhead of transitioning to kernel mode from either idle or nohz_full userspace execution for kernels that track context independently of RCU. This is expected to be helpful primarily for kernels built with CONFIG_NO_HZ_FULL=y * tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (98 commits) rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings rcu: Diagnose extended sync_rcu_do_polled_gp() loops rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings rcutorture: Test polled expedited grace-period primitives rcu: Add polled expedited grace-period primitives rcutorture: Verify that polled GP API sees synchronous grace periods rcu: Make Tiny RCU grace periods visible to polled APIs rcu: Make polled grace-period API account for expedited grace periods rcu: Switch polled grace-period APIs to ->gp_seq_polled rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty rcu/nocb: Add option to opt rcuo kthreads out of RT priority rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread() rcu/nocb: Add an option to offload all CPUs on boot rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order rcu/nocb: Add/del rdp to iterate from rcuog itself rcu/tree: Add comment to describe GP-done condition in fqs loop rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs() rcu/kvfree: Remove useless monitor_todo flag rcu: Cleanup RCU urgency state for offline CPU ...
2022-08-02Merge tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull io_uring updates from Jens Axboe: - As per (valid) complaint in the last merge window, fs/io_uring.c has grown quite large these days. io_uring isn't really tied to fs either, as it supports a wide variety of functionality outside of that. Move the code to io_uring/ and split it into files that either implement a specific request type, and split some code into helpers as well. The code is organized a lot better like this, and io_uring.c is now < 4K LOC (me). - Deprecate the epoll_ctl opcode. It'll still work, just trigger a warning once if used. If we don't get any complaints on this, and I don't expect any, then we can fully remove it in a future release (me). - Improve the cancel hash locking (Hao) - kbuf cleanups (Hao) - Efficiency improvements to the task_work handling (Dylan, Pavel) - Provided buffer improvements (Dylan) - Add support for recv/recvmsg multishot support. This is similar to the accept (or poll) support for have for multishot, where a single SQE can trigger everytime data is received. For applications that expect to do more than a few receives on an instantiated socket, this greatly improves efficiency (Dylan). - Efficiency improvements for poll handling (Pavel) - Poll cancelation improvements (Pavel) - Allow specifiying a range for direct descriptor allocations (Pavel) - Cleanup the cqe32 handling (Pavel) - Move io_uring types to greatly cleanup the tracing (Pavel) - Tons of great code cleanups and improvements (Pavel) - Add a way to do sync cancelations rather than through the sqe -> cqe interface, as that's a lot easier to use for some use cases (me). - Add support to IORING_OP_MSG_RING for sending direct descriptors to a different ring. This avoids the usually problematic SCM case, as we disallow those. (me) - Make the per-command alloc cache we use for apoll generic, place limits on it, and use it for netmsg as well (me). - Various cleanups (me, Michal, Gustavo, Uros) * tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-block: (172 commits) io_uring: ensure REQ_F_ISREG is set async offload net: fix compat pointer in get_compat_msghdr() io_uring: Don't require reinitable percpu_ref io_uring: fix types in io_recvmsg_multishot_overflow io_uring: Use atomic_long_try_cmpxchg in __io_account_mem io_uring: support multishot in recvmsg net: copy from user before calling __get_compat_msghdr net: copy from user before calling __copy_msghdr io_uring: support 0 length iov in buffer select in compat io_uring: fix multishot ending when not polled io_uring: add netmsg cache io_uring: impose max limit on apoll cache io_uring: add abstraction around apoll cache io_uring: move apoll cache to poll.c io_uring: consolidate hash_locked io-wq handling io_uring: clear REQ_F_HASH_LOCKED on hash removal io_uring: don't race double poll setting REQ_F_ASYNC_DATA io_uring: don't miss setting REQ_F_DOUBLE_POLL io_uring: disable multishot recvmsg io_uring: only trace one of complete or overflow ...
2022-08-02sched/debug: Print each field value left-aligned in sched_show_task()Zhen Lei1-1/+1
Currently, the values of some fields are printed right-aligned, causing the field value to be next to the next field name rather than next to its own field name. So print each field value left-aligned, to make it more readable. Before: stack: 0 pid: 307 ppid: 2 flags:0x00000008 After: stack:0 pid:308 ppid:2 flags:0x0000000a This also makes them print in the same style as the other two fields: task:demo0 state:R running task Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220727060819.1085-1-thunder.leizhen@huawei.com
2022-08-02sched/deadline: Use sched_dl_entity's dl_density in dl_task_fits_capacity()Dietmar Eggemann1-15/+15
Save a multiplication in dl_task_fits_capacity() by using already maintained per-sched_dl_entity (i.e. per-task) `dl_runtime/dl_deadline` (dl_density). cap_scale(dl_deadline, cap) >= dl_runtime dl_deadline * cap >> SCHED_CAPACITY_SHIFT >= dl_runtime cap >= dl_runtime << SCHED_CAPACITY_SHIFT / dl_deadline cap >= (dl_runtime << BW_SHIFT / dl_deadline) >> BW_SHIFT - SCHED_CAPACITY_SHIFT cap >= dl_density >> BW_SHIFT - SCHED_CAPACITY_SHIFT __sched_setscheduler()->__checkparam_dl() ensures that the 2 corner cases (if conditions) `runtime == RUNTIME_INF (-1)` and `period == 0` of to_ratio(deadline, runtime) are not met when setting dl_density in __sched_setscheduler()-> __setscheduler_params()->__setparam_dl(). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220729111305.1275158-4-dietmar.eggemann@arm.com
2022-08-02sched/deadline: Make dl_cpuset_cpumask_can_shrink() capacity-awareDietmar Eggemann1-13/+11
dl_cpuset_cpumask_can_shrink() is used to validate whether there is still enough CPU capacity for DL tasks in the reduced cpuset. Currently it still operates on `# remaining CPUs in the cpuset` (1). Change this to use the already capacity-aware DL admission control __dl_overflow() for the `cpumask can shrink` test. dl_b->bw = sched_rt_period << BW_SHIFT / sched_rt_period dl_b->bw * (1) >= currently allocated bandwidth in root_domain (rd) Replace (1) w/ `\Sum CPU capacity in rd >> SCHED_CAPACITY_SHIFT` Adapt __dl_bw_capacity() to take a cpumask instead of a CPU number argument so that `rd->span` and `cpumask of the reduced cpuset` can be used here. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220729111305.1275158-3-dietmar.eggemann@arm.com
2022-08-02sched/core: Introduce sched_asym_cpucap_active()Dietmar Eggemann5-9/+14
Create an inline helper for conditional code to be only executed on asymmetric CPU capacity systems. This makes these (currently ~10 and future) conditions a lot more readable. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220729111305.1275158-2-dietmar.eggemann@arm.com
2022-08-01Merge tag 'sched-core-2022-08-01' of ↵Linus Torvalds11-427/+791
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Load-balancing improvements: - Improve NUMA balancing on AMD Zen systems for affine workloads. - Improve the handling of reduced-capacity CPUs in load-balancing. - Energy Model improvements: fix & refine all the energy fairness metrics (PELT), and remove the conservative threshold requiring 6% energy savings to migrate a task. Doing this improves power efficiency for most workloads, and also increases the reliability of energy-efficiency scheduling. - Optimize/tweak select_idle_cpu() to spend (much) less time searching for an idle CPU on overloaded systems. There's reports of several milliseconds spent there on large systems with large workloads ... [ Since the search logic changed, there might be behavioral side effects. ] - Improve NUMA imbalance behavior. On certain systems with spare capacity, initial placement of tasks is non-deterministic, and such an artificial placement imbalance can persist for a long time, hurting (and sometimes helping) performance. The fix is to make fork-time task placement consistent with runtime NUMA balancing placement. Note that some performance regressions were reported against this, caused by workloads that are not memory bandwith limited, which benefit from the artificial locality of the placement bug(s). Mel Gorman's conclusion, with which we concur, was that consistency is better than random workload benefits from non-deterministic bugs: "Given there is no crystal ball and it's a tradeoff, I think it's better to be consistent and use similar logic at both fork time and runtime even if it doesn't have universal benefit." - Improve core scheduling by fixing a bug in sched_core_update_cookie() that caused unnecessary forced idling. - Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs for newly woken tasks. - Fix a newidle balancing bug that introduced unnecessary wakeup latencies. ABI improvements/fixes: - Do not check capabilities and do not issue capability check denial messages when a scheduler syscall doesn't require privileges. (Such as increasing niceness.) - Add forced-idle accounting to cgroups too. - Fix/improve the RSEQ ABI to not just silently accept unknown flags. (No existing tooling is known to have learned to rely on the previous behavior.) - Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags. Optimizations: - Optimize & simplify leaf_cfs_rq_list() - Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg(). Misc fixes & cleanups: - Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems. - Fix a full-NOHZ bug that can in some cases result in the tick not being re-enabled when the last SCHED_RT task is gone from a runqueue but there's still SCHED_OTHER tasks around. - Various PREEMPT_RT related fixes. - Misc cleanups & smaller fixes" * tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits) rseq: Kill process when unknown flags are encountered in ABI structures rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags sched/core: Fix the bug that task won't enqueue into core tree when update cookie nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt() sched/core: Always flush pending blk_plug sched/fair: fix case with reduced capacity CPU sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling sched/core: add forced idle accounting for cgroups sched/fair: Remove the energy margin in feec() sched/fair: Remove task_util from effective utilization in feec() sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() sched/fair: Rename select_idle_mask to select_rq_mask sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() sched/fair: Decay task PELT values during wakeup migration sched/fair: Provide u64 read for 32-bits arch helper sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg sched: only perform capability check on privileged operation sched: Remove unused function group_first_cpu() sched/fair: Remove redundant word " *" selftests/rseq: check if libc rseq support is registered ...
2022-07-25io_uring: move to separate directoryJens Axboe1-1/+1
In preparation for splitting io_uring up a bit, move it into its own top level directory. It didn't really belong in fs/ anyway, as it's not a file system only API. This adds io_uring/ and moves the core files in there, and updates the MAINTAINERS file for the new location. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-22Merge branch 'ctxt.2022.07.05a' into HEADPaul E. McKenney3-6/+7
ctxt.2022.07.05a: Linux-kernel memory model development branch.
2022-07-21sched/core: Fix the bug that task won't enqueue into core tree when update ↵Cruz Zhao1-4/+5
cookie In function sched_core_update_cookie(), a task will enqueue into the core tree only when it enqueued before, that is, if an uncookied task is cookied, it will not enqueue into the core tree until it enqueue again, which will result in unnecessary force idle. Here follows the scenario: CPU x and CPU y are a pair of SMT siblings. 1. Start task a running on CPU x without sleeping, and task b and task c running on CPU y without sleeping. 2. We create a cookie and share it to task a and task b, and then we create another cookie and share it to task c. 3. Simpling core_forceidle_sum of task a and b from /proc/PID/sched And we will find out that core_forceidle_sum of task a takes 30% time of the sampling period, which shouldn't happen as task a and b have the same cookie. Then we migrate task a to CPU x', migrate task b and c to CPU y', where CPU x' and CPU y' are a pair of SMT siblings, and sampling again, we will found out that core_forceidle_sum of task a and b are almost zero. To solve this problem, we enqueue the task into the core tree if it's on rq. Fixes: 6e33cad0af49("sched: Trivial core scheduling cookie management") Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1656403045-100840-2-git-send-email-CruzZhao@linux.alibaba.com
2022-07-21nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()Nicolas Saenz Julienne1-6/+9
dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having called sched_update_tick_dependency() preventing it from re-enabling the tick on systems that no longer have pending SCHED_RT tasks but have multiple runnable SCHED_OTHER tasks: dequeue_task_rt() dequeue_rt_entity() dequeue_rt_stack() dequeue_top_rt_rq() sub_nr_running() // decrements rq->nr_running sched_update_tick_dependency() sched_can_stop_tick() // checks rq->rt.rt_nr_running, ... __dequeue_rt_entity() dec_rt_tasks() // decrements rq->rt.rt_nr_running ... Every other scheduler class performs the operation in the opposite order, and sched_update_tick_dependency() expects the values to be updated as such. So avoid the misbehaviour by inverting the order in which the above operations are performed in the RT scheduler. Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model") Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com
2022-07-21sched/deadline: Fix BUG_ON condition for deboosted tasksJuri Lelli1-1/+4
Tasks the are being deboosted from SCHED_DEADLINE might enter enqueue_task_dl() one last time and hit an erroneous BUG_ON condition: since they are not boosted anymore, the if (is_dl_boosted()) branch is not taken, but the else if (!dl_prio) is and inside this one we BUG_ON(!is_dl_boosted), which is of course false (BUG_ON triggered) otherwise we had entered the if branch above. Long story short, the current condition doesn't make sense and always leads to triggering of a BUG. Fix this by only checking enqueue flags, properly: ENQUEUE_REPLENISH has to be present, but additional flags are not a problem. Fixes: 64be6f1f5f71 ("sched/deadline: Don't replenish from a !SCHED_DEADLINE entity") Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220714151908.533052-1-juri.lelli@redhat.com
2022-07-13sched/core: Always flush pending blk_plugJohn Keeping1-2/+6
With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two normal priority tasks (SCHED_OTHER, nice level zero): INFO: task kworker/u8:0:8 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000 Workqueue: writeback wb_workfn (flush-7:0) [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134) [<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174) [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>] +(rt_mutex_slowlock.constprop.0+0xac/0x174) [<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54) [<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec) [<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c) [<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8) [<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4) [<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4) [<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c) [<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8) [<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178) [<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38) Exception stack(0xc10e3fb0 to 0xc10e3ff8) 3fa0: 00000000 00000000 00000000 00000000 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 INFO: task tar:2083 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000 [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134) [<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24) [<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30) [<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8) [<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0) [<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144) [<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4) [<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250) [<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148) [<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98) [<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec) [<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c) Exception stack(0xc2e1bfa8 to 0xc2e1bff0) bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000 bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190 bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc Here the kworker is waiting on msdos_sb_info::s_lock which is held by tar which is in turn waiting for a buffer which is locked waiting to be flushed, but this operation is plugged in the kworker. The lock is a normal struct mutex, so tsk_is_pi_blocked() will always return false on !RT and thus the behaviour changes for RT. It seems that the intent here is to skip blk_flush_plug() in the case where a non-preemptible lock (such as a spinlock) has been converted to a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT schedule flag. But sched_submit_work() is only called from schedule() which is never called in this scenario, so the check can simply be deleted. Looking at the history of the -rt patchset, in fact this change was present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b ("sched/core: Rework the __schedule() preempt argument"). As described in [1]: The schedule process must distinguish between blocking on a regular sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock and rwlock): - rwsem and mutex must flush block requests (blk_schedule_flush_plug()) even if blocked on a lock. This can not deadlock because this also happens for non-RT. There should be a warning if the scheduling point is within a RCU read section. - spinlock and rwlock must not flush block requests. This will deadlock if the callback attempts to acquire a lock which is already acquired. Similarly to being preempted, there should be no warning if the scheduling point is within a RCU read section. and with the tsk_is_pi_blocked() in the scheduler path, we hit the first issue. [1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches Signed-off-by: John Keeping <john@metanate.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com
2022-07-13sched/fair: fix case with reduced capacity CPUVincent Guittot1-12/+42
The capacity of the CPU available for CFS tasks can be reduced because of other activities running on the latter. In such case, it's worth trying to move CFS tasks on a CPU with more available capacity. The rework of the load balance has filtered the case when the CPU is classified to be fully busy but its capacity is reduced. Check if CPU's capacity is reduced while gathering load balance statistic and classify it group_misfit_task instead of group_fully_busy so we can try to move the load on another CPU. Reported-by: David Chen <david.chen@nutanix.com> Reported-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: David Chen <david.chen@nutanix.com> Tested-by: Zhang Qiao <zhangqiao22@huawei.com> Link: https://lkml.kernel.org/r/20220708154401.21411-1-vincent.guittot@linaro.org
2022-07-05context_tracking: Take idle eqs entrypoints over RCUFrederic Weisbecker2-5/+6
The RCU dynticks counter is going to be merged into the context tracking subsystem. Start with moving the idle extended quiescent states entrypoints to context tracking. For now those are dumb redirections to existing RCU calls. [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-04sched/core: Use try_cmpxchg in set_nr_{and_not,if}_pollingUros Bizjak1-15/+9
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in set_nr_{and_not,if}_polling. x86 cmpxchg returns success in ZF flag, so this change saves a compare after cmpxchg. The definition of cmpxchg based fetch_or was changed in the same way as atomic_fetch_##op definitions were changed in e6790e4b5d5e97dc287f3496dd2cf2dbabdfdb35. Also declare these two functions as inline to ensure inlining. In the case of set_nr_and_not_polling, the compiler (gcc) tries to outsmart itself by constructing the boolean return value with logic operations on the fetched value, and these extra operations enlarge the function over the inlining threshold value. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220629151552.6015-1-ubizjak@gmail.com
2022-07-04sched/core: add forced idle accounting for cgroupsJosh Don2-1/+20
4feee7d1260 previously added per-task forced idle accounting. This patch extends this to also include cgroups. rstat is used for cgroup accounting, except for the root, which uses kcpustat in order to bypass the need for doing an rstat flush when reading root stats. Only cgroup v2 is supported. Similar to the task accounting, the cgroup accounting requires that schedstats is enabled. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20220629211426.3329954-1-joshdon@google.com
2022-06-30context_tracking: Split user tracking KconfigFrederic Weisbecker1-1/+1
Context tracking is going to be used not only to track user transitions but also idle/IRQs/NMIs. The user tracking part will then become a separate feature. Prepare Kconfig for that. [ frederic: Apply Max Filippov feedback. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-06-28sched/fair: Remove the energy margin in feec()Vincent Donnefort1-15/+8
find_energy_efficient_cpu() integrates a margin to protect tasks from bouncing back and forth from a CPU to another. This margin is set as being 6% of the total current energy estimated on the system. This however does not work for two reasons: 1. The energy estimation is not a good absolute value: compute_energy() used in feec() is a good estimation for task placement as it allows to compare the energy with and without a task. The computed delta will give a good overview of the cost for a certain task placement. It, however, doesn't work as an absolute estimation for the total energy of the system. First it adds the contribution to idle CPUs into the energy, second it mixes util_avg with util_est values. util_avg contains the near history for a CPU usage, it doesn't tell at all what the current utilization is. A system that has been quite busy in the near past will hold a very high energy and then a high margin preventing any task migration to a lower capacity CPU, wasting energy. It even creates a negative feedback loop: by holding the tasks on a less efficient CPU, the margin contributes in keeping the energy high. 2. The margin handicaps small tasks: On a system where the workload is composed mostly of small tasks (which is often the case on Android), the overall energy will be high enough to create a margin none of those tasks can cross. On a Pixel4, a small utilization of 5% on all the CPUs creates a global estimated energy of 140 joules, as per the Energy Model declaration of that same device. This means, after applying the 6% margin that any migration must save more than 8 joules to happen. No task with a utilization lower than 40 would then be able to migrate away from the biggest CPU of the system. The 6% of the overall system energy was brought by the following patch: (eb92692b2544 sched/fair: Speed-up energy-aware wake-ups) It was previously 6% of the prev_cpu energy. Also, the following one made this margin value conditional on the clusters where the task fits: (8d4c97c105ca sched/fair: Only compute base_energy_pd if necessary) We could simply revert that margin change to what it was, but the original version didn't have strong grounds neither and as demonstrated in (1.) the estimated energy isn't a good absolute value. Instead, removing it completely. It is indeed, made possible by recent changes that improved energy estimation comparison fairness (sched/fair: Remove task_util from effective utilization in feec()) (PM: EM: Increase energy calculation precision) and task utilization stabilization (sched/fair: Decay task util_avg during migration) Without a margin, we could have feared bouncing between CPUs. But running LISA's eas_behaviour test coverage on three different platforms (Hikey960, RB-5 and DB-845) showed no issue. Removing the energy margin enables more energy-optimized placements for a more energy efficient system. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-8-vdonnefort@google.com
2022-06-28sched/fair: Remove task_util from effective utilization in feec()Vincent Donnefort1-63/+139
The energy estimation in find_energy_efficient_cpu() (feec()) relies on the computation of the effective utilization for each CPU of a perf domain (PD). This effective utilization is then used as an estimation of the busy time for this pd. The function effective_cpu_util() which gives this value, scales the utilization relative to IRQ pressure on the CPU to take into account that the IRQ time is hidden from the task clock. The IRQ scaling is as follow: effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of the CPU and irq the IRQ avg time. If now we take as an example a task placement which doesn't raise the OPP on the candidate CPU, we can write the energy delta as: delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) - effective_cpu_util(cpu_util)) = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util We end-up with an energy delta depending on the IRQ avg time, which is a problem: first the time spent on IRQs by a CPU has no effect on the additional energy that would be consumed by a task. Second, we don't want to favour a CPU with a higher IRQ avg time value. Nonetheless, we need to take the IRQ avg time into account. If a task placement raises the PD's frequency, it will increase the energy cost for the entire time where the CPU is busy. A solution is to only use effective_cpu_util() with the CPU contribution part. The task contribution is added separately and scaled according to prev_cpu's IRQ time. No change for the FREQUENCY_UTIL component of the energy estimation. We still want to get the actual frequency that would be selected after the task placement. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-7-vdonnefort@google.com
2022-06-28sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()Dietmar Eggemann1-9/+13
The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays invariant after Energy Model creation, i.e. it is not updated after CPU hotplug operations. That's why the PD mask is used in conjunction with the cpu_online_mask (or Sched Domain cpumask). Thereby the cpu_online_mask is fetched multiple times (in compute_energy()) during a run-queue selection for a task. cpu_online_mask may change during this time which can lead to wrong energy calculations. To be able to avoid this, use the select_rq_mask per-cpu cpumask to create a cpumask out of PD cpumask and cpu_online_mask and pass it through the function calls of the EAS run-queue selection path. The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection (find_energy_efficient_cpu()) is now ANDed not only with the SD mask but also with the cpu_online_mask. This is fine since this cpumask has to be in syc with the one used for energy computation (compute_energy()). An exclusive cpuset setup with at least one asymmetric CPU capacity island (hence the additional AND with the SD cpumask) is the obvious exception here. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-6-vdonnefort@google.com
2022-06-28sched/fair: Rename select_idle_mask to select_rq_maskDietmar Eggemann2-7/+7
On 21/06/2022 11:04, Vincent Donnefort wrote: > From: Dietmar Eggemann <dietmar.eggemann@arm.com> https://lkml.kernel.org/r/202206221253.ZVyGQvPX-lkp@intel.com discovered that this patch doesn't build anymore (on tip sched/core or linux-next) because of commit f5b2eeb499910 ("sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()"). New version of [PATCH v11 4/7] sched/fair: Rename select_idle_mask to select_rq_mask below. -- >8 -- Decouple the name of the per-cpu cpumask select_idle_mask from its usage in select_idle_[cpu/capacity]() of the CFS run-queue selection (select_task_rq_fair()). This is to support the reuse of this cpumask in the Energy Aware Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue selection. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/250691c7-0e2b-05ab-bedf-b245c11d9400@arm.com
2022-06-28sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()Dietmar Eggemann4-20/+19
effective_cpu_util() already has a `int cpu' parameter which allows to retrieve the CPU capacity scale factor (or maximum CPU capacity) inside this function via an arch_scale_cpu_capacity(cpu). A lot of code calling effective_cpu_util() (or the shim sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call arch_scale_cpu_capacity() already. But not having to pass it into effective_cpu_util() will make the EAS wake-up code easier, especially when the maximum CPU capacity reduced by the thermal pressure is passed through the EAS wake-up functions. Due to the asymmetric CPU capacity support of arm/arm64 architectures, arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via per_cpu(cpu_scale, cpu) on such a system. On all other architectures it is a a compile-time constant (SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com
2022-06-28sched/fair: Decay task PELT values during wakeup migrationVincent Donnefort3-34/+172
Before being migrated to a new CPU, a task sees its PELT values synchronized with rq last_update_time. Once done, that same task will also have its sched_avg last_update_time reset. This means the time between the migration and the last clock update will not be accounted for in util_avg and a discontinuity will appear. This issue is amplified by the PELT clock scaling. It takes currently one tick after the CPU being idle to let clock_pelt catching up clock_task. This is especially problematic for asymmetric CPU capacity systems which need stable util_avg signals for task placement and energy estimation. Ideally, this problem would be solved by updating the runqueue clocks before the migration. But that would require taking the runqueue lock which is quite expensive [1]. Instead estimate the missing time and update the task util_avg with that value. To that end, we need sched_clock_cpu() but it is a costly function. Limit the usage to the case where the source CPU is idle as we know this is when the clock is having the biggest risk of being outdated. See comment in migrate_se_pelt_lag() for more details about how the PELT value is estimated. Notice though this estimation doesn't take into account IRQ and Paravirt time. [1] https://lkml.kernel.org/r/20190709115759.10451-1-chris.redpath@arm.com Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-3-vdonnefort@google.com
2022-06-28sched/fair: Provide u64 read for 32-bits arch helperVincent Donnefort2-71/+54
Introducing macro helpers u64_u32_{store,load}() to factorize lockless accesses to u64 variables for 32-bits architectures. Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To accommodate the later where the copy lies outside of the structure (cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy), use the _copy() version of those helpers. Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and therefore, have a small penalty for 32-bits machines in set_task_rq_fair() and init_cfs_rq(). Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-2-vdonnefort@google.com
2022-06-28sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avgChen Yu2-1/+89
[Problem Statement] select_idle_cpu() might spend too much time searching for an idle CPU, when the system is overloaded. The following histogram is the time spent in select_idle_cpu(), when running 224 instances of netperf on a system with 112 CPUs per LLC domain: @usecs: [0] 533 | | [1] 5495 | | [2, 4) 12008 | | [4, 8) 239252 | | [8, 16) 4041924 |@@@@@@@@@@@@@@ | [16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 4507667 |@@@@@@@@@@@@@@@ | [512, 1K) 2600472 |@@@@@@@@@ | [1K, 2K) 927912 |@@@ | [2K, 4K) 218720 | | [4K, 8K) 98161 | | [8K, 16K) 37722 | | [16K, 32K) 6715 | | [32K, 64K) 477 | | [64K, 128K) 7 | | netperf latency usecs: ======= case load Lat_99th std% TCP_RR thread-224 257.39 ( 0.21) The time spent in select_idle_cpu() is visible to netperf and might have a negative impact. [Symptom analysis] The patch [1] from Mel Gorman has been applied to track the efficiency of select_idle_sibling. Copy the indicators here: SIS Search Efficiency(se_eff%): A ratio expressed as a percentage of runqueues scanned versus idle CPUs found. A 100% efficiency indicates that the target, prev or recent CPU of a task was idle at wakeup. The lower the efficiency, the more runqueues were scanned before an idle CPU was found. SIS Domain Search Efficiency(dom_eff%): Similar, except only for the slower SIS patch. SIS Fast Success Rate(fast_rate%): Percentage of SIS that used target, prev or recent CPUs. SIS Success rate(success_rate%): Percentage of scans that found an idle CPU. The test is based on Aubrey's schedtests tool, including netperf, hackbench, schbench and tbench. Test on vanilla kernel: schedstat_parse.py -f netperf_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% TCP_RR 28 threads 99.978 18.535 99.995 100.000 TCP_RR 56 threads 99.397 5.671 99.964 100.000 TCP_RR 84 threads 21.721 6.818 73.632 100.000 TCP_RR 112 threads 12.500 5.533 59.000 100.000 TCP_RR 140 threads 8.524 4.535 49.020 100.000 TCP_RR 168 threads 6.438 3.945 40.309 99.999 TCP_RR 196 threads 5.397 3.718 32.320 99.982 TCP_RR 224 threads 4.874 3.661 25.775 99.767 UDP_RR 28 threads 99.988 17.704 99.997 100.000 UDP_RR 56 threads 99.528 5.977 99.970 100.000 UDP_RR 84 threads 24.219 6.992 76.479 100.000 UDP_RR 112 threads 13.907 5.706 62.538 100.000 UDP_RR 140 threads 9.408 4.699 52.519 100.000 UDP_RR 168 threads 7.095 4.077 44.352 100.000 UDP_RR 196 threads 5.757 3.775 35.764 99.991 UDP_RR 224 threads 5.124 3.704 28.748 99.860 schedstat_parse.py -f schbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% normal 1 mthread 99.152 6.400 99.941 100.000 normal 2 mthreads 97.844 4.003 99.908 100.000 normal 3 mthreads 96.395 2.118 99.917 99.998 normal 4 mthreads 55.288 1.451 98.615 99.804 normal 5 mthreads 7.004 1.870 45.597 61.036 normal 6 mthreads 3.354 1.346 20.777 34.230 normal 7 mthreads 2.183 1.028 11.257 21.055 normal 8 mthreads 1.653 0.825 7.849 15.549 schedstat_parse.py -f hackbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% process-pipe 1 group 99.991 7.692 99.999 100.000 process-pipe 2 groups 99.934 4.615 99.997 100.000 process-pipe 3 groups 99.597 3.198 99.987 100.000 process-pipe 4 groups 98.378 2.464 99.958 100.000 process-pipe 5 groups 27.474 3.653 89.811 99.800 process-pipe 6 groups 20.201 4.098 82.763 99.570 process-pipe 7 groups 16.423 4.156 77.398 99.316 process-pipe 8 groups 13.165 3.920 72.232 98.828 process-sockets 1 group 99.977 5.882 99.999 100.000 process-sockets 2 groups 99.927 5.505 99.996 100.000 process-sockets 3 groups 99.397 3.250 99.980 100.000 process-sockets 4 groups 79.680 4.258 98.864 99.998 process-sockets 5 groups 7.673 2.503 63.659 92.115 process-sockets 6 groups 4.642 1.584 58.946 88.048 process-sockets 7 groups 3.493 1.379 49.816 81.164 process-sockets 8 groups 3.015 1.407 40.845 75.500 threads-pipe 1 group 99.997 0.000 100.000 100.000 threads-pipe 2 groups 99.894 2.932 99.997 100.000 threads-pipe 3 groups 99.611 4.117 99.983 100.000 threads-pipe 4 groups 97.703 2.624 99.937 100.000 threads-pipe 5 groups 22.919 3.623 87.150 99.764 threads-pipe 6 groups 18.016 4.038 80.491 99.557 threads-pipe 7 groups 14.663 3.991 75.239 99.247 threads-pipe 8 groups 12.242 3.808 70.651 98.644 threads-sockets 1 group 99.990 6.667 99.999 100.000 threads-sockets 2 groups 99.940 5.114 99.997 100.000 threads-sockets 3 groups 99.469 4.115 99.977 100.000 threads-sockets 4 groups 87.528 4.038 99.400 100.000 threads-sockets 5 groups 6.942 2.398 59.244 88.337 threads-sockets 6 groups 4.359 1.954 49.448 87.860 threads-sockets 7 groups 2.845 1.345 41.198 77.102 threads-sockets 8 groups 2.871 1.404 38.512 74.312 schedstat_parse.py -f tbench_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% loopback 28 threads 99.976 18.369 99.995 100.000 loopback 56 threads 99.222 7.799 99.934 100.000 loopback 84 threads 19.723 6.819 70.215 100.000 loopback 112 threads 11.283 5.371 55.371 99.999 loopback 140 threads 0.000 0.000 0.000 0.000 loopback 168 threads 0.000 0.000 0.000 0.000 loopback 196 threads 0.000 0.000 0.000 0.000 loopback 224 threads 0.000 0.000 0.000 0.000 According to the test above, if the system becomes busy, the SIS Search Efficiency(se_eff%) drops significantly. Although some benchmarks would finally find an idle CPU(success_rate% = 100%), it is doubtful whether it is worth it to search the whole LLC domain. [Proposal] It would be ideal to have a crystal ball to answer this question: How many CPUs must a wakeup path walk down, before it can find an idle CPU? Many potential metrics could be used to predict the number. One candidate is the sum of util_avg in this LLC domain. The benefit of choosing util_avg is that it is a metric of accumulated historic activity, which seems to be smoother than instantaneous metrics (such as rq->nr_running). Besides, choosing the sum of util_avg would help predict the load of the LLC domain more precisely, because SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle time. In summary, the lower the util_avg is, the more select_idle_cpu() should scan for idle CPU, and vice versa. When the sum of util_avg in this LLC domain hits 85% or above, the scan stops. The reason to choose 85% as the threshold is that this is the imbalance_pct(117) when a LLC sched group is overloaded. Introduce the quadratic function: y = SCHED_CAPACITY_SCALE - p * x^2 and y'= y / SCHED_CAPACITY_SCALE x is the ratio of sum_util compared to the CPU capacity: x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE) y' is the ratio of CPUs to be scanned in the LLC domain, and the number of CPUs to scan is calculated by: nr_scan = llc_weight * y' Choosing quadratic function is because: [1] Compared to the linear function, it scans more aggressively when the sum_util is low. [2] Compared to the exponential function, it is easier to calculate. [3] It seems that there is no accurate mapping between the sum of util_avg and the number of CPUs to be scanned. Use heuristic scan for now. For a platform with 112 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 112 111 108 102 93 81 65 47 25 1 0 ... For a platform with 16 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 16 15 15 14 13 11 9 6 3 0 0 ... Furthermore, to minimize the overhead of calculating the metrics in select_idle_cpu(), borrow the statistics from periodic load balance. As mentioned by Abel, on a platform with 112 CPUs per LLC, the sum_util calculated by periodic load balance after 112 ms would decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in reflecting the latest utilization. But it is a trade-off. Checking the util_avg in newidle load balance would be more frequent, but it brings overhead - multiple CPUs write/read the per-LLC shared variable and introduces cache contention. Tim also mentioned that, it is allowed to be non-optimal in terms of scheduling for the short-term variations, but if there is a long-term trend in the load behavior, the scheduler can adjust for that. When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and Mel suggested, SIS_UTIL should be enabled by default. This patch is based on the util_avg, which is very sensitive to the CPU frequency invariance. There is an issue that, when the max frequency has been clamp, the util_avg would decay insanely fast when the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo in frequency invariance") could be used to mitigate this symptom, by adjusting the arch_max_freq_ratio when turbo is disabled. But this issue is still not thoroughly fixed, because the current code is unaware of the user-specified max CPU frequency. [Test result] netperf and tbench were launched with 25% 50% 75% 100% 125% 150% 175% 200% of CPU number respectively. Hackbench and schbench were launched by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times. The following is the benchmark result comparison between baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare% indicates better performance. Each netperf test is a: netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100 netperf.throughput ======= case load baseline(std%) compare%( std%) TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40) TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20) TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40) TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22) TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19) TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18) TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43) TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85) UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33) UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21) UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32) UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61) UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04) UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16) UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93) UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62) Take the 224 threads as an example, the SIS search metrics changes are illustrated below: vanilla patched 4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg 38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg 128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg 5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg There is -87.9% less CPU scans after patched, which indicates lower overhead. Besides, with this patch applied, there is -13% less rq lock contention in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested .try_to_wake_up.default_wake_function.woken_wake_function. This might help explain the performance improvement - Because this patch allows the waking task to remain on the previous CPU, rather than grabbing other CPUs' lock. Each hackbench test is a: hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100 hackbench.throughput ========= case load baseline(std%) compare%( std%) process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47) process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81) process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02) process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02) process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13) process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14) process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26) process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03) threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14) threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74) threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13) threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06) threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53) threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26) threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24) threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11) Each tbench test is a: tbench -t 100 $job 127.0.0.1 tbench.throughput ====== case load baseline(std%) compare%( std%) loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09) loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17) loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13) loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03) loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19) loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27) loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17) loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22) Each schbench test is a: schbench -m $job -t 28 -r 100 -s 30000 -c 30000 schbench.latency_90%_us ======== case load baseline(std%) compare%( std%) normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)* normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79) normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64) normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28) *Consider the Standard Deviation, this -7.36% regression might not be valid. Also, a OLTP workload with a commercial RDBMS has been tested, and there is no significant change. There were concerns that unbalanced tasks among CPUs would cause problems. For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to CPU0~CPU6, while CPU7 is idle: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 util_avg 1024 1024 1024 1024 1024 1024 1024 0 Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%, select_idle_cpu() will not scan, thus CPU7 is undetected during scan. But according to Mel, it is unlikely the CPU7 will be idle all the time because CPU7 could pull some tasks via CPU_NEWLY_IDLE. lkp(kernel test robot) has reported a regression on stress-ng.sock on a very busy system. According to the sched_debug statistics, it might be caused by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this might introduce more context switch, especially involuntary preemption, which impacts a busy stress-ng. This regression has shown that, not all benchmarks in every scenario benefit from idle CPU scan limit, and it needs further investigation. Besides, there is slight regression in hackbench's 16 groups case when the LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively in an LLC domain with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is negligible. The current patch aims to propose a generic solution and only considers the util_avg. Something like the below could be applied on top of the current patch to fulfill the requirement: if (llc_weight <= 16) nr_scan = nr_scan * 32 / llc_weight; For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large. The smaller the CPU number this LLC domain has, the larger nr_scan will be expanded. This needs further investigation. There is also ongoing work[2] from Abel to filter out the busy CPUs during wakeup, to further speed up the idle CPU scan. And it could be a following-up optimization on top of this change. Suggested-by: Tim Chen <tim.c.chen@intel.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
2022-06-28sched: only perform capability check on privileged operationChristian Göttsche1-55/+83
sched_setattr(2) issues via kernel/sched/core.c:__sched_setscheduler() a CAP_SYS_NICE audit event unconditionally, even when the requested operation does not require that capability / is unprivileged, i.e. for reducing niceness. This is relevant in connection with SELinux, where a capability check results in a policy decision and by default a denial message on insufficient permission is issued. It can lead to three undesired cases: 1. A denial message is generated, even in case the operation was an unprivileged one and thus the syscall succeeded, creating noise. 2. To avoid the noise from 1. the policy writer adds a rule to ignore those denial messages, hiding future syscalls, where the task performs an actual privileged operation, leading to hidden limited functionality of that task. 3. To avoid the noise from 1. the policy writer adds a rule to allow the task the capability CAP_SYS_NICE, while it does not need it, violating the principle of least privilege. Conduct privilged/unprivileged categorization first and perform a capable test (and at most once) only if needed. Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220615152505.310488-1-cgzones@googlemail.com
2022-06-28sched: Remove unused function group_first_cpu()Zhang Qiao1-9/+0
As of commit afe06efdf07c ("sched: Extend scheduler's asym packing") group_first_cpu() became an unused function, remove it. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220617181151.29980-3-zhangqiao22@huawei.com
2022-06-28sched/fair: Remove redundant word " *"Zhang Qiao1-1/+1
" *" is redundant. so remove it. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220617181151.29980-2-zhangqiao22@huawei.com