summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)AuthorFilesLines
2022-04-27sched/pelt: Fix attach_entity_load_avg() corner casekuyo chang1-5/+5
[ Upstream commit 40f5aa4c5eaebfeaca4566217cb9c468e28ed682 ] The warning in cfs_rq_is_decayed() triggered: SCHED_WARN_ON(cfs_rq->avg.load_avg || cfs_rq->avg.util_avg || cfs_rq->avg.runnable_avg) There exists a corner case in attach_entity_load_avg() which will cause load_sum to be zero while load_avg will not be. Consider se_weight is 88761 as per the sched_prio_to_weight[] table. Further assume the get_pelt_divider() is 47742, this gives: se->avg.load_avg is 1. However, calculating load_sum: se->avg.load_sum = div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se)); se->avg.load_sum = 1*47742/88761 = 0. Then enqueue_load_avg() adds this to the cfs_rq totals: cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; Resulting in load_avg being 1 with load_sum is 0, which will trigger the WARN. Fixes: f207934fb79d ("sched/fair: Align PELT windows between cfs_rq and its se") Signed-off-by: kuyo chang <kuyo.chang@mediatek.com> [peterz: massage changelog] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20220414090229.342-1-kuyo.chang@mediatek.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-13sched: Teach the forced-newidle balancer about CPU affinity limitation.Sebastian Andrzej Siewior1-1/+1
commit 386ef214c3c6ab111d05e1790e79475363abaa05 upstream. try_steal_cookie() looks at task_struct::cpus_mask to decide if the task could be moved to `this' CPU. It ignores that the task might be in a migration disabled section while not on the CPU. In this case the task must not be moved otherwise per-CPU assumption are broken. Use is_cpu_allowed(), as suggested by Peter Zijlstra, to decide if the a task can be moved. Fixes: d2dfa17bc7de6 ("sched: Trivial forced-newidle balancer") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YjNK9El+3fzGmswf@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08sched/rt: Plug rt_mutex_setprio() vs push_rt_task() raceValentin Schneider2-16/+28
[ Upstream commit 49bef33e4b87b743495627a529029156c6e09530 ] John reported that push_rt_task() can end up invoking find_lowest_rq(rq->curr) when curr is not an RT task (in this case a CFS one), which causes mayhem down convert_prio(). This can happen when current gets demoted to e.g. CFS when releasing an rt_mutex, and the local CPU gets hit with an rto_push_work irqwork before getting the chance to reschedule. Exactly who triggers this work isn't entirely clear to me - switched_from_rt() only invokes rt_queue_pull_task() if there are no RT tasks on the local RQ, which means the local CPU can't be in the rto_mask. My current suspected sequence is something along the lines of the below, with the demoted task being current. mark_wakeup_next_waiter() rt_mutex_adjust_prio() rt_mutex_setprio() // deboost originally-CFS task check_class_changed() switched_from_rt() // Only rt_queue_pull_task() if !rq->rt.rt_nr_running switched_to_fair() // Sets need_resched __balance_callbacks() // if pull_rt_task(), tell_cpu_to_push() can't select local CPU per the above raw_spin_rq_unlock(rq) // need_resched is set, so task_woken_rt() can't // invoke push_rt_tasks(). Best I can come up with is // local CPU has rt_nr_migratory >= 2 after the demotion, so stays // in the rto_mask, and then: <some other CPU running rto_push_irq_work_func() queues rto_push_work on this CPU> push_rt_task() // breakage follows here as rq->curr is CFS Move an existing check to check rq->curr vs the next pushable task's priority before getting anywhere near find_lowest_rq(). While at it, add an explicit sched_class of rq->curr check prior to invoking find_lowest_rq(rq->curr). Align the DL logic to also reschedule regardless of next_task's migratability. Fixes: a7c81556ec4d ("sched: Fix migrate_disable() vs rt/dl balancing") Reported-by: John Keeping <john@metanate.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: John Keeping <john@metanate.com> Link: https://lore.kernel.org/r/20220127154059.974729-1-valentin.schneider@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08sched/cpuacct: Fix charge percpu cpuusageChengming Zhou1-1/+2
[ Upstream commit 248cc9993d1cc12b8e9ed716cc3fc09f6c3517dd ] The cpuacct_account_field() is always called by the current task itself, so it's ok to use __this_cpu_add() to charge the tick time. But cpuacct_charge() maybe called by update_curr() in load_balance() on a random CPU, different from the CPU on which the task is running. So __this_cpu_add() will charge that cputime to a random incorrect CPU. Fixes: 73e6aafd9ea8 ("sched/cpuacct: Simplify the cpuacct code") Reported-by: Minye Zhu <zhuminye@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220220051426.5274-1-zhouchengming@bytedance.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08sched/fair: Improve consistency of allowed NUMA balance calculationsMel Gorman1-8/+10
[ Upstream commit 2cfb7a1b031b0e816af7a6ee0c6ab83b0acdf05a ] There are inconsistencies when determining if a NUMA imbalance is allowed that should be corrected. o allow_numa_imbalance changes types and is not always examining the destination group so both the type should be corrected as well as the naming. o find_idlest_group uses the sched_domain's weight instead of the group weight which is different to find_busiest_group o find_busiest_group uses the source group instead of the destination which is different to task_numa_find_cpu o Both find_idlest_group and find_busiest_group should account for the number of running tasks if a move was allowed to be consistent with task_numa_find_cpu Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Link: https://lore.kernel.org/r/20220208094334.16379-2-mgorman@techsingularity.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08sched/uclamp: Fix iowait boost escaping uclamp restrictionQais Yousef1-0/+1
[ Upstream commit d37aee9018e68b0d356195caefbb651910e0bbfa ] iowait_boost signal is applied independently of util and doesn't take into account uclamp settings of the rq. An io heavy task that is capped by uclamp_max could still request higher frequency because sugov_iowait_apply() doesn't clamp the boost via uclamp_rq_util_with() like effective_cpu_util() does. Make sure that iowait_boost honours uclamp requests by calling uclamp_rq_util_with() when applying the boost. Fixes: 982d9cdc22c9 ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20211216225320.2957053-3-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08sched/core: Export pelt_thermal_tpQais Yousef1-0/+1
[ Upstream commit 77cf151b7bbdfa3577b3c3f3a5e267a6c60a263b ] We can't use this tracepoint in modules without having the symbol exported first, fix that. Fixes: 765047932f15 ("sched/pelt: Add support to track thermal pressure") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211028115005.873539-1-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numaBharata B Rao1-10/+0
[ Upstream commit 28c988c3ec29db74a1dda631b18785958d57df4f ] The older format of /proc/pid/sched printed home node info which required the mempolicy and task lock around mpol_get(). However the format has changed since then and there is no need for sched_show_numa() any more to have mempolicy argument, asssociated mpol_get/put and task_lock/unlock. Remove them. Fixes: 397f2378f1361 ("sched/numa: Fix numa balancing stats in /proc/pid/sched") Signed-off-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20220118050515.2973-1-bharata@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-08sched: Fix yet more sched_fork() racesPeter Zijlstra1-13/+21
commit b1e8206582f9d680cff7d04828708c8b6ab32957 upstream. Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") fixed a fork race vs cgroup, it opened up a race vs syscalls by not placing the task on the runqueue before it gets exposed through the pidhash. Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is trying to fix a single instance of this, instead fix the whole class of issues, effectively reverting this commit. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org> Tested-by: Zhang Qiao <zhangqiao22@huawei.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-08sched/fair: Fix fault in reweight_entityTadeusz Struk1-5/+6
[ Upstream commit 13765de8148f71fa795e0a6607de37c49ea5915a ] Syzbot found a GPF in reweight_entity. This has been bisected to commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") There is a race between sched_post_fork() and setpriority(PRIO_PGRP) within a thread group that causes a null-ptr-deref in reweight_entity() in CFS. The scenario is that the main process spawns number of new threads, which then call setpriority(PRIO_PGRP, 0, -20), wait, and exit. For each of the new threads the copy_process() gets invoked, which adds the new task_struct and calls sched_post_fork() for it. In the above scenario there is a possibility that setpriority(PRIO_PGRP) and set_one_prio() will be called for a thread in the group that is just being created by copy_process(), and for which the sched_post_fork() has not been executed yet. This will trigger a null pointer dereference in reweight_entity(), as it will try to access the run queue pointer, which hasn't been set. Before the mentioned change the cfs_rq pointer for the task has been set in sched_fork(), which is called much earlier in copy_process(), before the new task is added to the thread_group. Now it is done in the sched_post_fork(), which is called after that. To fix the issue the remove the update_load param from the update_load param() function and call reweight_task() only if the task flag doesn't have the TASK_NEW flag set. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: syzbot+af7a719bc92395ee41b3@syzkaller.appspotmail.com Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220203161846.1160750-1-tadeusz.struk@linaro.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-16sched: Avoid double preemption in __cond_resched_*lock*()Peter Zijlstra1-9/+3
[ Upstream commit 7e406d1ff39b8ee574036418a5043c86723170cf ] For PREEMPT/DYNAMIC_PREEMPT the *_unlock() will already trigger a preemption, no point in then calling preempt_schedule_common() *again*. Use _cond_resched() instead, since this is a NOP for the preemptible configs while it provide a preemption point for the others. Reported-by: xuhaifeng <xuhaifeng@oppo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YcGnvDEYBwOiV0cR@hirez.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-01psi: fix "defined but not used" warnings when CONFIG_PROC_FS=nSuren Baghdasaryan1-38/+41
commit 44585f7bc0cb01095bc2ad4258049c02bbad21ef upstream. When CONFIG_PROC_FS is disabled psi code generates the following warnings: kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=] 1364 | static const struct proc_ops psi_cpu_proc_ops = { | ^~~~~~~~~~~~~~~~ kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=] 1355 | static const struct proc_ops psi_memory_proc_ops = { | ^~~~~~~~~~~~~~~~~~~ kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=] 1346 | static const struct proc_ops psi_io_proc_ops = { | ^~~~~~~~~~~~~~~ Make definitions of these structures and related functions conditional on CONFIG_PROC_FS config. Link: https://lkml.kernel.org/r/20220119223940.787748-3-surenb@google.com Fixes: 0e94682b73bf ("psi: introduce psi monitor") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: kernel test robot <lkp@intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01sched/pelt: Relax the sync of util_sum with util_avgVincent Guittot2-4/+16
[ Upstream commit 98b0d890220d45418cfbc5157b3382e6da5a12ab ] Rick reported performance regressions in bugzilla because of cpu frequency being lower than before: https://bugzilla.kernel.org/show_bug.cgi?id=215045 He bisected the problem to: commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") This commit forces util_sum to be synced with the new util_avg after removing the contribution of a task and before the next periodic sync. By doing so util_sum is rounded to its lower bound and might lost up to LOAD_AVG_MAX-1 of accumulated contribution which has not yet been reflected in util_avg. Instead of always setting util_sum to the low bound of util_avg, which can significantly lower the utilization of root cfs_rq after propagating the change down into the hierarchy, we revert the change of util_sum and propagate the difference. In addition, we also check that cfs's util_sum always stays above the lower bound for a given util_avg as it has been observed that sched_entity's util_sum is sometimes above cfs one. Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") Reported-by: Rick Yiu <rickyiu@google.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://lkml.kernel.org/r/20220111134659.24961-2-vincent.guittot@linaro.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-01sched/membarrier: Fix membarrier-rseq fence command missing from query bitmaskMathieu Desnoyers1-4/+5
commit 809232619f5b15e31fb3563985e705454f32621f upstream. The membarrier command MEMBARRIER_CMD_QUERY allows querying the available membarrier commands. When the membarrier-rseq fence commands were added, a new MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ_BITMASK was introduced with the intent to expose them with the MEMBARRIER_CMD_QUERY command, the but it was never added to MEMBARRIER_CMD_BITMASK. The membarrier-rseq fence commands are therefore not wired up with the query command. Rename MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ_BITMASK to MEMBARRIER_PRIVATE_EXPEDITED_RSEQ_BITMASK (the bitmask is not a command per-se), and change the erroneous MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ_BITMASK (which does not actually exist) to MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ. Wire up MEMBARRIER_PRIVATE_EXPEDITED_RSEQ_BITMASK in MEMBARRIER_CMD_BITMASK. Fixing this allows discovering availability of the membarrier-rseq fence feature. Fixes: 2a36ab717e8f ("rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> # 5.10+ Link: https://lkml.kernel.org/r/20220117203010.30129-1-mathieu.desnoyers@efficios.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01psi: Fix uaf issue when psi trigger is destroyed while being polledSuren Baghdasaryan1-37/+29
commit a06247c6804f1a7c86a2e5398a4c1f1db1471848 upstream. With write operation on psi files replacing old trigger with a new one, the lifetime of its waitqueue is totally arbitrary. Overwriting an existing trigger causes its waitqueue to be freed and pending poll() will stumble on trigger->event_wait which was destroyed. Fix this by disallowing to redefine an existing psi trigger. If a write operation is used on a file descriptor with an already existing psi trigger, the operation will fail with EBUSY error. Also bypass a check for psi_disabled in the psi_trigger_destroy as the flag can be flipped after the trigger is created, leading to a memory leak. Fixes: 0e94682b73bf ("psi: introduce psi monitor") Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Analyzed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220111232309.1786347-1-surenb@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27sched/cpuacct: Fix user/system in shown cpuacct.usage*Andrey Ryabinin1-47/+32
commit dd02d4234c9a2214a81c57a16484304a1a51872a upstream. cpuacct has 2 different ways of accounting and showing user and system times. The first one uses cpuacct_account_field() to account times and cpuacct.stat file to expose them. And this one seems to work ok. The second one is uses cpuacct_charge() function for accounting and set of cpuacct.usage* files to show times. Despite some attempts to fix it in the past it still doesn't work. Sometimes while running KVM guest the cpuacct_charge() accounts most of the guest time as system time. This doesn't match with user&system times shown in cpuacct.stat or proc/<pid>/stat. Demonstration: # git clone https://github.com/aryabinin/kvmsample # make # mkdir /sys/fs/cgroup/cpuacct/test # echo $$ > /sys/fs/cgroup/cpuacct/test/tasks # ./kvmsample & # for i in {1..5}; do cat /sys/fs/cgroup/cpuacct/test/cpuacct.usage_sys; sleep 1; done 1976535645 2979839428 3979832704 4983603153 5983604157 Use cpustats accounted in cpuacct_account_field() as the source of user/sys times for cpuacct.usage* files. Make cpuacct_charge() to account only summary execution time. Fixes: d740037fac70 ("sched/cpuacct: Split usage accounting into user_usage and sys_usage") Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20211115164607.23784-3-arbn@yandex-team.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27cputime, cpuacct: Include guest time in user time in cpuacct.statAndrey Ryabinin1-2/+2
commit 9731698ecb9c851f353ce2496292ff9fcea39dff upstream. cpuacct.stat in no-root cgroups shows user time without guest time included int it. This doesn't match with user time shown in root cpuacct.stat and /proc/<pid>/stat. This also affects cgroup2's cpu.stat in the same way. Make account_guest_time() to add user time to cgroup's cpustat to fix this. Fixes: ef12fefabf94 ("cpuacct: add per-cgroup utime/stime statistics") Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20211115164607.23784-1-arbn@yandex-team.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27psi: Fix PSI_MEM_FULL state when tasks are in memstall and doing reclaimBrian Chen2-18/+32
[ Upstream commit cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf ] We've noticed cases where tasks in a cgroup are stalled on memory but there is little memory FULL pressure since tasks stay on the runqueue in reclaim. A simple example involves a single threaded program that keeps leaking and touching large amounts of memory. It runs in a cgroup with swap enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though there is significant CPU pressure and memory SOME, there is barely any memory FULL since the task enters reclaim and stays on the runqueue. However, this memory-bound task is effectively stalled on memory and we expect memory FULL to match memory SOME in this scenario. The code is confused about memstall && running, thinking there is a stalled task and a productive task when there's only one task: a reclaimer that's counted as both. To fix this, we redefine the condition for PSI_MEM_FULL to check that all running tasks are in an active memstall instead of checking that there are no running tasks. case PSI_MEM_FULL: - return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]); + return unlikely(tasks[NR_MEMSTALL] && + tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]); This will capture reclaimers. It will also capture tasks that called psi_memstall_enter() and are about to sleep, but this should be negligible noise. Signed-off-by: Brian Chen <brianchen118@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20211110213312.310243-1-brianchen118@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27sched/rt: Try to restart rt period timer when rt runtime exceededLi Hua1-5/+18
[ Upstream commit 9b58e976b3b391c0cf02e038d53dd0478ed3013c ] When rt_runtime is modified from -1 to a valid control value, it may cause the task to be throttled all the time. Operations like the following will trigger the bug. E.g: 1. echo -1 > /proc/sys/kernel/sched_rt_runtime_us 2. Run a FIFO task named A that executes while(1) 3. echo 950000 > /proc/sys/kernel/sched_rt_runtime_us When rt_runtime is -1, The rt period timer will not be activated when task A enqueued. And then the task will be throttled after setting rt_runtime to 950,000. The task will always be throttled because the rt period timer is not activated. Fixes: d0b27fa77854 ("sched: rt-group: synchonised bandwidth period") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Li Hua <hucool.lihua@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211203033618.11895-1-hucool.lihua@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27sched/fair: Fix per-CPU kthread and wakee stacking for asym CPU capacityVincent Donnefort1-1/+2
[ Upstream commit 014ba44e8184e1acf93e0cbb7089ee847802f8f0 ] select_idle_sibling() has a special case for tasks woken up by a per-CPU kthread where the selected CPU is the previous one. For asymmetric CPU capacity systems, the assumption was that the wakee couldn't have a bigger utilization during task placement than it used to have during the last activation. That was not considering uclamp.min which can completely change between two task activations and as a consequence mandates the fitness criterion asym_fits_capacity(), even for the exit path described above. Fixes: b4c9c9f15649 ("sched/fair: Prefer prev cpu in asymmetric wakeup path") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20211129173115.4006346-1-vincent.donnefort@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27sched/fair: Fix detection of per-CPU kthreads waking a taskVincent Donnefort1-0/+1
[ Upstream commit 8b4e74ccb582797f6f0b0a50372ebd9fd2372a27 ] select_idle_sibling() has a special case for tasks woken up by a per-CPU kthread, where the selected CPU is the previous one. However, the current condition for this exit path is incomplete. A task can wake up from an interrupt context (e.g. hrtimer), while a per-CPU kthread is running. A such scenario would spuriously trigger the special case described above. Also, a recent change made the idle task like a regular per-CPU kthread, hence making that situation more likely to happen (is_per_cpu_kthread(swapper) being true now). Checking for task context makes sure select_idle_sibling() will not interpret a wake up from any other context as a wake up by a per-CPU kthread. Fixes: 52262ee567ad ("sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20211201143450.479472-1-vincent.donnefort@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-14wait: add wake_up_pollfree()Eric Biggers1-0/+7
commit 42288cb44c4b5fff7653bc392b583a2b8bd6a8c0 upstream. Several ->poll() implementations are special in that they use a waitqueue whose lifetime is the current task, rather than the struct file as is normally the case. This is okay for blocking polls, since a blocking poll occurs within one task; however, non-blocking polls require another solution. This solution is for the queue to be cleared before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'. However, that has a bug: wake_up_poll() calls __wake_up() with nr_exclusive=1. Therefore, if there are multiple "exclusive" waiters, and the wakeup function for the first one returns a positive value, only that one will be called. That's *not* what's needed for POLLFREE; POLLFREE is special in that it really needs to wake up everyone. Considering the three non-blocking poll systems: - io_uring poll doesn't handle POLLFREE at all, so it is broken anyway. - aio poll is unaffected, since it doesn't support exclusive waits. However, that's fragile, as someone could add this feature later. - epoll doesn't appear to be broken by this, since its wakeup function returns 0 when it sees POLLFREE. But this is fragile. Although there is a workaround (see epoll), it's better to define a function which always sends POLLFREE to all waiters. Add such a function. Also make it verify that the queue really becomes empty after all waiters have been woken up. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-08sched/uclamp: Fix rq->uclamp_max not set on first enqueueQais Yousef1-1/+1
[ Upstream commit 315c4f884800c45cb6bd8c90422fad554a8b9588 ] Commit d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to match the woken up task's uclamp_max when the rq is idle. The code was relying on rq->uclamp_max initialized to zero, so on first enqueue static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p, enum uclamp_id clamp_id) { ... if (uc_se->value > READ_ONCE(uc_rq->value)) WRITE_ONCE(uc_rq->value, uc_se->value); } was actually resetting it. But since commit d81ae8aac85c changed the default to 1024, this no longer works. And since rq->uclamp_flags is also initialized to 0, neither above code path nor uclamp_idle_reset() update the rq->uclamp_max on first wake up from idle. This is only visible from first wake up(s) until the first dequeue to idle after enabling the static key. And it only matters if the uclamp_max of this task is < 1024 since only then its uclamp_max will be effectively ignored. Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to ensure uclamp_idle_reset() is called which then will update the rq uclamp_max value as expected. Fixes: d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-08preempt/dynamic: Fix setup_preempt_mode() return valueAndrew Halaney1-2/+2
[ Upstream commit 9ed20bafc85806ca6c97c9128cec46c3ef80ae86 ] __setup() callbacks expect 1 for success and 0 for failure. Correct the usage here to reflect that. Fixes: 826bfeb37bb4 ("preempt/dynamic: Support dynamic preempt with preempt= boot option") Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211203233203.133581-1-ahalaney@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-01sched/scs: Reset task stack state in bringup_cpu()Mark Rutland1-4/+0
[ Upstream commit dce1ca0525bfdc8a69a9343bc714fbc19a2f04b3 ] To hot unplug a CPU, the idle task on that CPU calls a few layers of C code before finally leaving the kernel. When KASAN is in use, poisoned shadow is left around for each of the active stack frames, and when shadow call stacks are in use. When shadow call stacks (SCS) are in use the task's saved SCS SP is left pointing at an arbitrary point within the task's shadow call stack. When a CPU is offlined than onlined back into the kernel, this stale state can adversely affect execution. Stale KASAN shadow can alias new stackframes and result in bogus KASAN warnings. A stale SCS SP is effectively a memory leak, and prevents a portion of the shadow call stack being used. Across a number of hotplug cycles the idle task's entire shadow call stack can become unusable. We previously fixed the KASAN issue in commit: e1b77c92981a5222 ("sched/kasan: remove stale KASAN poison after hotplug") ... by removing any stale KASAN stack poison immediately prior to onlining a CPU. Subsequently in commit: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") ... the refactoring left the KASAN and SCS cleanup in one-time idle thread initialization code rather than something invoked prior to each CPU being onlined, breaking both as above. We fixed SCS (but not KASAN) in commit: 63acd42c0d4942f7 ("sched/scs: Reset the shadow stack when idle_task_exit") ... but as this runs in the context of the idle task being offlined it's potentially fragile. To fix these consistently and more robustly, reset the SCS SP and KASAN shadow of a CPU's idle task immediately before we online that CPU in bringup_cpu(). This ensures the idle task always has a consistent state when it is running, and removes the need to so so when exiting an idle task. Whenever any thread is created, dup_task_struct() will give the task a stack which is free of KASAN shadow, and initialize the task's SCS SP, so there's no need to specially initialize either for idle thread within init_idle(), as this was only necessary to handle hotplug cycles. I've tested this on arm64 with: * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK ... offlining and onlining CPUS with: | while true; do | for C in /sys/devices/system/cpu/cpu*/online; do | echo 0 > $C; | echo 1 > $C; | done | done Fixes: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25sched/fair: Prevent dead task groups from regaining cfs_rq'sMathias Krause5-16/+49
[ Upstream commit b027789e5e50494c2325cc70c8642e7fd6059479 ] Kevin is reporting crashes which point to a use-after-free of a cfs_rq in update_blocked_averages(). Initial debugging revealed that we've live cfs_rq's (on_list=1) in an about to be kfree()'d task group in free_fair_sched_group(). However, it was unclear how that can happen. His kernel config happened to lead to a layout of struct sched_entity that put the 'my_q' member directly into the middle of the object which makes it incidentally overlap with SLUB's freelist pointer. That, in combination with SLAB_FREELIST_HARDENED's freelist pointer mangling, leads to a reliable access violation in form of a #GP which made the UAF fail fast. Michal seems to have run into the same issue[1]. He already correctly diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") is causing the preconditions for the UAF to happen by re-adding cfs_rq's also to task groups that have no more running tasks, i.e. also to dead ones. His analysis, however, misses the real root cause and it cannot be seen from the crash backtrace only, as the real offender is tg_unthrottle_up() getting called via sched_cfs_period_timer() via the timer interrupt at an inconvenient time. When unregister_fair_sched_group() unlinks all cfs_rq's from the dying task group, it doesn't protect itself from getting interrupted. If the timer interrupt triggers while we iterate over all CPUs or after unregister_fair_sched_group() has finished but prior to unlinking the task group, sched_cfs_period_timer() will execute and walk the list of task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the dying task group. These will later -- in free_fair_sched_group() -- be kfree()'ed while still being linked, leading to the fireworks Kevin and Michal are seeing. To fix this race, ensure the dying task group gets unlinked first. However, simply switching the order of unregistering and unlinking the task group isn't sufficient, as concurrent RCU walkers might still see it, as can be seen below: CPU1: CPU2: : timer IRQ: : do_sched_cfs_period_timer(): : : : distribute_cfs_runtime(): : rcu_read_lock(); : : : unthrottle_cfs_rq(): sched_offline_group(): : : walk_tg_tree_from(…,tg_unthrottle_up,…): list_del_rcu(&tg->list); : (1) : list_for_each_entry_rcu(child, &parent->children, siblings) : : (2) list_del_rcu(&tg->siblings); : : tg_unthrottle_up(): unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; : : list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); : : : : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running) (3) : list_add_leaf_cfs_rq(cfs_rq); : : : : : : : : : : (4) : rcu_read_unlock(); CPU 2 walks the task group list in parallel to sched_offline_group(), specifically, it'll read the soon to be unlinked task group entry at (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of these at (3), which is the cause of the UAF later on. To prevent this additional race from happening, we need to wait until walk_tg_tree_from() has finished traversing the task groups, i.e. after the RCU read critical section ends in (4). Afterwards we're safe to call unregister_fair_sched_group(), as each new walk won't see the dying task group any more. On top of that, we need to wait yet another RCU grace period after unregister_fair_sched_group() to ensure print_cfs_stats(), which might run concurrently, always sees valid objects, i.e. not already free'd ones. This patch survives Michal's reproducer[2] for 8h+ now, which used to trigger within minutes before. [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/ [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/ Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") [peterz: shuffle code around a bit] Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()Vincent Donnefort1-0/+3
[ Upstream commit 42dc938a590c96eeb429e1830123fef2366d9c80 ] Nothing protects the access to the per_cpu variable sd_llc_id. When testing the same CPU (i.e. this_cpu == that_cpu), a race condition exists with update_top_cache_domain(). One scenario being: CPU1 CPU2 ================================================================== per_cpu(sd_llc_id, CPUX) => 0 partition_sched_domains_locked() detach_destroy_domains() cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX) per_cpu(sd_llc_id, CPUX) => 0 per_cpu(sd_llc_id, CPUX) = CPUX per_cpu(sd_llc_id, CPUX) => CPUX return false ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result is a warning triggered from ttwu_queue_wakelist(). Avoid a such race in cpus_share_cache() by always returning true when this_cpu == that_cpu. Fixes: 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries") Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20211104175120.857087-1-vincent.donnefort@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18kernel/sched: Fix sched_fork() access an invalid sched_task_groupZhang Qiao1-21/+22
[ Upstream commit 4ef0c5c6b5ba1f38f0ea1cedad0cad722f00c14a ] There is a small race between copy_process() and sched_fork() where child->sched_task_group point to an already freed pointer. parent doing fork() | someone moving the parent | to another cgroup -------------------------------+------------------------------- copy_process() + dup_task_struct()<1> parent move to another cgroup, and free the old cgroup. <2> + sched_fork() + __set_task_cpu()<3> + task_fork_fair() + sched_slice()<4> In the worst case, this bug can lead to "use-after-free" and cause panic as shown above: (1) parent copy its sched_task_group to child at <1>; (2) someone move the parent to another cgroup and free the old cgroup at <2>; (3) the sched_task_group and cfs_rq that belong to the old cgroup will be accessed at <3> and <4>, which cause a panic: [] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD 2029955067 PMD 0 [] Oops: 0000 [#1] SMP PTI [] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded Tainted: G OE --------- - - 4.18.0.x86_64+ #1 [] RIP: 0010:sched_slice+0x84/0xc0 [] Call Trace: [] task_fork_fair+0x81/0x120 [] sched_fork+0x132/0x240 [] copy_process.part.5+0x675/0x20e0 [] ? __handle_mm_fault+0x63f/0x690 [] _do_fork+0xcd/0x3b0 [] do_syscall_64+0x5d/0x1d0 [] entry_SYSCALL_64_after_hwframe+0x65/0xca [] RIP: 0033:0x7f04418cd7e1 Between cgroup_can_fork() and cgroup_post_fork(), the cgroup membership and thus sched_task_group can't change. So update child's sched_task_group at sched_post_fork() and move task_fork() and __set_task_cpu() (where accees the sched_task_group) from sched_fork() to sched_post_fork(). Fixes: 8323f26ce342 ("sched: Fix race in task_group") Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20210915064030.2231-1-zhangqiao22@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-19sched/scs: Reset the shadow stack when idle_task_exitWoody Lin1-0/+1
Commit f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") removed the init_idle() call from idle_thread_get(). This was the sole call-path on hotplug that resets the Shadow Call Stack (scs) Stack Pointer (sp). Not resetting the scs-sp leads to scs overflow after enough hotplug cycles. Therefore add an explicit scs_task_reset() to the hotplug code to make sure the scs-sp does get reset on hotplug. Fixes: f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") Signed-off-by: Woody Lin <woodylin@google.com> [peterz: Changelog] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20211012083521.973587-1-woodylin@google.com
2021-10-01sched/fair: Null terminate buffer when updating tunable_scalingMel Gorman1-1/+7
This patch null-terminates the temporary buffer in sched_scaling_write() so kstrtouint() does not return failure and checks the value is valid. Before: $ cat /sys/kernel/debug/sched/tunable_scaling 1 $ echo 0 > /sys/kernel/debug/sched/tunable_scaling -bash: echo: write error: Invalid argument $ cat /sys/kernel/debug/sched/tunable_scaling 1 After: $ cat /sys/kernel/debug/sched/tunable_scaling 1 $ echo 0 > /sys/kernel/debug/sched/tunable_scaling $ cat /sys/kernel/debug/sched/tunable_scaling 0 $ echo 3 > /sys/kernel/debug/sched/tunable_scaling -bash: echo: write error: Invalid argument Fixes: 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210927114635.GH3959@techsingularity.net
2021-10-01sched/fair: Add ancestors of unthrottled undecayed cfs_rqMichal Koutný1-1/+5
Since commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") we add cfs_rqs with no runnable tasks but not fully decayed into the load (leaf) list. We may ignore adding some ancestors and therefore breaking tmp_alone_branch invariant. This broke LTP test cfs_bandwidth01 and it was partially fixed in commit fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after unthrottling"). I noticed the named test still fails even with the fix (but with low probability, 1 in ~1000 executions of the test). The reason is when bailing out of unthrottle_cfs_rq early, we may miss adding ancestors of the unthrottled cfs_rq, thus, not joining tmp_alone_branch properly. Fix this by adding ancestors if we notice the unthrottled cfs_rq was added to the load list. Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Odin Ugedal <odin@uged.al> Link: https://lore.kernel.org/r/20210917153037.11176-1-mkoutny@suse.com
2021-09-12Merge tag 'sched_urgent_for_v5.15_rc1' of ↵Linus Torvalds2-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Make sure the idle timer expires in hardirq context, on PREEMPT_RT - Make sure the run-queue balance callback is invoked only on the outgoing CPU * tag 'sched_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Prevent balance_push() on remote runqueues sched/idle: Make the idle timer expire in hard interrupt context
2021-09-09sched: Prevent balance_push() on remote runqueuesThomas Gleixner1-3/+3
sched_setscheduler() and rt_mutex_setprio() invoke the run-queue balance callback after changing priorities or the scheduling class of a task. The run-queue for which the callback is invoked can be local or remote. That's not a problem for the regular rq::push_work which is serialized with a busy flag in the run-queue struct, but for the balance_push() work which is only valid to be invoked on the outgoing CPU that's wrong. It not only triggers the debug warning, but also leaves the per CPU variable push_work unprotected, which can result in double enqueues on the stop machine list. Remove the warning and validate that the function is invoked on the outgoing CPU. Fixes: ae7927023243 ("sched: Optimize finish_lock_switch()") Reported-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/87zgt1hdw7.ffs@tglx
2021-09-09sched/idle: Make the idle timer expire in hard interrupt contextSebastian Andrzej Siewior1-2/+2
The intel powerclamp driver will setup a per-CPU worker with RT priority. The worker will then invoke play_idle() in which it remains in the idle poll loop until it is stopped by the timer it started earlier. That timer needs to expire in hard interrupt context on PREEMPT_RT. Otherwise the timer will expire in ksoftirqd as a SOFT timer but that task won't be scheduled on the CPU because its priority is lower than the priority of the worker which is in the idle loop. Always expire the idle timer in hard interrupt context. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210906113034.jgfxrjdvxnjqgtmc@linutronix.de
2021-08-31Merge tag 'pm-5.15-rc1' of ↵Linus Torvalds1-5/+11
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These address some PCI device power management issues, add new hardware support to the RAPL power capping driver, add HWP guaranteed performance change notification support to the intel_pstate driver, replace deprecated CPU-hotplug functions in a few places, update CPU PM notifiers to use raw spinlocks, update the PM domains framework (new DT property support, Kconfig fix), do a couple of cleanups in code related to system sleep, and improve the energy model and the schedutil cpufreq governor. Specifics: - Address 3 PCI device power management issues (Rafael Wysocki). - Add Power Limit4 support for Alder Lake to the Intel RAPL power capping driver (Sumeet Pawnikar). - Add HWP guaranteed performance change notification support to the intel_pstate driver (Srinivas Pandruvada). - Replace deprecated CPU-hotplug functions in code related to power management (Sebastian Andrzej Siewior). - Update CPU PM notifiers to use raw spinlocks (Valentin Schneider). - Add support for 'required-opps' DT property to the generic power domains (genpd) framework and use this property for I2C on ARM64 sc7180 (Rajendra Nayak). - Fix Kconfig issue related to genpd (Geert Uytterhoeven). - Increase energy calculation precision in the Energy Model (Lukasz Luba). - Fix kobject deletion in the exit code of the schedutil cpufreq governor (Kevin Hao). - Unmark some functions as kernel-doc in the PM core to avoid false-positive documentation build warnings (Randy Dunlap). - Check RTC features instead of ops in suspend_test Alexandre Belloni)" * tag 'pm-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: domains: Fix domain attach for CONFIG_PM_OPP=n powercap: Add Power Limit4 support for Alder Lake SoC cpufreq: intel_pstate: Process HWP Guaranteed change notification thermal: intel: Allow processing of HWP interrupt notifier: Remove atomic_notifier_call_chain_robust() PM: cpu: Make notifier chain use a raw_spinlock_t PM: sleep: unmark 'state' functions as kernel-doc arm64: dts: sc7180: Add required-opps for i2c PM: domains: Add support for 'required-opps' to set default perf state opp: Don't print an error if required-opps is missing cpufreq: schedutil: Use kobject release() method to free sugov_tunables PM: EM: Increase energy calculation precision PM: sleep: check RTC features instead of ops in suspend_test PM: sleep: s2idle: Replace deprecated CPU-hotplug functions cpufreq: Replace deprecated CPU-hotplug functions powercap: intel_rapl: Replace deprecated CPU-hotplug functions PCI: PM: Enable PME if it can be signaled from D3cold PCI: PM: Avoid forcing PCI_D0 for wakeup reasons inconsistently PCI: Use pci_update_current_state() in pci_enable_device_flags()
2021-08-31Merge tag 'locking-core-2021-08-30' of ↵Linus Torvalds1-17/+92
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and atomics updates from Thomas Gleixner: "The regular pile: - A few improvements to the mutex code - Documentation updates for atomics to clarify the difference between cmpxchg() and try_cmpxchg() and to explain the forward progress expectations. - Simplification of the atomics fallback generator - The addition of arch_atomic_long*() variants and generic arch_*() bitops based on them. - Add the missing might_sleep() invocations to the down*() operations of semaphores. The PREEMPT_RT locking core: - Scheduler updates to support the state preserving mechanism for 'sleeping' spin- and rwlocks on RT. This mechanism is carefully preserving the state of the task when blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups targeted at the same task into account. The preserved or updated (via a regular wakeup) state is restored when the lock has been acquired. - Restructuring of the rtmutex code so it can be utilized and extended for the RT specific lock variants. - Restructuring of the ww_mutex code to allow sharing of the ww_mutex specific functionality for rtmutex based ww_mutexes. - Header file disentangling to allow substitution of the regular lock implementations with the PREEMPT_RT variants without creating an unmaintainable #ifdef mess. - Shared base code for the PREEMPT_RT specific rw_semaphore and rwlock implementations. Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT implementation is writer unfair because it is infeasible to do priority inheritance on multiple readers. Experience over the years has shown that real-time workloads are not the typical workloads which are sensitive to writer starvation. The alternative solution would be to allow only a single reader which has been tried and discarded as it is a major bottleneck especially for mmap_sem. Aside of that many of the writer starvation critical usage sites have been converted to a writer side mutex/spinlock and RCU read side protections in the past decade so that the issue is less prominent than it used to be. - The actual rtmutex based lock substitutions for PREEMPT_RT enabled kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and rwlock_t. The spin/rw_lock*() functions disable migration across the critical section to preserve the existing semantics vs per-CPU variables. - Rework of the futex REQUEUE_PI mechanism to handle the case of early wake-ups which interleave with a re-queue operation to prevent the situation that a task would be blocked on both the rtmutex associated to the outer futex and the rtmutex based hash bucket spinlock. While this situation cannot happen on !RT enabled kernels the changes make the underlying concurrency problems easier to understand in general. As a result the difference between !RT and RT kernels is reduced to the handling of waiting for the critical section. !RT kernels simply spin-wait as before and RT kernels utilize rcu_wait(). - The substitution of local_lock for PREEMPT_RT with a spinlock which protects the critical section while staying preemptible. The CPU locality is established by disabling migration. The underlying concepts of this code have been in use in PREEMPT_RT for way more than a decade. The code has been refactored several times over the years and this final incarnation has been optimized once again to be as non-intrusive as possible, i.e. the RT specific parts are mostly isolated. It has been extensively tested in the 5.14-rt patch series and it has been verified that !RT kernels are not affected by these changes" * tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits) locking/rtmutex: Return success on deadlock for ww_mutex waiters locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes locking/rtmutex: Dequeue waiter on ww_mutex deadlock locking/rtmutex: Dont dereference waiter lockless locking/semaphore: Add might_sleep() to down_*() family locking/ww_mutex: Initialize waiter.ww_ctx properly static_call: Update API documentation locking/local_lock: Add PREEMPT_RT support locking/spinlock/rt: Prepare for RT local_lock locking/rtmutex: Add adaptive spinwait mechanism locking/rtmutex: Implement equal priority lock stealing preempt: Adjust PREEMPT_LOCK_OFFSET for RT locking/rtmutex: Prevent lockdep false positive with PI futexes futex: Prevent requeue_pi() lock nesting issue on RT futex: Simplify handle_early_requeue_pi_wakeup() futex: Reorder sanity checks in futex_requeue() futex: Clarify comment in futex_requeue() futex: Restructure futex_requeue() futex: Correct the number of requeued waiters for PI futex: Remove bogus condition for requeue PI ...
2021-08-30Merge tag 'sched-core-2021-08-30' of ↵Linus Torvalds6-129/+631
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - The biggest change in this cycle is scheduler support for asymmetric scheduling affinity, to support the execution of legacy 32-bit tasks on AArch32 systems that also have 64-bit-only CPUs. Architectures can fill in this functionality by defining their own task_cpu_possible_mask(p). When this is done, the scheduler will make sure the task will only be scheduled on CPUs that support it. (The actual arm64 specific changes are not part of this tree.) For other architectures there will be no change in functionality. - Add cgroup SCHED_IDLE support - Increase node-distance flexibility & delay determining it until a CPU is brought online. (This enables platforms where node distance isn't final until the CPU is only.) - Deadline scheduler enhancements & fixes - Misc fixes & cleanups. * tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) eventfd: Make signal recursion protection a task bit sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case sched: Introduce dl_task_check_affinity() to check proposed affinity sched: Allow task CPU affinity to be restricted on asymmetric systems sched: Split the guts of sched_setaffinity() into a helper function sched: Introduce task_struct::user_cpus_ptr to track requested affinity sched: Reject CPU affinity changes based on task_cpu_possible_mask() cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq() cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 sched: Introduce task_cpu_possible_mask() to limit fallback rq selection sched: Cgroup SCHED_IDLE support sched/topology: Skip updating masks for non-online nodes sched: Replace deprecated CPU-hotplug functions. sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS sched: Fix UCLAMP_FLAG_IDLE setting sched/deadline: Fix missing clock update in migrate_task_rq_dl() sched/fair: Avoid a second scan of target in select_idle_cpu sched/fair: Use prev instead of new target as recent_used_cpu sched: Don't report SCHED_FLAG_SUGOV in sched_getattr() ...
2021-08-30Merge branch 'core-rcu.2021.08.28a' of ↵Linus Torvalds1-0/+11
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: "RCU changes for this cycle were: - Documentation updates - Miscellaneous fixes - Offloaded-callbacks updates - Updates to the nolibc library - Tasks-RCU updates - In-kernel torture-test updates - Torture-test scripting, perhaps most notably the pinning of torture-test guest OSes so as to force differences in memory latency. For example, in a two-socket system, a four-CPU guest OS will have one pair of its CPUs pinned to threads in a single core on one socket and the other pair pinned to threads in a single core on the other socket. This approach proved able to force race conditions that earlier testing missed. Some of these race conditions are still being tracked down" * 'core-rcu.2021.08.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (61 commits) torture: Replace deprecated CPU-hotplug functions. rcu: Replace deprecated CPU-hotplug functions rcu: Print human-readable message for schedule() in RCU reader rcu: Explain why rcu_all_qs() is a stub in preemptible TREE RCU rcu: Use per_cpu_ptr to get the pointer of per_cpu variable rcu: Remove useless "ret" update in rcu_gp_fqs_loop() rcu: Mark accesses in tree_stall.h rcu: Make rcu_gp_init() and rcu_gp_fqs_loop noinline to conserve stack rcu: Mark lockless ->qsmask read in rcu_check_boost_fail() srcutiny: Mark read-side data races rcu: Start timing stall repetitions after warning complete rcu: Do not disable GP stall detection in rcu_cpu_stall_reset() rcu/tree: Handle VM stoppage in stall detection rculist: Unify documentation about missing list_empty_rcu() rcu: Mark accesses to ->rcu_read_lock_nesting rcu: Weaken ->dynticks accesses and updates rcu: Remove special bit at the bottom of the ->dynticks counter rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock rcu: Fix to include first blocked task in stall warning torture: Make kvm-test-1-run-qemu.sh check for reboot loops ...
2021-08-26sched: Fix get_push_task() vs migrate_disable()Sebastian Andrzej Siewior1-0/+3
push_rt_task() attempts to move the currently running task away if the next runnable task has migration disabled and therefore is pinned on the current CPU. The current task is retrieved via get_push_task() which only checks for nr_cpus_allowed == 1, but does not check whether the task has migration disabled and therefore cannot be moved either. The consequence is a pointless invocation of the migration thread which correctly observes that the task cannot be moved. Return NULL if the task has migration disabled and cannot be moved to another CPU. Fixes: a7c81556ec4d3 ("sched: Fix migrate_disable() vs rt/dl balancing") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210826133738.yiotqbtdaxzjsnfj@linutronix.de
2021-08-26sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED caseIngo Molnar1-1/+1
It's not actually used in the !CONFIG_FAIR_GROUP_SCHED case: kernel/sched/fair.c:488:12: warning: ‘tg_is_idle’ defined but not used [-Wunused-function] Keep around a placeholder nevertheless, for API completeness. Mark it inline, so the compiler doesn't think it must be used. Fixes: 304000390f88: ("sched: Cgroup SCHED_IDLE support") Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Josh Don <joshdon@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org>
2021-08-23Merge back cpufreq changes for v5.15.Rafael J. Wysocki1-5/+11
2021-08-20sched: Introduce dl_task_check_affinity() to check proposed affinityWill Deacon1-17/+29
In preparation for restricting the affinity of a task during execve() on arm64, introduce a new dl_task_check_affinity() helper function to give an indication as to whether the restricted mask is admissible for a deadline task. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lore.kernel.org/r/20210730112443.23245-10-will@kernel.org
2021-08-20sched: Allow task CPU affinity to be restricted on asymmetric systemsWill Deacon2-18/+181
Asymmetric systems may not offer the same level of userspace ISA support across all CPUs, meaning that some applications cannot be executed by some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do not feature support for 32-bit applications on both clusters. Although userspace can carefully manage the affinity masks for such tasks, one place where it is particularly problematic is execve() because the CPU on which the execve() is occurring may be incompatible with the new application image. In such a situation, it is desirable to restrict the affinity mask of the task and ensure that the new image is entered on a compatible CPU. From userspace's point of view, this looks the same as if the incompatible CPUs have been hotplugged off in the task's affinity mask. Similarly, if a subsequent execve() reverts to a compatible image, then the old affinity is restored if it is still valid. In preparation for restricting the affinity mask for compat tasks on arm64 systems without uniform support for 32-bit applications, introduce {force,relax}_compatible_cpus_allowed_ptr(), which respectively restrict and restore the affinity mask for a task based on the compatible CPUs. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20210730112443.23245-9-will@kernel.org
2021-08-20sched: Split the guts of sched_setaffinity() into a helper functionWill Deacon1-48/+57
In preparation for replaying user affinity requests using a saved mask, split sched_setaffinity() up so that the initial task lookup and security checks are only performed when the request is coming directly from userspace. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Link: https://lore.kernel.org/r/20210730112443.23245-8-will@kernel.org
2021-08-20sched: Introduce task_struct::user_cpus_ptr to track requested affinityWill Deacon1-0/+20
In preparation for saving and restoring the user-requested CPU affinity mask of a task, add a new cpumask_t pointer to 'struct task_struct'. If the pointer is non-NULL, then the mask is copied across fork() and freed on task exit. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Link: https://lore.kernel.org/r/20210730112443.23245-7-will@kernel.org
2021-08-20sched: Reject CPU affinity changes based on task_cpu_possible_mask()Will Deacon1-1/+8
Reject explicit requests to change the affinity mask of a task via set_cpus_allowed_ptr() if the requested mask is not a subset of the mask returned by task_cpu_possible_mask(). This ensures that the 'cpus_mask' for a given task cannot contain CPUs which are incapable of executing it, except in cases where the affinity is forced. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20210730112443.23245-6-will@kernel.org
2021-08-20cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()Will Deacon1-2/+1
select_fallback_rq() only needs to recheck for an allowed CPU if the affinity mask of the task has changed since the last check. Return a 'bool' from cpuset_cpus_allowed_fallback() to indicate whether the affinity mask was updated, and use this to elide the allowed check when the mask has been left alone. No functional change. Suggested-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20210730112443.23245-5-will@kernel.org
2021-08-20sched: Introduce task_cpu_possible_mask() to limit fallback rq selectionWill Deacon1-6/+3
Asymmetric systems may not offer the same level of userspace ISA support across all CPUs, meaning that some applications cannot be executed by some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do not feature support for 32-bit applications on both clusters. On such a system, we must take care not to migrate a task to an unsupported CPU when forcefully moving tasks in select_fallback_rq() in response to a CPU hot-unplug operation. Introduce a task_cpu_possible_mask() hook which, given a task argument, allows an architecture to return a cpumask of CPUs that are capable of executing that task. The default implementation returns the cpu_possible_mask, since sane machines do not suffer from per-cpu ISA limitations that affect scheduling. The new mask is used when selecting the fallback runqueue as a last resort before forcing a migration to the first active CPU. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20210730112443.23245-2-will@kernel.org
2021-08-20sched: Cgroup SCHED_IDLE supportJosh Don4-25/+208
This extends SCHED_IDLE to cgroups. Interface: cgroup/cpu.idle. 0: default behavior 1: SCHED_IDLE Extending SCHED_IDLE to cgroups means that we incorporate the existing aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its descendant threads towards the idle_h_nr_running count of all of its ancestor cgroups. Thus, sched_idle_rq() will work properly. Additionally, SCHED_IDLE cgroups are configured with minimum weight. There are two key differences between the per-task and per-cgroup SCHED_IDLE interface: - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to maintain their relative weights. The entity that is "idle" is the cgroup, not the tasks themselves. - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption decision is not made by comparing the current task with the woken task, but rather by comparing their matching sched_entity. A typical use-case for this is a user that creates an idle and a non-idle subtree. The non-idle subtree will dominate competition vs the idle subtree, but the idle subtree will still be high priority vs other users on the system. The latter is accomplished via comparing matching sched_entity in the waken preemption path (this could also be improved by making the sched_idle_rq() decision dependent on the perspective of a specific task). For now, we maintain the existing SCHED_IDLE semantics. Future patches may make improvements that extend how we treat SCHED_IDLE entities. The per-task_group idle field is an integer that currently only holds either a 0 or a 1. This is explicitly typed as an integer to allow for further extensions to this API. For example, a negative value may indicate a highly latency-sensitive cgroup that should be preferred for preemption/placement/etc. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210730020019.1487127-2-joshdon@google.com
2021-08-20sched/topology: Skip updating masks for non-online nodesValentin Schneider1-0/+65
The scheduler currently expects NUMA node distances to be stable from init onwards, and as a consequence builds the related data structures once-and-for-all at init (see sched_init_numa()). Unfortunately, on some architectures node distance is unreliable for offline nodes and may very well change upon onlining. Skip over offline nodes during sched_init_numa(). Track nodes that have been onlined at least once, and trigger a build of a node's NUMA masks when it is first onlined post-init. Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210818074333.48645-1-srikar@linux.vnet.ibm.com