From b57b84968865780262f191cc95d688d7f82e681c Mon Sep 17 00:00:00 2001 From: Tang Yizhou Date: Sat, 12 Mar 2022 15:07:51 +0800 Subject: docs: scheduler: Convert schedutil.txt to ReST All other scheduler documents have been converted to *.rst. Let's do the same for schedutil.txt. Also fixed some typos. Signed-off-by: Tang Yizhou Link: https://lore.kernel.org/r/20220312070751.16844-1-tangyizhou@huawei.com Signed-off-by: Jonathan Corbet --- Documentation/scheduler/index.rst | 1 + Documentation/scheduler/schedutil.rst | 173 ++++++++++++++++++++++++++++++++++ Documentation/scheduler/schedutil.txt | 169 --------------------------------- 3 files changed, 174 insertions(+), 169 deletions(-) create mode 100644 Documentation/scheduler/schedutil.rst delete mode 100644 Documentation/scheduler/schedutil.txt (limited to 'Documentation') diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst index 88900aabdbf7..693eaac769a7 100644 --- a/Documentation/scheduler/index.rst +++ b/Documentation/scheduler/index.rst @@ -14,6 +14,7 @@ Linux Scheduler sched-domains sched-capacity sched-energy + schedutil sched-nice-design sched-rt-group sched-stats diff --git a/Documentation/scheduler/schedutil.rst b/Documentation/scheduler/schedutil.rst new file mode 100644 index 000000000000..32c7d69fc86c --- /dev/null +++ b/Documentation/scheduler/schedutil.rst @@ -0,0 +1,173 @@ +========= +Schedutil +========= + +.. note:: + + All this assumes a linear relation between frequency and work capacity, + we know this is flawed, but it is the best workable approximation. + + +PELT (Per Entity Load Tracking) +=============================== + +With PELT we track some metrics across the various scheduler entities, from +individual tasks to task-group slices to CPU runqueues. As the basis for this +we use an Exponentially Weighted Moving Average (EWMA), each period (1024us) +is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute +half, while the rest of history contribute the other half. + +Specifically: + + ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ... + + ewma(u) = ewma_sum(u) / ewma_sum(1) + +Since this is essentially a progression of an infinite geometric series, the +results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property +is key, since it gives the ability to recompose the averages when tasks move +around. + +Note that blocked tasks still contribute to the aggregates (task-group slices +and CPU runqueues), which reflects their expected contribution when they +resume running. + +Using this we track 2 key metrics: 'running' and 'runnable'. 'Running' +reflects the time an entity spends on the CPU, while 'runnable' reflects the +time an entity spends on the runqueue. When there is only a single task these +two metrics are the same, but once there is contention for the CPU 'running' +will decrease to reflect the fraction of time each task spends on the CPU +while 'runnable' will increase to reflect the amount of contention. + +For more detail see: kernel/sched/pelt.c + + +Frequency / CPU Invariance +========================== + +Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU +for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on +a big CPU, we allow architectures to scale the time delta with two ratios, one +Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio. + +For simple DVFS architectures (where software is in full control) we trivially +compute the ratio as:: + + f_cur + r_dvfs := ----- + f_max + +For more dynamic systems where the hardware is in control of DVFS we use +hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio. +For Intel specifically, we use:: + + APERF + f_cur := ----- * P0 + MPERF + + 4C-turbo; if available and turbo enabled + f_max := { 1C-turbo; if turbo enabled + P0; otherwise + + f_cur + r_dvfs := min( 1, ----- ) + f_max + +We pick 4C turbo over 1C turbo to make it slightly more sustainable. + +r_cpu is determined as the ratio of highest performance level of the current +CPU vs the highest performance level of any other CPU in the system. + + r_tot = r_dvfs * r_cpu + +The result is that the above 'running' and 'runnable' metrics become invariant +of DVFS and CPU type. IOW. we can transfer and compare them between CPUs. + +For more detail see: + + - kernel/sched/pelt.h:update_rq_clock_pelt() + - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation." + - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" + + +UTIL_EST / UTIL_EST_FASTUP +========================== + +Because periodic tasks have their averages decayed while they sleep, even +though when running their expected utilization will be the same, they suffer a +(DVFS) ramp-up after they are running again. + +To alleviate this (a default enabled option) UTIL_EST drives an Infinite +Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is +highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR +filter to instantly increase and only decay on decrease. + +A further runqueue wide sum (of runnable tasks) is maintained of: + + util_est := \Sum_t max( t_running, t_util_est_ewma ) + +For more detail see: kernel/sched/fair.c:util_est_dequeue() + + +UCLAMP +====== + +It is possible to set effective u_min and u_max clamps on each CFS or RT task; +the runqueue keeps an max aggregate of these clamps for all running tasks. + +For more detail see: include/uapi/linux/sched/types.h + + +Schedutil / DVFS +================ + +Every time the scheduler load tracking is updated (task wakeup, task +migration, time progression) we call out to schedutil to update the hardware +DVFS state. + +The basis is the CPU runqueue's 'running' metric, which per the above it is +the frequency invariant utilization estimate of the CPU. From this we compute +a desired frequency like:: + + max( running, util_est ); if UTIL_EST + u_cfs := { running; otherwise + + clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK + u_clamp := { u_cfs + u_rt; otherwise + + u := u_clamp + u_irq + u_dl; [approx. see source for more detail] + + f_des := min( f_max, 1.25 u * f_max ) + +XXX IO-wait: when the update is due to a task wakeup from IO-completion we +boost 'u' above. + +This frequency is then used to select a P-state/OPP or directly munged into a +CPPC style request to the hardware. + +XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min +required to satisfy the workload. + +Because these callbacks are directly from the scheduler, the DVFS hardware +interaction should be 'fast' and non-blocking. Schedutil supports +rate-limiting DVFS requests for when hardware interaction is slow and +expensive, this reduces effectiveness. + +For more information see: kernel/sched/cpufreq_schedutil.c + + +NOTES +===== + + - On low-load scenarios, where DVFS is most relevant, the 'running' numbers + will closely reflect utilization. + + - In saturated scenarios task movement will cause some transient dips, + suppose we have a CPU saturated with 4 tasks, then when we migrate a task + to an idle CPU, the old CPU will have a 'running' value of 0.75 while the + new CPU will gain 0.25. This is inevitable and time progression will + correct this. XXX do we still guarantee f_max due to no idle-time? + + - Much of the above is about avoiding DVFS dips, and independent DVFS domains + having to re-learn / ramp-up when load shifts. + diff --git a/Documentation/scheduler/schedutil.txt b/Documentation/scheduler/schedutil.txt deleted file mode 100644 index 78f6b91e2291..000000000000 --- a/Documentation/scheduler/schedutil.txt +++ /dev/null @@ -1,169 +0,0 @@ - - -NOTE; all this assumes a linear relation between frequency and work capacity, -we know this is flawed, but it is the best workable approximation. - - -PELT (Per Entity Load Tracking) -------------------------------- - -With PELT we track some metrics across the various scheduler entities, from -individual tasks to task-group slices to CPU runqueues. As the basis for this -we use an Exponentially Weighted Moving Average (EWMA), each period (1024us) -is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute -half, while the rest of history contribute the other half. - -Specifically: - - ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ... - - ewma(u) = ewma_sum(u) / ewma_sum(1) - -Since this is essentially a progression of an infinite geometric series, the -results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property -is key, since it gives the ability to recompose the averages when tasks move -around. - -Note that blocked tasks still contribute to the aggregates (task-group slices -and CPU runqueues), which reflects their expected contribution when they -resume running. - -Using this we track 2 key metrics: 'running' and 'runnable'. 'Running' -reflects the time an entity spends on the CPU, while 'runnable' reflects the -time an entity spends on the runqueue. When there is only a single task these -two metrics are the same, but once there is contention for the CPU 'running' -will decrease to reflect the fraction of time each task spends on the CPU -while 'runnable' will increase to reflect the amount of contention. - -For more detail see: kernel/sched/pelt.c - - -Frequency- / CPU Invariance ---------------------------- - -Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU -for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on -a big CPU, we allow architectures to scale the time delta with two ratios, one -Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio. - -For simple DVFS architectures (where software is in full control) we trivially -compute the ratio as: - - f_cur - r_dvfs := ----- - f_max - -For more dynamic systems where the hardware is in control of DVFS we use -hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio. -For Intel specifically, we use: - - APERF - f_cur := ----- * P0 - MPERF - - 4C-turbo; if available and turbo enabled - f_max := { 1C-turbo; if turbo enabled - P0; otherwise - - f_cur - r_dvfs := min( 1, ----- ) - f_max - -We pick 4C turbo over 1C turbo to make it slightly more sustainable. - -r_cpu is determined as the ratio of highest performance level of the current -CPU vs the highest performance level of any other CPU in the system. - - r_tot = r_dvfs * r_cpu - -The result is that the above 'running' and 'runnable' metrics become invariant -of DVFS and CPU type. IOW. we can transfer and compare them between CPUs. - -For more detail see: - - - kernel/sched/pelt.h:update_rq_clock_pelt() - - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation." - - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" - - -UTIL_EST / UTIL_EST_FASTUP --------------------------- - -Because periodic tasks have their averages decayed while they sleep, even -though when running their expected utilization will be the same, they suffer a -(DVFS) ramp-up after they are running again. - -To alleviate this (a default enabled option) UTIL_EST drives an Infinite -Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is -highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR -filter to instantly increase and only decay on decrease. - -A further runqueue wide sum (of runnable tasks) is maintained of: - - util_est := \Sum_t max( t_running, t_util_est_ewma ) - -For more detail see: kernel/sched/fair.c:util_est_dequeue() - - -UCLAMP ------- - -It is possible to set effective u_min and u_max clamps on each CFS or RT task; -the runqueue keeps an max aggregate of these clamps for all running tasks. - -For more detail see: include/uapi/linux/sched/types.h - - -Schedutil / DVFS ----------------- - -Every time the scheduler load tracking is updated (task wakeup, task -migration, time progression) we call out to schedutil to update the hardware -DVFS state. - -The basis is the CPU runqueue's 'running' metric, which per the above it is -the frequency invariant utilization estimate of the CPU. From this we compute -a desired frequency like: - - max( running, util_est ); if UTIL_EST - u_cfs := { running; otherwise - - clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK - u_clamp := { u_cfs + u_rt; otherwise - - u := u_clamp + u_irq + u_dl; [approx. see source for more detail] - - f_des := min( f_max, 1.25 u * f_max ) - -XXX IO-wait; when the update is due to a task wakeup from IO-completion we -boost 'u' above. - -This frequency is then used to select a P-state/OPP or directly munged into a -CPPC style request to the hardware. - -XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min -required to satisfy the workload. - -Because these callbacks are directly from the scheduler, the DVFS hardware -interaction should be 'fast' and non-blocking. Schedutil supports -rate-limiting DVFS requests for when hardware interaction is slow and -expensive, this reduces effectiveness. - -For more information see: kernel/sched/cpufreq_schedutil.c - - -NOTES ------ - - - On low-load scenarios, where DVFS is most relevant, the 'running' numbers - will closely reflect utilization. - - - In saturated scenarios task movement will cause some transient dips, - suppose we have a CPU saturated with 4 tasks, then when we migrate a task - to an idle CPU, the old CPU will have a 'running' value of 0.75 while the - new CPU will gain 0.25. This is inevitable and time progression will - correct this. XXX do we still guarantee f_max due to no idle-time? - - - Much of the above is about avoiding DVFS dips, and independent DVFS domains - having to re-learn / ramp-up when load shifts. - -- cgit v1.2.3