summaryrefslogtreecommitdiff
path: root/kernel/time
AgeCommit message (Collapse)AuthorFilesLines
2023-09-02Merge tag 'timers-urgent-2023-09-02' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix false positive 'softirq work is pending' messages on -rt kernels, caused by a buggy factoring-out of existing code" * tag 'timers-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/rcu: Fix false positive "softirq work is pending" messages
2023-08-31Merge tag 'docs-6.6' of git://git.lwn.net/linuxLinus Torvalds1-11/+158
Pull documentation updates from Jonathan Corbet: "Documentation work keeps chugging along; this includes: - Work from Carlos Bilbao to integrate rustdoc output into the generated HTML documentation. This took some work to figure out how to do it without slowing the docs build and without creating people who don't have Rust installed, but Carlos got there - Move the loongarch and mips architecture documentation under Documentation/arch/ - Some more maintainer documentation from Jakub ... plus the usual assortment of updates, translations, and fixes" * tag 'docs-6.6' of git://git.lwn.net/linux: (56 commits) Docu: genericirq.rst: fix irq-example input: docs: pxrc: remove reference to phoenix-sim Documentation: serial-console: Fix literal block marker docs/mm: remove references to hmm_mirror ops and clean typos docs/zh_CN: correct regi_chg(),regi_add() to region_chg(),region_add() Documentation: Fix typos Documentation/ABI: Fix typos scripts: kernel-doc: fix macro handling in enums scripts: kernel-doc: parse DEFINE_DMA_UNMAP_[ADDR|LEN] Documentation: riscv: Update boot image header since EFI stub is supported Documentation: riscv: Add early boot document Documentation: arm: Add bootargs to the table of added DT parameters docs: kernel-parameters: Refer to the correct bitmap function doc: update params of memhp_default_state= docs: Add book to process/kernel-docs.rst docs: sparse: fix invalid link addresses docs: vfs: clean up after the iterate() removal docs: Add a section on surveys to the researcher guidelines docs: move mips under arch docs: move loongarch under arch ...
2023-08-30tick/rcu: Fix false positive "softirq work is pending" messagesPaul Gortmaker1-1/+1
In commit 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the new function report_idle_softirq() was created by breaking code out of the existing can_stop_idle_tick() for kernels v5.18 and newer. In doing so, the code essentially went from a one conditional: if (a && b && c) warn(); to a three conditional: if (!a) return; if (!b) return; if (!c) return; warn(); But that conversion got the condition for the RT specific local_bh_blocked() wrong. The original condition was: !local_bh_blocked() but the conversion failed to negate it so it ended up as: if (!local_bh_blocked()) return false; This issue lay dormant until another fixup for the same commit was added in commit a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition"). This commit realized the ratelimit was essentially set to zero instead of ten, and hence *no* softirq pending messages would ever be issued. Once this commit was backported via linux-stable, both the v6.1 and v6.4 preempt-rt kernels started printing out 10 instances of this at boot: NOHZ tick-stop error: local softirq work is pending, handler #80!!! Remove the negation and return when local_bh_blocked() evaluates to true to bring the correct behaviour back. Fixes: 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Wen Yang <wenyang.linux@foxmail.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230818200757.1808398-1-paul.gortmaker@windriver.com
2023-08-29Merge tag 'linux-kselftest-kunit-6.6-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kunit updates from Shuah Khan: - add support for running Rust documentation tests as KUnit tests - make init, str, sync, types doctests compilable/testable - add support for attributes API which include speed, modules attributes, ability to filter and report attributes - add support for marking tests slow using attributes API - add attributes API documentation - fix a wild-memory-access bug in kunit_filter_suites() and a possible memory leak in kunit_filter_suites() - add support for counting number of test suites in a module, list action to kunit test modules, and test filtering on module tests * tag 'linux-kselftest-kunit-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (25 commits) kunit: fix struct kunit_attr header kunit: replace KUNIT_TRIGGER_STATIC_STUB maro with KUNIT_STATIC_STUB_REDIRECT kunit: Allow kunit test modules to use test filtering kunit: Make 'list' action available to kunit test modules kunit: Report the count of test suites in a module kunit: fix uninitialized variables bug in attributes filtering kunit: fix possible memory leak in kunit_filter_suites() kunit: fix wild-memory-access bug in kunit_filter_suites() kunit: Add documentation of KUnit test attributes kunit: add tests for filtering attributes kunit: time: Mark test as slow using test attributes kunit: memcpy: Mark tests as slow using test attributes kunit: tool: Add command line interface to filter and report attributes kunit: Add ability to filter attributes kunit: Add module attribute kunit: Add speed attribute kunit: Add test attributes API structure MAINTAINERS: add Rust KUnit files to the KUnit entry rust: support running Rust documentation tests as KUnit ones rust: types: make doctests compilable/testable ...
2023-07-26kunit: time: Mark test as slow using test attributesRae Moar1-1/+1
Mark the time KUnit test, time64_to_tm_test_date_range, as slow using test attributes. This test ran relatively much slower than most other KUnit tests. By marking this test as slow, the test can now be filtered using the KUnit test attribute filtering feature. Example: --filter "speed>slow". This will run only the tests that have speeds faster than slow. The slow attribute will also be outputted in KTAP. Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Rae Moar <rmoar@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2023-07-15clocksource: Handle negative skews in "skew is too large" messagesPaul E. McKenney1-4/+4
The nanosecond-to-millisecond skew computation uses unsigned arithmetic, which produces user-unfriendly large positive numbers for negative skews. Therefore, use signed arithmetic for this computation in order to preserve the negativity. Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com> Reported-by: Feng Tang <feng.tang@intel.com> Fixes: dd029269947a ("clocksource: Improve "skew is too large" messages") Reviewed-by: Feng Tang <feng.tang@intel.com> Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14time: add kernel-doc in time.cRandy Dunlap1-11/+158
Add kernel-doc for all APIs that do not already have it. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: John Stultz <jstultz@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Acked-by: John Stultz <jstultz@google.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20230704052405.5089-3-rdunlap@infradead.org
2023-06-28Merge tag 'hardening-v6.5-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "There are three areas of note: A bunch of strlcpy()->strscpy() conversions ended up living in my tree since they were either Acked by maintainers for me to carry, or got ignored for multiple weeks (and were trivial changes). The compiler option '-fstrict-flex-arrays=3' has been enabled globally, and has been in -next for the entire devel cycle. This changes compiler diagnostics (though mainly just -Warray-bounds which is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_ coverage. In other words, there are no new restrictions, just potentially new warnings. Any new FORTIFY warnings we've seen have been fixed (usually in their respective subsystem trees). For more details, see commit df8fc4e934c12b. The under-development compiler attribute __counted_by has been added so that we can start annotating flexible array members with their associated structure member that tracks the count of flexible array elements at run-time. It is possible (likely?) that the exact syntax of the attribute will change before it is finalized, but GCC and Clang are working together to sort it out. Any changes can be made to the macro while we continue to add annotations. As an example of that last case, I have a treewide commit waiting with such annotations found via Coccinelle: https://git.kernel.org/linus/adc5b3cb48a049563dc673f348eab7b6beba8a9b Also see commit dd06e72e68bcb4 for more details. Summary: - Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko) - Convert strreplace() to return string start (Andy Shevchenko) - Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook) - Add missing function prototypes seen with W=1 (Arnd Bergmann) - Fix strscpy() kerndoc typo (Arne Welzel) - Replace strlcpy() with strscpy() across many subsystems which were either Acked by respective maintainers or were trivial changes that went ignored for multiple weeks (Azeem Shaikh) - Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers) - Add KUnit tests for strcat()-family - Enable KUnit tests of FORTIFY wrappers under UML - Add more complete FORTIFY protections for strlcat() - Add missed disabling of FORTIFY for all arch purgatories. - Enable -fstrict-flex-arrays=3 globally - Tightening UBSAN_BOUNDS when using GCC - Improve checkpatch to check for strcpy, strncpy, and fake flex arrays - Improve use of const variables in FORTIFY - Add requested struct_size_t() helper for types not pointers - Add __counted_by macro for annotating flexible array size members" * tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (54 commits) netfilter: ipset: Replace strlcpy with strscpy uml: Replace strlcpy with strscpy um: Use HOST_DIR for mrproper kallsyms: Replace all non-returning strlcpy with strscpy sh: Replace all non-returning strlcpy with strscpy of/flattree: Replace all non-returning strlcpy with strscpy sparc64: Replace all non-returning strlcpy with strscpy Hexagon: Replace all non-returning strlcpy with strscpy kobject: Use return value of strreplace() lib/string_helpers: Change returned value of the strreplace() jbd2: Avoid printing outside the boundary of the buffer checkpatch: Check for 0-length and 1-element arrays riscv/purgatory: Do not use fortified string functions s390/purgatory: Do not use fortified string functions x86/purgatory: Do not use fortified string functions acpi: Replace struct acpi_table_slit 1-element array with flex-array clocksource: Replace all non-returning strlcpy with strscpy string: use __builtin_memcpy() in strlcpy/strlcat staging: most: Replace all non-returning strlcpy with strscpy drm/i2c: tda998x: Replace all non-returning strlcpy with strscpy ...
2023-06-28Merge tag 'sched-core-2023-06-27' of ↵Linus Torvalds2-9/+19
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scheduler SMP load-balancer improvements: - Avoid unnecessary migrations within SMT domains on hybrid systems. Problem: On hybrid CPU systems, (processors with a mixture of higher-frequency SMT cores and lower-frequency non-SMT cores), under the old code lower-priority CPUs pulled tasks from the higher-priority cores if more than one SMT sibling was busy - resulting in many unnecessary task migrations. Solution: The new code improves the load balancer to recognize SMT cores with more than one busy sibling and allows lower-priority CPUs to pull tasks, which avoids superfluous migrations and lets lower-priority cores inspect all SMT siblings for the busiest queue. - Implement the 'runnable boosting' feature in the EAS balancer: consider CPU contention in frequency, EAS max util & load-balance busiest CPU selection. This improves CPU utilization for certain workloads, while leaves other key workloads unchanged. Scheduler infrastructure improvements: - Rewrite the scheduler topology setup code by consolidating it into the build_sched_topology() helper function and building it dynamically on the fly. - Resolve the local_clock() vs. noinstr complications by rewriting the code: provide separate sched_clock_noinstr() and local_clock_noinstr() functions to be used in instrumentation code, and make sure it is all instrumentation-safe. Fixes: - Fix a kthread_park() race with wait_woken() - Fix misc wait_task_inactive() bugs unearthed by the -rt merge: - Fix UP PREEMPT bug by unifying the SMP and UP implementations - Fix task_struct::saved_state handling - Fix various rq clock update bugs, unearthed by turning on the rq clock debugging code. - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by creating enough cgroups, by removing the warnign and restricting window size triggers to PSI file write-permission or CAP_SYS_RESOURCE. - Propagate SMT flags in the topology when removing degenerate domain - Fix grub_reclaim() calculation bug in the deadline scheduler code - Avoid resetting the min update period when it is unnecessary, in psi_trigger_destroy(). - Don't balance a task to its current running CPU in load_balance(), which was possible on certain NUMA topologies with overlapping groups. - Fix the sched-debug printing of rq->nr_uninterruptible Cleanups: - Address various -Wmissing-prototype warnings, as a preparation to (maybe) enable this warning in the future. - Remove unused code - Mark more functions __init - Fix shadow-variable warnings" * tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle() sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop() sched/core: Fixed missing rq clock update before calling set_rq_offline() sched/deadline: Update GRUB description in the documentation sched/deadline: Fix bandwidth reclaim equation in GRUB sched/wait: Fix a kthread_park race with wait_woken() sched/topology: Mark set_sched_topology() __init sched/fair: Rename variable cpu_util eff_util arm64/arch_timer: Fix MMIO byteswap sched/fair, cpufreq: Introduce 'runnable boosting' sched/fair: Refactor CPU utilization functions cpuidle: Use local_clock_noinstr() sched/clock: Provide local_clock_noinstr() x86/tsc: Provide sched_clock_noinstr() clocksource: hyper-v: Provide noinstr sched_clock() clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX x86/vdso: Fix gettimeofday masking math64: Always inline u128 version of mul_u64_u64_shr() s390/time: Provide sched_clock_noinstr() loongarch: Provide noinstr sched_clock_read() ...
2023-06-27Merge tag 'timers-core-2023-06-26' of ↵Linus Torvalds4-208/+326
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Time, timekeeping and related device driver updates: Core: - A set of fixes, cleanups and enhancements to the posix timer code: - Prevent another possible live lock scenario in the exit() path, which affects POSIX_CPU_TIMERS_TASK_WORK enabled architectures. - Fix a loop termination issue which was reported syzcaller/KSAN in the posix timer ID allocation code. That triggered a deeper look into the posix-timer code which unearthed more small issues. - Add missing READ/WRITE_ONCE() annotations - Fix or remove completely outdated comments - Document places which are subtle and completely undocumented. - Add missing hrtimer modes to the trace event decoder - Small cleanups and enhancements all over the place Drivers: - Rework the Hyper-V clocksource and sched clock setup code - Remove a deprecated clocksource driver - Small fixes and enhancements all over the place" * tag 'timers-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits) clocksource/drivers/cadence-ttc: Fix memory leak in ttc_timer_probe dt-bindings: timers: Add Ralink SoCs timer clocksource/drivers/hyper-v: Rework clocksource and sched clock setup dt-bindings: timer: brcm,kona-timer: convert to YAML clocksource/drivers/imx-gpt: Fold <soc/imx/timer.h> into its only user clk: imx: Drop inclusion of unused header <soc/imx/timer.h> hrtimer: Add missing sparse annotations to hrtimer locking clocksource/drivers/imx-gpt: Use only a single name for functions clocksource/drivers/loongson1: Move PWM timer to clocksource framework dt-bindings: timer: Add Loongson-1 clocksource MIPS: Loongson32: Remove deprecated PWM timer clocksource clocksource/drivers/ingenic-timer: Use pm_sleep_ptr() macro tracing/timer: Add missing hrtimer modes to decode_hrtimer_mode(). posix-timers: Add sys_ni_posix_timers() prototype tick/rcu: Fix bogus ratelimit condition alarmtimer: Remove unnecessary (void *) cast alarmtimer: Remove unnecessary initialization of variable 'ret' posix-timers: Refer properly to CONFIG_HIGH_RES_TIMERS posix-timers: Polish coding style in a few places posix-timers: Remove pointless comments ...
2023-06-22hrtimer: Add missing sparse annotations to hrtimer lockingBen Dooks1-0/+3
Sparse warns about lock imbalance vs. the hrtimer_base lock due to missing sparse annotations: kernel/time/hrtimer.c:175:33: warning: context imbalance in 'lock_hrtimer_base' - wrong count at exit kernel/time/hrtimer.c:1301:28: warning: context imbalance in 'hrtimer_start_range_ns' - unexpected unlock kernel/time/hrtimer.c:1336:28: warning: context imbalance in 'hrtimer_try_to_cancel' - unexpected unlock kernel/time/hrtimer.c:1457:9: warning: context imbalance in '__hrtimer_get_remaining' - unexpected unlock Add the annotations to the relevant functions. Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230621075928.394481-1-ben.dooks@codethink.co.uk
2023-06-18tick/rcu: Fix bogus ratelimit conditionWen Yang1-1/+1
The ratelimit logic in report_idle_softirq() is broken because the exit condition is always true: static int ratelimit; if (ratelimit < 10) return false; ---> always returns here ratelimit++; ---> no chance to run Make it check for >= 10 instead. Fixes: 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") Signed-off-by: Wen Yang <wenyang.linux@foxmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/tencent_5AAA3EEAB42095C9B7740BE62FBF9A67E007@qq.com
2023-06-18alarmtimer: Remove unnecessary (void *) castLi zeming1-1/+1
Pointers of type void * do not require a type cast when they are assigned to a real pointer. Signed-off-by: Li zeming <zeming@nfschina.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230609182059.4509-1-zeming@nfschina.com
2023-06-18alarmtimer: Remove unnecessary initialization of variable 'ret'Li zeming1-1/+1
ret is assigned before checked, so it does not need to initialize the variable Signed-off-by: Li zeming <zeming@nfschina.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230609182856.4660-1-zeming@nfschina.com
2023-06-18posix-timers: Refer properly to CONFIG_HIGH_RES_TIMERSLukas Bulwahn1-1/+1
Commit c78f261e5dcb ("posix-timers: Clarify posix_timer_fn() comments") turns an ifdef CONFIG_HIGH_RES_TIMERS into an conditional on "IS_ENABLED(CONFIG_HIGHRES_TIMERS)"; note that the new conditional refers to "HIGHRES_TIMERS" not "HIGH_RES_TIMERS" as before. Fix this typo introduced in that refactoring. Fixes: c78f261e5dcb ("posix-timers: Clarify posix_timer_fn() comments") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230609094643.26253-1-lukas.bulwahn@gmail.com
2023-06-18posix-timers: Polish coding style in a few placesThomas Gleixner1-7/+7
Make it consistent with the TIP tree documentation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.888493625@linutronix.de
2023-06-18posix-timers: Remove pointless commentsThomas Gleixner1-25/+0
Documenting the obvious is just consuming space for no value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.832240451@linutronix.de
2023-06-18posix-timers: Clarify posix_timer_fn() commentsThomas Gleixner1-30/+32
Make the issues vs. SIG_IGN understandable and remove the 15 years old promise that a proper solution is already on the horizon. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/874jnrdmrq.ffs@tglx
2023-06-18posix-timers: Clarify posix_timer_rearm() commentThomas Gleixner1-9/+3
Yet another incomprehensible piece of art. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.724863461@linutronix.de
2023-06-18posix-timers: Comment SIGEV_THREAD_ID properlyThomas Gleixner1-5/+2
Replace the word salad. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.672220780@linutronix.de
2023-06-18posix-timers: Add proper comments in do_timer_create()Thomas Gleixner1-9/+11
The comment about timer lifetime at the end of the function is misplaced and uncomprehensible. Make it understandable and put it at the right place. Add a new comment about the visibility of the new timer ID to user space. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.619897296@linutronix.de
2023-06-18posix-timers: Document nanosleep() detailsThomas Gleixner1-2/+7
The descriptions for common_nsleep() is wrong and common_nsleep_timens() lacks any form of comment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.567072835@linutronix.de
2023-06-18posix-timers: Document sys_clock_settime() permissions in placeThomas Gleixner1-7/+4
The documentation of sys_clock_settime() permissions is at a random place and mostly word salad. Remove it and add a concise comment into sys_clock_settime(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.514700292@linutronix.de
2023-06-18posix-timers: Document sys_clock_getoverrun()Thomas Gleixner1-8/+17
Document the syscall in detail and with coherent sentences. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.462051641@linutronix.de
2023-06-18posix-timers: Document common_clock_get() correctlyThomas Gleixner1-20/+30
Replace another confusing and inaccurate set of comments. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.409169321@linutronix.de
2023-06-18posix-timers: Document sys_clock_getres() correctlyThomas Gleixner1-8/+73
The decades old comment about Posix clock resolution is confusing at best. Remove it and add a proper explanation to sys_clock_getres(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.356427330@linutronix.de
2023-06-18posix-timers: Split release_posix_timers()Thomas Gleixner1-16/+15
release_posix_timers() is called for cleaning up both hashed and unhashed timers. The cases are differentiated by an argument and the usage is hideous. Seperate the actual free path out and use it for unhashed timers. Provide a function for hashed timers. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.301432503@linutronix.de
2023-06-18posix-timers: Remove pointless irqsafe from hash_lockThomas Gleixner1-3/+2
All usage of hash_lock is in thread context. No point in using spin_lock_irqsave()/irqrestore() for a single usage site. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.249063953@linutronix.de
2023-06-18posix-timers: Set k_itimer:: It_signal to NULL on exit()Thomas Gleixner1-0/+8
Technically it's not required to set k_itimer::it_signal to NULL on exit() because there is no other thread anymore which could lookup the timer concurrently. Set it to NULL for consistency sake and add a comment to that effect. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.196462644@linutronix.de
2023-06-18posix-timers: Annotate concurrent access to k_itimer:: It_signalThomas Gleixner1-7/+7
k_itimer::it_signal is read lockless in the RCU protected hash lookup, but it can be written concurrently in the timer_create() and timer_delete() path. Annotate these places with READ_ONCE() and WRITE_ONCE() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.143596887@linutronix.de
2023-06-18posix-timers: Add comments about timer lookupThomas Gleixner1-7/+32
Document how the timer ID validation in the hash table works. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.091081515@linutronix.de
2023-06-18posix-timers: Cleanup comments about timer ID trackingThomas Gleixner1-20/+8
Describe the hash table properly and remove the IDR leftover comments. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183313.038444551@linutronix.de
2023-06-18posix-timers: Clarify timer_wait_running() commentThomas Gleixner1-4/+12
Explain it better and add the CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y aspect for completeness. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230425183312.985681995@linutronix.de
2023-06-18posix-timers: Ensure timer ID search-loop limit is validThomas Gleixner1-13/+18
posix_timer_add() tries to allocate a posix timer ID by starting from the cached ID which was stored by the last successful allocation. This is done in a loop searching the ID space for a free slot one by one. The loop has to terminate when the search wrapped around to the starting point. But that's racy vs. establishing the starting point. That is read out lockless, which leads to the following problem: CPU0 CPU1 posix_timer_add() start = sig->posix_timer_id; lock(hash_lock); ... posix_timer_add() if (++sig->posix_timer_id < 0) start = sig->posix_timer_id; sig->posix_timer_id = 0; So CPU1 can observe a negative start value, i.e. -1, and the loop break never happens because the condition can never be true: if (sig->posix_timer_id == start) break; While this is unlikely to ever turn into an endless loop as the ID space is huge (INT_MAX), the racy read of the start value caught the attention of KCSAN and Dmitry unearthed that incorrectness. Rewrite it so that all id operations are under the hash lock. Reported-by: syzbot+5c54bd3eb218bb595aa9@syzkaller.appspotmail.com Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/87bkhzdn6g.ffs@tglx
2023-06-18posix-timers: Prevent RT livelock in itimer_delete()Thomas Gleixner1-8/+35
itimer_delete() has a retry loop when the timer is concurrently expired. On non-RT kernels this just spin-waits until the timer callback has completed, except for posix CPU timers which have HAVE_POSIX_CPU_TIMERS_TASK_WORK enabled. In that case and on RT kernels the existing task could live lock when preempting the task which does the timer delivery. Replace spin_unlock() with an invocation of timer_wait_running() to handle it the same way as the other retry loops in the posix timer code. Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/87v8g7c50d.ffs@tglx
2023-06-16tick/common: Align tick period during sched_timer setupThomas Gleixner2-13/+13
The tick period is aligned very early while the first clock_event_device is registered. At that point the system runs in periodic mode and switches later to one-shot mode if possible. The next wake-up event is programmed based on the aligned value (tick_next_period) but the delta value, that is used to program the clock_event_device, is computed based on ktime_get(). With the subtracted offset, the device fires earlier than the exact time frame. With a large enough offset the system programs the timer for the next wake-up and the remaining time left is too small to make any boot progress. The system hangs. Move the alignment later to the setup of tick_sched timer. At this point the system switches to oneshot mode and a high resolution clocksource is available. At this point it is safe to align tick_next_period because ktime_get() will now return accurate (not jiffies based) time. [bigeasy: Patch description + testing]. Fixes: e9523a0d81899 ("tick/common: Align tick period with the HZ tick.") Reported-by: Mathias Krause <minipli@grsecurity.net> Reported-by: "Bhatnagar, Rishabh" <risbhat@amazon.com> Suggested-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Richard W.M. Jones <rjones@redhat.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Acked-by: SeongJae Park <sj@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/5a56290d-806e-b9a5-f37c-f21958b5a8c0@grsecurity.net Link: https://lore.kernel.org/12c6f9a3-d087-b824-0d05-0d18c9bc1bf3@amazon.com Link: https://lore.kernel.org/r/20230615091830.RxMV2xf_@linutronix.de
2023-06-05time/sched_clock: Provide sched_clock_noinstr()Peter Zijlstra1-6/+16
With the intent to provide local_clock_noinstr(), a variant of local_clock() that's safe to be called from noinstr code (with the assumption that any such code will already be non-preemptible), prepare for things by providing a noinstr sched_clock_noinstr() function. Specifically, preempt_enable_*() calls out to schedule(), which upsets noinstr validation efforts. As such, pull out the preempt_{dis,en}able_notrace() requirements from the sched_clock_read() implementations by explicitly providing it in the sched_clock() function. This further requires said sched_clock_read() functions to be noinstr themselves, for ARCH_WANTS_NO_INSTR users. See the next few patches. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.302350330@infradead.org
2023-06-05seqlock/latch: Provide raw_read_seqcount_latch_retry()Peter Zijlstra2-3/+3
The read side of seqcount_latch consists of: do { seq = raw_read_seqcount_latch(&latch->seq); ... } while (read_seqcount_latch_retry(&latch->seq, seq)); which is asymmetric in the raw_ department, and sure enough, read_seqcount_latch_retry() includes (explicit) instrumentation where raw_read_seqcount_latch() does not. This inconsistency becomes a problem when trying to use it from noinstr code. As such, fix it by renaming and re-implementing raw_read_seqcount_latch_retry() without the instrumentation. Specifically the instrumentation in question is kcsan_atomic_next(0) in do___read_seqcount_retry(). Loosing this annotation is not a problem because raw_read_seqcount_latch() does not pass through kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.233598176@infradead.org
2023-06-01clocksource: Replace all non-returning strlcpy with strscpyAzeem Shaikh1-1/+1
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Acked-by: John Stultz <jstultz@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230530163546.986188-1-azeemshaikh38@gmail.com
2023-05-09tick/broadcast: Make broadcast device replacement work correctlyThomas Gleixner1-32/+88
When a tick broadcast clockevent device is initialized for one shot mode then tick_broadcast_setup_oneshot() OR's the periodic broadcast mode cpumask into the oneshot broadcast cpumask. This is required when switching from periodic broadcast mode to oneshot broadcast mode to ensure that CPUs which are waiting for periodic broadcast are woken up on the next tick. But it is subtly broken, when an active broadcast device is replaced and the system is already in oneshot (NOHZ/HIGHRES) mode. Victor observed this and debugged the issue. Then the OR of the periodic broadcast CPU mask is wrong as the periodic cpumask bits are sticky after tick_broadcast_enable() set it for a CPU unless explicitly cleared via tick_broadcast_disable(). That means that this sets all other CPUs which have tick broadcasting enabled at that point unconditionally in the oneshot broadcast mask. If the affected CPUs were already idle and had their bits set in the oneshot broadcast mask then this does no harm. But for non idle CPUs which were not set this corrupts their state. On their next invocation of tick_broadcast_enable() they observe the bit set, which indicates that the broadcast for the CPU is already set up. As a consequence they fail to update the broadcast event even if their earliest expiring timer is before the actually programmed broadcast event. If the programmed broadcast event is far in the future, then this can cause stalls or trigger the hung task detector. Avoid this by telling tick_broadcast_setup_oneshot() explicitly whether this is the initial switch over from periodic to oneshot broadcast which must take the periodic broadcast mask into account. In the case of initialization of a replacement device this prevents that the broadcast oneshot mask is modified. There is a second problem with broadcast device replacement in this function. The broadcast device is only armed when the previous state of the device was periodic. That is correct for the switch from periodic broadcast mode to oneshot broadcast mode as the underlying broadcast device could operate in oneshot state already due to lack of periodic state in hardware. In that case it is already armed to expire at the next tick. For the replacement case this is wrong as the device is in shutdown state. That means that any already pending broadcast event will not be armed. This went unnoticed because any CPU which goes idle will observe that the broadcast device has an expiry time of KTIME_MAX and therefore any CPUs next timer event will be earlier and cause a reprogramming of the broadcast device. But that does not guarantee that the events of the CPUs which were already in idle are delivered on time. Fix this by arming the newly installed device for an immediate event which will reevaluate the per CPU expiry times and reprogram the broadcast device accordingly. This is simpler than caching the last expiry time in yet another place or saving it before the device exchange and handing it down to the setup function. Replacement of broadcast devices is not a frequent operation and usually happens once somewhere late in the boot process. Fixes: 9c336c9935cf ("tick/broadcast: Allow late registered device to enter oneshot mode") Reported-by: Victor Hassan <victor@allwinnertech.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/87pm7d2z1i.ffs@tglx
2023-04-29Merge tag 'timers-core-2023-04-28' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more timer updates from Thomas Gleixner: "Timekeeping and clocksource/event driver updates the second batch: - A trivial documentation fix in the timekeeping core - A really boring set of small fixes, enhancements and cleanups in the drivers code. No new clocksource/clockevent drivers for a change" * tag 'timers-core-2023-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix references to nonexistent ktime_get_fast_ns() dt-bindings: timer: rockchip: Add rk3588 compatible dt-bindings: timer: rockchip: Drop superfluous rk3288 compatible clocksource/drivers/ti: Use of_property_read_bool() for boolean properties clocksource/drivers/timer-ti-dm: Fix finding alwon timer clocksource/drivers/davinci: Fix memory leak in davinci_timer_register when init fails clocksource/drivers/stm32-lp: Drop of_match_ptr for ID table clocksource/drivers/timer-ti-dm: Convert to platform remove callback returning void clocksource/drivers/timer-tegra186: Convert to platform remove callback returning void clocksource/drivers/timer-ti-dm: Improve error message in .remove clocksource/drivers/timer-stm32-lp: Mark driver as non-removable clocksource/drivers/sh_mtu2: Mark driver as non-removable clocksource/drivers/timer-ti-dm: Use of_address_to_resource() clocksource/drivers/timer-imx-gpt: Remove non-DT function clocksource/drivers/timer-mediatek: Split out CPUXGPT timers clocksource/drivers/exynos_mct: Explicitly return 0 for shared timer
2023-04-27Merge tag 'driver-core-6.4-rc1' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the large set of driver core changes for 6.4-rc1. Once again, a busy development cycle, with lots of changes happening in the driver core in the quest to be able to move "struct bus" and "struct class" into read-only memory, a task now complete with these changes. This will make the future rust interactions with the driver core more "provably correct" as well as providing more obvious lifetime rules for all busses and classes in the kernel. The changes required for this did touch many individual classes and busses as many callbacks were changed to take const * parameters instead. All of these changes have been submitted to the various subsystem maintainers, giving them plenty of time to review, and most of them actually did so. Other than those changes, included in here are a small set of other things: - kobject logging improvements - cacheinfo improvements and updates - obligatory fw_devlink updates and fixes - documentation updates - device property cleanups and const * changes - firwmare loader dependency fixes. All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits) device property: make device_property functions take const device * driver core: update comments in device_rename() driver core: Don't require dynamic_debug for initcall_debug probe timing firmware_loader: rework crypto dependencies firmware_loader: Strip off \n from customized path zram: fix up permission for the hot_add sysfs file cacheinfo: Add use_arch[|_cache]_info field/function arch_topology: Remove early cacheinfo error message if -ENOENT cacheinfo: Check cache properties are present in DT cacheinfo: Check sib_leaf in cache_leaves_are_shared() cacheinfo: Allow early level detection when DT/ACPI info is missing/broken cacheinfo: Add arm64 early level initializer implementation cacheinfo: Add arch specific early level initializer tty: make tty_class a static const structure driver core: class: remove struct class_interface * from callbacks driver core: class: mark the struct class in struct class_interface constant driver core: class: make class_register() take a const * driver core: class: mark class_release() as taking a const * driver core: remove incorrect comment for device_create* MIPS: vpe-cmp: remove module owner pointer from struct class usage. ...
2023-04-27timekeeping: Fix references to nonexistent ktime_get_fast_ns()Geert Uytterhoeven1-2/+2
There was never a function named ktime_get_fast_ns(). Presumably these should refer to ktime_get_mono_fast_ns() instead. Fixes: c1ce406e80fb15fa ("timekeeping: Fix up function documentation for the NMI safe accessors") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/06df7b3cbd94f016403bbf6cd2b38e4368e7468f.1682516546.git.geert+renesas@glider.be
2023-04-25Merge tag 'timers-core-2023-04-24' of ↵Linus Torvalds5-112/+187
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers and timekeeping updates from Thomas Gleixner: - Improve the VDSO build time checks to cover all dynamic relocations VDSO does not allow dynamic relocations, but the build time check is incomplete and fragile. It's based on architectures specifying the relocation types to search for and does not handle R_*_NONE relocation entries correctly. R_*_NONE relocations are injected by some GNU ld variants if they fail to determine the exact .rel[a]/dyn_size to cover trailing zeros. R_*_NONE relocations must be ignored by dynamic loaders, so they should be ignored in the build time check too. Remove the architecture specific relocation types to check for and validate strictly that no other relocations than R_*_NONE end up in the VSDO .so file. - Prefer signal delivery to the current thread for CLOCK_PROCESS_CPUTIME_ID based posix-timers Such timers prefer to deliver the signal to the main thread of a process even if the context in which the timer expires is the current task. This has the downside that it might wake up an idle thread. As there is no requirement or guarantee that the signal has to be delivered to the main thread, avoid this by preferring the current task if it is part of the thread group which shares sighand. This not only avoids waking idle threads, it also distributes the signal delivery in case of multiple timers firing in the context of different threads close to each other better. - Align the tick period properly (again) For a long time the tick was starting at CLOCK_MONOTONIC zero, which allowed users space applications to either align with the tick or to place a periodic computation so that it does not interfere with the tick. The alignement of the tick period was more by chance than by intention as the tick is set up before a high resolution clocksource is installed, i.e. timekeeping is still tick based and the tick period advances from there. The early enablement of sched_clock() broke this alignement as the time accumulated by sched_clock() is taken into account when timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is not longer a multiple of tick periods, which breaks applications which relied on that behaviour. Cure this by aligning the tick starting point to the next multiple of tick periods, i.e 1000ms/CONFIG_HZ. - A set of NOHZ fixes and enhancements: * Cure the concurrent writer race for idle and IO sleeptime statistics The statitic values which are exposed via /proc/stat are updated from the CPU local idle exit and remotely by cpufreq, but that happens without any form of serialization. As a consequence sleeptimes can be accounted twice or worse. Prevent this by restricting the accumulation writeback to the CPU local idle exit and let the remote access compute the accumulated value. * Protect idle/iowait sleep time with a sequence count Reading idle/iowait sleep time, e.g. from /proc/stat, can race with idle exit updates. As a consequence the readout may result in random and potentially going backwards values. Protect this by a sequence count, which fixes the idle time statistics issue, but cannot fix the iowait time problem because iowait time accounting races with remote wake ups decrementing the remote runqueues nr_iowait counter. The latter is impossible to fix, so the only way to deal with that is to document it properly and to remove the assertion in the selftest which triggers occasionally due to that. * Restructure struct tick_sched for better cache layout * Some small cleanups and a better cache layout for struct tick_sched - Implement the missing timer_wait_running() callback for POSIX CPU timers For unknown reason the introduction of the timer_wait_running() callback missed to fixup posix CPU timers, which went unnoticed for almost four years. While initially only targeted to prevent livelocks between a timer deletion and the timer expiry function on PREEMPT_RT enabled kernels, it turned out that fixing this for mainline is not as trivial as just implementing a stub similar to the hrtimer/timer callbacks. The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled systems there is a livelock issue independent of RT. CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU timers out from hard interrupt context to task work, which is handled before returning to user space or to a VM. The expiry mechanism moves the expired timers to a stack local list head with sighand lock held. Once sighand is dropped the task can be preempted and a task which wants to delete a timer will spin-wait until the expiry task is scheduled back in. In the worst case this will end up in a livelock when the preempting task and the expiry task are pinned on the same CPU. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way. Add a per task mutex to struct posix_cputimers_work, let the expiry task hold it accross the expiry function and let the deleting task which waits for the expiry to complete block on the mutex. In the non-contended case this results in an extra mutex_lock()/unlock() pair on both sides. This avoids spin-waiting on a task which is scheduled out, prevents the livelock and cures the problem for RT and !RT systems * tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Implement the missing timer_wait_running callback selftests/proc: Assert clock_gettime(CLOCK_BOOTTIME) VS /proc/uptime monotonicity selftests/proc: Remove idle time monotonicity assertions MAINTAINERS: Remove stale email address timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick() timers/nohz: Add a comment about broken iowait counter update race timers/nohz: Protect idle/iowait sleep time under seqcount timers/nohz: Only ever update sleeptime from idle exit timers/nohz: Restructure and reshuffle struct tick_sched tick/common: Align tick period with the HZ tick. selftests/timers/posix_timers: Test delivery of signals across threads posix-timers: Prefer delivery of signals to the current thread vdso: Improve cmd_vdso_check to check all dynamic relocations
2023-04-21posix-cpu-timers: Implement the missing timer_wait_running callbackThomas Gleixner2-14/+71
For some unknown reason the introduction of the timer_wait_running callback missed to fixup posix CPU timers, which went unnoticed for almost four years. Marco reported recently that the WARN_ON() in timer_wait_running() triggers with a posix CPU timer test case. Posix CPU timers have two execution models for expiring timers depending on CONFIG_POSIX_CPU_TIMERS_TASK_WORK: 1) If not enabled, the expiry happens in hard interrupt context so spin waiting on the remote CPU is reasonably time bound. Implement an empty stub function for that case. 2) If enabled, the expiry happens in task work before returning to user space or guest mode. The expired timers are marked as firing and moved from the timer queue to a local list head with sighand lock held. Once the timers are moved, sighand lock is dropped and the expiry happens in fully preemptible context. That means the expiring task can be scheduled out, migrated, interrupted etc. So spin waiting on it is more than suboptimal. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way: - Add a mutex to struct posix_cputimers_work. This struct is per task and used to schedule the expiry task work from the timer interrupt. - Add a task_struct pointer to struct cpu_timer which is used to store a the task which runs the expiry. That's filled in when the task moves the expired timers to the local expiry list. That's not affecting the size of the k_itimer union as there are bigger union members already - Let the task take the expiry mutex around the expiry function - Let the waiter acquire a task reference with rcu_read_lock() held and block on the expiry mutex This avoids spin-waiting on a task which might not even be on a CPU and works nicely for RT too. Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT") Reported-by: Marco Elver <elver@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Marco Elver <elver@google.com> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87zg764ojw.ffs@tglx
2023-04-18timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick()Frederic Weisbecker1-12/+8
There is no need for the __tick_nohz_idle_stop_tick() function between tick_nohz_idle_stop_tick() and its implementation. Remove that unnecessary step. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230222144649.624380-6-frederic@kernel.org
2023-04-18timers/nohz: Add a comment about broken iowait counter update raceFrederic Weisbecker1-2/+8
The per-cpu iowait task counter is incremented locally upon sleeping. But since the task can be woken to (and by) another CPU, the counter may then be decremented remotely. This is the source of a race involving readers VS writer of idle/iowait sleeptime. The following scenario shows an example where a /proc/stat reader observes a pending sleep time as IO whereas that pending sleep time later eventually gets accounted as non-IO. CPU 0 CPU 1 CPU 2 ----- ----- ------ //io_schedule() TASK A current->in_iowait = 1 rq(0)->nr_iowait++ //switch to idle // READ /proc/stat // See nr_iowait_cpu(0) == 1 return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime) //try_to_wake_up(TASK A) rq(0)->nr_iowait-- //idle exit // See nr_iowait_cpu(0) == 0 ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime) As a result subsequent reads on /proc/stat may expose backward progress. This is unfortunately hardly fixable. Just add a comment about that condition. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230222144649.624380-5-frederic@kernel.org
2023-04-18timers/nohz: Protect idle/iowait sleep time under seqcountFrederic Weisbecker2-6/+17
Reading idle/IO sleep time (eg: from /proc/stat) can race with idle exit updates because the state machine handling the stats is not atomic and requires a coherent read batch. As a result reading the sleep time may report irrelevant or backward values. Fix this with protecting the simple state machine within a seqcount. This is expected to be cheap enough not to add measurable performance impact on the idle path. Note this only fixes reader VS writer condition partitially. A race remains that involves remote updates of the CPU iowait task counter. It can hardly be fixed. Reported-by: Yu Liao <liaoyu15@huawei.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230222144649.624380-4-frederic@kernel.org
2023-04-18timers/nohz: Only ever update sleeptime from idle exitFrederic Weisbecker1-58/+37
The idle and IO sleeptime statistics appearing in /proc/stat can be currently updated from two sites: locally on idle exit and remotely by cpufreq. However there is no synchronization mechanism protecting concurrent updates. It is therefore possible to account the sleeptime twice, among all the other possible broken scenarios. To prevent from breaking the sleeptime accounting source, restrict the sleeptime updates to the local idle exit site. If there is a delta to add since the last update, IO/Idle sleep time readers will now only compute the delta without actually writing it back to the internal idle statistic fields. This fixes a writer VS writer race. Note there are still two known reader VS writer races to handle. A subsequent patch will fix one. Reported-by: Yu Liao <liaoyu15@huawei.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230222144649.624380-3-frederic@kernel.org
2023-04-18timers/nohz: Restructure and reshuffle struct tick_schedFrederic Weisbecker1-25/+41
Restructure and group fields by access in order to optimize cache layout. While at it, also add missing kernel doc for two fields: @last_jiffies and @idle_expires. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230222144649.624380-2-frederic@kernel.org