summaryrefslogtreecommitdiff
path: root/kernel/rcu
AgeCommit message (Collapse)AuthorFilesLines
2024-01-24rcu: Defer RCU kthreads wakeup when CPU is dyingFrederic Weisbecker2-3/+34
When the CPU goes idle for the last time during the CPU down hotplug process, RCU reports a final quiescent state for the current CPU. If this quiescent state propagates up to the top, some tasks may then be woken up to complete the grace period: the main grace period kthread and/or the expedited main workqueue (or kworker). If those kthreads have a SCHED_FIFO policy, the wake up can indirectly arm the RT bandwith timer to the local offline CPU. Since this happens after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the timer gets ignored. Therefore if the RCU kthreads are waiting for RT bandwidth to be available, they may never be actually scheduled. This triggers TREE03 rcutorture hangs: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved) rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6) rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x2eb/0xa80 schedule+0x1f/0x90 schedule_timeout+0x163/0x270 ? __pfx_process_timeout+0x10/0x10 rcu_gp_fqs_loop+0x37c/0x5b0 ? __pfx_rcu_gp_kthread+0x10/0x10 rcu_gp_kthread+0x17c/0x200 kthread+0xde/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2b/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> The situation can't be solved with just unpinning the timer. The hrtimer infrastructure and the nohz heuristics involved in finding the best remote target for an unpinned timer would then also need to handle enqueues from an offline CPU in the most horrendous way. So fix this on the RCU side instead and defer the wake up to an online CPU if it's too late for the local one. Reported-by: Paul E. McKenney <paulmck@kernel.org> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-12-13Merge branches 'doc.2023.12.13a', 'torture.2023.11.23a', ↵Neeraj Upadhyay (AMD)8-25/+71
'fixes.2023.12.13a', 'rcu-tasks.2023.12.12b' and 'srcu.2023.12.13a' into rcu-merge.2023.12.13a
2023-12-13rcu: Force quiescent states only for ongoing grace periodZqiang1-0/+2
If an rcutorture test scenario creates an fqs_task kthread, it will periodically invoke rcu_force_quiescent_state() in order to start force-quiescent-state (FQS) operations. However, an FQS operation will be started even if there is no RCU grace period in progress. Although testing FQS operations startup when there is no grace period in progress is necessary, it need not happen all that often. This commit therefore causes rcu_force_quiescent_state() to take an early exit if there is no grace period in progress. Note that there will still be attempts to start an FQS scan in the absence of a grace period because the grace period might end right after the rcu_force_quiescent_state() function's check. In actual testing, this happens about once every ten minutes, which should provide adequate testing. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-12-12srcu: Explain why callbacks invocations can't run concurrentlyFrederic Weisbecker1-0/+6
If an SRCU barrier is queued while callbacks are running and a new callbacks invocator for the same sdp were to run concurrently, the RCU barrier might execute too early. As this requirement is non-obvious, make sure to keep a record. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-12-12srcu: No need to advance/accelerate if no callback enqueuedFrederic Weisbecker1-3/+5
While in grace period start, there is nothing to accelerate and therefore no need to advance the callbacks either if no callback is to be enqueued. Spare these needless operations in this case. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-12-12srcu: Remove superfluous callbacks advancing from srcu_gp_start()Frederic Weisbecker1-10/+0
Callbacks advancing on SRCU must be performed on two specific places: 1) On enqueue time in order to make room for the acceleration of the new callback. 2) On invocation time in order to move the callbacks ready to invoke. Any other callback advancing callsite is needless. Remove the remaining one in srcu_gp_start(). Co-developed-by: Yong He <zhuangel570@gmail.com> Signed-off-by: Yong He <zhuangel570@gmail.com> Co-developed-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Co-developed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-12-12rcu: Restrict access to RCU CPU stall notifiersPaul E. McKenney5-9/+53
Although the RCU CPU stall notifiers can be useful for dumping state when tracking down delicate forward-progress bugs where NUMA effects cause cache lines to be delivered to a given CPU regularly, but always in a state that prevents that CPU from making forward progress. These bugs can be detected by the RCU CPU stall-warning mechanism, but in some cases, the stall-warnings printk()s disrupt the forward-progress bug before any useful state can be obtained. Unfortunately, the notifier mechanism added by commit 5b404fdabacf ("rcu: Add RCU CPU stall notifier") can make matters worse if used at all carelessly. For example, if the stall warning was caused by a lock not being released, then any attempt to acquire that lock in the notifier will hang. This will prevent not only the notifier from producing any useful output, but it will also prevent the stall-warning message from ever appearing. This commit therefore hides this new RCU CPU stall notifier mechanism under a new RCU_CPU_STALL_NOTIFIER Kconfig option that depends on both DEBUG_KERNEL and RCU_EXPERT. In addition, the rcupdate.rcu_cpu_stall_notifiers=1 kernel boot parameter must also be specified. The RCU_CPU_STALL_NOTIFIER Kconfig option's help text contains a warning and explains the dangers of careless use, recommending lockless notifier code. In addition, a WARN() is triggered each time that an attempt is made to register a stall-warning notifier in kernels built with CONFIG_RCU_CPU_STALL_NOTIFIER=y. This combination of measures will keep use of this mechanism confined to debug kernels and away from routine deployments. [ paulmck: Apply Dan Carpenter feedback. ] Fixes: 5b404fdabacf ("rcu: Add RCU CPU stall notifier") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-12-11rcu-tasks: Mark RCU Tasks accesses to current->rcu_tasks_idle_cpuPaul E. McKenney1-2/+2
The task_struct structure's ->rcu_tasks_idle_cpu can be concurrently read and written from the RCU Tasks grace-period kthread and from the CPU on which the task_struct structure's task is running. This commit therefore marks the accesses appropriately. Reported-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-11-23rcutorture: Add fqs_holdoff check before fqs_task is createdZqiang1-1/+3
For rcutorture tests on RCU implementations that support force-quiescent-state operations and that set the fqs_duration module parameter greater than zero, the fqs_task kthread will be created. However, if the fqs_holdoff module parameter is not set, then its default value of zero will cause fqs_task enter a long-term busy loop until stopped by kthread_stop(). This commit therefore adds a fqs_holdoff check before the fqs_task is created, making sure that whenever the fqs_task is created, the fqs_holdoff will be greater than zero. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-11-08Merge tag 'rcu-fixes-v6.7' of ↵Linus Torvalds3-14/+63
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull RCU fixes from Frederic Weisbecker: - Fix a lock inversion between scheduler and RCU introduced in v6.2-rc4. The scenario could trigger on any user of RCU_NOCB (mostly Android but also nohz_full) - Fix PF_IDLE semantic changes introduced in v6.6-rc3 breaking some RCU-Tasks and RCU-Tasks-Trace expectations as to what exactly is an idle task. This resulted in potential spurious stalls and warnings. * tag 'rcu-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: rcu/tasks-trace: Handle new PF_IDLE semantics rcu/tasks: Handle new PF_IDLE semantics rcu: Introduce rcu_cpu_online() rcu: Break rcu_node_0 --> &rq->__lock order
2023-11-03Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds2-18/+22
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-11-02rcu/tasks-trace: Handle new PF_IDLE semanticsFrederic Weisbecker1-1/+1
The commit: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup") has changed the semantics of what is to be considered an idle task in such a way that the idle task of an offline CPU may not carry the PF_IDLE flag anymore. However RCU-tasks-trace tests the opposite assertion, still assuming that idle tasks carry the PF_IDLE flag during their whole lifecycle. Remove this assumption to avoid spurious warnings but keep the initial test verifying that the idle task is the current task on any offline CPU. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup") Suggested-by: Joel Fernandes <joel@joelfernandes.org> Suggested-by: Paul E . McKenney" <paulmck@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-11-02rcu/tasks: Handle new PF_IDLE semanticsFrederic Weisbecker1-2/+28
The commit: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup") has changed the semantics of what is to be considered an idle task in such a way that CPU boot code preceding the actual idle loop is excluded from it. This has however introduced new potential RCU-tasks stalls when either: 1) Grace period is started before init/0 had a chance to set PF_IDLE, keeping it stuck in the holdout list until idle ever schedules. 2) Grace period is started when some possible CPUs have never been online, keeping their idle tasks stuck in the holdout list until the CPU ever boots up. 3) Similar to 1) but with secondary CPUs: Grace period is started concurrently with secondary CPU booting, putting its idle task in the holdout list because PF_IDLE isn't yet observed on it. It stays then stuck in the holdout list until that CPU ever schedules. The effect is mitigated here by the hotplug AP thread that must run to bring the CPU up. Fix this with handling the new semantics of PF_IDLE, keeping in mind that it may or may not be set on an idle task. Take advantage of that to strengthen the coverage of an RCU-tasks quiescent state within an idle task, excluding the CPU boot code from it. Only the code running within the idle loop is now a quiescent state, along with offline CPUs. Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup") Suggested-by: Joel Fernandes <joel@joelfernandes.org> Suggested-by: Paul E . McKenney" <paulmck@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-11-02rcu: Introduce rcu_cpu_online()Frederic Weisbecker2-0/+9
Export the RCU point of view as to when a CPU is considered offline (ie: when does RCU consider that a CPU is sufficiently down in the hotplug process to not feature any possible read side). This will be used by RCU-tasks whose vision of an offline CPU should reasonably match the one of RCU core. Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-11-01rcu: Break rcu_node_0 --> &rq->__lock orderPeter Zijlstra1-11/+25
Commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") added a kfree() call to free any user provided affinity mask, if present. It was changed later to use kfree_rcu() in commit 9a5418bc48ba ("sched/core: Use kfree_rcu() in do_set_cpus_allowed()") to avoid a circular locking dependency problem. It turns out that even kfree_rcu() isn't safe for avoiding circular locking problem. As reported by kernel test robot, the following circular locking dependency now exists: &rdp->nocb_lock --> rcu_node_0 --> &rq->__lock Solve this by breaking the rcu_node_0 --> &rq->__lock chain by moving the resched_cpu() out from under rcu_node lock. [peterz: heavily borrowed from Waiman's Changelog] [paulmck: applied Z qiang feedback] Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/oe-lkp/202310302207.a25f1a30-oliver.sang@intel.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-10-23Merge branches 'rcu/torture', 'rcu/fixes', 'rcu/docs', 'rcu/refscale', ↵Frederic Weisbecker12-141/+377
'rcu/tasks' and 'rcu/stall' into rcu/next rcu/torture: RCU torture, locktorture and generic torture infrastructure rcu/fixes: Generic and misc fixes rcu/docs: RCU documentation updates rcu/refscale: RCU reference scalability test updates rcu/tasks: RCU tasks updates rcu/stall: Stall detection updates
2023-10-13srcu: Only accelerate on enqueue timeFrederic Weisbecker1-4/+2
Acceleration in SRCU happens on enqueue time for each new callback. This operation is expected not to fail and therefore any similar attempt from other places shouldn't find any remaining callbacks to accelerate. Moreover accelerations performed beyond enqueue time are error prone because rcu_seq_snap() then may return the snapshot for a new grace period that is not going to be started. Remove these dangerous and needless accelerations and introduce instead assertions reporting leaking unaccelerated callbacks beyond enqueue time. Co-developed-by: Yong He <alexyonghe@tencent.com> Signed-off-by: Yong He <alexyonghe@tencent.com> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Co-developed-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Reviewed-by: Like Xu <likexu@tencent.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-10-10srcu: Fix callbacks acceleration mishandlingFrederic Weisbecker1-2/+29
SRCU callbacks acceleration might fail if the preceding callbacks advance also fails. This can happen when the following steps are met: 1) The RCU_WAIT_TAIL segment has callbacks (say for gp_num 8) and the RCU_NEXT_READY_TAIL also has callbacks (say for gp_num 12). 2) The grace period for RCU_WAIT_TAIL is observed as started but not yet completed so rcu_seq_current() returns 4 + SRCU_STATE_SCAN1 = 5. 3) This value is passed to rcu_segcblist_advance() which can't move any segment forward and fails. 4) srcu_gp_start_if_needed() still proceeds with callback acceleration. But then the call to rcu_seq_snap() observes the grace period for the RCU_WAIT_TAIL segment (gp_num 8) as completed and the subsequent one for the RCU_NEXT_READY_TAIL segment as started (ie: 8 + SRCU_STATE_SCAN1 = 9) so it returns a snapshot of the next grace period, which is 16. 5) The value of 16 is passed to rcu_segcblist_accelerate() but the freshly enqueued callback in RCU_NEXT_TAIL can't move to RCU_NEXT_READY_TAIL which already has callbacks for a previous grace period (gp_num = 12). So acceleration fails. 6) Note in all these steps, srcu_invoke_callbacks() hadn't had a chance to run srcu_invoke_callbacks(). Then some very bad outcome may happen if the following happens: 7) Some other CPU races and starts the grace period number 16 before the CPU handling previous steps had a chance. Therefore srcu_gp_start() isn't called on the latter sdp to fix the acceleration leak from previous steps with a new pair of call to advance/accelerate. 8) The grace period 16 completes and srcu_invoke_callbacks() is finally called. All the callbacks from previous grace periods (8 and 12) are correctly advanced and executed but callbacks in RCU_NEXT_READY_TAIL still remain. Then rcu_segcblist_accelerate() is called with a snaphot of 20. 9) Since nothing started the grace period number 20, callbacks stay unhandled. This has been reported in real load: [3144162.608392] INFO: task kworker/136:12:252684 blocked for more than 122 seconds. [3144162.615986] Tainted: G O K 5.4.203-1-tlinux4-0011.1 #1 [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [3144162.631162] kworker/136:12 D 0 252684 2 0x90004000 [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm] [3144162.631192] Call Trace: [3144162.631202] __schedule+0x2ee/0x660 [3144162.631206] schedule+0x33/0xa0 [3144162.631209] schedule_timeout+0x1c4/0x340 [3144162.631214] ? update_load_avg+0x82/0x660 [3144162.631217] ? raw_spin_rq_lock_nested+0x1f/0x30 [3144162.631218] wait_for_completion+0x119/0x180 [3144162.631220] ? wake_up_q+0x80/0x80 [3144162.631224] __synchronize_srcu.part.19+0x81/0xb0 [3144162.631226] ? __bpf_trace_rcu_utilization+0x10/0x10 [3144162.631227] synchronize_srcu+0x5f/0xc0 [3144162.631236] irqfd_shutdown+0x3c/0xb0 [kvm] [3144162.631239] ? __schedule+0x2f6/0x660 [3144162.631243] process_one_work+0x19a/0x3a0 [3144162.631244] worker_thread+0x37/0x3a0 [3144162.631247] kthread+0x117/0x140 [3144162.631247] ? process_one_work+0x3a0/0x3a0 [3144162.631248] ? __kthread_cancel_work+0x40/0x40 [3144162.631250] ret_from_fork+0x1f/0x30 Fix this with taking the snapshot for acceleration _before_ the read of the current grace period number. The only side effect of this solution is that callbacks advancing happen then _after_ the full barrier in rcu_seq_snap(). This is not a problem because that barrier only cares about: 1) Ordering accesses of the update side before call_srcu() so they don't bleed. 2) See all the accesses prior to the grace period of the current gp_num The only things callbacks advancing need to be ordered against are carried by snp locking. Reported-by: Yong He <alexyonghe@tencent.com> Co-developed-by:: Yong He <alexyonghe@tencent.com> Signed-off-by: Yong He <alexyonghe@tencent.com> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Co-developed-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Link: http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com Fixes: da915ad5cf25 ("srcu: Parallelize callback handling") Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-10-04rcu: Standardize explicit CPU-hotplug callsFrederic Weisbecker1-4/+8
rcu_report_dead() and rcutree_migrate_callbacks() have their headers in rcupdate.h while those are pure rcutree calls, like the other CPU-hotplug functions. Also rcu_cpu_starting() and rcu_report_dead() have different naming conventions while they mirror each other's effects. Fix the headers and propose a naming that relates both functions and aligns with the prefix of other rcutree CPU-hotplug functions. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-10-04rcu: Conditionally build CPU-hotplug teardown callbacksFrederic Weisbecker1-60/+54
Among the three CPU-hotplug teardown RCU callbacks, two of them early exit if CONFIG_HOTPLUG_CPU=n, and one is left unchanged. In any case all of them have an implementation when CONFIG_HOTPLUG_CPU=n. Align instead with the common way to deal with CPU-hotplug teardown callbacks and provide a proper stub when they are not supported. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-10-04rcu: dynamically allocate the rcu-kfree shrinkerQi Zheng1-9/+12
Use new APIs to dynamically allocate the rcu-kfree shrinker. Link: https://lkml.kernel.org/r/20230911094444.68966-17-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04rcu: dynamically allocate the rcu-lazy shrinkerQi Zheng1-9/+10
Use new APIs to dynamically allocate the rcu-lazy shrinker. Link: https://lkml.kernel.org/r/20230911094444.68966-16-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04rcu: Assume rcu_report_dead() is always called locallyFrederic Weisbecker1-2/+2
rcu_report_dead() has to be called locally by the CPU that is going to exit the RCU state machine. Passing a cpu argument here is error-prone and leaves the possibility for a racy remote call. Use local access instead. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-10-04rcu: Assume IRQS disabled from rcu_report_dead()Frederic Weisbecker1-4/+6
rcu_report_dead() is the last RCU word from the CPU down through the hotplug path. It is called in the idle loop right before the CPU shuts down for good. Because it removes the CPU from the grace period state machine and reports an ultimate quiescent state if necessary, no further use of RCU is allowed. Therefore it is expected that IRQs are disabled upon calling this function and are not to be re-enabled again until the CPU shuts down. Remove the IRQs disablement from that function and verify instead that it is actually called with IRQs disabled as it is expected at that special point in the idle path. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-10-04rcu: Use rcu_segcblist_segempty() instead of open coding itFrederic Weisbecker1-2/+2
This makes the code more readable. Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-10-04rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objectsCatalin Marinas1-0/+9
Since the actual slab freeing is deferred when calling kvfree_rcu(), so is the kmemleak_free() callback informing kmemleak of the object deletion. From the perspective of the kvfree_rcu() caller, the object is freed and it may remove any references to it. Since kmemleak does not scan RCU internal data storing the pointer, it will report such objects as leaks during the grace period. Tell kmemleak to ignore such objects on the kvfree_call_rcu() path. Note that the tiny RCU implementation does not have such issue since the objects can be tracked from the rcu_ctrlblk structure. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://lore.kernel.org/all/F903A825-F05F-4B77-A2B5-7356282FBA2C@apple.com/ Cc: <stable@vger.kernel.org> Tested-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-26srcu: Fix srcu_struct node grpmask overflow on 64-bit systemsDenis Arefev1-2/+2
The value of a bitwise expression 1 << (cpu - sdp->mynode->grplo) is subject to overflow due to a failure to cast operands to a larger data type before performing the bitwise operation. The maximum result of this subtraction is defined by the RCU_FANOUT_LEAF Kconfig option, which on 64-bit systems defaults to 16 (resulting in a maximum shift of 15), but which can be set up as high as 64 (resulting in a maximum shift of 63). A value of 31 can result in sign extension, resulting in 0xffffffff80000000 instead of the desired 0x80000000. A value of 32 or greater triggers undefined behavior per the C standard. This bug has not been known to cause issues because almost all kernels take the default CONFIG_RCU_FANOUT_LEAF=16. Furthermore, as long as a given compiler gives a deterministic non-zero result for 1<<N for N>=32, the code correctly invokes all SRCU callbacks, albeit wasting CPU time along the way. This commit therefore substitutes the correct 1UL for the buggy 1. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Denis Arefev <arefev@swemel.ru> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-24rcutorture: Traverse possible cpu to set maxcpu in rcu_nocb_toggle()Zqiang1-1/+1
Currently, the maxcpu is set by traversing online CPUs, however, if the rcutorture.onoff_holdoff is set zero and onoff_interval is set non-zero, and the some CPUs with larger cpuid has been offline before setting maxcpu, for these CPUs, even if they are online again, also cannot be offload or deoffload. This can result in rcutorture attempting to (de-)offload CPUs that have never been online, but the (de-)offload code handles this. This commit therefore use for_each_possible_cpu() instead of for_each_online_cpu() in rcu_nocb_toggle(). Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-24rcutorture: Replace schedule_timeout*() 1-jiffy waits with HZ/20Joel Fernandes (Google)1-6/+6
In the past, spinning on schedule_timeout* with a wait of 1 jiffy has hung the kernel. See for example d52d3a2bf408 ("torture: Fix hang during kthread shutdown phase"). This issue recently recurred in torture's stutter code. The result is that the function instantly returns and never goes to sleep, preempting whatever might otherwise make useful forward progress. To prevent future issues, apply the commit-d52d3a2bf408 fix throughout rcutorture, moving from a 1-jiffy wait to a 50-millisecond wait. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-24torture: Move rcutorture_sched_setaffinity() out of rcutorturePaul E. McKenney3-9/+5
The rcutorture_sched_setaffinity() function is needed by locktorture, so move its declaration from rcu.h to torture.h and rename it to the more generic torture_sched_setaffinity() name. Please note that use of this function is still restricted to torture tests, and of those, currently only rcutorture and locktorture. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-24rcu: Include torture_sched_setaffinity() declarationArnd Bergmann1-0/+1
The prototype for torture_sched_setaffinity() will be moved to a different header, which will need to be included from update.c to avoid this W=1 warning: kernel/rcu/update.c:529:6: error: no previous prototype for 'torture_sched_setaffinity' [-Werror=missing-prototypes] 529 | long torture_sched_setaffinity(pid_t pid, const struct cpumask *in_mask) Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-13rcu: Eliminate rcu_gp_slow_unregister() false positivePaul E. McKenney1-1/+1
When using rcutorture as a module, there are a number of conditions that can abort the modprobe operation, for example, when attempting to run both RCU CPU stall warning tests and forward-progress tests. This can cause rcu_torture_cleanup() to be invoked on the unwind path out of rcu_rcu_torture_init(), which will mean that rcu_gp_slow_unregister() is invoked without a matching rcu_gp_slow_register(). This will cause a splat because rcu_gp_slow_unregister() is passed rcu_fwd_cb_nodelay, which does not match a NULL pointer. This commit therefore forgives a mismatch involving a NULL pointer, thus avoiding this false-positive splat. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-13rcu: Dump memory object info if callback function is invalidZhen Lei6-0/+12
When a structure containing an RCU callback rhp is (incorrectly) freed and reallocated after rhp is passed to call_rcu(), it is not unusual for rhp->func to be set to NULL. This defeats the debugging prints used by __call_rcu_common() in kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, which expect to identify the offending code using the identity of this function. And in kernels build without CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, things are even worse, as can be seen from this splat: Unable to handle kernel NULL pointer dereference at virtual address 0 ... ... PC is at 0x0 LR is at rcu_do_batch+0x1c0/0x3b8 ... ... (rcu_do_batch) from (rcu_core+0x1d4/0x284) (rcu_core) from (__do_softirq+0x24c/0x344) (__do_softirq) from (__irq_exit_rcu+0x64/0x108) (__irq_exit_rcu) from (irq_exit+0x8/0x10) (irq_exit) from (__handle_domain_irq+0x74/0x9c) (__handle_domain_irq) from (gic_handle_irq+0x8c/0x98) (gic_handle_irq) from (__irq_svc+0x5c/0x94) (__irq_svc) from (arch_cpu_idle+0x20/0x3c) (arch_cpu_idle) from (default_idle_call+0x4c/0x78) (default_idle_call) from (do_idle+0xf8/0x150) (do_idle) from (cpu_startup_entry+0x18/0x20) (cpu_startup_entry) from (0xc01530) This commit therefore adds calls to mem_dump_obj(rhp) to output some information, for example: slab kmalloc-256 start ffff410c45019900 pointer offset 0 size 256 This provides the rough size of the memory block and the offset of the rcu_head structure, which as least provides at least a few clues to help locate the problem. If the problem is reproducible, additional slab debugging can be enabled, for example, CONFIG_DEBUG_SLAB=y, which can provide significantly more information. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-13rcu: Add sysfs to provide throttled access to rcu_barrier()Paul E. McKenney1-0/+76
When running a series of stress tests all making heavy use of RCU, it is all too possible to OOM the system when the prior test's RCU callbacks don't get invoked until after the subsequent test starts. One way of handling this is just a timed wait, but this fails when a given CPU has so many callbacks queued that they take longer to invoke than allowed for by that timed wait. This commit therefore adds an rcutree.do_rcu_barrier module parameter that is accessible from sysfs. Writing one of the many synonyms for boolean "true" will cause an rcu_barrier() to be invoked, but will guarantee that no more than one rcu_barrier() will be invoked per sixteenth of a second via this mechanism. The flip side is that a given request might wait a second or three longer than absolutely necessary, but only when there are multiple uses of rcutree.do_rcu_barrier within a one-second time interval. This commit unnecessarily serializes the rcu_barrier() machinery, given that serialization is already provided by procfs. This has the advantage of allowing throttled rcu_barrier() from other sources within the kernel. Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-13rcu/tree: Remove superfluous return from void call_rcu* functionsJoel Fernandes (Google)1-2/+2
The return keyword is not needed here. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-13srcu: Fix error handling in init_srcu_struct_fields()Joel Fernandes (Google)1-15/+17
The current error handling in init_srcu_struct_fields() is a bit inconsistent. If init_srcu_struct_nodes() fails, the function either returns -ENOMEM or 0 depending on whether ssp->sda_is_static is true or false. This can make init_srcu_struct_fields() return 0 even if memory allocation failed! Simplify the error handling by always returning -ENOMEM if either init_srcu_struct_nodes() or the per-CPU allocation fails. This makes the control flow easier to follow and avoids the inconsistent return values. Add goto labels to avoid duplicating the error cleanup code. Link: https://lore.kernel.org/r/20230404003508.GA254019@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-12refscale: Print out additional module parametersPaul E. McKenney1-2/+2
The refscale.verbose_batched and refscale.lookup_instances module parameters are omitted from the ref_scale_print_module_parms() beginning-of-test output. This commit therefore adds them. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-12refscale: Fix misplaced data re-readPaul E. McKenney1-1/+1
This commit fixes a misplaced data re-read in the typesafe code. The reason that this was not noticed is that this is a performance test with no writers, so a mismatch could not occur. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-11rcu-tasks: Make rcu_tasks_lazy_ms staticJiapeng Chong1-1/+1
The rcu_tasks_lazy_ms variable is not used outside the file tasks.h, so this commit marks it static. kernel/rcu/tasks.h:1085:5: warning: symbol 'rcu_tasks_lazy_ms' was not declared. Should it be static? Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6086 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-11rcu-tasks: Pull sampling of ->percpu_dequeue_lim out of loopPaul E. McKenney1-1/+3
The rcu_tasks_need_gpcb() samples ->percpu_dequeue_lim as part of the condition clause of a "for" loop, which is a bit confusing. This commit therefore hoists this sampling out of the loop, using the result loaded in the condition clause. So why does this work in the face of a concurrent switch from single-CPU queueing to per-CPU queueing? o The call_rcu_tasks_generic() that makes the change has already enqueued its callback, which means that all of the other CPU's callback queues are empty. o For the call_rcu_tasks_generic() that first notices the switch to per-CPU queues, the smp_store_release() used to update ->percpu_enqueue_lim pairs with the raw_spin_trylock_rcu_node()'s full barrier that is between the READ_ONCE(rtp->percpu_enqueue_shift) and the rcu_segcblist_enqueue() that enqueues the callback. o Because this CPU's queue is empty (unless it happens to be the original single queue, in which case there is no need for synchronization), this call_rcu_tasks_generic() will do an irq_work_queue() to schedule a handler for the needed rcuwait_wake_up() call. This call will be ordered after the first call_rcu_tasks_generic() function's change to ->percpu_dequeue_lim. o This rcuwait_wake_up() will either happen before or after the set_current_state() in rcuwait_wait_event(). If it happens before, the "condition" argument's call to rcu_tasks_need_gpcb() will be ordered after the original change, and all callbacks on all CPUs will be visible. Otherwise, if it happens after, then the grace-period kthread's state will be set back to running, which will result in a later call to rcuwait_wait_event() and thus to rcu_tasks_need_gpcb(), which will again see the change. So it all works out. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-11rcu-tasks: Add printk()s to localize boot-time self-test hangPaul E. McKenney1-1/+3
Currently, rcu_tasks_initiate_self_tests() prints a message and then initiates self tests on up to three different RCU Tasks flavors. If one of the flavors has a grace-period hang, it is not easy to work out which of the three hung. This commit therefore prints a message prior to each individual test. Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-11rcu/tree: Defer setting of jiffies during stall resetJoel Fernandes (Google)3-2/+34
There are instances where rcu_cpu_stall_reset() is called when jiffies did not get a chance to update for a long time. Before jiffies is updated, the CPU stall detector can go off triggering false-positives where a just-started grace period appears to be ages old. In the past, we disabled stall detection in rcu_cpu_stall_reset() however this got changed [1]. This is resulting in false-positives in KGDB usecase [2]. Fix this by deferring the update of jiffies to the third run of the FQS loop. This is more robust, as, even if rcu_cpu_stall_reset() is called just before jiffies is read, we would end up pushing out the jiffies read by 3 more FQS loops. Meanwhile the CPU stall detection will be delayed and we will not get any false positives. [1] https://lore.kernel.org/all/20210521155624.174524-2-senozhatsky@chromium.org/ [2] https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/ Tested with rcutorture.cpu_stall option as well to verify stall behavior with/without patch. Tested-by: Huacai Chen <chenhuacai@loongson.cn> Reported-by: Binbin Zhou <zhoubinbin@loongson.cn> Closes: https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/ Suggested-by: Paul McKenney <paulmck@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: a80be428fbc1 ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-11rcutorture: Add test of RCU CPU stall notifiersPaul E. McKenney1-0/+21
This commit registers an RCU CPU stall notifier when testing RCU CPU stalls. The notifier logs a message similar to the following: rcu_torture_stall_nf: v=1, duration=21001. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-11rcu: Add RCU CPU stall notifierPaul E. McKenney3-2/+69
It is sometimes helpful to have a way for the subsystem causing the stall to dump its state when an RCU CPU stall occurs. This commit therefore bases rcu_stall_chain_notifier_register() and rcu_stall_chain_notifier_unregister() on atomic notifiers in order to provide this functionality. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-11rcu: Eliminate check_cpu_stall() duplicate codeZhen Lei1-27/+15
The code and comments of self-detected and other-detected RCU CPU stall warnings are identical except the output function. This commit therefore refactors so as to consolidate the duplicate code. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-11rcu: Don't redump the stalled CPU where RCU GP kthread last ranZhen Lei1-1/+3
The stacks of all stalled CPUs will be dumped in rcu_dump_cpu_stacks(). If the CPU on where RCU GP kthread last ran is stalled, its stack does not need to be dumped again. We can search the corresponding backtrace based on the printed CPU ID. For example: [ 87.328275] rcu: rcu_sched kthread starved for ... ->cpu=3 <--------| ... ... | [ 89.385007] NMI backtrace for cpu 3 <--------| [ 89.385179] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.10.0+ #22 <--| [ 89.385188] Hardware name: linux,dummy-virt (DT) [ 89.385196] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) [ 89.385204] pc : arch_cpu_idle+0x40/0xc0 [ 89.385211] lr : arch_cpu_idle+0x2c/0xc0 ... ... [ 89.385566] Call trace: [ 89.385574] arch_cpu_idle+0x40/0xc0 [ 89.385581] default_idle_call+0x100/0x450 [ 89.385589] cpuidle_idle_call+0x2f8/0x460 [ 89.385596] do_idle+0x1dc/0x3d0 [ 89.385604] cpu_startup_entry+0x5c/0xb0 [ 89.385613] secondary_start_kernel+0x35c/0x520 Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-11rcu: Delete a redundant check in rcu_check_gp_kthread_starvation()Zhen Lei1-7/+5
The rcu_check_gp_kthread_starvation() function uses task_cpu() to sample the last CPU that the grace-period kthread ran on, and task_cpu() samples the thread_info structure's ->cpu field. But this field will always contain a number corresponding to a CPU that was online some time in the past, thus never a negative number. This invariant is checked by a WARN_ON_ONCE() in set_task_cpu(). This means that if the grace-period kthread exists, that is, if the "gpk" local variable is non-NULL, the "cpu" local variable will be non-negative. This in turn means that the existing check for non-negative "cpu" is redundant with the enclosing check for non-NULL "gpk". This commit threefore removes the redundant check of "cpu". Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-01Merge tag 'tty-6.6-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial driver updates from Greg KH: "Here is the big set of tty and serial driver changes for 6.6-rc1. Lots of cleanups in here this cycle, and some driver updates. Short summary is: - Jiri's continued work to make the tty code and apis be a bit more sane with regards to modern kernel coding style and types - cpm_uart driver updates - n_gsm updates and fixes - meson driver updates - sc16is7xx driver updates - 8250 driver updates for different hardware types - qcom-geni driver fixes - tegra serial driver change - stm32 driver updates - synclink_gt driver cleanups - tty structure size reduction All of these have been in linux-next this week with no reported issues. The last bit of cleanups from Jiri and the tty structure size reduction came in last week, a bit late but as they were just style changes and size reductions, I figured they should get into this merge cycle so that others can work on top of them with no merge conflicts" * tag 'tty-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (199 commits) tty: shrink the size of struct tty_struct by 40 bytes tty: n_tty: deduplicate copy code in n_tty_receive_buf_real_raw() tty: n_tty: extract ECHO_OP processing to a separate function tty: n_tty: unify counts to size_t tty: n_tty: use u8 for chars and flags tty: n_tty: simplify chars_in_buffer() tty: n_tty: remove unsigned char casts from character constants tty: n_tty: move newline handling to a separate function tty: n_tty: move canon handling to a separate function tty: n_tty: use MASK() for masking out size bits tty: n_tty: make n_tty_data::num_overrun unsigned tty: n_tty: use time_is_before_jiffies() in n_tty_receive_overrun() tty: n_tty: use 'num' for writes' counts tty: n_tty: use output character directly tty: n_tty: make flow of n_tty_receive_buf_common() a bool Revert "tty: serial: meson: Add a earlycon for the T7 SoC" Documentation: devices.txt: Fix minors for ttyCPM* Documentation: devices.txt: Remove ttySIOC* Documentation: devices.txt: Remove ttyIOC* serial: 8250_bcm7271: improve bcm7271 8250 port ...
2023-08-29Merge tag 'net-next-6.6' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Increase size limits for to-be-sent skb frag allocations. This allows tun, tap devices and packet sockets to better cope with large writes operations - Store netdevs in an xarray, to simplify iterating over netdevs - Refactor nexthop selection for multipath routes - Improve sched class lifetime handling - Add backup nexthop ID support for bridge - Implement drop reasons support in openvswitch - Several data races annotations and fixes - Constify the sk parameter of routing functions - Prepend kernel version to netconsole message Protocols: - Implement support for TCP probing the peer being under memory pressure - Remove hard coded limitation on IPv6 specific info placement inside the socket struct - Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated per socket scaling factor - Scaling-up the IPv6 expired route GC via a separated list of expiring routes - In-kernel support for the TLS alert protocol - Better support for UDP reuseport with connected sockets - Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR header size - Get rid of additional ancillary per MPTCP connection struct socket - Implement support for BPF-based MPTCP packet schedulers - Format MPTCP subtests selftests results in TAP - Several new SMC 2.1 features including unique experimental options, max connections per lgr negotiation, max links per lgr negotiation BPF: - Multi-buffer support in AF_XDP - Add multi uprobe BPF links for attaching multiple uprobes and usdt probes, which is significantly faster and saves extra fds - Implement an fd-based tc BPF attach API (TCX) and BPF link support on top of it - Add SO_REUSEPORT support for TC bpf_sk_assign - Support new instructions from cpu v4 to simplify the generated code and feature completeness, for x86, arm64, riscv64 - Support defragmenting IPv(4|6) packets in BPF - Teach verifier actual bounds of bpf_get_smp_processor_id() and fix perf+libbpf issue related to custom section handling - Introduce bpf map element count and enable it for all program types - Add a BPF hook in sys_socket() to change the protocol ID from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy - Introduce bpf_me_mcache_free_rcu() and fix OOM under stress - Add uprobe support for the bpf_get_func_ip helper - Check skb ownership against full socket - Support for up to 12 arguments in BPF trampoline - Extend link_info for kprobe_multi and perf_event links Netfilter: - Speed-up process exit by aborting ruleset validation if a fatal signal is pending - Allow NLA_POLICY_MASK to be used with BE16/BE32 types Driver API: - Page pool optimizations, to improve data locality and cache usage - Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the need for raw ioctl() handling in drivers - Simplify genetlink dump operations (doit/dumpit) providing them the common information already populated in struct genl_info - Extend and use the yaml devlink specs to [re]generate the split ops - Introduce devlink selective dumps, to allow SF filtering SF based on handle and other attributes - Add yaml netlink spec for netlink-raw families, allow route, link and address related queries via the ynl tool - Remove phylink legacy mode support - Support offload LED blinking to phy - Add devlink port function attributes for IPsec New hardware / drivers: - Ethernet: - Broadcom ASP 2.0 (72165) ethernet controller - MediaTek MT7988 SoC - Texas Instruments AM654 SoC - Texas Instruments IEP driver - Atheros qca8081 phy - Marvell 88Q2110 phy - NXP TJA1120 phy - WiFi: - MediaTek mt7981 support - Can: - Kvaser SmartFusion2 PCI Express devices - Allwinner T113 controllers - Texas Instruments tcan4552/4553 chips - Bluetooth: - Intel Gale Peak - Qualcomm WCN3988 and WCN7850 - NXP AW693 and IW624 - Mediatek MT2925 Drivers: - Ethernet NICs: - nVidia/Mellanox: - mlx5: - support UDP encapsulation in packet offload mode - IPsec packet offload support in eswitch mode - improve aRFS observability by adding new set of counters - extends MACsec offload support to cover RoCE traffic - dynamic completion EQs - mlx4: - convert to use auxiliary bus instead of custom interface logic - Intel - ice: - implement switchdev bridge offload, even for LAG interfaces - implement SRIOV support for LAG interfaces - igc: - add support for multiple in-flight TX timestamps - Broadcom: - bnxt: - use the unified RX page pool buffers for XDP and non-XDP - use the NAPI skb allocation cache - OcteonTX2: - support Round Robin scheduling HTB offload - TC flower offload support for SPI field - Freescale: - add XDP_TX feature support - AMD: - ionic: add support for PCI FLR event - sfc: - basic conntrack offload - introduce eth, ipv4 and ipv6 pedit offloads - ST Microelectronics: - stmmac: maximze PTP timestamping resolution - Virtual NICs: - Microsoft vNIC: - batch ringing RX queue doorbell on receiving packets - add page pool for RX buffers - Virtio vNIC: - add per queue interrupt coalescing support - Google vNIC: - add queue-page-list mode support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add port range matching tc-flower offload - permit enslavement to netdevices with uppers - Ethernet embedded switches: - Marvell (mv88e6xxx): - convert to phylink_pcs - Renesas: - r8A779fx: add speed change support - rzn1: enables vlan support - Ethernet PHYs: - convert mv88e6xxx to phylink_pcs - WiFi: - Qualcomm Wi-Fi 7 (ath12k): - extremely High Throughput (EHT) PHY support - RealTek (rtl8xxxu): - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU), RTL8192EU and RTL8723BU - RealTek (rtw89): - Introduce Time Averaged SAR (TAS) support - Connector: - support for event filtering" * tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1806 commits) net: ethernet: mtk_wed: minor change in wed_{tx,rx}info_show net: ethernet: mtk_wed: add some more info in wed_txinfo_show handler net: stmmac: clarify difference between "interface" and "phy_interface" r8152: add vendor/device ID pair for D-Link DUB-E250 devlink: move devlink_notify_register/unregister() to dev.c devlink: move small_ops definition into netlink.c devlink: move tracepoint definitions into core.c devlink: push linecard related code into separate file devlink: push rate related code into separate file devlink: push trap related code into separate file devlink: use tracepoint_enabled() helper devlink: push region related code into separate file devlink: push param related code into separate file devlink: push resource related code into separate file devlink: push dpipe related code into separate file devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper devlink: push shared buffer related code into separate file devlink: push port related code into separate file devlink: push object register/unregister notifications into separate helpers inet: fix IP_TRANSPARENT error handling ...
2023-08-17Merge branches 'doc.2023.07.14b', 'fixes.2023.08.16a', ↵Paul E. McKenney7-39/+252
'rcu-tasks.2023.07.24a', 'rcuscale.2023.07.14b', 'refscale.2023.07.14b', 'torture.2023.08.14a' and 'torturescripts.2023.07.20a' into HEAD doc.2023.07.14b: Documentation updates. fixes.2023.08.16a: Miscellaneous fixes. rcu-tasks.2023.07.24a: RCU Tasks updates. rcuscale.2023.07.14b: RCU (updater) scalability test updates. refscale.2023.07.14b: Reference (reader) scalability test updates. torture.2023.08.14a: Other torture-test updates. torturescripts.2023.07.20a: Other torture-test scripting updates.