summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-04-27perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabledZhipeng Xie3-6/+6
[ Upstream commit 60490e7966659b26d74bf1fa4aa8693d9a94ca88 ] This problem can be reproduced with CONFIG_PERF_USE_VMALLOC enabled on both x86_64 and aarch64 arch when using sysdig -B(using ebpf)[1]. sysdig -B works fine after rebuilding the kernel with CONFIG_PERF_USE_VMALLOC disabled. I tracked it down to the if condition event->rb->nr_pages != nr_pages in perf_mmap is true when CONFIG_PERF_USE_VMALLOC is enabled where event->rb->nr_pages = 1 and nr_pages = 2048 resulting perf_mmap to return -EINVAL. This is because when CONFIG_PERF_USE_VMALLOC is enabled, rb->nr_pages is always equal to 1. Arch with CONFIG_PERF_USE_VMALLOC enabled by default: arc/arm/csky/mips/sh/sparc/xtensa Arch with CONFIG_PERF_USE_VMALLOC disabled by default: x86_64/aarch64/... Fix this problem by using data_page_nr() [1] https://github.com/draios/sysdig Fixes: 906010b2134e ("perf_event: Provide vmalloc() based mmap() backing") Signed-off-by: Zhipeng Xie <xiezhipeng1@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220209145417.6495-1-xiezhipeng1@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-27sched/pelt: Fix attach_entity_load_avg() corner casekuyo chang1-5/+5
[ Upstream commit 40f5aa4c5eaebfeaca4566217cb9c468e28ed682 ] The warning in cfs_rq_is_decayed() triggered: SCHED_WARN_ON(cfs_rq->avg.load_avg || cfs_rq->avg.util_avg || cfs_rq->avg.runnable_avg) There exists a corner case in attach_entity_load_avg() which will cause load_sum to be zero while load_avg will not be. Consider se_weight is 88761 as per the sched_prio_to_weight[] table. Further assume the get_pelt_divider() is 47742, this gives: se->avg.load_avg is 1. However, calculating load_sum: se->avg.load_sum = div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se)); se->avg.load_sum = 1*47742/88761 = 0. Then enqueue_load_avg() adds this to the cfs_rq totals: cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; Resulting in load_avg being 1 with load_sum is 0, which will trigger the WARN. Fixes: f207934fb79d ("sched/fair: Align PELT windows between cfs_rq and its se") Signed-off-by: kuyo chang <kuyo.chang@mediatek.com> [peterz: massage changelog] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20220414090229.342-1-kuyo.chang@mediatek.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-27dma-mapping: remove bogus test for pfn_valid from dma_map_resourceMike Rapoport1-4/+0
commit a9c38c5d267cb94871dfa2de5539c92025c855d7 upstream. dma_map_resource() uses pfn_valid() to ensure the range is not RAM. However, pfn_valid() only checks for availability of the memory map for a PFN but it does not ensure that the PFN is actually backed by RAM. As dma_map_resource() is the only method in DMA mapping APIs that has this check, simply drop the pfn_valid() test from dma_map_resource(). Link: https://lore.kernel.org/all/20210824173741.GC623@arm.com/ Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20210930013039.11260-2-rppt@kernel.org Signed-off-by: Will Deacon <will@kernel.org> Fixes: 859a85ddf90e ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE") Link: https://lore.kernel.org/r/Yl0IZWT2nsiYtqBT@linux.ibm.com Signed-off-by: Georgi Djakov <quic_c_gdjako@quicinc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-20cpu/hotplug: Remove the 'cpu' member of cpuhp_cpu_stateSteven Price1-18/+18
commit b7ba6d8dc3569e49800ef0136799f26f43e237e8 upstream. Currently the setting of the 'cpu' member of struct cpuhp_cpu_state in cpuhp_create() is too late as it is used earlier in _cpu_up(). If kzalloc_node() in __smpboot_create_thread() fails then the rollback will be done with st->cpu==0 causing CPU0 to be erroneously set to be dying, causing the scheduler to get mightily confused and throw its toys out of the pram. However the cpu number is actually available directly, so simply remove the 'cpu' member and avoid the problem in the first place. Fixes: 2ea46c6fc945 ("cpumask/hotplug: Fix cpu_dying() state tracking") Signed-off-by: Steven Price <steven.price@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220411152233.474129-2-steven.price@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-20dma-direct: avoid redundant memory sync for swiotlbChao Gao1-1/+2
commit 9e02977bfad006af328add9434c8bffa40e053bb upstream. When we looked into FIO performance with swiotlb enabled in VM, we found swiotlb_bounce() is always called one more time than expected for each DMA read request. It turns out that the bounce buffer is copied to original DMA buffer twice after the completion of a DMA request (one is done by in dma_direct_sync_single_for_cpu(), the other by swiotlb_tbl_unmap_single()). But the content in bounce buffer actually doesn't change between the two rounds of copy. So, one round of copy is redundant. Pass DMA_ATTR_SKIP_CPU_SYNC flag to swiotlb_tbl_unmap_single() to skip the memory copy in it. This fix increases FIO 64KB sequential read throughput in a guest with swiotlb=force by 5.6%. Fixes: 55897af63091 ("dma-direct: merge swiotlb_dma_ops into the dma_direct code") Reported-by: Wang Zhaoyang1 <zhaoyang1.wang@intel.com> Reported-by: Gao Liang <liang.gao@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-20timers: Fix warning condition in __run_timers()Anna-Maria Behnsen1-4/+7
commit c54bc0fc84214b203f7a0ebfd1bd308ce2abe920 upstream. When the timer base is empty, base::next_expiry is set to base::clk + NEXT_TIMER_MAX_DELTA and base::next_expiry_recalc is false. When no timer is queued until jiffies reaches base::next_expiry value, the warning for not finding any expired timer and base::next_expiry_recalc is false in __run_timers() triggers. To prevent triggering the warning in this valid scenario base::timers_pending needs to be added to the warning condition. Fixes: 31cd0e119d50 ("timers: Recalculate next timer interrupt only when necessary") Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20220405191732.7438-3-anna-maria@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-20smp: Fix offline cpu check in flush_smp_call_function_queue()Nadav Amit1-1/+1
commit 9e949a3886356fe9112c6f6f34a6e23d1d35407f upstream. The check in flush_smp_call_function_queue() for callbacks that are sent to offline CPUs currently checks whether the queue is empty. However, flush_smp_call_function_queue() has just deleted all the callbacks from the queue and moved all the entries into a local list. This checks would only be positive if some callbacks were added in the short time after llist_del_all() was called. This does not seem to be the intention of this check. Change the check to look at the local list to which the entries were moved instead of the queue from which all the callbacks were just removed. Fixes: 8d056c48e4862 ("CPU hotplug, smp: flush any pending IPI callbacks before CPU offline") Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220319072015.1495036-1-namit@vmware.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-20tick/nohz: Use WARN_ON_ONCE() to prevent console saturationPaul Gortmaker1-1/+1
commit 40e97e42961f8c6cc7bd5fe67cc18417e02d78f1 upstream. While running some testing on code that happened to allow the variable tick_nohz_full_running to get set but with no "possible" NOHZ cores to back up that setting, this warning triggered: if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)) WARN_ON(tick_nohz_full_running); The console was overwhemled with an endless stream of one WARN per tick per core and there was no way to even see what was going on w/o using a serial console to capture it and then trace it back to this. Change it to WARN_ON_ONCE(). Fixes: 08ae95f4fd3b ("nohz_full: Allow the boot CPU to be nohz_full") Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211206145950.10927-3-paul.gortmaker@windriver.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-20genirq/affinity: Consider that CPUs on nodes can be unbalancedRei Yamamoto1-2/+3
commit 08d835dff916bfe8f45acc7b92c7af6c4081c8a7 upstream. If CPUs on a node are offline at boot time, the number of nodes is different when building affinity masks for present cpus and when building affinity masks for possible cpus. This causes the following problem: In the case that the number of vectors is less than the number of nodes there are cases where bits of masks for present cpus are overwritten when building masks for possible cpus. Fix this by excluding CPUs, which are not part of the current build mask (present/possible). [ tglx: Massaged changelog and added comment ] Fixes: b82592199032 ("genirq/affinity: Spread IRQs to all available NUMA nodes") Signed-off-by: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220331003309.10891-1-yamamoto.rei@jp.fujitsu.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-13stacktrace: move filter_irq_stacks() to kernel/stacktrace.cMarco Elver1-0/+30
commit f39f21b3ddc7fc0f87eb6dc75ddc81b5bbfb7672 upstream. filter_irq_stacks() has little to do with the stackdepot implementation, except that it is usually used by users (such as KASAN) of stackdepot to reduce the stack trace. However, filter_irq_stacks() itself is not useful without a stack trace as obtained by stack_trace_save() and friends. Therefore, move filter_irq_stacks() to kernel/stacktrace.c, so that new users of filter_irq_stacks() do not have to start depending on STACKDEPOT only for filter_irq_stacks(). Link: https://lkml.kernel.org/r/20210923104803.2620285-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Jann Horn <jannh@google.com> Cc: Aleksandr Nogikh <nogikh@google.com> Cc: Taras Madan <tarasmadan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-13static_call: Don't make __static_call_return0 staticChristophe Leroy3-542/+546
commit 8fd4ddda2f49a66bf5dd3d0c01966c4b1971308b upstream. System.map shows that vmlinux contains several instances of __static_call_return0(): c0004fc0 t __static_call_return0 c0011518 t __static_call_return0 c00d8160 t __static_call_return0 arch_static_call_transform() uses the middle one to check whether we are setting a call to __static_call_return0 or not: c0011520 <arch_static_call_transform>: c0011520: 3d 20 c0 01 lis r9,-16383 <== r9 = 0xc001 << 16 c0011524: 39 29 15 18 addi r9,r9,5400 <== r9 += 0x1518 c0011528: 7c 05 48 00 cmpw r5,r9 <== r9 has value 0xc0011518 here So if static_call_update() is called with one of the other instances of __static_call_return0(), arch_static_call_transform() won't recognise it. In order to work properly, global single instance of __static_call_return0() is required. Fixes: 3f2a8fc4b15d ("static_call/x86: Add __static_call_return0()") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/30821468a0e7d28251954b578e5051dc09300d04.1647258493.git.christophe.leroy@csgroup.eu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-13sched: Teach the forced-newidle balancer about CPU affinity limitation.Sebastian Andrzej Siewior1-1/+1
commit 386ef214c3c6ab111d05e1790e79475363abaa05 upstream. try_steal_cookie() looks at task_struct::cpus_mask to decide if the task could be moved to `this' CPU. It ignores that the task might be in a migration disabled section while not on the CPU. In this case the task must not be moved otherwise per-CPU assumption are broken. Use is_cpu_allowed(), as suggested by Peter Zijlstra, to decide if the a task can be moved. Fixes: d2dfa17bc7de6 ("sched: Trivial forced-newidle balancer") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YjNK9El+3fzGmswf@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-13perf/core: Inherit event_capsNamhyung Kim1-0/+3
commit e3265a4386428d3d157d9565bb520aabff8b4bf0 upstream. It was reported that some perf event setup can make fork failed on ARM64. It was the case of a group of mixed hw and sw events and it failed in perf_event_init_task() due to armpmu_event_init(). The ARM PMU code checks if all the events in a group belong to the same PMU except for software events. But it didn't set the event_caps of inherited events and no longer identify them as software events. Therefore the test failed in a child process. A simple reproducer is: $ perf stat -e '{cycles,cs,instructions}' perf bench sched messaging # Running 'sched/messaging' benchmark: perf: fork(): Invalid argument The perf stat was fine but the perf bench failed in fork(). Let's inherit the event caps from the parent. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20220328200112.457740-1-namhyung@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08swiotlb: Support aligned swiotlb buffersDavid Stevens1-5/+8
commit e81e99bacc9f9347bda7808a949c1ce9fcc2bbf4 upstream. Add an argument to swiotlb_tbl_map_single that specifies the desired alignment of the allocated buffer. This is used by dma-iommu to ensure the buffer is aligned to the iova granule size when using swiotlb with untrusted sub-granule mappings. This addresses an issue where adjacent slots could be exposed to the untrusted device if IO_TLB_SIZE < iova granule < PAGE_SIZE. Signed-off-by: David Stevens <stevensd@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210929023300.335969-7-stevensd@google.com Signed-off-by: Joerg Roedel <jroedel@suse.de> Cc: Mario Limonciello <Mario.Limonciello@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08bpf: Adjust BPF stack helper functions to accommodate skip > 0Namhyung Kim1-32/+24
commit ee2a098851bfbe8bcdd964c0121f4246f00ff41e upstream. Let's say that the caller has storage for num_elem stack frames. Then, the BPF stack helper functions walk the stack for only num_elem frames. This means that if skip > 0, one keeps only 'num_elem - skip' frames. This is because it sets init_nr in the perf_callchain_entry to the end of the buffer to save num_elem entries only. I believe it was because the perf callchain code unwound the stack frames until it reached the global max size (sysctl_perf_event_max_stack). However it now has perf_callchain_entry_ctx.max_stack to limit the iteration locally. This simplifies the code to handle init_nr in the BPF callstack entries and removes the confusion with the perf_event's __PERF_SAMPLE_CALLCHAIN_EARLY which sets init_nr to 0. Also change the comment on bpf_get_stack() in the header file to be more explicit what the return value means. Fixes: c195651e565a ("bpf: add bpf_get_stack helper") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/30a7b5d5-6726-1cc2-eaee-8da2828a9a9c@oracle.com Link: https://lore.kernel.org/bpf/20220314182042.71025-1-namhyung@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Based-on-patch-by: Eugene Loh <eugene.loh@oracle.com>
2022-04-08tracing: Have type enum modifications copy the stringsSteven Rostedt (Google)1-1/+61
commit 795301d3c28996219d555023ac6863401b6076bc upstream. When an enum is used in the visible parts of a trace event that is exported to user space, the user space applications like perf and trace-cmd do not have a way to know what the value of the enum is. To solve this, at boot up (or module load) the printk formats are modified to replace the enum with their numeric value in the string output. Array fields of the event are defined by [<nr-elements>] in the type portion of the format file so that the user space parsers can correctly parse the array into the appropriate size chunks. But in some trace events, an enum is used in defining the size of the array, which once again breaks the parsing of user space tooling. This was solved the same way as the print formats were, but it modified the type strings of the trace event. This caused crashes in some architectures because, as supposed to the print string, is a const string value. This was not detected on x86, as it appears that const strings are still writable (at least in boot up), but other architectures this is not the case, and writing to a const string will cause a kernel fault. To fix this, use kstrdup() to copy the type before modifying it. If the trace event is for the core kernel there's no need to free it because the string will be in use for the life of the machine being on line. For modules, create a link list to store all the strings being allocated for modules and when the module is removed, free them. Link: https://lore.kernel.org/all/yt9dr1706b4i.fsf@linux.ibm.com/ Link: https://lkml.kernel.org/r/20220318153432.3984b871@gandalf.local.home Tested-by: Marc Zyngier <maz@kernel.org> Tested-by: Sven Schnelle <svens@linux.ibm.com> Reported-by: Sven Schnelle <svens@linux.ibm.com> Fixes: b3bc8547d3be ("tracing: Have TRACE_DEFINE_ENUM affect trace event types as well") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08Reinstate some of "swiotlb: rework "fix info leak with DMA_FROM_DEVICE""Linus Torvalds1-4/+8
commit 901c7280ca0d5e2b4a8929fbe0bfb007ac2a6544 upstream. Halil Pasic points out [1] that the full revert of that commit (revert in bddac7c1e02b), and that a partial revert that only reverts the problematic case, but still keeps some of the cleanups is probably better.  And that partial revert [2] had already been verified by Oleksandr Natalenko to also fix the issue, I had just missed that in the long discussion. So let's reinstate the cleanups from commit aa6f8dcbab47 ("swiotlb: rework "fix info leak with DMA_FROM_DEVICE""), and effectively only revert the part that caused problems. Link: https://lore.kernel.org/all/20220328013731.017ae3e3.pasic@linux.ibm.com/ [1] Link: https://lore.kernel.org/all/20220324055732.GB12078@lst.de/ [2] Link: https://lore.kernel.org/all/4386660.LvFx2qVVIh@natalenko.name/ [3] Suggested-by: Halil Pasic <pasic@linux.ibm.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08watch_queue: Free the page array when watch_queue is dismantledEric Dumazet1-0/+1
commit b490207017ba237d97b735b2aa66dc241ccd18f5 upstream. Commit 7ea1a0124b6d ("watch_queue: Free the alloc bitmap when the watch_queue is torn down") took care of the bitmap, but not the page array. BUG: memory leak unreferenced object 0xffff88810d9bc140 (size 32): comm "syz-executor335", pid 3603, jiffies 4294946994 (age 12.840s) hex dump (first 32 bytes): 40 a7 40 04 00 ea ff ff 00 00 00 00 00 00 00 00 @.@............. 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: kmalloc_array include/linux/slab.h:621 [inline] kcalloc include/linux/slab.h:652 [inline] watch_queue_set_size+0x12f/0x2e0 kernel/watch_queue.c:251 pipe_ioctl+0x82/0x140 fs/pipe.c:632 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] Reported-by: syzbot+25ea042ae28f3888727a@syzkaller.appspotmail.com Fixes: c73be61cede5 ("pipe: Add general notification queue support") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20220322004654.618274-1-eric.dumazet@gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08tracing: Have TRACE_DEFINE_ENUM affect trace event types as wellSteven Rostedt (Google)1-0/+28
[ Upstream commit b3bc8547d3be60898818885f5bf22d0a62e2eb48 ] The macro TRACE_DEFINE_ENUM is used to convert enums in the kernel to their actual value when they are exported to user space via the trace event format file. Currently only the enums in the "print fmt" (TP_printk in the TRACE_EVENT macro) have the enums converted. But the enums can be used to denote array size: field:unsigned int fc_ineligible_rc[EXT4_FC_REASON_MAX]; offset:12; size:36; signed:0; The EXT4_FC_REASON_MAX has no meaning to userspace but it needs to know that information to know how to parse the array. Have the array indexes also be parsed as well. Link: https://lore.kernel.org/all/cover.1646922487.git.riteshh@linux.ibm.com/ Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08locking/lockdep: Iterate lock_classes directly when reading lockdep filesWaiman Long3-15/+56
[ Upstream commit fb7275acd6fb988313dddd8d3d19efa70d9015ad ] When dumping lock_classes information via /proc/lockdep, we can't take the lockdep lock as the lock hold time is indeterminate. Iterating over all_lock_classes without holding lock can be dangerous as there is a slight chance that it may branch off to other lists leading to infinite loop or even access invalid memory if changes are made to all_lock_classes list in parallel. To avoid this problem, iteration of lock classes is now done directly on the lock_classes array itself. The lock_classes_in_use bitmap is checked to see if the lock class is being used. To avoid iterating the full array all the times, a new max_lock_class_idx value is added to track the maximum lock_class index that is currently being used. We can theoretically take the lockdep lock for iterating all_lock_classes when other lockdep files (lockdep_stats and lock_stat) are accessed as the lock hold time will be shorter for them. For consistency, they are also modified to iterate the lock_classes array directly. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220211035526.1329503-2-longman@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08rcu: Mark writes to the rcu_segcblist structure's ->flags fieldPaul E. McKenney1-2/+2
[ Upstream commit c09929031018913b5783872a8b8cdddef4a543c7 ] KCSAN reports data races between the rcu_segcblist_clear_flags() and rcu_segcblist_set_flags() functions, though misreporting the latter as a call to rcu_segcblist_is_enabled() from call_rcu(). This commit converts the updates of this field to WRITE_ONCE(), relying on the resulting unmarked reads to continue to detect buggy concurrent writes to this field. Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08kdb: Fix the putarea helper functionDaniel Thompson1-1/+1
[ Upstream commit c1cb81429df462eca1b6ba615cddd21dd3103c46 ] Currently kdb_putarea_size() uses copy_from_kernel_nofault() to write *to* arbitrary kernel memory. This is obviously wrong and means the memory modify ('mm') command is a serious risk to debugger stability: if we poke to a bad address we'll double-fault and lose our debug session. Fix this the (very) obvious way. Note that there are two Fixes: tags because the API was renamed and this patch will only trivially backport as far as the rename (and this is probably enough). Nevertheless Christoph's rename did not introduce this problem so I wanted to record that! Fixes: fe557319aa06 ("maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault") Fixes: 5d5314d6795f ("kdb: core for kgdb back end (1 of 2)") Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20220128144055.207267-1-daniel.thompson@linaro.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08dma-debug: fix return value of __setup handlersRandy Dunlap1-2/+2
[ Upstream commit 80e4390981618e290616dbd06ea190d4576f219d ] When valid kernel command line parameters dma_debug=off dma_debug_entries=100 are used, they are reported as Unknown parameters and added to init's environment strings, polluting it. Unknown kernel command line parameters "BOOT_IMAGE=/boot/bzImage-517rc5 dma_debug=off dma_debug_entries=100", will be passed to user space. and Run /sbin/init as init process with arguments: /sbin/init with environment: HOME=/ TERM=linux BOOT_IMAGE=/boot/bzImage-517rc5 dma_debug=off dma_debug_entries=100 Return 1 from these __setup handlers to indicate that the command line option has been handled. Fixes: 59d3daafa1726 ("dma-debug: add kernel command line parameters") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru> Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru Cc: Joerg Roedel <joro@8bytes.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: iommu@lists.linux-foundation.org Cc: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08kernel/resource: fix kfree() of bootmem memory againMiaohe Lin1-33/+8
[ Upstream commit 0cbcc92917c5de80f15c24d033566539ad696892 ] Since commit ebff7d8f270d ("mem hotunplug: fix kfree() of bootmem memory"), we could get a resource allocated during boot via alloc_resource(). And it's required to release the resource using free_resource(). Howerver, many people use kfree directly which will result in kernel BUG. In order to fix this without fixing every call site, just leak a couple of bytes in such corner case. Link: https://lkml.kernel.org/r/20220217083619.19305-1-linmiaohe@huawei.com Fixes: ebff7d8f270d ("mem hotunplug: fix kfree() of bootmem memory") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08livepatch: Fix build failure on 32 bits processorsChristophe Leroy1-2/+2
[ Upstream commit 2f293651eca3eacaeb56747dede31edace7329d2 ] Trying to build livepatch on powerpc/32 results in: kernel/livepatch/core.c: In function 'klp_resolve_symbols': kernel/livepatch/core.c:221:23: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] 221 | sym = (Elf64_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info); | ^ kernel/livepatch/core.c:221:21: error: assignment to 'Elf32_Sym *' {aka 'struct elf32_sym *'} from incompatible pointer type 'Elf64_Sym *' {aka 'struct elf64_sym *'} [-Werror=incompatible-pointer-types] 221 | sym = (Elf64_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info); | ^ kernel/livepatch/core.c: In function 'klp_apply_section_relocs': kernel/livepatch/core.c:312:35: error: passing argument 1 of 'klp_resolve_symbols' from incompatible pointer type [-Werror=incompatible-pointer-types] 312 | ret = klp_resolve_symbols(sechdrs, strtab, symndx, sec, sec_objname); | ^~~~~~~ | | | Elf32_Shdr * {aka struct elf32_shdr *} kernel/livepatch/core.c:193:44: note: expected 'Elf64_Shdr *' {aka 'struct elf64_shdr *'} but argument is of type 'Elf32_Shdr *' {aka 'struct elf32_shdr *'} 193 | static int klp_resolve_symbols(Elf64_Shdr *sechdrs, const char *strtab, | ~~~~~~~~~~~~^~~~~~~ Fix it by using the right types instead of forcing 64 bits types. Fixes: 7c8e2bdd5f0d ("livepatch: Apply vmlinux-specific KLP relocations early") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Petr Mladek <pmladek@suse.com> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/5288e11b018a762ea3351cc8fb2d4f15093a4457.1640017960.git.christophe.leroy@csgroup.eu Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08bpf: Fix UAF due to race between btf_try_get_module and load_moduleKumar Kartikeya Dwivedi1-2/+24
[ Upstream commit 18688de203b47e5d8d9d0953385bf30b5949324f ] While working on code to populate kfunc BTF ID sets for module BTF from its initcall, I noticed that by the time the initcall is invoked, the module BTF can already be seen by userspace (and the BPF verifier). The existing btf_try_get_module calls try_module_get which only fails if mod->state == MODULE_STATE_GOING, i.e. it can increment module reference when module initcall is happening in parallel. Currently, BTF parsing happens from MODULE_STATE_COMING notifier callback. At this point, the module initcalls have not been invoked. The notifier callback parses and prepares the module BTF, allocates an ID, which publishes it to userspace, and then adds it to the btf_modules list allowing the kernel to invoke btf_try_get_module for the BTF. However, at this point, the module has not been fully initialized (i.e. its initcalls have not finished). The code in module.c can still fail and free the module, without caring for other users. However, nothing stops btf_try_get_module from succeeding between the state transition from MODULE_STATE_COMING to MODULE_STATE_LIVE. This leads to a use-after-free issue when BPF program loads successfully in the state transition, load_module's do_init_module call fails and frees the module, and BPF program fd on close calls module_put for the freed module. Future patch has test case to verify we don't regress in this area in future. There are multiple points after prepare_coming_module (in load_module) where failure can occur and module loading can return error. We illustrate and test for the race using the last point where it can practically occur (in module __init function). An illustration of the race: CPU 0 CPU 1 load_module notifier_call(MODULE_STATE_COMING) btf_parse_module btf_alloc_id // Published to userspace list_add(&btf_mod->list, btf_modules) mod->init(...) ... ^ bpf_check | check_pseudo_btf_id | btf_try_get_module | returns true | ... ... | module __init in progress return prog_fd | ... ... V if (ret < 0) free_module(mod) ... close(prog_fd) ... bpf_prog_free_deferred module_put(used_btf.mod) // use-after-free We fix this issue by setting a flag BTF_MODULE_F_LIVE, from the notifier callback when MODULE_STATE_LIVE state is reached for the module, so that we return NULL from btf_try_get_module for modules that are not fully formed. Since try_module_get already checks that module is not in MODULE_STATE_GOING state, and that is the only transition a live module can make before being removed from btf_modules list, this is enough to close the race and prevent the bug. A later selftest patch crafts the race condition artifically to verify that it has been fixed, and that verifier fails to load program (with ENXIO). Lastly, a couple of comments: 1. Even if this race didn't exist, it seems more appropriate to only access resources (ksyms and kfuncs) of a fully formed module which has been initialized completely. 2. This patch was born out of need for synchronization against module initcall for the next patch, so it is needed for correctness even without the aforementioned race condition. The BTF resources initialized by module initcall are set up once and then only looked up, so just waiting until the initcall has finished ensures correct behavior. Fixes: 541c3bad8dc5 ("bpf: Support BPF ksym variables in kernel modules") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220114163953.1455836-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08printk: fix return value of printk.devkmsg __setup handlerRandy Dunlap1-2/+4
[ Upstream commit b665eae7a788c5e2bc10f9ac3c0137aa0ad1fc97 ] If an invalid option value is used with "printk.devkmsg=<value>", it is silently ignored. If a valid option value is used, it is honored but the wrong return value (0) is used, indicating that the command line option had an error and was not handled. This string is not added to init's environment strings due to init/main.c::unknown_bootoption() checking for a '.' in the boot option string and then considering that string to be an "Unused module parameter". Print a warning message if a bad option string is used. Always return 1 from the __setup handler to indicate that the command line option has been handled. Fixes: 750afe7babd1 ("printk: add kernel parameter to control writes to /dev/kmsg") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru> Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru Cc: Borislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: John Ogness <john.ogness@linutronix.de> Reviewed-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220228220556.23484-1-rdunlap@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08sched/rt: Plug rt_mutex_setprio() vs push_rt_task() raceValentin Schneider2-16/+28
[ Upstream commit 49bef33e4b87b743495627a529029156c6e09530 ] John reported that push_rt_task() can end up invoking find_lowest_rq(rq->curr) when curr is not an RT task (in this case a CFS one), which causes mayhem down convert_prio(). This can happen when current gets demoted to e.g. CFS when releasing an rt_mutex, and the local CPU gets hit with an rto_push_work irqwork before getting the chance to reschedule. Exactly who triggers this work isn't entirely clear to me - switched_from_rt() only invokes rt_queue_pull_task() if there are no RT tasks on the local RQ, which means the local CPU can't be in the rto_mask. My current suspected sequence is something along the lines of the below, with the demoted task being current. mark_wakeup_next_waiter() rt_mutex_adjust_prio() rt_mutex_setprio() // deboost originally-CFS task check_class_changed() switched_from_rt() // Only rt_queue_pull_task() if !rq->rt.rt_nr_running switched_to_fair() // Sets need_resched __balance_callbacks() // if pull_rt_task(), tell_cpu_to_push() can't select local CPU per the above raw_spin_rq_unlock(rq) // need_resched is set, so task_woken_rt() can't // invoke push_rt_tasks(). Best I can come up with is // local CPU has rt_nr_migratory >= 2 after the demotion, so stays // in the rto_mask, and then: <some other CPU running rto_push_irq_work_func() queues rto_push_work on this CPU> push_rt_task() // breakage follows here as rq->curr is CFS Move an existing check to check rq->curr vs the next pushable task's priority before getting anywhere near find_lowest_rq(). While at it, add an explicit sched_class of rq->curr check prior to invoking find_lowest_rq(rq->curr). Align the DL logic to also reschedule regardless of next_task's migratability. Fixes: a7c81556ec4d ("sched: Fix migrate_disable() vs rt/dl balancing") Reported-by: John Keeping <john@metanate.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: John Keeping <john@metanate.com> Link: https://lore.kernel.org/r/20220127154059.974729-1-valentin.schneider@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08sched/cpuacct: Fix charge percpu cpuusageChengming Zhou1-1/+2
[ Upstream commit 248cc9993d1cc12b8e9ed716cc3fc09f6c3517dd ] The cpuacct_account_field() is always called by the current task itself, so it's ok to use __this_cpu_add() to charge the tick time. But cpuacct_charge() maybe called by update_curr() in load_balance() on a random CPU, different from the CPU on which the task is running. So __this_cpu_add() will charge that cputime to a random incorrect CPU. Fixes: 73e6aafd9ea8 ("sched/cpuacct: Simplify the cpuacct code") Reported-by: Minye Zhu <zhuminye@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220220051426.5274-1-zhouchengming@bytedance.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08sched/fair: Improve consistency of allowed NUMA balance calculationsMel Gorman1-8/+10
[ Upstream commit 2cfb7a1b031b0e816af7a6ee0c6ab83b0acdf05a ] There are inconsistencies when determining if a NUMA imbalance is allowed that should be corrected. o allow_numa_imbalance changes types and is not always examining the destination group so both the type should be corrected as well as the naming. o find_idlest_group uses the sched_domain's weight instead of the group weight which is different to find_busiest_group o find_busiest_group uses the source group instead of the destination which is different to task_numa_find_cpu o Both find_idlest_group and find_busiest_group should account for the number of running tasks if a move was allowed to be consistent with task_numa_find_cpu Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Link: https://lore.kernel.org/r/20220208094334.16379-2-mgorman@techsingularity.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08perf/core: Fix address filter parser for multiple filtersAdrian Hunter1-0/+3
[ Upstream commit d680ff24e9e14444c63945b43a37ede7cd6958f9 ] Reset appropriate variables in the parser loop between parsing separate filters, so that they do not interfere with parsing the next filter. Fixes: 375637bc524952 ("perf/core: Introduce address range filtering") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220131072453.2839535-4-adrian.hunter@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08rseq: Remove broken uapi field layout on 32-bit little endianMathieu Desnoyers1-4/+4
[ Upstream commit bfdf4e6208051ed7165b2e92035b4bf11f43eb63 ] The rseq rseq_cs.ptr.{ptr32,padding} uapi endianness handling is entirely wrong on 32-bit little endian: a preprocessor logic mistake wrongly uses the big endian field layout on 32-bit little endian architectures. Fortunately, those ptr32 accessors were never used within the kernel, and only meant as a convenience for user-space. Remove those and replace the whole rseq_cs union by a __u64 type, as this is the only thing really needed to express the ABI. Document how 32-bit architectures are meant to interact with this field. Fixes: ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, update includes") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220127152720.25898-1-mathieu.desnoyers@efficios.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08sched/uclamp: Fix iowait boost escaping uclamp restrictionQais Yousef1-0/+1
[ Upstream commit d37aee9018e68b0d356195caefbb651910e0bbfa ] iowait_boost signal is applied independently of util and doesn't take into account uclamp settings of the rq. An io heavy task that is capped by uclamp_max could still request higher frequency because sugov_iowait_apply() doesn't clamp the boost via uclamp_rq_util_with() like effective_cpu_util() does. Make sure that iowait_boost honours uclamp requests by calling uclamp_rq_util_with() when applying the boost. Fixes: 982d9cdc22c9 ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20211216225320.2957053-3-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08sched/core: Export pelt_thermal_tpQais Yousef1-0/+1
[ Upstream commit 77cf151b7bbdfa3577b3c3f3a5e267a6c60a263b ] We can't use this tracepoint in modules without having the symbol exported first, fix that. Fixes: 765047932f15 ("sched/pelt: Add support to track thermal pressure") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211028115005.873539-1-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numaBharata B Rao1-10/+0
[ Upstream commit 28c988c3ec29db74a1dda631b18785958d57df4f ] The older format of /proc/pid/sched printed home node info which required the mempolicy and task lock around mpol_get(). However the format has changed since then and there is no need for sched_show_numa() any more to have mempolicy argument, asssociated mpol_get/put and task_lock/unlock. Remove them. Fixes: 397f2378f1361 ("sched/numa: Fix numa balancing stats in /proc/pid/sched") Signed-off-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20220118050515.2973-1-bharata@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08watch_queue: Actually free the watchDavid Howells1-0/+1
[ Upstream commit 3d8dcf278b1ee1eff1e90be848fa2237db4c07a7 ] free_watch() does everything barring actually freeing the watch object. Fix this by adding the missing kfree. kmemleak produces a report something like the following. Note that as an address can be seen in the first word, the watch would appear to have gone through call_rcu(). BUG: memory leak unreferenced object 0xffff88810ce4a200 (size 96): comm "syz-executor352", pid 3605, jiffies 4294947473 (age 13.720s) hex dump (first 32 bytes): e0 82 48 0d 81 88 ff ff 00 00 00 00 00 00 00 00 ..H............. 80 a2 e4 0c 81 88 ff ff 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8214e6cc>] kmalloc include/linux/slab.h:581 [inline] [<ffffffff8214e6cc>] kzalloc include/linux/slab.h:714 [inline] [<ffffffff8214e6cc>] keyctl_watch_key+0xec/0x2e0 security/keys/keyctl.c:1800 [<ffffffff8214ec84>] __do_sys_keyctl+0x3c4/0x490 security/keys/keyctl.c:2016 [<ffffffff84493a25>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff84493a25>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-and-tested-by: syzbot+6e2de48f06cdb2884bfc@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08watch_queue: Fix NULL dereference in error cleanupDavid Howells1-1/+1
[ Upstream commit a635415a064e77bcfbf43da413fd9dfe0bbed9cb ] In watch_queue_set_size(), the error cleanup code doesn't take account of the fact that __free_page() can't handle a NULL pointer when trying to free up buffer pages that did get allocated. Fix this by only calling __free_page() on the pages actually allocated. Without the fix, this can lead to something like the following: BUG: KASAN: null-ptr-deref in __free_pages+0x1f/0x1b0 mm/page_alloc.c:5473 Read of size 4 at addr 0000000000000034 by task syz-executor168/3599 ... Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 __kasan_report mm/kasan/report.c:446 [inline] kasan_report.cold+0x66/0xdf mm/kasan/report.c:459 check_region_inline mm/kasan/generic.c:183 [inline] kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189 instrument_atomic_read include/linux/instrumented.h:71 [inline] atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline] page_ref_count include/linux/page_ref.h:67 [inline] put_page_testzero include/linux/mm.h:717 [inline] __free_pages+0x1f/0x1b0 mm/page_alloc.c:5473 watch_queue_set_size+0x499/0x630 kernel/watch_queue.c:275 pipe_ioctl+0xac/0x2b0 fs/pipe.c:632 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-and-tested-by: syzbot+d55757faa9b80590767b@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08PM: suspend: fix return value of __setup handlerRandy Dunlap1-4/+4
[ Upstream commit 7a64ca17e4dd50d5f910769167f3553902777844 ] If an invalid option is given for "test_suspend=<option>", the entire string is added to init's environment, so return 1 instead of 0 from the __setup handler. Unknown kernel command line parameters "BOOT_IMAGE=/boot/bzImage-517rc5 test_suspend=invalid" and Run /sbin/init as init process with arguments: /sbin/init with environment: HOME=/ TERM=linux BOOT_IMAGE=/boot/bzImage-517rc5 test_suspend=invalid Fixes: 2ce986892faf ("PM / sleep: Enhance test_suspend option with repeat capability") Fixes: 27ddcc6596e5 ("PM / sleep: Add state field to pm_states[] entries") Fixes: a9d7052363a6 ("PM: Separate suspend to RAM functionality from core") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru> Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08PM: hibernate: fix __setup handler error handlingRandy Dunlap1-1/+1
[ Upstream commit ba7ffcd4c4da374b0f64666354eeeda7d3827131 ] If an invalid value is used in "resumedelay=<seconds>", it is silently ignored. Add a warning message and then let the __setup handler return 1 to indicate that the kernel command line option has been handled. Fixes: 317cf7e5e85e3 ("PM / hibernate: convert simple_strtoul to kstrtoul") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru> Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08audit: log AUDIT_TIME_* records only from rulesRichard Guy Briggs2-20/+71
[ Upstream commit 272ceeaea355214b301530e262a0df8600bfca95 ] AUDIT_TIME_* events are generated when there are syscall rules present that are not related to time keeping. This will produce noisy log entries that could flood the logs and hide events we really care about. Rather than immediately produce the AUDIT_TIME_* records, store the data in the context and log it at syscall exit time respecting the filter rules. Note: This eats the audit_buffer, unlike any others in show_special(). Please see https://bugzilla.redhat.com/show_bug.cgi?id=1991919 Fixes: 7e8eda734d30 ("ntp: Audit NTP parameters adjustment") Fixes: 2d87a0674bd6 ("timekeeping: Audit clock adjustments") Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: fixed style/whitespace issues] Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08tracing: Have trace event string test handle zero length stringsSteven Rostedt (Google)1-2/+7
commit eca344a7362e0f34f179298fd8366bcd556eede1 upstream. If a trace event has in its TP_printk(): "%*.s", len, len ? __get_str(string) : NULL It is perfectly valid if len is zero and passing in the NULL. Unfortunately, the runtime string check at time of reading the trace sees the NULL and flags it as a bad string and produces a WARN_ON(). Handle this case by passing into the test function if the format has an asterisk (star) and if so, if the length is zero, then mark it as safe. Link: https://lore.kernel.org/all/YjsWzuw5FbWPrdqq@bfoster/ Cc: stable@vger.kernel.org Reported-by: Brian Foster <bfoster@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Fixes: 9a6944fee68e2 ("tracing: Add a verifier to check string pointers for trace events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZEJann Horn1-15/+32
commit ee1fee900537b5d9560e9f937402de5ddc8412f3 upstream. Setting PTRACE_O_SUSPEND_SECCOMP is supposed to be a highly privileged operation because it allows the tracee to completely bypass all seccomp filters on kernels with CONFIG_CHECKPOINT_RESTORE=y. It is only supposed to be settable by a process with global CAP_SYS_ADMIN, and only if that process is not subject to any seccomp filters at all. However, while these permission checks were done on the PTRACE_SETOPTIONS path, they were missing on the PTRACE_SEIZE path, which also sets user-specified ptrace flags. Move the permissions checks out into a helper function and let both ptrace_attach() and ptrace_setoptions() call it. Cc: stable@kernel.org Fixes: 13c4a90119d2 ("seccomp: add ptrace options for suspend/resume") Signed-off-by: Jann Horn <jannh@google.com> Link: https://lkml.kernel.org/r/20220319010838.1386861-1-jannh@google.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08locking/lockdep: Avoid potential access of invalid memory in lock_classWaiman Long1-9/+15
commit 61cc4534b6550997c97a03759ab46b29d44c0017 upstream. It was found that reading /proc/lockdep after a lockdep splat may potentially cause an access to freed memory if lockdep_unregister_key() is called after the splat but before access to /proc/lockdep [1]. This is due to the fact that graph_lock() call in lockdep_unregister_key() fails after the clearing of debug_locks by the splat process. After lockdep_unregister_key() is called, the lock_name may be freed but the corresponding lock_class structure still have a reference to it. That invalid memory pointer will then be accessed when /proc/lockdep is read by a user and a use-after-free (UAF) error will be reported if KASAN is enabled. To fix this problem, lockdep_unregister_key() is now modified to always search for a matching key irrespective of the debug_locks state and zap the corresponding lock class if a matching one is found. [1] https://lore.kernel.org/lkml/77f05c15-81b6-bddd-9650-80d5f23fe330@i-love.sakura.ne.jp/ Fixes: 8b39adbee805 ("locking/lockdep: Make lockdep_unregister_key() honor 'debug_locks' again") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Cc: Cheng-Jui Wang <cheng-jui.wang@mediatek.com> Link: https://lkml.kernel.org/r/20220103023558.1377055-1-longman@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08Revert "swiotlb: rework "fix info leak with DMA_FROM_DEVICE""Linus Torvalds1-15/+8
commit bddac7c1e02ba47f0570e494c9289acea3062cc1 upstream. This reverts commit aa6f8dcbab473f3a3c7454b74caa46d36cdc5d13. It turns out this breaks at least the ath9k wireless driver, and possibly others. What the ath9k driver does on packet receive is to set up the DMA transfer with: int ath_rx_init(..) .. bf->bf_buf_addr = dma_map_single(sc->dev, skb->data, common->rx_bufsize, DMA_FROM_DEVICE); and then the receive logic (through ath_rx_tasklet()) will fetch incoming packets static bool ath_edma_get_buffers(..) .. dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr, common->rx_bufsize, DMA_FROM_DEVICE); ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data); if (ret == -EINPROGRESS) { /*let device gain the buffer again*/ dma_sync_single_for_device(sc->dev, bf->bf_buf_addr, common->rx_bufsize, DMA_FROM_DEVICE); return false; } and it's worth noting how that first DMA sync: dma_sync_single_for_cpu(..DMA_FROM_DEVICE); is there to make sure the CPU can read the DMA buffer (possibly by copying it from the bounce buffer area, or by doing some cache flush). The iommu correctly turns that into a "copy from bounce bufer" so that the driver can look at the state of the packets. In the meantime, the device may continue to write to the DMA buffer, but we at least have a snapshot of the state due to that first DMA sync. But that _second_ DMA sync: dma_sync_single_for_device(..DMA_FROM_DEVICE); is telling the DMA mapping that the CPU wasn't interested in the area because the packet wasn't there. In the case of a DMA bounce buffer, that is a no-op. Note how it's not a sync for the CPU (the "for_device()" part), and it's not a sync for data written by the CPU (the "DMA_FROM_DEVICE" part). Or rather, it _should_ be a no-op. That's what commit aa6f8dcbab47 broke: it made the code bounce the buffer unconditionally, and changed the DMA_FROM_DEVICE to just unconditionally and illogically be DMA_TO_DEVICE. [ Side note: purely within the confines of the swiotlb driver it wasn't entirely illogical: The reason it did that odd DMA_FROM_DEVICE -> DMA_TO_DEVICE conversion thing is because inside the swiotlb driver, it uses just a swiotlb_bounce() helper that doesn't care about the whole distinction of who the sync is for - only which direction to bounce. So it took the "sync for device" to mean that the CPU must have been the one writing, and thought it meant DMA_TO_DEVICE. ] Also note how the commentary in that commit was wrong, probably due to that whole confusion, claiming that the commit makes the swiotlb code "bounce unconditionally (that is, also when dir == DMA_TO_DEVICE) in order do avoid synchronising back stale data from the swiotlb buffer" which is nonsensical for two reasons: - that "also when dir == DMA_TO_DEVICE" is nonsensical, as that was exactly when it always did - and should do - the bounce. - since this is a sync for the device (not for the CPU), we're clearly fundamentally not coping back stale data from the bounce buffers at all, because we'd be copying *to* the bounce buffers. So that commit was just very confused. It confused the direction of the synchronization (to the device, not the cpu) with the direction of the DMA (from the device). Reported-and-bisected-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Olha Cherevyk <olha.cherevyk@gmail.com> Cc: Halil Pasic <pasic@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Kalle Valo <kvalo@kernel.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Toke Høiland-Jørgensen <toke@toke.dk> Cc: Maxime Bizon <mbizon@freebox.fr> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-28rcu: Don't deboost before reporting expedited quiescent statePaul E. McKenney1-4/+4
commit 10c535787436d62ea28156a4b91365fd89b5a432 upstream. Currently rcu_preempt_deferred_qs_irqrestore() releases rnp->boost_mtx before reporting the expedited quiescent state. Under heavy real-time load, this can result in this function being preempted before the quiescent state is reported, which can in turn prevent the expedited grace period from completing. Tim Murray reports that the resulting expedited grace periods can take hundreds of milliseconds and even more than one second, when they should normally complete in less than a millisecond. This was fine given that there were no particular response-time constraints for synchronize_rcu_expedited(), as it was designed for throughput rather than latency. However, some users now need sub-100-millisecond response-time constratints. This patch therefore follows Neeraj's suggestion (seconded by Tim and by Uladzislau Rezki) of simply reversing the two operations. Reported-by: Tim Murray <timmurray@google.com> Reported-by: Joel Fernandes <joelaf@google.com> Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Tim Murray <timmurray@google.com> Cc: Todd Kjos <tkjos@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: <stable@vger.kernel.org> # 5.4.x Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16watch_queue: Make comment about setting ->defunct more accurateDavid Howells1-1/+1
commit 4edc0760412b0c4ecefc7e02cb855b310b122825 upstream. watch_queue_clear() has a comment stating that setting ->defunct to true preventing new additions as well as preventing notifications. Whilst the latter is true, the first bit is superfluous since at the time this function is called, the pipe cannot be accessed to add new event sources. Remove the "new additions" bit from the comment. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16watch_queue: Fix lack of barrier/sync/lock between post and readDavid Howells1-1/+1
commit 2ed147f015af2b48f41c6f0b6746aa9ea85c19f3 upstream. There's nothing to synchronise post_one_notification() versus pipe_read(). Whilst posting is done under pipe->rd_wait.lock, the reader only takes pipe->mutex which cannot bar notification posting as that may need to be made from contexts that cannot sleep. Fix this by setting pipe->head with a barrier in post_one_notification() and reading pipe->head with a barrier in pipe_read(). If that's not sufficient, the rd_wait.lock will need to be taken, possibly in a ->confirm() op so that it only applies to notifications. The lock would, however, have to be dropped before copy_page_to_iter() is invoked. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16watch_queue: Free the alloc bitmap when the watch_queue is torn downDavid Howells1-0/+1
commit 7ea1a0124b6da246b5bc8c66cddaafd36acf3ecb upstream. Free the watch_queue note allocation bitmap when the watch_queue is destroyed. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16watch_queue: Fix the alloc bitmap size to reflect notes allocatedDavid Howells1-1/+2
commit 3b4c0371928c17af03e8397ac842346624017ce6 upstream. Currently, watch_queue_set_size() sets the number of notes available in wqueue->nr_notes according to the number of notes allocated, but sets the size of the bitmap to the unrounded number of notes originally asked for. Fix this by setting the bitmap size to the number of notes we're actually going to make available (ie. the number allocated). Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16watch_queue: Fix to always request a pow-of-2 pipe ring sizeDavid Howells1-1/+1
commit 96a4d8912b28451cd62825fd7caa0e66e091d938 upstream. The pipe ring size must always be a power of 2 as the head and tail pointers are masked off by AND'ing with the size of the ring - 1. watch_queue_set_size(), however, lets you specify any number of notes between 1 and 511. This number is passed through to pipe_resize_ring() without checking/forcing its alignment. Fix this by rounding the number of slots required up to the nearest power of two. The request is meant to guarantee that at least that many notifications can be generated before the queue is full, so rounding down isn't an option, but, alternatively, it may be better to give an error if we aren't allowed to allocate that much ring space. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>