summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
10 daysmm/ksm: fix possible UAF of stable_nodeChengming Zhou1-1/+2
The commit 2c653d0ee2ae ("ksm: introduce ksm_max_page_sharing per page deduplication limit") introduced a possible failure case in the stable_tree_insert(), where we may free the new allocated stable_node_dup if we fail to prepare the missing chain node. Then that kfolio return and unlock with a freed stable_node set... And any MM activities can come in to access kfolio->mapping, so UAF. Fix it by moving folio_set_stable_node() to the end after stable_node is inserted successfully. Link: https://lkml.kernel.org/r/20240513-b4-ksm-stable-node-uaf-v1-1-f687de76f452@linux.dev Fixes: 2c653d0ee2ae ("ksm: introduce ksm_max_page_sharing per page deduplication limit") Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysmm/memory-failure: fix handling of dissolved but not taken off from buddy pagesMiaohe Lin1-2/+2
When I did memory failure tests recently, below panic occurs: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00 flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff) raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page)) ------------[ cut here ]------------ kernel BUG at include/linux/page-flags.h:1009! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:__del_page_from_free_list+0x151/0x180 RSP: 0018:ffffa49c90437998 EFLAGS: 00000046 RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0 RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69 R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80 R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009 FS: 00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0 Call Trace: <TASK> __rmqueue_pcplist+0x23b/0x520 get_page_from_freelist+0x26b/0xe40 __alloc_pages_noprof+0x113/0x1120 __folio_alloc_noprof+0x11/0xb0 alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130 __alloc_fresh_hugetlb_folio+0xe7/0x140 alloc_pool_huge_folio+0x68/0x100 set_max_huge_pages+0x13d/0x340 hugetlb_sysctl_handler_common+0xe8/0x110 proc_sys_call_handler+0x194/0x280 vfs_write+0x387/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xc2/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff916114887 RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887 RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003 RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0 R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004 R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00 </TASK> Modules linked in: mce_inject hwpoison_inject ---[ end trace 0000000000000000 ]--- And before the panic, there had an warning about bad page state: BUG: Bad page state in process page-types pfn:8cee00 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00 flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff) page_type: 0xffffff7f(buddy) raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000 raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000 page dumped because: nonzero mapcount Modules linked in: mce_inject hwpoison_inject CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22 Call Trace: <TASK> dump_stack_lvl+0x83/0xa0 bad_page+0x63/0xf0 free_unref_page+0x36e/0x5c0 unpoison_memory+0x50b/0x630 simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110 debugfs_attr_write+0x42/0x60 full_proxy_write+0x5b/0x80 vfs_write+0xcd/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xc2/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f189a514887 RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887 RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003 RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8 R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040 </TASK> The root cause should be the below race: memory_failure try_memory_failure_hugetlb me_huge_page __page_handle_poison dissolve_free_hugetlb_folio drain_all_pages -- Buddy page can be isolated e.g. for compaction. take_page_off_buddy -- Failed as page is not in the buddy list. -- Page can be putback into buddy after compaction. page_ref_inc -- Leads to buddy page with refcnt = 1. Then unpoison_memory() can unpoison the page and send the buddy page back into buddy list again leading to the above bad page state warning. And bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to allocate this page. Fix this issue by only treating __page_handle_poison() as successful when it returns 1. Link: https://lkml.kernel.org/r/20240523071217.1696196-1-linmiaohe@huawei.com Fixes: ceaf8fbea79a ("mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysmm: /proc/pid/smaps_rollup: avoid skipping vma after getting mmap_lock againYuanyuan Zhong1-2/+7
After switching smaps_rollup to use VMA iterator, searching for next entry is part of the condition expression of the do-while loop. So the current VMA needs to be addressed before the continue statement. Otherwise, with some VMAs skipped, userspace observed memory consumption from /proc/pid/smaps_rollup will be smaller than the sum of the corresponding fields from /proc/pid/smaps. Link: https://lkml.kernel.org/r/20240523183531.2535436-1-yzhong@purestorage.com Fixes: c4c84f06285e ("fs/proc/task_mmu: stop using linked list and highest_vm_end") Signed-off-by: Yuanyuan Zhong <yzhong@purestorage.com> Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysnilfs2: fix potential hang in nilfs_detach_log_writer()Ryusuke Konishi1-3/+18
Syzbot has reported a potential hang in nilfs_detach_log_writer() called during nilfs2 unmount. Analysis revealed that this is because nilfs_segctor_sync(), which synchronizes with the log writer thread, can be called after nilfs_segctor_destroy() terminates that thread, as shown in the call trace below: nilfs_detach_log_writer nilfs_segctor_destroy nilfs_segctor_kill_thread --> Shut down log writer thread flush_work nilfs_iput_work_func nilfs_dispose_list iput nilfs_evict_inode nilfs_transaction_commit nilfs_construct_segment (if inode needs sync) nilfs_segctor_sync --> Attempt to synchronize with log writer thread *** DEADLOCK *** Fix this issue by changing nilfs_segctor_sync() so that the log writer thread returns normally without synchronizing after it terminates, and by forcing tasks that are already waiting to complete once after the thread terminates. The skipped inode metadata flushout will then be processed together in the subsequent cleanup work in nilfs_segctor_destroy(). Link: https://lkml.kernel.org/r/20240520132621.4054-4-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+e3973c409251e136fdd0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=e3973c409251e136fdd0 Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Cc: "Bai, Shuangpeng" <sjb7183@psu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysnilfs2: fix unexpected freezing of nilfs_segctor_sync()Ryusuke Konishi1-4/+13
A potential and reproducible race issue has been identified where nilfs_segctor_sync() would block even after the log writer thread writes a checkpoint, unless there is an interrupt or other trigger to resume log writing. This turned out to be because, depending on the execution timing of the log writer thread running in parallel, the log writer thread may skip responding to nilfs_segctor_sync(), which causes a call to schedule() waiting for completion within nilfs_segctor_sync() to lose the opportunity to wake up. The reason why waking up the task waiting in nilfs_segctor_sync() may be skipped is that updating the request generation issued using a shared sequence counter and adding an wait queue entry to the request wait queue to the log writer, are not done atomically. There is a possibility that log writing and request completion notification by nilfs_segctor_wakeup() may occur between the two operations, and in that case, the wait queue entry is not yet visible to nilfs_segctor_wakeup() and the wake-up of nilfs_segctor_sync() will be carried over until the next request occurs. Fix this issue by performing these two operations simultaneously within the lock section of sc_state_lock. Also, following the memory barrier guidelines for event waiting loops, move the call to set_current_state() in the same location into the event waiting loop to ensure that a memory barrier is inserted just before the event condition determination. Link: https://lkml.kernel.org/r/20240520132621.4054-3-konishi.ryusuke@gmail.com Fixes: 9ff05123e3bf ("nilfs2: segment constructor") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Cc: "Bai, Shuangpeng" <sjb7183@psu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysnilfs2: fix use-after-free of timer for log writer threadRyusuke Konishi1-6/+19
Patch series "nilfs2: fix log writer related issues". This bug fix series covers three nilfs2 log writer-related issues, including a timer use-after-free issue and potential deadlock issue on unmount, and a potential freeze issue in event synchronization found during their analysis. Details are described in each commit log. This patch (of 3): A use-after-free issue has been reported regarding the timer sc_timer on the nilfs_sc_info structure. The problem is that even though it is used to wake up a sleeping log writer thread, sc_timer is not shut down until the nilfs_sc_info structure is about to be freed, and is used regardless of the thread's lifetime. Fix this issue by limiting the use of sc_timer only while the log writer thread is alive. Link: https://lkml.kernel.org/r/20240520132621.4054-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20240520132621.4054-2-konishi.ryusuke@gmail.com Fixes: fdce895ea5dd ("nilfs2: change sc_timer from a pointer to an embedded one in struct nilfs_sc_info") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: "Bai, Shuangpeng" <sjb7183@psu.edu> Closes: https://groups.google.com/g/syzkaller/c/MK_LYqtt8ko/m/8rgdWeseAwAJ Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysselftests/mm: fix build warnings on ppc64Michael Ellerman2-0/+2
Fix warnings like: In file included from uffd-unit-tests.c:8: uffd-unit-tests.c: In function `uffd_poison_handle_fault': uffd-common.h:45:33: warning: format `%llu' expects argument of type `long long unsigned int', but argument 3 has type `__u64' {aka `long unsigned int'} [-Wformat=] By switching to unsigned long long for u64 for ppc64 builds. Link: https://lkml.kernel.org/r/20240521030219.57439-1-mpe@ellerman.id.au Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysarm64: patching: fix handling of execmem addressesWill Deacon1-1/+1
Klara Modin reported warnings for a kernel configured with BPF_JIT but without MODULES: [ 44.131296] Trying to vfree() bad address (000000004a17c299) [ 44.138024] WARNING: CPU: 1 PID: 193 at mm/vmalloc.c:3189 remove_vm_area (mm/vmalloc.c:3189 (discriminator 1)) [ 44.146675] CPU: 1 PID: 193 Comm: kworker/1:2 Tainted: G D W 6.9.0-01786-g2c9e5d4a0082 #25 [ 44.158229] Hardware name: Raspberry Pi 3 Model B (DT) [ 44.164433] Workqueue: events bpf_prog_free_deferred [ 44.170492] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 44.178601] pc : remove_vm_area (mm/vmalloc.c:3189 (discriminator 1)) [ 44.183705] lr : remove_vm_area (mm/vmalloc.c:3189 (discriminator 1)) [ 44.188772] sp : ffff800082a13c70 [ 44.193112] x29: ffff800082a13c70 x28: 0000000000000000 x27: 0000000000000000 [ 44.201384] x26: 0000000000000000 x25: ffff00003a44efa0 x24: 00000000d4202000 [ 44.209658] x23: ffff800081223dd0 x22: ffff00003a198a40 x21: ffff8000814dd880 [ 44.217924] x20: 00000000d4202000 x19: ffff8000814dd880 x18: 0000000000000006 [ 44.226206] x17: 0000000000000000 x16: 0000000000000020 x15: 0000000000000002 [ 44.234460] x14: ffff8000811a6370 x13: 0000000020000000 x12: 0000000000000000 [ 44.242710] x11: ffff8000811a6370 x10: 0000000000000144 x9 : ffff8000811fe370 [ 44.250959] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000811fe370 [ 44.259206] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 44.267457] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000002203240 [ 44.275703] Call trace: [ 44.279158] remove_vm_area (mm/vmalloc.c:3189 (discriminator 1)) [ 44.283858] vfree (mm/vmalloc.c:3322) [ 44.287835] execmem_free (mm/execmem.c:70) [ 44.292347] bpf_jit_free_exec+0x10/0x1c [ 44.297283] bpf_prog_pack_free (kernel/bpf/core.c:1006) [ 44.302457] bpf_jit_binary_pack_free (kernel/bpf/core.c:1195) [ 44.307951] bpf_jit_free (include/linux/filter.h:1083 arch/arm64/net/bpf_jit_comp.c:2474) [ 44.312342] bpf_prog_free_deferred (kernel/bpf/core.c:2785) [ 44.317785] process_one_work (kernel/workqueue.c:3273) [ 44.322684] worker_thread (kernel/workqueue.c:3342 (discriminator 2) kernel/workqueue.c:3429 (discriminator 2)) [ 44.327292] kthread (kernel/kthread.c:388) [ 44.331342] ret_from_fork (arch/arm64/kernel/entry.S:861) The problem is because bpf_arch_text_copy() silently fails to write to the read-only area as a result of patch_map() faulting and the resulting -EFAULT being chucked away. Update patch_map() to use CONFIG_EXECMEM instead of CONFIG_STRICT_MODULE_RWX to check for vmalloc addresses. Link: https://lkml.kernel.org/r/20240521213813.703309-1-rppt@kernel.org Fixes: 2c9e5d4a0082 ("bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of") Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reported-by: Klara Modin <klarasmodin@gmail.com> Closes: https://lore.kernel.org/all/7983fbbf-0127-457c-9394-8d6e4299c685@gmail.com Tested-by: Klara Modin <klarasmodin@gmail.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysselftests/mm: compaction_test: fix bogus test success and reduce probability ↵Dev Jain1-22/+49
of OOM-killer invocation Reset nr_hugepages to zero before the start of the test. If a non-zero number of hugepages is already set before the start of the test, the following problems arise: - The probability of the test getting OOM-killed increases. Proof: The test wants to run on 80% of available memory to prevent OOM-killing (see original code comments). Let the value of mem_free at the start of the test, when nr_hugepages = 0, be x. In the other case, when nr_hugepages > 0, let the memory consumed by hugepages be y. In the former case, the test operates on 0.8 * x of memory. In the latter, the test operates on 0.8 * (x - y) of memory, with y already filled, hence, memory consumed is y + 0.8 * (x - y) = 0.8 * x + 0.2 * y > 0.8 * x. Q.E.D - The probability of a bogus test success increases. Proof: Let the memory consumed by hugepages be greater than 25% of x, with x and y defined as above. The definition of compaction_index is c_index = (x - y)/z where z is the memory consumed by hugepages after trying to increase them again. In check_compaction(), we set the number of hugepages to zero, and then increase them back; the probability that they will be set back to consume at least y amount of memory again is very high (since there is not much delay between the two attempts of changing nr_hugepages). Hence, z >= y > (x/4) (by the 25% assumption). Therefore, c_index = (x - y)/z <= (x - y)/y = x/y - 1 < 4 - 1 = 3 hence, c_index can always be forced to be less than 3, thereby the test succeeding always. Q.E.D Link: https://lkml.kernel.org/r/20240521074358.675031-4-dev.jain@arm.com Fixes: bd67d5c15cc1 ("Test compaction of mlocked memory") Signed-off-by: Dev Jain <dev.jain@arm.com> Cc: <stable@vger.kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Sri Jayaramappa <sjayaram@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysselftests/mm: compaction_test: fix incorrect write of zero to nr_hugepagesDev Jain1-0/+2
Currently, the test tries to set nr_hugepages to zero, but that is not actually done because the file offset is not reset after read(). Fix that using lseek(). Link: https://lkml.kernel.org/r/20240521074358.675031-3-dev.jain@arm.com Fixes: bd67d5c15cc1 ("Test compaction of mlocked memory") Signed-off-by: Dev Jain <dev.jain@arm.com> Cc: <stable@vger.kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Sri Jayaramappa <sjayaram@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysselftests/mm: compaction_test: fix bogus test success on Aarch64Dev Jain1-7/+13
Patch series "Fixes for compaction_test", v2. The compaction_test memory selftest introduces fragmentation in memory and then tries to allocate as many hugepages as possible. This series addresses some problems. On Aarch64, if nr_hugepages == 0, then the test trivially succeeds since compaction_index becomes 0, which is less than 3, due to no division by zero exception being raised. We fix that by checking for division by zero. Secondly, correctly set the number of hugepages to zero before trying to set a large number of them. Now, consider a situation in which, at the start of the test, a non-zero number of hugepages have been already set (while running the entire selftests/mm suite, or manually by the admin). The test operates on 80% of memory to avoid OOM-killer invocation, and because some memory is already blocked by hugepages, it would increase the chance of OOM-killing. Also, since mem_free used in check_compaction() is the value before we set nr_hugepages to zero, the chance that the compaction_index will be small is very high if the preset nr_hugepages was high, leading to a bogus test success. This patch (of 3): Currently, if at runtime we are not able to allocate a huge page, the test will trivially pass on Aarch64 due to no exception being raised on division by zero while computing compaction_index. Fix that by checking for nr_hugepages == 0. Anyways, in general, avoid a division by zero by exiting the program beforehand. While at it, fix a typo, and handle the case where the number of hugepages may overflow an integer. Link: https://lkml.kernel.org/r/20240521074358.675031-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20240521074358.675031-2-dev.jain@arm.com Fixes: bd67d5c15cc1 ("Test compaction of mlocked memory") Signed-off-by: Dev Jain <dev.jain@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Sri Jayaramappa <sjayaram@akamai.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysmailmap: update email address for Satya PriyaSatya Priya Kakitapalli1-1/+1
Update mailmap with my latest email ID, quic_c_skakit@quicinc.com is no longer active. Link: https://lkml.kernel.org/r/20240515-mailmap-update-v1-1-df4853f757a3@quicinc.com Signed-off-by: Satya Priya Kakitapalli <quic_skakitap@quicinc.com> Cc: Ajit Pandey <quic_ajipan@quicinc.com> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Imran Shaik <quic_imrashai@quicinc.com> Cc: Jagadeesh Kona <quic_jkona@quicinc.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Taniya Das <quic_tdas@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysmm/huge_memory: don't unpoison huge_zero_folioMiaohe Lin1-0/+7
When I did memory failure tests recently, below panic occurs: kernel BUG at include/linux/mm.h:1135! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 9 PID: 137 Comm: kswapd1 Not tainted 6.9.0-rc4-00491-gd5ce28f156fe-dirty #14 RIP: 0010:shrink_huge_zero_page_scan+0x168/0x1a0 RSP: 0018:ffff9933c6c57bd0 EFLAGS: 00000246 RAX: 000000000000003e RBX: 0000000000000000 RCX: ffff88f61fc5c9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff88f61fc5c9c0 RBP: ffffcd7c446b0000 R08: ffffffff9a9405f0 R09: 0000000000005492 R10: 00000000000030ea R11: ffffffff9a9405f0 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88e703c4ac00 FS: 0000000000000000(0000) GS:ffff88f61fc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f4da6e9878 CR3: 0000000c71048000 CR4: 00000000000006f0 Call Trace: <TASK> do_shrink_slab+0x14f/0x6a0 shrink_slab+0xca/0x8c0 shrink_node+0x2d0/0x7d0 balance_pgdat+0x33a/0x720 kswapd+0x1f3/0x410 kthread+0xd5/0x100 ret_from_fork+0x2f/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: mce_inject hwpoison_inject ---[ end trace 0000000000000000 ]--- RIP: 0010:shrink_huge_zero_page_scan+0x168/0x1a0 RSP: 0018:ffff9933c6c57bd0 EFLAGS: 00000246 RAX: 000000000000003e RBX: 0000000000000000 RCX: ffff88f61fc5c9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff88f61fc5c9c0 RBP: ffffcd7c446b0000 R08: ffffffff9a9405f0 R09: 0000000000005492 R10: 00000000000030ea R11: ffffffff9a9405f0 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88e703c4ac00 FS: 0000000000000000(0000) GS:ffff88f61fc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f4da6e9878 CR3: 0000000c71048000 CR4: 00000000000006f0 The root cause is that HWPoison flag will be set for huge_zero_folio without increasing the folio refcnt. But then unpoison_memory() will decrease the folio refcnt unexpectedly as it appears like a successfully hwpoisoned folio leading to VM_BUG_ON_PAGE(page_ref_count(page) == 0) when releasing huge_zero_folio. Skip unpoisoning huge_zero_folio in unpoison_memory() to fix this issue. We're not prepared to unpoison huge_zero_folio yet. Link: https://lkml.kernel.org/r/20240516122608.22610-1-linmiaohe@huawei.com Fixes: 478d134e9506 ("mm/huge_memory: do not overkill when splitting huge_zero_page") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Xu Yu <xuyu@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 dayskasan, fortify: properly rename memintrinsicsAndrey Konovalov1-4/+18
After commit 69d4c0d32186 ("entry, kasan, x86: Disallow overriding mem*() functions") and the follow-up fixes, with CONFIG_FORTIFY_SOURCE enabled, even though the compiler instruments meminstrinsics by generating calls to __asan/__hwasan_ prefixed functions, FORTIFY_SOURCE still uses uninstrumented memset/memmove/memcpy as the underlying functions. As a result, KASAN cannot detect bad accesses in memset/memmove/memcpy. This also makes KASAN tests corrupt kernel memory and cause crashes. To fix this, use __asan_/__hwasan_memset/memmove/memcpy as the underlying functions whenever appropriate. Do this only for the instrumented code (as indicated by __SANITIZE_ADDRESS__). Link: https://lkml.kernel.org/r/20240517130118.759301-1-andrey.konovalov@linux.dev Fixes: 69d4c0d32186 ("entry, kasan, x86: Disallow overriding mem*() functions") Fixes: 51287dcb00cc ("kasan: emit different calls for instrumentable memintrinsics") Fixes: 36be5cba99f6 ("kasan: treat meminstrinsic as builtins in uninstrumented files") Signed-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reported-by: Erhard Furtner <erhard_f@mailbox.org> Reported-by: Nico Pache <npache@redhat.com> Closes: https://lore.kernel.org/all/20240501144156.17e65021@outsider.home/ Reviewed-by: Marco Elver <elver@google.com> Tested-by: Nico Pache <npache@redhat.com> Acked-by: Nico Pache <npache@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Daniel Axtens <dja@axtens.net> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 dayslib: add version into /proc/allocinfo outputSuren Baghdasaryan2-17/+35
Add version string and a header at the beginning of /proc/allocinfo to allow later format changes. Example output: > head /proc/allocinfo allocinfo - version: 1.0 # <size> <calls> <tag info> 0 0 init/main.c:1314 func:do_initcalls 0 0 init/do_mounts.c:353 func:mount_nodev_root 0 0 init/do_mounts.c:187 func:mount_root_generic 0 0 init/do_mounts.c:158 func:do_mount_root 0 0 init/initramfs.c:493 func:unpack_to_rootfs 0 0 init/initramfs.c:492 func:unpack_to_rootfs 0 0 init/initramfs.c:491 func:unpack_to_rootfs 512 1 arch/x86/events/rapl.c:681 func:init_rapl_pmus 128 1 arch/x86/events/rapl.c:571 func:rapl_cpu_online [akpm@linux-foundation.org: remove stray newline from struct allocinfo_private] Link: https://lkml.kernel.org/r/20240514163128.3662251-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysmm/vmalloc: fix vmalloc which may return null if called with __GFP_NOFAILHailong.Liu1-3/+2
commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc") includes support for __GFP_NOFAIL, but it presents a conflict with commit dd544141b9eb ("vmalloc: back off when the current task is OOM-killed"). A possible scenario is as follows: process-a __vmalloc_node_range(GFP_KERNEL | __GFP_NOFAIL) __vmalloc_area_node() vm_area_alloc_pages() --> oom-killer send SIGKILL to process-a if (fatal_signal_pending(current)) break; --> return NULL; To fix this, do not check fatal_signal_pending() in vm_area_alloc_pages() if __GFP_NOFAIL set. This issue occurred during OPLUS KASAN TEST. Below is part of the log -> oom-killer sends signal to process [65731.222840] [ T1308] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/apps/uid_10198,task=gs.intelligence,pid=32454,uid=10198 [65731.259685] [T32454] Call trace: [65731.259698] [T32454] dump_backtrace+0xf4/0x118 [65731.259734] [T32454] show_stack+0x18/0x24 [65731.259756] [T32454] dump_stack_lvl+0x60/0x7c [65731.259781] [T32454] dump_stack+0x18/0x38 [65731.259800] [T32454] mrdump_common_die+0x250/0x39c [mrdump] [65731.259936] [T32454] ipanic_die+0x20/0x34 [mrdump] [65731.260019] [T32454] atomic_notifier_call_chain+0xb4/0xfc [65731.260047] [T32454] notify_die+0x114/0x198 [65731.260073] [T32454] die+0xf4/0x5b4 [65731.260098] [T32454] die_kernel_fault+0x80/0x98 [65731.260124] [T32454] __do_kernel_fault+0x160/0x2a8 [65731.260146] [T32454] do_bad_area+0x68/0x148 [65731.260174] [T32454] do_mem_abort+0x151c/0x1b34 [65731.260204] [T32454] el1_abort+0x3c/0x5c [65731.260227] [T32454] el1h_64_sync_handler+0x54/0x90 [65731.260248] [T32454] el1h_64_sync+0x68/0x6c [65731.260269] [T32454] z_erofs_decompress_queue+0x7f0/0x2258 --> be->decompressed_pages = kvcalloc(be->nr_pages, sizeof(struct page *), GFP_KERNEL | __GFP_NOFAIL); kernel panic by NULL pointer dereference. erofs assume kvmalloc with __GFP_NOFAIL never return NULL. [65731.260293] [T32454] z_erofs_runqueue+0xf30/0x104c [65731.260314] [T32454] z_erofs_readahead+0x4f0/0x968 [65731.260339] [T32454] read_pages+0x170/0xadc [65731.260364] [T32454] page_cache_ra_unbounded+0x874/0xf30 [65731.260388] [T32454] page_cache_ra_order+0x24c/0x714 [65731.260411] [T32454] filemap_fault+0xbf0/0x1a74 [65731.260437] [T32454] __do_fault+0xd0/0x33c [65731.260462] [T32454] handle_mm_fault+0xf74/0x3fe0 [65731.260486] [T32454] do_mem_abort+0x54c/0x1b34 [65731.260509] [T32454] el0_da+0x44/0x94 [65731.260531] [T32454] el0t_64_sync_handler+0x98/0xb4 [65731.260553] [T32454] el0t_64_sync+0x198/0x19c Link: https://lkml.kernel.org/r/20240510100131.1865-1-hailong.liu@oppo.com Fixes: 9376130c390a ("mm/vmalloc: add support for __GFP_NOFAIL") Signed-off-by: Hailong.Liu <hailong.liu@oppo.com> Acked-by: Michal Hocko <mhocko@suse.com> Suggested-by: Barry Song <21cnbao@gmail.com> Reported-by: Oven <liyangouwen1@oppo.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Chao Yu <chao@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Gao Xiang <xiang@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysMerge tag 'riscv-for-linus-6.10-mw2' of ↵Linus Torvalds27-184/+395
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull more RISC-V updates from Palmer Dabbelt: - The compression format used for boot images is now configurable at build time, and these formats are shown in `make help` - access_ok() has been optimized - A pair of performance bugs have been fixed in the uaccess handlers - Various fixes and cleanups, including one for the IMSIC build failure and one for the early-boot ftrace illegal NOPs bug * tag 'riscv-for-linus-6.10-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Fix early ftrace nop patching irqchip: riscv-imsic: Fixup riscv_ipi_set_virq_range() conflict riscv: selftests: Add signal handling vector tests riscv: mm: accelerate pagefault when badaccess riscv: uaccess: Relax the threshold for fast path riscv: uaccess: Allow the last potential unrolled copy riscv: typo in comment for get_f64_reg Use bool value in set_cpu_online() riscv: selftests: Add hwprobe binaries to .gitignore riscv: stacktrace: fixed walk_stackframe() ftrace: riscv: move from REGS to ARGS riscv: do not select MODULE_SECTIONS by default riscv: show help string for riscv-specific targets riscv: make image compression configurable riscv: cpufeature: Fix extension subset checking riscv: cpufeature: Fix thead vector hwcap removal riscv: rewrite __kernel_map_pages() to fix sleeping in invalid context riscv: force PAGE_SIZE linear mapping if debug_pagealloc is enabled riscv: Define TASK_SIZE_MAX for __access_ok() riscv: Remove PGDIR_SIZE_L3 and TASK_SIZE_MIN
10 daysMerge tag 'for-linus-6.10a-rc1-tag' of ↵Linus Torvalds4-27/+67
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - a small cleanup in the drivers/xen/xenbus Makefile - a fix of the Xen xenstore driver to improve connecting to a late started Xenstore - an enhancement for better support of ballooning in PVH guests - a cleanup using try_cmpxchg() instead of open coding it * tag 'for-linus-6.10a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: drivers/xen: Improve the late XenStore init protocol xen/xenbus: Use *-y instead of *-objs in Makefile xen/x86: add extra pages to unpopulated-alloc if available locking/x86/xen: Use try_cmpxchg() in xen_alloc_p2m_entry()
10 daysMerge tag 'for-6.10-tag' of ↵Linus Torvalds5-12/+45
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull more btrfs updates from David Sterba: "A few more updates, mostly stability fixes or user visible changes: - fix race in zoned mode during device replace that can lead to use-after-free - update return codes and lower message levels for quota rescan where it's causing false alerts - fix unexpected qgroup id reuse under some conditions - fix condition when looking up extent refs - add option norecovery (removed in 6.8), the intended replacements haven't been used and some aplications still rely on the old one - build warning fixes" * tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: re-introduce 'norecovery' mount option btrfs: fix end of tree detection when searching for data extent ref btrfs: scrub: initialize ret in scrub_simple_mirror() to fix compilation warning btrfs: zoned: fix use-after-free due to race with dev replace btrfs: qgroup: fix qgroup id collision across mounts btrfs: qgroup: update rescan message levels and error codes
10 daysMerge tag 'erofs-for-6.10-rc1-2' of ↵Linus Torvalds8-88/+65
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull more erofs updates from Gao Xiang: "The main ones are metadata API conversion to byte offsets by Al Viro. Another patch gets rid of unnecessary memory allocation out of DEFLATE decompressor. The remaining one is a trivial cleanup. - Convert metadata APIs to byte offsets - Avoid allocating DEFLATE streams unnecessarily - Some erofs_show_options() cleanup" * tag 'erofs-for-6.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: avoid allocating DEFLATE streams before mounting z_erofs_pcluster_begin(): don't bother with rounding position down erofs: don't round offset down for erofs_read_metabuf() erofs: don't align offset for erofs_read_metabuf() (simple cases) erofs: mechanically convert erofs_read_metabuf() to offsets erofs: clean up erofs_show_options()
10 daysMerge tag 'bcachefs-2024-05-24' of https://evilpiepirate.org/git/bcachefsLinus Torvalds15-57/+96
Pull bcachefs fixes from Kent Overstreet: "Nothing exciting, just syzbot fixes (except for the one FMODE_CAN_ODIRECT patch). Looks like syzbot reports have slowed down; this is all catch up from two weeks of conferences. Next hardening project is using Thomas's error injection tooling to torture test repair" * tag 'bcachefs-2024-05-24' of https://evilpiepirate.org/git/bcachefs: bcachefs: Fix race path in bch2_inode_insert() bcachefs: Ensure we're RW before journalling bcachefs: Fix shutdown ordering bcachefs: Fix unsafety in bch2_dirent_name_bytes() bcachefs: Fix stack oob in __bch2_encrypt_bio() bcachefs: Fix btree_trans leak in bch2_readahead() bcachefs: Fix bogus verify_replicas_entry() assert bcachefs: Check for subvolues with bogus snapshot/inode fields bcachefs: bch2_checksum() returns 0 for unknown checksum type bcachefs: Fix bch2_alloc_ciphers() bcachefs: Add missing guard in bch2_snapshot_has_children() bcachefs: Fix missing parens in drop_locks_do() bcachefs: Improve bch2_assert_pos_locked() bcachefs: Fix shift overflows in replicas.c bcachefs: Fix shift overflow in btree_lost_data() bcachefs: Fix ref in trans_mark_dev_sbs() error path bcachefs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method bcachefs: Fix rcu splat in check_fix_ptrs()
10 daysMerge tag 'input-for-v6.10-rc0' of ↵Linus Torvalds82-168/+328
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: - a change to input core to trim amount of keys data in modalias string in case when a device declares too many keys and they do not fit in uevent buffer instead of reporting an error which results in uevent not being generated at all - support for Machenike G5 Pro Controller added to xpad driver - support for FocalTech FT5452 and FT8719 added to edt-ft5x06 - support for new SPMI vibrator added to pm8xxx-vibrator driver - missing locking added to cyapa touchpad driver - removal of unused fields in various driver structures - explicit initialization of i2c_device_id::driver_data to 0 dropped from input drivers - other assorted fixes and cleanups. * tag 'input-for-v6.10-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (24 commits) Input: edt-ft5x06 - add support for FocalTech FT5452 and FT8719 dt-bindings: input: touchscreen: edt-ft5x06: Document FT5452 and FT8719 support Input: xpad - add support for Machenike G5 Pro Controller Input: try trimming too long modalias strings Input: drop explicit initialization of struct i2c_device_id::driver_data to 0 Input: zet6223 - remove an unused field in struct zet6223_ts Input: chipone_icn8505 - remove an unused field in struct icn8505_data Input: cros_ec_keyb - remove an unused field in struct cros_ec_keyb Input: lpc32xx-keys - remove an unused field in struct lpc32xx_kscan_drv Input: matrix_keypad - remove an unused field in struct matrix_keypad Input: tca6416-keypad - remove unused struct tca6416_drv_data Input: tca6416-keypad - remove an unused field in struct tca6416_keypad_chip Input: da7280 - remove an unused field in struct da7280_haptic Input: ff-core - prefer struct_size over open coded arithmetic Input: cyapa - add missing input core locking to suspend/resume functions input: pm8xxx-vibrator: add new SPMI vibrator support dt-bindings: input: qcom,pm8xxx-vib: add new SPMI vibrator module input: pm8xxx-vibrator: refactor to support new SPMI vibrator Input: pm8xxx-vibrator - correct VIB_MAX_LEVELS calculation Input: sur40 - convert le16 to cpu before use ...
10 daysMerge tag 'sound-fix-6.10-rc1' of ↵Linus Torvalds13-106/+86
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "A collection of small fixes for 6.10-rc1. Most of changes are various device-specific fixes and quirks, while there are a few small changes in ALSA core timer and module / built-in fixes" * tag 'sound-fix-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: hda/realtek: fix mute/micmute LEDs don't work for ProBook 440/460 G11. ALSA: core: Enable proc module when CONFIG_MODULES=y ALSA: core: Fix NULL module pointer assignment at card init ALSA: hda/realtek: Enable headset mic of JP-IK LEAP W502 with ALC897 ASoC: dt-bindings: stm32: Ensure compatible pattern matches whole string ASoC: tas2781: Fix wrong loading calibrated data sequence ASoC: tas2552: Add TX path for capturing AUDIO-OUT data ALSA: usb-audio: Fix for sampling rates support for Mbox3 Documentation: sound: Fix trailing whitespaces ALSA: timer: Set lower bound of start tick time ASoC: codecs: ES8326: solve hp and button detect issue ASoC: rt5645: mic-in detection threshold modification ASoC: Intel: sof_sdw_rt_sdca_jack_common: Use name_prefix for `-sdca` detection
10 daysMerge tag 'char-misc-6.10-rc1-fix' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc fix from Greg KH: "Here is one remaining bugfix for 6.10-rc1 that missed the 6.9-final merge window, and has been sitting in my tree and linux-next for quite a while now, but wasn't sent to you (my fault, travels...) It is a bugfix to resolve an error in the speakup code that could overflow a buffer. It has been in linux-next for a while with no reported problems" * tag 'char-misc-6.10-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: speakup: Fix sizeof() vs ARRAY_SIZE() bug
10 daysMerge tag 'tty-6.10-rc1-fixes' of ↵Linus Torvalds4-91/+181
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial fixes from Greg KH: "Here are some small TTY and Serial driver fixes that missed the 6.9-final merge window, but have been in my tree for weeks (my fault, travel caused me to miss this) These fixes include: - more n_gsm fixes for reported problems - 8520_mtk driver fix - 8250_bcm7271 driver fix - sc16is7xx driver fix All of these have been in linux-next for weeks without any reported problems" * tag 'tty-6.10-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: serial: sc16is7xx: fix bug in sc16is7xx_set_baud() when using prescaler serial: 8250_bcm7271: use default_mux_rate if possible serial: 8520_mtk: Set RTS on shutdown for Rx in-band wakeup tty: n_gsm: fix missing receive state reset after mode switch tty: n_gsm: fix possible out-of-bounds in gsm0_receive()
10 daysMerge tag 'hardening-v6.10-rc1-fixes' of ↵Linus Torvalds3-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - loadpin: Prevent SECURITY_LOADPIN_ENFORCE=y without module decompression (Stephen Boyd) - ubsan: Restore dependency on ARCH_HAS_UBSAN - kunit/fortify: Fix memcmp() test to be amplitude agnostic * tag 'hardening-v6.10-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kunit/fortify: Fix memcmp() test to be amplitude agnostic ubsan: Restore dependency on ARCH_HAS_UBSAN loadpin: Prevent SECURITY_LOADPIN_ENFORCE=y without module decompression
10 daysMerge tag 'trace-tracefs-v6.10' of ↵Linus Torvalds2-155/+116
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracefs/eventfs updates from Steven Rostedt: "Bug fixes: - The eventfs directories need to have unique inode numbers. Make sure that they do not get the default file inode number. - Update the inode uid and gid fields on remount. When a remount happens where a uid and/or gid is specified, all the tracefs files and directories should get the specified uid and/or gid. But this can be sporadic when some uids were assigned already. There's already a list of inodes that are allocated. Just update their uid and gid fields at the time of remount. - Update the eventfs_inodes on remount from the top level "events" descriptor. There was a bug where not all the eventfs files or directories where getting updated on remount. One fix was to clear the SAVED_UID/GID flags from the inode list during the iteration of the inodes during the remount. But because the eventfs inodes can be freed when the last referenced is released, not all the eventfs_inodes were being updated. This lead to the ownership selftest to fail if it was run a second time (the first time would leave eventfs_inodes with no corresponding tracefs_inode). Instead, for eventfs_inodes, only process the "events" eventfs_inode from the list iteration, as it is guaranteed to have a tracefs_inode (it's never freed while the "events" directory exists). As it has a list of its children, and the children have a list of their children, just iterate all the eventfs_inodes from the "events" descriptor and it is guaranteed to get all of them. - Clear the EVENT_INODE flag from the tracefs_drop_inode() callback. Currently the EVENTFS_INODE FLAG is cleared in the tracefs_d_iput() callback. But this is the wrong location. The iput() callback is called when the last reference to the dentry inode is hit. There could be a case where two dentry's have the same inode, and the flag will be cleared prematurely. The flag needs to be cleared when the last reference of the inode is dropped and that happens in the inode's drop_inode() callback handler. Cleanups: - Consolidate the creation of a tracefs_inode for an eventfs_inode A tracefs_inode is created for both files and directories of the eventfs system. It is open coded. Instead, consolidate it into a single eventfs_get_inode() function call. - Remove the eventfs getattr and permission callbacks. The permissions for the eventfs files and directories are updated when the inodes are created, on remount, and when the user sets them (via setattr). The inodes hold the current permissions so there is no need to have custom getattr or permissions callbacks as they will more likely cause them to be incorrect. The inode's permissions are updated when they should be updated. Remove the getattr and permissions inode callbacks. - Do not update eventfs_inode attributes on creation of inodes. The eventfs_inodes attribute field is used to store the permissions of the directories and files for when their corresponding inodes are freed and are created again. But when the creation of the inodes happen, the eventfs_inode attributes are recalculated. The recalculation should only happen when the permissions change for a given file or directory. Currently, the attribute changes are just being set to their current files so this is not a bug, but it's unnecessary and error prone. Stop doing that. - The events directory inode is created once when the events directory is created and deleted when it is deleted. It is now updated on remount and when the user changes the permissions. There's no need to use the eventfs_inode of the events directory to store the events directory permissions. But using it to store the default permissions for the files within the directory that have not been updated by the user can simplify the code" * tag 'trace-tracefs-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Do not use attributes for events directory eventfs: Cleanup permissions in creation of inodes eventfs: Remove getattr and permission callbacks eventfs: Consolidate the eventfs_inode update in eventfs_get_inode() tracefs: Clear EVENT_INODE flag in tracefs_drop_inode() eventfs: Update all the eventfs_inodes from the events descriptor tracefs: Update inode permissions on remount eventfs: Keep the directories from having the same inode number as files
10 daysgenirq/irqdesc: Prevent use-after-free in irq_find_at_or_after()dicken.ding1-1/+4
irq_find_at_or_after() dereferences the interrupt descriptor which is returned by mt_find() while neither holding sparse_irq_lock nor RCU read lock, which means the descriptor can be freed between mt_find() and the dereference: CPU0 CPU1 desc = mt_find() delayed_free_desc(desc) irq_desc_get_irq(desc) The use-after-free is reported by KASAN: Call trace: irq_get_next_irq+0x58/0x84 show_stat+0x638/0x824 seq_read_iter+0x158/0x4ec proc_reg_read_iter+0x94/0x12c vfs_read+0x1e0/0x2c8 Freed by task 4471: slab_free_freelist_hook+0x174/0x1e0 __kmem_cache_free+0xa4/0x1dc kfree+0x64/0x128 irq_kobj_release+0x28/0x3c kobject_put+0xcc/0x1e0 delayed_free_desc+0x14/0x2c rcu_do_batch+0x214/0x720 Guard the access with a RCU read lock section. Fixes: 721255b9826b ("genirq: Use a maple tree for interrupt descriptor management") Signed-off-by: dicken.ding <dicken.ding@mediatek.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240524091739.31611-1-dicken.ding@mediatek.com
10 daysfs/ntfs3: Break dir enumeration if directory contents errorKonstantin Komarov1-0/+1
If we somehow attempt to read beyond the directory size, an error is supposed to be returned. However, in some cases, read requests do not stop and instead enter into a loop. To avoid this, we set the position in the directory to the end. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: stable@vger.kernel.org
10 daysfs/ntfs3: Fix case when index is reused during tree transformationKonstantin Komarov1-0/+6
In most cases when adding a cluster to the directory index, they are placed at the end, and in the bitmap, this cluster corresponds to the last bit. The new directory size is calculated as follows: data_size = (u64)(bit + 1) << indx->index_bits; In the case of reusing a non-final cluster from the index, data_size is calculated incorrectly, resulting in the directory size differing from the actual size. A check for cluster reuse has been added, and the size update is skipped. Fixes: 82cae269cfa95 ("fs/ntfs3: Add initialization of super block") Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: stable@vger.kernel.org
11 daysselftest mm/mseal read-only elf memory segmentJeff Xu4-36/+275
Sealing read-only of elf mapping so it can't be changed by mprotect. [jeffxu@chromium.org: style change] Link: https://lkml.kernel.org/r/20240416220944.2481203-2-jeffxu@chromium.org [amer.shanawany@gmail.com: fix linker error for inline function] Link: https://lkml.kernel.org/r/20240420202346.546444-1-amer.shanawany@gmail.com [jeffxu@chromium.org: fix compile warning] Link: https://lkml.kernel.org/r/20240420003515.345982-2-jeffxu@chromium.org [jeffxu@chromium.org: fix arm build] Link: https://lkml.kernel.org/r/20240502225331.3806279-2-jeffxu@chromium.org Link: https://lkml.kernel.org/r/20240415163527.626541-6-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Signed-off-by: Amer Al Shanawany <amer.shanawany@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 daysmseal: add documentationJeff Xu2-0/+200
Add documentation for mseal(). Link: https://lkml.kernel.org/r/20240415163527.626541-5-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 daysselftest mm/mseal memory sealingJeff Xu3-0/+1838
selftest for memory sealing change in mmap() and mseal(). Link: https://lkml.kernel.org/r/20240415163527.626541-4-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 daysmseal: add mseal syscallJeff Xu8-1/+432
The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. Following input during RFC are incooperated into this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Linus Torvalds: assisting in defining system call signature and scope. Liam R. Howlett: perf optimization. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. Finally, the idea that inspired this patch comes from Stephen Röttger's work in Chrome V8 CFI. [jeffxu@chromium.org: add branch prediction hint, per Pedro] Link: https://lkml.kernel.org/r/20240423192825.1273679-2-jeffxu@chromium.org Link: https://lkml.kernel.org/r/20240415163527.626541-3-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 daysmseal: wire up mseal syscallJeff Xu19-2/+23
Patch series "Introduce mseal", v10. This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. MM perf benchmarks ================== This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to check the VMAs’ sealing flag, so that no partial update can be made, when any segment within the given memory range is sealed. To measure the performance impact of this loop, two tests are developed. [8] The first is measuring the time taken for a particular system call, by using clock_gettime(CLOCK_MONOTONIC). The second is using PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have similar results. The tests have roughly below sequence: for (i = 0; i < 1000, i++) create 1000 mappings (1 page per VMA) start the sampling for (j = 0; j < 1000, j++) mprotect one mapping stop and save the sample delete 1000 mappings calculates all samples. Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz, 4G memory, Chromebook. Based on the latest upstream code: The first test (measuring time) syscall__ vmas t t_mseal delta_ns per_vma % munmap__ 1 909 944 35 35 104% munmap__ 2 1398 1502 104 52 107% munmap__ 4 2444 2594 149 37 106% munmap__ 8 4029 4323 293 37 107% munmap__ 16 6647 6935 288 18 104% munmap__ 32 11811 12398 587 18 105% mprotect 1 439 465 26 26 106% mprotect 2 1659 1745 86 43 105% mprotect 4 3747 3889 142 36 104% mprotect 8 6755 6969 215 27 103% mprotect 16 13748 14144 396 25 103% mprotect 32 27827 28969 1142 36 104% madvise_ 1 240 262 22 22 109% madvise_ 2 366 442 76 38 121% madvise_ 4 623 751 128 32 121% madvise_ 8 1110 1324 215 27 119% madvise_ 16 2127 2451 324 20 115% madvise_ 32 4109 4642 534 17 113% The second test (measuring cpu cycle) syscall__ vmas cpu cmseal delta_cpu per_vma % munmap__ 1 1790 1890 100 100 106% munmap__ 2 2819 3033 214 107 108% munmap__ 4 4959 5271 312 78 106% munmap__ 8 8262 8745 483 60 106% munmap__ 16 13099 14116 1017 64 108% munmap__ 32 23221 24785 1565 49 107% mprotect 1 906 967 62 62 107% mprotect 2 3019 3203 184 92 106% mprotect 4 6149 6569 420 105 107% mprotect 8 9978 10524 545 68 105% mprotect 16 20448 21427 979 61 105% mprotect 32 40972 42935 1963 61 105% madvise_ 1 434 497 63 63 115% madvise_ 2 752 899 147 74 120% madvise_ 4 1313 1513 200 50 115% madvise_ 8 2271 2627 356 44 116% madvise_ 16 4312 4883 571 36 113% madvise_ 32 8376 9319 943 29 111% Based on the result, for 6.8 kernel, sealing check adds 20-40 nano seconds, or around 50-100 CPU cycles, per VMA. In addition, I applied the sealing to 5.10 kernel: The first test (measuring time) syscall__ vmas t tmseal delta_ns per_vma % munmap__ 1 357 390 33 33 109% munmap__ 2 442 463 21 11 105% munmap__ 4 614 634 20 5 103% munmap__ 8 1017 1137 120 15 112% munmap__ 16 1889 2153 263 16 114% munmap__ 32 4109 4088 -21 -1 99% mprotect 1 235 227 -7 -7 97% mprotect 2 495 464 -30 -15 94% mprotect 4 741 764 24 6 103% mprotect 8 1434 1437 2 0 100% mprotect 16 2958 2991 33 2 101% mprotect 32 6431 6608 177 6 103% madvise_ 1 191 208 16 16 109% madvise_ 2 300 324 24 12 108% madvise_ 4 450 473 23 6 105% madvise_ 8 753 806 53 7 107% madvise_ 16 1467 1592 125 8 108% madvise_ 32 2795 3405 610 19 122% The second test (measuring cpu cycle) syscall__ nbr_vma cpu cmseal delta_cpu per_vma % munmap__ 1 684 715 31 31 105% munmap__ 2 861 898 38 19 104% munmap__ 4 1183 1235 51 13 104% munmap__ 8 1999 2045 46 6 102% munmap__ 16 3839 3816 -23 -1 99% munmap__ 32 7672 7887 216 7 103% mprotect 1 397 443 46 46 112% mprotect 2 738 788 50 25 107% mprotect 4 1221 1256 35 9 103% mprotect 8 2356 2429 72 9 103% mprotect 16 4961 4935 -26 -2 99% mprotect 32 9882 10172 291 9 103% madvise_ 1 351 380 29 29 108% madvise_ 2 565 615 49 25 109% madvise_ 4 872 933 61 15 107% madvise_ 8 1508 1640 132 16 109% madvise_ 16 3078 3323 245 15 108% madvise_ 32 5893 6704 811 25 114% For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30 CPU cycles, there is even decrease in some cases. It might be interesting to compare 5.10 and 6.8 kernel The first test (measuring time) syscall__ vmas t_5_10 t_6_8 delta_ns per_vma % munmap__ 1 357 909 552 552 254% munmap__ 2 442 1398 956 478 316% munmap__ 4 614 2444 1830 458 398% munmap__ 8 1017 4029 3012 377 396% munmap__ 16 1889 6647 4758 297 352% munmap__ 32 4109 11811 7702 241 287% mprotect 1 235 439 204 204 187% mprotect 2 495 1659 1164 582 335% mprotect 4 741 3747 3006 752 506% mprotect 8 1434 6755 5320 665 471% mprotect 16 2958 13748 10790 674 465% mprotect 32 6431 27827 21397 669 433% madvise_ 1 191 240 49 49 125% madvise_ 2 300 366 67 33 122% madvise_ 4 450 623 173 43 138% madvise_ 8 753 1110 357 45 147% madvise_ 16 1467 2127 660 41 145% madvise_ 32 2795 4109 1314 41 147% The second test (measuring cpu cycle) syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma % munmap__ 1 684 1790 1106 1106 262% munmap__ 2 861 2819 1958 979 327% munmap__ 4 1183 4959 3776 944 419% munmap__ 8 1999 8262 6263 783 413% munmap__ 16 3839 13099 9260 579 341% munmap__ 32 7672 23221 15549 486 303% mprotect 1 397 906 509 509 228% mprotect 2 738 3019 2281 1140 409% mprotect 4 1221 6149 4929 1232 504% mprotect 8 2356 9978 7622 953 423% mprotect 16 4961 20448 15487 968 412% mprotect 32 9882 40972 31091 972 415% madvise_ 1 351 434 82 82 123% madvise_ 2 565 752 186 93 133% madvise_ 4 872 1313 442 110 151% madvise_ 8 1508 2271 763 95 151% madvise_ 16 3078 4312 1234 77 140% madvise_ 32 5893 8376 2483 78 142% From 5.10 to 6.8 munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma. mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma. madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma. In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the increase from 5.10 to 6.8 is significantly larger, approximately ten times greater for munmap and mprotect. When I discuss the mm performance with Brian Makin, an engineer who worked on performance, it was brought to my attention that such performance benchmarks, which measuring millions of mm syscall in a tight loop, may not accurately reflect real-world scenarios, such as that of a database service. Also this is tested using a single HW and ChromeOS, the data from another HW or distribution might be different. It might be best to take this data with a grain of salt. This patch (of 5): Wire up mseal syscall for all architectures. Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> [Bug #2] Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 daysMerge tag 'nfs-for-6.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds16-68/+129
Pull NFS client updates from Trond Myklebust: "Stable fixes: - nfs: fix undefined behavior in nfs_block_bits() - NFSv4.2: Fix READ_PLUS when server doesn't support OP_READ_PLUS Bugfixes: - Fix mixing of the lock/nolock and local_lock mount options - NFSv4: Fixup smatch warning for ambiguous return - NFSv3: Fix remount when using the legacy binary mount api - SUNRPC: Fix the handling of expired RPCSEC_GSS contexts - SUNRPC: fix the NFSACL RPC retries when soft mounts are enabled - rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL Features and cleanups: - NFSv3: Use the atomic_open API to fix open(O_CREAT|O_TRUNC) - pNFS/filelayout: S layout segment range in LAYOUTGET - pNFS: rework pnfs_generic_pg_check_layout to check IO range - NFSv2: Turn off enabling of NFS v2 by default" * tag 'nfs-for-6.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: fix undefined behavior in nfs_block_bits() pNFS: rework pnfs_generic_pg_check_layout to check IO range pNFS/filelayout: check layout segment range pNFS/filelayout: fixup pNfs allocation modes rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL NFS: Don't enable NFS v2 by default NFS: Fix READ_PLUS when server doesn't support OP_READ_PLUS sunrpc: fix NFSACL RPC retry on soft mount SUNRPC: fix handling expired GSS context nfs: keep server info for remounts NFSv4: Fixup smatch warning for ambiguous return NFS: make sure lock/nolock overriding local_lock mount option NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly. pNFS/filelayout: Specify the layout segment range in LAYOUTGET pNFS/filelayout: Remove the whole file layout requirement
11 daysMerge tag 'block-6.10-20240523' of git://git.kernel.dk/linuxLinus Torvalds26-186/+314
Pull more block updates from Jens Axboe: "Followup block updates, mostly due to NVMe being a bit late to the party. But nothing major in there, so not a big deal. In detail, this contains: - NVMe pull request via Keith: - Fabrics connection retries (Daniel, Hannes) - Fabrics logging enhancements (Tokunori) - RDMA delete optimization (Sagi) - ublk DMA alignment fix (me) - null_blk sparse warning fixes (Bart) - Discard support for brd (Keith) - blk-cgroup list corruption fixes (Ming) - blk-cgroup stat propagation fix (Waiman) - Regression fix for plugging stall with md (Yu) - Misc fixes or cleanups (David, Jeff, Justin)" * tag 'block-6.10-20240523' of git://git.kernel.dk/linux: (24 commits) null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues' blk-throttle: remove unused struct 'avg_latency_bucket' block: fix lost bio for plug enabled bio based device block: t10-pi: add MODULE_DESCRIPTION() blk-mq: add helper for checking if one CPU is mapped to specified hctx blk-cgroup: Properly propagate the iostat update up the hierarchy blk-cgroup: fix list corruption from reorder of WRITE ->lqueued blk-cgroup: fix list corruption from resetting io stat cdrom: rearrange last_media_change check to avoid unintentional overflow nbd: Fix signal handling nbd: Remove a local variable from nbd_send_cmd() nbd: Improve the documentation of the locking assumptions nbd: Remove superfluous casts nbd: Use NULL to represent a pointer brd: implement discard support null_blk: Fix two sparse warnings ublk_drv: set DMA alignment mask to 3 nvme-rdma, nvme-tcp: include max reconnects for reconnect logging nvmet-rdma: Avoid o(n^2) loop in delete_ctrl nvme: do not retry authentication failures ...
11 daysMerge tag 'io_uring-6.10-20240523' of git://git.kernel.dk/linuxLinus Torvalds2-6/+6
Pull io_uring fixes from Jens Axboe: "Single fix here for a regression in 6.9, and then a simple cleanup removing some dead code" * tag 'io_uring-6.10-20240523' of git://git.kernel.dk/linux: io_uring: remove checks for NULL 'sq_offset' io_uring/sqpoll: ensure that normal task_work is also run timely
11 daysMerge tag 'regulator-fix-v6.10-merge-window' of ↵Linus Torvalds6-77/+48
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fixes from Mark Brown: "A bunch of fixes that came in during the merge window. Matti found several issues with some of the more complexly configured Rohm regulators and the helpers they use and there were some errors in the specification of tps6594 when regulators are grouped together" * tag 'regulator-fix-v6.10-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: tps6594-regulator: Correct multi-phase configuration regulator: tps6287x: Force writing VSEL bit regulator: pickable ranges: don't always cache vsel regulator: rohm-regulator: warn if unsupported voltage is set regulator: bd71828: Don't overwrite runtime voltages
11 daysMerge tag 'regmap-fix-v6.10-merge-window' of ↵Linus Torvalds1-1/+8
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap Pull regmap fix from Mark Brown: "Guenter ran with memory sanitisers and found an issue in the new KUnit tests that Richard added where an assumption in older test code was exposed, this was fixed quickly by Richard" * tag 'regmap-fix-v6.10-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: regmap: kunit: Fix array overflow in stride() test
11 daysgenirq/cpuhotplug, x86/vector: Prevent vector leak during CPU offlineDongli Zhang2-11/+14
The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of interrupt affinity reconfiguration via procfs. Instead, the change is deferred until the next instance of the interrupt being triggered on the original CPU. When the interrupt next triggers on the original CPU, the new affinity is enforced within __irq_move_irq(). A vector is allocated from the new CPU, but the old vector on the original CPU remains and is not immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the reclaiming process is delayed until the next trigger of the interrupt on the new CPU. Upon the subsequent triggering of the interrupt on the new CPU, irq_complete_move() adds a task to the old CPU's vector_cleanup list if it remains online. Subsequently, the timer on the old CPU iterates over its vector_cleanup list, reclaiming old vectors. However, a rare scenario arises if the old CPU is outgoing before the interrupt triggers again on the new CPU. In that case irq_force_complete_move() is not invoked on the outgoing CPU to reclaim the old apicd->prev_vector because the interrupt isn't currently affine to the outgoing CPU, and irq_needs_fixup() returns false. Even though __vector_schedule_cleanup() is later called on the new CPU, it doesn't reclaim apicd->prev_vector; instead, it simply resets both apicd->move_in_progress and apicd->prev_vector to 0. As a result, the vector remains unreclaimed in vector_matrix, leading to a CPU vector leak. To address this issue, move the invocation of irq_force_complete_move() before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the interrupt is currently or used to be affine to the outgoing CPU. Additionally, reclaim the vector in __vector_schedule_cleanup() as well, following a warning message, although theoretically it should never see apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU. Fixes: f0383c24b485 ("genirq/cpuhotplug: Add support for cleaning up move in progress") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240522220218.162423-1-dongli.zhang@oracle.com
11 daysMerge tag 'net-6.10-rc1' of ↵Linus Torvalds26-245/+243
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Quite smaller than usual. Notably it includes the fix for the unix regression from the past weeks. The TCP window fix will require some follow-up, already queued. Current release - regressions: - af_unix: fix garbage collection of embryos Previous releases - regressions: - af_unix: fix race between GC and receive path - ipv6: sr: fix missing sk_buff release in seg6_input_core - tcp: remove 64 KByte limit for initial tp->rcv_wnd value - eth: r8169: fix rx hangup - eth: lan966x: remove ptp traps in case the ptp is not enabled - eth: ixgbe: fix link breakage vs cisco switches - eth: ice: prevent ethtool from corrupting the channels Previous releases - always broken: - openvswitch: set the skbuff pkt_type for proper pmtud support - tcp: Fix shift-out-of-bounds in dctcp_update_alpha() Misc: - a bunch of selftests stabilization patches" * tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (25 commits) r8169: Fix possible ring buffer corruption on fragmented Tx packets. idpf: Interpret .set_channels() input differently ice: Interpret .set_channels() input differently nfc: nci: Fix handling of zero-length payload packets in nci_rx_work() net: relax socket state check at accept time. tcp: remove 64 KByte limit for initial tp->rcv_wnd value net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe() tls: fix missing memory barrier in tls_init net: fec: avoid lock evasion when reading pps_enable Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI" testing: net-drv: use stats64 for testing net: mana: Fix the extra HZ in mana_hwc_send_request net: lan966x: Remove ptp traps in case the ptp is not enabled. openvswitch: Set the skbuff pkt_type for proper pmtud support. selftest: af_unix: Make SCM_RIGHTS into OOB data. af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). selftests/net: use tc rule to filter the na packet ipv6: sr: fix memleak in seg6_hmac_init_algo af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. ...
11 daysMerge tag 'trace-fixes-v6.10' of ↵Linus Torvalds3-13/+15
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "Minor last minute fixes: - Fix a very tight race between the ring buffer readers and resizing the ring buffer - Correct some stale comments in the ring buffer code - Fix kernel-doc in the rv code - Add a MODULE_DESCRIPTION to preemptirq_delay_test" * tag 'trace-fixes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Update rv_en(dis)able_monitor doc to match kernel-doc tracing: Add MODULE_DESCRIPTION() to preemptirq_delay_test ring-buffer: Fix a race between readers and resize checks ring-buffer: Correct stale comments related to non-consuming readers
11 daysMerge tag 'trace-tools-v6.10-2' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing tool fix from Steven Rostedt: "Fix printf format warnings in latency-collector. Use the printf format string with %s to take a string instead of taking in a string directly" * tag 'trace-tools-v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tools/latency-collector: Fix -Wformat-security compile warns
11 daysMerge tag 'trace-assign-str-v6.10' of ↵Linus Torvalds147-808/+794
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing cleanup from Steven Rostedt: "Remove second argument of __assign_str() The __assign_str() macro logic of the TRACE_EVENT() macro was optimized so that it no longer needs the second argument. The __assign_str() is always matched with __string() field that takes a field name and the source for that field: __string(field, source) The TRACE_EVENT() macro logic will save off the source value and then use that value to copy into the ring buffer via the __assign_str(). Before commit c1fa617caeb0 ("tracing: Rework __assign_str() and __string() to not duplicate getting the string"), the __assign_str() needed the second argument which would perform the same logic as the __string() source parameter did. Not only would this add overhead, but it was error prone as if the __assign_str() source produced something different, it may not have allocated enough for the string in the ring buffer (as the __string() source was used to determine how much to allocate) Now that the __assign_str() just uses the same string that was used in __string() it no longer needs the source parameter. It can now be removed" * tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/treewide: Remove second parameter of __assign_str()
11 daysMerge tag 'sparc-for-6.10-tag1' of ↵Linus Torvalds24-114/+76
git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc Pull sparc updates from Andreas Larsson: - Avoid on-stack cpumask variables in a number of places - Move struct termio to asm/termios.h, matching other architectures and allowing certain user space applications to build also for sparc - Fix missing prototype warnings for sparc64 - Fix version generation warnings for sparc32 - Fix bug where non-consecutive CPU IDs lead to some CPUs not starting - Simplification using swap and cleanup using NULL for pointer - Convert sparc parport and chmc drivers to use remove callbacks returning void * tag 'sparc-for-6.10-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc: sparc/leon: Remove on-stack cpumask var sparc/pci_msi: Remove on-stack cpumask var sparc/of: Remove on-stack cpumask var sparc/irq: Remove on-stack cpumask var sparc/srmmu: Remove on-stack cpumask var sparc: chmc: Convert to platform remove callback returning void sparc: parport: Convert to platform remove callback returning void sparc: Compare pointers to NULL instead of 0 sparc: Use swap() to fix Coccinelle warning sparc32: Fix version generation failed warnings sparc64: Fix number of online CPUs sparc64: Fix prototype warning for sched_clock sparc64: Fix prototype warnings in adi_64.c sparc64: Fix prototype warning for dma_4v_iotsb_bind sparc64: Fix prototype warning for uprobe_trap sparc64: Fix prototype warning for alloc_irqstack_bootmem sparc64: Fix prototype warning for vmemmap_free sparc64: Fix prototype warnings in traps_64.c sparc64: Fix prototype warning for init_vdso_image sparc: move struct termio to asm/termios.h
11 daysMerge tag 'arm64-fixes' of ↵Linus Torvalds12-25/+125
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "The major fix here is for a filesystem corruption issue reported on Apple M1 as a result of buggy management of the floating point register state introduced in 6.8. I initially reverted one of the offending patches, but in the end Ard cooked a proper fix so there's a revert+reapply in the series. Aside from that, we've got some CPU errata workarounds and misc other fixes. - Fix broken FP register state tracking which resulted in filesystem corruption when dm-crypt is used - Workarounds for Arm CPU errata affecting the SSBS Spectre mitigation - Fix lockdep assertion in DMC620 memory controller PMU driver - Fix alignment of BUG table when CONFIG_DEBUG_BUGVERBOSE is disabled" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64/fpsimd: Avoid erroneous elide of user state reload Reapply "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD" arm64: asm-bug: Add .align 2 to the end of __BUG_ENTRY perf/arm-dmc620: Fix lockdep assert in ->event_init() Revert "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD" arm64: errata: Add workaround for Arm errata 3194386 and 3312417 arm64: cputype: Add Neoverse-V3 definitions arm64: cputype: Add Cortex-X4 definitions arm64: barrier: Restore spec_bar() macro
11 daysMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds40-175/+353
Pull virtio updates from Michael Tsirkin: "Several new features here: - virtio-net is finally supported in vduse - virtio (balloon and mem) interaction with suspend is improved - vhost-scsi now handles signals better/faster And fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits) virtio-pci: Check if is_avq is NULL virtio: delete vq in vp_find_vqs_msix() when request_irq() fails MAINTAINERS: add Eugenio Pérez as reviewer vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API vp_vdpa: don't allocate unused msix vectors sound: virtio: drop owner assignment fuse: virtio: drop owner assignment scsi: virtio: drop owner assignment rpmsg: virtio: drop owner assignment nvdimm: virtio_pmem: drop owner assignment wifi: mac80211_hwsim: drop owner assignment vsock/virtio: drop owner assignment net: 9p: virtio: drop owner assignment net: virtio: drop owner assignment net: caif: virtio: drop owner assignment misc: nsm: drop owner assignment iommu: virtio: drop owner assignment drm/virtio: drop owner assignment gpio: virtio: drop owner assignment firmware: arm_scmi: virtio: drop owner assignment ...
11 daysirqchip/riscv-imsic: Fixup riscv_ipi_set_virq_range() conflictPalmer Dabbelt1-1/+1
There was a semantic conflict between 21a8f8a0eb35 ("irqchip: Add RISC-V incoming MSI controller early driver") and dc892fb44322 ("riscv: Use IPIs for remote cache/TLB flushes by default") due to an API change. This manifests as a build failure post-merge. Fixes: 0bfbc914d943 ("Merge tag 'riscv-for-linus-6.10-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux") Reported-by: Tomasz Jeznach <tjeznach@rivosinc.com> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240522184953.28531-3-palmer@rivosinc.com Link: https://lore.kernel.org/all/mhng-10b71228-cf3e-42ca-9abf-5464b15093f1@palmer-ri-x1c9/
11 daysriscv: Fix early ftrace nop patchingAlexandre Ghiti2-0/+9
Commit c97bf629963e ("riscv: Fix text patching when IPI are used") converted ftrace_make_nop() to use patch_insn_write() which does not emit any icache flush relying entirely on __ftrace_modify_code() to do that. But we missed that ftrace_make_nop() was called very early directly when converting mcount calls into nops (actually on riscv it converts 2B nops emitted by the compiler into 4B nops). This caused crashes on multiple HW as reported by Conor and Björn since the booting core could have half-patched instructions in its icache which would trigger an illegal instruction trap: fix this by emitting a local flush icache when early patching nops. Fixes: c97bf629963e ("riscv: Fix text patching when IPI are used") Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reported-by: Conor Dooley <conor.dooley@microchip.com> Tested-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20240523115134.70380-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>