summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-03-09NFS: Fix an off by one in root_nfs_cat()Christophe JAILLET1-2/+2
The intent is to check if 'dest' is truncated or not. So, >= should be used instead of >, because strlcat() returns the length of 'dest' and 'src' excluding the trailing NULL. Fixes: 56463e50d1fc ("NFS: Use super.c for NFSROOT mount option parsing") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09nfs: make the rpc_stat per net namespaceJosef Bacik4-4/+9
Now that we're exposing the rpc stats on a per-network namespace basis, move this struct into struct nfs_net and use that to make sure only the per-network namespace stats are exposed. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09nfs: expose /proc/net/sunrpc/nfs in net namespacesJosef Bacik1-4/+2
We're using nfs mounts inside of containers in production and noticed that the nfs stats are not exposed in /proc. This is a problem for us as we use these stats for monitoring, and have to do this awkward bind mount from the main host into the container in order to get to these states. Add the rpc_proc_register call to the pernet operations entry and exit points so these stats can be exposed inside of network namespaces. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09sunrpc: add a struct rpc_stats arg to rpc_create_argsJosef Bacik2-1/+5
We want to be able to have our rpc stats handled in a per network namespace manner, so add an option to rpc_create_args to specify a different rpc_stats struct instead of using the one on the rpc_program. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09nfs: remove unused NFS_CALL macroJeff Layton1-7/+0
Nothing uses this, and thank goodness, as the syntax looks horrid. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFSv4.1: add tracepoint to trunked nfs4_exchange_id callsOlga Kornievskaia3-2/+35
Add a tracepoint to track when the client sends EXCHANGE_ID to test a new transport for session trunking. nfs4_detect_session_trunking() tests for trunking and returns EINVAL if trunking can't be done, add EINVAL mapping to show_nfs4_status() in tracepoints. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: Fix nfs_netfs_issue_read() xarray locking for writeback interruptDave Wysochanski1-7/+2
The loop inside nfs_netfs_issue_read() currently does not disable interrupts while iterating through pages in the xarray to submit for NFS read. This is not safe though since after taking xa_lock, another page in the mapping could be processed for writeback inside an interrupt, and deadlock can occur. The fix is simple and clean if we use xa_for_each_range(), which handles the iteration with RCU while reducing code complexity. The problem is easily reproduced with the following test: mount -o vers=3,fsc 127.0.0.1:/export /mnt/nfs dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1 echo 3 > /proc/sys/vm/drop_caches dd if=/mnt/nfs/file1.bin of=/dev/null umount /mnt/nfs On the console with a lockdep-enabled kernel a message similar to the following will be seen: ================================ WARNING: inconsistent lock state 6.7.0-lockdbg+ #10 Not tainted -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. test5/1708 [HC0[0]:SC0[0]:HE1:SE1] takes: ffff888127baa598 (&xa->xa_lock#4){+.?.}-{3:3}, at: nfs_netfs_issue_read+0x1b2/0x4b0 [nfs] {IN-SOFTIRQ-W} state was registered at: lock_acquire+0x144/0x380 _raw_spin_lock_irqsave+0x4e/0xa0 __folio_end_writeback+0x17e/0x5c0 folio_end_writeback+0x93/0x1b0 iomap_finish_ioend+0xeb/0x6a0 blk_update_request+0x204/0x7f0 blk_mq_end_request+0x30/0x1c0 blk_complete_reqs+0x7e/0xa0 __do_softirq+0x113/0x544 __irq_exit_rcu+0xfe/0x120 irq_exit_rcu+0xe/0x20 sysvec_call_function_single+0x6f/0x90 asm_sysvec_call_function_single+0x1a/0x20 pv_native_safe_halt+0xf/0x20 default_idle+0x9/0x20 default_idle_call+0x67/0xa0 do_idle+0x2b5/0x300 cpu_startup_entry+0x34/0x40 start_secondary+0x19d/0x1c0 secondary_startup_64_no_verify+0x18f/0x19b irq event stamp: 176891 hardirqs last enabled at (176891): [<ffffffffa67a0be4>] _raw_spin_unlock_irqrestore+0x44/0x60 hardirqs last disabled at (176890): [<ffffffffa67a0899>] _raw_spin_lock_irqsave+0x79/0xa0 softirqs last enabled at (176646): [<ffffffffa515d91e>] __irq_exit_rcu+0xfe/0x120 softirqs last disabled at (176633): [<ffffffffa515d91e>] __irq_exit_rcu+0xfe/0x120 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&xa->xa_lock#4); <Interrupt> lock(&xa->xa_lock#4); *** DEADLOCK *** 2 locks held by test5/1708: #0: ffff888127baa498 (&sb->s_type->i_mutex_key#22){++++}-{4:4}, at: nfs_start_io_read+0x28/0x90 [nfs] #1: ffff888127baa650 (mapping.invalidate_lock#3){.+.+}-{4:4}, at: page_cache_ra_unbounded+0xa4/0x280 stack backtrace: CPU: 6 PID: 1708 Comm: test5 Kdump: loaded Not tainted 6.7.0-lockdbg+ Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 Call Trace: dump_stack_lvl+0x5b/0x90 mark_lock+0xb3f/0xd20 __lock_acquire+0x77b/0x3360 _raw_spin_lock+0x34/0x80 nfs_netfs_issue_read+0x1b2/0x4b0 [nfs] netfs_begin_read+0x77f/0x980 [netfs] nfs_netfs_readahead+0x45/0x60 [nfs] nfs_readahead+0x323/0x5a0 [nfs] read_pages+0xf3/0x5c0 page_cache_ra_unbounded+0x1c8/0x280 filemap_get_pages+0x38c/0xae0 filemap_read+0x206/0x5e0 nfs_file_read+0xb7/0x140 [nfs] vfs_read+0x2a9/0x460 ksys_read+0xb7/0x140 Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when fscache is enabled") Suggested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-02-29SUNRPC: increase size of rpc_wait_queue.qlen from unsigned short to unsigned intDai Ngo1-1/+1
When the NFS client is under extreme load the rpc_wait_queue.qlen counter can be overflowed. Here is an instant of the backlog queue overflow in a real world environment shown by drgn helper: rpc_task_stats(rpc_clnt): ------------------------- rpc_clnt: 0xffff92b65d2bae00 rpc_xprt: 0xffff9275db64f000 Queue: sending[64887] pending[524] backlog[30441] binding[0] XMIT task: 0xffff925c6b1d8e98 WRITE: 750654 __dta_call_status_580: 65463 __dta_call_transmit_status_579: 1 call_reserveresult: 685189 nfs_client_init_is_complete: 1 COMMIT: 584 call_reserveresult: 573 __dta_call_status_580: 11 ACCESS: 1 __dta_call_status_580: 1 GETATTR: 10 __dta_call_status_580: 4 call_reserveresult: 6 751249 tasks for server 111.222.333.444 Total tasks: 751249 count_rpc_wait_queues(xprt): ---------------------------- **** rpc_xprt: 0xffff9275db64f000 num_reqs: 65511 wait_queue: xprt_binding[0] cnt: 0 wait_queue: xprt_binding[1] cnt: 0 wait_queue: xprt_binding[2] cnt: 0 wait_queue: xprt_binding[3] cnt: 0 rpc_wait_queue[xprt_binding].qlen: 0 maxpriority: 0 wait_queue: xprt_sending[0] cnt: 0 wait_queue: xprt_sending[1] cnt: 64887 wait_queue: xprt_sending[2] cnt: 0 wait_queue: xprt_sending[3] cnt: 0 rpc_wait_queue[xprt_sending].qlen: 64887 maxpriority: 3 wait_queue: xprt_pending[0] cnt: 524 wait_queue: xprt_pending[1] cnt: 0 wait_queue: xprt_pending[2] cnt: 0 wait_queue: xprt_pending[3] cnt: 0 rpc_wait_queue[xprt_pending].qlen: 524 maxpriority: 0 wait_queue: xprt_backlog[0] cnt: 0 wait_queue: xprt_backlog[1] cnt: 685801 wait_queue: xprt_backlog[2] cnt: 0 wait_queue: xprt_backlog[3] cnt: 0 rpc_wait_queue[xprt_backlog].qlen: 30441 maxpriority: 3 [task cnt mismatch] There is no effect on operations when this overflow occurs. However it causes confusion when trying to diagnose the performance problem. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-02-29nfs: fix regression in handling of fsc= option in NFSv4Chen Hanxiao1-0/+24
Setting the uniquifier for fscache via the fsc= mount option is currently broken in NFSv4. Fix this by passing fscache_uniq to root_fc if possible. Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-02-29pnfs/filelayout: add tracepoint to getdeviceinfoOlga Kornievskaia3-0/+32
While decoding filelayout getdeviceinfo received, print out the information about the location of data servers (IPs). Generic getdeviceinfo tracepoints prints the MDS's ip for the dstaddr. In this patch, separate the MDS's address from the DS's addresses. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-02-29NFSv4.2: fix listxattr maximum XDR buffer sizeJorge Mora1-2/+5
Switch order of operations to avoid creating a short XDR buffer: e.g., buflen = 12, old xdrlen = 12, new xdrlen = 20. Having a short XDR buffer leads to lxa_maxcount be a few bytes less than what is needed to retrieve the whole list when using a buflen as returned by a call with size = 0: buflen = listxattr(path, NULL, 0); buf = malloc(buflen); buflen = listxattr(path, buf, buflen); For a file with one attribute (name = '123456'), the first call with size = 0 will return buflen = 12 ('user.123456\x00'). The second call with size = 12, sends LISTXATTRS with lxa_maxcount = 12 + 8 (cookie) + 4 (array count) = 24. The XDR buffer needs 8 (cookie) + 4 (array count) + 4 (name count) + 6 (name len) + 2 (padding) + 4 (eof) = 28 which is 4 bytes shorter than the lxa_maxcount provided in the call. Fixes: 04a5da690e8f ("NFSv4.2: define limits and sizes for user xattr handling") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-02-29NFSv4.2: fix nfs4_listxattr kernel BUG at mm/usercopy.c:102Jorge Mora1-6/+10
A call to listxattr() with a buffer size = 0 returns the actual size of the buffer needed for a subsequent call. When size > 0, nfs4_listxattr() does not return an error because either generic_listxattr() or nfs4_listxattr_nfs4_label() consumes exactly all the bytes then size is 0 when calling nfs4_listxattr_nfs4_user() which then triggers the following kernel BUG: [ 99.403778] kernel BUG at mm/usercopy.c:102! [ 99.404063] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 99.408463] CPU: 0 PID: 3310 Comm: python3 Not tainted 6.6.0-61.fc40.aarch64 #1 [ 99.415827] Call trace: [ 99.415985] usercopy_abort+0x70/0xa0 [ 99.416227] __check_heap_object+0x134/0x158 [ 99.416505] check_heap_object+0x150/0x188 [ 99.416696] __check_object_size.part.0+0x78/0x168 [ 99.416886] __check_object_size+0x28/0x40 [ 99.417078] listxattr+0x8c/0x120 [ 99.417252] path_listxattr+0x78/0xe0 [ 99.417476] __arm64_sys_listxattr+0x28/0x40 [ 99.417723] invoke_syscall+0x78/0x100 [ 99.417929] el0_svc_common.constprop.0+0x48/0xf0 [ 99.418186] do_el0_svc+0x24/0x38 [ 99.418376] el0_svc+0x3c/0x110 [ 99.418554] el0t_64_sync_handler+0x120/0x130 [ 99.418788] el0t_64_sync+0x194/0x198 [ 99.418994] Code: aa0003e3 d000a3e0 91310000 97f49bdb (d4210000) Issue is reproduced when generic_listxattr() returns 'system.nfs4_acl', thus calling lisxattr() with size = 16 will trigger the bug. Add check on nfs4_listxattr() to return ERANGE error when it is called with size > 0 and the return value is greater than size. Fixes: 012a211abd5d ("NFSv4.2: hook in the user extended attribute handlers") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-02-29NFS: Display the "fsc=" mount option if it is setChen Hanxiao1-1/+9
With this patch, mount command will show fsc=xxx if set: If -o fsc=6666 clientaddr=192.168.122.208,fsc=6666,local_lock=none If only -o fsc clientaddr=192.168.122.208,fsc,local_lock=none Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-02-29SUNRPC: add xrpt id to rpc_stats_latency tracepointOlga Kornievskaia1-2/+6
In order to get the latency per xprt under the same clientid this patch adds xprt_id to the tracepoint output. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Tested-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-02-29net: sunrpc: Fix an off by one in rpc_sockaddr2uaddr()Christophe JAILLET1-2/+2
The intent is to check if the strings' are truncated or not. So, >= should be used instead of >, because strlcat() and snprintf() return the length of the output, excluding the trailing NULL. Fixes: a02d69261134 ("SUNRPC: Provide functions for managing universal addresses") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-02-28SUNRPC: Add a transport callback to handle dequeuing of an RPC requestTrond Myklebust3-0/+20
Add a transport level callback to allow it to handle the consequences of dequeuing the request that was in the process of being transmitted. For something like a TCP connection, we may need to disconnect if the request was partially transmitted. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-02-28SUNRPC: Don't try to send when the connection is shutting downTrond Myklebust1-0/+3
If the connection has been scheduled to shut down, we must assume that the socket is not in a state to accept further transmissions until the connection has been re-established. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-02-28SUNRPC: Don't retry using the same source port if connection failedTrond Myklebust1-1/+9
If the TCP connection attempt fails without ever establishing a connection, then assume the problem may be the server is rejecting us due to port reuse. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-02-26Linux 6.8-rc6v6.8-rc6Linus Torvalds1-1/+1
2024-02-26Merge tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefsLinus Torvalds7-22/+25
Pull bcachefs fixes from Kent Overstreet: "Some more mostly boring fixes, but some not User reported ones: - the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty performance bug; user reported an untar initially taking two seconds and then ~2 minutes - kill a __GFP_NOFAIL in the buffered read path; this was a leftover from the trickier fix to kill __GFP_NOFAIL in readahead, where we can't return errors (and have to silently truncate the read ourselves). bcachefs can't use GFP_NOFAIL for folio state unlike iomap based filesystems because our folio state is just barely too big, 2MB hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL. additionally, the flags argument was just buggy, we weren't supplying GFP_KERNEL previously (!)" * tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs: bcachefs: fix bch2_save_backtrace() bcachefs: Fix check_snapshot() memcpy bcachefs: Fix bch2_journal_flush_device_pins() bcachefs: fix iov_iter count underflow on sub-block dio read bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree bcachefs: Kill __GFP_NOFAIL in buffered read path bcachefs: fix backpointer_to_text() when dev does not exist
2024-02-25bcachefs: fix bch2_save_backtrace()Kent Overstreet1-1/+1
Missed a call in the previous fix. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-25Merge tag 'docs-6.8-fixes3' of git://git.lwn.net/linuxLinus Torvalds2-6/+10
Pull two documentation build fixes from Jonathan Corbet: - The XFS online fsck documentation uses incredibly deeply nested subsection and list nesting; that broke the PDF docs build. Tweak a parameter to tell LaTeX to allow the deeper nesting. - Fix a 6.8 PDF-build regression * tag 'docs-6.8-fixes3' of git://git.lwn.net/linux: docs: translations: use attribute to store current language docs: Instruct LaTeX to cope with deeper nesting
2024-02-25Merge tag 'usb-6.8-rc6' of ↵Linus Torvalds12-26/+75
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are some small USB fixes for 6.8-rc6 to resolve some reported problems. These include: - regression fixes with typec tpcm code as reported by many - cdnsp and cdns3 driver fixes - usb role setting code bugfixes - build fix for uhci driver - ncm gadget driver bugfix - MAINTAINERS entry update All of these have been in linux-next all week with no reported issues and there is at least one fix in here that is in Thorsten's regression list that is being tracked" * tag 'usb-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: usb: typec: tpcm: Fix issues with power being removed during reset MAINTAINERS: Drop myself as maintainer of TYPEC port controller drivers usb: gadget: ncm: Avoid dropping datagrams of properly parsed NTBs Revert "usb: typec: tcpm: reset counter when enter into unattached state after try role" usb: gadget: omap_udc: fix USB gadget regression on Palm TE usb: dwc3: gadget: Don't disconnect if not started usb: cdns3: fix memory double free when handle zero packet usb: cdns3: fixed memory use after free at cdns3_gadget_ep_disable() usb: roles: don't get/set_role() when usb_role_switch is unregistered usb: roles: fix NULL pointer issue when put module's reference usb: cdnsp: fixed issue with incorrect detecting CDNSP family controllers usb: cdnsp: blocked some cdns3 specific code usb: uhci-grlib: Explicitly include linux/platform_device.h
2024-02-25Merge tag 'tty-6.8-rc6' of ↵Linus Torvalds3-34/+38
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial driver fixes from Greg KH: "Here are three small serial/tty driver fixes for 6.8-rc6 that resolve the following reported errors: - riscv hvc console driver fix that was reported by many - amba-pl011 serial driver fix for RS485 mode - stm32 serial driver fix for RS485 mode All of these have been in linux-next all week with no reported problems" * tag 'tty-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: serial: amba-pl011: Fix DMA transmission in RS485 mode serial: stm32: do not always set SER_RS485_RX_DURING_TX if RS485 is enabled tty: hvc: Don't enable the RISC-V SBI console by default
2024-02-25Merge tag 'x86_urgent_for_v6.8_rc6' of ↵Linus Torvalds13-46/+112
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Make sure clearing CPU buffers using VERW happens at the latest possible point in the return-to-userspace path, otherwise memory accesses after the VERW execution could cause data to land in CPU buffers again * tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM/VMX: Move VERW closer to VMentry for MDS mitigation KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key x86/entry_32: Add VERW just before userspace transition x86/entry_64: Add VERW just before userspace transition x86/bugs: Add asm helpers for executing VERW
2024-02-25Merge tag 'irq_urgent_for_v6.8_rc6' of ↵Linus Torvalds4-10/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Make sure GICv4 always gets initialized to prevent a kexec-ed kernel from silently failing to set it up - Do not call bus_get_dev_root() for the mbigen irqchip as it always returns NULL - use NULL directly - Fix hardware interrupt number truncation when assigning MSI interrupts - Correct sending end-of-interrupt messages to disabled interrupts lines on RISC-V PLIC * tag 'irq_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3-its: Do not assume vPE tables are preallocated irqchip/mbigen: Don't use bus_get_dev_root() to find the parent PCI/MSI: Prevent MSI hardware interrupt number truncation irqchip/sifive-plic: Enable interrupt if needed before EOI
2024-02-25Merge tag 'erofs-for-6.8-rc6-fixes' of ↵Linus Torvalds1-14/+14
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fix from Gao Xiang: - Fix page refcount leak when looking up specific inodes introduced by metabuf reworking * tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix refcount on the metabuf used for inode lookup
2024-02-25Merge tag 'pull-fixes.pathwalk-rcu-2' of ↵Linus Torvalds22-63/+88
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull RCU pathwalk fixes from Al Viro: "We still have some races in filesystem methods when exposed to RCU pathwalk. This series is a result of code audit (the second round of it) and it should deal with most of that stuff. Still pending: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate(). Up to maintainers (a note for NTFS folks - when documentation says that a method may not block, it *does* imply that blocking allocations are to be avoided. Really)" [ More explanations for people who aren't familiar with the vagaries of RCU path walking: most of it is hidden from filesystems, but if a filesystem actively participates in the low-level path walking it needs to make sure the fields involved in that walk are RCU-safe. That "actively participate in low-level path walking" includes things like having its own ->d_hash()/->d_compare() routines, or by having its own directory permission function that doesn't just use the common helpers. Having a ->d_revalidate() function will also have this issue. Note that instead of making everything RCU safe you can also choose to abort the RCU pathwalk if your operation cannot be done safely under RCU, but that obviously comes with a performance penalty. One common pattern is to allow the simple cases under RCU, and abort only if you need to do something more complicated. So not everything needs to be RCU-safe, and things like the inode etc that the VFS itself maintains obviously already are. But these fixes tend to be about properly RCU-delaying things like ->s_fs_info that are maintained by the filesystem and that got potentially released too early. - Linus ] * tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ext4_get_link(): fix breakage in RCU mode cifs_get_link(): bail out in unsafe case fuse: fix UAF in rcu pathwalks procfs: make freeing proc_fs_info rcu-delayed procfs: move dropping pde and pid from ->evict_inode() to ->free_inode() nfs: fix UAF on pathwalk running into umount nfs: make nfs_set_verifier() safe for use in RCU pathwalk afs: fix __afs_break_callback() / afs_drop_open_mmap() race hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper affs: free affs_sb_info with kfree_rcu() rcu pathwalk: prevent bogus hard errors from may_lookup() fs/super.c: don't drop ->s_user_ns until we free struct super_block itself
2024-02-25Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds3-4/+9
Pull vfs fixes from Al Viro: "A couple of fixes - revert of regression from this cycle and a fix for erofs failure exit breakage (had been there since way back)" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: erofs: fix handling kern_mount() failure Revert "get rid of DCACHE_GENOCIDE"
2024-02-25ext4_get_link(): fix breakage in RCU modeAl Viro1-3/+5
1) errors from ext4_getblk() should not be propagated to caller unless we are really sure that we would've gotten the same error in non-RCU pathwalk. 2) we leak buffer_heads if ext4_getblk() is successful, but bh is not uptodate. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25cifs_get_link(): bail out in unsafe caseAl Viro1-0/+3
->d_revalidate() bails out there, anyway. It's not enough to prevent getting into ->get_link() in RCU mode, but that could happen only in a very contrieved setup. Not worth trying to do anything fancy here unless ->d_revalidate() stops kicking out of RCU mode at least in some cases. Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25fuse: fix UAF in rcu pathwalksAl Viro3-6/+13
->permission(), ->get_link() and ->inode_get_acl() might dereference ->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns as well) when called from rcu pathwalk. Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info and dropping ->user_ns rcu-delayed too. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25procfs: make freeing proc_fs_info rcu-delayedAl Viro2-1/+2
makes proc_pid_ns() safe from rcu pathwalk (put_pid_ns() is still synchronous, but that's not a problem - it does rcu-delay everything that needs to be) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()Al Viro2-13/+8
that keeps both around until struct inode is freed, making access to them safe from rcu-pathwalk Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25nfs: fix UAF on pathwalk running into umountAl Viro2-3/+12
NFS ->d_revalidate(), ->permission() and ->get_link() need to access some parts of nfs_server when called in RCU mode: server->flags server->caps *(server->io_stats) and, worst of all, call server->nfs_client->rpc_ops->have_delegation (the last one - as NFS_PROTO(inode)->have_delegation()). We really don't want to RCU-delay the entire nfs_free_server() (it would have to be done with schedule_work() from RCU callback, since it can't be made to run from interrupt context), but actual freeing of nfs_server and ->io_stats can be done via call_rcu() just fine. nfs_client part is handled simply by making nfs_free_client() use kfree_rcu(). Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25nfs: make nfs_set_verifier() safe for use in RCU pathwalkAl Viro1-2/+2
nfs_set_verifier() relies upon dentry being pinned; if that's the case, grabbing ->d_lock stabilizes ->d_parent and guarantees that ->d_parent points to a positive dentry. For something we'd run into in RCU mode that is *not* true - dentry might've been through dentry_kill() just as we grabbed ->d_lock, with its parent going through the same just as we get to into nfs_set_verifier_locked(). It might get to detaching inode (and zeroing ->d_inode) before nfs_set_verifier_locked() gets to fetching that; we get an oops as the result. That can happen in nfs{,4} ->d_revalidate(); the call chain in question is nfs_set_verifier_locked() <- nfs_set_verifier() <- nfs_lookup_revalidate_delegated() <- nfs{,4}_do_lookup_revalidate(). We have checked that the parent had been positive, but that's done before we get to nfs_set_verifier() and it's possible for memory pressure to pick our dentry as eviction candidate by that time. If that happens, back-to-back attempts to kill dentry and its parent are quite normal. Sure, in case of eviction we'll fail the ->d_seq check in the caller, but we need to survive until we return there... Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25afs: fix __afs_break_callback() / afs_drop_open_mmap() raceAl Viro1-2/+6
In __afs_break_callback() we might check ->cb_nr_mmap and if it's non-zero do queue_work(&vnode->cb_work). In afs_drop_open_mmap() we decrement ->cb_nr_mmap and do flush_work(&vnode->cb_work) if it reaches zero. The trouble is, there's nothing to prevent __afs_break_callback() from seeing ->cb_nr_mmap before the decrement and do queue_work() after both the decrement and flush_work(). If that happens, we might be in trouble - vnode might get freed before the queued work runs. __afs_break_callback() is always done under ->cb_lock, so let's make sure that ->cb_nr_mmap can change from non-zero to zero while holding ->cb_lock (the spinlock component of it - it's a seqlock and we don't need to mess with the counter). Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_infoAl Viro2-3/+10
->d_hash() and ->d_compare() use those, so we need to delay freeing them. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helperAl Viro3-19/+16
That stuff can be accessed by ->d_hash()/->d_compare(); as it is, we have a hard-to-hit UAF if rcu pathwalk manages to get into ->d_hash() on a filesystem that is in process of getting shut down. Besides, having nls and upcase table cleanup moved from ->put_super() towards the place where sbi is freed makes for simpler failure exits. Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25affs: free affs_sb_info with kfree_rcu()Al Viro2-1/+2
one of the flags in it is used by ->d_hash()/->d_compare() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25rcu pathwalk: prevent bogus hard errors from may_lookup()Al Viro1-1/+5
If lazy call of ->permission() returns a hard error, check that try_to_unlazy() succeeds before returning it. That both makes life easier for ->permission() instances and closes the race in ENOTDIR handling - it is possible that positive d_can_lookup() seen in link_path_walk() applies to the state *after* unlink() + mkdir(), while nd->inode matches the state prior to that. Normally seeing e.g. EACCES from permission check in rcu pathwalk means that with some timings non-rcu pathwalk would've run into the same; however, running into a non-executable regular file in the middle of a pathname would not get to permission check - it would fail with ENOTDIR instead. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25fs/super.c: don't drop ->s_user_ns until we free struct super_block itselfAl Viro1-9/+4
Avoids fun races in RCU pathwalk... Same goes for freeing LSM shite hanging off super_block's arse. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25bcachefs: Fix check_snapshot() memcpyKent Overstreet1-1/+1
check_snapshot() copies the bch_snapshot to a temporary to easily handle older versions that don't have all the fields of the current version, but it lacked a min() to correctly handle keys newer and larger than the current version. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-25bcachefs: Fix bch2_journal_flush_device_pins()Kent Overstreet1-3/+5
If a journal write errored, the list of devices it was written to could be empty - we're not supposed to mark an empty replicas list. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-25bcachefs: fix iov_iter count underflow on sub-block dio readBrian Foster1-0/+2
bch2_direct_IO_read() checks the request offset and size for sector alignment and then falls through to a couple calculations to shrink the size of the request based on the inode size. The problem is that these checks round up to the fs block size, which runs the risk of underflowing iter->count if the block size happens to be large enough. This is triggered by fstest generic/361 with a 4k block size, which subsequently leads to a crash. To avoid this crash, check that the shorten length doesn't exceed the overall length of the iter. Fixes: Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Su Yue <glass.su@suse.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-25bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btreeKent Overstreet1-1/+3
If we're in FILTER_SNAPSHOTS mode and we start scanning a range of the keyspace where no keys are visible in the current snapshot, we have a problem - we'll scan for a very long time before scanning terminates. Awhile back, this was fixed for most cases with peek_upto() (and assertions that enforce that it's being used). But the fix missed the fact that the inodes btree is different - every key offset is in a different snapshot tree, not just the inode field. Fixes: Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-25bcachefs: Kill __GFP_NOFAIL in buffered read pathKent Overstreet1-13/+8
Recently, we fixed our __GFP_NOFAIL usage in the readahead path, but the easy one in read_single_folio() (where wa can return an error) was missed - oops. Fixes: Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-25bcachefs: fix backpointer_to_text() when dev does not existKent Overstreet1-3/+5
Fixes: Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-25Merge tag 'powerpc-6.8-4' of ↵Linus Torvalds5-10/+73
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix a crash when hot adding a PCI device to an LPAR since recent changes - Fix nested KVM level-2 guest reboot failure due to empty 'arch_compat' Thanks to Amit Machhiwal, Aneesh Kumar K.V (IBM), Brian King, Gaurav Batra, and Vaibhav Jain. * tag 'powerpc-6.8-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat' powerpc/pseries/iommu: DLPAR add doesn't completely initialize pci_controller
2024-02-25Merge tag 'iommu-fixes-v6.8-rc5' of ↵Linus Torvalds9-93/+225
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: - Intel VT-d fixes for nested domain handling: - Cache invalidation for changes in a parent domain - Dirty tracking setting for parent and nested domains - Fix a constant-out-of-range warning - ARM SMMU fixes: - Fix CD allocation from atomic context when using SVA with SMMUv3 - Revert the conversion of SMMUv2 to domain_alloc_paging(), as it breaks the boot for Qualcomm MSM8996 devices - Restore SVA handle sharing in core code as it turned out there are still drivers relying on it * tag 'iommu-fixes-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/sva: Restore SVA handle sharing iommu/arm-smmu-v3: Do not use GFP_KERNEL under as spinlock iommu/vt-d: Fix constant-out-of-range warning iommu/vt-d: Set SSADE when attaching to a parent with dirty tracking iommu/vt-d: Add missing dirty tracking set for parent domain iommu/vt-d: Wrap the dirty tracking loop to be a helper iommu/vt-d: Remove domain parameter for intel_pasid_setup_dirty_tracking() iommu/vt-d: Add missing device iotlb flush for parent domain iommu/vt-d: Update iotlb in nested domain attach iommu/vt-d: Add missing iotlb flush for parent domain iommu/vt-d: Add __iommu_flush_iotlb_psi() iommu/vt-d: Track nested domains in parent Revert "iommu/arm-smmu: Convert to domain_alloc_paging()"