summaryrefslogtreecommitdiff
path: root/fs/nfs
AgeCommit message (Collapse)AuthorFilesLines
13 daysNFSv4.1 another fix for EXCHGID4_FLAG_USE_PNFS_DS for DS serverOlga Kornievskaia2-5/+3
[ Upstream commit 4840c00003a2275668a13b82c9f5b1aed80183aa ] Previously in order to mark the communication with the DS server, we tried to use NFS_CS_DS in cl_flags. However, this flag would only be saved for the DS server and in case where DS equals MDS, the client would not find a matching nfs_client in nfs_match_client that represents the MDS (but is also a DS). Instead, don't rely on the NFS_CS_DS but instead use NFS_CS_PNFS. Fixes: 379e4adfddd6 ("NFSv4.1: fixup use EXCHGID4_FLAG_USE_PNFS_DS for DS server") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25nfs: don't invalidate dentries on transient errorsScott Mayhew1-14/+13
[ Upstream commit 0c8c7c559740d2d8b66048162af6c4dba8f0c88c ] This is a slight variation on a patch previously proposed by Neil Brown that never got merged. Prior to commit 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()"), any error from nfs_lookup_verify_inode() other than -ESTALE would result in nfs_lookup_revalidate() returning that error (-ESTALE is mapped to zero). Since that commit, all errors result in nfs_lookup_revalidate() returning zero, resulting in dentries being invalidated where they previously were not (particularly in the case of -ERESTARTSYS). Fix it by passing the actual error code to nfs_lookup_revalidate_done(), and leaving the decision on whether to map the error code to zero or one to nfs_lookup_revalidate_done(). A simple reproducer is to run the following python code in a subdirectory of an NFS mount (not in the root of the NFS mount): ---8<--- import os import multiprocessing import time if __name__=="__main__": multiprocessing.set_start_method("spawn") count = 0 while True: try: os.getcwd() pool = multiprocessing.Pool(10) pool.close() pool.terminate() count += 1 except Exception as e: print(f"Failed after {count} iterations") print(e) break ---8<--- Prior to commit 5ceb9d7fdaaf, the above code would run indefinitely. After commit 5ceb9d7fdaaf, it fails almost immediately with -ENOENT. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25nfs: Avoid flushing many pages with NFS_FILE_SYNCJan Kara1-0/+5
[ Upstream commit a527c3ba41c4c61e2069bfce4091e5515f06a8dd ] When we are doing WB_SYNC_ALL writeback, nfs submits write requests with NFS_FILE_SYNC flag to the server (which then generally treats it as an O_SYNC write). This helps to reduce latency for single requests but when submitting more requests, additional fsyncs on the server side hurt latency. NFS generally avoids this additional overhead by not setting NFS_FILE_SYNC if desc->pg_moreio is set. However this logic doesn't always work. When we do random 4k writes to a huge file and then call fsync(2), each page writeback is going to be sent with NFS_FILE_SYNC because after preparing one page for writeback, we start writing back next, nfs_do_writepage() will call nfs_pageio_cond_complete() which finds the page is not contiguous with previously prepared IO and submits is *without* setting desc->pg_moreio. Hence NFS_FILE_SYNC is used resulting in poor performance. Fix the problem by setting desc->pg_moreio in nfs_pageio_cond_complete() before submitting outstanding IO. This improves throughput of fsync-after-random-writes on my test SSD from ~70MB/s to ~250MB/s. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25nfs: propagate readlink errors in nfs_symlink_fillerSagi Grimberg1-1/+1
[ Upstream commit 134d0b3f2440cdddd12fc3444c9c0f62331ce6fc ] There is an inherent race where a symlink file may have been overriden (by a different client) between lookup and readlink, resulting in a spurious EIO error returned to userspace. Fix this by propagating back ESTALE errors such that the vfs will retry the lookup/get_link (similar to nfs4_file_open) at least once. Cc: Dan Aloni <dan.aloni@vastdata.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25NFSv4: Fix memory leak in nfs4_set_security_labelDmitry Mastykin1-0/+1
[ Upstream commit aad11473f8f4be3df86461081ce35ec5b145ba68 ] We leak nfs_fattr and nfs4_label every time we set a security xattr. Signed-off-by: Dmitry Mastykin <mastichi@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-05nfs: drop the incorrect assertion in nfs_swap_rw()Christoph Hellwig1-2/+0
commit 54e7d59841dab977f6cb1183d658b1b82c9f4e94 upstream. Since commit 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS swap-space"), we can plug multiple pages then unplug them all together. That means iov_iter_count(iter) could be way bigger than PAGE_SIZE, it actually equals the size of iov_iter_npages(iter, INT_MAX). Note this issue has nothing to do with large folios as we don't support THP_SWPOUT to non-block devices. [v-songbaohua@oppo.com: figure out the cause and correct the commit message] Link: https://lkml.kernel.org/r/20240618065647.21791-1-21cnbao@gmail.com Fixes: 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS swap-space") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Closes: https://lore.kernel.org/linux-mm/20240617053201.GA16852@lst.de/ Reviewed-by: Martin Wege <martin.l.wege@gmail.com> Cc: NeilBrown <neilb@suse.de> Cc: Anna Schumaker <anna@kernel.org> Cc: Steve French <sfrench@samba.org> Cc: Trond Myklebust <trondmy@kernel.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-21NFS: add barriers when testing for NFS_FSDATA_BLOCKEDNeilBrown1-15/+32
[ Upstream commit 99bc9f2eb3f79a2b4296d9bf43153e1d10ca50d3 ] dentry->d_fsdata is set to NFS_FSDATA_BLOCKED while unlinking or renaming-over a file to ensure that no open succeeds while the NFS operation progressed on the server. Setting dentry->d_fsdata to NFS_FSDATA_BLOCKED is done under ->d_lock after checking the refcount is not elevated. Any attempt to open the file (through that name) will go through lookp_open() which will take ->d_lock while incrementing the refcount, we can be sure that once the new value is set, __nfs_lookup_revalidate() *will* see the new value and will block. We don't have any locking guarantee that when we set ->d_fsdata to NULL, the wait_var_event() in __nfs_lookup_revalidate() will notice. wait/wake primitives do NOT provide barriers to guarantee order. We must use smp_load_acquire() in wait_var_event() to ensure we look at an up-to-date value, and must use smp_store_release() before wake_up_var(). This patch adds those barrier functions and factors out block_revalidate() and unblock_revalidate() far clarity. There is also a hypothetical bug in that if memory allocation fails (which never happens in practice) we might leave ->d_fsdata locked. This patch adds the missing call to unblock_revalidate(). Reported-and-tested-by: Richard Kojedzinszky <richard+debian+bugreport@kojedz.in> Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501 Fixes: 3c59366c207e ("NFS: don't unhash dentry during unlink/rename") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-21NFSv4.1 enforce rootpath check in fs_location queryOlga Kornievskaia1-1/+22
[ Upstream commit 28568c906c1bb5f7560e18082ed7d6295860f1c2 ] In commit 4ca9f31a2be66 ("NFSv4.1 test and add 4.1 trunking transport"), we introduce the ability to query the NFS server for possible trunking locations of the existing filesystem. However, we never checked the returned file system path for these alternative locations. According to the RFC, the server can say that the filesystem currently known under "fs_root" of fs_location also resides under these server locations under the following "rootpath" pathname. The client cannot handle trunking a filesystem that reside under different location under different paths other than what the main path is. This patch enforces the check that fs_root path and rootpath path in fs_location reply is the same. Fixes: 4ca9f31a2be6 ("NFSv4.1 test and add 4.1 trunking transport") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-16NFS: Fix READ_PLUS when server doesn't support OP_READ_PLUSAnna Schumaker1-1/+1
commit f06d1b10cb016d5aaecdb1804fefca025387bd10 upstream. Olga showed me a case where the client was sending multiple READ_PLUS calls to the server in parallel, and the server replied NFS4ERR_OPNOTSUPP to each. The client would fall back to READ for the first reply, but fail to retry the other calls. I fix this by removing the test for NFS_CAP_READ_PLUS in nfs4_read_plus_not_supported(). This allows us to reschedule any READ_PLUS call that has a NFS4ERR_OPNOTSUPP return value, even after the capability has been cleared. Reported-by: Olga Kornievskaia <kolga@netapp.com> Fixes: c567552612ec ("NFS: Add READ_PLUS data segment support") Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16nfs: fix undefined behavior in nfs_block_bits()Sergey Shtylyov1-2/+2
commit 3c0a2e0b0ae661457c8505fecc7be5501aa7a715 upstream. Shifting *signed int* typed constant 1 left by 31 bits causes undefined behavior. Specify the correct *unsigned long* type by using 1UL instead. Found by Linux Verification Center (linuxtesting.org) with the Svace static analysis tool. Cc: stable@vger.kernel.org Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-12pNFS/filelayout: fixup pNfs allocation modesOlga Kornievskaia1-2/+2
[ Upstream commit 3ebcb24646f8c5bfad2866892d3f3cff05514452 ] Change left over allocation flags. Fixes: a245832aaa99 ("pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12nfs: keep server info for remountsMartin Kaiser1-3/+6
[ Upstream commit b322bf9e983addedff0894c55e92d58f4d16d92a ] With newer kernels that use fs_context for nfs mounts, remounts fail with -EINVAL. $ mount -t nfs -o nolock 10.0.0.1:/tmp/test /mnt/test/ $ mount -t nfs -o remount /mnt/test/ mount: mounting 10.0.0.1:/tmp/test on /mnt/test failed: Invalid argument For remounts, the nfs server address and port are populated by nfs_init_fs_context and later overwritten with 0x00 bytes by nfs23_parse_monolithic. The remount then fails as the server address is invalid. Fix this by not overwriting nfs server info in nfs23_parse_monolithic if we're doing a remount. Fixes: f2aedb713c28 ("NFS: Add fs_context support.") Signed-off-by: Martin Kaiser <martin@kaiser.cx> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12NFSv4: Fixup smatch warning for ambiguous returnBenjamin Coddington1-7/+5
[ Upstream commit 37ffe06537af3e3ec212e7cbe941046fce0a822f ] Dan Carpenter reports smatch warning for nfs4_try_migration() when a memory allocation failure results in a zero return value. In this case, a transient allocation failure error will likely be retried the next time the server responds with NFS4ERR_MOVED. We can fixup the smatch warning with a small refactor: attempt all three allocations before testing and returning on a failure. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: c3ed222745d9 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-05-17nfs: Handle error of rpc_proc_register() in nfs_net_init().Kuniyuki Iwashima1-1/+6
[ Upstream commit 24457f1be29f1e7042e50a7749f5c2dde8c433c8 ] syzkaller reported a warning [0] triggered while destroying immature netns. rpc_proc_register() was called in init_nfs_fs(), but its error has been ignored since at least the initial commit 1da177e4c3f4 ("Linux-2.6.12-rc2"). Recently, commit d47151b79e32 ("nfs: expose /proc/net/sunrpc/nfs in net namespaces") converted the procfs to per-netns and made the problem more visible. Even when rpc_proc_register() fails, nfs_net_init() could succeed, and thus nfs_net_exit() will be called while destroying the netns. Then, remove_proc_entry() will be called for non-existing proc directory and trigger the warning below. Let's handle the error of rpc_proc_register() properly in nfs_net_init(). [0]: name 'nfs' WARNING: CPU: 1 PID: 1710 at fs/proc/generic.c:711 remove_proc_entry+0x1bb/0x2d0 fs/proc/generic.c:711 Modules linked in: CPU: 1 PID: 1710 Comm: syz-executor.2 Not tainted 6.8.0-12822-gcd51db110a7e #12 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:remove_proc_entry+0x1bb/0x2d0 fs/proc/generic.c:711 Code: 41 5d 41 5e c3 e8 85 09 b5 ff 48 c7 c7 88 58 64 86 e8 09 0e 71 02 e8 74 09 b5 ff 4c 89 e6 48 c7 c7 de 1b 80 84 e8 c5 ad 97 ff <0f> 0b eb b1 e8 5c 09 b5 ff 48 c7 c7 88 58 64 86 e8 e0 0d 71 02 eb RSP: 0018:ffffc9000c6d7ce0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff8880422b8b00 RCX: ffffffff8110503c RDX: ffff888030652f00 RSI: ffffffff81105045 RDI: 0000000000000001 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: ffffffff81bb62cb R12: ffffffff84807ffc R13: ffff88804ad6fcc0 R14: ffffffff84807ffc R15: ffffffff85741ff8 FS: 00007f30cfba8640(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff51afe8000 CR3: 000000005a60a005 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> rpc_proc_unregister+0x64/0x70 net/sunrpc/stats.c:310 nfs_net_exit+0x1c/0x30 fs/nfs/inode.c:2438 ops_exit_list+0x62/0xb0 net/core/net_namespace.c:170 setup_net+0x46c/0x660 net/core/net_namespace.c:372 copy_net_ns+0x244/0x590 net/core/net_namespace.c:505 create_new_namespaces+0x2ed/0x770 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xae/0x160 kernel/nsproxy.c:228 ksys_unshare+0x342/0x760 kernel/fork.c:3322 __do_sys_unshare kernel/fork.c:3393 [inline] __se_sys_unshare kernel/fork.c:3391 [inline] __x64_sys_unshare+0x1f/0x30 kernel/fork.c:3391 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x46/0x4e RIP: 0033:0x7f30d0febe5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007f30cfba7cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00000000004bbf80 RCX: 00007f30d0febe5d RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000006c020600 RBP: 00000000004bbf80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 R13: 000000000000000b R14: 00007f30d104c530 R15: 0000000000000000 </TASK> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-05-17nfs: make the rpc_stat per net namespaceJosef Bacik4-4/+9
[ Upstream commit 1548036ef1204df65ca5a16e8b199c858cb80075 ] Now that we're exposing the rpc stats on a per-network namespace basis, move this struct into struct nfs_net and use that to make sure only the per-network namespace stats are exposed. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Stable-dep-of: 24457f1be29f ("nfs: Handle error of rpc_proc_register() in nfs_net_init().") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-05-17nfs: expose /proc/net/sunrpc/nfs in net namespacesJosef Bacik1-4/+2
[ Upstream commit d47151b79e3220e72ae323b8b8e9d6da20dc884e ] We're using nfs mounts inside of containers in production and noticed that the nfs stats are not exposed in /proc. This is a problem for us as we use these stats for monitoring, and have to do this awkward bind mount from the main host into the container in order to get to these states. Add the rpc_proc_register call to the pernet operations entry and exit points so these stats can be exposed inside of network namespaces. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Stable-dep-of: 24457f1be29f ("nfs: Handle error of rpc_proc_register() in nfs_net_init().") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-03NFS: Read unlock folio on nfs_page_create_from_folio() errorBenjamin Coddington1-0/+2
[ Upstream commit 11974eec839c167362af685aae5f5e1baaf979eb ] The netfs conversion lost a folio_unlock() for the case where nfs_page_create_from_folio() returns an error (usually -ENOMEM). Restore it. Reported-by: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> # 6.4+ Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when fscache is enabled") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Acked-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-03nfs: fix UAF in direct writesJosef Bacik2-3/+10
[ Upstream commit 17f46b803d4f23c66cacce81db35fef3adb8f2af ] In production we have been hitting the following warning consistently ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0 Workqueue: nfsiod nfs_direct_write_schedule_work [nfs] RIP: 0010:refcount_warn_saturate+0x9c/0xe0 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x9f/0x130 ? refcount_warn_saturate+0x9c/0xe0 ? report_bug+0xcc/0x150 ? handle_bug+0x3d/0x70 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? refcount_warn_saturate+0x9c/0xe0 nfs_direct_write_schedule_work+0x237/0x250 [nfs] process_one_work+0x12f/0x4a0 worker_thread+0x14e/0x3b0 ? ZSTD_getCParams_internal+0x220/0x220 kthread+0xdc/0x120 ? __btf_name_valid+0xa0/0xa0 ret_from_fork+0x1f/0x30 This is because we're completing the nfs_direct_request twice in a row. The source of this is when we have our commit requests to submit, we process them and send them off, and then in the completion path for the commit requests we have if (nfs_commit_end(cinfo.mds)) nfs_direct_write_complete(dreq); However since we're submitting asynchronous requests we sometimes have one that completes before we submit the next one, so we end up calling complete on the nfs_direct_request twice. The only other place we use nfs_generic_commit_list() is in __nfs_commit_inode, which wraps this call in a nfs_commit_begin(); nfs_commit_end(); Which is a common pattern for this style of completion handling, one that is also repeated in the direct code with get_dreq()/put_dreq() calls around where we process events as well as in the completion paths. Fix this by using the same pattern for the commit requests. Before with my 200 node rocksdb stress running this warning would pop every 10ish minutes. With my patch the stress test has been running for several hours without popping. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27nfs: fix panic when nfs4_ff_layout_prepare_ds() failsJosef Bacik1-1/+1
[ Upstream commit 719fcafe07c12646691bd62d7f8d94d657fa0766 ] We've been seeing the following panic in production BUG: kernel NULL pointer dereference, address: 0000000000000065 PGD 2f485f067 P4D 2f485f067 PUD 2cc5d8067 PMD 0 RIP: 0010:ff_layout_cancel_io+0x3a/0x90 [nfs_layout_flexfiles] Call Trace: <TASK> ? __die+0x78/0xc0 ? page_fault_oops+0x286/0x380 ? __rpc_execute+0x2c3/0x470 [sunrpc] ? rpc_new_task+0x42/0x1c0 [sunrpc] ? exc_page_fault+0x5d/0x110 ? asm_exc_page_fault+0x22/0x30 ? ff_layout_free_layoutreturn+0x110/0x110 [nfs_layout_flexfiles] ? ff_layout_cancel_io+0x3a/0x90 [nfs_layout_flexfiles] ? ff_layout_cancel_io+0x6f/0x90 [nfs_layout_flexfiles] pnfs_mark_matching_lsegs_return+0x1b0/0x360 [nfsv4] pnfs_error_mark_layout_for_return+0x9e/0x110 [nfsv4] ? ff_layout_send_layouterror+0x50/0x160 [nfs_layout_flexfiles] nfs4_ff_layout_prepare_ds+0x11f/0x290 [nfs_layout_flexfiles] ff_layout_pg_init_write+0xf0/0x1f0 [nfs_layout_flexfiles] __nfs_pageio_add_request+0x154/0x6c0 [nfs] nfs_pageio_add_request+0x26b/0x380 [nfs] nfs_do_writepage+0x111/0x1e0 [nfs] nfs_writepages_callback+0xf/0x30 [nfs] write_cache_pages+0x17f/0x380 ? nfs_pageio_init_write+0x50/0x50 [nfs] ? nfs_writepages+0x6d/0x210 [nfs] ? nfs_writepages+0x6d/0x210 [nfs] nfs_writepages+0x125/0x210 [nfs] do_writepages+0x67/0x220 ? generic_perform_write+0x14b/0x210 filemap_fdatawrite_wbc+0x5b/0x80 file_write_and_wait_range+0x6d/0xc0 nfs_file_fsync+0x81/0x170 [nfs] ? nfs_file_mmap+0x60/0x60 [nfs] __x64_sys_fsync+0x53/0x90 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Inspecting the core with drgn I was able to pull this >>> prog.crashed_thread().stack_trace()[0] #0 at 0xffffffffa079657a (ff_layout_cancel_io+0x3a/0x84) in ff_layout_cancel_io at fs/nfs/flexfilelayout/flexfilelayout.c:2021:27 >>> prog.crashed_thread().stack_trace()[0]['idx'] (u32)1 >>> prog.crashed_thread().stack_trace()[0]['flseg'].mirror_array[1].mirror_ds (struct nfs4_ff_layout_ds *)0xffffffffffffffed This is clear from the stack trace, we call nfs4_ff_layout_prepare_ds() which could error out initializing the mirror_ds, and then we go to clean it all up and our check is only for if (!mirror->mirror_ds). This is inconsistent with the rest of the users of mirror_ds, which have if (IS_ERR_OR_NULL(mirror_ds)) to keep from tripping over this exact scenario. Fix this up in ff_layout_cancel_io() to make sure we don't panic when we get an error. I also spot checked all the other instances of checking mirror_ds and we appear to be doing the correct checks everywhere, only unconditionally dereferencing mirror_ds when we know it would be valid. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Fixes: b739a5bd9d9f ("NFSv4/flexfiles: Cancel I/O if the layout is recalled or revoked") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27NFSv4.1/pnfs: fix NFS with TLS in pnfsOlga Kornievskaia1-1/+43
[ Upstream commit a35518cae4b325632840bc8c3aa9ad9bac430038 ] Currently, even though xprtsec=tls is specified and used for operations to MDS, any operations that go to DS travel over unencrypted connection. Or additionally, if more than 1 DS can serve the data, then trunked connections are also done unencrypted. IN GETDEVINCEINFO, we get an entry for the DS which carries a protocol type (which is TCP), then nfs4_set_ds_client() gets called with TCP instead of TCP with TLS. Currently, each trunked connection is created and uses clp->cl_hostname value which if TLS is used would get passed up in the handshake upcall, but instead we need to pass in the appropriate trunked address value. Fixes: c8407f2e560c ("NFS: Add an "xprtsec=" NFS mount option") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27NFS: Fix an off by one in root_nfs_cat()Christophe JAILLET1-2/+2
[ Upstream commit 698ad1a538da0b6bf969cfee630b4e3a026afb87 ] The intent is to check if 'dest' is truncated or not. So, >= should be used instead of >, because strlcat() returns the length of 'dest' and 'src' excluding the trailing NULL. Fixes: 56463e50d1fc ("NFS: Use super.c for NFSROOT mount option parsing") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27NFS: Fix nfs_netfs_issue_read() xarray locking for writeback interruptDave Wysochanski1-7/+2
[ Upstream commit fd5860ab6341506004219b080aea40213b299d2e ] The loop inside nfs_netfs_issue_read() currently does not disable interrupts while iterating through pages in the xarray to submit for NFS read. This is not safe though since after taking xa_lock, another page in the mapping could be processed for writeback inside an interrupt, and deadlock can occur. The fix is simple and clean if we use xa_for_each_range(), which handles the iteration with RCU while reducing code complexity. The problem is easily reproduced with the following test: mount -o vers=3,fsc 127.0.0.1:/export /mnt/nfs dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1 echo 3 > /proc/sys/vm/drop_caches dd if=/mnt/nfs/file1.bin of=/dev/null umount /mnt/nfs On the console with a lockdep-enabled kernel a message similar to the following will be seen: ================================ WARNING: inconsistent lock state 6.7.0-lockdbg+ #10 Not tainted -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. test5/1708 [HC0[0]:SC0[0]:HE1:SE1] takes: ffff888127baa598 (&xa->xa_lock#4){+.?.}-{3:3}, at: nfs_netfs_issue_read+0x1b2/0x4b0 [nfs] {IN-SOFTIRQ-W} state was registered at: lock_acquire+0x144/0x380 _raw_spin_lock_irqsave+0x4e/0xa0 __folio_end_writeback+0x17e/0x5c0 folio_end_writeback+0x93/0x1b0 iomap_finish_ioend+0xeb/0x6a0 blk_update_request+0x204/0x7f0 blk_mq_end_request+0x30/0x1c0 blk_complete_reqs+0x7e/0xa0 __do_softirq+0x113/0x544 __irq_exit_rcu+0xfe/0x120 irq_exit_rcu+0xe/0x20 sysvec_call_function_single+0x6f/0x90 asm_sysvec_call_function_single+0x1a/0x20 pv_native_safe_halt+0xf/0x20 default_idle+0x9/0x20 default_idle_call+0x67/0xa0 do_idle+0x2b5/0x300 cpu_startup_entry+0x34/0x40 start_secondary+0x19d/0x1c0 secondary_startup_64_no_verify+0x18f/0x19b irq event stamp: 176891 hardirqs last enabled at (176891): [<ffffffffa67a0be4>] _raw_spin_unlock_irqrestore+0x44/0x60 hardirqs last disabled at (176890): [<ffffffffa67a0899>] _raw_spin_lock_irqsave+0x79/0xa0 softirqs last enabled at (176646): [<ffffffffa515d91e>] __irq_exit_rcu+0xfe/0x120 softirqs last disabled at (176633): [<ffffffffa515d91e>] __irq_exit_rcu+0xfe/0x120 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&xa->xa_lock#4); <Interrupt> lock(&xa->xa_lock#4); *** DEADLOCK *** 2 locks held by test5/1708: #0: ffff888127baa498 (&sb->s_type->i_mutex_key#22){++++}-{4:4}, at: nfs_start_io_read+0x28/0x90 [nfs] #1: ffff888127baa650 (mapping.invalidate_lock#3){.+.+}-{4:4}, at: page_cache_ra_unbounded+0xa4/0x280 stack backtrace: CPU: 6 PID: 1708 Comm: test5 Kdump: loaded Not tainted 6.7.0-lockdbg+ Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 Call Trace: dump_stack_lvl+0x5b/0x90 mark_lock+0xb3f/0xd20 __lock_acquire+0x77b/0x3360 _raw_spin_lock+0x34/0x80 nfs_netfs_issue_read+0x1b2/0x4b0 [nfs] netfs_begin_read+0x77f/0x980 [netfs] nfs_netfs_readahead+0x45/0x60 [nfs] nfs_readahead+0x323/0x5a0 [nfs] read_pages+0xf3/0x5c0 page_cache_ra_unbounded+0x1c8/0x280 filemap_get_pages+0x38c/0xae0 filemap_read+0x206/0x5e0 nfs_file_read+0xb7/0x140 [nfs] vfs_read+0x2a9/0x460 ksys_read+0xb7/0x140 Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when fscache is enabled") Suggested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27NFSv4.2: fix listxattr maximum XDR buffer sizeJorge Mora1-2/+5
[ Upstream commit bcac8bff90a6ee1629f90669cdb9d28fb86049b0 ] Switch order of operations to avoid creating a short XDR buffer: e.g., buflen = 12, old xdrlen = 12, new xdrlen = 20. Having a short XDR buffer leads to lxa_maxcount be a few bytes less than what is needed to retrieve the whole list when using a buflen as returned by a call with size = 0: buflen = listxattr(path, NULL, 0); buf = malloc(buflen); buflen = listxattr(path, buf, buflen); For a file with one attribute (name = '123456'), the first call with size = 0 will return buflen = 12 ('user.123456\x00'). The second call with size = 12, sends LISTXATTRS with lxa_maxcount = 12 + 8 (cookie) + 4 (array count) = 24. The XDR buffer needs 8 (cookie) + 4 (array count) + 4 (name count) + 6 (name len) + 2 (padding) + 4 (eof) = 28 which is 4 bytes shorter than the lxa_maxcount provided in the call. Fixes: 04a5da690e8f ("NFSv4.2: define limits and sizes for user xattr handling") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27NFSv4.2: fix nfs4_listxattr kernel BUG at mm/usercopy.c:102Jorge Mora1-6/+10
[ Upstream commit 251a658bbfceafb4d58c76b77682c8bf7bcfad65 ] A call to listxattr() with a buffer size = 0 returns the actual size of the buffer needed for a subsequent call. When size > 0, nfs4_listxattr() does not return an error because either generic_listxattr() or nfs4_listxattr_nfs4_label() consumes exactly all the bytes then size is 0 when calling nfs4_listxattr_nfs4_user() which then triggers the following kernel BUG: [ 99.403778] kernel BUG at mm/usercopy.c:102! [ 99.404063] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 99.408463] CPU: 0 PID: 3310 Comm: python3 Not tainted 6.6.0-61.fc40.aarch64 #1 [ 99.415827] Call trace: [ 99.415985] usercopy_abort+0x70/0xa0 [ 99.416227] __check_heap_object+0x134/0x158 [ 99.416505] check_heap_object+0x150/0x188 [ 99.416696] __check_object_size.part.0+0x78/0x168 [ 99.416886] __check_object_size+0x28/0x40 [ 99.417078] listxattr+0x8c/0x120 [ 99.417252] path_listxattr+0x78/0xe0 [ 99.417476] __arm64_sys_listxattr+0x28/0x40 [ 99.417723] invoke_syscall+0x78/0x100 [ 99.417929] el0_svc_common.constprop.0+0x48/0xf0 [ 99.418186] do_el0_svc+0x24/0x38 [ 99.418376] el0_svc+0x3c/0x110 [ 99.418554] el0t_64_sync_handler+0x120/0x130 [ 99.418788] el0t_64_sync+0x194/0x198 [ 99.418994] Code: aa0003e3 d000a3e0 91310000 97f49bdb (d4210000) Issue is reproduced when generic_listxattr() returns 'system.nfs4_acl', thus calling lisxattr() with size = 16 will trigger the bug. Add check on nfs4_listxattr() to return ERANGE error when it is called with size > 0 and the return value is greater than size. Fixes: 012a211abd5d ("NFSv4.2: hook in the user extended attribute handlers") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-06NFS: Fix data corruption caused by congestion.NeilBrown1-1/+3
when AOP_WRITEPAGE_ACTIVATE is returned (as NFS does when it detects congestion) it is important that the folio is redirtied. nfs_writepage_locked() doesn't do this, so files can become corrupted as writes can be lost. Note that this is not needed in v6.8 as AOP_WRITEPAGE_ACTIVATE cannot be returned. It is needed for kernels v5.18..v6.7. Prior to 6.3 the patch is different as it needs to mention "page", not "folio". Reported-and-tested-by: Jacek Tomaka <Jacek.Tomaka@poczta.fm> Fixes: 6df25e58532b ("nfs: remove reliance on bdi congestion") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-26pNFS: Fix the pnfs block driver's calculation of layoutget sizeTrond Myklebust4-7/+8
[ Upstream commit 8a6291bf3b0eae1bf26621e6419a91682f2d6227 ] Instead of relying on the value of the 'bytes_left' field, we should calculate the layout size based on the offset of the request that is being written out. Reported-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Fixes: 954998b60caa ("NFS: Fix error handling for O_DIRECT write scheduling") Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-26NFSv4.1/pnfs: Ensure we handle the error NFS4ERR_RETURNCONFLICTTrond Myklebust1-0/+3
[ Upstream commit 037e56a22ff37f9a9c2330b66cff55d3d1ff9b90 ] Once the client has processed the CB_LAYOUTRECALL, but has not yet successfully returned the layout, the server is supposed to switch to returning NFS4ERR_RETURNCONFLICT. This patch ensures that we handle that return value correctly. Fixes: 183d9e7b112a ("pnfs: rework LAYOUTGET retry handling") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-26NFS: Use parent's objective cred in nfs_access_login_time()Scott Mayhew1-1/+1
[ Upstream commit a10a9233073d984b239e22358ba21825e27e2e88 ] The subjective cred (task->cred) can potentially be overridden and subsquently freed in non-RCU context, which could lead to a panic if we try to use it in cred_fscmp(). Use __task_cred(), which returns the objective cred (task->real_cred) instead. Fixes: 0eb43812c027 ("NFS: Clear the file access cache upon login") Fixes: 5e9a7b9c2ea1 ("NFS: Fix up a sparse warning") Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-26blocklayoutdriver: Fix reference leak of pnfs_device_nodeBenjamin Coddington1-0/+2
[ Upstream commit 1530827b90025cdf80c9b0d07a166d045a0a7b81 ] The error path for blocklayout's device lookup is missing a reference drop for the case where a lookup finds the device, but the device is marked with NFS_DEVICEID_UNAVAILABLE. Fixes: b3dce6a2f060 ("pnfs/blocklayout: handle transient devices") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28NFSv4.1: fix SP4_MACH_CRED protection for pnfs IOOlga Kornievskaia1-2/+3
[ Upstream commit 5cc7688bae7f0757c39c1d3dfdd827b724061067 ] If the client is doing pnfs IO and Kerberos is configured and EXCHANGEID successfully negotiated SP4_MACH_CRED and WRITE/COMMIT are on the list of state protected operations, then we need to make sure to choose the DS's rpc_client structure instead of the MDS's one. Fixes: fb91fb0ee7b2 ("NFS: Move call to nfs4_state_protect_write() to nfs4_write_setup()") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28NFSv4.1: fix handling NFS4ERR_DELAY when testing for session trunkingOlga Kornievskaia1-1/+6
[ Upstream commit 6bd1a77dc72dea0b0d8b6014f231143984d18f6d ] Currently when client sends an EXCHANGE_ID for a possible trunked connection, for any error that happened, the trunk will be thrown out. However, an NFS4ERR_DELAY is a transient error that should be retried instead. Fixes: e818bd085baf ("NFSv4.1 remove xprt from xprt_switch if session trunking test fails") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-18NFSv4.1: fixup use EXCHGID4_FLAG_USE_PNFS_DS for DS serverOlga Kornievskaia1-2/+0
This patches fixes commit 51d674a5e488 "NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server", purpose of that commit was to mark EXCHANGE_ID to the DS with the appropriate flag. However, connection to MDS can return both EXCHGID4_FLAG_USE_PNFS_DS and EXCHGID4_FLAG_USE_PNFS_MDS set but previous patch would only remember the USE_PNFS_DS and for the 2nd EXCHANGE_ID send that to the MDS. Instead, just mark the pnfs path exclusively. Fixes: 51d674a5e488 ("NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-10-18pNFS/flexfiles: Check the layout validity in ff_layout_mirror_prepare_statsTrond Myklebust1-7/+10
Ensure that we check the layout pointer and validity after dereferencing it in ff_layout_mirror_prepare_stats. Fixes: 08e2e5bc6c9a ("pNFS/flexfiles: Clean up layoutstats") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-10-18pNFS: Fix a hang in nfs4_evict_inode()Trond Myklebust1-10/+23
We are not allowed to call pnfs_mark_matching_lsegs_return() without also holding a reference to the layout header, since doing so could lead to the reference count going to zero when we call pnfs_layout_remove_lseg(). This again can lead to a hang when we get to nfs4_evict_inode() and are unable to clear the layout pointer. pnfs_layout_return_unused_byserver() is guilty of this behaviour, and has been seen to trigger the refcount warning prior to a hang. Fixes: b6d49ecd1081 ("NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-10-11NFS: Fix potential oops in nfs_inode_remove_request()Scott Mayhew1-1/+3
Once a folio's private data has been cleared, it's possible for another process to clear the folio->mapping (e.g. via invalidate_complete_folio2 or evict_mapping_folio), so it wouldn't be safe to call nfs_page_to_inode() after that. Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios") Signed-off-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-10-11nfs42: client needs to strip file mode's suid/sgid bit after ALLOCATE opDai Ngo1-1/+2
The Linux NFS server strips the SUID and SGID from the file mode on ALLOCATE op. Modify _nfs42_proc_fallocate to add NFS_INO_REVAL_FORCED to nfs_set_cache_invalid's argument to force update of the file mode suid/sgid bit. Suggested-by: Trond Myklebust <trondmy@hammerspace.com> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-28nfs: decrement nrequests counter before releasing the reqJeff Layton1-1/+1
I hit this panic in testing: [ 6235.500016] run fstests generic/464 at 2023-09-18 22:51:24 [ 6288.410761] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 6288.412174] #PF: supervisor read access in kernel mode [ 6288.413160] #PF: error_code(0x0000) - not-present page [ 6288.413992] PGD 0 P4D 0 [ 6288.414603] Oops: 0000 [#1] PREEMPT SMP PTI [ 6288.415419] CPU: 0 PID: 340798 Comm: kworker/u18:8 Not tainted 6.6.0-rc1-gdcf620ceebac #95 [ 6288.416538] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014 [ 6288.417701] Workqueue: nfsiod rpc_async_release [sunrpc] [ 6288.418676] RIP: 0010:nfs_inode_remove_request+0xc8/0x150 [nfs] [ 6288.419836] Code: ff ff 48 8b 43 38 48 8b 7b 10 a8 04 74 5b 48 85 ff 74 56 48 8b 07 a9 00 00 08 00 74 58 48 8b 07 f6 c4 10 74 50 e8 c8 44 b3 d5 <48> 8b 00 f0 48 ff 88 30 ff ff ff 5b 5d 41 5c c3 cc cc cc cc 48 8b [ 6288.422389] RSP: 0018:ffffbd618353bda8 EFLAGS: 00010246 [ 6288.423234] RAX: 0000000000000000 RBX: ffff9a29f9a25280 RCX: 0000000000000000 [ 6288.424351] RDX: ffff9a29f9a252b4 RSI: 000000000000000b RDI: ffffef41448e3840 [ 6288.425345] RBP: ffffef41448e3840 R08: 0000000000000038 R09: ffffffffffffffff [ 6288.426334] R10: 0000000000033f80 R11: ffff9a2a7fffa000 R12: ffff9a29093f98c4 [ 6288.427353] R13: 0000000000000000 R14: ffff9a29230f62e0 R15: ffff9a29230f62d0 [ 6288.428358] FS: 0000000000000000(0000) GS:ffff9a2a77c00000(0000) knlGS:0000000000000000 [ 6288.429513] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6288.430427] CR2: 0000000000000000 CR3: 0000000264748002 CR4: 0000000000770ef0 [ 6288.431553] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6288.432715] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6288.433698] PKRU: 55555554 [ 6288.434196] Call Trace: [ 6288.434667] <TASK> [ 6288.435132] ? __die+0x1f/0x70 [ 6288.435723] ? page_fault_oops+0x159/0x450 [ 6288.436389] ? try_to_wake_up+0x98/0x5d0 [ 6288.437044] ? do_user_addr_fault+0x65/0x660 [ 6288.437728] ? exc_page_fault+0x7a/0x180 [ 6288.438368] ? asm_exc_page_fault+0x22/0x30 [ 6288.439137] ? nfs_inode_remove_request+0xc8/0x150 [nfs] [ 6288.440112] ? nfs_inode_remove_request+0xa0/0x150 [nfs] [ 6288.440924] nfs_commit_release_pages+0x16e/0x340 [nfs] [ 6288.441700] ? __pfx_call_transmit+0x10/0x10 [sunrpc] [ 6288.442475] ? _raw_spin_lock_irqsave+0x23/0x50 [ 6288.443161] nfs_commit_release+0x15/0x40 [nfs] [ 6288.443926] rpc_free_task+0x36/0x60 [sunrpc] [ 6288.444741] rpc_async_release+0x29/0x40 [sunrpc] [ 6288.445509] process_one_work+0x171/0x340 [ 6288.446135] worker_thread+0x277/0x3a0 [ 6288.446724] ? __pfx_worker_thread+0x10/0x10 [ 6288.447376] kthread+0xf0/0x120 [ 6288.447903] ? __pfx_kthread+0x10/0x10 [ 6288.448500] ret_from_fork+0x2d/0x50 [ 6288.449078] ? __pfx_kthread+0x10/0x10 [ 6288.449665] ret_from_fork_asm+0x1b/0x30 [ 6288.450283] </TASK> [ 6288.450688] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc nls_iso8859_1 nls_cp437 vfat fat 9p netfs ext4 kvm_intel crc16 mbcache jbd2 joydev kvm xfs irqbypass virtio_net pcspkr net_failover psmouse failover 9pnet_virtio cirrus drm_shmem_helper virtio_balloon drm_kms_helper button evdev drm loop dm_mod zram zsmalloc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic virtio_blk nvme aesni_intel crypto_simd cryptd nvme_core t10_pi i6300esb crc64_rocksoft_generic crc64_rocksoft crc64 virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring serio_raw btrfs blake2b_generic libcrc32c crc32c_generic crc32c_intel xor raid6_pq autofs4 [ 6288.460211] CR2: 0000000000000000 [ 6288.460787] ---[ end trace 0000000000000000 ]--- [ 6288.461571] RIP: 0010:nfs_inode_remove_request+0xc8/0x150 [nfs] [ 6288.462500] Code: ff ff 48 8b 43 38 48 8b 7b 10 a8 04 74 5b 48 85 ff 74 56 48 8b 07 a9 00 00 08 00 74 58 48 8b 07 f6 c4 10 74 50 e8 c8 44 b3 d5 <48> 8b 00 f0 48 ff 88 30 ff ff ff 5b 5d 41 5c c3 cc cc cc cc 48 8b [ 6288.465136] RSP: 0018:ffffbd618353bda8 EFLAGS: 00010246 [ 6288.465963] RAX: 0000000000000000 RBX: ffff9a29f9a25280 RCX: 0000000000000000 [ 6288.467035] RDX: ffff9a29f9a252b4 RSI: 000000000000000b RDI: ffffef41448e3840 [ 6288.468093] RBP: ffffef41448e3840 R08: 0000000000000038 R09: ffffffffffffffff [ 6288.469121] R10: 0000000000033f80 R11: ffff9a2a7fffa000 R12: ffff9a29093f98c4 [ 6288.470109] R13: 0000000000000000 R14: ffff9a29230f62e0 R15: ffff9a29230f62d0 [ 6288.471106] FS: 0000000000000000(0000) GS:ffff9a2a77c00000(0000) knlGS:0000000000000000 [ 6288.472216] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6288.473059] CR2: 0000000000000000 CR3: 0000000264748002 CR4: 0000000000770ef0 [ 6288.474096] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6288.475097] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6288.476148] PKRU: 55555554 [ 6288.476665] note: kworker/u18:8[340798] exited with irqs disabled Once we've released "req", it's not safe to dereference it anymore. Decrement the nrequests counter before dropping the reference. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-27NFSv4: Fix a state manager thread deadlock regressionTrond Myklebust2-13/+29
Commit 4dc73c679114 reintroduces the deadlock that was fixed by commit aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock") because it prevents the setup of new threads to handle reboot recovery, while the older recovery thread is stuck returning delegations. Fixes: 4dc73c679114 ("NFSv4: keep state manager thread active if swap is enabled") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-27NFSv4: Fix a nfs4_state_manager() raceTrond Myklebust1-0/+7
If the NFS4CLNT_RUN_MANAGER flag got set just before we cleared NFS4CLNT_MANAGER_RUNNING, then we might have won the race against nfs4_schedule_state_manager(), and are responsible for handling the recovery situation. Fixes: aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-15NFSv4.1: fix zero value filehandle in post open getattrOlga Kornievskaia1-1/+5
Currently, if the OPEN compound experiencing an error and needs to get the file attributes separately, it will send a stand alone GETATTR but it would use the filehandle from the results of the OPEN compound. In case of the CLAIM_FH OPEN, nfs_openres's fh is zero value. That generate a GETATTR that's sent with a zero value filehandle, and results in the server returning an error. Instead, for the CLAIM_FH OPEN, take the filehandle that was used in the PUTFH of the OPEN compound. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13NFSv4.1: fix pnfs MDS=DS session trunkingOlga Kornievskaia1-1/+5
Currently, when GETDEVICEINFO returns multiple locations where each is a different IP but the server's identity is same as MDS, then nfs4_set_ds_client() finds the existing nfs_client structure which has the MDS's max_connect value (and if it's 1), then the 1st IP on the DS's list will get dropped due to MDS trunking rules. Other IPs would be added as they fall under the pnfs trunking rules. For the list of IPs the 1st goes thru calling nfs4_set_ds_client() which will eventually call nfs4_add_trunk() and call into rpc_clnt_test_and_add_xprt() which has the check for MDS trunking. The other IPs (after the 1st one), would call rpc_clnt_add_xprt() which doesn't go thru that check. nfs4_add_trunk() is called when MDS trunking is happening and it needs to enforce the usage of max_connect mount option of the 1st mount. However, this shouldn't be applied to pnfs flow. Instead, this patch proposed to treat MDS=DS as DS trunking and make sure that MDS's max_connect limit does not apply to the 1st IP returned in the GETDEVICEINFO list. It does so by marking the newly created client with a new flag NFS_CS_PNFS which then used to pass max_connect value to use into the rpc_clnt_test_and_add_xprt() instead of the existing rpc client's max_connect value set by the MDS connection. For example, mount was done without max_connect value set so MDS's rpc client has cl_max_connect=1. Upon calling into rpc_clnt_test_and_add_xprt() and using rpc client's value, the caller passes in max_connect value which is previously been set in the pnfs path (as a part of handling GETDEVICEINFO list of IPs) in nfs4_set_ds_client(). However, when NFS_CS_PNFS flag is not set and we know we are doing MDS trunking, comparing a new IP of the same server, we then set the max_connect value to the existing MDS's value and pass that into rpc_clnt_test_and_add_xprt(). Fixes: dc48e0abee24 ("SUNRPC enforce creation of no more than max_connect xprts") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13NFS/pNFS: Report EINVAL errors from connect() to the serverTrond Myklebust1-0/+1
With IPv6, connect() can occasionally return EINVAL if a route is unavailable. If this happens during I/O to a data server, we want to report it using LAYOUTERROR as an inability to connect. Fixes: dd52128afdde ("NFSv4.1/pnfs Ensure flexfiles reports all connection related errors") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13NFS: More fixes for nfs_direct_write_reschedule_io()Trond Myklebust1-6/+11
Ensure that all requests are put back onto the commit list so that they can be rescheduled. Fixes: 4daaeba93822 ("NFS: Fix nfs_direct_write_reschedule_io()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13NFS: Use the correct commit info in nfs_join_page_group()Trond Myklebust2-14/+17
Ensure that nfs_clear_request_commit() updates the correct counters when it removes them from the commit list. Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13NFS: More O_DIRECT accounting fixes for error pathsTrond Myklebust1-16/+31
If we hit a fatal error when retransmitting, we do need to record the removal of the request from the count of written bytes. Fixes: 031d73ed768a ("NFS: Fix O_DIRECT accounting of number of bytes read/written") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13NFS: Fix O_DIRECT locking issuesTrond Myklebust1-4/+4
The dreq fields are protected by the dreq->lock. Fixes: 954998b60caa ("NFS: Fix error handling for O_DIRECT write scheduling") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-11NFS: Fix error handling for O_DIRECT write schedulingTrond Myklebust1-16/+46
If we fail to schedule a request for transmission, there are 2 possibilities: 1) Either we hit a fatal error, and we just want to drop the remaining requests on the floor. 2) We were asked to try again, in which case we should allow the outstanding RPC calls to complete, so that we can recoalesce requests and try again. Fixes: d600ad1f2bdb ("NFS41: pop some layoutget errors to application") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-01Merge tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds19-43/+88
Pull NFS client updates from Anna Schumaker: "New Features: - Enable the NFS v4.2 READ_PLUS operation by default Stable Fixes: - NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info - NFS: Fix a potential data corruption Bugfixes: - Fix various READ_PLUS issues including: - smatch warnings - xdr size calculations - scratch buffer handling - 32bit / highmem xdr page handling - Fix checkpatch errors in file.c - Fix redundant readdir request after an EOF - Fix handling of COPY ERR_OFFLOAD_NO_REQ - Fix assignment of xprtdata.cred Cleanups: - Remove unused xprtrdma function declarations - Clean up an integer overflow check to avoid a warning - Clean up #includes in dns_resolve.c - Clean up nfs4_get_device_info so we don't pass a NULL pointer to __free_page() - Clean up sunrpc TCP socket timeout configuration - Guard against READDIR loops when entry names are too long - Use EXCHID4_FLAG_USE_PNFS_DS for DS servers" * tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (22 commits) pNFS: Fix assignment of xprtdata.cred NFSv4.2: fix handling of COPY ERR_OFFLOAD_NO_REQ NFS: Guard against READDIR loop when entry names exceed MAXNAMELEN NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server NFS/pNFS: Set the connect timeout for the pNFS flexfiles driver SUNRPC: Don't override connect timeouts in rpc_clnt_add_xprt() SUNRPC: Allow specification of TCP client connect timeout at setup SUNRPC: Refactor and simplify connect timeout SUNRPC: Set the TCP_SYNCNT to match the socket timeout NFS: Fix a potential data corruption nfs: fix redundant readdir request after get eof nfs/blocklayout: Use the passed in gfp flags filemap: Fix errors in file.c NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info NFS: Move common includes outside ifdef SUNRPC: clean up integer overflow check xprtrdma: Remove unused function declaration rpcrdma_bc_post_recv() NFS: Enable the READ_PLUS operation by default SUNRPC: kmap() the xdr pages during decode NFSv4.2: Rework scratch handling for READ_PLUS (again) ...
2023-09-01Merge tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds1-19/+4
Pull nfsd updates from Chuck Lever: "I'm thrilled to announce that the Linux in-kernel NFS server now offers NFSv4 write delegations. A write delegation enables a client to cache data and metadata for a single file more aggressively, reducing network round trips and server workload. Many thanks to Dai Ngo for contributing this facility, and to Jeff Layton and Neil Brown for reviewing and testing it. This release also sees the removal of all support for DES- and triple-DES-based Kerberos encryption types in the kernel's SunRPC implementation. These encryption types have been deprecated by the Internet community for years and are considered insecure. This change affects both the in-kernel NFS client and server. The server's UDP and TCP socket transports have now fully adopted David Howells' new bio_vec iterator so that no more than one sendmsg() call is needed to transmit each RPC message. In particular, this helps kTLS optimize record boundaries when sending RPC-with-TLS replies, and it takes the server a baby step closer to handling file I/O via folios. We've begun work on overhauling the SunRPC thread scheduler to remove a costly linked-list walk when looking for an idle RPC service thread to wake. The pre-requisites are included in this release. Thanks to Neil Brown for his ongoing work on this improvement" * tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (56 commits) Documentation: Add missing documentation for EXPORT_OP flags SUNRPC: Remove unused declaration rpc_modcount() SUNRPC: Remove unused declarations NFSD: da_addr_body field missing in some GETDEVICEINFO replies SUNRPC: Remove return value of svc_pool_wake_idle_thread() SUNRPC: make rqst_should_sleep() idempotent() SUNRPC: Clean up svc_set_num_threads SUNRPC: Count ingress RPC messages per svc_pool SUNRPC: Deduplicate thread wake-up code SUNRPC: Move trace_svc_xprt_enqueue SUNRPC: Add enum svc_auth_status SUNRPC: change svc_xprt::xpt_flags bits to enum SUNRPC: change svc_rqst::rq_flags bits to enum SUNRPC: change svc_pool::sp_flags bits to enum SUNRPC: change cache_head.flags bits to enum SUNRPC: remove timeout arg from svc_recv() SUNRPC: change svc_recv() to return void. SUNRPC: call svc_process() from svc_recv(). nfsd: separate nfsd_last_thread() from nfsd_put() nfsd: Simplify code around svc_exit_thread() call in nfsd() ...
2023-08-31NFS: switch back to using kill_anon_superChristoph Hellwig1-3/+1
NFS switch to open coding kill_anon_super in 7b14a213890a ("nfs: don't call bdi_unregister") to avoid the extra bdi_unregister call. At that point bdi_destroy was called in nfs_free_server and thus it required a later freeing of the anon dev_t. But since 0db10944a76b ("nfs: Convert to separately allocated bdi") the bdi has been free implicitly by the sb destruction, so this isn't needed anymore. By not open coding kill_anon_super, nfs now inherits the fix in dc3216b14160 ("super: ensure valid info"), and we remove the only open coded version of kill_anon_super. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230831052940.256193-1-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>