summaryrefslogtreecommitdiff
path: root/net/sunrpc/xprtrdma
AgeCommit message (Collapse)AuthorFilesLines
2023-06-18svcrdma: Fix stale commentChuck Lever1-4/+2
Commit 7d81ee8722d6 ("svcrdma: Single-stage RDMA Read") changed the behavior of svc_rdma_recvfrom() but neglected to update the documenting comment. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17svcrdma: Remove an unused argument from __svc_rdma_put_rw_ctxt()Chuck Lever1-4/+3
Clean up. Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17svcrdma: trace cc_release callsChuck Lever1-0/+2
This event brackets the svcrdma_post_* trace points. If this trace event is enabled but does not appear as expected, that indicates a chunk_ctxt leak. Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17svcrdma: Convert "might sleep" comment into a code annotationChuck Lever2-2/+5
Try to catch incorrect calling contexts mechanically rather than by code review. Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17SUNRPC: Optimize page release in svc_rdma_sendto()Chuck Lever1-2/+2
Now that we have bulk page allocation and release APIs, it's more efficient to use those than it is for nfsd threads to wait for send completions. Previous patches have eliminated the calls to wait_for_completion() and complete(), in order to avoid scheduler overhead. Now release pages-under-I/O in the send completion handler using the efficient bulk release API. I've measured a 7% reduction in cumulative CPU utilization in svc_rdma_sendto(), svc_rdma_wc_send(), and svc_xprt_release(). In particular, using release_pages() instead of complete() cuts the time per svc_rdma_wc_send() call by two-thirds. This helps improve scalability because svc_rdma_wc_send() is single-threaded per connection. Reviewed-by: Tom Talpey <tom@talpey.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17svcrdma: Prevent page release when nothing was receivedChuck Lever1-6/+6
I noticed that svc_rqst_release_pages() was still unnecessarily releasing a page when svc_rdma_recvfrom() returns zero. Fixes: a53d5cb0646a ("svcrdma: Avoid releasing a page in svc_xprt_release()") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12svcrdma: Revert 2a1e4f21d841 ("svcrdma: Normalize Send page handling")Chuck Lever2-22/+13
Get rid of the completion wait in svc_rdma_sendto(), and release pages in the send completion handler again. A subsequent patch will handle releasing those pages more efficiently. Reverted by hand: patch -R would not apply cleanly. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12SUNRPC: Revert 579900670ac7 ("svcrdma: Remove unused sc_pages field")Chuck Lever1-0/+25
Pre-requisite for releasing pages in the send completion handler. Reverted by hand: patch -R would not apply cleanly. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12SUNRPC: Revert cc93ce9529a6 ("svcrdma: Retain the page backing ↵Chuck Lever1-5/+0
rq_res.head[0].iov_base") Pre-requisite for releasing pages in the send completion handler. Reverted by hand: patch -R would not apply cleanly. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12svcrdma: Clean up allocation of svc_rdma_rw_ctxtChuck Lever1-4/+6
The physical device's favored NUMA node ID is available when allocating a rw_ctxt. Use that value instead of relying on the assumption that the memory allocation happens to be running on a node close to the device. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12svcrdma: Clean up allocation of svc_rdma_send_ctxtChuck Lever1-5/+4
The physical device's favored NUMA node ID is available when allocating a send_ctxt. Use that value instead of relying on the assumption that the memory allocation happens to be running on a node close to the device. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12svcrdma: Clean up allocation of svc_rdma_recv_ctxtChuck Lever1-11/+7
The physical device's favored NUMA node ID is available when allocating a recv_ctxt. Use that value instead of relying on the assumption that the memory allocation happens to be running on a node close to the device. This clean up eliminates the hack of destroying recv_ctxts that were not created by the receive CQ thread -- recv_ctxts are now always allocated on a "good" node. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12svcrdma: Allocate new transports on device's NUMA nodeChuck Lever1-9/+9
The physical device's NUMA node ID is available when allocating an svc_xprt for an incoming connection. Use that value to ensure the svc_xprt structure is allocated on the NUMA node closest to the device. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-17Merge tag 'nfsd-6.4-1' of ↵Linus Torvalds2-7/+6
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - A collection of minor bug fixes * tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Remove open coding of string copy SUNRPC: Fix trace_svc_register() call site SUNRPC: always free ctxt when freeing deferred request SUNRPC: double free xprt_ctxt while still in use SUNRPC: Fix error handling in svc_setup_socket() SUNRPC: Fix encoding of accepted but unsuccessful RPC replies lockd: define nlm_port_min,max with CONFIG_SYSCTL nfsd: define exports_proc_ops with CONFIG_PROC_FS SUNRPC: Avoid relying on crypto API to derive CBC-CTS output IV
2023-05-14SUNRPC: always free ctxt when freeing deferred requestNeilBrown2-7/+6
Since the ->xprt_ctxt pointer was added to svc_deferred_req, it has not been sufficient to use kfree() to free a deferred request. We may need to free the ctxt as well. As freeing the ctxt is all that ->xpo_release_rqst() does, we repurpose it to explicit do that even when the ctxt is not stored in an rqst. So we now have ->xpo_release_ctxt() which is given an xprt and a ctxt, which may have been taken either from an rqst or from a dreq. The caller is now responsible for clearing that pointer after the call to ->xpo_release_ctxt. We also clear dr->xprt_ctxt when the ctxt is moved into a new rqst when revisiting a deferred request. This ensures there is only one pointer to the ctxt, so the risk of double freeing in future is reduced. The new code in svc_xprt_release which releases both the ctxt and any rq_deferred depends on this. Fixes: 773f91b2cf3f ("SUNRPC: Fix NFSD's request deferral on RDMA transports") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-29Merge tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds1-19/+2
Pull nfsd updates from Chuck Lever: "The big ticket item for this release is that support for RPC-with-TLS [RFC 9289] has been added to the Linux NFS server. The goal is to provide a simple-to-deploy, low-overhead in-transit confidentiality and peer authentication mechanism. It can supplement NFS Kerberos and it can protect the use of legacy non-cryptographic user authentication flavors such as AUTH_SYS. The TLS Record protocol is handled entirely by kTLS, meaning it can use either software encryption or offload encryption to smart NICs. Aside from that, work continues on improving NFSD's open file cache. Among the many clean-ups in that area is a patch to convert the rhashtable to use the list-hashing version of that data structure" * tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits) NFSD: Handle new xprtsec= export option SUNRPC: Support TLS handshake in the server-side TCP socket code NFSD: Clean up xattr memory allocation flags NFSD: Fix problem of COMMIT and NFS4ERR_DELAY in infinite loop SUNRPC: Clear rq_xid when receiving a new RPC Call SUNRPC: Recognize control messages in server-side TCP socket code SUNRPC: Be even lazier about releasing pages SUNRPC: Convert svc_xprt_release() to the release_pages() API SUNRPC: Relocate svc_free_res_pages() nfsd: simplify the delayed disposal list code SUNRPC: Ignore return value of ->xpo_sendto SUNRPC: Ensure server-side sockets have a sock->file NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor() sunrpc: simplify two-level sysctl registration for svcrdma_parm_table SUNRPC: return proper error from get_expiry() lockd: add some client-side tracepoints nfs: move nfs_fhandle_hash to common include file lockd: server should unlock lock if client rejects the grant lockd: fix races in client GRANTED_MSG wait logic lockd: move struct nlm_wait to lockd.h ...
2023-04-26sunrpc: simplify two-level sysctl registration for svcrdma_parm_tableLuis Chamberlain1-19/+2
There is no need to declare two tables to just create directories, this can be easily be done with a prefix path with register_sysctl(). Simplify this registration. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-11sunrpc: simplify one-level sysctl registration for xr_tunables_tableLuis Chamberlain1-10/+1
There is no need to declare an extra tables to just create directory, this can be easily be done with a prefix path with register_sysctl(). Simplify this registration. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-02-20SUNRPC: Remove ->xpo_secure_port()Chuck Lever2-7/+1
There's no need for the cost of this extra virtual function call during every RPC transaction: the RQ_SECURE bit can be set properly in ->xpo_recvfrom() instead. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-01-11Merge tag 'nfsd-6.2-3' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix a race when creating NFSv4 files - Revert the use of relaxed bitops * tag 'nfsd-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Use set_bit(RQ_DROPME) Revert "SUNRPC: Use RMW bitops in single-threaded hot paths" nfsd: fix handling of cached open files in nfsd4_open codepath
2023-01-06Revert "SUNRPC: Use RMW bitops in single-threaded hot paths"Chuck Lever1-1/+1
The premise that "Once an svc thread is scheduled and executing an RPC, no other processes will touch svc_rqst::rq_flags" is false. svc_xprt_enqueue() examines the RQ_BUSY flag in scheduled nfsd threads when determining which thread to wake up next. Found via KCSAN. Fixes: 28df0988815f ("SUNRPC: Use RMW bitops in single-threaded hot paths") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-12-06xprtrdma: Fix regbuf data not freed in rpcrdma_req_create()Zhang Xiaoxu1-1/+1
If rdma receive buffer allocate failed, should call rpcrdma_regbuf_free() to free the send buffer, otherwise, the buffer data will be leaked. Fixes: bb93a1ae2bf4 ("xprtrdma: Allocate req's regbufs at xprt create time") Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-10-05xprtrdma: Fix uninitialized variableChuck Lever1-2/+1
net/sunrpc/xprtrdma/frwr_ops.c:151:32: warning: variable 'rc' is uninitialized when used here [-Wuninitialized] trace_xprtrdma_frwr_alloc(mr, rc); ^~ net/sunrpc/xprtrdma/frwr_ops.c:127:8: note: initialize the variable 'rc' to silence this warning int rc; ^ = 0 1 warning generated. The tracepoint is intended to record the error returned from ib_alloc_mr(). In the current code there is no other purpose for @rc, so simply replace it. Reported-by: kernel test robot <lkp@intel.com> Fixes: d8cf39a280c3b0 ('xprtrdma: MR-related memory allocation should be allowed to fail') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05xprtrdma: Prevent memory allocations from driving a reclaimChuck Lever1-4/+4
Many memory allocations that xprtrdma does can fail safely. Let's use this fact to avoid some potential deadlocks: Replace GFP_KERNEL with GFP flags that do not try hard to acquire memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05xprtrdma: Memory allocation should be allowed to fail during connectChuck Lever1-3/+3
An attempt to establish a connection can always fail and then be retried. GFP_KERNEL allocation is not necessary here. Like MR allocation, establishing a connection is always done in a worker thread. The new GFP flags align with the flags that would be returned by rpc_task_gfp_mask() in this case. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05xprtrdma: MR-related memory allocation should be allowed to failChuck Lever3-11/+17
xprtrdma always drives a retry of MR allocation if it should fail. It should be safe to not use GFP_KERNEL for this purpose rather than sleeping in the memory allocator. In theory, if these weaker allocations are attempted first, memory exhaustion is likely to cause xprtrdma to fail fast and not then invoke the RDMA core APIs, which still might use GFP_KERNEL. Also note that rpc_task_gfp_mask() always sets __GFP_NORETRY and __GFP_NOWARN when an RPC-related allocation is being done in a worker thread. MR allocation is already always done in worker threads. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05xprtrdma: Clean up synopsis of rpcrdma_regbuf_alloc()Chuck Lever1-11/+8
Currently all rpcrdma_regbuf_alloc() call sites pass the same value as their third argument. That argument can therefore be eliminated. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05xprtrdma: Clean up synopsis of rpcrdma_req_create()Chuck Lever3-11/+11
Commit 1769e6a816df ("xprtrdma: Clean up rpcrdma_create_req()") added rpcrdma_req_create() with a GFP flags argument in case a caller might want to avoid waiting for memory. There has never been a caller that does not pass GFP_KERNEL as the third argument. That argument can therefore be eliminated. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05svcrdma: Clean up RPCRDMA_DEF_GFPChuck Lever2-4/+2
xprt_rdma_bc_allocate() is now the only user of RPCRDMA_DEF_GFP. Replace that macro with the raw flags. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05SUNRPC: Replace the use of the xprtiod WQ in rpcrdmaChuck Lever2-9/+4
While setting up a new lab, I accidentally misconfigured the Ethernet port for a system that tried an NFS mount using RoCE. This made the NFS server unreachable. The following WARNING popped on the NFS client while waiting for the mount attempt to time out: kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI> kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma> kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13 kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017 kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma] kernel: RIP: 0010:check_flush_dependency+0xbf/0xca kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be> kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092 kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027 kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731 kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0 kernel: FS: 0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 kernel: Call Trace: kernel: <TASK> kernel: __flush_work.isra.0+0xaf/0x188 kernel: ? _raw_spin_lock_irqsave+0x2c/0x37 kernel: ? lock_timer_base+0x38/0x5f kernel: __cancel_work_timer+0xea/0x13d kernel: ? preempt_latency_start+0x2b/0x46 kernel: rdma_addr_cancel+0x70/0x81 [ib_core] kernel: _destroy_id+0x1a/0x246 [rdma_cm] kernel: rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma] kernel: ? _raw_spin_unlock+0x14/0x29 kernel: ? raw_spin_rq_unlock_irq+0x5/0x10 kernel: ? finish_task_switch.isra.0+0x171/0x249 kernel: xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma] kernel: process_one_work+0x1d8/0x2d4 kernel: worker_thread+0x18b/0x24f kernel: ? rescuer_thread+0x280/0x280 kernel: kthread+0xf4/0xfc kernel: ? kthread_complete_and_exit+0x1b/0x1b kernel: ret_from_fork+0x22/0x30 kernel: </TASK> SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue that one of its work items tries to cancel has to be WQ_MEM_RECLAIM to prevent a priority inversion. The internal workqueues in the RDMA/core are currently non-MEM_RECLAIM. Jason Gunthorpe says this about the current state of RDMA/core: > If you attempt to do a reconnection/etc from within a RECLAIM > context it will deadlock on one of the many allocations that are > made to support opening the connection. > > The general idea of reclaim is that the entire task context > working under the reclaim is marked with an override of the gfp > flags to make all allocations under that call chain reclaim safe. > > But rdmacm does allocations outside this, eg in the WQs processing > the CM packets. So this doesn't work and we will deadlock. > > Fixing it is a big deal and needs more than poking WQ_MEM_RECLAIM > here and there. So we will change the ULP in this case to avoid the use of WQ_MEM_RECLAIM where possible. Deadlocks that were possible before are not fixed, but at least we no longer have a false sense of confidence that the stack won't allocate memory during memory reclaim. Suggested-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-07-11SUNRPC: Fix an RPC/RDMA performance regressionTrond Myklebust1-5/+1
Use the standard gfp mask instead of using GFP_NOWAIT. The latter causes issues when under memory pressure. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-06-11Merge tag 'nfsd-5.19-1' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Notable changes: - There is now a backup maintainer for NFSD Notable fixes: - Prevent array overruns in svc_rdma_build_writes() - Prevent buffer overruns when encoding NFSv3 READDIR results - Fix a potential UAF in nfsd_file_put()" * tag 'nfsd-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: SUNRPC: Remove pointer type casts from xdr_get_next_encode_buffer() SUNRPC: Clean up xdr_get_next_encode_buffer() SUNRPC: Clean up xdr_commit_encode() SUNRPC: Optimize xdr_reserve_space() SUNRPC: Fix the calculation of xdr->end in xdr_get_next_encode_buffer() SUNRPC: Trap RDMA segment overflows NFSD: Fix potential use-after-free in nfsd_file_put() MAINTAINERS: reciprocal co-maintainership for file locking and nfsd
2022-06-02SUNRPC: Trap RDMA segment overflowsChuck Lever1-2/+2
Prevent svc_rdma_build_writes() from walking off the end of a Write chunk's segment array. Caught with KASAN. The test that this fix replaces is invalid, and might have been left over from an earlier prototype of the PCL work. Fixes: 7a1cbfa18059 ("svcrdma: Use parsed chunk lists to construct RDMA Writes") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-06-01Merge tag 'nfs-for-5.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds1-0/+5
Pull NFS client updates from Anna Schumaker: "New Features: - Add support for 'dacl' and 'sacl' attributes Bugfixes and Cleanups: - Fixes for reporting mapping errors - Fixes for memory allocation errors - Improve warning message when locks are lost - Update documentation for the nfs4_unique_id parameter - Add an explanation of NFSv4 client identifiers - Ensure the i_size attribute is written to the fscache storage - Fix freeing uninitialized nfs4_labels - Better handling when xprtrdma bc_serv is NULL - Mark qualified async operations as MOVEABLE tasks" * tag 'nfs-for-5.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4.1 mark qualified async operations as MOVEABLE tasks xprtrdma: treat all calls not a bcall when bc_serv is NULL NFSv4: Fix free of uninitialized nfs4_label on referral lookup. NFS: Pass i_size to fscache_unuse_cookie() when a file is released Documentation: Add an explanation of NFSv4 client identifiers NFS: update documentation for the nfs4_unique_id parameter NFS: Improve warning message when locks are lost. NFSv4.1: Enable access to the NFSv4.1 'dacl' and 'sacl' attributes NFSv4: Add encoders/decoders for the NFSv4.1 dacl and sacl attributes NFSv4: Specify the type of ACL to cache NFSv4: Don't hold the layoutget locks across multiple RPC calls pNFS/files: Fall back to I/O through the MDS on non-fatal layout errors NFS: Further fixes to the writeback error handling NFSv4/pNFS: Do not fail I/O when we fail to allocate the pNFS layout NFS: Memory allocation failures are not server fatal errors NFS: Don't report errors from nfs_pageio_complete() more than once NFS: Do not report flush errors in nfs_write_end() NFS: Don't report ENOSPC write errors twice NFS: fsync() should report filesystem errors over EINTR/ERESTARTSYS NFS: Do not report EINTR/ERESTARTSYS as mapping errors
2022-06-01xprtrdma: treat all calls not a bcall when bc_serv is NULLKinglong Mee1-0/+5
When a rdma server returns a fault format reply, nfs v3 client may treats it as a bcall when bc service is not exist. The debug message at rpcrdma_bc_receive_call are, [56579.837169] RPC: rpcrdma_bc_receive_call: callback XID 00000001, length=20 [56579.837174] RPC: rpcrdma_bc_receive_call: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 After that, rpcrdma_bc_receive_call will meets NULL pointer as, [ 226.057890] BUG: unable to handle kernel NULL pointer dereference at 00000000000000c8 ... [ 226.058704] RIP: 0010:_raw_spin_lock+0xc/0x20 ... [ 226.059732] Call Trace: [ 226.059878] rpcrdma_bc_receive_call+0x138/0x327 [rpcrdma] [ 226.060011] __ib_process_cq+0x89/0x170 [ib_core] [ 226.060092] ib_cq_poll_work+0x26/0x80 [ib_core] [ 226.060257] process_one_work+0x1a7/0x360 [ 226.060367] ? create_worker+0x1a0/0x1a0 [ 226.060440] worker_thread+0x30/0x390 [ 226.060500] ? create_worker+0x1a0/0x1a0 [ 226.060574] kthread+0x116/0x130 [ 226.060661] ? kthread_flush_work_fn+0x10/0x10 [ 226.060724] ret_from_fork+0x35/0x40 ... Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-27Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds1-1/+1
Pull rdma updates from Jason Gunthorpe: "Small collection of incremental improvement patches: - Minor code cleanup patches, comment improvements, etc from static tools - Clean the some of the kernel caps, reducing the historical stealth uAPI leftovers - Bug fixes and minor changes for rdmavt, hns, rxe, irdma - Remove unimplemented cruft from rxe - Reorganize UMR QP code in mlx5 to avoid going through the IB verbs layer - flush_workqueue(system_unbound_wq) removal - Ensure rxe waits for objects to be unused before allowing the core to free them - Several rc quality bug fixes for hfi1" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (67 commits) RDMA/rtrs-clt: Fix one kernel-doc comment RDMA/hfi1: Remove all traces of diagpkt support RDMA/hfi1: Consolidate software versions RDMA/hfi1: Remove pointless driver version RDMA/hfi1: Fix potential integer multiplication overflow errors RDMA/hfi1: Prevent panic when SDMA is disabled RDMA/hfi1: Prevent use of lock before it is initialized RDMA/rxe: Fix an error handling path in rxe_get_mcg() IB/core: Fix typo in comment RDMA/core: Fix typo in comment IB/hf1: Fix typo in comment IB/qib: Fix typo in comment IB/iser: Fix typo in comment RDMA/mlx4: Avoid flush_scheduled_work() usage IB/isert: Avoid flush_scheduled_work() usage RDMA/mlx5: Remove duplicate pointer assignment in mlx5_ib_alloc_implicit_mr() RDMA/qedr: Remove unnecessary synchronize_irq() before free_irq() RDMA/hns: Use hr_reg_read() instead of remaining roce_get_xxx() RDMA/hns: Use hr_reg_xxx() instead of remaining roce_set_xxx() RDMA/irdma: Add SW mechanism to generate completions on error ...
2022-05-24Merge tag 'v5.18' into rdma.git for-nextJason Gunthorpe1-1/+1
Following patches have dependencies. Resolve the merge conflict in drivers/net/ethernet/mellanox/mlx5/core/main.c by keeping the new names for the fs functions following linux-next: https://lore.kernel.org/r/20220519113529.226bc3e2@canb.auug.org.au/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-05-23SUNRPC: Use RMW bitops in single-threaded hot pathsChuck Lever1-1/+1
I noticed CPU pipeline stalls while using perf. Once an svc thread is scheduled and executing an RPC, no other processes will touch svc_rqst::rq_flags. Thus bus-locked atomics are not needed outside the svc thread scheduler. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19SUNRPC: Remove svc_rqst::rq_xprt_hlenChuck Lever1-1/+0
Clean up: This field is now always set to zero. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-04-13Merge tag 'nfsd-5.18-1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix a write performance regression - Fix crashes during request deferral on RDMA transports * tag 'nfsd-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: SUNRPC: Fix the svc_deferred_event trace class SUNRPC: Fix NFSD's request deferral on RDMA transports nfsd: Clean up nfsd_file_put() nfsd: Fix a write performance regression SUNRPC: Return true/false (not 1/0) from bool functions
2022-04-06RDMA: Split kernel-only global device caps from uverbs device capsJason Gunthorpe1-1/+1
Split out flags from ib_device::device_cap_flags that are only used internally to the kernel into kernel_cap_flags that is not part of the uapi. This limits the device_cap_flags to being the same bitmap that will be copied to userspace. This cleanly splits out the uverbs flags from the kernel flags to avoid confusion in the flags bitmap. Add some short comments describing which each of the kernel flags is connected to. Remove unused kernel flags. Link: https://lore.kernel.org/r/0-v2-22c19e565eef+139a-kern_caps_jgg@nvidia.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-04-06SUNRPC: Fix NFSD's request deferral on RDMA transportsChuck Lever1-1/+1
Trond Myklebust reports an NFSD crash in svc_rdma_sendto(). Further investigation shows that the crash occurred while NFSD was handling a deferred request. This patch addresses two inter-related issues that prevent request deferral from working correctly for RPC/RDMA requests: 1. Prevent the crash by ensuring that the original svc_rqst::rq_xprt_ctxt value is available when the request is revisited. Otherwise svc_rdma_sendto() does not have a Receive context available with which to construct its reply. 2. Possibly since before commit 71641d99ce03 ("svcrdma: Properly compute .len and .buflen for received RPC Calls"), svc_rdma_recvfrom() did not include the transport header in the returned xdr_buf. There should have been no need for svc_defer() and friends to save and restore that header, as of that commit. This issue is addressed in a backport-friendly way by simply having svc_rdma_recvfrom() set rq_xprt_hlen to zero unconditionally, just as svc_tcp_recvfrom() does. This enables svc_deferred_recv() to correctly reconstruct an RPC message received via RPC/RDMA. Reported-by: Trond Myklebust <trondmy@hammerspace.com> Link: https://lore.kernel.org/linux-nfs/82662b7190f26fb304eb0ab1bb04279072439d4e.camel@hammerspace.com/ Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org>
2022-03-30Merge tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds3-6/+10
Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - Switch NFS to use readahead instead of the obsolete readpages. - Readdir fixes to improve cacheability of large directories when there are multiple readers and writers. - Readdir performance improvements when doing a seekdir() immediately after opening the directory (common when re-exporting NFS). - NFS swap improvements from Neil Brown. - Loosen up memory allocation to permit direct reclaim and write back in cases where there is no danger of deadlocking the writeback code or NFS swap. - Avoid sillyrename when the NFSv4 server claims to support the necessary features to recover the unlinked but open file after reboot. Bugfixes: - Patch from Olga to add a mount option to control NFSv4.1 session trunking discovery, and default it to being off. - Fix a lockup in nfs_do_recoalesce(). - Two fixes for list iterator variables being used when pointing to the list head. - Fix a kernel memory scribble when reading from a non-socket transport in /sys/kernel/sunrpc. - Fix a race where reconnecting to a server could leave the TCP socket stuck forever in the connecting state. - Patch from Neil to fix a shutdown race which can leave the SUNRPC transport timer primed after we free the struct xprt itself. - Patch from Xin Xiong to fix reference count leaks in the NFSv4.2 copy offload. - Sunrpc patch from Olga to avoid resending a task on an offlined transport. Cleanups: - Patches from Dave Wysochanski to clean up the fscache code" * tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (91 commits) NFSv4/pNFS: Fix another issue with a list iterator pointing to the head NFS: Don't loop forever in nfs_do_recoalesce() SUNRPC: Don't return error values in sysfs read of closed files SUNRPC: Do not dereference non-socket transports in sysfs NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error SUNRPC don't resend a task on an offlined transport NFS: replace usage of found with dedicated list iterator variable SUNRPC: avoid race between mod_timer() and del_timer_sync() pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod pNFS/flexfiles: Ensure pNFS allocation modes are consistent with nfsiod NFSv4/pnfs: Ensure pNFS allocation modes are consistent with nfsiod NFS: Avoid writeback threads getting stuck in mempool_alloc() NFS: nfsiod should not block forever in mempool_alloc() SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent SUNRPC: Fix unx_lookup_cred() allocation NFS: Fix memory allocation in rpc_alloc_task() NFS: Fix memory allocation in rpc_malloc() SUNRPC: Improve accuracy of socket ENOBUFS determination SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE SUNRPC: Fix socket waits for write buffer space ...
2022-03-13SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOCNeilBrown1-2/+4
rpc tasks can be marked as RPC_TASK_SWAPPER. This causes GFP_MEMALLOC to be used for some allocations. This is needed in some cases, but not in all where it is currently provided, and in some where it isn't provided. Currently *all* tasks associated with a rpc_client on which swap is enabled get the flag and hence some GFP_MEMALLOC support. GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it. However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does need it. xdr_alloc_bvec is called while the XPRT_LOCK is held. If this blocks, then it blocks all other queued tasks. So this allocation needs GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used for any swap writes. Similarly, if the transport is not connected, that will block all requests including swap writes, so memory allocations should get GFP_MEMALLOC if swap writes are possible. So with this patch: 1/ we ONLY set RPC_TASK_SWAPPER for swap writes. 2/ __rpc_execute() sets PF_MEMALLOC while handling any task with RPC_TASK_SWAPPER set, or when handling any task that holds the XPRT_LOCKED lock on an xprt used for swap. This removes the need for the RPC_IS_SWAPPER() test in ->buf_alloc handlers. 3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking any task to a swapper xprt. __rpc_execute() will clear it. 3/ PF_MEMALLOC is set for all the connect workers. Reviewed-by: Chuck Lever <chuck.lever@oracle.com> (for xprtrdma parts) Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC/xprt: async tasks mustn't block waiting for memoryNeilBrown1-1/+1
When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. xprt_dynamic_alloc_slot can block indefinitely. This can tie up all workqueue threads and NFS can deadlock. So when called from a workqueue, set __GFP_NORETRY. The rdma alloc_slot already does not block. However it sets the error to -EAGAIN suggesting this will trigger a sleep. It does not. As we can see in call_reserveresult(), only -ENOMEM causes a sleep. -EAGAIN causes immediate retry. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC/call_alloc: async tasks mustn't block waiting for memoryNeilBrown1-1/+3
When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. mempools are particularly a problem as memory can only be released back to the mempool by an async rpc task running. If all available workqueue threads are waiting on the mempool, no thread is available to return anything. rpc_malloc() can block, and this might cause deadlocks. So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if blocking is acceptable. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-28SUNRPC: Rename svc_close_xprt()Chuck Lever1-1/+1
Clean up: Use the "svc_xprt_<task>" function naming convention as is used for other external APIs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-26SUNRPC/xprtrdma: Convert GFP_NOFS to GFP_KERNELTrond Myklebust2-3/+3
Assume that the upper layers have set memalloc_nofs_save/restore as appropriate. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-08xprtrdma: fix pointer derefs in error cases of rpcrdma_ep_createDan Aloni1-0/+3
If there are failures then we must not leave the non-NULL pointers with the error value, otherwise `rpcrdma_ep_destroy` gets confused and tries free them, resulting in an Oops. Signed-off-by: Dan Aloni <dan.aloni@vastdata.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-14xprtrdma: Remove definitions of RPCDBG_FACILITYChuck Lever5-27/+0
Deprecated. dprintk is no longer used in xprtrdma. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>