summaryrefslogtreecommitdiff
path: root/net/sunrpc/xprt.c
AgeCommit message (Collapse)AuthorFilesLines
2022-11-24timers: Get rid of del_singleshot_timer_sync()Thomas Gleixner1-1/+1
del_singleshot_timer_sync() used to be an optimization for deleting timers which are not rearmed from the timer callback function. This optimization turned out to be broken and got mapped to del_timer_sync() about 17 years ago. Get rid of the undocumented indirection and use del_timer_sync() directly. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/r/20221123201624.706987932@linutronix.de
2022-10-17Merge tag 'random-6.1-rc1-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull more random number generator updates from Jason Donenfeld: "This time with some large scale treewide cleanups. The intent of this pull is to clean up the way callers fetch random integers. The current rules for doing this right are: - If you want a secure or an insecure random u64, use get_random_u64() - If you want a secure or an insecure random u32, use get_random_u32() The old function prandom_u32() has been deprecated for a while now and is just a wrapper around get_random_u32(). Same for get_random_int(). - If you want a secure or an insecure random u16, use get_random_u16() - If you want a secure or an insecure random u8, use get_random_u8() - If you want secure or insecure random bytes, use get_random_bytes(). The old function prandom_bytes() has been deprecated for a while now and has long been a wrapper around get_random_bytes() - If you want a non-uniform random u32, u16, or u8 bounded by a certain open interval maximum, use prandom_u32_max() I say "non-uniform", because it doesn't do any rejection sampling or divisions. Hence, it stays within the prandom_*() namespace, not the get_random_*() namespace. I'm currently investigating a "uniform" function for 6.2. We'll see what comes of that. By applying these rules uniformly, we get several benefits: - By using prandom_u32_max() with an upper-bound that the compiler can prove at compile-time is ≤65536 or ≤256, internally get_random_u16() or get_random_u8() is used, which wastes fewer batched random bytes, and hence has higher throughput. - By using prandom_u32_max() instead of %, when the upper-bound is not a constant, division is still avoided, because prandom_u32_max() uses a faster multiplication-based trick instead. - By using get_random_u16() or get_random_u8() in cases where the return value is intended to indeed be a u16 or a u8, we waste fewer batched random bytes, and hence have higher throughput. This series was originally done by hand while I was on an airplane without Internet. Later, Kees and I worked on retroactively figuring out what could be done with Coccinelle and what had to be done manually, and then we split things up based on that. So while this touches a lot of files, the actual amount of code that's hand fiddled is comfortably small" * tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: prandom: remove unused functions treewide: use get_random_bytes() when possible treewide: use get_random_u32() when possible treewide: use get_random_{u8,u16}() when possible, part 2 treewide: use get_random_{u8,u16}() when possible, part 1 treewide: use prandom_u32_max() when possible, part 2 treewide: use prandom_u32_max() when possible, part 1
2022-10-12treewide: use get_random_u32() when possibleJason A. Donenfeld1-1/+1
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-03SUNRPC: use max_t() to simplify open codeZiyang Xuan1-4/+1
Use max_t() to simplify open code which uses "if...else" to get maximum of two values. Generated by coccinelle script: scripts/coccinelle/misc/minmax.cocci Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-03SUNRPC: Directly use ida_alloc()/free()Bo Liu1-2/+2
Use ida_alloc()/ida_free() instead of ida_simple_get()/ida_simple_remove(). The latter is deprecated and more verbose. Signed-off-by: Bo Liu <liubo03@inspur.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-09-01SUNRPC: Fix call completion races with call_decode()Trond Myklebust1-4/+4
We need to make sure that the req->rq_private_buf is completely up to date before we make req->rq_reply_bytes_recvd visible to the call_decode() routine in order to avoid triggering the WARN_ON(). Reported-by: Benjamin Coddington <bcodding@redhat.com> Fixes: 72691a269f0b ("SUNRPC: Don't reuse bvec on retransmission of the request") Tested-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-27SUNRPC: Don't reuse bvec on retransmission of the requestTrond Myklebust1-9/+18
If a request is re-encoded and then retransmitted, we need to make sure that we also re-encode the bvec, in case the page lists have changed. Fixes: ff053dbbaffe ("SUNRPC: Move the call to xprt_send_pagedata() out of xprt_sock_sendmsg()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-25SUNRPC create an rpc function that allows xprt removal from rpc_clntOlga Kornievskaia1-1/+1
Expose a function that allows a removal of xprt from the rpc_clnt. When called from NFS that's running a trunked transport then don't decrement the active transport counter. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-25SUNRPC expose functions for offline remote xprt functionalityOlga Kornievskaia1-0/+32
Re-arrange the code that make offline transport and delete transport callable functions. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-08Merge tag 'nfs-for-5.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-14/+16
Pull NFS client fixes from Trond Myklebust: "Stable fixes: - SUNRPC: Ensure we flush any closed sockets before xs_xprt_free() Bugfixes: - Fix an Oopsable condition due to SLAB_ACCOUNT setting in the NFSv4.2 xattr code. - Fix for open() using an file open mode of '3' in NFSv4 - Replace readdir's use of xxhash() with hash_64() - Several patches to handle malloc() failure in SUNRPC" * tag 'nfs-for-5.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: Move the call to xprt_send_pagedata() out of xprt_sock_sendmsg() SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec() SUNRPC: Handle allocation failure in rpc_new_task() NFS: Ensure rpc_run_task() cannot fail in nfs_async_rename() NFSv4/pnfs: Handle RPC allocation errors in nfs4_proc_layoutget SUNRPC: Handle low memory situations in call_status() SUNRPC: Handle ENOMEM in call_transmit_status() NFSv4.2: Fix missing removal of SLAB_ACCOUNT on kmem_cache allocation SUNRPC: Ensure we flush any closed sockets before xs_xprt_free() NFS: Replace readdir's use of xxhash() with hash_64() SUNRPC: handle malloc failure in ->request_prepare NFSv4: fix open failure with O_ACCMODE flag Revert "NFSv4: Handle the special Linux file open access mode"
2022-04-07SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()Trond Myklebust1-6/+1
We must ensure that all sockets are closed before we call xprt_free() and release the reference to the net namespace. The problem is that calling fput() will defer closing the socket until delayed_fput() gets called. Let's fix the situation by allowing rpciod and the transport teardown code (which runs on the system wq) to call __fput_sync(), and directly close the socket. Reported-by: Felix Fu <foyjog@gmail.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: a73881c96d73 ("SUNRPC: Fix an Oops in udp_poll()") Cc: stable@vger.kernel.org # 5.1.x: 3be232f11a3c: SUNRPC: Prevent immediate close+reconnect Cc: stable@vger.kernel.org # 5.1.x: 89f42494f92f: SUNRPC: Don't call connect() more than once on a TCP socket Cc: stable@vger.kernel.org # 5.1.x Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-30SUNRPC: handle malloc failure in ->request_prepareNeilBrown1-8/+15
If ->request_prepare() detects an error, it sets ->rq_task->tk_status. This is easy for callers to ignore. The only caller is xprt_request_enqueue_receive() and it does ignore the error, as does call_encode() which calls it. This can result in a request being queued to receive a reply without an allocated receive buffer. So instead of setting rq_task->tk_status, return an error, and store in ->tk_status only in call_encode(); The call to xprt_request_enqueue_receive() is now earlier in call_encode(), where the error can still be handled. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-30Merge tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-12/+11
Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - Switch NFS to use readahead instead of the obsolete readpages. - Readdir fixes to improve cacheability of large directories when there are multiple readers and writers. - Readdir performance improvements when doing a seekdir() immediately after opening the directory (common when re-exporting NFS). - NFS swap improvements from Neil Brown. - Loosen up memory allocation to permit direct reclaim and write back in cases where there is no danger of deadlocking the writeback code or NFS swap. - Avoid sillyrename when the NFSv4 server claims to support the necessary features to recover the unlinked but open file after reboot. Bugfixes: - Patch from Olga to add a mount option to control NFSv4.1 session trunking discovery, and default it to being off. - Fix a lockup in nfs_do_recoalesce(). - Two fixes for list iterator variables being used when pointing to the list head. - Fix a kernel memory scribble when reading from a non-socket transport in /sys/kernel/sunrpc. - Fix a race where reconnecting to a server could leave the TCP socket stuck forever in the connecting state. - Patch from Neil to fix a shutdown race which can leave the SUNRPC transport timer primed after we free the struct xprt itself. - Patch from Xin Xiong to fix reference count leaks in the NFSv4.2 copy offload. - Sunrpc patch from Olga to avoid resending a task on an offlined transport. Cleanups: - Patches from Dave Wysochanski to clean up the fscache code" * tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (91 commits) NFSv4/pNFS: Fix another issue with a list iterator pointing to the head NFS: Don't loop forever in nfs_do_recoalesce() SUNRPC: Don't return error values in sysfs read of closed files SUNRPC: Do not dereference non-socket transports in sysfs NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error SUNRPC don't resend a task on an offlined transport NFS: replace usage of found with dedicated list iterator variable SUNRPC: avoid race between mod_timer() and del_timer_sync() pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod pNFS/flexfiles: Ensure pNFS allocation modes are consistent with nfsiod NFSv4/pnfs: Ensure pNFS allocation modes are consistent with nfsiod NFS: Avoid writeback threads getting stuck in mempool_alloc() NFS: nfsiod should not block forever in mempool_alloc() SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent SUNRPC: Fix unx_lookup_cred() allocation NFS: Fix memory allocation in rpc_alloc_task() NFS: Fix memory allocation in rpc_malloc() SUNRPC: Improve accuracy of socket ENOBUFS determination SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE SUNRPC: Fix socket waits for write buffer space ...
2022-03-23SUNRPC: avoid race between mod_timer() and del_timer_sync()NeilBrown1-0/+7
xprt_destory() claims XPRT_LOCKED and then calls del_timer_sync(). Both xprt_unlock_connect() and xprt_release() call ->release_xprt() which drops XPRT_LOCKED and *then* xprt_schedule_autodisconnect() which calls mod_timer(). This may result in mod_timer() being called *after* del_timer_sync(). When this happens, the timer may fire long after the xprt has been freed, and run_timer_softirq() will probably crash. The pairing of ->release_xprt() and xprt_schedule_autodisconnect() is always called under ->transport_lock. So if we take ->transport_lock to call del_timer_sync(), we can be sure that mod_timer() will run first (if it runs at all). Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22SUNRPC: Make the rpciod and xprtiod slab allocation modes consistentTrond Myklebust1-4/+1
Make sure that rpciod and xprtiod are always using the same slab allocation modes. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOCNeilBrown1-0/+3
rpc tasks can be marked as RPC_TASK_SWAPPER. This causes GFP_MEMALLOC to be used for some allocations. This is needed in some cases, but not in all where it is currently provided, and in some where it isn't provided. Currently *all* tasks associated with a rpc_client on which swap is enabled get the flag and hence some GFP_MEMALLOC support. GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it. However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does need it. xdr_alloc_bvec is called while the XPRT_LOCK is held. If this blocks, then it blocks all other queued tasks. So this allocation needs GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used for any swap writes. Similarly, if the transport is not connected, that will block all requests including swap writes, so memory allocations should get GFP_MEMALLOC if swap writes are possible. So with this patch: 1/ we ONLY set RPC_TASK_SWAPPER for swap writes. 2/ __rpc_execute() sets PF_MEMALLOC while handling any task with RPC_TASK_SWAPPER set, or when handling any task that holds the XPRT_LOCKED lock on an xprt used for swap. This removes the need for the RPC_IS_SWAPPER() test in ->buf_alloc handlers. 3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking any task to a swapper xprt. __rpc_execute() will clear it. 3/ PF_MEMALLOC is set for all the connect workers. Reviewed-by: Chuck Lever <chuck.lever@oracle.com> (for xprtrdma parts) Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC: remove scheduling boost for "SWAPPER" tasks.NeilBrown1-11/+0
Currently, tasks marked as "swapper" tasks get put to the front of non-priority rpc_queues, and are sorted earlier than non-swapper tasks on the transport's ->xmit_queue. This is pointless as currently *all* tasks for a mount that has swap enabled on *any* file are marked as "swapper" tasks. So the net result is that the non-priority rpc_queues are reverse-ordered (LIFO). This scheduling boost is not necessary to avoid deadlocks, and hurts fairness, so remove it. If there were a need to expedite some requests, the tk_priority mechanism is a more appropriate tool. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC/xprt: async tasks mustn't block waiting for memoryNeilBrown1-1/+4
When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. xprt_dynamic_alloc_slot can block indefinitely. This can tie up all workqueue threads and NFS can deadlock. So when called from a workqueue, set __GFP_NORETRY. The rdma alloc_slot already does not block. However it sets the error to -EAGAIN suggesting this will trigger a sleep. It does not. As we can see in call_reserveresult(), only -ENOMEM causes a sleep. -EAGAIN causes immediate retry. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-26SUNRPC: Convert GFP_NOFS to GFP_KERNELTrond Myklebust1-1/+1
The sections which should not re-enter the filesystem are already protected with memalloc_nofs_save/restore calls, so it is better to use GFP_KERNEL in these calls to allow better performance for synchronous RPC calls. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-01-28SUNRPC: add netns refcount tracker to struct rpc_xprtEric Dumazet1-2/+2
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-05SUNRPC: Prevent immediate close+reconnectTrond Myklebust1-1/+2
If we have already set up the socket and are waiting for it to connect, then don't immediately close and retry. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-05SUNRPC: Fix races when closing the socketTrond Myklebust1-0/+2
Ensure that we bump the xprt->connect_cookie when we set the XPRT_CLOSE_WAIT flag so that another call to xprt_conditional_disconnect() won't race with the reconnection. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-04SUNRPC: xprt_clear_locked() only needs release memory semanticsTrond Myklebust1-5/+3
The clearing of the XPRT_LOCKED bit has to happen after we clear xprt->snd_task, but we don't require any extra memory barriers after that. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-04SUNRPC: Partial revert of commit 6f9f17287e78Trond Myklebust1-13/+15
The premise of commit 6f9f17287e78 ("SUNRPC: Mitigate cond_resched() in xprt_transmit()") was that cond_resched() is expensive and unnecessary when there has been just a single send. The point of cond_resched() is to ensure that tasks that should pre-empt this one get a chance to do so when it is safe to do so. The code prior to commit 6f9f17287e78 failed to take into account that it was keeping a rpc_task pinned for longer than it needed to, and so rather than doing a full revert, let's just move the cond_resched. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-09-04Merge tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds1-12/+20
Pull NFS client updates from Anna Schumaker: "New Features: - Better client responsiveness when server isn't replying - Use refcount_t in sunrpc rpc_client refcount tracking - Add srcaddr and dst_port to the sunrpc sysfs info files - Add basic support for connection sharing between servers with multiple NICs` Bugfixes and Cleanups: - Sunrpc tracepoint cleanups - Disconnect after ib_post_send() errors to avoid deadlocks - Fix for tearing down rpcrdma_reps - Fix a potential pNFS layoutget livelock loop - pNFS layout barrier fixes - Fix a potential memory corruption in rpc_wake_up_queued_task_set_status() - Fix reconnection locking - Fix return value of get_srcport() - Remove rpcrdma_post_sends() - Remove pNFS dead code - Remove copy size restriction for inter-server copies - Overhaul the NFS callback service - Clean up sunrpc TCP socket shutdowns - Always provide aligned buffers to RPC read layers" * tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (39 commits) NFS: Always provide aligned buffers to the RPC read layers NFSv4.1 add network transport when session trunking is detected SUNRPC enforce creation of no more than max_connect xprts NFSv4 introduce max_connect mount options SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs SUNRPC keep track of number of transports to unique addresses NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox SUNRPC: Tweak TCP socket shutdown in the RPC client SUNRPC: Simplify socket shutdown when not reusing TCP ports NFSv4.2: remove restriction of copy size for inter-server copy. NFS: Clean up the synopsis of callback process_op() NFS: Extract the xdr_init_encode/decode() calls from decode_compound NFS: Remove unused callback void decoder NFS: Add a private local dispatcher for NFSv4 callback operations SUNRPC: Eliminate the RQ_AUTHERR flag SUNRPC: Set rq_auth_stat in the pg_authenticate() callout SUNRPC: Add svc_rqst::rq_auth_stat SUNRPC: Add dst_port to the sysfs xprt info file SUNRPC: Add srcaddr as a file in sysfs sunrpc: Fix return value of get_srcport() ...
2021-08-20SUNRPC: Move client-side disconnect injectionChuck Lever1-0/+14
Disconnect injection stress-tests the ability for both client and server implementations to behave resiliently in the face of network instability. Convert the existing client-side disconnect injection infrastructure to use the kernel's generic error injection facility. The generic facility has a richer set of injection criteria. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-09SUNRPC/xprtrdma: Fix reconnection lockingTrond Myklebust1-0/+2
The xprtrdma client code currently relies on the task that initiated the connect to hold the XPRT_LOCK for the duration of the connection attempt. If the task is woken early, due to some other event, then that lock could get released early. Avoid races by using the same mechanism that the socket code uses of transferring lock ownership to the RDMA connect worker itself. That frees us to call rpcrdma_xprt_disconnect() directly since we're now guaranteed exclusion w.r.t. other callers. Fixes: 4cf44be6f1e8 ("xprtrdma: Fix recursion into rpcrdma_xprt_disconnect()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-09SUNRPC: Clean up scheduling of autocloseTrond Myklebust1-12/+16
Consolidate duplicated code in xprt_force_disconnect() and xprt_conditional_disconnect(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-09SUNRPC: Fix potential memory corruptionTrond Myklebust1-2/+4
We really should not call rpc_wake_up_queued_task_set_status() with xprt->snd_task as an argument unless we are certain that is actually an rpc_task. Fixes: 0445f92c5d53 ("SUNRPC: Fix disconnection races") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-07-08sunrpc: add dst_attr attributes to the sysfs xprt directoryOlga Kornievskaia1-1/+3
Allow to query and set the destination's address of a transport. Setting of the destination address is allowed only for TCP or RDMA based connections. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-07-08sunrpc: add xprt idOlga Kornievskaia1-0/+26
This adds a unique identifier for a sunrpc transport in sysfs, which is similarly managed to the unique IDs of clients. Signed-off-by: Dan Aloni <dan@kernelim.com> Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-26SUNRPC: More fixes for backlog congestionTrond Myklebust1-30/+28
Ensure that we fix the XPRT_CONGESTED starvation issue for RDMA as well as socket based transports. Ensure we always initialise the request after waking up from the backlog list. Fixes: e877a88d1f06 ("SUNRPC in case of backlog, hand free slots directly to waiting task") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-20SUNRPC in case of backlog, hand free slots directly to waiting taskNeilBrown1-21/+47
If sunrpc.tcp_max_slot_table_entries is small and there are tasks on the backlog queue, then when a request completes it is freed and the first task on the queue is woken. The expectation is that it will wake and claim that request. However if it was a sync task and the waiting process was killed at just that moment, it will wake and NOT claim the request. As long as TASK_CONGESTED remains set, requests can only be claimed by tasks woken from the backlog, and they are woken only as requests are freed, so when a task doesn't claim a request, no other task can ever get that request until TASK_CONGESTED is cleared. Each time this happens the number of available requests is decreased by one. With a sufficiently high workload and sufficiently low setting of max_slot (16 in the case where this was seen), TASK_CONGESTED can remain set for an extended period, and the above scenario (of a process being killed just as its task was woken) can repeat until no requests can be allocated. Then traffic stops. This patch addresses the problem by introducing a positive handover of a request from a completing task to a backlog task - the request is never freed when there is a backlog. When a task is woken it might not already have a request attached in which case it is *not* freed (as with current code) but is initialised (if needed) and used. If it isn't used it will eventually be freed by rpc_exit_task(). xprt_release() is enhanced to be able to correctly release an uninitialised request. Fixes: ba60eb25ff6b ("SUNRPC: Fix a livelock problem in the xprt->backlog queue") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-14SUNRPC: Handle major timeout in xprt_adjust_timeout()Chris Dion1-2/+2
Currently if a major timeout value is reached, but the minor value has not been reached, an ETIMEOUT will not be sent back to the caller. This can occur if the v4 server is not responding to requests and retrans is configured larger than the default of two. For example, A TCP mount with a configured timeout value of 50 and a retransmission count of 3 to a v4 server which is not responding: 1. Initial value and increment set to 5s, maxval set to 20s, retries at 3 2. Major timeout is set to 20s, minor timeout set to 5s initially 3. xport_adjust_timeout() is called after 5s, retry with 10s timeout, minor timeout is bumped to 10s 4. And again after another 10s, 15s total time with minor timeout set to 15s 5. After 20s total time xport_adjust_timeout is called as major timeout is reached, but skipped because the minor timeout is not reached - After this time the cpu spins continually calling xport_adjust_timeout() and returning 0 for 10 seconds. As seen on perf sched: 39243.913182 [0005] mount.nfs[3794] 4607.938 0.017 9746.863 6. This continues until the 15s minor timeout condition is reached (in this case for 10 seconds). After which the ETIMEOUT is processed back to the caller, the cpu spinning stops, and normal operations continue Fixes: 7de62bc09fe6 ("SUNRPC dont update timeout value on connection reset") Signed-off-by: Chris Dion <Christopher.Dion@dell.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-14SUNRPC: Remove trace_xprt_transmit_queuedChuck Lever1-2/+0
This tracepoint can crash when dereferencing snd_task because when some transports connect, they put a cookie in that field instead of a pointer to an rpc_task. BUG: KASAN: use-after-free in trace_event_raw_event_xprt_writelock_event+0x141/0x18e [sunrpc] Read of size 2 at addr ffff8881a83bd3a0 by task git/331872 CPU: 11 PID: 331872 Comm: git Tainted: G S 5.12.0-rc2-00007-g3ab6e585a7f9 #1453 Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015 Call Trace: dump_stack+0x9c/0xcf print_address_description.constprop.0+0x18/0x239 kasan_report+0x174/0x1b0 trace_event_raw_event_xprt_writelock_event+0x141/0x18e [sunrpc] xprt_prepare_transmit+0x8e/0xc1 [sunrpc] call_transmit+0x4d/0xc6 [sunrpc] Fixes: 9ce07ae5eb1d ("SUNRPC: Replace dprintk() call site in xprt_prepare_transmit") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-14SUNRPC: Add tracepoint that fires when an RPC is retransmittedChuck Lever1-1/+3
A separate tracepoint can be left enabled all the time to capture rare but important retransmission events. So for example: kworker/u26:3-568 [009] 156.967933: xprt_retransmit: task:44093@5 xid=0xa25dbc79 nfsv3 WRITE ntrans=2 Or, for example, enable all nfs and nfs4 tracepoints, and set up a trigger to disable tracing when xprt_retransmit fires to capture everything that leads up to it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-14SUNRPC: Move fault injection call sitesChuck Lever1-2/+4
I've hit some crashes that occur in the xprt_rdma_inject_disconnect path. It appears that, for some provides, rdma_disconnect() can take so long that the transport can disconnect and release its hardware resources while rdma_disconnect() is still running, resulting in a UAF in the provider. The transport's fault injection method may depend on the stability of transport data structures. That means it needs to be invoked only from contexts that hold the transport write lock. Fixes: 4a0682583988 ("SUNRPC: Transport fault injection") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-05SUNRPC: Set TCP_CORK until the transmit queue is emptyTrond Myklebust1-0/+2
When we have multiple RPC requests queued up, it makes sense to set the TCP_CORK option while the transmit queue is non-empty. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Remove unused function xprt_load_transport()Trond Myklebust1-15/+0
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Add a helper to return the transport identifier given a netidTrond Myklebust1-4/+21
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Close a race with transport setup and module putTrond Myklebust1-11/+33
After we've looked up the transport module, we need to ensure it can't go away until we've finished running the transport setup code. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: xprt_load_transport() needs to support the netid "rdma6"Trond Myklebust1-16/+49
According to RFC5666, the correct netid for an IPv6 addressed RDMA transport is "rdma6", which we've supported as a mount option since Linux-4.7. The problem is when we try to load the module "xprtrdma6", that will fail, since there is no modulealias of that name. Fixes: 181342c5ebe8 ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-09-21SUNRPC: Mitigate cond_resched() in xprt_transmit()Chuck Lever1-2/+4
The original purpose of this expensive call is to prevent a long queue of requests from blocking other work. The cond_resched() call is unnecessary after just a single send operation. For longer queues, instead of invoking the kernel scheduler, simply release the transport send lock and return to the RPC scheduler. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-09-21SUNRPC: Replace connect dprintk call sites with a tracepointChuck Lever1-2/+1
This trace event can be used to audit transport connections from the client. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-09-21SUNRPC: Replace dprintk() call site in xprt_prepare_transmitChuck Lever1-2/+2
Generate a trace event when an RPC request is queued without being sent immediately. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-09-21SUNRPC: Update debugging instrumentation in xprt_do_reserve()Chuck Lever1-6/+2
Replace a dprintk() with a tracepoint. The tracepoint marks the point where an RPC request is assigned an XID. Additional clean up: Remove trace_xprt_enq_xmit, which reports much the same thing. That tracepoint was added for debugging commit 918f3c1fe83c ("SUNRPC: Improve latency for interactive tasks"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-09-21SUNRPC: Remove debugging instrumentation from xprt_releaseChuck Lever1-1/+0
These instruments don't appear to add any substantial value. We already have this at the termination of each RPC: iozone-2617 [002] 975.713126: rpc_stats_latency: task:418@5 xid=0x260eab5d nfsv3 LOOKUP backlog=15 rtt=32 execute=58 iozone-2617 [002] 975.713127: xprt_release_cong: task:418@5 snd_task:4294967295 cong=256 cwnd=16384 iozone-2617 [002] 975.713127: xprt_put_cong: task:418@5 snd_task:4294967295 cong=0 cwnd=16384 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-09-21SUNRPC: Remove trace_xprt_complete_rqst()Chuck Lever1-2/+0
Request completion is already recorded by an "rpc_task_wakeup queue=xprt_pending" trace record. A subsequent rpc_xdr_recvfrom trace record shows the number of bytes received. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-08-24treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva1-1/+1
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-05SUNRPC dont update timeout value on connection resetOlga Kornievskaia1-0/+9
Current behaviour: every time a v3 operation is re-sent to the server we update (double) the timeout. There is no distinction between whether or not the previous timer had expired before the re-sent happened. Here's the scenario: 1. Client sends a v3 operation 2. Server RST-s the connection (prior to the timeout) (eg., connection is immediately reset) 3. Client re-sends a v3 operation but the timeout is now 120sec. As a result, an application sees 2mins pause before a retry in case server again does not reply. Instead, this patch proposes to keep track off when the minor timeout should happen and if it didn't, then don't update the new timeout. Value is updated based on the previous value to make timeouts predictable. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>