summaryrefslogtreecommitdiff
path: root/net/sunrpc/xprtrdma/transport.c
AgeCommit message (Collapse)AuthorFilesLines
2024-05-03net: sunrpc: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados1-1/+0
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) * Remove sentinel element from ctl_table structs. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-11sunrpc: simplify one-level sysctl registration for xr_tunables_tableLuis Chamberlain1-10/+1
There is no need to declare an extra tables to just create directory, this can be easily be done with a prefix path with register_sysctl(). Simplify this registration. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05SUNRPC: Replace the use of the xprtiod WQ in rpcrdmaChuck Lever1-2/+1
While setting up a new lab, I accidentally misconfigured the Ethernet port for a system that tried an NFS mount using RoCE. This made the NFS server unreachable. The following WARNING popped on the NFS client while waiting for the mount attempt to time out: kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI> kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma> kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13 kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017 kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma] kernel: RIP: 0010:check_flush_dependency+0xbf/0xca kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be> kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092 kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027 kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731 kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0 kernel: FS: 0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 kernel: Call Trace: kernel: <TASK> kernel: __flush_work.isra.0+0xaf/0x188 kernel: ? _raw_spin_lock_irqsave+0x2c/0x37 kernel: ? lock_timer_base+0x38/0x5f kernel: __cancel_work_timer+0xea/0x13d kernel: ? preempt_latency_start+0x2b/0x46 kernel: rdma_addr_cancel+0x70/0x81 [ib_core] kernel: _destroy_id+0x1a/0x246 [rdma_cm] kernel: rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma] kernel: ? _raw_spin_unlock+0x14/0x29 kernel: ? raw_spin_rq_unlock_irq+0x5/0x10 kernel: ? finish_task_switch.isra.0+0x171/0x249 kernel: xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma] kernel: process_one_work+0x1d8/0x2d4 kernel: worker_thread+0x18b/0x24f kernel: ? rescuer_thread+0x280/0x280 kernel: kthread+0xf4/0xfc kernel: ? kthread_complete_and_exit+0x1b/0x1b kernel: ret_from_fork+0x22/0x30 kernel: </TASK> SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue that one of its work items tries to cancel has to be WQ_MEM_RECLAIM to prevent a priority inversion. The internal workqueues in the RDMA/core are currently non-MEM_RECLAIM. Jason Gunthorpe says this about the current state of RDMA/core: > If you attempt to do a reconnection/etc from within a RECLAIM > context it will deadlock on one of the many allocations that are > made to support opening the connection. > > The general idea of reclaim is that the entire task context > working under the reclaim is marked with an override of the gfp > flags to make all allocations under that call chain reclaim safe. > > But rdmacm does allocations outside this, eg in the WQs processing > the CM packets. So this doesn't work and we will deadlock. > > Fixing it is a big deal and needs more than poking WQ_MEM_RECLAIM > here and there. So we will change the ULP in this case to avoid the use of WQ_MEM_RECLAIM where possible. Deadlocks that were possible before are not fixed, but at least we no longer have a false sense of confidence that the stack won't allocate memory during memory reclaim. Suggested-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-07-11SUNRPC: Fix an RPC/RDMA performance regressionTrond Myklebust1-5/+1
Use the standard gfp mask instead of using GFP_NOWAIT. The latter causes issues when under memory pressure. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOCNeilBrown1-2/+4
rpc tasks can be marked as RPC_TASK_SWAPPER. This causes GFP_MEMALLOC to be used for some allocations. This is needed in some cases, but not in all where it is currently provided, and in some where it isn't provided. Currently *all* tasks associated with a rpc_client on which swap is enabled get the flag and hence some GFP_MEMALLOC support. GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it. However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does need it. xdr_alloc_bvec is called while the XPRT_LOCK is held. If this blocks, then it blocks all other queued tasks. So this allocation needs GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used for any swap writes. Similarly, if the transport is not connected, that will block all requests including swap writes, so memory allocations should get GFP_MEMALLOC if swap writes are possible. So with this patch: 1/ we ONLY set RPC_TASK_SWAPPER for swap writes. 2/ __rpc_execute() sets PF_MEMALLOC while handling any task with RPC_TASK_SWAPPER set, or when handling any task that holds the XPRT_LOCKED lock on an xprt used for swap. This removes the need for the RPC_IS_SWAPPER() test in ->buf_alloc handlers. 3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking any task to a swapper xprt. __rpc_execute() will clear it. 3/ PF_MEMALLOC is set for all the connect workers. Reviewed-by: Chuck Lever <chuck.lever@oracle.com> (for xprtrdma parts) Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC/xprt: async tasks mustn't block waiting for memoryNeilBrown1-1/+1
When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. xprt_dynamic_alloc_slot can block indefinitely. This can tie up all workqueue threads and NFS can deadlock. So when called from a workqueue, set __GFP_NORETRY. The rdma alloc_slot already does not block. However it sets the error to -EAGAIN suggesting this will trigger a sleep. It does not. As we can see in call_reserveresult(), only -ENOMEM causes a sleep. -EAGAIN causes immediate retry. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC/call_alloc: async tasks mustn't block waiting for memoryNeilBrown1-1/+3
When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. mempools are particularly a problem as memory can only be released back to the mempool by an async rpc task running. If all available workqueue threads are waiting on the mempool, no thread is available to return anything. rpc_malloc() can block, and this might cause deadlocks. So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if blocking is acceptable. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-01-14xprtrdma: Remove definitions of RPCDBG_FACILITYChuck Lever1-4/+0
Deprecated. dprintk is no longer used in xprtrdma. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-09SUNRPC/xprtrdma: Fix reconnection lockingTrond Myklebust1-6/+5
The xprtrdma client code currently relies on the task that initiated the connect to hold the XPRT_LOCK for the duration of the connection attempt. If the task is woken early, due to some other event, then that lock could get released early. Avoid races by using the same mechanism that the socket code uses of transferring lock ownership to the RDMA connect worker itself. That frees us to call rpcrdma_xprt_disconnect() directly since we're now guaranteed exclusion w.r.t. other callers. Fixes: 4cf44be6f1e8 ("xprtrdma: Fix recursion into rpcrdma_xprt_disconnect()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-08-09xprtrdma: Eliminate rpcrdma_post_sends()Chuck Lever1-1/+1
Clean up. Now that there is only one registration mode, there is only one target "post_send" method: frwr_send(). rpcrdma_post_sends() no longer adds much value, especially since all of its call sites ignore the return code value except to check if it's non-zero. Just have them call frwr_send() directly instead. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-07-08sunrpc: keep track of the xprt_class in rpc_xprt structureOlga Kornievskaia1-0/+2
We need to keep track of the type for a given transport. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-26SUNRPC: More fixes for backlog congestionTrond Myklebust1-6/+6
Ensure that we fix the XPRT_CONGESTED starvation issue for RDMA as well as socket based transports. Ensure we always initialise the request after waking up from the backlog list. Fixes: e877a88d1f06 ("SUNRPC in case of backlog, hand free slots directly to waiting task") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-14SUNRPC: Move fault injection call sitesChuck Lever1-2/+4
I've hit some crashes that occur in the xprt_rdma_inject_disconnect path. It appears that, for some provides, rdma_disconnect() can take so long that the transport can disconnect and release its hardware resources while rdma_disconnect() is still running, resulting in a UAF in the provider. The transport's fault injection method may depend on the stability of transport data structures. That means it needs to be invoked only from contexts that hold the transport write lock. Fixes: 4a0682583988 ("SUNRPC: Transport fault injection") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-16Merge tag 'nfs-rdma-for-5.11-1' of ↵Trond Myklebust1-3/+4
git://git.linux-nfs.org/projects/anna/linux-nfs into linux-next NFSoRDmA Client updates for Linux 5.11 Cleanups and improvements: - Remove use of raw kernel memory addresses in tracepoints - Replace dprintk() call sites in ERR_CHUNK path - Trace unmap sync calls - Optimize MR DMA-unmapping Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: xprt_load_transport() needs to support the netid "rdma6"Trond Myklebust1-0/+1
According to RFC5666, the correct netid for an IPv6 addressed RDMA transport is "rdma6", which we've supported as a mount option since Linux-4.7. The problem is when we try to load the module "xprtrdma6", that will fail, since there is no modulealias of that name. Fixes: 181342c5ebe8 ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-11-11xprtrdma: Trace unmap_sync callsChuck Lever1-3/+4
->buf_free is called nearly once per RPC. Only rarely does xprt_rdma_free() have to do anything, thus tracing every one of these calls seems unnecessary. Instead, just throw a trace event when that one occasional RPC still has MRs that need to be released. xprt_rdma_free() is further micro-optimized to reduce the amount of work done in the common case. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-09-21SUNRPC: Hoist trace_xprtrdma_op_setport into generic codeChuck Lever1-3/+0
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-09-21SUNRPC: Remove debugging instrumentation from xprt_releaseChuck Lever1-2/+0
These instruments don't appear to add any substantial value. We already have this at the termination of each RPC: iozone-2617 [002] 975.713126: rpc_stats_latency: task:418@5 xid=0x260eab5d nfsv3 LOOKUP backlog=15 rtt=32 execute=58 iozone-2617 [002] 975.713127: xprt_release_cong: task:418@5 snd_task:4294967295 cong=256 cwnd=16384 iozone-2617 [002] 975.713127: xprt_put_cong: task:418@5 snd_task:4294967295 cong=0 cwnd=16384 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-09-21SUNRPC: Hoist trace_xprtrdma_op_allocate into generic codeChuck Lever1-2/+0
Introduce a tracepoint in call_allocate that reports the exact sizes in the RPC buffer allocation request and the status of the result. This helps catch problems with XDR buffer provisioning, and replaces transport-specific debugging instrumentation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-07-13xprtrdma: Fix recursion into rpcrdma_xprt_disconnect()Chuck Lever1-0/+5
Both Dan and I have observed two processes invoking rpcrdma_xprt_disconnect() concurrently. In my case: 1. The connect worker invokes rpcrdma_xprt_disconnect(), which drains the QP and waits for the final completion 2. This causes the newly posted Receive to flush and invoke xprt_force_disconnect() 3. xprt_force_disconnect() sets CLOSE_WAIT and wakes up the RPC task that is holding the transport lock 4. The RPC task invokes xprt_connect(), which calls ->ops->close 5. xprt_rdma_close() invokes rpcrdma_xprt_disconnect(), which tries to destroy the QP. Deadlock. To prevent xprt_force_disconnect() from waking anything, handle the clean up after a failed connection attempt in the xprt's sndtask. The retry loop is removed from rpcrdma_xprt_connect() to ensure that the newly allocated ep and id are properly released before a REJECTED connection attempt can be retried. Reported-by: Dan Aloni <dan@kernelim.com> Fixes: e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-22xprtrdma: Use re_connect_status safely in rpcrdma_xprt_connect()Chuck Lever1-1/+1
Clean up: Sometimes creating a fresh rpcrdma_ep can fail. That's why xprt_rdma_connect() always checks if the r_xprt->rx_ep pointer is valid before dereferencing it. Instead, xprt_rdma_connect() can simply check rpcrdma_xprt_connect()'s return value. Also, there's no need to set re_connect_status to zero just after the rpcrdma_ep is created, since it is allocated with kzalloc. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-11xprtrdma: Make xprt_rdma_slot_table_entries staticZou Wei1-1/+1
Fix the following sparse warning: net/sunrpc/xprtrdma/transport.c:71:14: warning: symbol 'xprt_rdma_slot_table_entries' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zou Wei <zou_wei@huawei.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-11SUNRPC: Trace transport lifetime eventsChuck Lever1-8/+0
Refactor: Hoist create/destroy/disconnect tracepoints out of xprtrdma and into the generic RPC client. Some benefits include: - Enable tracing of xprt lifetime events for the socket transport types - Expose the different types of disconnect to help run down issues with lingering connections Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprtChuck Lever1-20/+17
Change the rpcrdma_xprt_disconnect() function so that it no longer waits for the DISCONNECTED event. This prevents blocking if the remote is unresponsive. In rpcrdma_xprt_disconnect(), the transport's rpcrdma_ep is detached. Upon return from rpcrdma_xprt_disconnect(), the transport (r_xprt) is ready immediately for a new connection. The RDMA_CM_DEVICE_REMOVAL and RDMA_CM_DISCONNECTED events are now handled almost identically. However, because the lifetimes of rpcrdma_xprt structures and rpcrdma_ep structures are now independent, creating an rpcrdma_ep needs to take a module ref count. The ep now owns most of the hardware resources for a transport. Also, a kref is needed to ensure that rpcrdma_ep sticks around long enough for the cm_event_handler to finish. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Merge struct rpcrdma_ia into struct rpcrdma_epChuck Lever1-3/+6
I eventually want to allocate rpcrdma_ep separately from struct rpcrdma_xprt so that on occasion there can be more than one ep per xprt. The new struct rpcrdma_ep will contain all the fields currently in rpcrdma_ia and in rpcrdma_ep. This is all the device and CM settings for the connection, in addition to per-connection settings negotiated with the remote. Take this opportunity to rename the existing ep fields from rep_* to re_* to disambiguate these from struct rpcrdma_rep. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Remove rpcrdma_ia::ri_flagsChuck Lever1-15/+0
Clean up: The upper layer serializes calls to xprt_rdma_close, so there is no need for an atomic bit operation, saving 8 bytes in rpcrdma_ia. This enables merging rpcrdma_ia_remove directly into the disconnect logic. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Invoke rpcrdma_ia_open in the connect workerChuck Lever1-7/+0
Move rdma_cm_id creation into rpcrdma_ep_create() so that it is now responsible for allocating all per-connection hardware resources. With this clean-up, all three arms of the switch statement in rpcrdma_ep_connect are exactly the same now, thus the switch can be removed. Because device removal behaves a little differently than disconnection, there is a little more work to be done before rpcrdma_ep_destroy() can release the connection's rdma_cm_id. So it is not quite symmetrical with rpcrdma_ep_create() yet. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Refactor rpcrdma_ep_connect() and rpcrdma_ep_disconnect()Chuck Lever1-3/+3
Clean up: Simplify the synopses of functions in the connect and disconnect paths in preparation for combining the rpcrdma_ia and struct rpcrdma_ep structures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Clean up the post_send pathChuck Lever1-1/+1
Clean up: Simplify the synopses of functions in the post_send path by combining the struct rpcrdma_ia and struct rpcrdma_ep arguments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Invoke rpcrdma_ep_create() in the connect workerChuck Lever1-8/+2
Refactor rpcrdma_ep_create(), rpcrdma_ep_disconnect(), and rpcrdma_ep_destroy(). rpcrdma_ep_create will be invoked at connect time instead of at transport set-up time. It will be responsible for allocating per- connection resources. In this patch it allocates the CQs and creates a QP. More to come. rpcrdma_ep_destroy() is the inverse functionality that is invoked at disconnect time. It will be responsible for releasing the CQs and QP. These changes should be safe to do because both connect and disconnect is guaranteed to be serialized by the transport send lock. This takes us another step closer to resolving the address and route only at connect time so that connection failover to another device will work correctly. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Eliminate per-transport "max pages"Chuck Lever1-10/+4
To support device hotplug and migrating a connection between devices of different capabilities, we have to guarantee that all in-kernel devices can support the same max NFS payload size (1 megabyte). This means that possibly one or two in-tree devices are no longer supported for NFS/RDMA because they cannot support 1MB rsize/wsize. The only one I confirmed was cxgb3, but it has already been removed from the kernel. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Refactor initialization of ep->rep_max_requestsChuck Lever1-1/+2
Clean up: there is no need to keep two copies of the same value. Also, in subsequent patches, rpcrdma_ep_create() will be called in the connect worker rather than at set-up time. Minor fix: Initialize the transport's sendctx to the value based on the capabilities of the underlying device, not the maximum setting. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Replace dprintk in xprt_rdma_set_portChuck Lever1-6/+3
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Report the computed connect delayChuck Lever1-2/+1
For debugging, the op_connect trace point should report the computed connect delay. We can then ensure that the delay is computed at the proper times, for example. As a further clean-up, remove a few low-value "heartbeat" trace points in the connect path. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Wake tasks after connect worker failsChuck Lever1-9/+6
Pending tasks are currently never awoken when the connect worker fails. The reason is that XPRT_CONNECTED is always clear after a failure return of rpcrdma_ep_connect, thus the xprt_test_and_clear_connected() check in xprt_rdma_connect_worker() always fails. - xprt_rdma_close always clears XPRT_CONNECTED. - rpcrdma_ep_connect always clears XPRT_CONNECTED. After reviewing the TCP connect worker, it appears that there's no need for extra test_and_set paranoia in xprt_rdma_connect_worker. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Initialize rb_credits in one placeChuck Lever1-9/+0
Clean up/code de-duplication. Nit: RPC_CWNDSHIFT is incorrect as the initial value for xprt->cwnd. This mistake does not appear to have operational consequences, since the cwnd value is replaced with a valid value upon the first Receive completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Connection becomes unstable after a reconnectChuck Lever1-0/+3
This is because xprt_request_get_cong() is allowing more than one RPC Call to be transmitted before the first Receive on the new connection. The first Receive fills the Receive Queue based on the server's credit grant. Before that Receive, there is only a single Receive WR posted because the client doesn't know the server's credit grant. Solution is to clear rq_cong on all outstanding rpc_rqsts when the the cwnd is reset. This is because an RPC/RDMA credit is good for one connection instance only. Fixes: 75891f502f5f ("SUNRPC: Support for congestion control ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-26xprtrdma: Clear xprt->reestablish_timeout on closeChuck Lever1-2/+1
Ensure that the re-establishment delay does not grow exponentially on each good reconnect. This probably should have been part of commit 675dd90ad093 ("xprtrdma: Modernize ops->connect"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-21xprtrdma: Clean up xprt_rdma_set_connect_timeout()Chuck Lever1-4/+4
Clean up: The function name should match the documenting comment. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20xprtrdma: Toggle XPRT_CONGESTED in xprtrdma's slot methodsChuck Lever1-1/+3
Commit 48be539dd44a ("xprtrdma: Introduce ->alloc_slot call-out for xprtrdma") added a separate alloc_slot and free_slot to the RPC/RDMA transport. Later, commit 75891f502f5f ("SUNRPC: Support for congestion control when queuing is enabled") modified the generic alloc/free_slot methods, but neglected the methods in xprtrdma. Found via code review. Fixes: 75891f502f5f ("SUNRPC: Support for congestion control ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-19Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-2/+1
Merge yet more updates from Andrew Morton: "The rest of MM and a kernel-wide procfs cleanup. Summary of the more significant patches: - Patch series "mm/memory_hotplug: Factor out memory block devicehandling", v3. David Hildenbrand. Some spring-cleaning of the memory hotplug code, notably in drivers/base/memory.c - "mm: thp: fix false negative of shmem vma's THP eligibility". Yang Shi. Fix /proc/pid/smaps output for THP pages used in shmem. - "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit. Bugfix and speedup for kernel/resource.c - Patch series "mm: Further memory block device cleanups", David Hildenbrand. More spring-cleaning of the memory hotplug code. - Patch series "mm: Sub-section memory hotplug support". Dan Williams. Generalise the memory hotplug code so that pmem can use it more completely. Then remove the hacks from the libnvdimm code which were there to work around the memory-hotplug code's constraints. - "proc/sysctl: add shared variables for range check", Matteo Croce. We have about 250 instances of int zero; ... .extra1 = &zero, in the tree. This is a tree-wide sweep to make all those private "zero"s and "one"s use global variables. Alas, it isn't practical to make those two global integers const" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (38 commits) proc/sysctl: add shared variables for range check mm: migrate: remove unused mode argument mm/sparsemem: cleanup 'section number' data types libnvdimm/pfn: stop padding pmem namespaces to section alignment libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields mm/devm_memremap_pages: enable sub-section remap mm: document ZONE_DEVICE memory-model implications mm/sparsemem: support sub-section hotplug mm/sparsemem: prepare for sub-section ranges mm: kill is_dev_zone() helper mm/hotplug: kill is_dev_zone() usage in __remove_pages() mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap() mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal mm/sparsemem: add helpers track active portions of a section at boot mm/sparsemem: introduce a SECTION_IS_EARLY flag mm/sparsemem: introduce struct mem_section_usage drivers/base/memory.c: get rid of find_memory_block_hinted() mm/memory_hotplug: move and simplify walk_memory_blocks() mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns mm: make register_mem_sect_under_node() static ...
2019-07-19proc/sysctl: add shared variables for range checkMatteo Croce1-2/+1
In the sysctl code the proc_dointvec_minmax() function is often used to validate the user supplied value between an allowed range. This function uses the extra1 and extra2 members from struct ctl_table as minimum and maximum allowed value. On sysctl handler declaration, in every source file there are some readonly variables containing just an integer which address is assigned to the extra1 and extra2 members, so the sysctl range is enforced. The special values 0, 1 and INT_MAX are very often used as range boundary, leading duplication of variables like zero=0, one=1, int_max=INT_MAX in different source files: $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l 248 Add a const int array containing the most commonly used values, some macros to refer more easily to the correct array member, and use them instead of creating a local one for every object file. This is the bloat-o-meter output comparing the old and new binary compiled with the default Fedora config: # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164) Data old new delta sysctl_vals - 12 +12 __kstrtab_sysctl_vals - 12 +12 max 14 10 -4 int_max 16 - -16 one 68 - -68 zero 128 28 -100 Total: Before=20583249, After=20583085, chg -0.00% [mcroce@redhat.com: tipc: remove two unused variables] Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c] [arnd@arndb.de: proc/sysctl: make firmware loader table conditional] Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de [akpm@linux-foundation.org: fix fs/eventpoll.c] Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18SUNRPC: Fix up backchannel slot table accountingTrond Myklebust1-0/+1
Add a per-transport maximum limit in the socket case, and add helpers to allow the NFSv4 code to discover that limit. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-09xprtrdma: Modernize ops->connectChuck Lever1-15/+51
Adapt and apply changes that were made to the TCP socket connect code. See the following commits for details on the purpose of these changes: Commit 7196dbb02ea0 ("SUNRPC: Allow changing of the TCP timeout parameters on the fly") Commit 3851f1cdb2b8 ("SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout") Commit 02910177aede ("SUNRPC: Fix reconnection timeouts") Some common transport code is moved to xprt.c to satisfy the code duplication police. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Remove rpcrdma_req::rl_bufferChuck Lever1-1/+4
Clean up. There is only one remaining function, rpcrdma_buffer_put(), that uses this field. Its caller can supply a pointer to the correct rpcrdma_buffer, enabling the removal of an 8-byte pointer field from a frequently-allocated shared data structure. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Wake RPCs directly in rpcrdma_wc_send pathChuck Lever1-1/+9
Eliminate a context switch in the path that handles RPC wake-ups when a Receive completion has to wait for a Send completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flagChuck Lever1-3/+1
Commit 9590d083c1bb ("xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler") pins incoming RPC/RDMA replies so they can be left in the pending requests queue while they are being processed without introducing a race between ->buf_free and the transport's reply handler. Therefore RPCRDMA_REQ_F_PENDING is no longer necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25xprtrdma: Eliminate struct rpcrdma_create_data_internalChuck Lever1-23/+4
Clean up. Move the remaining field in rpcrdma_create_data_internal so the structure can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25xprtrdma: Aggregate the inline settings in struct rpcrdma_epChuck Lever1-3/+1
Clean up. The inline settings are actually a characteristic of the endpoint, and not related to the device. They are also modified after the transport instance is created, so they do not belong in the cdata structure either. Lastly, let's use names that are more natural to RDMA than to NFS: inline_write -> inline_send and inline_read -> inline_recv. The /proc files retain their names to avoid breaking user space. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25xprtrdma: Remove rpcrdma_create_data_internal::rsize and wsizeChuck Lever1-9/+0
Clean up. xprt_rdma_max_inline_{read,write} cannot be set to large values by virtue of proc_dointvec_minmax. The current maximum is RPCRDMA_MAX_INLINE, which is much smaller than RPCRDMA_MAX_SEGS * PAGE_SIZE. The .rsize and .wsize fields are otherwise unused in the current code base, and thus can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>