From: Chuck Lever Date: Thu, 4 Jun 2026 17:06:36 +0000 (-0400) Subject: xprtrdma: Resize reply buffers before reposting receives X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=234c0ff695ef3ffb656931000e6b823d0c2f30fd;p=thirdparty%2Flinux.git xprtrdma: Resize reply buffers before reposting receives Commit 0e13dd9ea8be ("xprtrdma: Remove temp allocation of rpcrdma_rep objects") made rpcrdma_rep objects survive disconnects. That is normally fine, but it also means their receive regbufs keep the size they had when they were first allocated. Each rep's receive buffer is sized to ep->re_inline_recv when the rep is created. rpcrdma_ep_create() resets that threshold to the rdma_max_inline_read ceiling for every new endpoint, and the connect handshake then shrinks it to the peer's advertised inline send size. A rep allocated under a smaller negotiated threshold keeps that size: on disconnect, rpcrdma_xprt_disconnect() drains and DMA-unmaps the surviving reps but does not free or resize them. The threshold can come back larger on the next connection. The first peer may supply no RPC-over-RDMA CM private data, defaulting its send size to 1024, while the reconnect target is an ordinary server offering 4096; or, with rdma_max_inline_read raised above its default, the reconnect target may advertise a larger svcrdma_max_req_size than the first. rpcrdma_post_recvs() then reposts a surviving rep whose SGE length is still the old, smaller value, and a larger inline Reply hits a receive length error and forces another disconnect. The undersized rep returns to the free list when its failed Receive flushes, so the following reconnect reposts the same rep and fails the same way. The transport flaps without making forward progress for as long as the peer keeps advertising the larger inline size. This is local/admin-triggerable rather than remote-triggerable: a local administrator must create and maintain the NFS/RDMA mount, while the server or reconnect target has to advertise a larger inline send size and return a reply that uses it. Fix this by checking each rep before it is reposted. If the receive regbuf is smaller than the current endpoint's inline receive size, reallocate it on the current RDMA device's NUMA node and reinitialize the rep's xdr_buf before DMA-mapping and posting the Receive WR. Fixes: 0e13dd9ea8be ("xprtrdma: Remove temp allocation of rpcrdma_rep objects") Signed-off-by: Chuck Lever Signed-off-by: Anna Schumaker --- diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index 74d173e5681db..8392ba4bcdcae 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -81,6 +81,8 @@ rpcrdma_regbuf_alloc_node(size_t size, enum dma_data_direction direction, int node); static struct rpcrdma_regbuf * rpcrdma_regbuf_alloc(size_t size, enum dma_data_direction direction); +static bool rpcrdma_regbuf_realloc_node(struct rpcrdma_regbuf *rb, + size_t size, gfp_t flags, int node); static void rpcrdma_regbuf_dma_unmap(struct rpcrdma_regbuf *rb); static void rpcrdma_regbuf_free(struct rpcrdma_regbuf *rb); @@ -1353,10 +1355,16 @@ rpcrdma_regbuf_alloc(size_t size, enum dma_data_direction direction) * returned, @rb is left untouched. */ bool rpcrdma_regbuf_realloc(struct rpcrdma_regbuf *rb, size_t size, gfp_t flags) +{ + return rpcrdma_regbuf_realloc_node(rb, size, flags, NUMA_NO_NODE); +} + +static bool rpcrdma_regbuf_realloc_node(struct rpcrdma_regbuf *rb, + size_t size, gfp_t flags, int node) { void *buf; - buf = kmalloc(size, flags); + buf = kmalloc_node(size, flags, node); if (!buf) return false; @@ -1368,6 +1376,23 @@ bool rpcrdma_regbuf_realloc(struct rpcrdma_regbuf *rb, size_t size, gfp_t flags) return true; } +static bool rpcrdma_rep_resize(struct rpcrdma_xprt *r_xprt, + struct rpcrdma_rep *rep) +{ + struct rpcrdma_regbuf *rb = rep->rr_rdmabuf; + struct rpcrdma_ep *ep = r_xprt->rx_ep; + size_t size = ep->re_inline_recv; + + if (likely(rdmab_length(rb) >= size)) + return true; + if (!rpcrdma_regbuf_realloc_node(rb, size, XPRTRDMA_GFP_FLAGS, + ibdev_to_node(ep->re_id->device))) + return false; + + xdr_buf_init(&rep->rr_hdrbuf, rdmab_data(rb), rdmab_length(rb)); + return true; +} + /** * __rpcrdma_regbuf_dma_map - DMA-map a regbuf * @r_xprt: controlling transport instance @@ -1451,6 +1476,10 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, int needed) break; /* I1: a rep on rb_free_reps must carry no rqst pointer. */ WARN_ON_ONCE(rep->rr_rqst); + if (!rpcrdma_rep_resize(r_xprt, rep)) { + rpcrdma_rep_put(buf, rep); + break; + } if (!rpcrdma_regbuf_dma_map(r_xprt, rep->rr_rdmabuf)) { rpcrdma_rep_put(buf, rep); break;