From: Chuck Lever Date: Fri, 13 Mar 2026 19:41:58 +0000 (-0400) Subject: RDMA/rw: Fall back to direct SGE on MR pool exhaustion X-Git-Tag: v7.0-rc6~28^2~10 X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=00da250c21b074ea9494c375d0117b69e5b1d0a4;p=thirdparty%2Fkernel%2Flinux.git RDMA/rw: Fall back to direct SGE on MR pool exhaustion When IOMMU passthrough mode is active, ib_dma_map_sgtable_attrs() produces no coalescing: each scatterlist page maps 1:1 to a DMA entry, so sgt.nents equals the raw page count. A 1 MB transfer yields 256 DMA entries. If that count exceeds the device's max_sgl_rd threshold (an optimization hint from mlx5 firmware), rdma_rw_io_needs_mr() steers the operation into the MR registration path. Each such operation consumes one or more MRs from a pool sized at max_rdma_ctxs -- roughly one MR per concurrent context. Under write-intensive workloads that issue many concurrent RDMA READs, the pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns -EAGAIN. Upper layer protocols treat this as a fatal DMA mapping failure and tear down the connection. The max_sgl_rd check is a performance optimization, not a correctness requirement: the device can handle large SGE counts via direct posting, just less efficiently than with MR registration. When the MR pool cannot satisfy a request, falling back to the direct SGE (map_wrs) path avoids the connection reset while preserving the MR optimization for the common case where pool resources are available. Add a fallback in rdma_rw_ctx_init() so that -EAGAIN from rdma_rw_init_mr_wrs() triggers direct SGE posting instead of propagating the error. iWARP devices, which mandate MR registration for RDMA READs, and force_mr debug mode continue to treat -EAGAIN as terminal. Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages") Signed-off-by: Chuck Lever Reviewed-by: Christoph Hellwig Link: https://patch.msgid.link/20260313194201.5818-2-cel@kernel.org Signed-off-by: Leon Romanovsky --- diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c index fc45c384833f..c01d5e605053 100644 --- a/drivers/infiniband/core/rw.c +++ b/drivers/infiniband/core/rw.c @@ -608,14 +608,29 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num, if (rdma_rw_io_needs_mr(qp->device, port_num, dir, sg_cnt)) { ret = rdma_rw_init_mr_wrs(ctx, qp, port_num, sg, sg_cnt, sg_offset, remote_addr, rkey, dir); - } else if (sg_cnt > 1) { + /* + * If MR init succeeded or failed for a reason other + * than pool exhaustion, that result is final. + * + * Pool exhaustion (-EAGAIN) from the max_sgl_rd + * optimization is recoverable: fall back to + * direct SGE posting. iWARP and force_mr require + * MRs unconditionally, so -EAGAIN is terminal. + */ + if (ret != -EAGAIN || + rdma_protocol_iwarp(qp->device, port_num) || + unlikely(rdma_rw_force_mr)) + goto out; + } + + if (sg_cnt > 1) ret = rdma_rw_init_map_wrs(ctx, qp, sg, sg_cnt, sg_offset, remote_addr, rkey, dir); - } else { + else ret = rdma_rw_init_single_wr(ctx, qp, sg, sg_offset, remote_addr, rkey, dir); - } +out: if (ret < 0) goto out_unmap_sg; return ret;