From: Chuck Lever Date: Tue, 26 May 2026 14:14:05 +0000 (-0400) Subject: xprtrdma: Document and assert reply-handler invariants X-Git-Tag: v7.2-rc1~46^2~25 X-Git-Url: http://git.ipfire.org/gitweb/?a=commitdiff_plain;h=797943e8bd1ffcc63bfe79d24faad9a77054ec40;p=thirdparty%2Fkernel%2Flinux.git xprtrdma: Document and assert reply-handler invariants The xprtrdma reply path has been the subject of recurring LLM-driven review claims that 'an RPC can complete while receive buffers are still DMA-mapped' or that 'the req can be freed while the HCA still owns the send buffer.' No runtime reproducer has surfaced, but the absence of a written-down invariant set lets each pass of automated review reach the same hypothetical conclusion. Subsequent fixes against ce2f9a4d9ccc ('xprtrdma: Decouple req recycling from RPC completion') closed the underlying races but did not document the closure where future readers will look for it. State the invariants explicitly in a comment above rpcrdma_reply_handler() and back four of them with WARN_ON_ONCE() probes positioned where each invariant is locally checkable on the previous patch's cleaned-up ownership state: - I1 (Receive WR ownership): WARN at rpcrdma_post_recvs() that a rep pulled from rb_free_reps carries rr_rqst == NULL. - I2 (rep attachment): WARN at rpcrdma_reply_put() that req->rl_reply was NULLed before the matching rep_put. - I3 (Registered-MR fence): WARN at rpcrdma_complete_rqst() that req->rl_registered is empty. Strong send-queue ordering of the LocalInv WR chain makes the last completion observe the ib_dma_unmap_sg() of every earlier MR, so 'list empty' implies 'all MRs unmapped'. - I4 (Send-buffer release): WARN at rpcrdma_req_release() that req->rl_sendctx is NULL. Reaching the kref release callback requires both the RPC-layer and Send-side references to have dropped; the Send-side drop runs in rpcrdma_sendctx_unmap(), which clears rl_sendctx (previous patch). A non-NULL rl_sendctx here would mean the Send-side owner had not run -- a contradiction. The XXX comment in xprt_rdma_free() about signal-driven release racing the Send completion described the pre-decouple state. Replace it with a one-line note pointing at the invariant set, since the kref scheme now holds the req across the in-flight Send regardless of which path released the rpc_task. I5 (req lifecycle) is stated in the comment but not probed: making it locally assertible would require moving kref_init out of rpcrdma_req_release(), which in turn requires adding kref_init to the bc_pa_list and backlog-wake reuse paths. That restructuring is deferred -- the invariant is unchanged either way. Signed-off-by: Chuck Lever Signed-off-by: Anna Schumaker --- diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c index f4b4abefc4e05..626cadec4555d 100644 --- a/net/sunrpc/xprtrdma/rpc_rdma.c +++ b/net/sunrpc/xprtrdma/rpc_rdma.c @@ -1336,6 +1336,11 @@ void rpcrdma_complete_rqst(struct rpcrdma_rep *rep) struct rpc_rqst *rqst = rep->rr_rqst; int status; + /* I3: every registered MR has been invalidated and + * ib_dma_unmap_sg()'d before complete_rqst runs. + */ + WARN_ON_ONCE(!list_empty(&rpcr_to_rdmar(rqst)->rl_registered)); + switch (rep->rr_proc) { case rdma_msg: status = rpcrdma_decode_msg(r_xprt, rep, rqst); @@ -1367,6 +1372,70 @@ out_badheader: goto out; } +/* Reply-side ownership invariants + * + * I1 (Receive WR ownership). A struct rpcrdma_rep is owned by the + * HCA between ib_post_recv() and the matching Receive completion. + * After ib_dma_sync_single_for_cpu() in rpcrdma_wc_receive() it is + * owned by the CPU until rpcrdma_rep_put() returns it to + * rb_free_reps; a rep on rb_free_reps is not re-posted until + * rpcrdma_post_recvs() pulls it off. Asserted: rpcrdma_post_recvs() + * WARNs that a pulled rep has rr_rqst == NULL. + * + * I2 (rep attachment). While req->rl_reply == rep, the rep cannot be + * re-posted. rpcrdma_reply_put() NULLs req->rl_reply before handing + * the rep to rpcrdma_rep_put(). Asserted: rpcrdma_reply_put() WARNs + * that rl_reply is NULL after the put. + * + * I3 (Registered-MR fence). On entry to rpcrdma_complete_rqst() every + * MR that was on req->rl_registered has had its rkey invalidated + * (remotely via IB_WC_WITH_INVALIDATE or locally via IB_WR_LOCAL_INV) + * and its pages ib_dma_unmap_sg()'d. The LocalInv chain is posted + * on a single QP; strong send-queue ordering makes the last + * completion (frwr_wc_localinv_done) observe the + * ib_dma_unmap_sg() that ran from each earlier completion's + * frwr_mr_put() before complete_rqst is called. The inline + * frwr_reminv() path unmaps its one MR synchronously before + * rpcrdma_reply_handler() reaches complete_rqst. Asserted: + * rpcrdma_complete_rqst() WARNs that rl_registered is empty. + * + * I4 (Send-buffer release). req->rl_kref carries two unconditional + * owners while a Send is outstanding: the RPC-layer reference (set + * at xprt_rdma_alloc_slot / xprt_rdma_bc_rqst_get / rpcrdma_req_release + * pool-entry) and the Send-side reference (kref_get() in + * rpcrdma_prepare_send_sges()). rpcrdma_req_release() runs only + * after both have dropped, so the req does not return to its free + * pool until rpcrdma_sendctx_unmap() has fired -- the HCA has + * released the send buffer before the req can be reused. Asserted: + * rpcrdma_req_release() WARNs that rl_sendctx is NULL. + * + * I5 (req lifecycle). A req is owned by the RPC layer between slot + * acquisition and the matching xprt_rdma_free_slot() (or, for the + * backchannel, xprt_rdma_bc_free_rqst()). While owned, rl_kref >= 1. + * The pools (rb_send_bufs, bc_pa_list, backlog wake target) never + * contain a req with outstanding Send-side or Reply-side work. + * + * Non-hazards. The following claims have been raised by adversarial + * review and are each closed by the invariants above: + * + * * "Reply completes the RPC while the HCA still holds the send + * buffer" -- excluded by I4. The Send-side kref reference is held + * until rpcrdma_sendctx_unmap() runs from Send completion. + * + * * "Signal-driven release races the in-flight Send" -- same + * resolution. xprt_rdma_free() does not touch rl_kref; the + * Send-side reference keeps the req out of its pool until Send + * completion fires. + * + * * "Receive completion races rep reuse" -- excluded by I1. A rep + * is on rb_free_reps only after rpcrdma_rep_put() has been called + * and rpcrdma_post_recvs() owns the next transition back to the HCA. + * + * * "Pages still DMA-mapped when call_decode reads them" -- excluded + * by I3. The matching ib_dma_unmap_sg() for every MR has run on + * the same CPU thread that calls rpcrdma_complete_rqst(). + */ + /** * rpcrdma_reply_handler - Process received RPC/RDMA messages * @rep: Incoming rpcrdma_rep object to process diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c index 875ad852a11da..d4e6746d8ecd3 100644 --- a/net/sunrpc/xprtrdma/transport.c +++ b/net/sunrpc/xprtrdma/transport.c @@ -510,6 +510,11 @@ static void rpcrdma_req_release(struct kref *kref) struct rpc_xprt *xprt = rqst->rq_xprt; struct rpcrdma_xprt *r_xprt; + /* I4: both the RPC-layer and Send-side owners have dropped, + * so rpcrdma_sendctx_unmap() has cleared rl_sendctx. + */ + WARN_ON_ONCE(req->rl_sendctx); + kref_init(&req->rl_kref); #if defined(CONFIG_SUNRPC_BACKCHANNEL) @@ -653,10 +658,10 @@ xprt_rdma_free(struct rpc_task *task) frwr_unmap_sync(rpcx_to_rdmax(rqst->rq_xprt), req); } - /* XXX: If the RPC is completing because of a signal and - * not because a reply was received, we ought to ensure - * that the Send completion has fired, so that memory - * involved with the Send is not still visible to the NIC. + /* The Send-side rl_kref owner keeps req out of its free pool + * until rpcrdma_sendctx_unmap() has fired -- see I4 above + * rpcrdma_reply_handler() -- so signal-driven release here + * does not let the HCA touch a recycled send buffer. */ } diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index 60cbc14c52995..da2c6fa44154e 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -1259,6 +1259,10 @@ void rpcrdma_reply_put(struct rpcrdma_buffer *buffers, struct rpcrdma_req *req) req->rl_reply = NULL; rpcrdma_rep_put(buffers, rep); } + /* I2: rl_reply NULL after the put closes the + * 'rep on rb_free_reps still referenced by req' window. + */ + WARN_ON_ONCE(req->rl_reply); } /** @@ -1435,6 +1439,8 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, int needed) rep = rpcrdma_rep_create(r_xprt); if (!rep) break; + /* I1: a rep on rb_free_reps must carry no rqst pointer. */ + WARN_ON_ONCE(rep->rr_rqst); if (!rpcrdma_regbuf_dma_map(r_xprt, rep->rr_rdmabuf)) { rpcrdma_rep_put(buf, rep); break;