]> git.ipfire.org Git - thirdparty/kernel/stable.git/commit
RDMA/mlx5: Fix a race for DMABUF MR which can lead to CQE with error
authorYishai Hadas <yishaih@nvidia.com>
Mon, 3 Feb 2025 12:50:59 +0000 (14:50 +0200)
committerGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Fri, 7 Mar 2025 17:25:25 +0000 (18:25 +0100)
commita14b5e690abaa52fee4885b361488bd5dd0802d0
treeeee6d71e56ba0f7e56f8747fee505212482bba90
parented3a682157ae7a3fdc8c1475ac6732f8bad6c282
RDMA/mlx5: Fix a race for DMABUF MR which can lead to CQE with error

[ Upstream commit cc668a11e6ac8adb0e016711080d3f314722cc91 ]

This patch addresses a potential race condition for a DMABUF MR that can
result in a CQE with an error on the UMR QP.

During the __mlx5_ib_dereg_mr() flow, the following sequence of calls
occurs:
mlx5_revoke_mr()
mlx5r_umr_revoke_mr()
mlx5r_umr_post_send_wait()
At this point, the lkey is freed from the hardware's perspective.

However, concurrently, mlx5_ib_dmabuf_invalidate_cb() might be triggered
by another task attempting to invalidate the MR having that freed lkey.

Since the lkey has already been freed, this can lead to a CQE error,
causing the UMR QP to enter an error state.

To resolve this race condition, the dma_resv_lock() which was hold as
part of the mlx5_ib_dmabuf_invalidate_cb() is now also acquired as part
of the mlx5_revoke_mr() scope.

Upon a successful revoke, we set umem_dmabuf->private which points to
that MR to NULL, preventing any further invalidation attempts on its
lkey.

Fixes: e6fb246ccafb ("RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr()")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Artemy Kovalyov <artemyko@mnvidia.com>
Link: https://patch.msgid.link/70617067abbfaa0c816a2544c922e7f4346def58.1738587016.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
drivers/infiniband/hw/mlx5/mr.c