Jason Gunthorpe [Wed, 27 Jul 2022 19:09:25 +0000 (16:09 -0300)]
Merge branch 'erdma' into rdma.git for-next
Cheng Xu says
====================
This v14 patch set introduces the Elastic RDMA Adapter (ERDMA) driver,
which released in Apsara Conference 2021 by Alibaba. The PR of ERDMA
userspace provider has already been created [1].
ERDMA enables large-scale RDMA acceleration capability in Alibaba ECS
environment, initially offered in g7re instance. It can improve the
efficiency of large-scale distributed computing and communication
significantly and expand dynamically with the cluster scale of Alibaba
Cloud.
ERDMA is a RDMA networking adapter based on the Alibaba MOC hardware. It
works in the VPC network environment (overlay network), and uses iWarp
transport protocol. ERDMA supports reliable connection (RC). ERDMA also
supports both kernel space and user space verbs. Now we have already
supported HPC/AI applications with libfabric, NoF and some other internal
verbs libraries, such as xrdma, epsl, etc,.
For the ECS instance with RDMA enabled, our MOC hardware generates two
kinds of PCI devices: one for ERDMA, and one for the original net device
(virtio-net). They are separated PCI devices.
====================
* branch 'erdma':
RDMA/erdma: Add driver to kernel build environment
RDMA/erdma: Add the ABI definitions
RDMA/erdma: Add the erdma module
RDMA/erdma: Add connection management (CM) support
RDMA/erdma: Add verbs implementation
RDMA/erdma: Add verbs header file
RDMA/erdma: Add event queue implementation
RDMA/erdma: Add cmdq implementation
RDMA/erdma: Add main include file
RDMA/erdma: Add the hardware related definitions
RDMA: Add ERDMA to rdma_driver_id definition
Add the main erdma module, which provides interface to infiniband
subsystem.
This commit includes a modification from Christophe, that using the bitmap
API to allocate bitmaps instead of hand-writing. And the commit also fixes
warnings reported by static checkers.
Link: https://lore.kernel.org/r/20220727014927.76564-10-chengyou@linux.alibaba.com Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
RDMA/erdma: Add connection management (CM) support
ERDMA's transport protocol is iWarp, so the driver must support CM
interface. In CM part, we use the same way as SoftiWarp: using kernel
socket to set up the connection, then performing MPA negotiation in
kernel. So, this part of code mainly comes from SoftiWarp, base on it,
we add some more features, such as non-blocking iw_connect implementation.
This commit also fixes a duplicated include issue reported by Abaci Robot.
The RDMA verbs implementation of erdma is divided into three files:
erdma_qp.c, erdma_cq.c, and erdma_verbs.c. Internal used functions and
datapath functions of QP/CQ are put in erdma_qp.c and erdma_cq.c, the rest
is in erdma_verbs.c.
This commit also fixes some static check warnings.
Link: https://lore.kernel.org/r/20220727014927.76564-8-chengyou@linux.alibaba.com Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Event queue (EQ) is the main notification way from erdma hardware to its
driver. Each erdma device contains 2 kinds EQs: asynchronous EQ (AEQ) and
completion EQ (CEQ). Per device has 1 AEQ, which used for RDMA async event
report, and max to 32 CEQs (numbered for CEQ0 to CEQ31). CEQ0 is used for
cmdq completion event report, and the rest CEQs are used for RDMA
completion event report.
Cmdq is the main control plane channel between erdma driver and hardware.
After erdma device is initialized, the cmdq channel will be active in the
whole lifecycle of this driver.
This commit also includes two modifications from Christophe, one is using
the bitmap API to allocate bitmaps instead of hand-writing, and another
is using the non-atomic bitmap API when applicable.
Add ERDMA driver main header file, defining internal used data structures
and operations. The defined data structures includes *cmdq*, which is used
as the communication channel between ERDMA driver and hardware.
ERDMA is a PCIe device, and this file provides ERDMA hardware related
definitions, mainly including PCIe device capabilities and restrictions,
device registers definitions, doorbell space, doorbell structure
definitions and WQE definitions.
RDMA/mlx5: Store in the cache mkeys instead of mrs
Currently, the driver stores mlx5_ib_mr struct in the cache entries,
although the only use of the cached MR is the mkey. Store only the mkey in
the cache.
RDMA/mlx5: Store the number of in_use cache mkeys instead of total_mrs
total_mrs is used only to calculate the number of mkeys currently in
use. To simplify things, replace it with a new member called "in_use" and
directly store the number of mkeys currently in use.
The Xarray allows us to store the cached mkeys in memory efficient way.
Entries are reserved in the Xarray using xa_cmpxchg before calling to the
upcoming callbacks to avoid allocations in interrupt context. The
xa_cmpxchg can sleep when using GFP_KERNEL, so we call it in a loop to
ensure one reserved entry for each process trying to reserve.
In the next patch, ent->list will be replaced with an xarray. The xarray
uses an internal lock to protect the indexes. Use it to protect all the
entry fields, and get rid of ent->lock.
Li Zhijian [Tue, 26 Jul 2022 03:16:26 +0000 (03:16 +0000)]
Revert "RDMA/rxe: Create duplicate mapping tables for FMRs"
Below 2 commits will be reverted:
commit 8ff5f5d9d8cf ("RDMA/rxe: Prevent double freeing rxe_map_set()")
commit 647bf13ce944 ("RDMA/rxe: Create duplicate mapping tables for FMRs")
The community has a few bug reports which pointed this commit at last.
Some proposals are raised up in the meantime but all of them have no
follow-up operation.
The previous commit led the map_set of FMR to be not available any more if
the MR is registered again after invalidating. Although the mentioned
patch try to fix a potential race in building/accessing the same table
for fast memory regions, it broke rtrs etc ULPs. Since the latter could
be worse, revert this patch.
With previous commit, it's observed that a same MR in rnbd server will
trigger below code path:
-> rxe_mr_init_fast()
|-> alloc map_set() # map_set is uninitialized
|...-> rxe_map_mr_sg() # build the map_set
|-> rxe_mr_set_page()
|...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE that means
# we can access host memory(such rxe_mr_copy)
|...-> rxe_invalidate_mr() # mr->state change to FREE from VALID
|...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE,
# but map_set was not built again
|...-> rxe_mr_copy() # kernel crash due to access wild addresses
# that lookup from the map_set
Bob Pearson [Thu, 30 Jun 2022 19:04:26 +0000 (14:04 -0500)]
RDMA/rxe: Replace __rxe_do_task by rxe_run_task
In rxe_req.c replace calls to __rxe_do_task() by calls to rxe_run_task(..,
0). Using __rxe_do_task is an error because the completer tasklet is not
designed to be re-entrant and __rxe_do_task() should only be called when
it is clear that no one else could be calling the completer tasklet as is
the case in rxe_qp.c where this call is used in safe environments.
Bob Pearson [Thu, 30 Jun 2022 19:04:25 +0000 (14:04 -0500)]
RDMA/rxe: Limit the number of calls to each tasklet
Limit the maximum number of calls to each tasklet from rxe_do_task()
before yielding the cpu. When the limit is reached reschedule the tasklet
and exit the calling loop. This patch prevents one tasklet from consuming
100% of a cpu core and causing a deadlock or soft lockup.
Bob Pearson [Thu, 30 Jun 2022 19:04:22 +0000 (14:04 -0500)]
RDMA/rxe: Fix rnr retry behavior
Currently the completer tasklet when retransmit timer or the rnr timer
fires the same flag (qp->req.need_retry) is set so that if either timer
fires it will attempt to perform a retry flow on the send queue. This has
the effect of responding to an RNR NAK at the first retransmit timer event
which might not allow the requested rnr timeout.
This patch adds a new flag (qp->req.wait_for_rnr_timer) which, if set,
prevents a retry flow until the rnr nak timer fires.
This patch fixes rnr retry errors which can be observed by running the
pyverbs test_rdmacm_async_traffic_external_qp multiple times. With this
patch applied they do not occur.
Bob Pearson [Thu, 30 Jun 2022 19:04:18 +0000 (14:04 -0500)]
RDMA/rxe: Add rxe_is_fenced() subroutine
The code thc that decides whether to defer execution of a wqe in
rxe_requester.c is isolated into a subroutine rxe_is_fenced() and removed
from the call to req_next_wqe(). The condition whether a wqe should be
fenced is changed to comply with the IBA. Currently an operation is fenced
if the fence bit is set in the wqe flags and the last wqe has not
completed. For normal operations the IBA actually only requires that the
last read or atomic operation is complete.
RDMA/rxe: For invalidate compare according to set keys in mr
The 'rkey' input can be an lkey or rkey, and in rxe the lkey or rkey have
the same value, including the variant bits.
So, if mr->rkey is set, compare the invalidate key with it, otherwise
compare with the mr->lkey.
Since we already did a lookup on the non-varient bits to get this far, the
check's only purpose is to confirm that the wqe has the correct variant
bits.
Fixes: 001345339f4c ("RDMA/rxe: Separate HW and SW l/rkeys") Link: https://lore.kernel.org/r/20220707073006.328737-1-haris.phnx@gmail.com Signed-off-by: Md Haris Iqbal <haris.phnx@gmail.com> Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Bob Pearson [Thu, 14 Jul 2022 20:46:20 +0000 (15:46 -0500)]
RDMA/rxe: Fix mw bind to allow any consumer key portion
The current implementation of rxe_check_bind_mw() in rxe_mw.c is incorrect
since it requires the new key portion provided by the mw consumer to be
different than the previous key portion. This is not required by the
IBA. Remove the test.
Mark Bloch [Sun, 3 Jul 2022 20:54:07 +0000 (13:54 -0700)]
RDMA/mlx5: Expose steering anchor to userspace
Expose a steering anchor per priority to allow users to re-inject
packets back into default NIC pipeline for additional processing.
MLX5_IB_METHOD_STEERING_ANCHOR_CREATE returns a flow table ID which
a user can use to re-inject packets at a specific priority.
A FTE (flow table entry) can be created and the flow table ID
used as a destination.
When a packet is taken into a RDMA-controlled steering domain (like
software steering) there may be a need to insert the packet back into
the default NIC pipeline. This exposes a flow table ID to the user that can
be used as a destination in a flow table entry.
With this new method priorities that are exposed to users via
MLX5_IB_METHOD_FLOW_MATCHER_CREATE can be reached from a non-zero UID.
As user-created flow tables (via RDMA DEVX) are created with a non-zero UID
thus it's impossible to point to a NIC core flow table (core driver flow tables
are created with UID value of zero) from userspace.
Create flow tables that are exposed to users with the shared UID, this
allows users to point to default NIC flow tables.
Steering loops are prevented at FW level as FW enforces that no flow
table at level X can point to a table at level lower than X.
Mark Bloch [Sun, 3 Jul 2022 20:54:06 +0000 (13:54 -0700)]
RDMA/mlx5: Refactor get flow table function
_get_flow_table() requires the entire matcher being passed
while all it needs is the priority and namespace type.
Pass the priority and namespace type directly instead.
Leon Romanovsky [Tue, 19 Jul 2022 17:40:28 +0000 (20:40 +0300)]
Merge branch 'mlx5-next' into wip/leon-for-next
Mark Bloch Says:
================
Expose steering anchor
Expose a steering anchor per priority to allow users to re-inject
packets back into default NIC pipeline for additional processing.
MLX5_IB_METHOD_STEERING_ANCHOR_CREATE returns a flow table ID which
a user can use to re-inject packets at a specific priority.
A FTE (flow table entry) can be created and the flow table ID
used as a destination.
When a packet is taken into a RDMA-controlled steering domain (like
software steering) there may be a need to insert the packet back into
the default NIC pipeline. This exposes a flow table ID to the user that can
be used as a destination in a flow table entry.
With this new method priorities that are exposed to users via
MLX5_IB_METHOD_FLOW_MATCHER_CREATE can be reached from a non-zero UID.
As user-created flow tables (via RDMA DEVX) are created with a non-zero UID
thus it's impossible to point to a NIC core flow table (core driver flow tables
are created with UID value of zero) from userspace.
Create flow tables that are exposed to users with the shared UID, this
allows users to point to default NIC flow tables.
Steering loops are prevented at FW level as FW enforces that no flow
table at level X can point to a table at level lower than X.
Jianglei Nie [Mon, 11 Jul 2022 07:07:18 +0000 (15:07 +0800)]
RDMA/hfi1: fix potential memory leak in setup_base_ctxt()
setup_base_ctxt() allocates a memory chunk for uctxt->groups with
hfi1_alloc_ctxt_rcv_groups(). When init_user_ctxt() fails, uctxt->groups
is not released, which will lead to a memory leak.
We should release the uctxt->groups with hfi1_free_ctxt_rcv_groups()
when init_user_ctxt() fails.
Fixes: e87473bc1b6c ("IB/hfi1: Only set fd pointer when base context is completely initialized") Link: https://lore.kernel.org/r/20220711070718.2318320-1-niejianglei2021@163.com Signed-off-by: Jianglei Nie <niejianglei2021@163.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
Xiao Yang [Tue, 5 Jul 2022 14:52:11 +0000 (22:52 +0800)]
RDMA/rxe: Add common rxe_prepare_res()
It's redundant to prepare resources for Read and Atomic
requests by different functions. Replace them by a common
rxe_prepare_res() with different parameters. In addition,
the common rxe_prepare_res() can also be used by new Flush
and Atomic Write requests in the future.
RDMA/rxe: Fix BUG: KASAN: null-ptr-deref in rxe_qp_do_cleanup
The function rxe_create_qp calls rxe_qp_from_init. If some error
occurs, the error handler of function rxe_qp_from_init will set
both scq and rcq to NULL.
Then rxe_create_qp calls rxe_put to handle qp. In the end,
rxe_qp_do_cleanup is called by rxe_put. rxe_qp_do_cleanup directly
accesses scq and rcq before checking them. This will cause
null-ptr-deref error.
The call graph is as below:
rxe_create_qp {
...
rxe_qp_from_init {
...
err1:
...
qp->rcq = NULL; <---rcq is set to NULL
qp->scq = NULL; <---scq is set to NULL
...
}
qp_init:
rxe_put{
...
rxe_qp_do_cleanup {
...
atomic_dec(&qp->scq->num_wq); <--- scq is accessed
...
atomic_dec(&qp->rcq->num_wq); <--- rcq is accessed
}
}
Fixes: 4703b4f0d94a ("RDMA/rxe: Enforce IBA C11-17") Link: https://lore.kernel.org/r/20220705225414.315478-1-yanjun.zhu@linux.dev Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com> Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
If siw_recv_mpa_rr returns -EAGAIN, it means that the MPA reply hasn't
been received completely, and should not report IW_CM_EVENT_CONNECT_REPLY
in this case. This may trigger a call trace in iw_cm. A simple way to
trigger this:
server: ib_send_lat
client: ib_send_lat -R <server_ip>
Since ECC memory maintains a memory system immune to single-bit errors,
add support for correcting the 1bit-ECC error, which prevents a 1bit-ECC
error become an uncorrected type error. When a 1bit-ECC error happens in
the internal ram of the ROCE engine, such as the QPC table, as a 1bit-ECC
error caused by reading, the ROCE engine only corrects those 1bit ECC
errors by writing.
RDMA/hns: Fix incorrect clearing of interrupt status register
The driver will clear all the interrupts in the same area
when the driver handles the interrupt of type AEQ overflow.
It should only set the interrupt status bit of type AEQ overflow.
Fixes: a5073d6054f7 ("RDMA/hns: Add eq support of hip08") Link: https://lore.kernel.org/r/20220714134353.16700-4-liangwenpeng@huawei.com Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com> Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
RDMA/hns: Remove unused abnormal interrupt of type RAS
The HNS NIC driver receives and handles the abnormal interrupt of the RAS
type generated by ROCEE, and the HNS RDMA driver does not need to handle
this type of interrupt. Therefore, delete unused codes in the HNS RDMA
driver.
Jianglei Nie [Thu, 14 Jul 2022 06:15:05 +0000 (14:15 +0800)]
RDMA/qedr: Fix potential memory leak in __qedr_alloc_mr()
__qedr_alloc_mr() allocates a memory chunk for "mr->info.pbl_table" with
init_mr_info(). When rdma_alloc_tid() and rdma_register_tid() fail, "mr"
is released while "mr->info.pbl_table" is not released, which will lead
to a memory leak.
We should release the "mr->info.pbl_table" with qedr_free_pbl() when error
occurs to fix the memory leak.
Fixes: e0290cce6ac0 ("qedr: Add support for memory registeration verbs") Link: https://lore.kernel.org/r/20220714061505.2342759-1-niejianglei2021@163.com Signed-off-by: Jianglei Nie <niejianglei2021@163.com> Acked-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
Both hfi1 and UML depend on x86_64, this can trigger build errors.
This driver must depends on !UML because it accesses x86_64
features that are not supported by UML.
Jack Wang [Tue, 12 Jul 2022 10:31:13 +0000 (12:31 +0200)]
RDMA/rtrs-srv: Do not use mempool for page allocation
The mempool is for guaranteed memory allocation during
extreme VM load (see the header of mempool.c of the kernel).
But rtrs-srv allocates pages only when creating new session.
There is no need to use the mempool.
With the removal of mempool, rtrs-server no longer need to reserve
huge mount of memory, this will avoid error like this:
https://lore.kernel.org/lkml/20220620020727.GA3669@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/r/20220712103113.617754-6-haris.iqbal@ionos.com Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
RDMA/rtrs-clt: Replace list_next_or_null_rr_rcu with an inline function
removes list_next_or_null_rr_rcu macro to fix below warnings.
That macro is used only twice.
CHECK:MACRO_ARG_REUSE: Macro argument reuse 'head' - possible side-effects?
CHECK:MACRO_ARG_REUSE: Macro argument reuse 'ptr' - possible side-effects?
CHECK:MACRO_ARG_REUSE: Macro argument reuse 'memb' - possible side-effects?
Replaces that macro with an inline function.
Fixes: 6a98d71daea1 ("RDMA/rtrs: client: main functionality") Cc: jinpu.wang@ionos.com Link: https://lore.kernel.org/r/20220712103113.617754-5-haris.iqbal@ionos.com Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
RDMA/rtrs-srv: Use per-cpu variables for rdma stats
Convert server stat counters from atomic to per-cpu variables.
Link: https://lore.kernel.org/r/20220712103113.617754-4-haris.iqbal@ionos.com Signed-off-by: Santosh Kumar Pradhan <santosh.pradhan@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
Jack Wang [Tue, 12 Jul 2022 10:31:09 +0000 (12:31 +0200)]
RDMA/rtrs-srv: Fix modinfo output for stringify
stringify works with define, not enum.
Fixes: 91fddedd439c ("RDMA/rtrs: private headers with rtrs protocol structs and helpers") Cc: jinpu.wang@ionos.com Link: https://lore.kernel.org/r/20220712103113.617754-2-haris.iqbal@ionos.com Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Reviewed-by: Aleksei Marov <aleksei.marov@ionos.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
Mustafa Ismail [Tue, 5 Jul 2022 23:08:15 +0000 (18:08 -0500)]
RDMA/irdma: Fix setting of QP context err_rq_idx_valid field
Setting err_rq_idx_valid field in QP context when the AE source of the
AEQE is not associated with an RQ causes the firmware flush to fail.
Set err_rq_idx_valid field in QP context only if it is associated with an
RQ. Additionally, cleanup the redundant setting of this field in
irdma_process_aeq.
Mustafa Ismail [Tue, 5 Jul 2022 23:08:14 +0000 (18:08 -0500)]
RDMA/irdma: Fix VLAN connection with wildcard address
When an application listens on a wildcard address, and there are VLAN and
non-VLAN IP addresses, iWARP connection establishemnt can fail if the listen
node VLAN ID does not match.
Fix this by checking the vlan_id only if not a wildcard listen node.
Fixes: 146b9756f14c ("RDMA/irdma: Add connection manager") Link: https://lore.kernel.org/r/20220705230815.265-7-shiraz.saleem@intel.com Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
Mustafa Ismail [Tue, 5 Jul 2022 23:08:13 +0000 (18:08 -0500)]
RDMA/irdma: Fix a window for use-after-free
During a destroy CQ an interrupt may cause processing of a CQE after CQ
resources are freed by irdma_cq_free_rsrc(). Fix this by moving the call
to irdma_cq_free_rsrc() after the irdma_sc_cleanup_ceqes(), which is
called under the cq_lock.
RDMA/irdma: Make resource distribution algorithm more QP oriented
Adapt the resource distribution algorithm in irdma_cfg_fpm_val to be more
QP oriented. If the configuration is too big for the available memory,
trim the MR and PBLE's first before trimming the QPs. This also avoids
having to double QPs requested as input to algorithm for GEN1 devices.
Jakub Kicinski [Tue, 5 Jul 2022 23:02:08 +0000 (16:02 -0700)]
ipoib: switch to netif_napi_add_weight()
We want to remove the weight argument from the basic
netif_napi_add() API and just default to 64.
Switch ipoib to the new API for explicitly specifying
the weight.
Jakub Kicinski [Tue, 5 Jul 2022 23:02:07 +0000 (16:02 -0700)]
IB/hfi1: switch to netif_napi_add_weight()
Since we'll remove the last argument from netif_napi_add()
soon switch this RDMA driver to netif_napi_add_weight()
for now to avoid cross-tree patches.
Bob Pearson [Thu, 30 Jun 2022 19:04:20 +0000 (14:04 -0500)]
RDMA/rxe: Remove unnecessary include statement
rxe_verbs.h includes the file <rdma/rdma_user_rxe.h>. It should have been
<uapi/rdma/rdma_user_rxe.h>, however, it is not used and not required in
this file.
Bob Pearson [Thu, 30 Jun 2022 19:04:21 +0000 (14:04 -0500)]
RDMA/rxe: Replace include statement
rxe_queue.h currently includes <uapi/rdma/rdma_user_rxe.h> for a
definition of struct rxe_queue_buf. But it is only used as a pointer so
the definition is not needed.
This patch replaces the include statement with the declaration
Bob Pearson [Thu, 30 Jun 2022 19:04:19 +0000 (14:04 -0500)]
RDMA/rxe: Convert pr_warn/err to pr_debug in pyverbs
The pyverbs test suite generates a few dmesg traces from intentional error
tests. This patch replaces those messages with pr_debug() calls which
improves the usefullness of the tests.
Bob Pearson [Mon, 23 May 2022 22:32:52 +0000 (17:32 -0500)]
RDMA/rxe: Fix deadlock in rxe_do_local_ops()
When a local operation (invalidate mr, reg mr, bind mw) is finished there
will be no ack packet coming from a responder to cause the wqe to be
completed. This may happen anyway if a subsequent wqe performs
IO. Currently if the wqe is signalled the completer tasklet is scheduled
immediately but not otherwise.
This leads to a deadlock if the next wqe has the fence bit set in send
flags and the operation is not signalled. This patch removes the condition
that the wqe must be signalled in order to schedule the completer tasklet
which is the simplest fix for this deadlock and is fairly low cost. This
is the analog for local operations of always setting the ackreq bit in all
last or only request packets even if the operation is not signalled.
Bob Pearson [Mon, 6 Jun 2022 14:38:37 +0000 (09:38 -0500)]
RDMA/rxe: Merge normal and retry atomic flows
Make the execution of the atomic operation in rxe_atomic_reply()
conditional on res->replay and make duplicate_request() call into
rxe_atomic_reply() to merge the two flows. This is modeled on the behavior
of read reply. Delete the skb from the atomic responder resource since it
is no longer used. Adjust the reference counting of the qp in
send_atomic_ack() for this flow.
Bob Pearson [Mon, 6 Jun 2022 14:38:36 +0000 (09:38 -0500)]
RDMA/rxe: Move atomic original value to res
Move the saved original value to the atomic responder resource. This
replaces saving it in the qp. In preparation for merging the normal and
retry atomic responder flows.
Bob Pearson [Mon, 6 Jun 2022 14:38:35 +0000 (09:38 -0500)]
RDMA/rxe: Move atomic responder res to atomic_reply
Move the allocation of the atomic responder resource up into
rxe_atomic_reply() from send_atomic_ack(). In preparation for merging the
normal and retry atomic responder flows.
Bob Pearson [Mon, 6 Jun 2022 14:38:34 +0000 (09:38 -0500)]
RDMA/rxe: Add a responder state for atomic reply
Add a responder state for atomic reply similar to read reply and rename
process_atomic() rxe_atomic_reply(). In preparation for merging the normal
and retry atomic responder flows.
Bob Pearson [Mon, 6 Jun 2022 14:38:33 +0000 (09:38 -0500)]
RDMA/rxe: Move code to rxe_prepare_atomic_res()
Separate the code that prepares the atomic responder resource into a
subroutine. This is preparation for merging the normal and retry atomic
responder flows.
Bob Pearson [Sun, 12 Jun 2022 22:34:34 +0000 (17:34 -0500)]
RDMA/rxe: Stop lookup of partially built objects
Currently the rdma_rxe driver has a security weakness due to giving
objects which are partially initialized indices allowing external actors
to gain access to them by sending packets which refer to their
index (e.g. qpn, rkey, etc) causing unpredictable results.
This patch adds a new API rxe_finalize(obj) which enables looking up pool
objects from indices using rxe_pool_get_index() for AH, QP, MR, and
MW. They are added in create verbs only after the objects are fully
initialized.
It also adds wait for completion to destroy/dealloc verbs to assure that
all references have been dropped before returning to rdma_core by
implementing a new rxe_pool API rxe_cleanup() which drops a reference to
the object and then waits for all other references to be dropped. When
the last reference is dropped the object is completed by kref. After that
it cleans up the object and if locally allocated frees the memory. In the
special case of address handle objects the delay is implemented separately
if the destroy_ah call is not sleepable.
Combined with deferring cleanup code to type specific cleanup routines
this allows all pending activity referring to objects to complete before
returning to rdma_core.
Leon Romanovsky [Thu, 16 Jun 2022 17:09:11 +0000 (20:09 +0300)]
Merge branch 'mlx5-next' into wip/leon-for-next
This merge commit includes shared updates to net-next and rdma-next
for upcoming mlx5 features.
1) Updated HW bits and definitions for upcoming features
1.1) vport debug counters
1.2) flow meter
1.3) Execute ASO action for flow entry
1.4) enhanced CQE compression
2) Add ICM header-modify-pattern RDMA API
Leon Says
=========
SW steering manipulates packet's header using "modifying header" actions.
Many of these actions do the same operation, but use different data each time.
Currently we create and keep every one of these actions, which use expensive
and limited resources.
Now we introduce a new mechanism - pattern and argument, which splits
a modifying action into two parts:
1. action pattern: contains the operations to be applied on packet's header,
mainly set/add/copy of fields in the packet
2. action data/argument: contains the data to be used by each operation
in the pattern.
This way we reuse same patterns with different arguments to create new
modifying actions, and since many actions share the same operations, we end
up creating a small number of patterns that we keep in a dedicated cache.
These modify header patterns are implemented as new type of ICM memory,
so the following kernel patch series add the support for this new ICM type.
==========
Dongliang Mu [Thu, 9 Jun 2022 07:06:56 +0000 (15:06 +0800)]
RDMA/rxe: fix xa_alloc_cycle() error return value check again
Currently rxe_alloc checks ret to indicate error, but 1 is also a valid
return and just indicates that the allocation succeeded with a wrap.
Fix this by modifying the check to be < 0.
Link: https://lore.kernel.org/r/20220609070656.1446121-1-dzm91@hust.edu.cn Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by xarrays") Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add a netevent callback for cma, mainly to catch NETEVENT_NEIGH_UPDATE.
Previously, when a system with failover MAC mechanism change its MAC address
during a CM connection attempt, the RDMA-CM would take a lot of time till
it disconnects and timesout due to the incorrect MAC address.
Now when we get a NETEVENT_NEIGH_UPDATE we check if it is due to a failover
MAC change and if so, we instantly destroy the CM and notify the user in order
to spare the unnecessary waiting for the timeout.
RDMA/core: Add an rb_tree that stores cm_ids sorted by ifindex and remote IP
Add to the cma, a tree that keeps track of all rdma_id_private channels
that were created while in RoCE mode.
The IDs are sorted first according to their netdevice ifindex then their
destination IP. And for IDs with matching IP they would be at the same node
in the tree, since the tree data is a list of all ids with matching destination IP.
The tree allows fast and efficient lookup of ids using an ifindex and
IP address which is useful for identifying relevant net_events promptly.
Ofer Levi [Wed, 8 Jun 2022 20:04:52 +0000 (13:04 -0700)]
net/mlx5: Add bits and fields to support enhanced CQE compression
Expose ifc bits and add needed structure fields and methods to
support enhanced CQE compression feature.
The enhanced CQE compression feature improves cpu utiliziation with
better packet latency from nic to host.
Signed-off-by: Ofer Levi <oferle@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Shay Drory [Wed, 8 Jun 2022 20:04:51 +0000 (13:04 -0700)]
net/mlx5: Remove not used MLX5_CAP_BITS_RW_MASK
Remove not used MLX5_CAP_BITS_RW_MASK.
While at it, remove CAP_MASK, MLX5_CAP_OFF_CMDIF_CSUM
and MLX5_DEV_CAP_FLAG_*, since MLX5_CAP_BITS_RW_MASK
was their only user.
Shay Drory [Wed, 8 Jun 2022 20:04:50 +0000 (13:04 -0700)]
net/mlx5: group fdb cleanup to single function
Currently, the allocation of fdb software objects are done is single
function, oppose to the cleanup of them.
Group the cleanup of fdb software objects to single function.