Avihai Horon [Wed, 9 Jun 2021 11:05:03 +0000 (14:05 +0300)]
RDMA/mlx5: Enable Relaxed Ordering by default for kernel ULPs
Relaxed Ordering is a capability that can only benefit users that support
it. All kernel ULPs should support Relaxed Ordering, as they are designed
to read data only after observing the CQE and use the DMA API correctly.
Hence, implicitly enable Relaxed Ordering by default for MR transfers in
kernel ULPs.
Xi Wang [Fri, 11 Jun 2021 06:14:49 +0000 (14:14 +0800)]
RDMA/hns: Clear extended doorbell info before using
Both of HIP08 and HIP09 require the extended doorbell information to be
cleared before being used.
Fixes: 6b63597d3540 ("RDMA/hns: Add TSQ link table support") Link: https://lore.kernel.org/r/1623392089-35639-1-git-send-email-liweihang@huawei.com Signed-off-by: Xi Wang <wangxi11@huawei.com> Signed-off-by: Weihang Li <liweihang@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Jack Wang [Mon, 14 Jun 2021 09:03:37 +0000 (11:03 +0200)]
RDMA/rtrs: Check device max_qp_wr limit when create QP
Currently we only check device max_qp_wr limit for IO connection, but not
for service connection. We should check for both.
So save the max_qp_wr device limit in wr_limit, and use it for both IO
connections and service connections.
While at it, also remove an outdated comments.
Link: https://lore.kernel.org/r/20210614090337.29557-6-jinpu.wang@ionos.com Suggested-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Guoqing Jiang [Mon, 14 Jun 2021 09:03:36 +0000 (11:03 +0200)]
RDMA/rtrs: Rename cq_size/queue_size to cq_num/queue_num
Those variables are passed to create_cq, create_qp, rtrs_iu_alloc and
rtrs_iu_free, so these *_size means the num of unit. And cq_size also
means number of cq element.
Also move the setting of cq_num to common path.
Link: https://lore.kernel.org/r/20210614090337.29557-5-jinpu.wang@ionos.com Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Reviewed-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com> Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Md Haris Iqbal [Mon, 14 Jun 2021 09:03:35 +0000 (11:03 +0200)]
RDMA/rtrs: RDMA_RXE requires more number of WR
When using rdma_rxe, post_one_recv() returns ENOMEM error due to the full
recv queue. This patch increase the number of WR for receive queue to
support all devices.
Link: https://lore.kernel.org/r/20210614090337.29557-4-jinpu.wang@ionos.com Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com> Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com> Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Jack Wang [Mon, 14 Jun 2021 09:03:34 +0000 (11:03 +0200)]
RDMA/rtrs-clt: Use minimal max_send_sge when create qp
We use device limit max_send_sge, which is suboptimal for memory usage.
We don't need that much for User Con, 1 is enough. And for IO con,
sess->max_segments + 1 is enough
Link: https://lore.kernel.org/r/20210614090337.29557-3-jinpu.wang@ionos.com Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Jack Wang [Mon, 14 Jun 2021 09:03:33 +0000 (11:03 +0200)]
RDMA/rtrs-srv: Set minimal max_send_wr and max_recv_wr
Currently rtrs when create_qp use a coarse numbers (bigger in general),
which leads to hardware create more resources which only waste memory with
no benefits.
For max_send_wr, we don't really need alway max_qp_wr size when creating
qp, reduce it to cq_size.
For max_recv_wr, cq_size is enough.
With the patch when sess_queue_depth=128, per session (2 paths) memory
consumption reduced from 188 MB to 65MB
When always_invalidate is enabled, we need send more wr, so treat it
special.
Fixes: 9cb837480424e ("RDMA/rtrs: server: main functionality") Link: https://lore.kernel.org/r/20210614090337.29557-2-jinpu.wang@ionos.com Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com> Reviewed-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com> Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Jason Gunthorpe [Fri, 11 Jun 2021 16:00:33 +0000 (19:00 +0300)]
RDMA/core: Allow port_groups to be used with namespaces
Now that the port_groups data is being destroyed and managed by the core
code this restriction is no longer needed. All the ib_port_attrs are
compatible with the core's sysfs lifecycle.
When the main device is destroyed and moved to another namespace the
driver's port sysfs can be created/destroyed as well due to it now being a
simple attribute list.
Jason Gunthorpe [Fri, 11 Jun 2021 16:00:32 +0000 (19:00 +0300)]
RDMA: Change ops->init_port to ops->port_groups
init_port was only being used to register sysfs attributes against the
port kobject. Now that all users are creating static attribute_group's we
can simply set the attribute_group list in the ops and the core code can
just handle it directly.
This makes all the sysfs management quite straightforward and prevents any
driver from abusing the naked port kobject in future because no driver
code can access it.
Jason Gunthorpe [Fri, 11 Jun 2021 16:00:31 +0000 (19:00 +0300)]
RDMA/hfi1: Use attributes for the port sysfs
hfi1 should not be creating a mess of kobjects to attach to the port
kobject - this is all attributes. The proper API is to create an
attribute_group list and create it against the port's kobject.
Jason Gunthorpe [Fri, 11 Jun 2021 16:00:30 +0000 (19:00 +0300)]
RDMA/qib: Use attributes for the port sysfs
qib should not be creating a mess of kobjects to attach to the port
kobject - this is all attributes. The proper API is to create an
attribute_group list and create it against the port's kobject.
Jason Gunthorpe [Fri, 11 Jun 2021 16:00:29 +0000 (19:00 +0300)]
RDMA/cm: Use an attribute_group on the ib_port_attribute intead of kobj's
This code is trying to attach a list of counters grouped into 4 groups to
the ib_port sysfs. Instead of creating a bunch of kobjects simply express
everything naturally as an ib_port_attribute and add a single
attribute_groups list.
Jason Gunthorpe [Fri, 11 Jun 2021 16:00:27 +0000 (19:00 +0300)]
RDMA/core: Remove the kobject_uevent() NOP
This call does nothing because the ib_port kobj is nested under a struct
device kobject and the dev_uevent_filter() function of the struct device
blocks uevents for any children kobj's that are not also struct devices.
A uevent for the struct device will be triggered after
ib_setup_port_attrs() returns which causes udev to pick up all the deep
"attributes" which are implemented as kobjects nested under a struct
device and assign them to the udev object for the struct device:
$ udevadm info -a /sys/class/infiniband/ibp0s9
ATTR{ports/1/counters/excessive_buffer_overrun_errors}=="0"
Jason Gunthorpe [Fri, 11 Jun 2021 16:00:25 +0000 (19:00 +0300)]
RDMA/core: Simplify how the port sysfs is created
Use the same technique as gid_attrs now uses to manage the port
sysfs. Bundle everything into three allocations and use a single
sysfs_create_groups() to build everything in one shot.
All the memory is always freed in the kobj release function, removing most
of the error unwinding.
The gid_attr technique and the hw_counters are very similar, merge the two
together and combine the sysfs_create_group() call for hw_counters with
the single sysfs group setup.
Jason Gunthorpe [Fri, 11 Jun 2021 16:00:24 +0000 (19:00 +0300)]
RDMA/core: Simplify how the gid_attrs sysfs is created
Instead of having an whole bunch of different allocations to create the
gid_attr kobjects reduce it to three, one for the kobj struct plus the
attributes, and one for the attribute list for each of the two
groups.
Move the freeing of all allocations to the release function.
Reorder the operations so all the allocations happen first then the
kobject & sysfs operations are last.
This removes the majority of the complicated error unwind since the
release function will always undo all the memory allocations. Freeing the
memory is also much simpler since there is a lot less of it.
Consolidate creating the "group of array indexes" pattern into one helper
function. Ensure kobject_del is used.
Jason Gunthorpe [Fri, 11 Jun 2021 16:00:23 +0000 (19:00 +0300)]
RDMA/core: Split gid_attrs related sysfs from add_port()
The gid_attrs directory is a dedicated kobj nested under the port,
construct/destruct it with its own pair of functions for
understandability. This is much more readable than having it weirdly
inlined out of order into the add_port() function.
Jason Gunthorpe [Fri, 11 Jun 2021 16:00:22 +0000 (19:00 +0300)]
RDMA/core: Split port and device counter sysfs attributes
This code creates a 'struct hw_stats_attribute' for each sysfs entry that
contains a naked 'struct attribute' inside.
It then proceeds to attach this same structure to a 'struct device' kobj
and a 'struct ib_port' kobj. However, this violates the typing
requirements. 'struct device' requires the attribute to be a 'struct
device_attribute' and 'struct ib_port' requires the attribute to be
'struct port_attribute'.
This happens to work because the show/store function pointers in all three
structures happen to be at the same offset and happen to be nearly the
same signature. This means when container_of() was used to go between the
wrong two types it still managed to work.
However clang CFI detection notices that the function pointers have a
slightly different signature. As with show/store this was only working
because the device and port struct layouts happened to have the kobj at
the front.
Correct this by have two independent sets of data structures for the port
and device case. The two different attributes correctly include the
port/device_attribute struct and everything from there up is kept
split. The show/store function call chains start with device/port unique
functions that invoke a common show/store function pointer.
Jason Gunthorpe [Fri, 11 Jun 2021 16:00:21 +0000 (19:00 +0300)]
RDMA/core: Replace the ib_port_data hw_stats pointers with a ib_port pointer
It is much saner to store a pointer to the kobject structure that contains
the cannonical stats pointer than to copy the stats pointers into a public
structure.
Future patches will require the sysfs pointer for other purposes.
Jason Gunthorpe [Fri, 11 Jun 2021 16:00:20 +0000 (19:00 +0300)]
RDMA: Split the alloc_hw_stats() ops to port and device variants
This is being used to implement both the port and device global stats,
which is causing some confusion in the drivers. For instance EFA and i40iw
both seem to be misusing the device stats.
Split it into two ops so drivers that don't support one or the other can
leave the op NULL'd, making the calling code a little simpler to
understand.
Bob Pearson [Tue, 8 Jun 2021 04:25:50 +0000 (23:25 -0500)]
RDMA/rxe: Add support for bind MW work requests
Add support for bind MW work requests from user space. Since rdma/core
does not support bind mw in ib_send_wr there is no way to support bind mw
in kernel space.
Added bind_mw local operation in rxe_req.c. Added bind_mw WR operation in
rxe_opcode.c. Added bind_mw WC in rxe_comp.c. Added additional fields to
rxe_mw in rxe_verbs.h. Added rxe_do_dealloc_mw() subroutine to cleanup an
mw when rxe_dealloc_mw is called. Added code to implement bind_mw
operation in rxe_mw.c
Bob Pearson [Tue, 8 Jun 2021 04:25:49 +0000 (23:25 -0500)]
RDMA/rxe: Move local ops to subroutine
Simplify rxe_requester() by moving the local operations to a subroutine.
Add an error return for illegal send WR opcode. Moved next_index ahead of
rxe_run_task which fixed a small bug where work completions were delayed
until after the next wqe which was not the intended behavior. Let errors
return their own WC status. Previously all errors were reported as
protection errors which was incorrect. Changed the return of errors from
rxe_do_local_ops() to err: which causes an immediate completion. Without
this an error on a last WR may get lost. Changed fill_packet() to
finish_packet() which is more accurate.
Bob Pearson [Tue, 8 Jun 2021 04:25:48 +0000 (23:25 -0500)]
RDMA/rxe: Replace WR_REG_MASK by WR_LOCAL_OP_MASK
Rxe has two mask bits WR_LOCAL_MASK and WR_REG_MASK with WR_REG_MASK used
to indicate any local operation and WR_LOCAL_MASK unused. This patch
replaces both of these with one mask bit WR_LOCAL_OP_MASK which is
clearer.
Bob Pearson [Tue, 8 Jun 2021 04:25:47 +0000 (23:25 -0500)]
RDMA/rxe: Add ib_alloc_mw and ib_dealloc_mw verbs
Add ib_alloc_mw and ib_dealloc_mw verbs APIs.
Added new file rxe_mw.c focused on MWs. Changed the 8 bit random key
generator. Added a cleanup routine for MWs. Added verbs routines to
ib_device_ops.
Bob Pearson [Tue, 8 Jun 2021 04:25:46 +0000 (23:25 -0500)]
RDMA/rxe: Enable MW object pool
Currently the rxe driver has a rxe_mw struct object but nothing about
memory windows is enabled. This patch turns on memory windows and some
minor cleanup.
Set device attribute in rxe.c so max_mw = MAX_MW. Change parameters in
rxe_param.h so that MAX_MW is the same as MAX_MR. Reduce the number of
MRs and MWs to 4K from 256K. Add device capability bits for 2a and 2b
memory windows. Removed RXE_MR_TYPE_MW from the rxe_mr_type enum.
Bob Pearson [Tue, 8 Jun 2021 04:25:45 +0000 (23:25 -0500)]
RDMA/rxe: Return errors for add index and key
Modify rxe_add_index() and rxe_add_key() to return an error if the index
or key is aleady present in the pool. Currently they print a warning and
silently fail with bad consequences to the caller.
Bob Pearson [Fri, 4 Jun 2021 23:05:59 +0000 (18:05 -0500)]
RDMA/rxe: Fix qp reference counting for atomic ops
Currently the rdma_rxe driver attempts to protect atomic responder
resources by taking a reference to the qp which is only freed when the
resource is recycled for a new read or atomic operation. This means that
in normal circumstances there is almost always an extra qp reference once
an atomic operation has been executed which prevents cleaning up the qp
and associated pd and cqs when the qp is destroyed.
This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
destroyed while a peer is retrying an atomic op it will cause the
operation to fail which is acceptable.
Link: https://lore.kernel.org/r/20210604230558.4812-1-rpearsonhpe@gmail.com Reported-by: Zhu Yanjun <zyjzyj2000@gmail.com> Fixes: 86af61764151 ("IB/rxe: remove unnecessary skb_clone") Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Xi Wang [Tue, 1 Jun 2021 09:57:07 +0000 (17:57 +0800)]
RDMA/hns: Support getting max QP number from firmware
All functions of HIP09's ROCEE share on-chip resources for all QPs, the
driver needs configure the resource index and number for each function
during the init stage.
Leon Romanovsky [Mon, 31 May 2021 16:04:44 +0000 (19:04 +0300)]
RDMA/mlx5: Don't add slave port to unaffiliated list
The mlx5_ib_bind_slave_port() doesn't remove multiport device from the
unaffiliated list, but mlx5_ib_unbind_slave_port() did it. This unbalanced
flow caused to the situation where mlx5_ib_unaffiliated_port_list was
changed during iteration.
Shiraz Saleem [Wed, 9 Jun 2021 23:49:24 +0000 (18:49 -0500)]
RDMA/irdma: Store PBL info address a pointer type
The level1 PBL info address is stored as u64. This requires casting
through a uinptr_t before used as a pointer type.
And this leads to sparse warning such as this when uinptr_t is missing:
drivers/infiniband/hw/irdma/hw.c: In function 'irdma_destroy_virt_aeq':
drivers/infiniband/hw/irdma/hw.c:579:23: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
579 | dma_addr_t *pg_arr = (dma_addr_t *)aeq->palloc.level1.addr;
This can be fixed using an intermediate uintptr_t, but rather it is better
to fix the structure irdm_pble_info to store the address as u64* and the
VA it is assigned in irdma_chunk as a void*. This greatly reduces the
casting on this address.
Jason Gunthorpe [Wed, 9 Jun 2021 10:59:25 +0000 (13:59 +0300)]
IB/cm: Remove dgid from the cm_id_priv av
It turns out this is only being used to store the LID for SIDR mode to
search the RB tree for request de-duplication. Store the LID value
directly and don't pretend it is a GID.
Weihang Li [Fri, 28 May 2021 09:37:33 +0000 (17:37 +0800)]
RDMA/core: Use refcount_t instead of atomic_t on refcount of iwpm_admin_data
The refcount_t API will WARN on underflow and overflow of a reference
counter, and avoid use-after-free risks. Increase refcount_t from 0 to 1 is
regarded as there is a risk about use-after-free. So it should be set to 1
directly during initialization.
Colin Ian King [Sat, 5 Jun 2021 12:20:59 +0000 (13:20 +0100)]
RDMA/irdma: Fix issues with u8 left shift operation
The shifting of the u8 integer info->map[i] the left will be promoted
to a 32 bit signed int and then sign-extended to a u64. In the event
that the top bit of the u8 is set then all then all the upper 32 bits
of the u64 end up as also being set because of the sign-extension.
Fix this by casting the u8 values to a u64 before the left shift. This
Link: https://lore.kernel.org/r/20210605122059.25105-1-colin.king@canonical.com
Addresses-Coverity: ("Unitentional integer overflow / bad shift operation") Fixes: 3f49d6842569 ("RDMA/irdma: Implement HW Admin Queue OPs") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Kamal Heib [Thu, 3 Jun 2021 09:01:12 +0000 (12:01 +0300)]
RDMA/rxe: Fix failure during driver load
To avoid the following failure when trying to load the rdma_rxe module
while IPv6 is disabled, add a check for EAFNOSUPPORT and ignore the
failure, also delete the needless debug print from rxe_setup_udp_tunnel().
$ modprobe rdma_rxe
modprobe: ERROR: could not insert 'rdma_rxe': Operation not permitted
Fixes: dfdd6158ca2c ("IB/rxe: Fix kernel panic in udp_setup_tunnel") Link: https://lore.kernel.org/r/20210603090112.36341-1-kamalheib1@gmail.com Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Bob Pearson [Thu, 27 May 2021 19:47:48 +0000 (14:47 -0500)]
RDMA/rxe: Protext kernel index from user space
In order to prevent user space from modifying the index that belongs to
the kernel for shared queues let the kernel use a local copy of the index
and copy any new values of that index to the shared rxe_queue_bus struct.
This adds more switch statements which decreases the performance of the
queue API. Move the type into the parameter list for these functions so
that the compiler can optimize out the switch statements when the explicit
type is known. Modify all the calls in the driver on performance paths to
pass in the explicit queue type.
Bob Pearson [Thu, 27 May 2021 19:47:47 +0000 (14:47 -0500)]
RDMA/rxe: Protect user space index loads/stores
Modify the queue APIs to protect all user space index loads with
smp_load_acquire() and all user space index stores with
smp_store_release(). Base this on the types of the queues which can be one
of ..KERNEL, ..FROM_USER, ..TO_USER. Kernel space indices are protected by
locks which also provide memory barriers.
Bob Pearson [Thu, 27 May 2021 19:47:46 +0000 (14:47 -0500)]
RDMA/rxe: Add a type flag to rxe_queue structs
To create optimal code only want to use smp_load_acquire() and
smp_store_release() for user indices in rxe_queue APIs since kernel
indices are protected by locks which also act as memory barriers. By
adding a type to the queues we can determine which indices need to be
protected.
Jason Gunthorpe [Wed, 2 Jun 2021 22:59:10 +0000 (19:59 -0300)]
Merge branch 'irdma' into rdma.git for-next
Shiraz Saleem says:
====================
Add Intel Ethernet Protocol Driver for RDMA (irdma)
The following patch series introduces a unified Intel Ethernet Protocol
Driver for RDMA (irdma) for the X722 iWARP device and a new E810 device
which supports iWARP and RoCEv2. The irdma module replaces the legacy
i40iw module for X722 and extends the ABI already defined for i40iw. It is
backward compatible with legacy X722 rdma-core provider (libi40iw).
X722 and E810 are PCI network devices that are RDMA capable. The RDMA
block of this parent device is represented via an auxiliary device
exported to 'irdma' using the core auxiliary bus infrastructure recently
added for 5.11 kernel. The parent PCI netdev drivers 'i40e' and 'ice'
register auxiliary RDMA devices with private data/ops encapsulated that
bind to auxiliary drivers registered in irdma module.
Currently, default is RoCEv2 for E810. Runtime support for protocol switch
to iWARP will be made available via devlink in a future patch.
====================
Link: https://lore.kernel.org/r/20210602205138.889-1-shiraz.saleem@intel.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
* branch 'irdma':
RDMA/irdma: Update MAINTAINERS file
RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw
RDMA/irdma: Add ABI definitions
RDMA/irdma: Add dynamic tracing for CM
RDMA/irdma: Add miscellaneous utility definitions
RDMA/irdma: Add user/kernel shared libraries
RDMA/irdma: Add RoCEv2 UD OP support
RDMA/irdma: Implement device supported verb APIs
RDMA/irdma: Add PBLE resource manager
RDMA/irdma: Add connection manager
RDMA/irdma: Add QoS definitions
RDMA/irdma: Add privileged UDA queue implementation
RDMA/irdma: Add HMC backing store setup functions
RDMA/irdma: Implement HW Admin Queue OPs
RDMA/irdma: Implement device initialization definitions
RDMA/irdma: Register auxiliary driver and implement private channel OPs
i40e: Register auxiliary devices to provide RDMA
i40e: Prep i40e header for aux bus conversion
ice: Register auxiliary device to provide RDMA
ice: Implement iidc operations
ice: Initialize RDMA support
iidc: Introduce iidc.h
i40e: Replace one-element array with flexible-array member
Mustafa Ismail [Wed, 2 Jun 2021 20:51:28 +0000 (15:51 -0500)]
RDMA/irdma: Add QoS definitions
Add definitions for managing the RDMA HW work scheduler (WS) tree.
A WS node is created via a control QP operation with the bandwidth
allocation, arbitration scheme, and traffic class of the QP specified.
The Qset handle returned associates the QoS parameters for the QP.
The Qset is registered with the LAN and an equivalent node is created
in the LAN packet scheduler tree.
Mustafa Ismail [Wed, 2 Jun 2021 20:51:26 +0000 (15:51 -0500)]
RDMA/irdma: Add HMC backing store setup functions
HW uses host memory as a backing store for a number of
protocol context objects and queue state tracking.
The Host Memory Cache (HMC) is a component responsible for
managing these objects stored in host memory.
Add the functions and data structures to manage the allocation
of backing pages used by the HMC for the various objects
Mustafa Ismail [Wed, 2 Jun 2021 20:51:25 +0000 (15:51 -0500)]
RDMA/irdma: Implement HW Admin Queue OPs
The driver posts privileged commands to the HW
Admin Queue (Control QP or CQP) to request administrative
actions from the HW. Implement create/destroy of CQP
and the supporting functions, data structures and headers
to handle the different CQP commands
Mustafa Ismail [Wed, 2 Jun 2021 20:51:23 +0000 (15:51 -0500)]
RDMA/irdma: Register auxiliary driver and implement private channel OPs
Register auxiliary drivers which can attach to auxiliary RDMA
devices from Intel PCI netdev drivers i40e and ice. Implement the private
channel ops, and register net notifiers.
Link: https://lore.kernel.org/r/20210602205138.889-2-shiraz.saleem@intel.com
[flexible array transformation] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Mark Zhang [Wed, 2 Jun 2021 10:27:08 +0000 (13:27 +0300)]
IB/cm: Protect cm_dev, cm_ports and mad_agent with kref and lock
During cm_dev deregistration in cm_remove_one(), the cm_device and
cm_ports will be freed, after that they should not be accessed. The
mad_agent needs to be protected as well.
This patch adds a cm_device kref to protect cm_dev and cm_ports, and a
mad_agent_lock spinlock to protect mad_agent.
Mark Zhang [Wed, 2 Jun 2021 10:27:07 +0000 (13:27 +0300)]
IB/cm: Improve the calling of cm_init_av_for_lap and cm_init_av_by_path
The cm_init_av_for_lap() and cm_init_av_by_path() function calls have the
following issues:
1. Both of them might sleep and should not be called under spinlock.
2. The access of cm_id_priv->av should be under cm_id_priv->lock, which
means it can't be initialized directly.
This patch splits the calling of 2 functions into two parts: first one
initializes an AV outside of the spinlock, the second one copies AV to
cm_id_priv->av under spinlock.
Jason Gunthorpe [Wed, 2 Jun 2021 10:27:04 +0000 (13:27 +0300)]
IB/cm: Tidy remaining cm_msg free paths
Now that all the free paths are explicit cm_free_msg() will only be called
for msgs's allocated with cm_alloc_msg(), so we can assume the context is
set. Place it after the allocation function it is paired with for clarity.
Also remove a bogus NULL assignment in one place after a cancel. This does
nothing other than disable completions to become events, but changing the
state already did that.
Jason Gunthorpe [Wed, 2 Jun 2021 10:27:02 +0000 (13:27 +0300)]
IB/cm: Split cm_alloc_msg()
This is being used with two quite different flows, one attaches the
message to the priv and the other does not.
Ensure the message attach is consistently done under the spinlock and
ensure that the free on error always detaches the message from the
cm_id_priv, also always under lock.
This makes read/write to the cm_id_priv->msg consistently locked and
consistently NULL'd when the message is freed, even in all error paths.
Jason Gunthorpe [Wed, 2 Jun 2021 10:27:01 +0000 (13:27 +0300)]
IB/cm: Pair cm_alloc_response_msg() with a cm_free_response_msg()
This is not a functional change, but it helps make the purpose of all the
cm_free_msg() calls clearer. In this case a response msg has a NULL
context[0], and is never placed in cm_id_priv->msg.
Leon Romanovsky [Wed, 19 May 2021 08:37:31 +0000 (11:37 +0300)]
RDMA/core: Sanitize WQ state received from the userspace
The mlx4 and mlx5 implemented differently the WQ input checks. Instead of
duplicating mlx4 logic in the mlx5, let's prepare the input in the central
place.
The mlx5 implementation didn't check for validity of state input. It is
not real bug because our FW checked that, but still worth to fix.
drivers/infiniband/ulp/rtrs/rtrs-clt.c:1786:19: warning: result of comparison of
constant 'MAX_SESS_QUEUE_DEPTH' (65536) with expression of type 'u16'
(aka 'unsigned short') is always false [-Wtautological-constant-out-of-range-compare]
To fix it, limit MAX_SESS_QUEUE_DEPTH to u16 max, which is 65535, and
drop the check in rtrs-clt, as it's the type u16 max.
Shiraz Saleem [Fri, 21 May 2021 17:11:20 +0000 (10:11 -0700)]
i40e: Register auxiliary devices to provide RDMA
Convert i40e to use the auxiliary bus infrastructure to export
the RDMA functionality of the device to the RDMA driver.
Register i40e client auxiliary RDMA device on the auxiliary bus per
PCIe device function for the new auxiliary rdma driver (irdma) to
attach to.
The global i40e_register_client and i40e_unregister_client symbols
will be obsoleted once irdma replaces i40iw in the kernel
for the X722 device.
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Shiraz Saleem [Fri, 21 May 2021 17:10:59 +0000 (10:10 -0700)]
i40e: Prep i40e header for aux bus conversion
Add the definitions to the i40e client header file in
preparation to convert i40e to use the new auxiliary bus
infrastructure. This header is shared between the 'i40e'
Intel networking driver providing RDMA support and the
'irdma' driver.
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Dave Ertman [Thu, 20 May 2021 14:37:51 +0000 (09:37 -0500)]
ice: Register auxiliary device to provide RDMA
Register ice client auxiliary RDMA device on the auxiliary bus per
PCIe device function for the auxiliary driver (irdma) to attach to.
It allows to realize a single RDMA driver (irdma) capable of working with
multiple netdev drivers over multi-generation Intel HW supporting RDMA.
There is no load ordering dependencies between ice and irdma.
Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Dave Ertman [Thu, 20 May 2021 14:37:49 +0000 (09:37 -0500)]
ice: Initialize RDMA support
Probe the device's capabilities to see if it supports RDMA. If so, allocate
and reserve resources to support its operation; populate structures with
initial values.
Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Gioh Kim [Fri, 28 May 2021 11:30:18 +0000 (13:30 +0200)]
RDMA/rtrs-clt: Fix memory leak of not-freed sess->stats and stats->pcpu_stats
sess->stats and sess->stats->pcpu_stats objects are freed
when sysfs entry is removed. If something wrong happens and
session is closed before sysfs entry is created,
sess->stats and sess->stats->pcpu_stats objects are not freed.
This patch adds freeing of them at three places:
1. When client uses wrong address and session creation fails.
2. When client fails to create a sysfs entry.
3. When client adds wrong address via sysfs add_path.
Fixes: 215378b838df0 ("RDMA/rtrs: client: sysfs interface functions") Link: https://lore.kernel.org/r/20210528113018.52290-21-jinpu.wang@ionos.com Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Md Haris Iqbal [Fri, 28 May 2021 11:30:17 +0000 (13:30 +0200)]
RDMA/rtrs-clt: Check if the queue_depth has changed during a reconnection
The queue_depth is a module parameter for rtrs_server. It is used on the
client side to determing the queue_depth of the request queue for the RNBD
virtual block device.
During a reconnection event for an already mapped device, in case the
rtrs_server module queue_depth has changed, fail the reconnect attempt.
Also stop further auto reconnection attempts. A manual reconnect via
sysfs has to be triggerred.
Fixes: 6a98d71daea18 ("RDMA/rtrs: client: main functionality") Link: https://lore.kernel.org/r/20210528113018.52290-20-jinpu.wang@ionos.com Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The problem is we increase device refcount by get_device in process_info_req
for each path, but only does put_deice for last path, which lead to
memory leak.
To fix it, it also calls put_device when dev_ref is not 0.
Fixes: e2853c49477d1 ("RDMA/rtrs-srv-sysfs: fix missing put_device") Link: https://lore.kernel.org/r/20210528113018.52290-19-jinpu.wang@ionos.com Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Gioh Kim [Fri, 28 May 2021 11:30:15 +0000 (13:30 +0200)]
RDMA/rtrs-srv: Fix memory leak of unfreed rtrs_srv_stats object
When closing a session, currently the rtrs_srv_stats object in the
closing session is freed by kobject release. But if it failed
to create a session by various reasons, it must free the rtrs_srv_stats
object directly because kobject is not created yet.
This problem is found by kmemleak as below:
1. One client machine maps /dev/nullb0 with session name 'bla':
root@test1:~# echo "sessname=bla path=ip:192.168.122.190 \
device_path=/dev/nullb0" > /sys/devices/virtual/rnbd-client/ctl/map_device
2. Another machine failed to create a session with the same name 'bla':
root@test2:~# echo "sessname=bla path=ip:192.168.122.190 \
device_path=/dev/nullb1" > /sys/devices/virtual/rnbd-client/ctl/map_device
-bash: echo: write error: Connection reset by peer
Fixes: 39c2d639ca183 ("RDMA/rtrs-srv: Set .release function for rtrs srv device during device init") Link: https://lore.kernel.org/r/20210528113018.52290-18-jinpu.wang@ionos.com Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com> Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Gioh Kim [Fri, 28 May 2021 11:30:14 +0000 (13:30 +0200)]
RDMA/rtrs-srv: Duplicated session name is not allowed
If two clients try to use the same session name, rtrs-server generates a
kernel error that it failed to create the sysfs because the filename
is duplicated.
This patch adds code to check if there already exists the same session
name with the different UUID. If a client tries to add more session,
it sends the UUID and the session name. Therefore it is ok if there is
already same session name with the same UUID. The rtrs-server must fail
only-if there is the same session name with the different UUID.
Link: https://lore.kernel.org/r/20210528113018.52290-17-jinpu.wang@ionos.com Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com> Signed-off-by: Aleksei Marov <aleksei.marov@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>