Currently there isn't a clear layer separation between the counters and
the steering code, whereas the steering code is doing redundant access
to the counter struct.
Separate the fs.c and counters.c, where fs code won't access or be
aware of counter structs but only the objects it needs.
As a result, move mlx5_rdma_counter struct from the header file to be
an internal struct for the counters file only.
RDMA/mlx5: Add DMAH support for reg_user_mr/reg_user_dmabuf_mr
As part of this enhancement, allow the creation of an MKEY associated
with a DMA handle.
Additional notes:
MKEYs with TPH (i.e. TLP Processing Hints) attributes are currently not
UMR-capable unless explicitly enabled by firmware or hardware.
Therefore, to maintain such MKEYs in the MR cache, the TPH fields have
been added to the rb_key structure, with a dedicated hash bucket.
The ability to bypass the kernel verbs flow and create an MKEY with TPH
attributes using DEVX has been restricted. TPH must follow the standard
InfiniBand flow, where a DMAH is created with the appropriate security
checks and management mechanisms in place.
DMA handles are currently not supported in conjunction with On-Demand
Paging (ODP).
Re-registration of memory regions originally created with TPH attributes
is currently not supported.
RDMA/core: Introduce a DMAH object and its alloc/free APIs
Introduce a new DMA handle (DMAH) object along with its corresponding
allocation and deallocation APIs.
This DMAH object encapsulates attributes intended for use in DMA
transactions.
While its initial purpose is to support TPH functionality, it is
designed to be extensible for future features such as DMA PCI multipath,
PCI UIO configurations, PCI traffic class selection, and more.
Further details:
----------------
We ensure that a caller requesting a DMA handle for a specific CPU ID is
permitted to be scheduled on it. This prevent a potential security issue
where a non privilege user may trigger DMA operations toward a CPU that
it's not allowed to run on.
We manage reference counting for the DMAH object and its consumers
(e.g., memory regions) as will be detailed in subsequent patches in the
series.
IB/core: Add UVERBS_METHOD_REG_MR on the MR object
This new method enables us to use a single ioctl from user space which
supports the below variants of reg_mr [1].
The method will be extended in the next patches from the series with an
extra attribute to let us pass DMA handle to be used as part of the
registration.
Leon Romanovsky [Wed, 23 Jul 2025 05:38:56 +0000 (01:38 -0400)]
RDMA support for DMA handle
From Yishai:
This patch series introduces a new DMA Handle (DMAH) object, along with
corresponding APIs for its allocation and deallocation.
The DMAH object encapsulates attributes relevant for DMA transactions.
While initially intended to support TLP Processing Hints (TPH) [1], the
design is extensible to accommodate future features such as PCI
multipath for DMA, PCI UIO configurations, traffic class selection, and
more.
Additionally, we introduce a new ioctl method on the MR object:
UVERBS_METHOD_REG_MR.
This method consolidates multiple reg_mr variants under a single
user-space ioctl interface, supporting: ibv_reg_mr(), ibv_reg_mr_iova(),
ibv_reg_mr_iova2() and ibv_reg_dmabuf_mr(). It also enables passing a
DMA handle as part of the registration process.
Throughout the patch series, the following DMAH-related stuff can also
be observed in the IB layer:
- Association with a CPU ID and its memory type, for use with Steering
Tags [2].
- Inclusion of Processing Hints (PH) data for TPH functionality [3].
- Enforces security by ensuring that only tasks allowed to run on a
given CPU may request a DMA handle for it.
- Reference counting for DMAH life cycle management and safe usage
across memory regions.
mlx5 driver implementation:
--------------------------
The series includes implementation of the above functionality in the
mlx5 driver.
In mlx5_core:
- Enables TPH over PCIe when both firmware and OS support it.
- Manages Steering Tags and corresponding indices by writing tag values
to the PCI configuration space.
- Exposes APIs to upper layers (e.g., mlx5_ib) to enable the PCIe TPH
functionality.
In mlx5_ib:
- Adds full support for DMAH operations.
- Utilizes mlx5_core's Steering Tag APIs to derive tag indices from
input.
- Stores the resulting index in a mlx5_dmah structure for use during
MKEY creation with a DMA handle.
- Adds support for allowing MKEYs to be created in conjunction with DMA
handles.
Additional details are provided in the commit messages.
[1] Background, from PCIe specification 6.2.
TLP Processing Hints (TPH)
--------------------------
TLP Processing Hints is an optional feature that provides hints in
Request TLP headers to facilitate optimized processing of Requests that
target Memory Space. These Processing Hints enable the system hardware
(e.g., the Root Complex and/ or Endpoints) to optimize platform
resources such as system and memory interconnect on a per TLP basis.
Steering Tags are system-specific values used to identify a processing
resource that a Requester explicitly targets. System software discovers
and identifies TPH capabilities to determine the Steering Tag allocation
for each Function that supports TPH
[2] Steering Tags
Functions that intend to target a TLP towards a specific processing
resource such as a host processor or system cache hierarchy require
topological information of the target cache (e.g., which host cache).
Steering Tags are system-specific values that provide information about
the host or cache structure in the system cache hierarchy. These values
are used to associate processing elements within the platform with the
processing of Requests.
[3] Processing Hints
The Requester provides hints to the Root Complex or other targets about
the intended use of data and data structures by the host and/or device.
The hints are provided by the Requester, which has knowledge of upcoming
Request patterns, and which the Completer would not be able to deduce
autonomously (with good accuracy)
Yishai
Signed-off-by: Leon Romanovsky <leon@kernel.org>
* mlx5-next:
net/mlx5: Add support for device steering tag
net/mlx5: Expose IFC bits for TPH
PCI/TPH: Expose pcie_tph_get_st_table_size()
net/mlx5: Expose cable_length field in PFCC register
net/mlx5: Add IFC bits and enums for buf_ownership
net/mlx5: Add IFC bits to support RSS for IPSec offload
net/mlx5: IFC updates for disabled host PF
net/mlx5: Expose disciplined_fr_counter through HCA capabilities in mlx5_ifc
TLP Processing Hints (TPH)
--------------------------
TLP Processing Hints is an optional feature that provides hints in
Request TLP headers to facilitate optimized processing of Requests that
target Memory Space. These Processing Hints enable the system hardware
(e.g., the Root Complex and/or Endpoints) to optimize platform
resources such as system and memory interconnect on a per TLP basis.
Steering Tags are system-specific values used to identify a processing
resource that a Requester explicitly targets. System software discovers
and identifies TPH capabilities to determine the Steering Tag allocation
for each Function that supports TPH.
This patch adds steering tag support for mlx5 based NICs by:
- Enabling the TPH functionality over PCI if both FW and OS support it.
- Managing steering tags and their matching steering indexes by
writing a ST to an ST index over the PCI configuration space.
- Exposing APIs to upper layers (e.g.,mlx5_ib) to allow usage of
the PCI TPH infrastructure.
Further details:
- Upon probing of a device, the feature will be enabled based
on both capability detection and OS support.
- It will retrieve the appropriate ST for a given CPU ID and memory
type using the pcie_tph_get_cpu_st() API.
- It will track available ST indices according to the configuration
space table size (expected to be 63 entries), reserving index 0 to
indicate non-TPH use.
- It will assign a free ST index with a ST using the
pcie_tph_set_st_entry() API.
- It will reuse the same index for identical (CPU ID + memory type)
combinations by maintaining a reference count per entry.
- It will expose APIs to upper layers (e.g., mlx5_ib) to allow usage of
the PCI TPH infrastructure.
Leon Romanovsky [Sun, 20 Jul 2025 09:25:35 +0000 (12:25 +0300)]
RDMA/mlx5: Fix incorrect MKEY masking
mkey_mask is __be64 type, while MLX5_MKEY_MASK_PAGE_SIZE is declared as
unsigned long long. This causes to the static checkers errors:
drivers/infiniband/hw/mlx5/umr.c:663:49: warning: invalid assignment: &=
drivers/infiniband/hw/mlx5/umr.c:663:49: left side has type restricted __be64
drivers/infiniband/hw/mlx5/umr.c:663:49: right side has type int
Leon Romanovsky [Sun, 20 Jul 2025 09:25:34 +0000 (12:25 +0300)]
RDMA/mlx5: Fix returned type from _mlx5r_umr_zap_mkey()
As Colin reported:
"The variable zapped_blocks is a size_t type and is being assigned a int
return value from the call to _mlx5r_umr_zap_mkey. Since zapped_blocks is an
unsigned type, the error check for zapped_blocks < 0 will never be true."
So separate return error and nblocks assignment.
Fixes: e73242aa14d2 ("RDMA/mlx5: Optimize DMABUF mkey page size") Reported-by: Colin King (gmail) <colin.i.king@gmail.com> Closes: https://lore.kernel.org/all/79166fb1-3b73-4d37-af02-a17b22eb8e64@gmail.com Link: https://patch.msgid.link/71d8ea208ac7eaa4438af683b9afaed78625e419.1753003467.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
net/mlx5: Expose cable_length field in PFCC register
Introduce new "cable_length" field in PFCC register and related fields
to enhance rx buffer configuration management:
1. cable_length: Shifts cable length handling to fw by storing a
manually entered length from user in PFCC.cable_length
2. lane_rate_oper: In a case where PFCC.cable_length is not supported,
helps compute a default cable length
Jianbo Liu [Thu, 17 Jul 2025 06:48:13 +0000 (09:48 +0300)]
net/mlx5: Add IFC bits to support RSS for IPSec offload
This adds the capabilities, ipsec_next_header and inner/outer
l4_type_ext fields to support RSS for the decrypted packets.
These fields are specifically for firmware steering. HWS validation
logic is updated to correctly handle the changes, ensuring the
unsupported fields are not set.
Besides, reserved_at_c4 is fixed to reserved_at_d4 to reflect the
accurate offset within the structure.
Colin Ian King [Thu, 17 Jul 2025 11:21:08 +0000 (12:21 +0100)]
RDMA/mlx5: remove redundant check on err on return expression
Currently all paths that set err and then check it for an error
perform immediate returns, hence err always zero at the end of
the function _mlx5r_umr_zap_mkey. The return expression
err ? err : nblocks has a redundant check on the err since err
is always zero, so just return nblocks instead.
Add an option to create CQ using external memory instead of allocating
in the driver. The memory can be passed from userspace by dmabuf fd and
an offset or a VA. One of the possible usages is creating CQs that
reside in accelerator memory, allowing low latency asynchronous direct
polling from the accelerator device. Add a capability bit to reflect on
the feature support.
Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> Link: https://patch.msgid.link/20250708202308.24783-4-mrgolin@amazon.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
RDMA/core: Add umem "is_contiguous" and "start_dma_addr" helpers
In some cases drivers may need to check if a given umem is contiguous.
Add a helper function in core code so that drivers don't need to deal
with umem or scatter-gather lists structure.
Additionally add a helper for getting umem's start DMA address and use
it in other helper functions that open code it.
RDMA/uverbs: Add a common way to create CQ with umem
Add ioctl command attributes and a common handling for the option to
create CQs with memory buffers passed from userspace. When required
attributes are supplied, create umem and provide it for driver's use.
The extension enables creation of CQs on top of preallocated CPU
virtual or device memory buffers, by supplying VA or dmabuf fd, in a
common way.
Drivers can support this flow by initializing a new create_cq_umem fp
field in their ops struct, with a function that can handle the new
parameter.
Edward Srouji [Wed, 9 Jul 2025 06:42:11 +0000 (09:42 +0300)]
RDMA/mlx5: Optimize DMABUF mkey page size
The current implementation of DMABUF memory registration uses a fixed
page size for the memory key (mkey), which can lead to suboptimal
performance when the underlying memory layout may offer better page
size.
The optimization improves performance by reducing the number of page
table entries required for the mkey, leading to less MTT/KSM descriptors
that the HCA must go through to find translations, fewer cache-lines,
and shorter UMR work requests on mkey updates such as when
re-registering or reusing a cacheable mkey.
To ensure safe page size updates, the implementation uses a 5-step
process:
1. Make the first X entries non-present, while X is calculated to be
minimal according to a large page shift that can be used to cover the
MR length.
2. Update the page size to the large supported page size
3. Load the remaining N-X entries according to the (optimized)
page shift
4. Update the page size according to the (optimized) page shift
5. Load the first X entries with the correct translations
This ensures that at no point is the MR accessible with a partially
updated translation table, maintaining correctness and preventing
access to stale or inconsistent mappings, such as having an mkey
advertising the new page size while some of the underlying page table
entries still contain the old page size translations.
RDMA/mlx5: Align mkc page size capability check to PRM
Align the capabilities checked when using the log_page_size 6th bit in the
mkey context to the PRM definition. The upper and lower bounds are set by
max/min caps, and modification of the 6th bit by UMR is allowed only when
a specific UMR cap is set.
Current implementation falsely assumes all page sizes up-to 2^63 are
supported when the UMR cap is set. In case the upper bound cap is lower
than 63, this might result a FW syndrome on mkey creation, e.g:
mlx5_core 0000:c1:00.0: mlx5_cmd_out_err:832:(pid 0): CREATE_MKEY(0×200) op_mod(0×0) failed, status bad parameter(0×3), syndrome (0×38a711), err(-22)
Previous cap enforcement is still correct for all current HW, FW and
driver combinations. However, this patch aligns the code to be PRM
compliant in the general case.
Leon Romanovsky [Sun, 13 Jul 2025 07:00:21 +0000 (03:00 -0400)]
Optimize DMABUF mkey page size in mlx5
From Edward:
This patch series enables the mlx5 driver to dynamically choose the
optimal page size for a DMABUF-based memory key (mkey), rather than
always registering with a fixed page size.
Previously, DMABUF memory registration used a fixed 4K page size for
mkeys which could lead to suboptimal performance when the underlying
memory layout may offer better page sizes.
This approach did not take advantage of larger page size capabilities
advertised by the HCA, and the driver was not setting the proper page
size mask in the mkey mask when performing page size changes,
potentially
leading to invalid registrations when updating to a very large pages.
This series improves DMABUF performance by dynamically selecting the
best page size for a given memory region (MR) both at creation time and
on page fault occurrences, based on the underlying layout and fixing
related gaps and bugs.
By doing so, we reduce the number of page table entries (and thus MTT/
KSM descriptors) that the HCA must traverse, which in turn reduces
cache-line fetches.
Thanks
* mlx5-next:
RDMA/mlx5: Fix UMR modifying of mkey page size
net/mlx5: Expose HCA capability bits for mkey max page size
Edward Srouji [Wed, 9 Jul 2025 06:42:09 +0000 (09:42 +0300)]
RDMA/mlx5: Fix UMR modifying of mkey page size
When changing the page size on an mkey, the driver needs to set the
appropriate bits in the mkey mask to indicate which fields are being
modified.
The 6th bit of a page size in mlx5 driver is considered an extension,
and this bit has a dedicated capability and mask bits.
Previously, the driver was not setting this mask in the mkey mask when
performing page size changes, regardless of its hardware support,
potentially leading to an incorrect page size updates.
This fixes the issue by setting the relevant bit in the mkey mask when
performing page size changes on an mkey and the 6th bit of this field is
supported by the hardware.
net/mlx5: Expose HCA capability bits for mkey max page size
Expose the HCA capability for maximal page size that can be configured
for an mkey. Used for enforcing capabilities when working with highly
contiguous memory and using large page sizes.
The call to rdma_uattrs_has_raw_cap() is placed in mlx5 fs.c file,
which is compiled without relation to CONFIG_INFINIBAND_USER_ACCESS.
Despite the check is used only in flows with CONFIG_INFINIBAND_USER_ACCESS=y|m,
the compilers generate the following error for CONFIG_INFINIBAND_USER_ACCESS=n
builds.
Vlad Dumitrescu [Mon, 30 Jun 2025 10:16:44 +0000 (13:16 +0300)]
IB/cm: Use separate agent w/o flow control for REP
Most responses (e.g., RTU) are not subject to flow control, as there is
no further response expected. However, REPs are both requests (waiting
for RTUs) and responses (being waited by REQs).
With agent-level flow control added to the MAD layer, REPs can get
delayed by outstanding REQs. This can cause a problem in a scenario
such as 2 hosts connecting to each other at the same time. Both hosts
fill the flow control outstanding slots with REQs. The corresponding
REPs are now blocked behind those REQs, and neither side can make
progress until REQs time out.
Add a separate MAD agent which is only used to send REPs. This agent
does not have a recv_handler as it doesn't process responses nor does it
register to receive requests. Disable flow control for agents w/o a
recv_handler, as they aren't waiting for responses. This allows the
newly added REP agent to send even when clients are slow to generate
RTU, which would be needed to unblock flow control outstanding slots.
Relax check in ib_post_send_mad to allow retries for this agent. REPs
will be retried by the MAD layer until CM layer receives a response
(e.g., RTU) on the normal agent and cancels them.
Suggested-by: Sean Hefty <shefty@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Sean Hefty <shefty@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Link: https://patch.msgid.link/9ac12d0842b849e2c8537d6e291ee0af9f79855c.1751278420.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
Or Har-Toov [Mon, 30 Jun 2025 10:16:43 +0000 (13:16 +0300)]
IB/mad: Add flow control for solicited MADs
Currently, MADs sent via an agent are being forwarded directly to the
corresponding MAD QP layer.
MADs with a timeout value set and requiring a response (solicited MADs)
will be resent if the timeout expires without receiving a response.
In a congested subnet, flooding MAD QP layer with more solicited send
requests from the agent will only worsen the situation by triggering
more timeouts and therefore more retries.
Thus, add flow control for non-user solicited MADs to block agents from
issuing new solicited MAD requests to the MAD QP until outstanding
requests are completed and the MAD QP is ready to process additional
requests. While at it, keep track of the total outstanding solicited
MAD work requests in send or wait list. The number of outstanding send
WRs will be limited by a fraction of the RQ size, and any new send WR
that exceeds that limit will be held in a backlog list.
Backlog MADs will be forwarded to agent send list only once the total
number of outstanding send WRs falls below the limit.
Unsolicited MADs, RMPP MADs and MADs which are not SA, SMP or CM are
not subject to this flow control mechanism and will not be affected by
this change.
For this purpose, a new state is introduced:
- 'IB_MAD_STATE_QUEUED': MAD is in backlog list
Or Har-Toov [Mon, 30 Jun 2025 10:16:42 +0000 (13:16 +0300)]
IB/mad: Add state machine to MAD layer
Replace the use of refcount, timeout and status with a 'state'
field to track the status of MADs send work requests (WRs).
The state machine better represents the stages in the MAD lifecycle,
specifically indicating whether the MAD is waiting for a completion,
waiting for a response, was canceld or is done.
The existing refcount only takes two values:
1 : MAD is waiting either for completion or for response.
2 : MAD is waiting for both response and completion. Also when a
response was received before a completion notification.
The status field represents if the MAD was canceled at some point
in the flow.
The timeout is used to represent if a response was received.
The current state transitions are not clearly visible, and developers
needs to infer the state from the refcount's, timeout's or status's
value, which is error-prone and difficult to follow.
Thus, replace with a state machine as the following:
- 'IB_MAD_STATE_INIT': MAD is in the making and is not yet in any list
- 'IB_MAD_STATE_SEND_START': MAD was sent to the QP and is waiting for
completion notification in send list
- 'IB_MAD_STATE_WAIT_RESP': MAD send completed successfully, waiting for
a response in wait list
- 'IB_MAD_STATE_EARLY_RESP': Response came early, before send
completion notification, MAD is in the send list
- 'IB_MAD_STATE_CANCELED': MAD was canceled while in send or wait list
- 'IB_MAD_STATE_DONE': MAD processing completed, MAD is in no list
Adding the state machine also make it possible to remove the double
call for ib_mad_complete_send_wr in case of an early response and the
use of a done list in case of a regular response.
While at it, define a helper to clear error MADs which will handle
freeing MADs that timed out or have been cancelled.
bnxt_qplib_put_sges is calculating the length in
a signed int. So handling the 2G message size
is not working since it is considered as negative.
Use a unsigned number to calculate the total message
length. As per the spec, IB message size shall be
between zero and 2^31 bytes (inclusive). So adding
a check for 2G message size.
hr_dev->pgdir_list and hr_dev->pgdir_mutex won't be initialized if
CQ/QP record db are not enabled, but they are also needed when using
SRQ with SRQ record db enabled. Simplified the logic by always
initailizing the reosurces.
ACK_REQ_FREQ indicates the number of packets (after MTU fragmentation)
HW sends before setting an ACK request. When MTU is greater than or
equal to 1024, the current ACK_REQ_FREQ value causes HW to request an
ACK for every MTU fragment. The processing of a large number of ACKs
severely impacts HW performance when sending large size payloads.
Get message length of ack_req from FW so that we can adjust this
parameter according to different situations. There are several
constraints for ACK_REQ_FREQ:
1. mtu * (2 ^ ACK_REQ_FREQ) should not be too large, otherwise it may
cause some unexpected retries when sending large payload.
2. ACK_REQ_FREQ should be larger than or equal to LP_PKTN_INI.
3. ACK_REQ_FREQ must be equal to LP_PKTN_INI when using LDCP
or HC3 congestion control algorithm.
RDMA/hns: Fix HW configurations not cleared in error flow
hns_roce_clear_extdb_list_info() will eventually do some HW
configurations through FW, and they need to be cleared by
calling hns_roce_function_clear() when the initialization
fails.
Fixes: 7e78dd816e45 ("RDMA/hns: Clear extended doorbell info before using") Signed-off-by: wenglianfa <wenglianfa@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-3-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
rsv_qp may be double destroyed in error flow, first in free_mr_init(),
and then in hns_roce_exit(). Fix it by moving the free_mr_init() call
into hns_roce_v2_init().
Parav Pandit [Thu, 26 Jun 2025 18:58:12 +0000 (21:58 +0300)]
RDMA/counter: Check CAP_NET_RAW check in user namespace for RDMA counters
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Parav Pandit [Thu, 26 Jun 2025 18:58:11 +0000 (21:58 +0300)]
RDMA/nldev: Check CAP_NET_RAW in user namespace for QP modify
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to modify
the QP.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Parav Pandit [Thu, 26 Jun 2025 18:58:10 +0000 (21:58 +0300)]
RDMA/mlx5: Check CAP_NET_RAW in user namespace for devx create
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the devx object.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Parav Pandit [Thu, 26 Jun 2025 18:58:09 +0000 (21:58 +0300)]
RDMA/uverbs: Check CAP_NET_RAW in user namespace for RAW QP create
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the QP.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Parav Pandit [Thu, 26 Jun 2025 18:58:08 +0000 (21:58 +0300)]
RDMA/uverbs: Check CAP_NET_RAW in user namespace for RAW QP create
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the QP.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Parav Pandit [Thu, 26 Jun 2025 18:58:07 +0000 (21:58 +0300)]
RDMA/uverbs: Check CAP_NET_RAW in user namespace for QP create
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the QP.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Parav Pandit [Thu, 26 Jun 2025 18:58:06 +0000 (21:58 +0300)]
RDMA/mlx5: Check CAP_NET_RAW in user namespace for anchor create
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the anchor.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Parav Pandit [Thu, 26 Jun 2025 18:58:05 +0000 (21:58 +0300)]
RDMA/mlx5: Check CAP_NET_RAW in user namespace for flow create
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the flow.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Parav Pandit [Thu, 26 Jun 2025 18:58:04 +0000 (21:58 +0300)]
RDMA/uverbs: Check CAP_NET_RAW in user namespace for flow create
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the flow resource.
Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.
Mark Bloch [Tue, 17 Jun 2025 08:44:03 +0000 (11:44 +0300)]
RDMA/ipoib: Use parent rdma device net namespace
Use the net namespace of the underlying rdma device.
After honoring the rdma device's namespace, the ipoib
netdev now also runs in the same net namespace of the
rdma device.
Add an API to read the net namespace of the rdma device
so that ULP such as IPoIB can use it to initialize its
netdev.
Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Mark Bloch [Tue, 17 Jun 2025 08:44:02 +0000 (11:44 +0300)]
RDMA/mlx5: Allocate IB device with net namespace supplied from core dev
Use the new ib_alloc_device_with_net() API to allocate the IB device
so that it is properly bound to the network namespace obtained via
mlx5_core_net(). This change ensures correct namespace association
(e.g., for containerized setups).
Additionally, expose mlx5_core_net so that RDMA driver can use it.
Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Mark Bloch [Tue, 17 Jun 2025 08:44:01 +0000 (11:44 +0300)]
RDMA/core: Extend RDMA device registration to be net namespace aware
Presently, RDMA devices are always registered within the init network
namespace, even if the associated devlink device's namespace was
changed via a devlink reload. This mismatch leads to discrepancies
between the network namespace of the devlink device and that of the
RDMA device.
Therefore, extend the RDMA device allocation API to optionally take
the net namespace. This isn't limited to devices that support devlink
but allows all users to provide the network namespace if they need to
do so.
If a network namespace is provided during device allocation, it's up
to the caller to make sure the namespace stays valid until
ib_register_device() is called.
Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
The real problem here is the array of kvec structures in siw_tx_hdt that
makes up the majority of the consumed stack space.
Ideally there would be a way to avoid allocating the array on the
stack, but that would require a larger rework. Add a noinline_for_stack
annotation to avoid the warning for now, and make clang behave the same
way as gcc here. The combined stack usage is still similar, but is spread
over multiple functions now.
The function protects the for loop with affinity->num_core_siblings > 0
condition, which is redundant because the loop will break immediately in
that case.
RDMA: hfi1: use rounddown in find_hw_thread_mask()
num_cores_per_socket is calculated by dividing by
node_affinity.num_online_nodes, but all users of this variable multiply
it by node_affinity.num_online_nodes again. This effectively is the same
as rounding it down by node_affinity.num_online_nodes.
The function opencodes cpumask_nth() and cpumask_clear_cpus(). The
dedicated helpers are easier to use and usually much faster than
opencoded for-loops.
The function opencodes cpumask_nth() and cpumask_clear_cpus(). The
dedicated helpers are easier to use and usually much faster than
opencoded for-loops.
RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask()
The function divides number of online CPUs by num_core_siblings, and
later checks the divider by zero. This implies a possibility to get
and divide-by-zero runtime error. Fix it by moving the check prior to
division. This also helps to save one indentation level.
net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain
RDMA TRANSPORT domains were initially limited to a single priority.
This change allows the domains to have multiple priorities, making
it possible to add several rules and control the order in which
they're evaluated.
Mark Zhang [Mon, 16 Jun 2025 08:26:20 +0000 (11:26 +0300)]
RDMA/core: Add driver APIs pre_destroy_cq() and post_destroy_cq()
Currently in ib_free_cq, it disables IRQ or cancel the CQ work before
driver destroy_cq. This isn't good as a new IRQ or a CQ work can be
submitted immediately after disabling IRQ or canceling CQ work, which
may run concurrently with destroy_cq and cause crashes.
The right flow should be:
1. Driver disables CQ to make sure no new CQ event will be submitted;
2. Disables IRQ or Cancels CQ work in core layer, to make sure no CQ
polling work is running;
3. Free all resources to destroy the CQ.
This patch adds 2 driver APIs:
- pre_destroy_cq(): Disable a CQ to prevent it from generating any new
work completions, but not free any kernel resources;
- post_destroy_cq(): Free all kernel resources.
In ib_free_cq, the IRQ is disabled or CQ work is canceled after
pre_destroy_cq, and before post_destroy_cq.
The qib Infiniband device has long been decommissioned from distros and
unsupported by the company. We have kept it going in a compile only mode
for a while now and it's time to remove the driver. It has a number of
issues that are never going to be addressed and is a hindrance to
improving hfi1 and rdmavt.
Linus Torvalds [Sun, 15 Jun 2025 16:14:27 +0000 (09:14 -0700)]
Merge tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Move warnings about linux/export.h from W=1 to W=2
- Fix structure type overrides in gendwarfksyms
* tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
gendwarfksyms: Fix structure type overrides
kbuild: move warnings about linux/export.h from W=1 to W=2
Sami Tolvanen [Sat, 14 Jun 2025 00:55:33 +0000 (00:55 +0000)]
gendwarfksyms: Fix structure type overrides
As we always iterate through the entire die_map when expanding
type strings, recursively processing referenced types in
type_expand_child() is not actually necessary. Furthermore,
the type_string kABI rule added in commit c9083467f7b9
("gendwarfksyms: Add a kABI rule to override type strings") can
fail to override type strings for structures due to a missing
kabi_get_type_string() check in this function.
Fix the issue by dropping the unnecessary recursion and moving
the override check to type_expand(). Note that symbol versions
are otherwise unchanged with this patch.
Fixes: c9083467f7b9 ("gendwarfksyms: Add a kABI rule to override type strings") Reported-by: Giuliano Procida <gprocida@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Linus Torvalds [Sat, 14 Jun 2025 16:25:22 +0000 (09:25 -0700)]
Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix for a deadlock on queue freeze with zoned writes
- Fix for zoned append emulation
- Two bio folio fixes, for sparsemem and for very large folios
- Fix for a performance regression introduced in 6.13 when plug
insertion was changed
- Fix for NVMe passthrough handling for polled IO
- Document the ublk auto registration feature
- loop lockdep warning fix
* tag 'block-6.16-20250614' of git://git.kernel.dk/linux:
nvme: always punt polled uring_cmd end_io work to task_work
Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists
block: Fix bvec_set_folio() for very large folios
bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP
block: use plug request list tail for one-shot backmerge attempt
block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work
block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion
ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
loop: move lo_set_size() out of queue freeze
Linus Torvalds [Sat, 14 Jun 2025 15:44:54 +0000 (08:44 -0700)]
Merge tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Fix for a race between SQPOLL exit and fdinfo reading.
It's slim and I was only able to reproduce this with an artificial
delay in the kernel. Followup sparse fix as well to unify the access
to ->thread.
- Fix for multiple buffer peeking, avoiding truncation if possible.
- Run local task_work for IOPOLL reaping when the ring is exiting.
This currently isn't done due to an assumption that polled IO will
never need task_work, but a fix on the block side is going to change
that.
* tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux:
io_uring: run local task_work from ring exit IOPOLL reaping
io_uring/kbuf: don't truncate end buffer for multiple buffer peeks
io_uring: consistently use rcu semantics with sqpoll thread
io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()
Linus Torvalds [Sat, 14 Jun 2025 15:18:09 +0000 (08:18 -0700)]
Merge tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"9 hotfixes. 3 are cc:stable and the remainder address post-6.15 issues
or aren't considered necessary for -stable kernels. Only 4 are for MM"
* tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: add mmap_prepare() compatibility layer for nested file systems
init: fix build warnings about export.h
MAINTAINERS: add Barry as a THP reviewer
drivers/rapidio/rio_cm.c: prevent possible heap overwrite
mm: close theoretical race where stale TLB entries could linger
mm/vma: reset VMA iterator on commit_merge() OOM failure
docs: proc: update VmFlags documentation in smaps
scatterlist: fix extraneous '@'-sign kernel-doc notation
selftests/mm: skip failed memfd setups in gup_longterm
Linus Torvalds [Fri, 13 Jun 2025 23:49:39 +0000 (16:49 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"All fixes for drivers.
The core change in the error handler is simply to translate an ALUA
specific sense code into a retry the ALUA components can handle and
won't impact any other devices"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: error: alua: I/O errors for ALUA state transitions
scsi: storvsc: Increase the timeouts to storvsc_timeout
scsi: s390: zfcp: Ensure synchronous unit_add
scsi: iscsi: Fix incorrect error path labels for flashnode operations
scsi: mvsas: Fix typos in per-phy comments and SAS cmd port registers
scsi: core: ufs: Fix a hang in the error handler
Linus Torvalds [Fri, 13 Jun 2025 23:27:27 +0000 (16:27 -0700)]
Merge tag 'drm-fixes-2025-06-14' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Quiet week, only two pull requests came my way, xe has a couple of
fixes and then a bunch of fixes across the board, vc4 probably fixes
the biggest problem:
vc4:
- Fix infinite EPROBE_DEFER loop in vc4 probing
amdxdna:
- Fix amdxdna firmware size
meson:
- modesetting fixes
sitronix:
- Kconfig fix for st7171-i2c
dma-buf:
- Fix -EBUSY WARN_ON_ONCE in dma-buf
udmabuf:
- Use dma_sync_sgtable_for_cpu in udmabuf
xe:
- Fix regression disallowing 64K SVM migration
- Use a bounce buffer for WA BB"
* tag 'drm-fixes-2025-06-14' of https://gitlab.freedesktop.org/drm/kernel:
drm/xe/lrc: Use a temporary buffer for WA BB
udmabuf: use sgtable-based scatterlist wrappers
dma-buf: fix compare in WARN_ON_ONCE
drm/sitronix: st7571-i2c: Select VIDEOMODE_HELPERS
drm/meson: fix more rounding issues with 59.94Hz modes
drm/meson: use vclk_freq instead of pixel_freq in debug print
drm/meson: fix debug log statement when setting the HDMI clocks
drm/vc4: fix infinite EPROBE_DEFER loop
drm/xe/svm: Fix regression disallowing 64K SVM migration
accel/amdxdna: Fix incorrect PSP firmware size
Jens Axboe [Fri, 13 Jun 2025 21:24:53 +0000 (15:24 -0600)]
io_uring: run local task_work from ring exit IOPOLL reaping
In preparation for needing to shift NVMe passthrough to always use
task_work for polled IO completions, ensure that those are suitably
run at exit time. See commit:
9ce6c9875f3e ("nvme: always punt polled uring_cmd end_io work to task_work")
Jens Axboe [Fri, 13 Jun 2025 19:37:41 +0000 (13:37 -0600)]
nvme: always punt polled uring_cmd end_io work to task_work
Currently NVMe uring_cmd completions will complete locally, if they are
polled. This is done because those completions are always invoked from
task context. And while that is true, there's no guarantee that it's
invoked under the right ring context, or even task. If someone does
NVMe passthrough via multiple threads and with a limited number of
poll queues, then ringA may find completions from ringB. For that case,
completing the request may not be sound.
Always just punt the passthrough completions via task_work, which will
redirect the completion, if needed.
Cc: stable@vger.kernel.org Fixes: 585079b6e425 ("nvme: wire up async polling for io passthrough commands") Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 13 Jun 2025 20:39:15 +0000 (13:39 -0700)]
Merge tag 'acpi-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix an ACPI APEI error injection driver failure that started to
occur after switching it over to using a faux device, address an EC
driver issue related to invalid ECDT tables, clean up the usage of
mwait_idle_with_hints() in the ACPI PAD driver, add a new IRQ override
quirk, and fix a NULL pointer dereference related to nosmp:
- Update the faux device handling code in the driver core and address
an ACPI APEI error injection driver failure that started to occur
after switching it over to using a faux device on top of that (Dan
Williams)
- Update data types of variables passed as arguments to
mwait_idle_with_hints() in the ACPI PAD (processor aggregator
device) driver to match the function definition after recent
changes (Uros Bizjak)
- Fix a NULL pointer dereference in the ACPI CPPC library that occurs
when nosmp is passed to the kernel in the command line (Yunhui Cui)
- Ignore ECDT tables with an invalid ID string to prevent using an
incorrect GPE for signaling events on some systems (Armin Wolf)
- Add a new IRQ override quirk for MACHENIKE 16P (Wentao Guan)"
* tag 'acpi-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: resource: Use IRQ override on MACHENIKE 16P
ACPI: EC: Ignore ECDT tables with an invalid ID string
ACPI: CPPC: Fix NULL pointer dereference when nosmp is used
ACPI: PAD: Update arguments of mwait_idle_with_hints()
ACPI: APEI: EINJ: Do not fail einj_init() on faux_device_create() failure
driver core: faux: Quiet probe failures
driver core: faux: Suppress bind attributes
Linus Torvalds [Fri, 13 Jun 2025 20:27:41 +0000 (13:27 -0700)]
Merge tag 'pm-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix the cpupower utility installation, fix up the recently added
Rust abstractions for cpufreq and OPP, restore the x86 update
eliminating mwait_play_dead_cpuid_hint() that has been reverted during
the 6.16 merge window along with preventing the failure caused by it
from happening, and clean up mwait_idle_with_hints() usage in
intel_idle:
- Implement CpuId Rust abstraction and use it to fix doctest failure
related to the recently introduced cpumask abstraction (Viresh
Kumar)
- Do minor cleanups in the `# Safety` sections for cpufreq
abstractions added recently (Viresh Kumar)
- Unbreak cpupower systemd service units installation on some systems
by adding a unitdir variable for specifying the location to install
them (Francesco Poli)
- Eliminate mwait_play_dead_cpuid_hint() again after reverting its
elimination during the 6.16 merge window due to a problem with
handling "dead" SMT siblings, but this time prevent leaving them in
C1 after initialization by taking them online and back offline when
a proper cpuidle driver for the platform has been registered
(Rafael Wysocki)
- Update data types of variables passed as arguments to
mwait_idle_with_hints() to match the function definition after
recent changes (Uros Bizjak)"
* tag 'pm-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
rust: cpu: Add CpuId::current() to retrieve current CPU ID
rust: Use CpuId in place of raw CPU numbers
rust: cpu: Introduce CpuId abstraction
intel_idle: Update arguments of mwait_idle_with_hints()
cpufreq: Convert `/// SAFETY` lines to `# Safety` sections
cpupower: split unitdir from libdir in Makefile
Reapply "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"
ACPI: processor: Rescan "dead" SMT siblings during initialization
intel_idle: Rescan "dead" SMT siblings during initialization
x86/smp: PM/hibernate: Split arch_resume_nosmt()
intel_idle: Use subsys_initcall_sync() for initialization
Merge branches 'acpi-pad', 'acpi-cppc', 'acpi-ec' and 'acpi-resource'
Merge assorted ACPI updates for 6.16-rc2:
- Update data types of variables passed as arguments to
mwait_idle_with_hints() in the ACPI PAD (processor aggregator device)
driver to match the function definition after recent changes (Uros
Bizjak).
- Fix a NULL pointer dereference in the ACPI CPPC library that occurs
when nosmp is passed to the kernel in the command line (Yunhui Cui).
- Ignore ECDT tables with an invalid ID string to prevent using an
incorrect GPE for signaling events on some systems (Armin Wolf).
- Add a new IRQ override quirk for MACHENIKE 16P (Wentao Guan).
* acpi-pad:
ACPI: PAD: Update arguments of mwait_idle_with_hints()
* acpi-cppc:
ACPI: CPPC: Fix NULL pointer dereference when nosmp is used
* acpi-ec:
ACPI: EC: Ignore ECDT tables with an invalid ID string
* acpi-resource:
ACPI: resource: Use IRQ override on MACHENIKE 16P
- Update data types of variables passed as arguments to
mwait_idle_with_hints() to match the function definition
after recent changes (Uros Bizjak).
- Eliminate mwait_play_dead_cpuid_hint() again after reverting its
elimination during the merge window due to a problem with handling
"dead" SMT siblings, but this time prevent leaving them in C1 after
initialization by taking them online and back offline when a proper
cpuidle driver for the platform has been registered (Rafael Wysocki).
* pm-cpuidle:
intel_idle: Update arguments of mwait_idle_with_hints()
Reapply "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"
ACPI: processor: Rescan "dead" SMT siblings during initialization
intel_idle: Rescan "dead" SMT siblings during initialization
x86/smp: PM/hibernate: Split arch_resume_nosmt()
intel_idle: Use subsys_initcall_sync() for initialization
Linus Torvalds [Fri, 13 Jun 2025 18:01:44 +0000 (11:01 -0700)]
Merge tag 'spi-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A collection of driver specific fixes, most minor apart from the OMAP
ones which disable some recent performance optimisations in some
non-standard cases where we could start driving the bus incorrectly.
The change to the stm32-ospi driver to use the newer reset APIs is a
fix for interactions with other IP sharing the same reset line in some
SoCs"
* tag 'spi-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-pci1xxxx: Drop MSI-X usage as unsupported by DMA engine
spi: stm32-ospi: clean up on error in probe()
spi: stm32-ospi: Make usage of reset_control_acquire/release() API
spi: offload: check offload ops existence before disabling the trigger
spi: spi-pci1xxxx: Fix error code in probe
spi: loongson: Fix build warnings about export.h
spi: omap2-mcspi: Disable multi-mode when the previous message kept CS asserted
spi: omap2-mcspi: Disable multi mode when CS should be kept asserted after message
Linus Torvalds [Fri, 13 Jun 2025 18:00:19 +0000 (11:00 -0700)]
Merge tag 'regulator-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"One minor fix for a leak in the DT parsing code in the max20086 driver"
* tag 'regulator-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: max20086: Fix refcount leak in max20086_parse_regulators_dt()