git.ipfire.org Git - thirdparty/linux.git/log

RDMA/mlx5: Fix returned type from _mlx5r_umr_zap_mkey()

As Colin reported:
"The variable zapped_blocks is a size_t type and is being assigned a int
return value from the call to _mlx5r_umr_zap_mkey. Since zapped_blocks is an
unsigned type, the error check for zapped_blocks < 0 will never be true."

So separate return error and nblocks assignment.

Fixes: e73242aa14d2 ("RDMA/mlx5: Optimize DMABUF mkey page size")
Reported-by: Colin King (gmail) <colin.i.king@gmail.com>
Closes: https://lore.kernel.org/all/79166fb1-3b73-4d37-af02-a17b22eb8e64@gmail.com
Link: https://patch.msgid.link/71d8ea208ac7eaa4438af683b9afaed78625e419.1753003467.git.leon@kernel.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

RDMA/mlx5: remove redundant check on err on return expression

Currently all paths that set err and then check it for an error
perform immediate returns, hence err always zero at the end of
the function _mlx5r_umr_zap_mkey. The return expression
err ? err : nblocks has a redundant check on the err since err
is always zero, so just return nblocks instead.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://patch.msgid.link/20250717112108.4036171-1-colin.i.king@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mana_ib: add additional port counters

Add packet and request port counters to mana_ib.

Signed-off-by: Zhiyue Qiu <zhiyueqiu@microsoft.com>
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1752143395-5324-1-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mana_ib: Fix DSCP value in modify QP

Convert the traffic_class in GRH to a DSCP value as required by the HW.

Fixes: e095405b45bb ("RDMA/mana_ib: Modify QP state")
Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com>
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1752143085-4169-1-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/efa: Add CQ with external memory support

Add an option to create CQ using external memory instead of allocating
in the driver. The memory can be passed from userspace by dmabuf fd and
an offset or a VA. One of the possible usages is creating CQs that
reside in accelerator memory, allowing low latency asynchronous direct
polling from the accelerator device. Add a capability bit to reflect on
the feature support.

Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://patch.msgid.link/20250708202308.24783-4-mrgolin@amazon.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/core: Add umem "is_contiguous" and "start_dma_addr" helpers

In some cases drivers may need to check if a given umem is contiguous.
Add a helper function in core code so that drivers don't need to deal
with umem or scatter-gather lists structure.
Additionally add a helper for getting umem's start DMA address and use
it in other helper functions that open code it.

Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://patch.msgid.link/20250708202308.24783-3-mrgolin@amazon.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/uverbs: Add a common way to create CQ with umem

Add ioctl command attributes and a common handling for the option to
create CQs with memory buffers passed from userspace. When required
attributes are supplied, create umem and provide it for driver's use.
The extension enables creation of CQs on top of preallocated CPU
virtual or device memory buffers, by supplying VA or dmabuf fd, in a
common way.
Drivers can support this flow by initializing a new create_cq_umem fp
field in their ops struct, with a function that can handle the new
parameter.

Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://patch.msgid.link/20250708202308.24783-2-mrgolin@amazon.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Optimize DMABUF mkey page size

The current implementation of DMABUF memory registration uses a fixed
page size for the memory key (mkey), which can lead to suboptimal
performance when the underlying memory layout may offer better page
size.

The optimization improves performance by reducing the number of page
table entries required for the mkey, leading to less MTT/KSM descriptors
that the HCA must go through to find translations, fewer cache-lines,
and shorter UMR work requests on mkey updates such as when
re-registering or reusing a cacheable mkey.

To ensure safe page size updates, the implementation uses a 5-step
process:
1. Make the first X entries non-present, while X is calculated to be
   minimal according to a large page shift that can be used to cover the
   MR length.
2. Update the page size to the large supported page size
3. Load the remaining N-X entries according to the (optimized)
   page shift
4. Update the page size according to the (optimized) page shift
5. Load the first X entries with the correct translations

This ensures that at no point is the MR accessible with a partially
updated translation table, maintaining correctness and preventing
access to stale or inconsistent mappings, such as having an mkey
advertising the new page size while some of the underlying page table
entries still contain the old page size translations.

Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/bc05a6b2142c02f96a90635f9a4458ee4bbbf39f.1751979184.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Align mkc page size capability check to PRM

Align the capabilities checked when using the log_page_size 6th bit in the
mkey context to the PRM definition. The upper and lower bounds are set by
max/min caps, and modification of the 6th bit by UMR is allowed only when
a specific UMR cap is set.
Current implementation falsely assumes all page sizes up-to 2^63 are
supported when the UMR cap is set. In case the upper bound cap is lower
than 63, this might result a FW syndrome on mkey creation, e.g:
mlx5_core 0000:c1:00.0: mlx5_cmd_out_err:832:(pid 0): CREATE_MKEY(0×200) op_mod(0×0) failed, status bad parameter(0×3), syndrome (0×38a711), err(-22)

Previous cap enforcement is still correct for all current HW, FW and
driver combinations. However, this patch aligns the code to be PRM
compliant in the general case.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/eab4eeb4785105a4bb5eb362dc0b3662cd840412.1751979184.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

Optimize DMABUF mkey page size in mlx5

From Edward:

This patch series enables the mlx5 driver to dynamically choose the
optimal page size for a DMABUF-based memory key (mkey), rather than
always registering with a fixed page size.

Previously, DMABUF memory registration used a fixed 4K page size for
mkeys which could lead to suboptimal performance when the underlying
memory layout may offer better page sizes.

This approach did not take advantage of larger page size capabilities
advertised by the HCA, and the driver was not setting the proper page
size mask in the mkey mask when performing page size changes,
potentially
leading to invalid registrations when updating to a very large pages.

This series improves DMABUF performance by dynamically selecting the
best page size for a given memory region (MR) both at creation time and
on page fault occurrences, based on the underlying layout and fixing
related gaps and bugs.

By doing so, we reduce the number of page table entries (and thus MTT/
KSM descriptors) that the HCA must traverse, which in turn reduces
cache-line fetches.

Thanks

* mlx5-next:
RDMA/mlx5: Fix UMR modifying of mkey page size
net/mlx5: Expose HCA capability bits for mkey max page size

Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Fix UMR modifying of mkey page size

When changing the page size on an mkey, the driver needs to set the
appropriate bits in the mkey mask to indicate which fields are being
modified.
The 6th bit of a page size in mlx5 driver is considered an extension,
and this bit has a dedicated capability and mask bits.

Previously, the driver was not setting this mask in the mkey mask when
performing page size changes, regardless of its hardware support,
potentially leading to an incorrect page size updates.

This fixes the issue by setting the relevant bit in the mkey mask when
performing page size changes on an mkey and the 6th bit of this field is
supported by the hardware.

Fixes: cef7dde8836a ("net/mlx5: Expand mkey page size to support 6 bits")
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/9f43a9c73bf2db6085a99dc836f7137e76579f09.1751979184.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

net/mlx5: Expose HCA capability bits for mkey max page size

Expose the HCA capability for maximal page size that can be configured
for an mkey. Used for enforcing capabilities when working with highly
contiguous memory and using large page sizes.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/3e4d3fda37934430f65f72601519e22bf396fd05.1751979184.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/uverbs: Add empty rdma_uattrs_has_raw_cap() declaration

The call to rdma_uattrs_has_raw_cap() is placed in mlx5 fs.c file,
which is compiled without relation to CONFIG_INFINIBAND_USER_ACCESS.

Despite the check is used only in flows with CONFIG_INFINIBAND_USER_ACCESS=y|m,
the compilers generate the following error for CONFIG_INFINIBAND_USER_ACCESS=n
builds.

>> ERROR: modpost: "rdma_uattrs_has_raw_cap" [drivers/infiniband/hw/mlx5/mlx5_ib.ko] undefined!

Fixes: f458ccd2aa2c ("RDMA/uverbs: Check CAP_NET_RAW in user namespace for flow create")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202507080725.bh7xrhpg-lkp@intel.com/
Link: https://patch.msgid.link/72dee6b379bd709255a5d8e8010b576d50e47170.1751967071.git.leon@kernel.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

RDMA/efa: Add Network HW statistics counters

Update device API and request network counters. Expose newly added
counters through ib core counters mechanism.

Reviewed-by: David Shoolman <shoolman@amazon.com>
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Basel Nassar <baselna@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://patch.msgid.link/20250706070740.22534-1-mrgolin@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/cm: Use separate agent w/o flow control for REP

Most responses (e.g., RTU) are not subject to flow control, as there is
no further response expected.  However, REPs are both requests (waiting
for RTUs) and responses (being waited by REQs).

With agent-level flow control added to the MAD layer, REPs can get
delayed by outstanding REQs.  This can cause a problem in a scenario
such as 2 hosts connecting to each other at the same time.  Both hosts
fill the flow control outstanding slots with REQs.  The corresponding
REPs are now blocked behind those REQs, and neither side can make
progress until REQs time out.

Add a separate MAD agent which is only used to send REPs.  This agent
does not have a recv_handler as it doesn't process responses nor does it
register to receive requests.  Disable flow control for agents w/o a
recv_handler, as they aren't waiting for responses.  This allows the
newly added REP agent to send even when clients are slow to generate
RTU, which would be needed to unblock flow control outstanding slots.

Relax check in ib_post_send_mad to allow retries for this agent.  REPs
will be retried by the MAD layer until CM layer receives a response
(e.g., RTU) on the normal agent and cancels them.

Suggested-by: Sean Hefty <shefty@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Link: https://patch.msgid.link/9ac12d0842b849e2c8537d6e291ee0af9f79855c.1751278420.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/mad: Add flow control for solicited MADs

Currently, MADs sent via an agent are being forwarded directly to the
corresponding MAD QP layer.

MADs with a timeout value set and requiring a response (solicited MADs)
will be resent if the timeout expires without receiving a response.
In a congested subnet, flooding MAD QP layer with more solicited send
requests from the agent will only worsen the situation by triggering
more timeouts and therefore more retries.

Thus, add flow control for non-user solicited MADs to block agents from
issuing new solicited MAD requests to the MAD QP until outstanding
requests are completed and the MAD QP is ready to process additional
requests. While at it, keep track of the total outstanding solicited
MAD work requests in send or wait list. The number of outstanding send
WRs will be limited by a fraction of the RQ size, and any new send WR
that exceeds that limit will be held in a backlog list.
Backlog MADs will be forwarded to agent send list only once the total
number of outstanding send WRs falls below the limit.

Unsolicited MADs, RMPP MADs and MADs which are not SA, SMP or CM are
not subject to this flow control mechanism and will not be affected by
this change.

For this purpose, a new state is introduced:
- 'IB_MAD_STATE_QUEUED': MAD is in backlog list

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/c0ecaa1821badee124cd13f3bf860f67ce453beb.1751278420.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/mad: Add state machine to MAD layer

Replace the use of refcount, timeout and status with a 'state'
field to track the status of MADs send work requests (WRs).
The state machine better represents the stages in the MAD lifecycle,
specifically indicating whether the MAD is waiting for a completion,
waiting for a response, was canceld or is done.

The existing refcount only takes two values:
1 : MAD is waiting either for completion or for response.
2 : MAD is waiting for both response and completion. Also when a
     response was received before a completion notification.
The status field represents if the MAD was canceled at some point
in the flow.
The timeout is used to represent if a response was received.

The current state transitions are not clearly visible, and developers
needs to infer the state from the refcount's, timeout's or status's
value, which is error-prone and difficult to follow.

Thus, replace with a state machine as the following:
- 'IB_MAD_STATE_INIT': MAD is in the making and is not yet in any list
- 'IB_MAD_STATE_SEND_START': MAD was sent to the QP and is waiting for
   completion notification in send list
- 'IB_MAD_STATE_WAIT_RESP': MAD send completed successfully, waiting for
  a response in wait list
- 'IB_MAD_STATE_EARLY_RESP': Response came early, before send
   completion notification, MAD is in the send list
- 'IB_MAD_STATE_CANCELED': MAD was canceled while in send or wait list
- 'IB_MAD_STATE_DONE': MAD processing completed, MAD is in no list

Adding the state machine also make it possible to remove the double
call for ib_mad_complete_send_wr in case of an early response and the
use of a done list in case of a regular response.

While at it, define a helper to clear error MADs which will handle
freeing MADs that timed out or have been cancelled.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/48e6ae8689dc7bb8b4ba6e5ec562e1b018db88a8.1751278420.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

Merge branch 'mlx5-next' into wip/leon-for-next

* mlx5-next:
  net/mlx5: Check device memory pointer before usage
  net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow
  net/mlx5: Add IFC bits for PCIe Congestion Event object
  net/mlx5: Small refactor for general object capabilities

RDMA/bnxt_re: Use macro instead of hard coded value

1. Defined a macro for the hard coded value.
2. "access" field in the request structure is of type "u8".
Updated the mask accordingly.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Shravya KN <shravya.k-n@broadcom.com>
Link: https://patch.msgid.link/20250704043857.19158-4-kalesh-anakkur.purayil@broadcom.com
Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Support 2G message size

bnxt_qplib_put_sges is calculating the length in
a signed int. So handling the 2G message size
is not working since it is considered as negative.

Use a unsigned number to calculate the total message
length. As per the spec, IB message size shall be
between zero and 2^31 bytes (inclusive). So adding
a check for 2G message size.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Shravya KN <shravya.k-n@broadcom.com>
Link: https://patch.msgid.link/20250704043857.19158-3-kalesh-anakkur.purayil@broadcom.com
Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Fix size of uverbs_copy_to() in BNXT_RE_METHOD_GET_TOGGLE_MEM

Since both "length" and "offset" are of type u32, there is
no functional issue here.

Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Signed-off-by: Shravya KN <shravya.k-n@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250704043857.19158-2-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Fix -Wframe-larger-than issue

Fix -Wframe-larger-than issue by allocating memory for qpc struct
with kzalloc() instead of using stack memory.

Fixes: 606bf89e98ef ("RDMA/hns: Refactor for hns_roce_v2_modify_qp function")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506240032.CSgIyFct-lkp@intel.com/
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250703113905.3597124-7-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Drop GFP_NOWARN

GFP_NOWARN silences all warnings on dma_alloc_coherent() failure,
which might otherwise help with troubleshooting.

Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250703113905.3597124-6-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Fix accessing uninitialized resources

hr_dev->pgdir_list and hr_dev->pgdir_mutex won't be initialized if
CQ/QP record db are not enabled, but they are also needed when using
SRQ with SRQ record db enabled. Simplified the logic by always
initailizing the reosurces.

Fixes: c9813b0b9992 ("RDMA/hns: Support SRQ record doorbell")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250703113905.3597124-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Get message length of ack_req from FW

ACK_REQ_FREQ indicates the number of packets (after MTU fragmentation)
HW sends before setting an ACK request. When MTU is greater than or
equal to 1024, the current ACK_REQ_FREQ value causes HW to request an
ACK for every MTU fragment. The processing of a large number of ACKs
severely impacts HW performance when sending large size payloads.

Get message length of ack_req from FW so that we can adjust this
parameter according to different situations. There are several
constraints for ACK_REQ_FREQ:

1. mtu * (2 ^ ACK_REQ_FREQ) should not be too large, otherwise it may
cause some unexpected retries when sending large payload.

2. ACK_REQ_FREQ should be larger than or equal to LP_PKTN_INI.

3. ACK_REQ_FREQ must be equal to LP_PKTN_INI when using LDCP
or HC3 congestion control algorithm.

Fixes: 56518a603fd2 ("RDMA/hns: Modify the value of long message loopback slice")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250703113905.3597124-4-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Fix HW configurations not cleared in error flow

hns_roce_clear_extdb_list_info() will eventually do some HW
configurations through FW, and they need to be cleared by
calling hns_roce_function_clear() when the initialization
fails.

Fixes: 7e78dd816e45 ("RDMA/hns: Clear extended doorbell info before using")
Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250703113905.3597124-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Fix double destruction of rsv_qp

rsv_qp may be double destroyed in error flow, first in free_mr_init(),
and then in hns_roce_exit(). Fix it by moving the free_mr_init() call
into hns_roce_v2_init().

list_del corruption, ffff589732eb9b50->next is LIST_POISON1 (dead000000000100)
WARNING: CPU: 8 PID: 1047115 at lib/list_debug.c:53 __list_del_entry_valid+0x148/0x240
...
Call trace:
__list_del_entry_valid+0x148/0x240
hns_roce_qp_remove+0x4c/0x3f0 [hns_roce_hw_v2]
hns_roce_v2_destroy_qp_common+0x1dc/0x5f4 [hns_roce_hw_v2]
hns_roce_v2_destroy_qp+0x22c/0x46c [hns_roce_hw_v2]
free_mr_exit+0x6c/0x120 [hns_roce_hw_v2]
hns_roce_v2_exit+0x170/0x200 [hns_roce_hw_v2]
hns_roce_exit+0x118/0x350 [hns_roce_hw_v2]
__hns_roce_hw_v2_init_instance+0x1c8/0x304 [hns_roce_hw_v2]
hns_roce_hw_v2_reset_notify_init+0x170/0x21c [hns_roce_hw_v2]
hns_roce_hw_v2_reset_notify+0x6c/0x190 [hns_roce_hw_v2]
hclge_notify_roce_client+0x6c/0x160 [hclge]
hclge_reset_rebuild+0x150/0x5c0 [hclge]
hclge_reset+0x10c/0x140 [hclge]
hclge_reset_subtask+0x80/0x104 [hclge]
hclge_reset_service_task+0x168/0x3ac [hclge]
hclge_service_task+0x50/0x100 [hclge]
process_one_work+0x250/0x9a0
worker_thread+0x324/0x990
kthread+0x190/0x210
ret_from_fork+0x10/0x18

Fixes: fd8489294dd2 ("RDMA/hns: Fix Use-After-Free of rsv_qp on HIP08")
Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250703113905.3597124-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

net/mlx5: Check device memory pointer before usage

Add a NULL check before accessing device memory to prevent a crash if
dev->dm allocation in mlx5_init_once() fails.

Fixes: c9b9dcb430b3 ("net/mlx5: Move device memory management to mlx5_core")
Signed-off-by: Stav Aviram <saviram@nvidia.com>
Link: https://patch.msgid.link/c88711327f4d74d5cebc730dc629607e989ca187.1751370035.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow

Failing during the initialization of root_namespace didn't cleanup
the priorities of the namespace on which the failure occurred.

Properly cleanup said priorities on failure.

Fixes: 52931f55159e ("net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://patch.msgid.link/78cf89b5d8452caf1e979350b30ada6904362f66.1751451780.git.leon@kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

Fix dma_unmap_sg() nents value

The dma_unmap_sg() functions should be called with the same nents as the
dma_map_sg(), not the value the map function returned.

Fixes: ed10435d3583 ("RDMA/erdma: Implement hierarchical MTT")
Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com>
Link: https://patch.msgid.link/20250630092346.81017-2-fourier.thomas@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/counter: Check CAP_NET_RAW check in user namespace for RDMA counters

Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails.

Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.

Fixes: 1bd8e0a9d0fd ("RDMA/counter: Allow manual mode configuration support")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/68e2064e72e94558a576fdbbb987681a64f6fea8.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/nldev: Check CAP_NET_RAW in user namespace for QP modify

Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to modify
the QP.

Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.

Fixes: 0cadb4db79e1 ("RDMA/uverbs: Restrict usage of privileged QKEYs")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/099eb263622ccdd27014db7e02fec824a3307829.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Check CAP_NET_RAW in user namespace for devx create

Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the devx object.

Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.

Fixes: a8b92ca1b0e5 ("IB/mlx5: Introduce DEVX")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/36ee87e92defd81410c6a2b33f9d6c0d6dcfd64c.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/uverbs: Check CAP_NET_RAW in user namespace for RAW QP create

Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the QP.

Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/3914ef9702b01de8843a391ce397fca67d0fc7af.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/uverbs: Check CAP_NET_RAW in user namespace for RAW QP create

Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the QP.

Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.

Fixes: 6d1e7ba241e9 ("IB/uverbs: Introduce create/destroy QP commands over ioctl")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/7b6b87505ccc28a1f7b4255af94d898d2df0fff5.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/uverbs: Check CAP_NET_RAW in user namespace for QP create

Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the QP.

Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.

Fixes: 2dee0e545894 ("IB/uverbs: Enable QP creation with a given source QP number")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/0e5920d1dfe836817bb07576b192da41b637130b.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Check CAP_NET_RAW in user namespace for anchor create

Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the anchor.

Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.

Fixes: 0c6ab0ca9a66 ("RDMA/mlx5: Expose steering anchor to userspace")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/c2376ca75e7658e2cbd1f619cf28fbe98c906419.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Check CAP_NET_RAW in user namespace for flow create

Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the flow.

Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.

Fixes: 322694412400 ("IB/mlx5: Introduce driver create and destroy flow methods")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/a4dcd5e3ac6904ef50b19e56942ca6ab0728794c.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/uverbs: Check CAP_NET_RAW in user namespace for flow create

Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the flow resource.

Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.

Fixes: 436f2ad05a0b ("IB/core: Export ib_create/destroy_flow through uverbs")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Link: https://patch.msgid.link/6df6f2f24627874c4f6d041c19dc1f6f29f68f84.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/ipoib: Use parent rdma device net namespace

Use the net namespace of the underlying rdma device.
After honoring the rdma device's namespace, the ipoib
netdev now also runs in the same net namespace of the
rdma device.

Add an API to read the net namespace of the rdma device
so that ULP such as IPoIB can use it to initialize its
netdev.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

RDMA/mlx5: Allocate IB device with net namespace supplied from core dev

Use the new ib_alloc_device_with_net() API to allocate the IB device
so that it is properly bound to the network namespace obtained via
mlx5_core_net(). This change ensures correct namespace association
(e.g., for containerized setups).

Additionally, expose mlx5_core_net so that RDMA driver can use it.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

RDMA/core: Extend RDMA device registration to be net namespace aware

Presently, RDMA devices are always registered within the init network
namespace, even if the associated devlink device's namespace was
changed via a devlink reload. This mismatch leads to discrepancies
between the network namespace of the devlink device and that of the
RDMA device.

Therefore, extend the RDMA device allocation API to optionally take
the net namespace. This isn't limited to devices that support devlink
but allows all users to provide the network namespace if they need to
do so.

If a network namespace is provided during device allocation, it's up
to the caller to make sure the namespace stays valid until
ib_register_device() is called.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

RDMA/rxe: Fix a couple IS_ERR() vs NULL bugs

The lookup_mr() function returns NULL on error. It never returns error
pointers.

Fixes: 9284bc34c773 ("RDMA/rxe: Enable asynchronous prefetch for ODP MRs")
Fixes: 3576b0df1588 ("RDMA/rxe: Implement synchronous prefetch for ODP MRs")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/685c1430.050a0220.18b0ef.da83@mx.google.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/siw: work around clang stack size warning

clang inlines a lot of functions into siw_qp_sq_process(), with the
aggregate stack frame blowing the warning limit in some configurations:

drivers/infiniband/sw/siw/siw_qp_tx.c:1014:5: error: stack frame size (1544) exceeds limit (1280) in 'siw_qp_sq_process' [-Werror,-Wframe-larger-than]

The real problem here is the array of kvec structures in siw_tx_hdt that
makes up the majority of the consumed stack space.

Ideally there would be a way to avoid allocating the array on the
stack, but that would require a larger rework. Add a noinline_for_stack
annotation to avoid the warning for now, and make clang behave the same
way as gcc here. The combined stack usage is still similar, but is spread
over multiple functions now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20250620114332.4072051-1-arnd@kernel.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMI: hfi1: drop cpumask_empty() call in hfi1/affinity.c

In few places, the driver tests a cpumask for emptiness immediately
before calling functions that report emptiness themself.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-8-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA: hfi1: simplify hfi1_get_proc_affinity()

The function protects the for loop with affinity->num_core_siblings > 0
condition, which is redundant because the loop will break immediately in
that case.

Drop it and save one indentation level.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-7-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA: hfi1: use rounddown in find_hw_thread_mask()

num_cores_per_socket is calculated by dividing by
node_affinity.num_online_nodes, but all users of this variable multiply
it by node_affinity.num_online_nodes again. This effectively is the same
as rounding it down by node_affinity.num_online_nodes.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-6-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA: hfi1: simplify init_real_cpu_mask()

The function opencodes cpumask_nth() and cpumask_clear_cpus(). The
dedicated helpers are easier to use and usually much faster than
opencoded for-loops.

While there, drop useless clear of real_cpu_mask.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-5-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA: hfi1: simplify find_hw_thread_mask()

The function opencodes cpumask_nth() and cpumask_clear_cpus(). The
dedicated helpers are easier to use and usually much faster than
opencoded for-loops.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-4-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask()

The function divides number of online CPUs by num_core_siblings, and
later checks the divider by zero. This implies a possibility to get
and divide-by-zero runtime error. Fix it by moving the check prior to
division. This also helps to save one indentation level.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-3-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

cpumask: add cpumask_clear_cpus()

When user wants to clear a range in cpumask, the only option the API
provides now is a for-loop, like:

for_each_cpu_from(cpu, mask) {
if (cpu >= ncpus)
break;
__cpumask_clear_cpu(cpu, mask);
}

In the bitmap API we have bitmap_clear() for that, which is
significantly faster than a for-loop. Propagate it to cpumasks.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-2-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

net/mlx5: Add IFC bits for PCIe Congestion Event object

Add definitions for the PCIe Congestion Event object
and the relevant FW command structures.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250619113721.60201-3-mbloch@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

net/mlx5: Small refactor for general object capabilities

Make enum for capability bits of general object types depend on
the type definitions themselves.

Make sure that capabilities in the [64,127] bit range are
properly calculated (type id - 64).

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250619113721.60201-2-mbloch@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/core: reduce stack using in nldev_stat_get_doit()

In the s390 defconfig, gcc-10 and earlier end up inlining three functions
into nldev_stat_get_doit(), and each of them uses some 600 bytes of stack.

The result is a function with an overly large stack frame and a warning:

drivers/infiniband/core/nldev.c:2466:1: error: the frame size of 1720 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]

Mark the three functions noinline_for_stack to prevent this, ensuring
that only one copy of the nlattr array is on the stack of each function.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20250620113335.3776965-1-arnd@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Add multiple priorities support to RDMA TRANSPORT userspace tables

Support the creation of RDMA TRANSPORT tables over multiple priorities
via matcher creation.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/bb38e50ae4504e979c6568d41939402a4cf15635.1750148083.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

Add multiple priorities support to mlx5 RDMA TRANSPORT tables

From Patrisious:

This short series from Patrisious extends mlx5 flow steering logic to
allow creation rule creation with priorities in RDMA TRANSPORT tables.

Thanks

Link: https://lore.kernel.org/all/cover.1750148083.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
* mlx5-next: (200 commits)
  net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain
  Linux 6.16-rc2
  ...

net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain

RDMA TRANSPORT domains were initially limited to a single priority.
This change allows the domains to have multiple priorities, making
it possible to add several rules and control the order in which
they're evaluated.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/b299cbb4c8678a33da6e6b6988b5bf6145c54b88.1750148083.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Support driver APIs pre_destroy_cq and post_destroy_cq

- pre_destroy_cq: Destroy FW CQ object so that no new CQ event would
                  be generated;
- post_destroy_cq: Release all resources.

This patch, along with last one, fixes the crash below.

Unable to handle kernel paging request at virtual address ffff8000114b1180
Mem abort info:
   ESR = 0x96000047
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
Data abort info:
   ISV = 0, ISS = 0x00000047
   CM = 0, WnR = 1
swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000f4582000
[ffff8000114b1180] pgd=00000447fffff003, p4d=00000447fffff003, pud=00000447ffffe003, pmd=00000447ffffb003, pte=0000000000000000
Internal error: Oops: 96000047 [#1] SMP
Modules linked in: udp_diag uio_pci_generic uio tcp_diag inet_diag binfmt_misc sn_core_odd(OE) rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) ib_ipoib(OE) kpatch_9658536(OK) kpatch_9322385(OK) kpatch_8843421(OK) kpatch_8636216(OK) vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ghash_ce sm4_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt sg acpi_ipmi ipmi_si ipmi_msghandler m1_uncore_ddrss_pmu m1_uncore_cmn_pmu team_yosemite9rc6(OE) vnic(OE) ip_tables mlx5_ib(OE) sd_mod ast mlx5_core(OE) i2c_algo_bit drm_vram_helper psample drm_kms_helper mlxdevm(OE) auxiliary(OE) mlxfw(OE) syscopyarea sysfillrect tls sysimgblt fb_sys_fops drm_ttm_helper nvme ttm nvme_core drm t10_pi i2c_designware_platform i2c_designware_core i2c_core ahci libahci libata rdma_ucm(OE) ib_uverbs(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_umad(OE) ib_core(OE) ib_ucm(OE) mlx_compat(OE) [last unloaded: ipmi_devintf]
CPU: 83 PID: 59375 Comm: kworker/u253:1 Kdump: loaded Tainted: G           OE K   5.10.84-004.ali5000.alios7.aarch64 #1
Hardware name: Inspur AliServer-Xuanwu2.0AM-02-2UM1P-5B/AS1221MG1, BIOS 1.2.M1.AL.P.158.00 08/31/2023
Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
pstate: 82c00089 (Nzcv daIf +PAN +UAO +TCO BTYPE=--)
pc : native_queued_spin_lock_slowpath+0x1c4/0x31c
lr : mlx5_ib_poll_cq+0x18c/0x2f8 [mlx5_ib]
sp : ffff80002be1bc80
x29: ffff80002be1bc80 x28: ffff000810e69000
x27: ffff000810e69000 x26: ffff000810e69200
x25: 0000000000000000 x24: ffff8000117db000
x23: ffff04000156b780 x22: 0000000000000000
x21: ffff04000ce6c160 x20: ffff0008196a4000
x19: 0000000000000010 x18: 0000000000000020
x17: 0000000000000000 x16: 0000000000000000
x15: ffff040055a364e8 x14: ffffffffffffffff
x13: ffff80002318bda8 x12: ffff0400358836e8
x11: 0000000000000040 x10: 0000000000000eb0
x9 : 0000000000000000 x8 : 0000000000000000
x7 : ffff04477fa20140 x6 : ffff8000114b1140
x5 : ffff04477fa20140 x4 : ffff8000114b1180
x3 : ffff000810e69200 x2 : ffff8000114b1180
x1 : 0000000001500000 x0 : ffff04477fa20148
Call trace:
  native_queued_spin_lock_slowpath+0x1c4/0x31c
  __ib_process_cq+0x74/0x1b8 [ib_core]
  ib_cq_poll_work+0x34/0xa0 [ib_core]
  process_one_work+0x1d8/0x4b0
  worker_thread+0x180/0x440
  kthread+0x114/0x120
Code: 910020e0 8b0400c4 f862d929 aa0403e2 (f8296847)
---[ end trace 387be2290557729c ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x9850817,7a60aa38
Memory Limit: none
Starting crashdump kernel...
Bye!

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://patch.msgid.link/aaf0072f350d1c7e8731f43b79e11a560bafb9e0.1750070205.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/core: Add driver APIs pre_destroy_cq() and post_destroy_cq()

Currently in ib_free_cq, it disables IRQ or cancel the CQ work before
driver destroy_cq. This isn't good as a new IRQ or a CQ work can be
submitted immediately after disabling IRQ or canceling CQ work, which
may run concurrently with destroy_cq and cause crashes.
The right flow should be:
1. Driver disables CQ to make sure no new CQ event will be submitted;
2. Disables IRQ or Cancels CQ work in core layer, to make sure no CQ
polling work is running;
3. Free all resources to destroy the CQ.

This patch adds 2 driver APIs:
- pre_destroy_cq(): Disable a CQ to prevent it from generating any new
work completions, but not free any kernel resources;
- post_destroy_cq(): Free all kernel resources.

In ib_free_cq, the IRQ is disabled or CQ work is canceled after
pre_destroy_cq, and before post_destroy_cq.

Fixes: 14d3a3b2498e ("IB: add a proper completion queue abstraction")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://patch.msgid.link/b5f7ae3d75f44a3e15ff3f4eb2bbdea13e06b97f.1750062328.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

Maintainers: Remove QIB

qib driver has been removed. Take entry out of MAINTAINERS file.

Link: https://patch.msgid.link/r/174836077280.2436819.8306178407677103841.stgit@awdrv-04.cornelisnetworks.com
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/qib: Remove outdated driver

The qib Infiniband device has long been decommissioned from distros and
unsupported by the company. We have kept it going in a compile only mode
for a while now and it's time to remove the driver. It has a number of
issues that are never going to be addressed and is a hindrance to
improving hfi1 and rdmavt.

Link: https://patch.msgid.link/r/174836076755.2436819.5981097575800950899.stgit@awdrv-04.cornelisnetworks.com
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Linux 6.16-rc2

Merge tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

- Move warnings about linux/export.h from W=1 to W=2

- Fix structure type overrides in gendwarfksyms

* tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
gendwarfksyms: Fix structure type overrides
kbuild: move warnings about linux/export.h from W=1 to W=2

gendwarfksyms: Fix structure type overrides

As we always iterate through the entire die_map when expanding
type strings, recursively processing referenced types in
type_expand_child() is not actually necessary. Furthermore,
the type_string kABI rule added in commit c9083467f7b9
("gendwarfksyms: Add a kABI rule to override type strings") can
fail to override type strings for structures due to a missing
kabi_get_type_string() check in this function.

Fix the issue by dropping the unnecessary recursion and moving
the override check to type_expand(). Note that symbol versions
are otherwise unchanged with this patch.

Fixes: c9083467f7b9 ("gendwarfksyms: Add a kABI rule to override type strings")
Reported-by: Giuliano Procida <gprocida@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>

kbuild: move warnings about linux/export.h from W=1 to W=2

This hides excessive warnings, as nobody builds with W=2.

Fixes: a934a57a42f6 ("scripts/misc-check: check missing #include <linux/export.h> when W=1")
Fixes: 7d95680d64ac ("scripts/misc-check: check unnecessary #include <linux/export.h> when W=1")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>

Merge tag 'v6.16-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

- SMB3.1.1 POSIX extensions fix for char remapping

- Fix for repeated directory listings when directory leases enabled

- deferred close handle reuse fix

* tag 'v6.16-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: improve directory cache reuse for readdir operations
  smb: client: fix perf regression with deferred closes
  smb: client: disable path remapping with POSIX extensions

Merge tag 'iommu-fixes-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux

Pull iommu fix from Joerg Roedel:

- Fix PTE size calculation for NVidia Tegra

* tag 'iommu-fixes-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu/tegra: Fix incorrect size calculation

Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

- Fix for a deadlock on queue freeze with zoned writes

- Fix for zoned append emulation

- Two bio folio fixes, for sparsemem and for very large folios

- Fix for a performance regression introduced in 6.13 when plug
   insertion was changed

- Fix for NVMe passthrough handling for polled IO

- Document the ublk auto registration feature

- loop lockdep warning fix

* tag 'block-6.16-20250614' of git://git.kernel.dk/linux:
  nvme: always punt polled uring_cmd end_io work to task_work
  Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists
  block: Fix bvec_set_folio() for very large folios
  bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP
  block: use plug request list tail for one-shot backmerge attempt
  block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work
  block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion
  ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
  loop: move lo_set_size() out of queue freeze

Merge tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

- Fix for a race between SQPOLL exit and fdinfo reading.

   It's slim and I was only able to reproduce this with an artificial
   delay in the kernel. Followup sparse fix as well to unify the access
   to ->thread.

- Fix for multiple buffer peeking, avoiding truncation if possible.

- Run local task_work for IOPOLL reaping when the ring is exiting.

   This currently isn't done due to an assumption that polled IO will
   never need task_work, but a fix on the block side is going to change
   that.

* tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux:
  io_uring: run local task_work from ring exit IOPOLL reaping
  io_uring/kbuf: don't truncate end buffer for multiple buffer peeks
  io_uring: consistently use rcu semantics with sqpoll thread
  io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()

Merge tag 'rust-fixes-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux

Pull Rust fix from Miguel Ojeda:

  - 'hrtimer': fix future compile error when the 'impl_has_hr_timer!'
    macro starts to get called

* tag 'rust-fixes-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
  rust: time: Fix compile error in impl_has_hr_timer macro

Merge tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"9 hotfixes. 3 are cc:stable and the remainder address post-6.15 issues
  or aren't considered necessary for -stable kernels. Only 4 are for MM"

* tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: add mmap_prepare() compatibility layer for nested file systems
  init: fix build warnings about export.h
  MAINTAINERS: add Barry as a THP reviewer
  drivers/rapidio/rio_cm.c: prevent possible heap overwrite
  mm: close theoretical race where stale TLB entries could linger
  mm/vma: reset VMA iterator on commit_merge() OOM failure
  docs: proc: update VmFlags documentation in smaps
  scatterlist: fix extraneous '@'-sign kernel-doc notation
  selftests/mm: skip failed memfd setups in gup_longterm

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"All fixes for drivers.

  The core change in the error handler is simply to translate an ALUA
  specific sense code into a retry the ALUA components can handle and
  won't impact any other devices"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: error: alua: I/O errors for ALUA state transitions
  scsi: storvsc: Increase the timeouts to storvsc_timeout
  scsi: s390: zfcp: Ensure synchronous unit_add
  scsi: iscsi: Fix incorrect error path labels for flashnode operations
  scsi: mvsas: Fix typos in per-phy comments and SAS cmd port registers
  scsi: core: ufs: Fix a hang in the error handler

Merge tag 'drm-fixes-2025-06-14' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Quiet week, only two pull requests came my way, xe has a couple of
  fixes and then a bunch of fixes across the board, vc4 probably fixes
  the biggest problem:

  vc4:
   - Fix infinite EPROBE_DEFER loop in vc4 probing

  amdxdna:
   - Fix amdxdna firmware size

  meson:
   - modesetting fixes

  sitronix:
   - Kconfig fix for st7171-i2c

  dma-buf:
   - Fix -EBUSY WARN_ON_ONCE in dma-buf

  udmabuf:
   - Use dma_sync_sgtable_for_cpu in udmabuf

  xe:
   - Fix regression disallowing 64K SVM migration
   - Use a bounce buffer for WA BB"

* tag 'drm-fixes-2025-06-14' of https://gitlab.freedesktop.org/drm/kernel:
  drm/xe/lrc: Use a temporary buffer for WA BB
  udmabuf: use sgtable-based scatterlist wrappers
  dma-buf: fix compare in WARN_ON_ONCE
  drm/sitronix: st7571-i2c: Select VIDEOMODE_HELPERS
  drm/meson: fix more rounding issues with 59.94Hz modes
  drm/meson: use vclk_freq instead of pixel_freq in debug print
  drm/meson: fix debug log statement when setting the HDMI clocks
  drm/vc4: fix infinite EPROBE_DEFER loop
  drm/xe/svm: Fix regression disallowing 64K SVM migration
  accel/amdxdna: Fix incorrect PSP firmware size

io_uring: run local task_work from ring exit IOPOLL reaping

In preparation for needing to shift NVMe passthrough to always use
task_work for polled IO completions, ensure that those are suitably
run at exit time. See commit:

9ce6c9875f3e ("nvme: always punt polled uring_cmd end_io work to task_work")

for details on why that is necessary.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

nvme: always punt polled uring_cmd end_io work to task_work

Currently NVMe uring_cmd completions will complete locally, if they are
polled. This is done because those completions are always invoked from
task context. And while that is true, there's no guarantee that it's
invoked under the right ring context, or even task. If someone does
NVMe passthrough via multiple threads and with a limited number of
poll queues, then ringA may find completions from ringB. For that case,
completing the request may not be sound.

Always just punt the passthrough completions via task_work, which will
redirect the completion, if needed.

Cc: stable@vger.kernel.org
Fixes: 585079b6e425 ("nvme: wire up async polling for io passthrough commands")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'acpi-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
"These fix an ACPI APEI error injection driver failure that started to
  occur after switching it over to using a faux device, address an EC
  driver issue related to invalid ECDT tables, clean up the usage of
  mwait_idle_with_hints() in the ACPI PAD driver, add a new IRQ override
  quirk, and fix a NULL pointer dereference related to nosmp:

   - Update the faux device handling code in the driver core and address
     an ACPI APEI error injection driver failure that started to occur
     after switching it over to using a faux device on top of that (Dan
     Williams)

   - Update data types of variables passed as arguments to
     mwait_idle_with_hints() in the ACPI PAD (processor aggregator
     device) driver to match the function definition after recent
     changes (Uros Bizjak)

   - Fix a NULL pointer dereference in the ACPI CPPC library that occurs
     when nosmp is passed to the kernel in the command line (Yunhui Cui)

   - Ignore ECDT tables with an invalid ID string to prevent using an
     incorrect GPE for signaling events on some systems (Armin Wolf)

   - Add a new IRQ override quirk for MACHENIKE 16P (Wentao Guan)"

* tag 'acpi-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: resource: Use IRQ override on MACHENIKE 16P
  ACPI: EC: Ignore ECDT tables with an invalid ID string
  ACPI: CPPC: Fix NULL pointer dereference when nosmp is used
  ACPI: PAD: Update arguments of mwait_idle_with_hints()
  ACPI: APEI: EINJ: Do not fail einj_init() on faux_device_create() failure
  driver core: faux: Quiet probe failures
  driver core: faux: Suppress bind attributes

Merge tag 'pm-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"These fix the cpupower utility installation, fix up the recently added
  Rust abstractions for cpufreq and OPP, restore the x86 update
  eliminating mwait_play_dead_cpuid_hint() that has been reverted during
  the 6.16 merge window along with preventing the failure caused by it
  from happening, and clean up mwait_idle_with_hints() usage in
  intel_idle:

   - Implement CpuId Rust abstraction and use it to fix doctest failure
     related to the recently introduced cpumask abstraction (Viresh
     Kumar)

   - Do minor cleanups in the `# Safety` sections for cpufreq
     abstractions added recently (Viresh Kumar)

   - Unbreak cpupower systemd service units installation on some systems
     by adding a unitdir variable for specifying the location to install
     them (Francesco Poli)

   - Eliminate mwait_play_dead_cpuid_hint() again after reverting its
     elimination during the 6.16 merge window due to a problem with
     handling "dead" SMT siblings, but this time prevent leaving them in
     C1 after initialization by taking them online and back offline when
     a proper cpuidle driver for the platform has been registered
     (Rafael Wysocki)

   - Update data types of variables passed as arguments to
     mwait_idle_with_hints() to match the function definition after
     recent changes (Uros Bizjak)"

* tag 'pm-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  rust: cpu: Add CpuId::current() to retrieve current CPU ID
  rust: Use CpuId in place of raw CPU numbers
  rust: cpu: Introduce CpuId abstraction
  intel_idle: Update arguments of mwait_idle_with_hints()
  cpufreq: Convert `/// SAFETY` lines to `# Safety` sections
  cpupower: split unitdir from libdir in Makefile
  Reapply "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"
  ACPI: processor: Rescan "dead" SMT siblings during initialization
  intel_idle: Rescan "dead" SMT siblings during initialization
  x86/smp: PM/hibernate: Split arch_resume_nosmt()
  intel_idle: Use subsys_initcall_sync() for initialization

Merge branches 'acpi-pad', 'acpi-cppc', 'acpi-ec' and 'acpi-resource'

Merge assorted ACPI updates for 6.16-rc2:

- Update data types of variables passed as arguments to
   mwait_idle_with_hints() in the ACPI PAD (processor aggregator device)
   driver to match the function definition after recent changes (Uros
   Bizjak).

- Fix a NULL pointer dereference in the ACPI CPPC library that occurs
   when nosmp is passed to the kernel in the command line (Yunhui Cui).

- Ignore ECDT tables with an invalid ID string to prevent using an
   incorrect GPE for signaling events on some systems (Armin Wolf).

- Add a new IRQ override quirk for MACHENIKE 16P (Wentao Guan).

* acpi-pad:
  ACPI: PAD: Update arguments of mwait_idle_with_hints()

* acpi-cppc:
  ACPI: CPPC: Fix NULL pointer dereference when nosmp is used

* acpi-ec:
  ACPI: EC: Ignore ECDT tables with an invalid ID string

* acpi-resource:
  ACPI: resource: Use IRQ override on MACHENIKE 16P

Merge branch 'pm-cpuidle'

Merge cpuidle updates for 6.16-rc2:

- Update data types of variables passed as arguments to
   mwait_idle_with_hints() to match the function definition
   after recent changes (Uros Bizjak).

- Eliminate mwait_play_dead_cpuid_hint() again after reverting its
   elimination during the merge window due to a problem with handling
   "dead" SMT siblings, but this time prevent leaving them in C1 after
   initialization by taking them online and back offline when a proper
   cpuidle driver for the platform has been registered (Rafael Wysocki).

* pm-cpuidle:
  intel_idle: Update arguments of mwait_idle_with_hints()
  Reapply "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"
  ACPI: processor: Rescan "dead" SMT siblings during initialization
  intel_idle: Rescan "dead" SMT siblings during initialization
  x86/smp: PM/hibernate: Split arch_resume_nosmt()
  intel_idle: Use subsys_initcall_sync() for initialization

Merge branch 'pm-tools'

Merge a cpupower utility fix for 6.16-rc2 that unbreaks systemd service
units installation on some sysems (Francesco Poli).

* pm-tools:
cpupower: split unitdir from libdir in Makefile

Merge tag 'spi-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
"A collection of driver specific fixes, most minor apart from the OMAP
  ones which disable some recent performance optimisations in some
  non-standard cases where we could start driving the bus incorrectly.

  The change to the stm32-ospi driver to use the newer reset APIs is a
  fix for interactions with other IP sharing the same reset line in some
  SoCs"

* tag 'spi-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: spi-pci1xxxx: Drop MSI-X usage as unsupported by DMA engine
  spi: stm32-ospi: clean up on error in probe()
  spi: stm32-ospi: Make usage of reset_control_acquire/release() API
  spi: offload: check offload ops existence before disabling the trigger
  spi: spi-pci1xxxx: Fix error code in probe
  spi: loongson: Fix build warnings about export.h
  spi: omap2-mcspi: Disable multi-mode when the previous message kept CS asserted
  spi: omap2-mcspi: Disable multi mode when CS should be kept asserted after message

Merge tag 'regulator-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fix from Mark Brown:
"One minor fix for a leak in the DT parsing code in the max20086 driver"

* tag 'regulator-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: max20086: Fix refcount leak in max20086_parse_regulators_dt()

posix-cpu-timers: fix race between handle_posix_cpu_timers() and posix_cpu_timer_del()

If an exiting non-autoreaping task has already passed exit_notify() and
calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent
or debugger right after unlock_task_sighand().

If a concurrent posix_cpu_timer_del() runs at that moment, it won't be
able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or
lock_task_sighand() will fail.

Add the tsk->exit_state check into run_posix_cpu_timers() to fix this.

This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because
exit_task_work() is called before exit_notify(). But the check still
makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail
anyway in this case.

Cc: stable@vger.kernel.org
Reported-by: Benoît Sevens <bsevens@google.com>
Fixes: 0bdd2ed4138e ("sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'trace-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fix from Steven Rostedt:

- Do not free "head" variable in filter_free_subsystem_filters()

   The first error path jumps to "free_now" label but first frees the
   newly allocated "head" variable. But the "free_now" code checks this
   variable, and if it is not NULL, it will iterate the list. As this
   list variable was already initialized, the "free_now" code will not
   do anything as it is empty. But freeing it will cause a UAF bug.

   The error path should simply jump to the "free_now" label and leave
   the "head" variable alone.

* tag 'trace-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Do not free "head" on error path of filter_free_subsystem_filters()

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"ARM:

   - Rework of system register accessors for system registers that are
     directly writen to memory, so that sanitisation of the in-memory
     value happens at the correct time (after the read, or before the
     write). For convenience, RMW-style accessors are also provided.

   - Multiple fixes for the so-called "arch-timer-edge-cases' selftest,
     which was always broken.

  x86:

   - Make KVM_PRE_FAULT_MEMORY stricter for TDX, allowing userspace to
     pass only the "untouched" addresses and flipping the shared/private
     bit in the implementation.

   - Disable SEV-SNP support on initialization failure

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86/mmu: Reject direct bits in gpa passed to KVM_PRE_FAULT_MEMORY
  KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  KVM: SEV: Disable SEV-SNP support on initialization failure
  KVM: arm64: selftests: Determine effective counter width in arch_timer_edge_cases
  KVM: arm64: selftests: Fix xVAL init in arch_timer_edge_cases
  KVM: arm64: selftests: Fix thread migration in arch_timer_edge_cases
  KVM: arm64: selftests: Fix help text for arch_timer_edge_cases
  KVM: arm64: Make __vcpu_sys_reg() a pure rvalue operand
  KVM: arm64: Don't use __vcpu_sys_reg() to get the address of a sysreg
  KVM: arm64: Add RMW specific sysreg accessor
  KVM: arm64: Add assignment-specific sysreg accessor

io_uring/kbuf: don't truncate end buffer for multiple buffer peeks

If peeking a bunch of buffers, normally io_ring_buffers_peek() will
truncate the end buffer. This isn't optimal as presumably more data will
be arriving later, and hence it's better to stop with the last full
buffer rather than truncate the end buffer.

Cc: stable@vger.kernel.org
Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers")
Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'v6.16-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:
"Fix a broken self-test in hkdf (new regression)"

* tag 'v6.16-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: hkdf - move to late_initcall

Merge tag 'bcachefs-2025-06-12' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
"As usual, highlighting the ones users have been noticing:

   - Fix a small issue with has_case_insensitive not being propagated on
     snapshot creation; this led to fsck errors, which we're harmless
     because we're not using this flag yet (it's for overlayfs +
     casefolding).

   - Log the error being corrected in the journal when we're doing fsck
     repair: this was one of the "lessons learned" from the i_nlink 0 ->
     subvolume deletion bug, where reconstructing what had happened by
     analyzing the journal was a bit more difficult than it needed to
     be.

   - Don't schedule btree node scan to run in the superblock: this fixes
     a regression from the 6.16 recovery passes rework, and let to it
     running unnecessarily.

     The real issue here is that we don't have online, "self healing"
     style topology repair yet: topology repair currently has to run
     before we go RW, which means that we may schedule it unnecessarily
     after a transient error. This will be fixed in the future.

   - We now track, in btree node flags, the reason it was scheduled to
     be rewritten. We discovered a deadlock in recovery when many btree
     nodes need to be rewritten because they're degraded: fully fixing
     this will take some work but it's now easier to see what's going
     on.

     For the bug report where this came up, a device had been kicked RO
     due to transient errors: manually setting it back to RW was
     sufficient to allow recovery to succeed.

   - Mark a few more fsck errors as autofix: as a reminder to users,
     please do keep reporting cases where something needs to be repaired
     and is not repaired automatically (i.e. cases where -o fix_errors
     or fsck -y is required).

   - rcu_pending.c now works with PREEMPT_RT

   - 'bcachefs device add', then umount, then remount wasn't working -
     we now emit a uevent so that the new device's new superblock is
     correctly picked up

   - Assorted repair fixes: btree node scan will no longer incorrectly
     update sb->version_min,

   - Assorted syzbot fixes"

* tag 'bcachefs-2025-06-12' of git://evilpiepirate.org/bcachefs: (23 commits)
  bcachefs: Don't trace should_be_locked unless changing
  bcachefs: Ensure that snapshot creation propagates has_case_insensitive
  bcachefs: Print devices we're mounting on multi device filesystems
  bcachefs: Don't trust sb->nr_devices in members_to_text()
  bcachefs: Fix version checks in validate_bset()
  bcachefs: ioctl: avoid stack overflow warning
  bcachefs: Don't pass trans to fsck_err() in gc_accounting_done
  bcachefs: Fix leak in bch2_fs_recovery() error path
  bcachefs: Fix rcu_pending for PREEMPT_RT
  bcachefs: Fix downgrade_table_extra()
  bcachefs: Don't put rhashtable on stack
  bcachefs: Make sure opts.read_only gets propagated back to VFS
  bcachefs: Fix possible console lock involved deadlock
  bcachefs: mark more errors autofix
  bcachefs: Don't persistently run scan_for_btree_nodes
  bcachefs: Read error message now prints if self healing
  bcachefs: Only run 'increase_depth' for keys from btree node csan
  bcachefs: Mark need_discard_freespace_key_bad autofix
  bcachefs: Update /dev/disk/by-uuid on device add
  bcachefs: Add more flags to btree nodes for rewrite reason
  ...

Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists

Stephen Rothwell reports htmldocs warning on ublk docs:

Documentation/block/ublk.rst:414: ERROR: Unexpected indentation. [docutils]

Fix the warning by separating sublists of auto buffer registration
fallback behavior from their appropriate parent list item.

Fixes: ff20c516485e ("ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20250612132638.193de386@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20250613023857.15971-1-bagasdotme@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

iommu/tegra: Fix incorrect size calculation

This driver uses a mixture of ways to get the size of a PTE,
tegra_smmu_set_pde() did it as sizeof(*pd) which became wrong when pd
switched to a struct tegra_pd.

Switch pd back to a u32* in tegra_smmu_set_pde() so the sizeof(*pd)
returns 4.

Fixes: 50568f87d1e2 ("iommu/terga: Do not use struct page as the handle for as->pd memory")
Reported-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt>
Closes: https://lore.kernel.org/all/62e7f7fe-6200-4e4f-ad42-d58ad272baa6@tecnico.ulisboa.pt/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Tested-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt>
Link: https://lore.kernel.org/r/0-v1-da7b8b3d57eb+ce-iommu_terga_sizeof_jgg@nvidia.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

block: Fix bvec_set_folio() for very large folios

Similarly to 26064d3e2b4d ("block: fix adding folio to bio"), if
we attempt to add a folio that is larger than 4GB, we'll silently
truncate the offset and len. Widen the parameters to size_t, assert
that the length is less than 4GB and set the first page that contains
the interesting data rather than the first page of the folio.

Fixes: 26db5ee15851 (block: add a bvec_set_folio helper)
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20250612144255.2850278-1-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP

It is possible for physically contiguous folios to have discontiguous
struct pages if SPARSEMEM is enabled and SPARSEMEM_VMEMMAP is not.
This is correctly handled by folio_page_idx(), so remove this open-coded
implementation.

Fixes: 640d1930bef4 (block: Add bio_for_each_folio_all())
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20250612144126.2849931-1-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

spi: spi-pci1xxxx: Drop MSI-X usage as unsupported by DMA engine

Removes MSI-X from the interrupt request path, as the DMA engine used by
the SPI controller does not support MSI-X interrupts.

Signed-off-by: Thangaraj Samynathan <thangaraj.s@microchip.com>
Link: https://patch.msgid.link/20250612023059.71726-1-thangaraj.s@microchip.com
Signed-off-by: Mark Brown <broonie@kernel.org>

Merge tag 'drm-misc-fixes-2025-06-12' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

drm-misc-fixes for v6.16-rc2:
- Fix infinite EPROBE_DEFER loop in vc4 probing.
- Fix amdxdna firmware size.
- mode fixes for meson.
- Kconfig fix for st7171-i2c.
- Fix -EBUSY WARN_ON_ONCE in dma-buf
- Use dma_sync_sgtable_for_cpu in udmabuf.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://lore.kernel.org/r/62c06195-8bc1-4dae-8777-e86d94e4d9d9@linux.intel.com

mm: add mmap_prepare() compatibility layer for nested file systems

Nested file systems, that is those which invoke call_mmap() within their
own f_op->mmap() handlers, may encounter underlying file systems which
provide the f_op->mmap_prepare() hook introduced by commit c84bf6dd2b83
("mm: introduce new .mmap_prepare() file callback").

We have a chicken-and-egg scenario here - until all file systems are
converted to using .mmap_prepare(), we cannot convert these nested
handlers, as we can't call f_op->mmap from an .mmap_prepare() hook.

So we have to do it the other way round - invoke the .mmap_prepare() hook
from an .mmap() one.

in order to do so, we need to convert VMA state into a struct vm_area_desc
descriptor, invoking the underlying file system's f_op->mmap_prepare()
callback passing a pointer to this, and then setting VMA state accordingly
and safely.

This patch achieves this via the compat_vma_mmap_prepare() function, which
we invoke from call_mmap() if f_op->mmap_prepare() is specified in the
passed in file pointer.

We place the fundamental logic into mm/vma.h where VMA manipulation
belongs. We also update the VMA userland tests to accommodate the
changes.

The compat_vma_mmap_prepare() function and its associated machinery is
temporary, and will be removed once the conversion of file systems is
complete.

We carefully place this code so it can be used with CONFIG_MMU and also
with cutting edge nommu silicon.

[akpm@linux-foundation.org: export compat_vma_mmap_prepare tp fix build]
[lorenzo.stoakes@oracle.com: remove unused declarations]
Link: https://lkml.kernel.org/r/ac3ae324-4c65-432a-8c6d-2af988b18ac8@lucifer.local
Link: https://lkml.kernel.org/r/20250609165749.344976-1-lorenzo.stoakes@oracle.com
Fixes: c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback").
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Closes: https://lore.kernel.org/linux-mm/CAG48ez04yOEVx1ekzOChARDDBZzAKwet8PEoPM4Ln3_rk91AzQ@mail.gmail.com/
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Merge tag 'drm-xe-fixes-2025-06-12' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

Driver Changes:
- Fix regression disallowing 64K SVM migration (Maarten)
- Use a bounce buffer for WA BB (Lucas)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/aEsBQoh5Si3ouPgE@fedora

Merge tag 'bitmap-for-6.16-rc2' of https://github.com/norov/linux

Pull bitmap fix from Yury Norov:
"Fix for __GENMASK() and __GENMASK_ULL() in UAPI"

* tag 'bitmap-for-6.16-rc2' of https://github.com/norov/linux:
uapi: bitops: use UAPI-safe variant of BITS_PER_LONG again

smb: improve directory cache reuse for readdir operations

Currently, cached directory contents were not reused across subsequent
'ls' operations because the cache validity check relied on comparing
the ctx pointer, which changes with each readdir invocation. As a
result, the cached dir entries was not marked as valid and the cache was
not utilized for subsequent 'ls' operations.

This change uses the file pointer, which remains consistent across all
readdir calls for a given directory instance, to associate and validate
the cache. As a result, cached directory contents can now be
correctly reused, improving performance for repeated directory listings.

Performance gains with local windows SMB server:

Without the patch and default actimeo=1:
1000 directory enumeration operations on dir with 10k files took 135.0s

With this patch and actimeo=0:
1000 directory enumeration operations on dir with 10k files took just 5.1s

Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: fix perf regression with deferred closes

Customer reported that one of their applications started failing to
open files with STATUS_INSUFFICIENT_RESOURCES due to NetApp server
hitting the maximum number of opens to same file that it would allow
for a single client connection.

It turned out the client was failing to reuse open handles with
deferred closes because matching ->f_flags directly without masking
off O_CREAT|O_EXCL|O_TRUNC bits first broke the comparision and then
client ended up with thousands of deferred closes to same file.  Those
bits are already satisfied on the original open, so no need to check
them against existing open handles.

Reproducer:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <pthread.h>

#define NR_THREADS      4
#define NR_ITERATIONS   2500
#define TEST_FILE       "/mnt/1/test/dir/foo"

static char buf[64];

static void *worker(void *arg)
{
         int i, j;
         int fd;

         for (i = 0; i < NR_ITERATIONS; i++) {
                 fd = open(TEST_FILE, O_WRONLY|O_CREAT|O_APPEND, 0666);
                 for (j = 0; j < 16; j++)
                         write(fd, buf, sizeof(buf));
                 close(fd);
         }
}

int main(int argc, char *argv[])
{
         pthread_t t[NR_THREADS];
         int fd;
         int i;

         fd = open(TEST_FILE, O_WRONLY|O_CREAT|O_TRUNC, 0666);
         close(fd);
         memset(buf, 'a', sizeof(buf));
         for (i = 0; i < NR_THREADS; i++)
                 pthread_create(&t[i], NULL, worker, NULL);
         for (i = 0; i < NR_THREADS; i++)
                 pthread_join(t[i], NULL);
         return 0;
}

Before patch:

$ mount.cifs //srv/share /mnt/1 -o ...
$ mkdir -p /mnt/1/test/dir
$ gcc repro.c && ./a.out
...
number of opens: 1391

After patch:

$ mount.cifs //srv/share /mnt/1 -o ...
$ mkdir -p /mnt/1/test/dir
$ gcc repro.c && ./a.out
...
number of opens: 1

Cc: linux-cifs@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Cc: Jay Shin <jaeshin@redhat.com>
Cc: Pierguido Lambri <plambri@redhat.com>
Fixes: b8ea3b1ff544 ("smb: enable reuse of deferred file handles for write operations")
Acked-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

Merge tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth and wireless.

  Current release - regressions:

   - af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD

  Current release - new code bugs:

   - eth: airoha: correct enable mask for RX queues 16-31

   - veth: prevent NULL pointer dereference in veth_xdp_rcv when peer
     disappears under traffic

   - ipv6: move fib6_config_validate() to ip6_route_add(), prevent
     invalid routes

  Previous releases - regressions:

   - phy: phy_caps: don't skip better duplex match on non-exact match

   - dsa: b53: fix untagged traffic sent via cpu tagged with VID 0

   - Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused
     transient packet loss, exact reason not fully understood, yet

  Previous releases - always broken:

   - net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6)

   - sched: sfq: fix a potential crash on gso_skb handling

   - Bluetooth: intel: improve rx buffer posting to avoid causing issues
     in the firmware

   - eth: intel: i40e: make reset handling robust against multiple
     requests

   - eth: mlx5: ensure FW pages are always allocated on the local NUMA
     node, even when device is configure to 'serve' another node

   - wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850,
     prevent kernel crashes

   - wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
     for 3 sec if fw_stats_done is not set"

* tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
  selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context
  net: ethtool: Don't check if RSS context exists in case of context 0
  af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD.
  ipv6: Move fib6_config_validate() to ip6_route_add().
  net: drv: netdevsim: don't napi_complete() from netpoll
  net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get()
  veth: prevent NULL pointer dereference in veth_xdp_rcv
  net_sched: remove qdisc_tree_flush_backlog()
  net_sched: ets: fix a race in ets_qdisc_change()
  net_sched: tbf: fix a race in tbf_change()
  net_sched: red: fix a race in __red_change()
  net_sched: prio: fix a race in prio_tune()
  net_sched: sch_sfq: reject invalid perturb period
  net: phy: phy_caps: Don't skip better duplex macth on non-exact match
  MAINTAINERS: Update Kuniyuki Iwashima's email address.
  selftests: net: add test case for NAT46 looping back dst
  net: clear the dst when changing skb protocol
  net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_oper
  net/mlx5e: Fix leak of Geneve TLV option object
  net/mlx5: HWS, make sure the uplink is the last destination
  ...