git.ipfire.org Git - thirdparty/kernel/linux.git/log

RDMA/mlx5: Use IB set_netdev and get_netdev functions

The IB layer provides a common interface to store and get net
devices associated to an IB device port (ib_device_set_netdev()
and ib_device_get_netdev()).
Previously, mlx5_ib stored and managed the associated net devices
internally.

Replace internal net device management in mlx5_ib with
ib_device_set_netdev() when attaching/detaching a net device and
ib_device_get_netdev() when retrieving the net device.

Export ib_device_get_netdev().

For mlx5 representors/PFs/VFs and lag creation we replace the netdev
assignments with the IB set/get netdev functions.

In active-backup mode lag the active slave net device is stored in the
lag itself. To assure the net device stored in a lag bond IB device is
the active slave we implement the following:
- mlx5_core: when modifying the slave of a bond we send the internal driver event
MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE.
- mlx5_ib: when catching the event call ib_device_set_netdev()

This patch also ensures the correct IB events are sent in switchdev lag.

While at it, when in multiport eswitch mode, only a single IB device is
created for all ports. The said IB device will receive all netdev events
of its VFs once loaded, thus to avoid overwriting the mapping of PF IB
device to PF netdev, ignore NETDEV_REGISTER events if the ib device has
already been mapped to a netdev.

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-6-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/device: Remove optimization in ib_device_get_netdev()

The caller of ib_device_get_netdev() relies on its result to accurately
match a given netdev with the ib device associated netdev.

ib_device_get_netdev returns NULL when the IB device associated
netdev is unregistering, preventing the caller of matching netdevs properly.

Thus, remove this optimization and return the netdev even if
it is undergoing unregistration, allowing matching by the caller.

This change ensures proper netdev matching and reference count handling
by the caller of ib_device_get_netdev/ib_device_set_netdev API.

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-5-michaelgur@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Initialize phys_port_cnt earlier in RDMA device creation

phys_port_cnt of the IB device must be initialized before calling
ib_device_set_netdev().

Previously, phys_port_cnt was initialized in the mlx5_ib init function.
Remove this initialization to allow setting it separately, providing
the flexibility to call ib_device_set_netdev before registering the
IB device.

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-4-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Obtain upper net device only when needed

Report the upper device's state as the RDMA port state only in RoCE LAG or
switchdev LAG.

Fixes: 27f9e0ccb6da ("net/mlx5: Lag, Add single RDMA device in multiport mode")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-3-michaelgur@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Check RoCE LAG status before getting netdev

Check if RoCE LAG is active before calling the LAG layer for netdev.
This clarifies if LAG is active. No behavior changes with this patch.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-2-michaelgur@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Consider the query_vuid cap for data_direct

Consider also the query_vuid cap before enabling the data_direct
functionality.

This may prevent a syndrome from the FW in case the query_vuid command
is not supported. (e.g. migratable VF)

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Gal Shalom <galshalom@nvidia.com>
Link: https://patch.msgid.link/274c4f6f1ac0b1078243dd296695a49dbe58e7d1.1725907637.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

net/mlx5: Handle memory scheme ODP capabilities

When running over new FW that supports the new memory scheme ODP, set
the cap in the FW to signal the FW we are working in the new scheme.

In the memory scheme ODP the per_transport_service capabilities are RO
for the driver so we skip their setting.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-9-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Add implicit MR handling to ODP memory scheme

Implicit MRs in ODP memory scheme require allocating a private null mkey
and assigning the mkey and va differently in the KSM mkey.
The page faults are received on the null mkey so we also add storing the
null mkey in the odp_mkey xarray.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-8-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Add handling for memory scheme page fault events

The memory scheme page fault event is a new approch in handling page fault
on mkeys using the on-demand-paging feature.
The major shift in handling the page fault in this scheme is that the HW
is taking responsibilty for parsing the faulted mkey instead of the
previous approach where the driver would read and parse the wqes and
query the mkeys to get to the direct mkey that we need to handle.

Therefore, the event we get from FW in this scheme will contain the
direct mkey and address we need to handle and require much less work
from driver.

Additionally, to optimize performance, the FW can generate the event on
a memory area that is larger than the faulted memory operation is
requiring, to 'prefetch' memory that is around it and will likely be
used soon.

Unlike previous types of page fault, the memory page scheme fault does
not always require a resume command after handling the page fault as the FW
can post multiple events on same mkey and will set the 'last' flag only on
the page fault that requires the resume command.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-7-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Split ODP mkey search logic

Split the search for the ODP mkey when handling an rdma type page fault to
a helper function, later to be used in other page fault types.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-6-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Enforce umem boundaries for explicit ODP page faults

The new memory scheme page faults are requesting the driver to fetch
additinal pages to the faulted memory access.
This is done in order to prefetch pages before and after the area that
got the page fault, assuming this will reduce the total amount of page
faults.

The driver should ensure it handles only the pages that are within the
umem range.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-5-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Add new ODP memory scheme eqe format

Add new fields to support the new memory scheme page fault and extend
the token field to u64 as in the new scheme the token is 48 bit.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-4-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

net/mlx5: Expose HW bits for Memory scheme ODP

Expose IFC bits to support the new memory scheme on demand paging.
Change the macro reading odp capabilities to be able to read from the
new IFC layout and align the code in upper layers to be compiled.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-3-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

net/mlx5: Expand mkey page size to support 6 bits

Protect the usage of the 6th bit with the relevant capability to ensure
we are using the new page sizes with FW that supports the bit extension.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-2-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Fix restricted __le16 degrades to integer issue

Fix sparse warnings: restricted __le16 degrades to integer.

Fixes: 5a87279591a1 ("RDMA/hns: Support hns HW stats")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409080508.g4mNSLwy-lkp@intel.com/
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240909065331.3950268-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/qib: Remove unused declarations in header file

The definition of qib_rc_rnr_retry() has been removed since
commit b4238e70579c ("IB/qib: Use new rdmavt timers"). Also, the definition
of mr_rcu_callback() has been remove since commit 7c2e11fe2dbe ("IB/qib:
Remove qp and mr functionality from qib"). So, let's remove the unused
declartions.

Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Link: https://patch.msgid.link/20240909121408.80079-3-zhangzekun11@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/iser: Remove unused declaration in header file

The definition of iser_finalize_rdma_unaligned_sg() has been removed
since commit dd0107a08996 ("IB/iser: set block queue_virt_boundary").
Let's remove the unused declaration in header file.

Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Link: https://patch.msgid.link/20240909121408.80079-2-zhangzekun11@huawei.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Optimize hem allocation performance

When allocating MTT hem, for each hop level of each hem that is being
allocated, the driver iterates the hem list to find out whether the
bt page has been allocated in this hop level. If not, allocate a new
one and splice it to the list. The time complexity is O(n^2) in worst
cases.

Currently the allocation for-loop uses 'unit' as the step size. This
actually has taken into account the reuse of last-hop-level MTT bt
pages by multiple buffer pages. Thus pages of last hop level will
never have been allocated, so there is no need to iterate the hem list
in last hop level.

Removing this unnecessary iteration can reduce the time complexity to
O(n).

Fixes: 38389eaa4db1 ("RDMA/hns: Add mtr support for mixed multihop addressing")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-9-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Fix 1bit-ECC recovery address in non-4K OS

The 1bit-ECC recovery address read from HW only contain bits 64:12, so
it should be fixed left-shifted 12 bits when used.

Currently, the driver will shift the address left by PAGE_SHIFT when
used, which is wrong in non-4K OS.

Fixes: 2de949abd6a5 ("RDMA/hns: Recover 1bit-ECC error of RAM on chip")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-8-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Fix VF triggering PF reset in abnormal interrupt handler

In abnormal interrupt handler, a PF reset will be triggered even if
the device is a VF. It should be a VF reset.

Fixes: 2b9acb9a97fe ("RDMA/hns: Add the process of AEQ overflow for hip08")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-7-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Fix spin_unlock_irqrestore() called with IRQs enabled

Fix missuse of spin_lock_irq()/spin_unlock_irq() when
spin_lock_irqsave()/spin_lock_irqrestore() was hold.

This was discovered through the lock debugging, and the corresponding
log is as follows:

raw_local_irq_restore() called with IRQs enabled
WARNING: CPU: 96 PID: 2074 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x30/0x40
...
Call trace:
warn_bogus_irq_restore+0x30/0x40
_raw_spin_unlock_irqrestore+0x84/0xc8
add_qp_to_list+0x11c/0x148 [hns_roce_hw_v2]
hns_roce_create_qp_common.constprop.0+0x240/0x780 [hns_roce_hw_v2]
hns_roce_create_qp+0x98/0x160 [hns_roce_hw_v2]
create_qp+0x138/0x258
ib_create_qp_kernel+0x50/0xe8
create_mad_qp+0xa8/0x128
ib_mad_port_open+0x218/0x448
ib_mad_init_device+0x70/0x1f8
add_client_context+0xfc/0x220
enable_device_and_get+0xd0/0x140
ib_register_device.part.0+0xf4/0x1c8
ib_register_device+0x34/0x50
hns_roce_register_device+0x174/0x3d0 [hns_roce_hw_v2]
hns_roce_init+0xfc/0x2c0 [hns_roce_hw_v2]
__hns_roce_hw_v2_init_instance+0x7c/0x1d0 [hns_roce_hw_v2]
hns_roce_hw_v2_init_instance+0x9c/0x180 [hns_roce_hw_v2]

Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-6-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Fix the overflow risk of hem_list_calc_ba_range()

The max value of 'unit' and 'hop_num' is 2^24 and 2, so the value of
'step' may exceed the range of u32. Change the type of 'step' to u64.

Fixes: 38389eaa4db1 ("RDMA/hns: Add mtr support for mixed multihop addressing")
Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Fix Use-After-Free of rsv_qp on HIP08

Currently rsv_qp is freed before ib_unregister_device() is called
on HIP08. During the time interval, users can still dereg MR and
rsv_qp will be used in this process, leading to a UAF. Move the
release of rsv_qp after calling ib_unregister_device() to fix it.

Fixes: 70f92521584f ("RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT")
Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Don't modify rq next block addr in HIP09 QPC

The field 'rq next block addr' in QPC can be updated by driver only
on HIP08. On HIP09 HW updates this field while driver is not allowed.

Fixes: 926a01dc000d ("RDMA/hns: Add QP operations support for hip08 SoC")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Fix the max WQE size for static WQE support

When variable size WQE is supported, max_qp_sges reported
is more than 6. For devices that supports variable size WQE,
the Send WQE size calculation is wrong when an an older library
that doesn't support variable size WQE is used.

Set the WQE size to 128 when static WQE is supported.

Fixes: de1d364c3815 ("RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725444253-13221-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Fix the compatibility flag for variable size WQE

For older adapters that doesn't support variable size WQE,
driver is wrongly reporting that variable WQE is supported,
when the latest library is used.

Report the variable WQE capability only if the driver supports
it.

Fixes: 10a104c0debb ("RDMA/bnxt_re: Enable variable size WQEs for user space applications")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725444253-13221-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Fix MR cache temp entries cleanup

Fix the cleanup of the temp cache entries that are dynamically created
in the MR cache.

The cleanup of the temp cache entries is currently scheduled only when a
new entry is created. Since in the cleanup of the entries only the mkeys
are destroyed and the cache entry stays in the cache, subsequent
registrations might reuse the entry and it will eventually be filled with
new mkeys without cleanup ever getting scheduled again.

On workloads that register and deregister MRs with a wide range of
properties we see the cache ends up holding many cache entries, each
holding the max number of mkeys that were ever used through it.

Additionally, as the cleanup work is scheduled to run over the whole
cache, any mkey that is returned to the cache after the cleanup was
scheduled will be held for less than the intended 30 seconds timeout.

Solve both issues by dropping the existing remove_ent_work and reusing
the existing per-entry work to also handle the temp entries cleanup.

Schedule the work to run with a 30 seconds delay every time we push an
mkey to a clean temp entry.
This ensures the cleanup runs on each entry only 30 seconds after the
first mkey was pushed to an empty entry.

As we have already been distinguishing between persistent and temp entries
when scheduling the cache_work_func, it is not being scheduled in any
other flows for the temp entries.

Another benefit from moving to a per-entry cleanup is we now not
required to hold the rb_tree mutex, thus enabling other flow to run
concurrently.

Fixes: dd1b913fb0d0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/e4fa4bb03bebf20dceae320f26816cd2dde23a26.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache

When searching the MR cache for suitable cache entries, don't use mkeys
larger than twice the size required for the MR.
This should ensure the usage of mkeys closer to the minimal required size
and reduce memory waste.

On driver init we create entries for mkeys with clear attributes and
powers of 2 sizes from 4 to the max supported size.
This solves the issue for anyone using mkeys that fit these
requirements.

In the use case where an MR is registered with different attributes,
like an access flag we can't UMR, we'll create a new cache entry to store
it upon dereg.
Without this fix, any later registration with same attributes and smaller
size will use the newly created cache entry and it's mkeys, disregarding
the memory waste of using mkeys larger than required.

For example, one worst-case scenario can be when registering and
deregistering a 1GB mkey with ATS enabled which will cause the creation of
a new cache entry to hold those type of mkeys. A user registering a 4k MR
with ATS will end up using the new cache entry and an mkey that can
support a 1GB MR, thus wasting x250k memory than actually needed in the HW.

Additionally, allow all small registration to use the smallest size
cache entry that is initialized on driver load even if size is larger
than twice the required size.

Fixes: 73d09b2fe833 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/8ba3a6e3748aace2026de8b83da03aba084f78f4.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Fix counter update on MR cache mkey creation

After an mkey is created, update the counter for pending mkeys before
reshceduling the work that is filling the cache.

Rescheduling the work with a full MR cache entry and a wrong 'pending'
counter will cause us to miss disabling the fill_to_high_water flag.
Thus leaving the cache full but with an indication that it's still
needs to be filled up to it's full size (2 * limit).
Next time an mkey will be taken from the cache, we'll unnecessarily
continue the process of filling the cache to it's full size.

Fixes: 57e7071683ef ("RDMA/mlx5: Implement mkeys management via LIFO queue")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/0f44f462ba22e45f72cb3d0ec6a748634086b8d0.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Drop redundant work canceling from clean_keys()

The canceling of dealyed work in clean_keys() is a leftover from years
back and was added to prevent races in the cleanup process of MR cache.
The cleanup process was rewritten a few years ago and the canceling of
delayed work and flushing of workqueue was added before the call to
clean_keys().

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/943d21f5a9dba7b98a3e1d531e3561ffe9745d71.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/erdma: Return QP state in erdma_query_qp

Fix qp_state and cur_qp_state to return correct values in
struct ib_qp_attr.

Fixes: 155055771704 ("RDMA/erdma: Add verbs implementation")
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://patch.msgid.link/20240902112920.58749-4-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/erdma: Add disassociate ucontext support

All IO pages mapped to user space are handled by rdma_user_mmap_io,
so add empty stub for disassociate ucontext.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://patch.msgid.link/20240902112920.58749-3-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/erdma: Refactor the initialization and destruction of EQ

We extracted the common parts of the initialization/destruction
process to make the code cleaner.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://patch.msgid.link/20240902112920.58749-2-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Enable ATS when allocating kernel MRs

When creating kernel MRs, it is not definitive whether they will be used
for peer-to-peer transactions or for other usecases, since address
mapping is performed only after the MR is created.

Since peer-to-peer transactions benefit significantly from ATS
performance-wise, enable ATS on newly-allocated kernel MRs when
supported.

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Gal Shalom <galshalom@nvidia.com>
Link: https://patch.msgid.link/fafd4c9f14cf438d2882d88649c2947e1d05d0b4.1725273403.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/core: Fix ib_cache_setup_one error flow cleanup

When ib_cache_update return an error, we exit ib_cache_setup_one
instantly with no proper cleanup, even though before this we had
already successfully done gid_table_setup_one, that results in
the kernel WARN below.

Do proper cleanup using gid_table_cleanup_one before returning
the err in order to fix the issue.

WARNING: CPU: 4 PID: 922 at drivers/infiniband/core/cache.c:806 gid_table_release_one+0x181/0x1a0
Modules linked in:
CPU: 4 UID: 0 PID: 922 Comm: c_repro Not tainted 6.11.0-rc1+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:gid_table_release_one+0x181/0x1a0
Code: 44 8b 38 75 0c e8 2f cb 34 ff 4d 8b b5 28 05 00 00 e8 23 cb 34 ff 44 89 f9 89 da 4c 89 f6 48 c7 c7 d0 58 14 83 e8 4f de 21 ff <0f> 0b 4c 8b 75 30 e9 54 ff ff ff 48 8    3 c4 10 5b 5d 41 5c 41 5d 41
RSP: 0018:ffffc90002b835b0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff811c8527
RDX: 0000000000000000 RSI: ffffffff811c8534 RDI: 0000000000000001
RBP: ffff8881011b3d00 R08: ffff88810b3abe00 R09: 205d303839303631
R10: 666572207972746e R11: 72746e6520444947 R12: 0000000000000001
R13: ffff888106390000 R14: ffff8881011f2110 R15: 0000000000000001
FS:  00007fecc3b70800(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000340 CR3: 000000010435a001 CR4: 00000000003706b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
? show_regs+0x94/0xa0
? __warn+0x9e/0x1c0
? gid_table_release_one+0x181/0x1a0
? report_bug+0x1f9/0x340
? gid_table_release_one+0x181/0x1a0
? handle_bug+0xa2/0x110
? exc_invalid_op+0x31/0xa0
? asm_exc_invalid_op+0x16/0x20
? __warn_printk+0xc7/0x180
? __warn_printk+0xd4/0x180
? gid_table_release_one+0x181/0x1a0
ib_device_release+0x71/0xe0
? __pfx_ib_device_release+0x10/0x10
device_release+0x44/0xd0
kobject_put+0x135/0x3d0
put_device+0x20/0x30
rxe_net_add+0x7d/0xa0
rxe_newlink+0xd7/0x190
nldev_newlink+0x1b0/0x2a0
? __pfx_nldev_newlink+0x10/0x10
rdma_nl_rcv_msg+0x1ad/0x2e0
rdma_nl_rcv_skb.constprop.0+0x176/0x210
netlink_unicast+0x2de/0x400
netlink_sendmsg+0x306/0x660
__sock_sendmsg+0x110/0x120
____sys_sendmsg+0x30e/0x390
___sys_sendmsg+0x9b/0xf0
? kstrtouint+0x6e/0xa0
? kstrtouint_from_user+0x7c/0xb0
? get_pid_task+0xb0/0xd0
? proc_fail_nth_write+0x5b/0x140
? __fget_light+0x9a/0x200
? preempt_count_add+0x47/0xa0
__sys_sendmsg+0x61/0xd0
do_syscall_64+0x50/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 1901b91f9982 ("IB/core: Fix potential NULL pointer dereference in pkey cache")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/79137687d829899b0b1c9835fcb4b258004c439a.1725273354.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/mlx5: Fix UMR pd cleanup on error flow of driver init

The cited commit moves the pd allocation from function
mlx5r_umr_resource_cleanup() to a new function mlx5r_umr_cleanup().
So the fix in commit [1] is broken. In error flow, will hit panic [2].

Fix it by checking pd pointer to avoid panic if it is NULL;

[1] RDMA/mlx5: Fix UMR cleanup on error flow of driver init
[2]
[  347.567063] infiniband mlx5_0: Couldn't register device with driver model
[  347.591382] BUG: kernel NULL pointer dereference, address: 0000000000000020
[  347.593438] #PF: supervisor read access in kernel mode
[  347.595176] #PF: error_code(0x0000) - not-present page
[  347.596962] PGD 0 P4D 0
[  347.601361] RIP: 0010:ib_dealloc_pd_user+0x12/0xc0 [ib_core]
[  347.604171] RSP: 0018:ffff888106293b10 EFLAGS: 00010282
[  347.604834] RAX: 0000000000000000 RBX: 000000000000000e RCX: 0000000000000000
[  347.605672] RDX: ffff888106293ad0 RSI: 0000000000000000 RDI: 0000000000000000
[  347.606529] RBP: 0000000000000000 R08: ffff888106293ae0 R09: ffff888106293ae0
[  347.607379] R10: 0000000000000a06 R11: 0000000000000000 R12: 0000000000000000
[  347.608224] R13: ffffffffa0704dc0 R14: 0000000000000001 R15: 0000000000000001
[  347.609067] FS:  00007fdc720cd9c0(0000) GS:ffff88852c880000(0000) knlGS:0000000000000000
[  347.610094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  347.610727] CR2: 0000000000000020 CR3: 0000000103012003 CR4: 0000000000370eb0
[  347.611421] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  347.612113] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  347.612804] Call Trace:
[  347.613130]  <TASK>
[  347.613417]  ? __die+0x20/0x60
[  347.613793]  ? page_fault_oops+0x150/0x3e0
[  347.614243]  ? free_msg+0x68/0x80 [mlx5_core]
[  347.614840]  ? cmd_exec+0x48f/0x11d0 [mlx5_core]
[  347.615359]  ? exc_page_fault+0x74/0x130
[  347.615808]  ? asm_exc_page_fault+0x22/0x30
[  347.616273]  ? ib_dealloc_pd_user+0x12/0xc0 [ib_core]
[  347.616801]  mlx5r_umr_cleanup+0x23/0x90 [mlx5_ib]
[  347.617365]  mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x36/0x40 [mlx5_ib]
[  347.618025]  __mlx5_ib_add+0x96/0xd0 [mlx5_ib]
[  347.618539]  mlx5r_probe+0xe9/0x310 [mlx5_ib]
[  347.619032]  ? kernfs_add_one+0x107/0x150
[  347.619478]  ? __mlx5_ib_add+0xd0/0xd0 [mlx5_ib]
[  347.619984]  auxiliary_bus_probe+0x3e/0x90
[  347.620448]  really_probe+0xc5/0x3a0
[  347.620857]  __driver_probe_device+0x80/0x160
[  347.621325]  driver_probe_device+0x1e/0x90
[  347.621770]  __driver_attach+0xec/0x1c0
[  347.622213]  ? __device_attach_driver+0x100/0x100
[  347.622724]  bus_for_each_dev+0x71/0xc0
[  347.623151]  bus_add_driver+0xed/0x240
[  347.623570]  driver_register+0x58/0x100
[  347.623998]  __auxiliary_driver_register+0x6a/0xc0
[  347.624499]  ? driver_register+0xae/0x100
[  347.624940]  ? 0xffffffffa0893000
[  347.625329]  mlx5_ib_init+0x16a/0x1e0 [mlx5_ib]
[  347.625845]  do_one_initcall+0x4a/0x2a0
[  347.626273]  ? gcov_event+0x2e2/0x3a0
[  347.626706]  do_init_module+0x8a/0x260
[  347.627126]  init_module_from_file+0x8b/0xd0
[  347.627596]  __x64_sys_finit_module+0x1ca/0x2f0
[  347.628089]  do_syscall_64+0x4c/0x100

Fixes: 638420115cc4 ("IB/mlx5: Create UMR QP just before first reg_mr occurs")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Link: https://patch.msgid.link/778c40c60287992da5d6ec92bb07b67f7bb5e6ef.1725273295.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Add support for MR Relaxed Ordering

Some of the adapters support Relaxed Ordering for the MRs.
Driver queries support for Memory region relax ordering support from
firmware and set relax ordering bit in REGISTER_MR request, if the users
request for the support. Also, this is supported only if the PCIe device
has enabled relaxed ordering attribute.

Reviewed-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Reviewed-by: Vijay Kumar Mandadapu <vijaykumar.mandadapu@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725256351-12751-5-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Avoid an extra hwrm per MR creation

Firmware now have a new mr registration command where
both MR allocation and registration can be done in a
single hwrm command. Driver has to issue this new hwrm
command whenever the support flag is set. This reduces
the number of hwrm issued per MR creation and speed up
the MR creation.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725256351-12751-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Rename a variable

Renaming flags to access_flags for clarity.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725256351-12751-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Update HW interface headers

Updating the HW structures for the pcie relax ordering support.
Newly added interface structures will be used in the
followup patch.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725256351-12751-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mana_ib: use the correct page size for mapping user-mode doorbell page

When mapping doorbell page from user-mode, the driver should use the system
page size as this memory is allocated via mmap() from user-mode.

Cc: stable@vger.kernel.org
Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1725030993-16213-2-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mana_ib: use the correct page table index based on hardware page size

MANA hardware uses 4k page size. When calculating the page table index,
it should use the hardware page size, not the system page size.

Cc: stable@vger.kernel.org
Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1725030993-16213-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Share a page to expose per SRQ info with userspace

Gen P7 adapters needs to share a toggle bits information received
in kernel driver with the user space. User space needs this
info to arm the SRQ.

User space application can get this page using the
UAPI routines. Library will mmap this page and get the
toggle bits to be used in the next ARM Doorbell.

Uses a hash list to map the SRQ structure from the SRQ ID.
SRQ structure is retrieved from the hash list while the
library calls the UAPI routine to get the toggle page
mapping. Currently the full page is mapped per SRQ. This
can be optimized to enable multiple SRQs from the same
application share the same page and different offsets
in the page

Signed-off-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1724945645-14989-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Refactor the BNXT_RE_METHOD_GET_TOGGLE_MEM method

Refactor the code in this function to have common code.
This is used in subsequent patches.

Signed-off-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1724945645-14989-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Get the toggle bits from SRQ events

SRQ arming requires the toggle bits received from hardware.
Get the toggle bits from SRQ notification for the
gen p7 adapters. This value will be zero for the older adapters.

Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1724945645-14989-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rdmavt: Convert to use ERR_CAST()

As opposed to open-code, using the ERR_CAST macro clearly indicates that
this is a pointer to an error value and a type conversion was performed.

Signed-off-by: Shen Lichuan <shenlichuan@vivo.com>
Link: https://patch.msgid.link/20240828082720.33231-1-shenlichuan@vivo.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/cxgb4: Remove unused declarations

Since commit be4c9bad9d0e ("MAINTAINERS: Add cxgb4 and iw_cxgb4 entries")
c4iw_post_terminate() declaration is not used anymore.

And other declarations were never implemented since introduction in
commit cfdda9d76436 ("RDMA/cxgb4: Add driver for Chelsio T4 RNIC").

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20240824091629.3659565-1-yuehaibing@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-clt: Remove an extra space

No functional changes.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Alexei Pastuchov <alexei.pastuchov@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-12-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-clt: Do local invalidate after write io completion

Switch local invalidate after write io completion avoid the
chain usage of WR, this fixed the local protection error on
LOCAL INVALIDATE WR.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-11-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs: Register ib event handler

Use ib_register_event_handler() to register event handlers for both
client and server side. For now, all those handlers do, is to print
type of incoming event.

Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-10-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-srv: Avoid null pointer deref during path establishment

For RTRS path establishment, RTRS client initiates and completes con_num
of connections. After establishing all its connections, the information
is exchanged between the client and server through the info_req message.
During this exchange, it is essential that all connections have been
established, and the state of the RTRS srv path is CONNECTED.

So add these sanity checks, to make sure we detect and abort process in
error scenarios to avoid null pointer deref.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-9-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-clt: Print request type for errors

Extend the output to print also the request type.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-8-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-clt: Reset cid to con_num - 1 to stay in bounds

In the function init_conns(), after the create_con() and create_cm() for
loop if something fails. In the cleanup for loop after the destroy tag, we
access out of bound memory because cid is set to clt_path->s.con_num.

This commits resets the cid to clt_path->s.con_num - 1, to stay in bounds
in the cleanup loop later.

Fixes: 6a98d71daea1 ("RDMA/rtrs: client: main functionality")
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-7-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-clt: Reuse need_inval from mr

mr has a member need_inval, which can be used to indicate if
local invalidate is needed, switch to it and remove need_inv
from rtrs_clt_io_req.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-6-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs: Reset hb_missed_cnt after receiving other traffic from peer

Reset hb_missed_cnt after receiving traffic from other peer, so
hb is more robust again high load on host or network.

Fixes: 6a98d71daea1 ("RDMA/rtrs: client: main functionality")
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-5-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-clt: Rate limit errors in IO path

On network errors, a large number of these logs are printed due to all the
inflight IOs, rate limit them so they do not clutter kernel log.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-4-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-clt: Fix need_inv setting in error case

In some cases need_inv can be missed for write requests, additionally
driver has to handle missing invalidates for write requests. While at
it, remove the else case from write invalidate path as it is possible
to reach there.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-3-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs: For HB error add additional clt/srv specific logging

In case of HB error, we need to know the specific path on which it
happened, for better debugging. Since the clt/srv path structures are not
available in rtrs.c, it needs to be done in the individual HB error
handler.

This commit add those loging. A sample kernel log output after this commit:

rtrs_core L357: <blya>: HB missed max reached.
rtrs_server L717: <blya>: HB err handler for path=ip:x.x.x.x@ip:x.x.x.x
.
.
rtrs_core L357: <blya>: HB missed max reached.
rtrs_client L1519: <blya>: HB err handler for path=ip:x.x.x.x@ip:x.x.x.x

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-2-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

Merge branch 'bnxt_re_variable_wqes' into rdma.git for-next

Selvin Xavier says:

=============
Enable the Variable size Work Queue entry support for Gen P7
adapters. This would help in the better utilization of the queue memory
and pci bandwidth due to the smaller send queue Work entries.
=============

Based on v6.11-rc5 for dependencies.

* bnxt_re_variable_wqes: (829 commits)
  RDMA/bnxt_re: Enable variable size WQEs for user space applications
  RDMA/bnxt_re: Handle variable WQE support for user applications
  RDMA/bnxt_re: Fix the table size for PSN/MSN entries
  RDMA/bnxt_re: Get the WQE index from slot index while completing the WQEs
  RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters
  Linux 6.11-rc5
  ...

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/bnxt_re: Enable variable size WQEs for user space applications

Add backward compatibility code to enable variable size WQEs only if the
user lib supports it.

Link: https://patch.msgid.link/r/1724042847-1481-6-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/bnxt_re: Handle variable WQE support for user applications

User library calculates the number of slots required for user applications
and it can pass that information to the driver. Driver can use this value
and update the HW directly. This mechanism is currently used only for the
newly introduced variable size WQEs.

Extend the bnxt_re_qp_req structure to pass the Send Queue slot count.
Reorganize the code to get the sq_slots before initializing the Send Queue
attributes.

Link: https://patch.msgid.link/r/1724042847-1481-5-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/bnxt_re: Fix the table size for PSN/MSN entries

HW MSN table size is always a power of 2. So the pages should be mapped
accordingly.

Use the power of two calculation while get the number of PSN/MSN entries.

Fixes: 6f6bfbc595fb ("RDMA/bnxt_re: Expose the MSN table capability for user library")
Link: https://patch.msgid.link/r/1724042847-1481-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/bnxt_re: Get the WQE index from slot index while completing the WQEs

While reporting the completions, SQ Work Queue index is required to
identify the WQE that generated the completions. In variable WQE mode, FW
returns the slot index for Error completions. Driver need to walk through
the shadow queue between the consumer index and producer index and matches
the slot index returned by FW. If a match is found, the next index of the
shadow queue is the WQE index to be considered for remaining poll_cq loop.

Link: https://patch.msgid.link/r/1724042847-1481-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters

Variable size WQE means that each send Work Queue Entry to HW can use
different WQE sizes as opposed to the static WQE size on the current
devices. Set variable WQE mode for Gen P7 devices. Depth of the Queue will
be a multiple of slot which is 16 bytes. The number of slots should be a
multiple of 256 as per the HW requirement.

Initialize the Software shadow queue to hold requests equal to the number
of slots. Also, do not expose the variable size WQE capability until the
last patch in the series.

Link: https://patch.msgid.link/r/1724042847-1481-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Linux 6.11-rc5

Merge tag 'bcachefs-2024-08-24' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:

- assorted syzbot fixes

- some upgrade fixes for old (pre 1.0) filesystems

- fix for moving data off a device that was switched to durability=0
   after data had been written to it.

- nocow deadlock fix

- fix for new rebalance_work accounting

* tag 'bcachefs-2024-08-24' of git://evilpiepirate.org/bcachefs: (28 commits)
  bcachefs: Fix rebalance_work accounting
  bcachefs: Fix failure to flush moves before sleeping in copygc
  bcachefs: don't use rht_bucket() in btree_key_cache_scan()
  bcachefs: add missing inode_walker_exit()
  bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop()
  bcachefs: Fix double assignment in check_dirent_to_subvol()
  bcachefs: Fix refcounting in discard path
  bcachefs: Fix compat issue with old alloc_v4 keys
  bcachefs: Fix warning in bch2_fs_journal_stop()
  fs/super.c: improve get_tree() error message
  bcachefs: Fix missing validation in bch2_sb_journal_v2_validate()
  bcachefs: Fix replay_now_at() assert
  bcachefs: Fix locking in bch2_ioc_setlabel()
  bcachefs: fix failure to relock in btree_node_fill()
  bcachefs: fix failure to relock in bch2_btree_node_mem_alloc()
  bcachefs: unlock_long() before resort in journal replay
  bcachefs: fix missing bch2_err_str()
  bcachefs: fix time_stats_to_text()
  bcachefs: Fix bch2_bucket_gens_init()
  bcachefs: Fix bch2_trigger_alloc assert
  ...

Merge tag '6.11-rc5-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

- query directory flex array fix

- fix potential null ptr reference in open

- fix error message in some open cases

- two minor cleanups

* tag '6.11-rc5-server-fixes' of git://git.samba.org/ksmbd:
  smb/server: update misguided comment of smb2_allocate_rsp_buf()
  smb/server: remove useless assignment of 'file_present' in smb2_open()
  smb/server: fix potential null-ptr-deref of lease_ctx_info in smb2_open()
  smb/server: fix return value of smb2_open()
  ksmbd: the buffer of smb2 query dir response has at least 1 byte

Merge tag 's390-6.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 fixes from Vasily Gorbik:

- Fix KASLR base offset to account for symbol offsets in the vmlinux
   ELF file, preventing tool breakages like the drgn debugger

- Fix potential memory corruption of physmem_info during kernel
   physical address randomization

- Fix potential memory corruption due to overlap between the relocated
   lowcore and identity mapping by correctly reserving lowcore memory

- Fix performance regression and avoid randomizing identity mapping
   base by default

- Fix unnecessary delay of AP bus binding complete uevent to prevent
   startup lag in KVM guests using AP

* tag 's390-6.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/boot: Fix KASLR base offset off by __START_KERNEL bytes
  s390/boot: Avoid possible physmem_info segment corruption
  s390/ap: Refine AP bus bindings complete processing
  s390/mm: Pin identity mapping base to zero
  s390/mm: Prevent lowcore vs identity mapping overlap

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"The important core fix is another tweak to our discard discovery
  issues. The off by 512 in logical block count seems bad, but in fact
  the inline was only ever used in debug prints, which is why no-one
  noticed"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: sd: Do not attempt to configure discard unless LBPME is set
  scsi: MAINTAINERS: Add header files to SCSI SUBSYSTEM
  scsi: ufs: qcom: Add UFSHCD_QUIRK_BROKEN_LSDBS_CAP for SM8550 SoC
  scsi: ufs: core: Add a quirk for handling broken LSDBS field in controller capabilities register
  scsi: core: Fix the return value of scsi_logical_block_count()
  scsi: MAINTAINERS: Update HiSilicon SAS controller driver maintainer

bcachefs: Fix rebalance_work accounting

rebalance_work was keying off of the presence of rebelance_opts in the
extent - but that was incorrect, we keep those around after rebalance
for indirect extents since the inode's options are not directly
available

Fixes: 20ac515a9cc7 ("bcachefs: bch_acct_rebalance_work")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix failure to flush moves before sleeping in copygc

This fixes an apparent deadlock - rebalance would get stuck trying to
take nocow locks because they weren't being released by copygc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

Merge tag 'cgroup-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:
"Three patches addressing cpuset corner cases"

* tag 'cgroup-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug
  cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set
  cgroup/cpuset: fix panic caused by partcmd_update

Merge tag 'wq-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fixes from Tejun Heo:
"Nothing too interesting. One patch to remove spurious warning and
  others to address static checker warnings"

* tag 'wq-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Correct declaration of cpu_pwq in struct workqueue_struct
  workqueue: Fix spruious data race in __flush_work()
  workqueue: Remove incorrect "WARN_ON_ONCE(!list_empty(&worker->entry));" from dying worker
  workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask()
  workqueue: doc: Fix function name, remove markers

Merge tag 'mips-fixes_6.11_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull MIPS fixes from Thomas Bogendoerfer:

- Set correct timer mode on Loongson64

- Only request r4k clockevent interrupt on one CPU

* tag 'mips-fixes_6.11_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: cevt-r4k: Don't call get_c0_compare_int if timer irq is installed
MIPS: Loongson64: Set timer mode in cpu-probe

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 kvm fixes from Catalin Marinas:

- Don't drop references on LPIs that weren't visited by the vgic-debug
   iterator

- Cure lock ordering issue when unregistering vgic redistributors

- Fix for misaligned stage-2 mappings when VMs are backed by hugetlb
   pages

- Treat SGI registers as UNDEFINED if a VM hasn't been configured for
   GICv3

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  KVM: arm64: Make ICC_*SGI*_EL1 undef in the absence of a vGICv3
  KVM: arm64: Ensure canonical IPA is hugepage-aligned when handling fault
  KVM: arm64: vgic: Don't hold config_lock while unregistering redistributors
  KVM: arm64: vgic-debug: Don't put unmarked LPIs

Merge tag 'nfs-for-6.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:

- Fix rpcrdma refcounting in xa_alloc

- Fix rpcrdma usage of XA_FLAGS_ALLOC

- Fix requesting FATTR4_WORD2_OPEN_ARGUMENTS

- Fix attribute bitmap decoder to handle a 3rd word

- Add reschedule points when returning delegations to avoid soft lockups

- Fix clearing layout segments in layoutreturn

- Avoid unnecessary rescanning of the per-server delegation list

* tag 'nfs-for-6.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Avoid unnecessary rescanning of the per-server delegation list
  NFSv4: Fix clearing of layout segments in layoutreturn
  NFSv4: Add missing rescheduling points in nfs_client_return_marked_delegations
  nfs: fix bitmap decoder to handle a 3rd word
  nfs: fix the fetch of FATTR4_OPEN_ARGUMENTS
  rpcrdma: Trace connection registration and unregistration
  rpcrdma: Use XA_FLAGS_ALLOC instead of XA_FLAGS_ALLOC1
  rpcrdma: Device kref is over-incremented on error from xa_alloc

Merge tag 'v6.11-rc4-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

- fix refcount leak (can cause rmmod fail)

- fix byte range locking problem with cached reads

- fix for mount failure if reparse point unrecognized

- minor typo

* tag 'v6.11-rc4-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb/client: fix typo: GlobalMid_Sem -> GlobalMid_Lock
  smb: client: ignore unhandled reparse tags
  smb3: fix problem unloading module due to leaked refcount on shutdown
  smb3: fix broken cached reads when posix locks

Merge tag 'input-for-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:

- a tweak to uinput interface to reject requests with abnormally large
   number of slots. 100 slots/contacts should be enough for real devices

- support for FocalTech FT8201 added to the edt-ft5x06 driver

- tweaks to i8042 to handle more devices that have issue with its
   emulation

- Synaptics touchpad switched to native SMbus/RMI mode on HP Elitebook
   840 G2

- other minor fixes

* tag 'input-for-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: himax_hx83112b - fix incorrect size when reading product ID
  Input: i8042 - use new forcenorestore quirk to replace old buggy quirk combination
  Input: i8042 - add forcenorestore quirk to leave controller untouched even on s3
  Input: i8042 - add Fujitsu Lifebook E756 to i8042 quirk table
  Input: uinput - reject requests with unreasonable number of slots
  Input: edt-ft5x06 - add support for FocalTech FT8201
  dt-bindings: input: touchscreen: edt-ft5x06: Document FT8201 support
  Input: adc-joystick - fix optional value handling
  Input: synaptics - enable SMBus for HP Elitebook 840 G2
  Input: ads7846 - ratelimit the spi_sync error message

Merge tag 'drm-fixes-2024-08-24' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Weekly fixes. xe and msm are the major groups, with
  amdgpu/i915/nouveau having smaller bits. xe has a bunch of hw
  workaround fixes that were found to be missing, so that is why there
  are a bunch of scattered fixes, and one larger one. But overall size
  doesn't look too out of the ordinary.

  msm:
   - virtual plane fixes:
      - drop yuv on hw where not supported
      - csc vs yuv format fix
      - rotation fix
   - fix fb cleanup on close
   - reset phy before link training
   - fix visual corruption at 4K
   - fix NULL ptr crash on hotplug
   - simplify debug macros
   - sc7180 fix
   - adreno firmware name error path fix

  amdgpu:
   - GFX10 firmware loading fix
   - SDMA 5.2 fix
   - Debugfs parameter validation fix
   - eGPU hotplug fix

  i915:
   - fix HDCP timeouts

  nouveau:
   - fix SG_DEBUG crash

  xe:
   - Fix OA format masks which were breaking build with gcc-5
   - Fix opregion leak (Lucas)
   - Fix OA sysfs entry (Ashutosh)
   - Fix VM dma-resv lock (Brost)
   - Fix tile fini sequence (Brost)
   - Prevent UAF around preempt fence (Auld)
   - Fix DGFX display suspend/resume (Maarten)
   - Many Xe/Xe2 critical workarounds (Auld, Ngai-Mint, Bommu, Tejas, Daniele)
   - Fix devm/drmm issues (Daniele)
   - Fix missing workqueue destroy in xe_gt_pagefault (Stuart)
   - Drop HW fence pointer to HW fence ctx (Brost)
   - Free job before xe_exec_queue_put (Brost)"

* tag 'drm-fixes-2024-08-24' of https://gitlab.freedesktop.org/drm/kernel: (35 commits)
  drm/xe: Free job before xe_exec_queue_put
  drm/xe: Drop HW fence pointer to HW fence ctx
  drm/xe: Fix missing workqueue destroy in xe_gt_pagefault
  drm/amdgpu: fix eGPU hotplug regression
  drm/amdgpu: Validate TA binary size
  drm/amdgpu/sdma5.2: limit wptr workaround to sdma 5.2.1
  drm/amdgpu: fixing rlc firmware loading failure issue
  drm/xe/uc: Use devm to register cleanup that includes exec_queues
  drm/xe: use devm instead of drmm for managed bo
  drm/xe/xe2hpg: Add Wa_14021821874
  drm/xe: fix WA 14018094691
  drm/xe/xe2: Add Wa_15015404425
  drm/xe/xe2: Make subsequent L2 flush sequential
  drm/xe/xe2lpg: Extend workaround 14021402888
  drm/xe/xe2lpm: Extend Wa_16021639441
  drm/xe/bmg: implement Wa_16023588340
  drm/xe/oa/uapi: Make bit masks unsigned
  drm/xe/display: Make display suspend/resume work on discrete
  drm/xe: prevent UAF around preempt fence
  drm/xe: Fix tile fini sequence
  ...

Merge tag 'block-6.11-20240823' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

- NVMe pull request via Keith
     - Remove unused struct field (Nilay)
     - Fix fabrics keep-alive teardown order (Ming)

- Write zeroes fixes (John)

* tag 'block-6.11-20240823' of git://git.kernel.dk/linux:
  nvme: Remove unused field
  nvme: move stopping keep-alive into nvme_uninit_ctrl()
  block: Drop NULL check in bdev_write_zeroes_sectors()
  block: Read max write zeroes once for __blkdev_issue_write_zeroes()

Merge tag 'io_uring-6.11-20240823' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
"Just a single fix for provided buffer validation"

* tag 'io_uring-6.11-20240823' of git://git.kernel.dk/linux:
io_uring/kbuf: sanitize peek buffer setup

Merge tag 'acpi-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
"Fix backlight control on a Dell All In One system where a backlight
  controller board is attached to a UART port and the dell-uart
  backlight driver binds to it, but the backlight is actually controlled
  by other means (Hans de Goede)"

* tag 'acpi-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: video: Add backlight=native quirk for Dell OptiPlex 7760 AIO
  platform/x86: dell-uart-backlight: Use acpi_video_get_backlight_type()
  ACPI: video: Add Dell UART backlight controller detection

Merge tag 'thermal-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull thermal control fixes from Rafael Wysocki:
"These fix error handling in the thermal debug code and OF node
  reference leaks in the thermal OF driver.

  Specifics:

   - Use IS_ERR() in checks of debugfs_create_dir() return value instead
     of checking it against NULL in the thermal debug code (Yang Ruibin)

   - Fix three OF node reference leaks in thermal_of (Krzysztof
     Kozlowski)"

* tag 'thermal-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  thermal: of: Fix OF node leak in of_thermal_zone_find() error paths
  thermal: of: Fix OF node leak in thermal_of_zone_register()
  thermal: of: Fix OF node leak in thermal_of_trips_init() error path
  thermal/debugfs: Fix the NULL vs IS_ERR() confusion in debugfs_create_dir()

Merge tag 'mmc-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc

Pull mmc fixes from Ulf Hansson:
"MMC core:
   - Fix NULL dereference for mmc_test on allocation failure

  MMC host:
   - dw_mmc: Fix support for deferred probe for biu/ciu clocks
   - mtk-sd: Fix CMD8 support when fragile tuning settings"

* tag 'mmc-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: mmc_test: Fix NULL dereference on allocation failure
  mmc: dw_mmc: allow biu and ciu clocks to defer
  mmc: mtk-sd: receive cmd8 data when hs400 tuning fail

Merge tag 'spi-fix-v6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
"A small collection of fixes here, all driver specific and none of them
  too serious. For whatever reason runtime PM seems to have been causing
  a bunch of issues recently"

* tag 'spi-fix-v6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: pxa2xx: Move PM runtime handling to the glue drivers
  spi: pxa2xx: Do not override dev->platform_data on probe
  spi: spi-fsl-lpspi: limit PRESCALE bit in TCR register
  spi: spi-cadence-quadspi: Fix OSPI NOR failures during system resume
  spi: zynqmp-gqspi: Scale timeout by data size

RDMA/efa: Add support for node guid

Propagate the unique, per device, ID in the device attributes to the
standard node_guid value in IB device.

Link: https://patch.msgid.link/r/20240822171143.2800-1-mrgolin@amazon.com
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Yehuda Yitschak <yehuday@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/iwcm: Fix WARNING:at_kernel/workqueue.c:#check_flush_dependency

In the commit aee2424246f9 ("RDMA/iwcm: Fix a use-after-free related to
destroying CM IDs"), the function flush_workqueue is invoked to flush the
work queue iwcm_wq.

But at that time, the work queue iwcm_wq was created via the function
alloc_ordered_workqueue without the flag WQ_MEM_RECLAIM.

Because the current process is trying to flush the whole iwcm_wq, if
iwcm_wq doesn't have the flag WQ_MEM_RECLAIM, verify that the current
process is not reclaiming memory or running on a workqueue which doesn't
have the flag WQ_MEM_RECLAIM as that can break forward-progress guarantee
leading to a deadlock.

The call trace is as below:

[  125.350876][ T1430] Call Trace:
[  125.356281][ T1430]  <TASK>
[ 125.361285][ T1430] ? __warn (kernel/panic.c:693)
[ 125.367640][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
[ 125.375689][ T1430] ? report_bug (lib/bug.c:180 lib/bug.c:219)
[ 125.382505][ T1430] ? handle_bug (arch/x86/kernel/traps.c:239)
[ 125.388987][ T1430] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1))
[ 125.395831][ T1430] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621)
[ 125.403125][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
[ 125.410984][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
[ 125.418764][ T1430] __flush_workqueue (kernel/workqueue.c:3970)
[ 125.426021][ T1430] ? __pfx___might_resched (kernel/sched/core.c:10151)
[ 125.433431][ T1430] ? destroy_cm_id (drivers/infiniband/core/iwcm.c:375) iw_cm
[ 125.441209][ T1430] ? __pfx___flush_workqueue (kernel/workqueue.c:3910)
[ 125.473900][ T1430] ? _raw_spin_lock_irqsave (arch/x86/include/asm/atomic.h:107 include/linux/atomic/atomic-arch-fallback.h:2170 include/linux/atomic/atomic-instrumented.h:1302 include/asm-generic/qspinlock.h:111 include/linux/spinlock.h:187 include/linux/spinlock_api_smp.h:111 kernel/locking/spinlock.c:162)
[ 125.473909][ T1430] ? __pfx__raw_spin_lock_irqsave (kernel/locking/spinlock.c:161)
[ 125.482537][ T1430] _destroy_id (drivers/infiniband/core/cma.c:2044) rdma_cm
[ 125.495072][ T1430] nvme_rdma_free_queue (drivers/nvme/host/rdma.c:656 drivers/nvme/host/rdma.c:650) nvme_rdma
[ 125.505827][ T1430] nvme_rdma_reset_ctrl_work (drivers/nvme/host/rdma.c:2180) nvme_rdma
[ 125.505831][ T1430] process_one_work (kernel/workqueue.c:3231)
[ 125.515122][ T1430] worker_thread (kernel/workqueue.c:3306 kernel/workqueue.c:3393)
[ 125.515127][ T1430] ? __pfx_worker_thread (kernel/workqueue.c:3339)
[ 125.531837][ T1430] kthread (kernel/kthread.c:389)
[ 125.539864][ T1430] ? __pfx_kthread (kernel/kthread.c:342)
[ 125.550628][ T1430] ret_from_fork (arch/x86/kernel/process.c:147)
[ 125.558840][ T1430] ? __pfx_kthread (kernel/kthread.c:342)
[ 125.558844][ T1430] ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
[  125.566487][ T1430]  </TASK>
[  125.566488][ T1430] ---[ end trace 0000000000000000 ]---

Fixes: aee2424246f9 ("RDMA/iwcm: Fix a use-after-free related to destroying CM IDs")
Link: https://patch.msgid.link/r/20240820113336.19860-1-yanjun.zhu@linux.dev
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202408151633.fc01893c-oliver.sang@intel.com
Tested-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/rxe: Fix __bth_set_resv6a

__bth_set_resv6a is used to clear BIT [24, 29] of rxe_bth::qpn, the
wrong expression leads other BITs into 1.

Link: https://patch.msgid.link/r/20240822065223.1117056-4-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/rxe: Fix misspelling of 'rmda'

Fix 'rmda' into 'RDMA'.

Link: https://patch.msgid.link/r/20240822065223.1117056-3-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/rxe: Use sizeof instead of hard code number

Use 'sizeof(union rdma_network_hdr)' instead of hard code GRH length
for GSI and UD.

Link: https://patch.msgid.link/r/20240822065223.1117056-2-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mlx4: Simplify an alloc_ordered_workqueue() invocation

Let alloc_ordered_workqueue() format the workqueue name instead of calling
snprintf() explicitly.

Link: https://patch.msgid.link/r/20240823101840.515398-5-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mlx4: Simplify an alloc_ordered_workqueue() invocation

Let alloc_ordered_workqueue() format the workqueue name instead of calling
snprintf() explicitly.

Link: https://patch.msgid.link/r/20240823101840.515398-4-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mad: Simplify an alloc_ordered_workqueue() invocation

Let alloc_ordered_workqueue() format the workqueue name instead of calling
snprintf() explicitly.

Link: https://patch.msgid.link/r/20240823101840.515398-3-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/qib: Simplify an alloc_ordered_workqueue() invocation

Let alloc_ordered_workqueue() format the workqueue name instead of
calling snprintf() explicitly.

Link: https://patch.msgid.link/r/20240823101840.515398-2-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Merge tag 'pwrseq-fixes-for-v6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull power sequencing fix from Bartosz Golaszewski:

- request the wlan-enable GPIO "as-is" to fix an issue with the wifi
module being already powered up before linux boots

* tag 'pwrseq-fixes-for-v6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
power: sequencing: request the WLAN enable GPIO as-is

Merge tag 'pmdomain-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm

Pull pmdomain fixes from Ulf Hansson:

- imx: Remove duplicated clocks for scu power domain

- imx: Wait for SSAR to complete power-on for i.MX93 power domain

* tag 'pmdomain-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
pmdomain: imx: wait SSAR when i.MX93 power domain on
pmdomain: imx: scu-pd: Remove duplicated clocks

Merge tag 'kvmarm-fixes-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into for-next/fixes

KVM/arm64 fixes for 6.11, round #2

- Don't drop references on LPIs that weren't visited by the
   vgic-debug iterator

- Cure lock ordering issue when unregistering vgic redistributors

- Fix for misaligned stage-2 mappings when VMs are backed by hugetlb
   pages

- Treat SGI registers as UNDEFINED if a VM hasn't been configured for
   GICv3

* tag 'kvmarm-fixes-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm:
  KVM: arm64: Make ICC_*SGI*_EL1 undef in the absence of a vGICv3
  KVM: arm64: Ensure canonical IPA is hugepage-aligned when handling fault
  KVM: arm64: vgic: Don't hold config_lock while unregistering redistributors
  KVM: arm64: vgic-debug: Don't put unmarked LPIs
  KVM: arm64: vgic: Hold config_lock while tearing down a CPU interface
  KVM: selftests: arm64: Correct feature test for S1PIE in get-reg-list
  KVM: arm64: Tidying up PAuth code in KVM
  KVM: arm64: vgic-debug: Exit the iterator properly w/o LPI
  KVM: arm64: Enforce dependency on an ARMv8.4-aware toolchain
  docs: KVM: Fix register ID of SPSR_FIQ
  KVM: arm64: vgic: fix unexpected unlock sparse warnings
  KVM: arm64: fix kdoc warnings in W=1 builds
  KVM: arm64: fix override-init warnings in W=1 builds
  KVM: arm64: free kvm->arch.nested_mmus with kvfree()

Merge tag 'ata-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux

Pull ata fixes from Damien Le Moal:

- Fix the max segment size and max number of segments supported by the
   pata_macio driver to fix issues with BIO splitting leading to an
   overflow of the adapter DMA table (from Michael)

- Related to the previous fix, change BUG_ON() calls for incorrect
   command buffer segmentation into WARN_ON() and an error return (from
   Michael)

* tag 'ata-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
  ata: pata_macio: Use WARN instead of BUG
  ata: pata_macio: Fix DMA table overflow

Merge tag 'net-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth and netfilter.

  Current release - regressions:

   - virtio_net: avoid crash on resume - move netdev_tx_reset_queue()
     call before RX napi enable

  Current release - new code bugs:

   - net/mlx5e: fix page leak and incorrect header release w/ HW GRO

  Previous releases - regressions:

   - udp: fix receiving fraglist GSO packets

   - tcp: prevent refcount underflow due to concurrent execution of
     tcp_sk_exit_batch()

  Previous releases - always broken:

   - ipv6: fix possible UAF when incrementing error counters on output

   - ip6: tunnel: prevent merging of packets with different L2

   - mptcp: pm: fix IDs not being reusable

   - bonding: fix potential crashes in IPsec offload handling

   - Bluetooth: HCI:
      - MGMT: add error handling to pair_device() to avoid a crash
      - invert LE State quirk to be opt-out rather then opt-in
      - fix LE quote calculation

   - drv: dsa: VLAN fixes for Ocelot driver

   - drv: igb: cope with large MAX_SKB_FRAGS Kconfig settings

   - drv: ice: fi Rx data path on architectures with PAGE_SIZE >= 8192

  Misc:

   - netpoll: do not export netpoll_poll_[disable|enable]()

   - MAINTAINERS: update the list of networking headers"

* tag 'net-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (82 commits)
  s390/iucv: Fix vargs handling in iucv_alloc_device()
  net: ovs: fix ovs_drop_reasons error
  net: xilinx: axienet: Fix dangling multicast addresses
  net: xilinx: axienet: Always disable promiscuous mode
  MAINTAINERS: Mark JME Network Driver as Odd Fixes
  MAINTAINERS: Add header files to NETWORKING sections
  MAINTAINERS: Add limited globs for Networking headers
  MAINTAINERS: Add net_tstamp.h to SOCKET TIMESTAMPING section
  MAINTAINERS: Add sonet.h to ATM section of MAINTAINERS
  octeontx2-af: Fix CPT AF register offset calculation
  net: phy: realtek: Fix setting of PHY LEDs Mode B bit on RTL8211F
  net: ngbe: Fix phy mode set to external phy
  netfilter: flowtable: validate vlan header
  bnxt_en: Fix double DMA unmapping for XDP_REDIRECT
  ipv6: prevent possible UAF in ip6_xmit()
  ipv6: fix possible UAF in ip6_finish_output2()
  ipv6: prevent UAF in ip6_send_skb()
  netpoll: do not export netpoll_poll_[disable|enable]()
  selftests: mlxsw: ethtool_lanes: Source ethtool lib from correct path
  udp: fix receiving fraglist GSO packets
  ...

Merge tag 'kbuild-fixes-v6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

- Eliminate the fdtoverlay command duplication in scripts/Makefile.lib

- Fix 'make compile_commands.json' for external modules

- Ensure scripts/kconfig/merge_config.sh handles missing newlines

- Fix some build errors on macOS

* tag 'kbuild-fixes-v6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: fix typos "prequisites" to "prerequisites"
  Documentation/llvm: turn make command for ccache into code block
  kbuild: avoid scripts/kallsyms parsing /dev/null
  treewide: remove unnecessary <linux/version.h> inclusion
  scripts: kconfig: merge_config: config files: add a trailing newline
  Makefile: add $(srctree) to dependency of compile_commands.json target
  kbuild: clean up code duplication in cmd_fdtoverlay