git.ipfire.org Git - thirdparty/kernel/linux.git/log

RDMA/bnxt_re: Report QP rate limit in debugfs

Update QP info debugfs hook to report the rate limit applied
on the QP. 0 means unlimited.

Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20260202133413.3182578-4-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Report packet pacing capabilities when querying device

Enable the support to report packet pacing capabilities
from kernel to user space. Packet pacing allows to limit
the rate to any number between the maximum and minimum.

The capabilities are exposed to user space through query_device.
The following capabilities are reported:

1. The maximum and minimum rate limit in kbps.
2. Bitmap showing which QP types support rate limit.

Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20260202133413.3182578-3-kalesh-anakkur.purayil@broadcom.com
Reviewed-by: Anantha Prabhu <anantha.prabhu@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/bnxt_re: Add support for QP rate limiting

Broadcom P7 chips supports applying rate limit to RC QPs.
It allows adjust shaper rate values during the INIT -> RTR,
RTR -> RTS, RTS -> RTS state changes or after QP transitions
to RTR or RTS.

Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20260202133413.3182578-2-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

MAINTAINERS: Drop RDMA files from Hyper-V section

MAINTAINERS entries are organized by subsystem ownership, and the RDMA
files belong under drivers/infiniband. Remove the overly broad mana_ib
entries from the Hyper-V section, and instead add the Hyper-V mailing list
to CC on mana_ib patches.

This makes get_maintainer.pl behave more sensibly when running it on
mana_ib patches.

Fixes: 428ca2d4c6aa ("MAINTAINERS: Add Long Li as a Hyper-V maintainer")
Link: https://patch.msgid.link/20260128-get-maintainers-fix-v1-1-fc5e58ce9f02@nvidia.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmalloc

Since wqe_size in ib_uverbs_unmarshall_recv() is user-provided and already
validated, but can still be large, add __GFP_NOWARN to suppress memory
allocation warnings for large sizes, consistent with the similar fix in
ib_uverbs_post_send().

Fixes: 67cdb40ca444 ("[IB] uverbs: Implement more commands")
Signed-off-by: Yi Liu <liuy22@mails.tsinghua.edu.cn>
Link: https://patch.msgid.link/20260129094900.3517706-1-liuy22@mails.tsinghua.edu.cn
Signed-off-by: Leon Romanovsky <leon@kernel.org>

svcrdma: use bvec-based RDMA read/write API

Convert svcrdma to the bvec-based RDMA API introduced earlier in
this series.

The bvec-based RDMA API eliminates the intermediate scatterlist
conversion step, allowing direct DMA mapping from bio_vec arrays.
This simplifies the svc_rdma_rw_ctxt structure by removing the
chained SG table management.

The structure retains an inline array approach similar to the
previous scatterlist implementation: an inline bvec array sized
to max_send_sge handles most I/O operations without additional
allocation. Larger requests fall back to dynamic allocation.
This preserves the allocation-free fast path for typical NFS
operations while supporting arbitrarily large transfers.

The bvec API handles all device types internally, including iWARP
devices which require memory registration. No explicit fallback
path is needed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-6-cel@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/core: add rdma_rw_max_sge() helper for SQ sizing

svc_rdma_accept() computes sc_sq_depth as the sum of rq_depth and the
number of rdma_rw contexts (ctxts). This value is used to allocate the
Send CQ and to initialize the sc_sq_avail credit pool.

However, when the device uses memory registration for RDMA operations,
rdma_rw_init_qp() inflates the QP's max_send_wr by a factor of three
per context to account for REG and INV work requests. The Send CQ and
credit pool remain sized for only one work request per context,
causing Send Queue exhaustion under heavy NFS WRITE workloads.

Introduce rdma_rw_max_sge() to compute the actual number of Send Queue
entries required for a given number of rdma_rw contexts. Upper layer
protocols call this helper before creating a Queue Pair so that their
Send CQs and credit accounting match the QP's true capacity.

Update svc_rdma_accept() to use rdma_rw_max_sge() when computing
sc_sq_depth, ensuring the credit pool reflects the work requests
that rdma_rw_init_qp() will reserve.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-5-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/core: add MR support for bvec-based RDMA operations

The bvec-based RDMA API currently returns -EOPNOTSUPP when Memory
Region registration is required. This prevents iWARP devices from
using the bvec path, since iWARP requires MR registration for RDMA
READ operations. The force_mr debug parameter is also unusable with
bvec input.

Add rdma_rw_init_mr_wrs_bvec() to handle MR registration for bvec
arrays. The approach creates a synthetic scatterlist populated with
DMA addresses from the bvecs, then reuses the existing ib_map_mr_sg()
infrastructure. This avoids driver changes while keeping the
implementation small.

The synthetic scatterlist is stored in the rdma_rw_ctx for cleanup.
On destroy, the MRs are returned to the pool and the bvec DMA
mappings are released using the stored addresses.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-4-cel@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations

The bvec RDMA API maps each bvec individually via dma_map_phys(),
requiring an IOTLB sync for each mapping. For large I/O operations
with many bvecs, this overhead becomes significant.

The two-step IOVA API (dma_iova_try_alloc / dma_iova_link /
dma_iova_sync) allocates a contiguous IOVA range upfront, links
all physical pages without IOTLB syncs, then performs a single
sync at the end. This reduces IOTLB flushes from O(n) to O(1).

It also requires only a single output dma_addr_t compared to extra
per-input element storage in struct scatterlist.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-3-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/core: add bio_vec based RDMA read/write API

The existing rdma_rw_ctx_init() API requires callers to construct a
scatterlist, which is then DMA-mapped page by page. Callers that
already have data in bio_vec form (such as the NVMe-oF target) must
first convert to scatterlist, adding overhead and complexity.

Introduce rdma_rw_ctx_init_bvec() and rdma_rw_ctx_destroy_bvec() to
accept bio_vec arrays directly. The new helpers use dma_map_phys()
for hardware RDMA devices and virtual addressing for software RDMA
devices (rxe, siw), avoiding intermediate scatterlist construction.

Memory registration (MR) path support is deferred to a follow-up
series; callers requiring MR-based transfers (iWARP devices or
force_mr=1) receive -EOPNOTSUPP and should use the scatterlist API.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-2-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/irdma: Use kvzalloc for paged memory DMA address array

Allocate array chunk->dmainfo.dmaaddrs using kvzalloc() to allow the
allocation to fall back to vmalloc when contiguous memory is unavailable
(instead of failing and logging page allocation warnings).

Acked-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Carlos Bilbao (Lambda) <carlos.bilbao@kernel.org>
Link: https://patch.msgid.link/20260128014446.405247-1-carlos.bilbao@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rxe: Fix race condition in QP timer handlers

I encontered the following warning:
WARNING: drivers/infiniband/sw/rxe/rxe_task.c:249 at rxe_sched_task+0x1c8/0x238 [rdma_rxe], CPU#0: swapper/0/0
...
  libsha1 [last unloaded: ip6_udp_tunnel]
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G         C          6.19.0-rc5-64k-v8+ #37 PREEMPT
Tainted: [C]=CRAP
Hardware name: Raspberry Pi 4 Model B Rev 1.2
Call trace:
  rxe_sched_task+0x1c8/0x238 [rdma_rxe] (P)
  retransmit_timer+0x130/0x188 [rdma_rxe]
  call_timer_fn+0x68/0x4d0
  __run_timers+0x630/0x888
...
WARNING: drivers/infiniband/sw/rxe/rxe_task.c:38 at rxe_sched_task+0x1c0/0x238 [rdma_rxe], CPU#0: swapper/0/0
...
WARNING: drivers/infiniband/sw/rxe/rxe_task.c:111 at do_work+0x488/0x5c8 [rdma_rxe], CPU#3: kworker/u17:4/93400
...
refcount_t: underflow; use-after-free.
WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x138/0x1a0, CPU#3: kworker/u17:4/93400

The issue is caused by a race condition between retransmit_timer() and
rxe_destroy_qp, leading to the Queue Pair's (QP) reference count dropping
to zero during timer handler execution.

It seems this warning is harmless because rxe_qp_do_cleanup() will flush
all pending timers and requests.

Example of flow causing the issue:

CPU0                                   CPU1
retransmit_timer() {
    spin_lock_irqsave
                           rxe_destroy_qp()
                            __rxe_cleanup()
                              __rxe_put() // qp->ref_count decrease to 0
                            rxe_qp_do_cleanup() {
    if (qp->valid) {
        rxe_sched_task() {
            WARN_ON(rxe_read(task->qp) <= 0);
        }
    }
    spin_unlock_irqrestore
}
                              spin_lock_irqsave
                              qp->valid = 0
                              spin_unlock_irqrestore
                            }

Ensure the QP's reference count is maintained and its validity is checked
within the timer callbacks by adding calls to rxe_get(qp) and corresponding
rxe_put(qp) after use.

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Fixes: d94671632572 ("RDMA/rxe: Rewrite rxe_task.c")
Link: https://patch.msgid.link/20260120074437.623018-1-lizhijian@fujitsu.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mana_ib: Add device‑memory support

Introduce a basic DM implementation that enables creating and
registering device memory, and using the associated memory keys
for networking operations.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/20260127082649.429018-1-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Fix memory leak in GET_DATA_DIRECT_SYSFS_PATH handler

The UVERBS_HANDLER(MLX5_IB_METHOD_GET_DATA_DIRECT_SYSFS_PATH) function
allocates memory for the device path using kobject_get_path(). If the
length of the device path exceeds the output buffer length, the function
returns -ENOSPC but does not free the allocated memory, resulting in a
memory leak.

Add a kfree() call to the error path to ensure the allocated memory is
properly freed.

Compile tested only. Issue found using a prototype static analysis tool
and code review.

Fixes: ec7ad6530909 ("RDMA/mlx5: Introduce GET_DATA_DIRECT_SYSFS_PATH ioctl")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Link: https://patch.msgid.link/20260126074801.627898-1-zilin@seu.edu.cn
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/uverbs: Validate wqe_size before using it in ib_uverbs_post_send

ib_uverbs_post_send() uses cmd.wqe_size from userspace without any
validation before passing it to kmalloc() and using the allocated
buffer as struct ib_uverbs_send_wr.

If a user provides a small wqe_size value (e.g., 1), kmalloc() will
succeed, but subsequent accesses to user_wr->opcode, user_wr->num_sge,
and other fields will read beyond the allocated buffer, resulting in
an out-of-bounds read from kernel heap memory. This could potentially
leak sensitive kernel information to userspace.

Additionally, providing an excessively large wqe_size can trigger a
WARNING in the memory allocation path, as reported by syzkaller.

This is inconsistent with ib_uverbs_unmarshall_recv() which properly
validates that wqe_size >= sizeof(struct ib_uverbs_recv_wr) before
proceeding.

Add the same validation for ib_uverbs_post_send() to ensure wqe_size
is at least sizeof(struct ib_uverbs_send_wr).

Fixes: c3bea3d2dc53 ("RDMA/uverbs: Use the iterator for ib_uverbs_unmarshall_recv()")
Signed-off-by: Yi Liu <liuy22@mails.tsinghua.edu.cn>
Link: https://patch.msgid.link/20260122142900.2356276-2-liuy22@mails.tsinghua.edu.cn
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/irdma: Use CQ ID for CEQE context

The hardware allows for an opaque CQ context field to be carried
over into CEQEs for the CQ. Previously, a pointer to the CQ was used
for this context. In the normal CQ destroy flow, the CEQ ring is
scrubbed to remove any preexisting CEQEs for the CQ that may not have
been processed yet so that the CQ structure is not dereferenced in the
CEQ ISR after the CQ has been freed.

However, in some cases, it is possible for a CEQE to be in flight in
HW even after the CQ destroy command completion is received, so it
could be missed during the scrub.

To protect against this, we can take advantage of the CQ table that
already exists and use the CQ ID for this context rather than a CQ
pointer.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260120212546.1893076-2-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/irdma: Add enum defs for reserved CQs/QPs

Added definitions for the special reserved CQs and QPs.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260120212546.1893076-1-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE

The current implementation incorrectly handles memory regions (MRs) with
page sizes different from the system PAGE_SIZE. The core issue is that
rxe_set_page() is called with mr->page_size step increments, but the
page_list stores individual struct page pointers, each representing
PAGE_SIZE of memory.

ib_sg_to_page() has ensured that when i>=1 either
a) SG[i-1].dma_end and SG[i].dma_addr are contiguous
or
b) SG[i-1].dma_end and SG[i].dma_addr are mr->page_size aligned.

This leads to incorrect iova-to-va conversion in scenarios:

1) page_size < PAGE_SIZE (e.g., MR: 4K, system: 64K):
   ibmr->iova = 0x181800
   sg[0]: dma_addr=0x181800, len=0x800
   sg[1]: dma_addr=0x173000, len=0x1000

   Access iova = 0x181800 + 0x810 = 0x182010
   Expected VA: 0x173010 (second SG, offset 0x10)
   Before fix:
     - index = (0x182010 >> 12) - (0x181800 >> 12) = 1
     - page_offset = 0x182010 & 0xFFF = 0x10
     - xarray[1] stores system page base 0x170000
     - Resulting VA: 0x170000 + 0x10 = 0x170010 (wrong)

2) page_size > PAGE_SIZE (e.g., MR: 64K, system: 4K):
   ibmr->iova = 0x18f800
   sg[0]: dma_addr=0x18f800, len=0x800
   sg[1]: dma_addr=0x170000, len=0x1000

   Access iova = 0x18f800 + 0x810 = 0x190010
   Expected VA: 0x170010 (second SG, offset 0x10)
   Before fix:
     - index = (0x190010 >> 16) - (0x18f800 >> 16) = 1
     - page_offset = 0x190010 & 0xFFFF = 0x10
     - xarray[1] stores system page for dma_addr 0x170000
     - Resulting VA: system page of 0x170000 + 0x10 = 0x170010 (wrong)

Yi Zhang reported a kernel panic[1] years ago related to this defect.

Solution:
1. Replace xarray with pre-allocated rxe_mr_page array for sequential
   indexing (all MR page indices are contiguous)
2. Each rxe_mr_page stores both struct page* and offset within the
   system page
3. Handle MR page_size != PAGE_SIZE relationships:
   - page_size > PAGE_SIZE: Split MR pages into multiple system pages
   - page_size <= PAGE_SIZE: Store offset within system page
4. Add boundary checks and compatibility validation

This ensures correct iova-to-va conversion regardless of MR page size
and system PAGE_SIZE relationship, while improving performance through
array-based sequential access.

Tests on 4K and 64K PAGE_SIZE hosts:
- rdma-core/pytests
  $ ./build/bin/run_tests.py  --dev eth0_rxe
- blktest:
  $ TIMEOUT=30 QUICK_RUN=1 USE_RXE=1 NVMET_TRTYPES=rdma ./check nvme srp rnbd

[1] https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/

Fixes: 592627ccbdff ("RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Link: https://patch.msgid.link/20260116032753.2574363-1-lizhijian@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rxe: Remove unused page_offset member

In rxe_map_mr_sg(), the `page_offset` member of the `rxe_mr` struct
was initialized based on `ibmr.iova`, which will be updated inside
ib_sg_to_pages() later.

Consequently, the value assigned to `page_offset` was incorrect. However,
since `page_offset` was never utilized throughout the code, it can be safely
removed to clean up the codebase and avoid future confusion.

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Link: https://patch.msgid.link/20260116032833.2574627-1-lizhijian@fujitsu.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/mlx5: Fix port speed query for representors

When querying speed information for a representor in switchdev mode,
the code previously used the first device in the eswitch, which may not
match the device that actually owns the representor. In setups such as
multi-port eswitch or LAG, this led to incorrect port attributes being
reported.

Fix this by retrieving the correct core device from the representor's
eswitch before querying its port attributes.

Fixes: 27f9e0ccb6da ("net/mlx5: Lag, Add single RDMA device in multiport mode")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260115-port-speed-query-fix-v2-1-3bde6a3c78e7@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Fix UMR hang in LAG error state unload

During firmware reset in LAG mode, a race condition causes the driver
to hang indefinitely while waiting for UMR completion during device
unload. See [1].

In LAG mode the bond device is only registered on the master, so it
never sees sys_error events from the slave.
During firmware reset this causes UMR waits to hang forever on unload
as the slave is dead but the master hasn't entered error state yet, so
UMR posts succeed but completions never arrive.

Fix this by adding a sys_error notifier that gets registered before
MLX5_IB_STAGE_IB_REG and stays alive until after ib_unregister_device().
This ensures error events reach the bond device throughout teardown.

[1]
Call Trace:
__schedule+0x2bd/0x760
schedule+0x37/0xa0
schedule_preempt_disabled+0xa/0x10
__mutex_lock.isra.6+0x2b5/0x4a0
__mlx5_ib_dereg_mr+0x606/0x870 [mlx5_ib]
? __xa_erase+0x4a/0xa0
? _cond_resched+0x15/0x30
? wait_for_completion+0x31/0x100
ib_dereg_mr_user+0x48/0xc0 [ib_core]
? rdmacg_uncharge_hierarchy+0xa0/0x100
destroy_hw_idr_uobject+0x20/0x50 [ib_uverbs]
uverbs_destroy_uobject+0x37/0x150 [ib_uverbs]
__uverbs_cleanup_ufile+0xda/0x140 [ib_uverbs]
uverbs_destroy_ufile_hw+0x3a/0xf0 [ib_uverbs]
ib_uverbs_remove_one+0xc3/0x140 [ib_uverbs]
remove_client_context+0x8b/0xd0 [ib_core]
disable_device+0x8c/0x130 [ib_core]
__ib_unregister_device+0x10d/0x180 [ib_core]
ib_unregister_device+0x21/0x30 [ib_core]
__mlx5_ib_remove+0x1e4/0x1f0 [mlx5_ib]
auxiliary_bus_remove+0x1e/0x30
device_release_driver_internal+0x103/0x1f0
bus_remove_device+0xf7/0x170
device_del+0x181/0x410
mlx5_rescan_drivers_locked.part.10+0xa9/0x1d0 [mlx5_core]
mlx5_disable_lag+0x253/0x260 [mlx5_core]
mlx5_lag_disable_change+0x89/0xc0 [mlx5_core]
mlx5_eswitch_disable+0x67/0xa0 [mlx5_core]
mlx5_unload+0x15/0xd0 [mlx5_core]
mlx5_unload_one+0x71/0xc0 [mlx5_core]
mlx5_sync_reset_reload_work+0x83/0x100 [mlx5_core]
process_one_work+0x1a7/0x360
worker_thread+0x30/0x390
? create_worker+0x1a0/0x1a0
kthread+0x116/0x130
? kthread_flush_work_fn+0x10/0x10
ret_from_fork+0x22/0x40

Fixes: ede132a5cf55 ("RDMA/mlx5: Move events notifier registration to be after device registration")
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260113-umr-hand-lag-fix-v1-1-3dc476e00cd9@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mana_ib: Take CQ type from the device type

Get CQ type from the used gdma device. The MANA_IB_CREATE_RNIC_CQ
flag is ignored. It was used in older kernel versions where
the mana_ib was shared between ethernet and rnic.

Fixes: d4293f96ce0b ("RDMA/mana_ib: unify mana_ib functions to support any gdma device")
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/20260115093625.177306-1-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/iwcm: Fix workqueue list corruption by removing work_list

The commit e1168f0 ("RDMA/iwcm: Simplify cm_event_handler()")
changed the work submission logic to unconditionally call
queue_work() with the expectation that queue_work() would
have no effect if work was already pending. The problem is
that a free list of struct iwcm_work is used (for which
struct work_struct is embedded), so each call to queue_work()
is basically unique and therefore does indeed queue the work.

This causes a problem in the work handler which walks the work_list
until it's empty to process entries. This means that a single
run of the work handler could process item N+1 and release it
back to the free list while the actual workqueue entry is still
queued. It could then get reused (INIT_WORK...) and lead to
list corruption in the workqueue logic.

Fix this by just removing the work_list. The workqueue already
does this for us.

This fixes the following error that was observed when stress
testing with ucmatose on an Intel E830 in iWARP mode:

[  151.465780] list_del corruption. next->prev should be ffff9f0915c69c08, but was ffff9f0a1116be08. (next=ffff9f0a15b11c08)
[  151.466639] ------------[ cut here ]------------
[  151.466986] kernel BUG at lib/list_debug.c:67!
[  151.467349] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[  151.467753] CPU: 14 UID: 0 PID: 2306 Comm: kworker/u64:18 Not tainted 6.19.0-rc4+ #1 PREEMPT(voluntary)
[  151.468466] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[  151.469192] Workqueue:  0x0 (iw_cm_wq)
[  151.469478] RIP: 0010:__list_del_entry_valid_or_report+0xf0/0x100
[  151.469942] Code: c7 58 5f 4c b2 e8 10 50 aa ff 0f 0b 48 89 ef e8 36 57 cb ff 48 8b 55 08 48 89 e9 48 89 de 48 c7 c7 a8 5f 4c b2 e8 f0 4f aa ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90
[  151.471323] RSP: 0000:ffffb15644e7bd68 EFLAGS: 00010046
[  151.471712] RAX: 000000000000006d RBX: ffff9f0915c69c08 RCX: 0000000000000027
[  151.472243] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f0a37d9c600
[  151.472768] RBP: ffff9f0a15b11c08 R08: 0000000000000000 R09: c0000000ffff7fff
[  151.473294] R10: 0000000000000001 R11: ffffb15644e7bba8 R12: ffff9f092339ee68
[  151.473817] R13: ffff9f0900059c28 R14: ffff9f092339ee78 R15: 0000000000000000
[  151.474344] FS:  0000000000000000(0000) GS:ffff9f0a847b5000(0000) knlGS:0000000000000000
[  151.474934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  151.475362] CR2: 0000559e233a9088 CR3: 000000020296b004 CR4: 0000000000770ef0
[  151.475895] PKRU: 55555554
[  151.476118] Call Trace:
[  151.476331]  <TASK>
[  151.476497]  move_linked_works+0x49/0xa0
[  151.476792]  __pwq_activate_work.isra.46+0x2f/0xa0
[  151.477151]  pwq_dec_nr_in_flight+0x1e0/0x2f0
[  151.477479]  process_scheduled_works+0x1c8/0x410
[  151.477823]  worker_thread+0x125/0x260
[  151.478108]  ? __pfx_worker_thread+0x10/0x10
[  151.478430]  kthread+0xfe/0x240
[  151.478671]  ? __pfx_kthread+0x10/0x10
[  151.478955]  ? __pfx_kthread+0x10/0x10
[  151.479240]  ret_from_fork+0x208/0x270
[  151.479523]  ? __pfx_kthread+0x10/0x10
[  151.479806]  ret_from_fork_asm+0x1a/0x30
[  151.480103]  </TASK>

Fixes: e1168f09b331 ("RDMA/iwcm: Simplify cm_event_handler()")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260112020006.1352438-1-jmoroni@google.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rxe: Fix double free in rxe_srq_from_init

In rxe_srq_from_init(), the queue pointer 'q' is assigned to
'srq->rq.queue' before copying the SRQ number to user space.
If copy_to_user() fails, the function calls rxe_queue_cleanup()
to free the queue, but leaves the now-invalid pointer in
'srq->rq.queue'.

The caller of rxe_srq_from_init() (rxe_create_srq) eventually
calls rxe_srq_cleanup() upon receiving the error, which triggers
a second rxe_queue_cleanup() on the same memory, leading to a
double free.

The call trace looks like this:
   kmem_cache_free+0x.../0x...
   rxe_queue_cleanup+0x1a/0x30 [rdma_rxe]
   rxe_srq_cleanup+0x42/0x60 [rdma_rxe]
   rxe_elem_release+0x31/0x70 [rdma_rxe]
   rxe_create_srq+0x12b/0x1a0 [rdma_rxe]
   ib_create_srq_user+0x9a/0x150 [ib_core]

Fix this by moving 'srq->rq.queue = q' after copy_to_user.

Fixes: aae0484e15f0 ("IB/rxe: avoid srq memory leak")
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Link: https://patch.msgid.link/20260112015412.29458-1-jiashengjiangcool@gmail.com
Reviewed-by: Zhu Yanjun <yanjun.Zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Support drain SQ and RQ

Some ULPs, e.g. rpcrdma, rely on drain_qp() to ensure all outstanding
requests are completed before releasing related memory. If drain_qp()
fails, ULPs may release memory directly, and in-flight WRs may later be
flushed after the memory is freed, potentially leading to UAF.

drain_qp() failures can happen when HW enters an error state or is
reset. Add support to drain SQ and RQ in such cases by posting a
fake WR during reset, so the driver can process all remaining WRs in
sequence and generate corresponding completions.

Always invoke comp_handler() in drain process to ensure completions
are not lost under concurrency (e.g. concurrent post_send() and
reset, or QPs created during reset). If the CQ is already processed,
cancel any already scheduled comp_handler() to avoid concurrency
issues.

Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20260108113032.856306-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/irdma: Remove fixed 1 ms delay during AH wait loop

The AH CQP command wait loop executes in an atomic context and was
using a fixed 1 ms delay. Since many AH create commands can complete
much faster than 1 ms, use poll_timeout_us_atomic with a 1 us delay.

Also, use the timeout value indicated during the capability exchange
rather than a hard-coded value.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260105180550.2907858-1-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/irdma: Remove redundant dma_wmb() before writel()

A dma_wmb() is not necessary before a writel() because writel()
already has an even stronger store barrier. A dma_wmb() is only
required to order writes to consistent/DMA memory whereas the
barrier in writel() is specified to order writes to DMA memory as
well as MMIO.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260103172517.2088895-1-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-srv: Fix error print in process_info_req()

rtrs_srv_change_state() returns bool (true on success) therefore
there is no reason to print error when it fails as it always will
be 0.

Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://patch.msgid.link/20260107161517.56357-11-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-clt: For conn rejection use actual err number

When the connection establishment request is rejected from the server
side, then the actual error number sent back should be used.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://patch.msgid.link/20260107161517.56357-10-haris.iqbal@ionos.com
Reviewed-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs: Extend log message when a port fails

Add HCA name and port of this HCA.
This would help with analysing and debugging the logs.

The logs would looks something like this,

rtrs_server L2516: Handling event: port error (10).
HCA name: mlx4_0, port num: 2
rtrs_client L3326: Handling event: port error (10).
HCA name: mlx4_0, port num: 1

Signed-off-by: Kim Zhu <zhu.yanjun@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20260107161517.56357-9-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-srv: Rate-limit I/O path error logging

Excessive error logging is making it difficult to identify the root
cause of issues. Implement rate limiting to improve log clarity.

Signed-off-by: Kim Zhu <zhu.yanjun@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20260107161517.56357-8-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-srv: Add check and closure for possible zombie paths

During several network incidents, a number of RTRS paths for a session
went through disconnect and reconnect phase. However, some of those did
not auto-reconnect successfully. Instead they failed with the following
logs,

On client,
kernel: rtrs_client L1991: <sess-name>: Connect rejected: status 28
  (consumer defined), rtrs errno -104
kernel: rtrs_client L2698: <sess-name>: init_conns() failed: err=-104
  path=gid:<gid1>@gid:<gid2> [mlx4_0:1]

On server, (log a)
kernel: ibtrs_server L1868: <>: Connection already exists: 0

When the misbehaving path was removed, and add_path was called to re-add
the path, the log on client side changed to, (log b)
kernel: rtrs_client L1991: <sess-name>: Connect rejected: status 28
  (consumer defined), rtrs errno -17

There was no log on the server side for this, which is expected since
there is no logging in that path,
if (unlikely(__is_path_w_addr_exists(srv, &cm_id->route.addr))) {
err = -EEXIST;
goto err;

Because of the following check on server side,
if (unlikely(sess->state != IBTRS_SRV_CONNECTING)) {
ibtrs_err(s, "Session in wrong state: %s\n",

.. we know that the path in (log a) was in CONNECTING state.

The above state of the path persists for as long as we leave the session
be. This means that the path is in some zombie state, probably waiting
for the info_req packet to arrive, which never does.

The changes in this commits does 2 things.

1) Add logs at places where we see the errors happening. The logs would
shed more light at the state and lifetime of such zombie paths.

2) Close such zombie sessions, only if they are in CONNECTING state, and
after an inactivity period of 30 seconds.
  i) The state check prevents closure of paths which are CONNECTED.
Also, from the above logs and code, we already know that the path could
only be on CONNECTING state, so we play safe and narrow our impact surface
area by closing only CONNECTING paths.
  ii) The inactivity period is to allow requests for other cid to finish
processing, or for any stray packets to arrive/fail.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20260107161517.56357-7-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-clt: Remove unused members in rtrs_clt_io_req

Remove unused members from rtrs_clt_io_req.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20260107161517.56357-6-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs: Improve error logging for RDMA cm events

The member variable status in the struct rdma_cm_event is used for both
linux errors and the errors definded in rdma stack.

Signed-off-by: Kim Zhu <zhu.yanjun@ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20260107161517.56357-5-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs: Add optional support for IB_MR_TYPE_SG_GAPS

Support IB_MR_TYPE_SG_GAPS, which has less limitations
than standard IB_MR_TYPE_MEM_REG, a few ULP support this.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Kim Zhu <zhu.yanjun@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20260107161517.56357-4-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs: Add error description to the logs

Print error description instead of the error number.

Signed-off-by: Kim Zhu <zhu.yanjun@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20260107161517.56357-3-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/rtrs-srv: fix SG mapping

This fixes the following error on the server side:

RTRS server session allocation failed: -EINVAL

caused by the caller of the `ib_dma_map_sg()`, which does not expect
less mapped entries, than requested, which is in the order of things
and can be easily reproduced on the machine with enabled IOMMU.

The fix is to treat any positive number of mapped sg entries as a
successful mapping and cache DMA addresses by traversing modified
SG table.

Fixes: 9cb837480424 ("RDMA/rtrs: server: main functionality")
Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20260107161517.56357-2-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/ocrdma: Remove unused OCRDMA_UVERBS definition

The OCRDMA_UVERBS() macro is unused, so remove it to clean up the code.

Link: https://patch.msgid.link/20260104-ib-core-misc-v1-6-00367f77f3a8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

RDMA/qedr: Remove unused defines

Perform basic cleanup by removing unused defines from qedr.h.

Link: https://patch.msgid.link/20260104-ib-core-misc-v1-5-00367f77f3a8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

RDMA/mlx5: Avoid direct access to DMA device pointer

The dma_device field is marked as internal and must not be accessed by
drivers or ULPs. Remove all direct mlx5 references to this field.

Link: https://patch.msgid.link/20260104-ib-core-misc-v1-4-00367f77f3a8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

RDMA/mlx5: Fix ucaps init error flow

In mlx5_ib_stage_caps_init(), if mlx5_ib_init_ucaps() fails after
mlx5_ib_init_var_table() succeeds, the VAR bitmap is leaked since
the function returns without cleanup.

Thus, cleanup the var table bitmap in case of error of initializing
ucaps before exiting, preventing the leak above.

Fixes: cf7174e8982f ("RDMA/mlx5: Create UCAP char devices for supported device capabilities")
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/20260104-ib-core-misc-v1-3-00367f77f3a8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/core: Avoid exporting module local functions and remove not-used ones

Some of the functions are local to the module and some are not used
starting from commit 36783dec8d79 ("RDMA/rxe: Delete deprecated module
parameters interface"). Delete and avoid exporting them.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/20260104-ib-core-misc-v1-2-00367f77f3a8@nvidia.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/umem: Remove redundant DMABUF ops check

ib_umem_dmabuf_get_with_dma_device() is an in-kernel function and does
not require a defensive check for the .move_notify callback. All current
callers guarantee that this callback is always present.

Link: https://patch.msgid.link/20260104-ib-core-misc-v1-1-00367f77f3a8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

RDMA/mlx5: Implement query_port_speed callback

Implement the query_port_speed callback for mlx5 driver to support
querying effective port bandwidth.

For LAG configurations, query the aggregated speed from the LAG layer
or from the modified vport max_tx_speed.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/mlx5: Raise async event on device speed change

Raise IB_EVENT_DEVICE_SPEED_CHANGE whenever the speed of one of the
device's ports changes. Usually all ports of the device changes
together.

This ensures user applications and upper-layer software are immediately
notified when bandwidth changes, improving traffic management in dynamic
environments. This is especially useful for vports which are part of a
LAG configuration, to know if the effective speed of the LAG was
changed.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/core: Add query_port_speed verb

Add new ibv_query_port_speed() verb to enable applications to query
the effective bandwidth of a port.

This verb is particularly useful when the speed is not a multiplication
of IB speed and width where width is 2^n.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/core: Refactor rate_show to use ib_port_attr_to_rate()

Update sysfs rate_show() to rely on ib_port_attr_to_speed_info() for
converting IB port speed and width attributes to data rate and speed
string.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/core: Add helper to convert port attributes to data rate

Introduce ib_port_attr_to_rate() to compute the data rate in 100 Mbps
units (deci-Gb/sec) from a port's active_speed and active_width
attributes. This generic helper removes duplicated speed-to-rate
calculations, which are used by sysfs and the upcoming new verb.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/core: Add async event on device speed change

Add IB_EVENT_DEVICE_SPEED_CHANGE for notifying user applications on
device's ports speed changes.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

Support effective VF bandwidth query in LAG mode

Currently, mlx5 driver exposes only the parent function's speed to VFs,
providing no way to query the actual effective bandwidth in LAG and
MPESW configurations. This limitation prevents userspace and
upper-layer software from obtaining accurate bandwidth information,
which impacts traffic scheduling decisions.

This series addresses this by:

1. Adding mlx5 internal logic to calculate and propagate the effective
   aggregated LAG speed to all attached vports. The vport speeds are
   dynamically updated when LAG member link states change.

2. Extending RDMA core with a new ib_query_port_speed() verb and an
   IB_EVENT_DEVICE_SPEED_CHANGE async event. These interfaces expose
   the effective port speed to userspace, supporting speeds that are
   not expressible as IB speed * width (where width is 2^n).

This series enables userspace applications to query the effective
port speed and receive notifications on speed changes in real-time.
In LAG configurations, each mlx5 port reports the aggregated bandwidth
of all active LAG members.

Signed-off-by: Leon Romanovsky <leon@kernel.org>

net/mlx5: Add support for querying bond speed

Add mlx5_lag_query_bond_speed() to query the aggregated speed of
lag configurations with a bond device.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

net/mlx5: Handle port and vport speed change events in MPESW

Add port change event handling logic for MPESW LAG mode, ensuring
VFs are updated when the speed of LAG physical ports changes.
This triggers a speed update workflow when relevant port state changes
occur, enabling consistent and accurate reporting of VF bandwidth.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

net/mlx5: Propagate LAG effective max_tx_speed to vports

Currently, vports report only their parent's uplink speed, which in LAG
setups does not reflect the true aggregated bandwidth. This makes it
hard for upper-layer software to optimize load balancing decisions
based on accurate bandwidth information.

Fix the issue by calculating the possible maximum speed of a LAG as
the sum of speeds of all active uplinks that are part of the LAG.
Propagate this effective max speed to vports associated with the LAG
whenever a relevant event occurs, such as physical port link state
changes or LAG creation/modification.

With this change, upper-layer components receive accurate bandwidth
information corresponding to the active members of the LAG and can
make better load balancing decisions.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

net/mlx5: Add max_tx_speed and its CAP bit to IFC

Introduce the max_tx_speed field to the query and modify_vport_state
structures.

Add the esw_vport_state_max_tx_speed capability bit, indicating
the firmware support modifying the max_tx_speed field via the
MODIFY_VPORT_STATE command.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Notify ULP of remaining soft-WCs during reset

During a reset, software-generated WCs cannot be reported via
interrupts. This may cause the ULP to miss some WCs.

To avoid this, add check in the CQ arm process: if a hardware reset
has occurred and there are still unreported soft-WCs, notify the ULP
to handle the remaining WCs, thereby preventing any loss of completions.

Fixes: 626903e9355b ("RDMA/hns: Add support for reporting wc as software mode")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20260104064057.1582216-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Fix RoCEv1 failure due to DSCP

DSCP is not supported in RoCEv1, but get_dscp() is still called. If
get_dscp() returns an error, it'll eventually cause create_ah to fail
even when using RoCEv1.

Correct the return value and avoid calling get_dscp() when using
RoCEv1.

Fixes: ee20cc17e9d8 ("RDMA/hns: Support DSCP")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20260104064057.1582216-4-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Return actual error code instead of fixed EINVAL

query_cqc() and query_mpt() may return various error codes in
different cases. Return actual error code instead of fixed EINVAL.

Fixes: f2b070f36d1b ("RDMA/hns: Support CQ's restrack raw ops for hns driver")
Fixes: 3d67e7e236ad ("RDMA/hns: Support MR's restrack raw ops for hns driver")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20260104064057.1582216-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Fix WQ_MEM_RECLAIM warning

When sunrpc is used, if a reset triggered, our wq may lead the
following trace:

workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma]
is flushing !WQ_MEM_RECLAIM hns_roce_irq_workq:flush_work_handle
[hns_roce_hw_v2]
WARNING: CPU: 0 PID: 8250 at kernel/workqueue.c:2644 check_flush_dependency+0xe0/0x144
Call trace:
  check_flush_dependency+0xe0/0x144
  start_flush_work.constprop.0+0x1d0/0x2f0
  __flush_work.isra.0+0x40/0xb0
  flush_work+0x14/0x30
  hns_roce_v2_destroy_qp+0xac/0x1e0 [hns_roce_hw_v2]
  ib_destroy_qp_user+0x9c/0x2b4
  rdma_destroy_qp+0x34/0xb0
  rpcrdma_ep_destroy+0x28/0xcc [rpcrdma]
  rpcrdma_ep_put+0x74/0xb4 [rpcrdma]
  rpcrdma_xprt_disconnect+0x1d8/0x260 [rpcrdma]
  xprt_rdma_connect_worker+0xc0/0x120 [rpcrdma]
  process_one_work+0x1cc/0x4d0
  worker_thread+0x154/0x414
  kthread+0x104/0x144
  ret_from_fork+0x10/0x18

Since QP destruction frees memory, this wq should have the WQ_MEM_RECLAIM.

Fixes: ffd541d45726 ("RDMA/hns: Add the workqueue framework for flush cqe handler")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20260104064057.1582216-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

IB/cache: update gid cache on client reregister event

Some HCAs (e.g: ConnectX4) do not trigger a IB_EVENT_GID_CHANGE on
subnet prefix update from SM (PortInfo).

Since the commit d58c23c92548 ("IB/core: Only update PKEY and GID caches
on respective events"), the GID cache is updated exclusively on
IB_EVENT_GID_CHANGE. If this event is not emitted, the subnet prefix in the
IPoIB interface’s hardware address remains set to its default value
(0xfe80000000000000).

Then rdma_bind_addr() failed because it relies on hardware address to
find the port GID (subnet_prefix + port GUID).

This patch fixes this issue by updating the GID cache on
IB_EVENT_CLIENT_REREGISTER event (emitted on PortInfo::ClientReregister=1).

Fixes: d58c23c92548 ("IB/core: Only update PKEY and GID caches on respective events")
Signed-off-by: Etienne AUJAMES <eaujames@ddn.com>
Link: https://patch.msgid.link/aVUfsO58QIDn5bGX@eaujamesFR0130
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

RDMA/hns: Introduce limit_bank mode with better performance

In limit_bank mode, QPs/CQs are restricted to using half of the banks.
HW concentrates resources on these banks, thereby improving performance
compared to the default mode.

Switch between limit_bank mode and default mode by setting the cap
flag in FW. Since the number of QPs and CQs will be halved, this mode
is suitable for scenarios where fewer QPs and CQs are required.

Signed-off-by: Lianfa Weng <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251230154911.3397584-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

Linux 6.19-rc3

Merge tag 'usb-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are some small USB fixes, and bunch of reverts for 6.19-rc3.
  Included in here are:

   - reverts of some typec ucsi driver changes that had a lot of
     regression reports after -rc1. Let's just revert it all for now and
     it will come back in a way that is better tested.

   - other typec bugfixes

   - usb-storage quirk fixups

   - dwc3 driver fix

   - other minor USB fixes for reported problems.

  All of these have passed 0-day testing and individual testing"

* tag 'usb-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (22 commits)
  Revert "usb: typec: ucsi: Update UCSI structure to have message in and message out fields"
  Revert "usb: typec: ucsi: Add support for message out data structure"
  Revert "usb: typec: ucsi: Enable debugfs for message_out data structure"
  Revert "usb: typec: ucsi: Add support for SET_PDOS command"
  Revert "usb: typec: ucsi: Fix null pointer dereference in ucsi_sync_control_common"
  Revert "usb: typec: ucsi: Get connector status after enable notifications"
  usb: ohci-nxp: clean up probe error labels
  usb: gadget: lpc32xx_udc: clean up probe error labels
  usb: ohci-nxp: fix device leak on probe failure
  usb: phy: isp1301: fix non-OF device reference imbalance
  usb: gadget: lpc32xx_udc: fix clock imbalance in error path
  usb: typec: ucsi: Get connector status after enable notifications
  usb: usb-storage: Maintain minimal modifications to the bcdDevice range.
  usb: dwc3: of-simple: fix clock resource leak in dwc3_of_simple_probe
  usb: typec: ucsi: Fix null pointer dereference in ucsi_sync_control_common
  USB: lpc32xx_udc: Fix error handling in probe
  usb: typec: altmodes/displayport: Drop the device reference in dp_altmode_probe()
  usb: phy: fsl-usb: Fix use-after-free in delayed work during device removal
  usb: renesas_usbhs: Fix a resource leak in usbhs_pipe_malloc()
  usb: typec: ucsi: huawei-gaokin: add DRM dependency
  ...

Merge tag 'tty-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull serial driver fixes from Greg KH:
"Here are some small serial driver fixes for some reported issues.
  Included in here are:

   - serial sysfs fwnode fix that was much reported

   - sh-sci driver fix

   - serial device init bugfix

   - 8250 bugfix

   - xilinx_uartps bugfix

  All of these have passed 0-day testing and individual testing"

* tag 'tty-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  serial: xilinx_uartps: fix rs485 delay_rts_after_send
  serial: sh-sci: Check that the DMA cookie is valid
  serial: core: Fix serial device initialization
  serial: 8250: longson: Fix NULL vs IS_ERR() bug in probe
  serial: core: Restore sysfs fwnode information

Merge tag 'firewire-fixes-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394

Pull firewire fix from Takashi Sakamoto:
"A fix for PCI driver for Texas Instruments PCILyx series.

  The driver had a bug where it allocated a DMA-coherent buffer of 16 KB
  but released it using PAGE_SIZE. This disproportion was reported in
  2020, but the fix was never merged. It is finally resolved"

* tag 'firewire-fixes-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: nosy: Fix dma_free_coherent() size

Merge tag 'riscv-for-linus-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V updates from Paul Walmsley:
"Nothing exotic here; these are the cleanup and new ISA extension
  probing patches (not including CFI):

   - Add probing and userspace reporting support for the standard RISC-V
     ISA extensions Zilsd and Zclsd, which implement load/store dual
     instructions on RV32

   - Abstract the register saving code in setup_sigcontext() so it can
     be used for stateful RISC-V ISA extensions beyond the vector
     extension

   - Add the SBI extension ID and some initial data structure
     definitions for the RISC-V standard SBI debug trigger extension

   - Clean up some code slightly: change some page table functions to
     avoid atomic operations oinn !SMP and to avoid unnecessary casts to
     atomic_long_t; and use the existing RISCV_FULL_BARRIER macro in
     place of some open-coded 'fence rw,rw' instructions"

* tag 'riscv-for-linus-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Add SBI debug trigger extension and function ids
  riscv/atomic.h: use RISCV_FULL_BARRIER in _arch_atomic* function.
  riscv: hwprobe: export Zilsd and Zclsd ISA extensions
  riscv: add ISA extension parsing for Zilsd and Zclsd
  dt-bindings: riscv: add Zilsd and Zclsd extension descriptions
  riscv: mm: use xchg() on non-atomic_long_t variables, not atomic_long_xchg()
  riscv: mm: ptep_get_and_clear(): avoid atomic ops when !CONFIG_SMP
  riscv: mm: pmdp_huge_get_and_clear(): avoid atomic ops when !CONFIG_SMP
  riscv: signal: abstract header saving for setup_sigcontext

Merge tag 'powerpc-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Madhavan Srinivasan:

- Fix for kexec warning due to SMT disable or partial SMT enabled

- Handle font bitmap pointer with reloc_offset to fix boot crash

- Fix to enable cpuidle state for Power11

- Couple of misc fixes

Thanks to Aboorva Devarajan, Aditya Bodkhe, Cedar Maxwell, Christian
Zigotzky, Christophe Leroy, Christophe Leroy (CS GROUP), Finn Thain,
Gopi Krishna Menon, Guenter Roeck, Jan Stancek, Joe Lawrence, Josh
Poimboeuf, Justin M. Forbes, Madadi Vineeth Reddy, Naveen N Rao (AMD),
Nysal Jan K.A., Sachin P Bappalige, Samir M, Sourabh Jain, Srikar
Dronamraju, and Stan Johnson

* tag 'powerpc-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/32: Restore disabling of interrupts at interrupt/syscall exit
  powerpc/powernv: Enable cpuidle state detection for POWER11
  powerpc: Add reloc_offset() to font bitmap pointer used for bootx_printf()
  powerpc/tools: drop `-o pipefail` in gcc check scripts
  selftests/powerpc/pmu/: Add check_extended_reg_test to .gitignore
  powerpc/kexec: Enable SMT before waking offline CPUs

Merge tag 'spi-fix-v6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
"We've got more fixes here for the Cadence QSPI controller, this time
  fixing some issues that come up when working with slower flashes on
  some platforms plus a general race condition.

  We also add support for the Allwinner A523, this is just some new
  compatibles"

* tag 'spi-fix-v6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: cadence-quadspi: Improve CQSPI_SLOW_SRAM quirk if flash is slow
  spi: cadence-quadspi: Prevent lost complete() call during indirect read
  spi: sun6i: Support A523's SPI controllers
  spi: dt-bindings: sun6i: Add compatibles for A523's SPI controllers

Merge tag 'regulator-fix-v6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fixes from Mark Brown:
"A couple of fixes from Thomas, making the UAPI headers more robustly
  correct and ensuring they are covered by checkpatch, and one from
  Andreas fixing an update for a change to the DT bindings that I missed
  was requested during bindings review in the newly added fp9931 driver"

* tag 'regulator-fix-v6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: fp9931: fix regulator node pointer
  regulator: Add UAPI headers to MAINTAINERS
  regulator: uapi: Use UAPI integer type

Merge tag 'drm-fixes-2025-12-27' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Post overeating fixes, only msm for this week has anything, so quiet
  as expected.

  msm:
   - GPU:
      - Fix crash on a7xx GPUs not supporting IFPC
      - Fix perfcntr use with IFPC
      - Concurrent binning fix
   - DPU:
      - Fixed DSC and SSPP fetching issues
      - Switched to scnprint instead of snprintf
      - Added missing NULL checks in pingpong code"

* tag 'drm-fixes-2025-12-27' of https://gitlab.freedesktop.org/drm/kernel: (27 commits)
  drm/msm: Replace unsafe snprintf usage with scnprintf
  drm/msm/dpu: Add missing NULL pointer check for pingpong interface
  Revert "drm/msm/dpu: Enable quad-pipe for DSC and dual-DSI case"
  Revert "drm/msm/dpu: support plane splitting in quad-pipe case"
  drm/msm: msm_iommu.c: fix all kernel-doc warnings
  drm/msm: msm_gpu.h: fix all kernel-doc warnings
  drm/msm: msm_gem_vma.c: fix all kernel-doc warnings
  drm/msm: msm_fence.h: fix all kernel-doc warnings
  drm/msm/dpu: dpu_hw_wb.h: fix all kernel-doc warnings
  drm/msm/dpu: dpu_hw_vbif.h: fix all kernel-doc warnings
  drm/msm/dpu: dpu_hw_top.h: fix all kernel-doc warnings
  drm/msm/dpu: dpu_hw_sspp.h: fix all kernel-doc warnings
  drm/msm/dpu: dpu_hw_pingpong.h: fix all kernel-doc warnings
  drm/msm/dpu: dpu_hw_merge3d.h: fix all kernel-doc warnings
  drm/msm/dpu: dpu_hw_lm.h: fix all kernel-doc warnings
  drm/msm/dpu: dpu_hw_intf.h: fix all kernel-doc warnings
  drm/msm/dpu: dpu_hw_dspp.h: fix all kernel-doc warnings
  drm/msm/dpu: dpu_hw_dsc.h: fix all kernel-doc warnings
  drm/msm/dpu: dpu_hw_cwb.h: fix all kernel-doc warnings
  drm/msm/dpu: dpu_hw_ctl.h: fix all kernel-doc warnings
  ...

Merge tag 'drm-msm-fixes-2025-12-26' of https://gitlab.freedesktop.org/drm/msm into drm-fixes

Fixes for v6.19:

GPU:
- Fix crash on a7xx GPUs not supporting IFPC
- Fix perfcntr use with IFPC
- Concurrent binning fix

DPU:
- Fixed DSC and SSPP fetching issues
- Switched to scnprint instead of snprintf
- Added missing NULL checks in pingpong code

Also documentation fixes.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rob Clark <rob.clark@oss.qualcomm.com>
Link: https://patch.msgid.link/CACSVV01jcLLChsFtmqc4VDNoQ2ic2q+d86n3wdoSUdmW6xaSdQ@mail.gmail.com

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"Three HBA driver and one upper level driver (sg) fix.

  The sg change is the largest, but that results mostly from moving code
  to avoid the described race condition"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ufs: core: Add ufshcd_update_evt_hist() for UFS suspend error
  scsi: sg: Fix occasional bogus elapsed time that exceeds timeout
  scsi: mpi3mr: Read missing IOCFacts flag for reply queue full overflow
  scsi: scsi_debug: Fix atomic write enable module param description

Merge tag 'v6.19-rc2-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fix from Steve French:

- Fix potential memory leak

* tag 'v6.19-rc2-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix memory and information leak in smb3_reconfigure()

Merge tag 'driver-core-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core

Pull driver core fixes from Danilo Krummrich:

- Introduce DMA Rust helpers to avoid build errors when !CONFIG_HAS_DMA

- Remove unnecessary (and hence incorrect) endian conversion in the
   Rust PCI driver sample code

- Fix memory leak in the unwind path of debugfs_change_name()

- Support non-const struct software_node pointers in
   SOFTWARE_NODE_REFERENCE(), after introducing _Generic()

- Avoid NULL pointer dereference in the unwind path of
   simple_xattrs_free()

* tag 'driver-core-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
  fs/kernfs: null-ptr deref in simple_xattrs_free()
  software node: Also support referencing non-constant software nodes
  debugfs: Fix memleak in debugfs_change_name().
  samples: rust: fix endianness issue in rust_driver_pci
  rust: dma: add helpers for architectures without CONFIG_HAS_DMA

Merge tag 'efi-fixes-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI fixes from Ard Biesheuvel:
"A couple of fixes for EFI regressions introduced this cycle:

   - Make EDID handling in the EFI stub mixed mode safe

   - Ensure that efi_mm.user_ns has a sane value - this is needed now
     that EFI runtime calls are preemptible on arm64"

* tag 'efi-fixes-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  kthread: Warn if mm_struct lacks user_ns in kthread_use_mm()
  arm64: efi: Fix NULL pointer dereference by initializing user_ns
  efi/libstub: gop: Fix EDID support in mixed-mode

Merge tag 'block-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

- Fix for a signedness issue introduced in this kernel release for rnbd

- Fix up user copy references for ublk when the server exits

* tag 'block-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
block: rnbd-clt: Fix signedness bug in init_dev()
ublk: clean up user copy references on ublk server exit

Merge tag 'io_uring-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring fix from Jens Axboe:
"Just a single fix for a bug that can cause a leak of the filename with
  IORING_OP_OPENAT, if direct descriptors are asked for and O_CLOEXEC
  has been set in the request flags"

* tag 'io_uring-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring: fix filename leak in __io_openat_prep()

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fixes from Michael Tsirkin:
"Just a bunch of fixes, mostly trivial ones in tools/virtio"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vhost/vsock: improve RCU read sections around vhost_vsock_get()
  tools/virtio: add device, device_driver stubs
  tools/virtio: fix up oot build
  virtio_features: make it self-contained
  tools/virtio: switch to kernel's virtio_config.h
  tools/virtio: stub might_sleep and synchronize_rcu
  tools/virtio: add struct cpumask to cpumask.h
  tools/virtio: pass KCFLAGS to module build
  tools/virtio: add ucopysize.h stub
  tools/virtio: add dev_WARN_ONCE and is_vmalloc_addr stubs
  tools/virtio: stub DMA mapping functions
  tools/virtio: add struct module forward declaration
  tools/virtio: use kernel's virtio.h
  virtio: make it self-contained
  tools/virtio: fix up compiler.h stub

Merge tag 'v6.19-rc2-smb3-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

- Fix parsing of SMB1 negotiate request by adjusting offsets affected
   by the removal of the RFC1002 length field from the SMB header

- Update minimum PDU size macros for both SMB1 and SMB2

- Rename smb2_get_msg function to smb_get_msg to better reflect its
   role in handling both SMB1 and SMB2 requests

* tag 'v6.19-rc2-smb3-server-fixes' of git://git.samba.org/ksmbd:
  smb/server: fix minimum SMB2 PDU size
  smb/server: fix minimum SMB1 PDU size
  ksmbd: rename smb2_get_msg to smb_get_msg
  ksmbd: Fix to handle removal of rfc1002 header from smb_hdr

firewire: nosy: Fix dma_free_coherent() size

It looks like the buffer allocated and mapped in add_card() is done
with size RCV_BUFFER_SIZE which is 16 KB and 4KB.

Fixes: 286468210d83 ("firewire: new driver: nosy - IEEE 1394 traffic sniffer")
Co-developed-by: Thomas Fourier <fourier.thomas@gmail.com>
Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com>
Co-developed-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20251216165420.38355-2-fourier.thomas@gmail.com
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>

io_uring: fix filename leak in __io_openat_prep()

__io_openat_prep() allocates a struct filename using getname(). However,
for the condition of the file being installed in the fixed file table as
well as having O_CLOEXEC flag set, the function returns early. At that
point, the request doesn't have REQ_F_NEED_CLEANUP flag set. Due to this,
the memory for the newly allocated struct filename is not cleaned up,
causing a memory leak.

Fix this by setting the REQ_F_NEED_CLEANUP for the request just after the
successful getname() call, so that when the request is torn down, the
filename will be cleaned up, along with other resources needing cleanup.

Reported-by: syzbot+00e61c43eb5e4740438f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=00e61c43eb5e4740438f
Tested-by: syzbot+00e61c43eb5e4740438f@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Prithvi Tambewagh <activprithvi@gmail.com>
Fixes: b9445598d8c6 ("io_uring: openat directly into fixed fd table")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

kthread: Warn if mm_struct lacks user_ns in kthread_use_mm()

Add a WARN_ON_ONCE() check to detect mm_struct instances that are
missing user_ns initialization when passed to kthread_use_mm().

When a kthread adopts an mm via kthread_use_mm(), LSM hooks and
capability checks may access current->mm->user_ns for credential
validation. If user_ns is NULL, this leads to a NULL pointer
dereference crash.

This was observed with efi_mm on arm64, where commit a5baf582f4c0
("arm64/efi: Call EFI runtime services without disabling preemption")
introduced kthread_use_mm(&efi_mm), but efi_mm lacked user_ns
initialization, causing crashes during /proc access.

Adding this warning helps catch similar bugs early during development
rather than waiting for hard-to-debug NULL pointer crashes in
production.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

arm64: efi: Fix NULL pointer dereference by initializing user_ns

Linux 6.19-rc2 (9448598b22c5 ("Linux 6.19-rc2")) is crashing with a NULL
pointer dereference on arm64 hosts:

  Unable to handle kernel NULL pointer dereference at virtual address 00000000000000c8
   pc : cap_capable (security/commoncap.c:82 security/commoncap.c:128)
   Call trace:
    cap_capable (security/commoncap.c:82 security/commoncap.c:128) (P)
    security_capable (security/security.c:?)
    ns_capable_noaudit (kernel/capability.c:342 kernel/capability.c:381)
    __ptrace_may_access (./include/linux/rcupdate.h:895 kernel/ptrace.c:326)
    ptrace_may_access (kernel/ptrace.c:353)
    do_task_stat (fs/proc/array.c:467)
    proc_tgid_stat (fs/proc/array.c:673)
    proc_single_show (fs/proc/base.c:803)

I've bissected the problem to commit a5baf582f4c0 ("arm64/efi: Call EFI
runtime services without disabling preemption").

>From my analyzes, the crash occurs because efi_mm lacks a user_ns field
initialization. This was previously harmless, but commit a5baf582f4c0
("arm64/efi: Call EFI runtime services without disabling preemption")
changed the EFI runtime call path to use kthread_use_mm(&efi_mm), which
temporarily adopts efi_mm as the current mm for the calling kthread.

When a thread has an active mm, LSM hooks like cap_capable() expect
mm->user_ns to be valid for credential checks. With efi_mm.user_ns being
NULL, capability checks during possible /proc access dereference the
NULL pointer and crash.

Fix by initializing efi_mm.user_ns to &init_user_ns.

Fixes: a5baf582f4c0 ("arm64/efi: Call EFI runtime services without disabling preemption")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

efi/libstub: gop: Fix EDID support in mixed-mode

The efi_edid_discovered_protocol and efi_edid_active_protocol have mixed
mode fields. So all their attributes should be accessed through
the efi_table_attr() helper.

Doing so fixes the upper 32 bits of the 64 bit gop_edid pointer getting
set to random values (followed by a crash at boot) when booting a x86_64
kernel on a machine with 32 bit UEFI like the Asus T100TA.

Fixes: 17029cdd8f9d ("efi/libstub: gop: Add support for reading EDID")
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Javier Martinez Canillas <javierm@redhat.com>
Signed-off-by: Hans de Goede <johannes.goede@oss.qualcomm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Merge tag 'nfsd-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
"A set of NFSD fixes that arrived just a bit late for the 6.19 merge
  window.

  Regression fixes:
   - Mark variable __maybe_unused to avoid W=1 build break

  Stable fixes:
   - NFSv4 file creation neglects setting ACL
   - Clear TIME_DELEG in the suppattr_exclcreat bitmap
   - Clear SECLABEL in the suppattr_exclcreat bitmap
   - Fix memory leak in nfsd_create_serv error paths
   - Bound check rq_pages index in inline path
   - Return 0 on success from svc_rdma_copy_inline_range
   - Use rc_pageoff for memcpy byte offset
   - Avoid NULL deref on zero length gss_token in gss_read_proxy_verf"

* tag 'nfsd-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: NFSv4 file creation neglects setting ACL
  NFSD: Clear TIME_DELEG in the suppattr_exclcreat bitmap
  NFSD: Clear SECLABEL in the suppattr_exclcreat bitmap
  nfsd: fix memory leak in nfsd_create_serv error paths
  nfsd: Mark variable __maybe_unused to avoid W=1 build break
  svcrdma: bound check rq_pages index in inline path
  svcrdma: return 0 on success from svc_rdma_copy_inline_range
  svcrdma: use rc_pageoff for memcpy byte offset
  SUNRPC: svcauth_gss: avoid NULL deref on zero length gss_token in gss_read_proxy_verf

Merge tag 'erofs-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fix from Gao Xiang:
"Junbeom reported that synchronous reads could hit unintended EIOs
  under memory pressure due to incorrect error propagation in
  z_erofs_decompress_queue(), where earlier physical clusters in the
  same decompression queue may be served for another readahead.

  This addresses the issue by decompressing each physical cluster
  independently as long as disk I/Os succeed, rather than being impacted
  by the error status of previous physical clusters in the same queue.

  Summary:

   - Fix unexpected EIOs under memory pressure caused by recent
     incorrect error propagation logic"

* tag 'erofs-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix unexpected EIO under memory pressure

cifs: Fix memory and information leak in smb3_reconfigure()

In smb3_reconfigure(), if smb3_sync_session_ctx_passwords() fails, the
function returns immediately without freeing and erasing the newly
allocated new_password and new_password2. This causes both a memory leak
and a potential information leak.

Fix this by calling kfree_sensitive() on both password buffers before
returning in this error case.

Fixes: 0f0e357902957 ("cifs: during remount, make sure passwords are in sync")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>

drm/msm: Replace unsafe snprintf usage with scnprintf

The refill_buf function uses snprintf to append to a fixed-size buffer.
snprintf returns the length that would have been written, which can
exceed the remaining buffer size. If this happens, ptr advances beyond
the buffer and rem becomes negative. In the 2nd iteration, rem is
treated as a large unsigned integer, causing snprintf to write oob.

While this behavior is technically mitigated by num_perfcntrs being
locked at 5, it's still unsafe if num_perfcntrs were ever to change/a
second source was added.

Signed-off-by: Evan Lambert <veyga@veygax.dev>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Patchwork: https://patchwork.freedesktop.org/patch/696358/
Link: https://lore.kernel.org/r/20251224124254.17920-3-veyga@veygax.dev
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>

vhost/vsock: improve RCU read sections around vhost_vsock_get()

vhost_vsock_get() uses hash_for_each_possible_rcu() to find the
`vhost_vsock` associated with the `guest_cid`. hash_for_each_possible_rcu()
should only be called within an RCU read section, as mentioned in the
following comment in include/linux/rculist.h:

/**
* hlist_for_each_entry_rcu - iterate over rcu list of given type
* @pos: the type * to use as a loop cursor.
* @head: the head for your list.
* @member: the name of the hlist_node within the struct.
* @cond: optional lockdep expression if called from non-RCU protection.
*
* This list-traversal primitive may safely run concurrently with
* the _rcu list-mutation primitives such as hlist_add_head_rcu()
* as long as the traversal is guarded by rcu_read_lock().
*/

Currently, all calls to vhost_vsock_get() are between rcu_read_lock()
and rcu_read_unlock() except for calls in vhost_vsock_set_cid() and
vhost_vsock_reset_orphans(). In both cases, the current code is safe,
but we can make improvements to make it more robust.

About vhost_vsock_set_cid(), when building the kernel with
CONFIG_PROVE_RCU_LIST enabled, we get the following RCU warning when the
user space issues `ioctl(dev, VHOST_VSOCK_SET_GUEST_CID, ...)` :

  WARNING: suspicious RCU usage
  6.18.0-rc7 #62 Not tainted
  -----------------------------
  drivers/vhost/vsock.c:74 RCU-list traversed in non-reader section!!

  other info that might help us debug this:

  rcu_scheduler_active = 2, debug_locks = 1
  1 lock held by rpc-libvirtd/3443:
   #0: ffffffffc05032a8 (vhost_vsock_mutex){+.+.}-{4:4}, at: vhost_vsock_dev_ioctl+0x2ff/0x530 [vhost_vsock]

  stack backtrace:
  CPU: 2 UID: 0 PID: 3443 Comm: rpc-libvirtd Not tainted 6.18.0-rc7 #62 PREEMPT(none)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-7.fc42 06/10/2025
  Call Trace:
   <TASK>
   dump_stack_lvl+0x75/0xb0
   dump_stack+0x14/0x1a
   lockdep_rcu_suspicious.cold+0x4e/0x97
   vhost_vsock_get+0x8f/0xa0 [vhost_vsock]
   vhost_vsock_dev_ioctl+0x307/0x530 [vhost_vsock]
   __x64_sys_ioctl+0x4f2/0xa00
   x64_sys_call+0xed0/0x1da0
   do_syscall_64+0x73/0xfa0
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
   ...
   </TASK>

This is not a real problem, because the vhost_vsock_get() caller, i.e.
vhost_vsock_set_cid(), holds the `vhost_vsock_mutex` used by the hash
table writers. Anyway, to prevent that warning, add lockdep_is_held()
condition to hash_for_each_possible_rcu() to verify that either the
caller is in an RCU read section or `vhost_vsock_mutex` is held when
CONFIG_PROVE_RCU_LIST is enabled; and also clarify the comment for
vhost_vsock_get() to better describe the locking requirements and the
scope of the returned pointer validity.

About vhost_vsock_reset_orphans(), currently this function is only
called via vsock_for_each_connected_socket(), which holds the
`vsock_table_lock` spinlock (which is also an RCU read-side critical
section). However, add an explicit RCU read lock there to make the code
more robust and explicit about the RCU requirements, and to prevent
issues if the calling context changes in the future or if
vhost_vsock_reset_orphans() is called from other contexts.

Fixes: 834e772c8db0 ("vhost/vsock: fix use-after-free in network stack callers")
Cc: stefanha@redhat.com
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20251126133826.142496-1-sgarzare@redhat.com>
Message-ID: <20251126210313.GA499503@fedora>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

tools/virtio: add device, device_driver stubs

Add stubs needed by virtio.h

Message-ID: <0fabf13f6ea812ebc73b1c919fb17d4dec1545db.1764873799.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

tools/virtio: fix up oot build

oot build tends to help uncover bugs so it's worth keeping around,
as long as it's low effort.
add stubs for a couple of macros virtio gained recently,
and disable vdpa in the test build.

Message-ID: <33968faa7994b86d1f78057358a50b8f460c7a23.1764873799.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

virtio_features: make it self-contained

virtio_features.h uses WARN_ON_ONCE and memset so it must
include linux/bug.h and linux/string.h

Message-ID: <579986aa9b8d023844990d2a0e267382f8ad85d5.1764873799.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

tools/virtio: switch to kernel's virtio_config.h

Drops stubs in virtio_config.h, use the kernel's version instead - we
are now activly developing it, so the stub became too hard to maintain.

Message-ID: <8e5c85dc8aad001f161f7e2d8799ffbccfc31381.1764873799.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

tools/virtio: stub might_sleep and synchronize_rcu

Add might_sleep() and synchronize_rcu() stubs needed by virtio_config.h.

might_sleep() is a no-op, synchronize_rcu doesn't work but we don't
need it to.

Created using Cursor CLI.

Message-ID: <5557e026335d808acd7b890693ee1382e73dd33a.1764873799.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

tools/virtio: add struct cpumask to cpumask.h

Add struct cpumask stub used by virtio_config.h.

Created using Cursor CLI.

Message-ID: <eacf56399ba220513ebcd610f4a5115dc768db80.1764873799.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

tools/virtio: pass KCFLAGS to module build

Update the mod target to pass KCFLAGS with the in-tree vhost driver
include path. This way vhost_test can find vhost headers.

Created using Cursor CLI.

Message-ID: <5473e5a5dfd2fcd261a778f2017cac669c031f23.1764873799.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

tools/virtio: add ucopysize.h stub

Add ucopysize.h with stub implementations of check_object_size,
copy_overflow, and check_copy_size.

Created using Cursor CLI.

Message-ID: <5046df90002bb744609248404b81d33b559fe813.1764873799.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

tools/virtio: add dev_WARN_ONCE and is_vmalloc_addr stubs

Add dev_WARN_ONCE and is_vmalloc_addr stubs needed by virtio_ring.c.
is_vmalloc_addr stub always returns false - that's fine since it's
merely a sanity check.

Created using Cursor CLI.

Message-ID: <749e7a03b7cd56baf50a27efc3b05e50cf8f36b6.1764873799.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

tools/virtio: stub DMA mapping functions

Add dma_map_page_attrs and dma_unmap_page_attrs stubs.
Follow the same pattern as existing DMA mapping stubs.

Created using Cursor CLI.

Message-ID: <3512df1fe0e2129ea493434a21c940c50381cc93.1764873799.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

tools/virtio: add struct module forward declaration

Declarate struct module in our linux/module.h stub.

Created using Cursor CLI.

Message-ID: <c01b8d24159664cc8c49354088efa342ae9e7321.1764873799.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

tools/virtio: use kernel's virtio.h

Replace virtio stubs with an include of the kernel header.

Message-ID: <33daf1033fc447eb8e3e54d21013ccfd99550e37.1764873799.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>