git.ipfire.org Git - thirdparty/kernel/linux.git/log

hv_netvsc: Fix panic during namespace deletion with VF

The existing code move the VF NIC to new namespace when NETDEV_REGISTER is
received on netvsc NIC. During deletion of the namespace,
default_device_exit_batch() >> default_device_exit_net() is called. When
netvsc NIC is moved back and registered to the default namespace, it
automatically brings VF NIC back to the default namespace. This will cause
the default_device_exit_net() >> for_each_netdev_safe loop unable to detect
the list end, and hit NULL ptr:

[  231.449420] mana 7870:00:00.0 enP30832s1: Moved VF to namespace with: eth0
[  231.449656] BUG: kernel NULL pointer dereference, address: 0000000000000010
[  231.450246] #PF: supervisor read access in kernel mode
[  231.450579] #PF: error_code(0x0000) - not-present page
[  231.450916] PGD 17b8a8067 P4D 0
[  231.451163] Oops: Oops: 0000 [#1] SMP NOPTI
[  231.451450] CPU: 82 UID: 0 PID: 1394 Comm: kworker/u768:1 Not tainted 6.16.0-rc4+ #3 VOLUNTARY
[  231.452042] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/21/2024
[  231.452692] Workqueue: netns cleanup_net
[  231.452947] RIP: 0010:default_device_exit_batch+0x16c/0x3f0
[  231.453326] Code: c0 0c f5 b3 e8 d5 db fe ff 48 85 c0 74 15 48 c7 c2 f8 fd ca b2 be 10 00 00 00 48 8d 7d c0 e8 7b 77 25 00 49 8b 86 28 01 00 00 <48> 8b 50 10 4c 8b 2a 4c 8d 62 f0 49 83 ed 10 4c 39 e0 0f 84 d6 00
[  231.454294] RSP: 0018:ff75fc7c9bf9fd00 EFLAGS: 00010246
[  231.454610] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 61c8864680b583eb
[  231.455094] RDX: ff1fa9f71462d800 RSI: ff75fc7c9bf9fd38 RDI: 0000000030766564
[  231.455686] RBP: ff75fc7c9bf9fd78 R08: 0000000000000000 R09: 0000000000000000
[  231.456126] R10: 0000000000000001 R11: 0000000000000004 R12: ff1fa9f70088e340
[  231.456621] R13: ff1fa9f70088e340 R14: ffffffffb3f50c20 R15: ff1fa9f7103e6340
[  231.457161] FS:  0000000000000000(0000) GS:ff1faa6783a08000(0000) knlGS:0000000000000000
[  231.457707] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  231.458031] CR2: 0000000000000010 CR3: 0000000179ab2006 CR4: 0000000000b73ef0
[  231.458434] Call Trace:
[  231.458600]  <TASK>
[  231.458777]  ops_undo_list+0x100/0x220
[  231.459015]  cleanup_net+0x1b8/0x300
[  231.459285]  process_one_work+0x184/0x340

To fix it, move the ns change to a workqueue, and take rtnl_lock to avoid
changing the netdev list when default_device_exit_net() is using it.

Cc: stable@vger.kernel.org
Fixes: 4c262801ea60 ("hv_netvsc: Fix VF namespace also in synthetic NIC NETDEV_REGISTER event")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1754511711-11188-1-git-send-email-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

hamradio: ignore ops-locked netdevs

Syzkaller managed to trigger lock dependency in xsk_notify via
register_netdevice. As discussed in [0], using register_netdevice
in the notifiers is problematic so skip adding hamradio for ops-locked
devices.

       xsk_notifier+0x89/0x230 net/xdp/xsk.c:1664
       notifier_call_chain+0x1b6/0x3e0 kernel/notifier.c:85
       call_netdevice_notifiers_extack net/core/dev.c:2267 [inline]
       call_netdevice_notifiers net/core/dev.c:2281 [inline]
       unregister_netdevice_many_notify+0x14d7/0x1ff0 net/core/dev.c:12156
       unregister_netdevice_many net/core/dev.c:12219 [inline]
       unregister_netdevice_queue+0x33c/0x380 net/core/dev.c:12063
       register_netdevice+0x1689/0x1ae0 net/core/dev.c:11241
       bpq_new_device drivers/net/hamradio/bpqether.c:481 [inline]
       bpq_device_event+0x491/0x600 drivers/net/hamradio/bpqether.c:523
       notifier_call_chain+0x1b6/0x3e0 kernel/notifier.c:85
       call_netdevice_notifiers_extack net/core/dev.c:2267 [inline]
       call_netdevice_notifiers net/core/dev.c:2281 [inline]
       __dev_notify_flags+0x18d/0x2e0 net/core/dev.c:-1
       netif_change_flags+0xe8/0x1a0 net/core/dev.c:9608
       dev_change_flags+0x130/0x260 net/core/dev_api.c:68
       devinet_ioctl+0xbb4/0x1b50 net/ipv4/devinet.c:1200
       inet_ioctl+0x3c0/0x4c0 net/ipv4/af_inet.c:1001

0: https://lore.kernel.org/netdev/20250625140357.6203d0af@kernel.org/
Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: syzbot+e6300f66a999a6612477@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=e6300f66a999a6612477
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250806213726.1383379-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: lapbether: ignore ops-locked netdevs

Syzkaller managed to trigger lock dependency in xsk_notify via
register_netdevice. As discussed in [0], using register_netdevice
in the notifiers is problematic so skip adding lapbeth for ops-locked
devices.

       xsk_notifier+0xa4/0x280 net/xdp/xsk.c:1645
       notifier_call_chain+0xbc/0x410 kernel/notifier.c:85
       call_netdevice_notifiers_info+0xbe/0x140 net/core/dev.c:2230
       call_netdevice_notifiers_extack net/core/dev.c:2268 [inline]
       call_netdevice_notifiers net/core/dev.c:2282 [inline]
       unregister_netdevice_many_notify+0xf9d/0x2700 net/core/dev.c:12077
       unregister_netdevice_many net/core/dev.c:12140 [inline]
       unregister_netdevice_queue+0x305/0x3f0 net/core/dev.c:11984
       register_netdevice+0x18f1/0x2270 net/core/dev.c:11149
       lapbeth_new_device drivers/net/wan/lapbether.c:420 [inline]
       lapbeth_device_event+0x5b1/0xbe0 drivers/net/wan/lapbether.c:462
       notifier_call_chain+0xbc/0x410 kernel/notifier.c:85
       call_netdevice_notifiers_info+0xbe/0x140 net/core/dev.c:2230
       call_netdevice_notifiers_extack net/core/dev.c:2268 [inline]
       call_netdevice_notifiers net/core/dev.c:2282 [inline]
       __dev_notify_flags+0x12c/0x2e0 net/core/dev.c:9497
       netif_change_flags+0x108/0x160 net/core/dev.c:9526
       dev_change_flags+0xba/0x250 net/core/dev_api.c:68
       devinet_ioctl+0x11d5/0x1f50 net/ipv4/devinet.c:1200
       inet_ioctl+0x3a7/0x3f0 net/ipv4/af_inet.c:1001

0: https://lore.kernel.org/netdev/20250625140357.6203d0af@kernel.org/
Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: syzbot+e67ea9c235b13b4f0020@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=e67ea9c235b13b4f0020
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250806213726.1383379-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: Fix KSZ8863 reset problem

ksz8873_valid_regs[] was added for register access for KSZ8863/KSZ8873
switches, but the reset register is not in the list so
ksz8_reset_switch() does not take any effect.

Replace regmap_update_bits() using ksz_regmap_8 with ksz_rmw8() so that
an error message will be given if the register is not defined.

A side effect of not resetting the switch is the static MAC table is not
cleared. Further additions to the table will show write error as there
are only 8 entries in the table.

Fixes: d0dec3333040 ("net: dsa: microchip: Add register access control for KSZ8873 chip")
Signed-off-by: Tristram Ha <tristram.ha@microchip.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250807005453.8306-1-Tristram.Ha@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: linearize cloned gso packets in sctp_rcv

A cloned head skb still shares these frag skbs in fraglist with the
original head skb. It's not safe to access these frag skbs.

syzbot reported two use-of-uninitialized-memory bugs caused by this:

  BUG: KMSAN: uninit-value in sctp_inq_pop+0x15b7/0x1920 net/sctp/inqueue.c:211
   sctp_inq_pop+0x15b7/0x1920 net/sctp/inqueue.c:211
   sctp_assoc_bh_rcv+0x1a7/0xc50 net/sctp/associola.c:998
   sctp_inq_push+0x2ef/0x380 net/sctp/inqueue.c:88
   sctp_backlog_rcv+0x397/0xdb0 net/sctp/input.c:331
   sk_backlog_rcv+0x13b/0x420 include/net/sock.h:1122
   __release_sock+0x1da/0x330 net/core/sock.c:3106
   release_sock+0x6b/0x250 net/core/sock.c:3660
   sctp_wait_for_connect+0x487/0x820 net/sctp/socket.c:9360
   sctp_sendmsg_to_asoc+0x1ec1/0x1f00 net/sctp/socket.c:1885
   sctp_sendmsg+0x32b9/0x4a80 net/sctp/socket.c:2031
   inet_sendmsg+0x25a/0x280 net/ipv4/af_inet.c:851
   sock_sendmsg_nosec net/socket.c:718 [inline]

and

  BUG: KMSAN: uninit-value in sctp_assoc_bh_rcv+0x34e/0xbc0 net/sctp/associola.c:987
   sctp_assoc_bh_rcv+0x34e/0xbc0 net/sctp/associola.c:987
   sctp_inq_push+0x2a3/0x350 net/sctp/inqueue.c:88
   sctp_backlog_rcv+0x3c7/0xda0 net/sctp/input.c:331
   sk_backlog_rcv+0x142/0x420 include/net/sock.h:1148
   __release_sock+0x1d3/0x330 net/core/sock.c:3213
   release_sock+0x6b/0x270 net/core/sock.c:3767
   sctp_wait_for_connect+0x458/0x820 net/sctp/socket.c:9367
   sctp_sendmsg_to_asoc+0x223a/0x2260 net/sctp/socket.c:1886
   sctp_sendmsg+0x3910/0x49f0 net/sctp/socket.c:2032
   inet_sendmsg+0x269/0x2a0 net/ipv4/af_inet.c:851
   sock_sendmsg_nosec net/socket.c:712 [inline]

This patch fixes it by linearizing cloned gso packets in sctp_rcv().

Fixes: 90017accff61 ("sctp: Add GSO support")
Reported-by: syzbot+773e51afe420baaf0e2b@syzkaller.appspotmail.com
Reported-by: syzbot+70a42f45e76bede082be@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://patch.msgid.link/dd7dc337b99876d4132d0961f776913719f7d225.1754595611.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools/power turbostat: Fix bogus SysWatt for forked program

Similar to delta_cpu(), delta_platform() is called in turbostat main
loop. This ensures accurate SysWatt readings in periodic monitoring mode
$ sudo turbostat -S -q --show power -i 1
CoreTmp PkgTmp PkgWatt CorWatt GFXWatt RAMWatt PKG_% RAM_% SysWatt
60 61 6.21 1.13 0.16 0.00 0.00 0.00 13.07
58 61 6.00 1.07 0.18 0.00 0.00 0.00 12.75
58 61 5.74 1.05 0.17 0.00 0.00 0.00 12.22
58 60 6.27 1.11 0.24 0.00 0.00 0.00 13.55

However, delta_platform() is missing for forked program and causes bogus
SysWatt reporting,
$ sudo turbostat -S -q --show power sleep 1
1.004736 sec
CoreTmp PkgTmp PkgWatt CorWatt GFXWatt RAMWatt PKG_% RAM_% SysWatt
57 58 6.05 1.02 0.16 0.00 0.00 0.00 0.03

Add missing delta_platform() for forked program.

Fixes: e5f687b89bc2 ("tools/power turbostat: Add RAPL psys as a built-in counter")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>

tools/power turbostat: Handle cap_get_proc() ENOSYS

Kernels configured with CONFIG_MULTIUSER=n have no cap_get_proc().
Check for ENOSYS to recognize this case, and continue on to
attempt to access the requested MSRs (such as temperature).

Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Len Brown <len.brown@intel.com>

tools/power turbostat: Fix build with musl

turbostat.c: In function 'parse_int_file':
    turbostat.c:5567:19: error: 'PATH_MAX' undeclared (first use in this function)
     5567 |         char path[PATH_MAX];
          |                   ^~~~~~~~

    turbostat.c: In function 'probe_graphics':
    turbostat.c:6787:19: error: 'PATH_MAX' undeclared (first use in this function)
     6787 |         char path[PATH_MAX];
          |                   ^~~~~~~~

Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Reviewed-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>

tools/power turbostat: verify arguments to params --show and --hide

$ sudo turbostat --quiet --show junk
turbostat: Counter 'junk' can not be added.

Previously, invalid arguments to --show and --hide were silently ignored

Acked-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>

vsock: Do not allow binding to VMADDR_PORT_ANY

It is possible for a vsock to autobind to VMADDR_PORT_ANY. This can
cause a use-after-free when a connection is made to the bound socket.
The socket returned by accept() also has port VMADDR_PORT_ANY but is not
on the list of unbound sockets. Binding it will result in an extra
refcount decrement similar to the one fixed in fcdd2242c023 (vsock: Keep
the binding until socket destruction).

Modify the check in __vsock_bind_connectible() to also prevent binding
to VMADDR_PORT_ANY.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Budimir Markovic <markovicbudimir@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250807041811.678-1-markovicbudimir@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ti: icss-iep: Fix incorrect type for return value in extts_enable()

The variable ret in icss_iep_extts_enable() was incorrectly declared
as u32, while the function returns int and may return negative error
codes. This will cause sign extension issues and incorrect error
propagation. Update ret to be int to fix error handling.

This change corrects the declaration to avoid potential type mismatch.

Fixes: c1e0230eeaab ("net: ti: icss-iep: Add IEP driver")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250805142323.1949406-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: page_pool: allow enabling recycling late, fix false positive warning

Page pool can have pages "directly" (locklessly) recycled to it,
if the NAPI that owns the page pool is scheduled to run on the same CPU.
To make this safe we check that the NAPI is disabled while we destroy
the page pool. In most cases NAPI and page pool lifetimes are tied
together so this happens naturally.

The queue API expects the following order of calls:
-> mem_alloc
    alloc new pp
-> stop
    napi_disable
-> start
    napi_enable
-> mem_free
    free old pp

Here we allocate the page pool in ->mem_alloc and free in ->mem_free.
But the NAPIs are only stopped between ->stop and ->start. We created
page_pool_disable_direct_recycling() to safely shut down the recycling
in ->stop. This way the page_pool_destroy() call in ->mem_free doesn't
have to worry about recycling any more.

Unfortunately, the page_pool_disable_direct_recycling() is not enough
to deal with failures which necessitate freeing the _new_ page pool.
If we hit a failure in ->mem_alloc or ->stop the new page pool has
to be freed while the NAPI is active (assuming driver attaches the
page pool to an existing NAPI instance and doesn't reallocate NAPIs).

Freeing the new page pool is technically safe because it hasn't been
used for any packets, yet, so there can be no recycling. But the check
in napi_assert_will_not_race() has no way of knowing that. We could
check if page pool is empty but that'd make the check much less likely
to trigger during development.

Add page_pool_enable_direct_recycling(), pairing with
page_pool_disable_direct_recycling(). It will allow us to create the new
page pools in "disabled" state and only enable recycling when we know
the reconfig operation will not fail.

Coincidentally it will also let us re-enable the recycling for the old
pool, if the reconfig failed:

-> mem_alloc (new)
-> stop (old)
    # disables direct recycling for old
-> start (new)
    # fail!!
-> start (old)
    # go back to old pp but direct recycling is lost :(
-> mem_free (new)

The new helper is idempotent to make the life easier for drivers,
which can operate in HDS mode and support zero-copy Rx.
The driver can call the helper twice whether there are two pools
or it has multiple references to a single pool.

Fixes: 40eca00ae605 ("bnxt_en: unlink page pool when stopping Rx queue")
Tested-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250805003654.2944974-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ti: icssg-prueth: Fix emac link speed handling

When link settings are changed emac->speed is populated by
emac_adjust_link(). The link speed and other settings are then written into
the DRAM. However if both ports are brought down after this and brought up
again or if the operating mode is changed and a firmware reload is needed,
the DRAM is cleared by icssg_config(). As a result the link settings are
lost.

Fix this by calling emac_adjust_link() after icssg_config(). This re
populates the settings in the DRAM after a new firmware load.

Fixes: 9facce84f406 ("net: ti: icssg-prueth: Fix firmware load sequence.")
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Message-ID: <20250805173812.2183161-1-danishanwar@ti.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'there-are-some-bugfix-for-hibmcge-ethernet-driver'

Jijie Shao says:

====================
There are some bugfix for hibmcge ethernet driver

This patch set is intended to fix several issues for hibmcge driver:
1. Holding the rtnl_lock in pci_error_handlers->reset_prepare()
   may lead to a deadlock issue.
   2. A division by zero issue caused by debugfs when the port is down.
   3. A probabilistic false positive issue with np_link_fail.

v2: https://lore.kernel.org/20250805181446.3deaceb9@kernel.org
v1: https://lore.kernel.org/20250731134749.4090041-1-shaojijie@huawei.com
====================

Link: https://patch.msgid.link/20250806102758.3632674-1-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hibmcge: fix the np_link_fail error reporting issue

Currently, after modifying device port mode, the np_link_ok state
is immediately checked. At this point, the device may not yet ready,
leading to the querying of an intermediate state.

This patch will poll to check if np_link is ok after
modifying device port mode, and only report np_link_fail upon timeout.

Fixes: e0306637e85d ("net: hibmcge: Add support for mac link exception handling feature")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hibmcge: fix the division by zero issue

When the network port is down, the queue is released, and ring->len is 0.
In debugfs, hbg_get_queue_used_num() will be called,
which may lead to a division by zero issue.

This patch adds a check, if ring->len is 0,
hbg_get_queue_used_num() directly returns 0.

Fixes: 40735e7543f9 ("net: hibmcge: Implement .ndo_start_xmit function")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hibmcge: fix rtnl deadlock issue

Currently, the hibmcge netdev acquires the rtnl_lock in
pci_error_handlers.reset_prepare() and releases it in
pci_error_handlers.reset_done().

However, in the PCI framework:
pci_reset_bus - __pci_reset_slot - pci_slot_save_and_disable_locked -
pci_dev_save_and_disable - err_handler->reset_prepare(dev);

In pci_slot_save_and_disable_locked():
list_for_each_entry(dev, &slot->bus->devices, bus_list) {
if (!dev->slot || dev->slot!= slot)
continue;
pci_dev_save_and_disable(dev);
if (dev->subordinate)
pci_bus_save_and_disable_locked(dev->subordinate);
}

This will iterate through all devices under the current bus and execute
err_handler->reset_prepare(), causing two devices of the hibmcge driver
to sequentially request the rtnl_lock, leading to a deadlock.

Since the driver now executes netif_device_detach()
before the reset process, it will not concurrently with
other netdev APIs, so there is no need to hold the rtnl_lock now.

Therefore, this patch removes the rtnl_lock during the reset process and
adjusts the position of HBG_NIC_STATE_RESETTING to ensure
that multiple resets are not executed concurrently.

Fixes: 3f5a61f6d504f ("net: hibmcge: Add reset supported in this module")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'nf-25-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Reinstantiate Florian Westphal as a Netfilter maintainer.

2) Depend on both NETFILTER_XTABLES and NETFILTER_XTABLES_LEGACY,
   from Arnd Bergmann.

3) Use id to annotate last conntrack/expectation visited to resume
   netlink dump, patches from Florian Westphal.

4) Fix bogus element in nft_pipapo avx2 lookup, introduced in
   the last nf-next batch of updates, also from Florian.

5) Return 0 instead of recycling ret variable in
   nf_conntrack_log_invalid_sysctl(), introduced in the last
   nf-next batch of updates, from Dan Carpenter.

6) Fix WARN_ON_ONCE triggered by syzbot with larger cgroup level
   in nft_socket.

* tag 'nf-25-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_socket: remove WARN_ON_ONCE with huge level value
  netfilter: conntrack: clean up returns in nf_conntrack_log_invalid_sysctl()
  netfilter: nft_set_pipapo: don't return bogus extension pointer
  netfilter: ctnetlink: remove refcounting in expectation dumpers
  netfilter: ctnetlink: fix refcount leak on table dump
  netfilter: add back NETFILTER_XTABLES dependencies
  MAINTAINERS: resurrect my netfilter maintainer entry
====================

Link: https://patch.msgid.link/20250807112948.1400523-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

io_uring/memmap: cast nr_pages to size_t before shifting

If the allocated size exceeds UINT_MAX, then it's necessary to cast
the mr->nr_pages value to size_t to prevent it from overflowing. In
practice this isn't much of a concern as the required memory size will
have been validated upfront, and accounted to the user. And > 4GB sizes
will be necessary to make the lack of a cast a problem, which greatly
exceeds normal user locked_vm settings that are generally in the kb to
mb range. However, if root is used, then accounting isn't done, and
then it's possible to hit this issue.

Link: https://lore.kernel.org/all/6895b298.050a0220.7f033.0059.GAE@google.com/
Cc: stable@vger.kernel.org
Reported-by: syzbot+23727438116feb13df15@syzkaller.appspotmail.com
Fixes: 087f997870a9 ("io_uring/memmap: implement mmap for regions")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'xfrm: some fixes for GSO with SW crypto'

Sabrina Dubroca says:

====================
This series fixes a few issues with GSO. Some recent patches made the
incorrect assumption that GSO is only used by offload. The first two
patches in this series restore the old behavior.

The final patch is in the UDP GSO code, but fixes an issue with IPsec
that is currently masked by the lack of GSO for SW crypto. With GSO,
VXLAN over IPsec doesn't get checksummed.
====================

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

mailbox/pcc: support mailbox management of the shared buffer

Define a new, optional, callback that allows the driver to
specify how the return data buffer is allocated.  If that callback
is set,  mailbox/pcc.c is now responsible for reading from and
writing to the PCC shared buffer.

This also allows for proper checks of the Commnand complete flag
between the PCC sender and receiver.

For Type 4 channels, initialize the command complete flag prior
to accepting messages.

Since the mailbox does not know what memory allocation scheme
to use for response messages, the client now has an optional
callback that allows it to allocate the buffer for a response
message.

When an outbound message is written to the buffer, the mailbox
checks for the flag indicating the client wants an tx complete
notification via IRQ.  Upon receipt of the interrupt It will
pair it with the outgoing message. The expected use is to
free the kernel memory buffer for the previous outgoing message.

Signed-off-by: Adam Young <admiyo@os.amperecomputing.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>

Merge tag 'net-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
  Previous releases - regressions:

   - netlink: avoid infinite retry looping in netlink_unicast()

  Previous releases - always broken:

   - packet: fix a race in packet_set_ring() and packet_notifier()

   - ipv6: reject malicious packets in ipv6_gso_segment()

   - sched: mqprio: fix stack out-of-bounds write in tc entry parsing

   - net: drop UFO packets (injected via virtio) in udp_rcv_segment()

   - eth: mlx5: correctly set gso_segs when LRO is used, avoid false
     positive checksum validation errors

   - netpoll: prevent hanging NAPI when netcons gets enabled

   - phy: mscc: fix parsing of unicast frames for PTP timestamping

   - a number of device tree / OF reference leak fixes"

* tag 'net-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits)
  pptp: fix pptp_xmit() error path
  net: ti: icssg-prueth: Fix skb handling for XDP_PASS
  net: Update threaded state in napi config in netif_set_threaded
  selftests: netdevsim: Xfail nexthop test on slow machines
  eth: fbnic: Lock the tx_dropped update
  eth: fbnic: Fix tx_dropped reporting
  eth: fbnic: remove the debugging trick of super high page bias
  net: ftgmac100: fix potential NULL pointer access in ftgmac100_phy_disconnect
  dt-bindings: net: Replace bouncing Alexandru Tachici emails
  dpll: zl3073x: ZL3073X_I2C and ZL3073X_SPI should depend on NET
  net/sched: mqprio: fix stack out-of-bounds write in tc entry parsing
  Revert "net: mdio_bus: Use devm for getting reset GPIO"
  selftests: net: packetdrill: xfail all problems on slow machines
  net/packet: fix a race in packet_set_ring() and packet_notifier()
  benet: fix BUG when creating VFs
  net: airoha: npu: Add missing MODULE_FIRMWARE macros
  net: devmem: fix DMA direction on unmapping
  ipa: fix compile-testing with qcom-mdt=m
  eth: fbnic: unlink NAPIs from queues on error to open
  net: Add locking to protect skb->dev access in ip_output
  ...

Merge tag 's390-6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Alexander Gordeev:

- Support MMIO read/write tracing

- Enable THP swapping and THP migration

- Unmask SLCF bit ("stateless command filtering") introduced with CEX8
   cards, so that user space applications like lszcrypt could evaluate
   and list this feature

- Fix the value of high_memory variable, so it considers possible
   tailing offline memory blocks

- Make vmem_pte_alloc() consistent and always allocate memory of
   PAGE_SIZE for page tables. This ensures a page table occupies the
   whole page, as the rest of the code assumes

- Fix kernel image end address in the decompressor debug output

- Fix a typo in debug_sprintf_format_fn() comment

* tag 's390-6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/debug: Fix typo in debug_sprintf_format_fn() comment
  s390/boot: Fix startup debugging log
  s390/mm: Allocate page table with PAGE_SIZE granularity
  s390/mm: Enable THP_SWAP and THP_MIGRATION
  s390: Support CONFIG_TRACE_MMIO_ACCESS
  s390/mm: Set high_memory at the end of the identity mapping
  s390/ap: Unmask SLCF bit in card and queue ap functions sysfs

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull vhost fix from Michael Tsirkin:
"A single fix for a regression in vhost"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: initialize vq->nheads properly

Merge tag 'drm-next-2025-08-08' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"This is the fixes that built up in the merge window, mostly amdgpu and
  xe with one i915 display fix, seems like things are pretty good for
  rc1.

  i915:
   - DP LPFS fixes

  xe:
   - SRIOV: PF fixes and removal of need of module param
   - Fix driver unbind around Devcoredump
   - Mark xe driver as BROKEN if kernel page size is not 4kB

  amdgpu:
   - GC 9.5.0 fixes
   - SMU fix
   - DCE 6 DC fixes
   - mmhub client ID fixes
   - VRR fix
   - Backlight fix
   - UserQ fix
   - Legacy reset fix
   - Misc fixes

  amdkfd:
   - CRIU fix
   - Debugfs fix"

* tag 'drm-next-2025-08-08' of https://gitlab.freedesktop.org/drm/kernel: (28 commits)
  drm/amdgpu: add missing vram lost check for LEGACY RESET
  drm/amdgpu/discovery: fix fw based ip discovery
  drm/amdkfd: Destroy KFD debugfs after destroy KFD wq
  amdgpu/amdgpu_discovery: increase timeout limit for IFWI init
  drm/amdgpu: Update SDMA firmware version check for user queue support
  drm/amdgpu: Add NULL check for asic_funcs
  drm/amd/display: Revert "drm/amd/display: Fix AMDGPU_MAX_BL_LEVEL value"
  drm/amd/display: fix a Null pointer dereference vulnerability
  drm/amd/display: Add primary plane to commits for correct VRR handling
  drm/amdgpu: update mmhub 3.3 client id mappings
  drm/amdgpu: update mmhub 3.0.1 client id mappings
  drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job
  drm/amd/display: Fix DCE 6.0 and 6.4 PLL programming.
  drm/amd/display: Don't overwrite dce60_clk_mgr
  drm/amdkfd: Fix checkpoint-restore on multi-xcc
  drm/amd: Restore cached manual clock settings during resume
  drm/amd: Restore cached power limit during resume
  drm/amdgpu: Update external revid for GC v9.5.0
  drm/amdgpu: Update supported modes for GC v9.5.0
  Mark xe driver as BROKEN if kernel page size is not 4kB
  ...

Merge tag 'fbdev-for-6.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev

Pull fbdev fixes for 6.17-rc1:

- Revert a patch which broke VGA console

- Fix an out-of-bounds access bug which may happen during console
   resizing when a console is mapped to a frame buffer

* tag 'fbdev-for-6.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
  Revert "vgacon: Add check for vc_origin address range in vgacon_scroll()"
  fbdev: Fix vmalloc out-of-bounds write in fast_imageblit

Merge tag 'loongarch-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch updates from Huacai Chen:

- Complete KSave registers definition

- Support the mem=<size> kernel parameter

- Support BPF dynamic modification & trampoline

- Add MMC/SDIO controller nodes in dts

- Some bug fixes and other small changes

* tag 'loongarch-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: vDSO: Remove -nostdlib complier flag
  LoongArch: dts: Add eMMC/SDIO controller support to Loongson-2K2000
  LoongArch: dts: Add SDIO controller support to Loongson-2K1000
  LoongArch: dts: Add SDIO controller support to Loongson-2K0500
  LoongArch: BPF: Set bpf_jit_bypass_spec_v1/v4()
  LoongArch: BPF: Fix the tailcall hierarchy
  LoongArch: BPF: Fix jump offset calculation in tailcall
  LoongArch: BPF: Add struct ops support for trampoline
  LoongArch: BPF: Add basic bpf trampoline support
  LoongArch: BPF: Add dynamic code modification support
  LoongArch: BPF: Rename and refactor validate_code()
  LoongArch: Add larch_insn_gen_{beq,bne} helpers
  LoongArch: Don't use %pK through printk() in unwinder
  LoongArch: Avoid in-place string operation on FDT content
  LoongArch: Support mem=<size> kernel parameter
  LoongArch: Make relocate_new_kernel_size be a .quad value
  LoongArch: Complete KSave registers definition

smb: server: Fix extension string in ksmbd_extract_shortname()

In ksmbd_extract_shortname(), strscpy() is incorrectly called with the
length of the source string (excluding the NUL terminator) rather than
the size of the destination buffer. This results in "__" being copied
to 'extension' rather than "___" (two underscores instead of three).

Use the destination buffer size instead to ensure that the string "___"
(three underscores) is copied correctly.

Cc: stable@vger.kernel.org
Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

ksmbd: limit repeated connections from clients with the same IP

Repeated connections from clients with the same IP address may exhaust
the max connections and prevent other normal client connections.
This patch limit repeated connections from clients with the same IP.

Reported-by: tianshuo han <hantianshuo233@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

Merge tag 'amd-drm-fixes-6.17-2025-08-07' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-fixes-6.17-2025-08-07:

amdgpu:
- GC 9.5.0 fixes
- SMU fix
- DCE 6 DC fixes
- mmhub client ID fixes
- VRR fix
- Backlight fix
- UserQ fix
- Legacy reset fix
- Misc fixes

amdkfd:
- CRIU fix
- Debugfs fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250807132030.1168068-1-alexander.deucher@amd.com

Merge tag 'drm-xe-next-fixes-2025-08-06' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next

- SRIOV: PF fixes and removal of need of module param (Michal)
- Fix driver unbind around Devcoredump (Bala)
- Mark xe driver as BROKEN if kernel page size is not 4kB (Simon)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/aJNXnIAp2Cq-2pZj@intel.com

smb: client: only use a single wait_queue to monitor smbdirect connection status

There's no need for separate conn_wait and disconn_wait queues.

This will simplify the move to common code, the server code
already a single wait_queue for this.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: don't call init_waitqueue_head(&info->conn_wait) twice in _smbd_get_connection

It is already called long before we may hit this cleanup code path.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: improve logging in smbd_conn_upcall()

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: return an error if rdma_connect does not return within 5 seconds

This matches the timeout for tcp connections.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

PCI: vmd: Fix wrong kfree() in vmd_msi_free()

vmd_msi_alloc() allocates struct vmd_irq and stashes it into
irq_data->chip_data associated with the VMD's interrupt domain.
vmd_msi_free() extracts the pointer by calling irq_get_chip_data() and
frees it.

irq_get_chip_data() returns the chip_data associated with the top interrupt
domain. This worked in the past because VMD's interrupt domain was the top
domain.

But d7d8ab87e3e7 ("PCI: vmd: Switch to msi_create_parent_irq_domain()")
changed the interrupt domain hierarchy so VMD's interrupt domain is not the
top domain anymore. irq_get_chip_data() now returns the chip_data at the
MSI devices' interrupt domains. It is therefore broken for vmd_msi_free()
to kfree() this chip_data.

Fix by extracting the chip_data associated with the VMD's interrupt domain.

Fixes: d7d8ab87e3e7 ("PCI: vmd: Switch to msi_create_parent_irq_domain()")
Reported-by: Kenneth Crudup <kenny@panix.com>
Closes: https://lore.kernel.org/linux-pci/dfa40e48-8840-4e61-9fda-25cdb3ad81c1@panix.com/
Reported-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Closes: https://lore.kernel.org/linux-pci/ed53280ed15d1140700b96cca2734bf327ee92539e5eb68e80f5bbbf0f01@linux.gnuweeb.org/
Tested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Tested-by: Kenneth Crudup <kenny@panix.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://patch.msgid.link/20250807081051.2253962-1-namcao@linutronix.de

Merge branch 'perf-s390-regression-move-uid-filtering-to-bpf-filters'

Ilya Leoshkevich says:

====================
perf/s390: Regression: Move uid filtering to BPF filters

v4: https://lore.kernel.org/bpf/20250806114227.14617-1-iii@linux.ibm.com/
v4 -> v5: Fix a typo in the commit message (Yonghong).

v3: https://lore.kernel.org/bpf/20250805130346.1225535-1-iii@linux.ibm.com/
v3 -> v4: Rename the new field to dont_enable (Alexei, Eduard).
Switch the Fixes: tag in patch 2 (Alexander, Thomas).
Fix typos in the cover letter (Thomas).

v2: https://lore.kernel.org/bpf/20250728144340.711196-1-tmricht@linux.ibm.com/
v2 -> v3: Use no_ioctl_enable in perf.

v1: https://lore.kernel.org/bpf/20250725093405.3629253-1-tmricht@linux.ibm.com/
v1 -> v2: Introduce no_ioctl_enable (Jiri).

Hi,

This series fixes a regression caused by moving UID filtering to BPF.
The regression affects all events that support auxiliary data, most
notably, "cycles" events on s390, but also PT events on Intel. The
symptom is missing events when UID filtering is enabled.

Patch 1 introduces a new option for the
bpf_program__attach_perf_event_opts() function.
Patch 2 makes use of it in perf, and also contains a lot of technical
details of why exactly the problem is occurring.

Thanks to Thomas Richter for the investigation and the initial version
of this fix, and to Jiri Olsa for suggestions.

Best regards,
Ilya
====================

Link: https://patch.msgid.link/20250806162417.19666-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

perf bpf-filter: Enable events manually

On s390, and, in general, on all platforms where the respective event
supports auxiliary data gathering, the command:

   # ./perf record -u 0 -aB --synth=no -- ./perf test -w thloop
   [ perf record: Woken up 1 times to write data ]
   [ perf record: Captured and wrote 0.011 MB perf.data ]
   # ./perf report --stats | grep SAMPLE
   #

does not generate samples in the perf.data file. On x86 the command:

  # sudo perf record -e intel_pt// -u 0 ls

is broken too.

Looking at the sequence of calls in 'perf record' reveals this
behavior:

1. The event 'cycles' is created and enabled:

   record__open()
   +-> evlist__apply_filters()
       +-> perf_bpf_filter__prepare()
   +-> bpf_program.attach_perf_event()
       +-> bpf_program.attach_perf_event_opts()
           +-> __GI___ioctl(..., PERF_EVENT_IOC_ENABLE, ...)

   The event 'cycles' is enabled and active now. However the event's
   ring-buffer to store the samples generated by hardware is not
   allocated yet.

2. The event's fd is mmap()ed to create the ring buffer:

   record__open()
   +-> record__mmap()
       +-> record__mmap_evlist()
   +-> evlist__mmap_ex()
       +-> perf_evlist__mmap_ops()
           +-> mmap_per_cpu()
               +-> mmap_per_evsel()
                   +-> mmap__mmap()
                       +-> perf_mmap__mmap()
                           +-> mmap()

   This allocates the ring buffer for the event 'cycles'. With mmap()
   the kernel creates the ring buffer:

   perf_mmap(): kernel function to create the event's ring
   |            buffer to save the sampled data.
   |
   +-> ring_buffer_attach(): Allocates memory for ring buffer.
       |        The PMU has auxiliary data setup function. The
       |        has_aux(event) condition is true and the PMU's
       |        stop() is called to stop sampling. It is not
       |        restarted:
       |
       |        if (has_aux(event))
       |                perf_event_stop(event, 0);
       |
       +-> cpumsf_pmu_stop():

   Hardware sampling is stopped. No samples are generated and saved
   anymore.

3. After the event 'cycles' has been mapped, the event is enabled a
   second time in:

   __cmd_record()
   +-> evlist__enable()
       +-> __evlist__enable()
   +-> evsel__enable_cpu()
       +-> perf_evsel__enable_cpu()
           +-> perf_evsel__run_ioctl()
               +-> perf_evsel__ioctl()
                   +-> __GI___ioctl(., PERF_EVENT_IOC_ENABLE, .)

   The second

      ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

   is just a NOP in this case. The first invocation in (1.) sets the
   event::state to PERF_EVENT_STATE_ACTIVE. The kernel functions

   perf_ioctl()
   +-> _perf_ioctl()
       +-> _perf_event_enable()
           +-> __perf_event_enable()

   return immediately because event::state is already set to
   PERF_EVENT_STATE_ACTIVE.

This happens on s390, because the event 'cycles' offers the possibility
to save auxilary data. The PMU callbacks setup_aux() and free_aux() are
defined. Without both callback functions, cpumsf_pmu_stop() is not
invoked and sampling continues.

To remedy this, remove the first invocation of

   ioctl(..., PERF_EVENT_IOC_ENABLE, ...).

in step (1.) Create the event in step (1.) and enable it in step (3.)
after the ring buffer has been mapped.

Output after:

# ./perf record -aB --synth=no -u 0 -- ./perf test -w thloop 2
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.876 MB perf.data ]
# ./perf  report --stats | grep SAMPLE
              SAMPLE events:      16200  (99.5%)
              SAMPLE events:      16200
#

The software event succeeded both before and after the patch:

# ./perf record -e cpu-clock -aB --synth=no -u 0 -- \
  ./perf test -w thloop 2
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 2.870 MB perf.data ]
# ./perf  report --stats | grep SAMPLE
              SAMPLE events:      53506  (99.8%)
              SAMPLE events:      53506
#

Fixes: b4c658d4d63d61 ("perf target: Remove uid from target")
Suggested-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Co-developed-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250806162417.19666-3-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: Add the ability to suppress perf event enablement

Automatically enabling a perf event after attaching a BPF prog to it is
not always desirable.

Add a new "dont_enable" field to struct bpf_perf_event_opts. While
introducing "enable" instead would be nicer in that it would avoid
a double negation in the implementation, it would make
DECLARE_LIBBPF_OPTS() less efficient.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Suggested-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Co-developed-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250806162417.19666-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

btrfs: fix iteration bug in __qgroup_excl_accounting()

__qgroup_excl_accounting() uses the qgroup iterator machinery to
update the account of one qgroups usage for all its parent hierarchy,
when we either add or remove a relation and have only exclusive usage.

However, there is a small bug there: we loop with an extra iteration
temporary qgroup called `cur` but never actually refer to that in the
body of the loop. As a result, we redundantly account the same usage to
the first qgroup in the list.

This can be reproduced in the following way:

  mkfs.btrfs -f -O squota <dev>
  mount <dev> <mnt>
  btrfs subvol create <mnt>/sv
  dd if=/dev/zero of=<mnt>/sv/f bs=1M count=1
  sync
  btrfs qgroup create 1/100 <mnt>
  btrfs qgroup create 2/200 <mnt>
  btrfs qgroup assign 1/100 2/200 <mnt>
  btrfs qgroup assign 0/256 1/100 <mnt>
  btrfs qgroup show <mnt>

and the broken result is (note the 2MiB on 1/100 and 0Mib on 2/100):

  Qgroupid    Referenced    Exclusive   Path
  --------    ----------    ---------   ----
  0/5           16.00KiB     16.00KiB   <toplevel>
  0/256          1.02MiB      1.02MiB   sv

  Qgroupid    Referenced    Exclusive   Path
  --------    ----------    ---------   ----
  0/5           16.00KiB     16.00KiB   <toplevel>
  0/256          1.02MiB      1.02MiB   sv
  1/100          2.03MiB      2.03MiB   2/100<1 member qgroup>
  2/100            0.00B        0.00B   <0 member qgroups>

With this fix, which simply re-uses `qgroup` as the iteration variable,
we see the expected result:

  Qgroupid    Referenced    Exclusive   Path
  --------    ----------    ---------   ----
  0/5           16.00KiB     16.00KiB   <toplevel>
  0/256          1.02MiB      1.02MiB   sv

  Qgroupid    Referenced    Exclusive   Path
  --------    ----------    ---------   ----
  0/5           16.00KiB     16.00KiB   <toplevel>
  0/256          1.02MiB      1.02MiB   sv
  1/100          1.02MiB      1.02MiB   2/100<1 member qgroup>
  2/100          1.02MiB      1.02MiB   <0 member qgroups>

The existing fstests did not exercise two layer inheritance so this bug
was missed. I intend to add that testing there, as well.

Fixes: a0bdc04b0732 ("btrfs: qgroup: use qgroup_iterator in __qgroup_excl_accounting()")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zoned: do not select metadata BG as finish target

We call btrfs_zone_finish_one_bg() to zone finish one block group and make
room to activate another block group. Currently, we can choose a metadata
block group as a target. But, as we reserve an active metadata block group,
we no longer want to select a metadata block group. So, skip it in the
loop.

CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: do not allow relocation of partially dropped subvolumes

[BUG]
There is an internal report that balance triggered transaction abort,
with the following call trace:

  item 85 key (594509824 169 0) itemoff 12599 itemsize 33
          extent refs 1 gen 197740 flags 2
          ref#0: tree block backref root 7
  item 86 key (594558976 169 0) itemoff 12566 itemsize 33
          extent refs 1 gen 197522 flags 2
          ref#0: tree block backref root 7
...
BTRFS error (device loop0): extent item not found for insert, bytenr 594526208 num_bytes 16384 parent 449921024 root_objectid 934 owner 1 offset 0
BTRFS error (device loop0): failed to run delayed ref for logical 594526208 num_bytes 16384 type 182 action 1 ref_mod 1: -117
------------[ cut here ]------------
BTRFS: Transaction aborted (error -117)
WARNING: CPU: 1 PID: 6963 at ../fs/btrfs/extent-tree.c:2168 btrfs_run_delayed_refs+0xfa/0x110 [btrfs]

And btrfs check doesn't report anything wrong related to the extent
tree.

[CAUSE]
The cause is a little complex, firstly the extent tree indeed doesn't
have the backref for 594526208.

The extent tree only have the following two backrefs around that bytenr
on-disk:

        item 65 key (594509824 METADATA_ITEM 0) itemoff 13880 itemsize 33
                refs 1 gen 197740 flags TREE_BLOCK
                tree block skinny level 0
                (176 0x7) tree block backref root CSUM_TREE
        item 66 key (594558976 METADATA_ITEM 0) itemoff 13847 itemsize 33
                refs 1 gen 197522 flags TREE_BLOCK
                tree block skinny level 0
                (176 0x7) tree block backref root CSUM_TREE

But the such missing backref item is not an corruption on disk, as the
offending delayed ref belongs to subvolume 934, and that subvolume is
being dropped:

        item 0 key (934 ROOT_ITEM 198229) itemoff 15844 itemsize 439
                generation 198229 root_dirid 256 bytenr 10741039104 byte_limit 0 bytes_used 345571328
                last_snapshot 198229 flags 0x1000000000001(RDONLY) refs 0
                drop_progress key (206324 EXTENT_DATA 2711650304) drop_level 2
                level 2 generation_v2 198229

And that offending tree block 594526208 is inside the dropped range of
that subvolume.  That explains why there is no backref item for that
bytenr and why btrfs check is not reporting anything wrong.

But this also shows another problem, as btrfs will do all the orphan
subvolume cleanup at a read-write mount.

So half-dropped subvolume should not exist after an RW mount, and
balance itself is also exclusive to subvolume cleanup, meaning we
shouldn't hit a subvolume half-dropped during relocation.

The root cause is, there is no orphan item for this subvolume.
In fact there are 5 subvolumes from around 2021 that have the same
problem.

It looks like the original report has some older kernels running, and
caused those zombie subvolumes.

Thankfully upstream commit 8d488a8c7ba2 ("btrfs: fix subvolume/snapshot
deletion not triggered on mount") has long fixed the bug.

[ENHANCEMENT]
For repairing such old fs, btrfs-progs will be enhanced.

Considering how delayed the problem will show up (at run delayed ref
time) and at that time we have to abort transaction already, it is too
late.

Instead here we reject any half-dropped subvolume for reloc tree at the
earliest time, preventing confusion and extra time wasted on debugging
similar bugs.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: error on missing block group when unaccounting log tree extent buffers

Currently we only log an error message if we can't find the block group
for a log tree extent buffer when unaccounting it (while freeing a log
tree). A missing block group means something is seriously wrong and we
end up leaking space from the metadata space info. So return -ENOENT in
case we don't find the block group.

CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix wrong length parameter for btrfs_cleanup_ordered_extents()

Inside nocow_one_range(), if the checksum cloning for data reloc inode
failed, we call btrfs_cleanup_ordered_extents() to cleanup the just
allocated ordered extents.

But unlike extent_clear_unlock_delalloc(),
btrfs_cleanup_ordered_extents() requires a length, not an inclusive end
bytenr.

This can be problematic, as the @end is normally way larger than @len.

This means btrfs_cleanup_ordered_extents() can be called on folios
out of the correct range, and if the out-of-range folio is under
writeback, we can incorrectly clear the ordered flag of the folio, and
trigger the DEBUG_WARN() inside btrfs_writepage_cow_fixup().

Fix the wrong parameter with correct length instead.

Fixes: 94f6c5c17e52 ("btrfs: move ordered extent cleanup to where they are allocated")
CC: stable@vger.kernel.org # 6.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: make btrfs_cleanup_ordered_extents() support large folios

When hitting a large folio, btrfs_cleanup_ordered_extents() will get the
same large folio multiple times, and clearing the same range again and
again.

Thankfully this is not causing anything wrong, just inefficiency.

This is caused by the fact that we're iterating folios using the old
page index, thus can hit the same large folio again and again.

Enhance it by increasing @index to the index of the folio end, and only
increase @index by 1 if we failed to grab a folio.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()

There is a potential deadlock that can happen in
try_release_subpage_extent_buffer() because the irq-safe xarray spin
lock fs_info->buffer_tree is being acquired before the irq-unsafe
eb->refs_lock.

This leads to the potential race:
// T1 (random eb->refs user)                  // T2 (release folio)

spin_lock(&eb->refs_lock);
// interrupt
end_bbio_meta_write()
  btrfs_meta_folio_clear_writeback()
                                              btree_release_folio()
                                                folio_test_writeback() //false
                                                try_release_extent_buffer()
                                                  try_release_subpage_extent_buffer()
                                                    xa_lock_irq(&fs_info->buffer_tree)
                                                    spin_lock(&eb->refs_lock); // blocked; held by T1
  buffer_tree_clear_mark()
    xas_lock_irqsave() // blocked; held by T2

I believe that the spin lock can safely be replaced by an rcu_read_lock.
The xa_for_each loop does not need the spin lock as it's already
internally protected by the rcu_read_lock. The extent buffer is also
protected by the rcu_read_lock so it won't be freed before we take the
eb->refs_lock and check the ref count.

The rcu_read_lock is taken and released every iteration, just like the
spin lock, which means we're not protected against concurrent
insertions into the xarray. This is fine because we rely on
folio->private to detect if there are any ebs remaining in the folio.

There is already some precedent for this with find_extent_buffer_nolock,
which loads an extent buffer from the xarray with only rcu_read_lock.

lockdep warning:

            =====================================================
            WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
            6.16.0-0_fbk701_debug_rc0_123_g4c06e63b9203 #1 Tainted: G E    N
            -----------------------------------------------------
            kswapd0/66 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
            ffff000011ffd600 (&eb->refs_lock){+.+.}-{3:3}, at: try_release_extent_buffer+0x18c/0x560

and this task is already holding:
            ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560
            which would create a new lock dependency:
             (&buffer_xa_class){-.-.}-{3:3} -> (&eb->refs_lock){+.+.}-{3:3}

but this new dependency connects a HARDIRQ-irq-safe lock:
             (&buffer_xa_class){-.-.}-{3:3}

... which became HARDIRQ-irq-safe at:
              lock_acquire+0x178/0x358
              _raw_spin_lock_irqsave+0x60/0x88
              buffer_tree_clear_mark+0xc4/0x160
              end_bbio_meta_write+0x238/0x398
              btrfs_bio_end_io+0x1f8/0x330
              btrfs_orig_write_end_io+0x1c4/0x2c0
              bio_endio+0x63c/0x678
              blk_update_request+0x1c4/0xa00
              blk_mq_end_request+0x54/0x88
              virtblk_request_done+0x124/0x1d0
              blk_mq_complete_request+0x84/0xa0
              virtblk_done+0x130/0x238
              vring_interrupt+0x130/0x288
              __handle_irq_event_percpu+0x1e8/0x708
              handle_irq_event+0x98/0x1b0
              handle_fasteoi_irq+0x264/0x7c0
              generic_handle_domain_irq+0xa4/0x108
              gic_handle_irq+0x7c/0x1a0
              do_interrupt_handler+0xe4/0x148
              el1_interrupt+0x30/0x50
              el1h_64_irq_handler+0x14/0x20
              el1h_64_irq+0x6c/0x70
              _raw_spin_unlock_irq+0x38/0x70
              __run_timer_base+0xdc/0x5e0
              run_timer_softirq+0xa0/0x138
              handle_softirqs.llvm.13542289750107964195+0x32c/0xbd0
              ____do_softirq.llvm.17674514681856217165+0x18/0x28
              call_on_irq_stack+0x24/0x30
              __irq_exit_rcu+0x164/0x430
              irq_exit_rcu+0x18/0x88
              el1_interrupt+0x34/0x50
              el1h_64_irq_handler+0x14/0x20
              el1h_64_irq+0x6c/0x70
              arch_local_irq_enable+0x4/0x8
              do_idle+0x1a0/0x3b8
              cpu_startup_entry+0x60/0x80
              rest_init+0x204/0x228
              start_kernel+0x394/0x3f0
              __primary_switched+0x8c/0x8958

to a HARDIRQ-irq-unsafe lock:
             (&eb->refs_lock){+.+.}-{3:3}

... which became HARDIRQ-irq-unsafe at:
            ...
              lock_acquire+0x178/0x358
              _raw_spin_lock+0x4c/0x68
              free_extent_buffer_stale+0x2c/0x170
              btrfs_read_sys_array+0x1b0/0x338
              open_ctree+0xeb0/0x1df8
              btrfs_get_tree+0xb60/0x1110
              vfs_get_tree+0x8c/0x250
              fc_mount+0x20/0x98
              btrfs_get_tree+0x4a4/0x1110
              vfs_get_tree+0x8c/0x250
              do_new_mount+0x1e0/0x6c0
              path_mount+0x4ec/0xa58
              __arm64_sys_mount+0x370/0x490
              invoke_syscall+0x6c/0x208
              el0_svc_common+0x14c/0x1b8
              do_el0_svc+0x4c/0x60
              el0_svc+0x4c/0x160
              el0t_64_sync_handler+0x70/0x100
              el0t_64_sync+0x168/0x170

other info that might help us debug this:
             Possible interrupt unsafe locking scenario:
                   CPU0                    CPU1
                   ----                    ----
              lock(&eb->refs_lock);
                                           local_irq_disable();
                                           lock(&buffer_xa_class);
                                           lock(&eb->refs_lock);
              <Interrupt>
                lock(&buffer_xa_class);

  *** DEADLOCK ***
            2 locks held by kswapd0/66:
             #0: ffff800085506e40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xe8/0xe50
             #1: ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560

Link: https://www.kernel.org/doc/Documentation/locking/lockdep-design.rst#:~:text=Multi%2Dlock%20dependency%20rules%3A
Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray")
CC: stable@vger.kernel.org # 6.16+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

pptp: fix pptp_xmit() error path

I accidentally added a bug in pptp_xmit() that syzbot caught for us.

Only call ip_rt_put() if a route has been allocated.

BUG: unable to handle page fault for address: ffffffffffffffdb
PGD df3b067 P4D df3b067 PUD df3d067 PMD 0
Oops: Oops: 0002 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 6346 Comm: syz.0.336 Not tainted 6.16.0-next-20250804-syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025
RIP: 0010:arch_atomic_add_return arch/x86/include/asm/atomic.h:85 [inline]
RIP: 0010:raw_atomic_sub_return_release include/linux/atomic/atomic-arch-fallback.h:846 [inline]
RIP: 0010:atomic_sub_return_release include/linux/atomic/atomic-instrumented.h:327 [inline]
RIP: 0010:__rcuref_put include/linux/rcuref.h:109 [inline]
RIP: 0010:rcuref_put+0x172/0x210 include/linux/rcuref.h:173
Call Trace:
<TASK>
dst_release+0x24/0x1b0 net/core/dst.c:167
ip_rt_put include/net/route.h:285 [inline]
pptp_xmit+0x14b/0x1a90 drivers/net/ppp/pptp.c:267
__ppp_channel_push+0xf2/0x1c0 drivers/net/ppp/ppp_generic.c:2166
ppp_channel_push+0x123/0x660 drivers/net/ppp/ppp_generic.c:2198
ppp_write+0x2b0/0x400 drivers/net/ppp/ppp_generic.c:544
vfs_write+0x27b/0xb30 fs/read_write.c:684
ksys_write+0x145/0x250 fs/read_write.c:738
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: de9c4861fb42 ("pptp: ensure minimal skb length in pptp_xmit()")
Reported-by: syzbot+27d7cfbc93457e472e00@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/689095a5.050a0220.1fc43d.0009.GAE@google.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250807142146.2877060-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lib/sbitmap: make sbitmap_get_shallow() internal

Because it's only used in sbitmap.c

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20250807032413.1469456-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

lib/sbitmap: convert shallow_depth from one word to the whole sbitmap

Currently elevators will record internal 'async_depth' to throttle
asynchronous requests, and they both calculate shallow_dpeth based on
sb->shift, with the respect that sb->shift is the available tags in one
word.

However, sb->shift is not the availbale tags in the last word, see
__map_depth:

if (index == sb->map_nr - 1)
return sb->depth - (index << sb->shift);

For consequence, if the last word is used, more tags can be get than
expected, for example, assume nr_requests=256 and there are four words,
in the worst case if user set nr_requests=32, then the first word is
the last word, and still use bits per word, which is 64, to calculate
async_depth is wrong.

One the ohter hand, due to cgroup qos, bfq can allow only one request
to be allocated, and set shallow_dpeth=1 will still allow the number
of words request to be allocated.

Fix this problems by using shallow_depth to the whole sbitmap instead
of per word, also change kyber, mq-deadline and bfq to follow this,
a new helper __map_depth_with_shallow() is introduced to calculate
available bits in each word.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250807032413.1469456-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

nvmet: exit debugfs after discovery subsystem exits

Commit 528589947c180 ("nvmet: initialize discovery subsys after debugfs
is initialized") changed nvmet_init() to initialize nvme discovery after
"nvmet" debugfs directory is initialized. The change broke nvmet_exit()
because discovery subsystem now depends on debugfs. Debugfs should be
destroyed after discovery subsystem. Fix nvmet_exit() to do that.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs96AfFQpyDKF_MdfJsnOEo=2V7dQgqjFv+k3t7H-=yGhA@mail.gmail.com/
Fixes: 528589947c180 ("nvmet: initialize discovery subsys after debugfs is initialized")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Link: https://lore.kernel.org/r/20250807053507.2794335-1-mkhalfella@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

netfilter: nft_socket: remove WARN_ON_ONCE with huge level value

syzbot managed to reach this WARN_ON_ONCE by passing a huge level
value, remove it.

WARNING: CPU: 0 PID: 5853 at net/netfilter/nft_socket.c:220 nft_socket_init+0x2f4/0x3d0 net/netfilter/nft_socket.c:220

Reported-by: syzbot+a225fea35d7baf8dbdc3@syzkaller.appspotmail.com
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: clean up returns in nf_conntrack_log_invalid_sysctl()

Smatch complains that these look like error paths with missing error
codes, especially the one where we return if nf_log_is_registered() is
true:

    net/netfilter/nf_conntrack_standalone.c:575 nf_conntrack_log_invalid_sysctl()
    warn: missing error code? 'ret'

In fact, all these return zero deliberately.  Change them to return a
literal instead which helps readability as well as silencing the warning.

Fixes: e89a68046687 ("netfilter: load nf_log_syslog on enabling nf_conntrack_log_invalid")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_set_pipapo: don't return bogus extension pointer

Dan Carpenter says:
Commit 17a20e09f086 ("netfilter: nft_set: remove one argument from
lookup and update functions") [..] leads to the following Smatch
static checker warning:

net/netfilter/nft_set_pipapo_avx2.c:1269 nft_pipapo_avx2_lookup()
error: uninitialized symbol 'ext'.

Fix this by initing ext to NULL and set it only once we've found
a match.

Fixes: 17a20e09f086 ("netfilter: nft_set: remove one argument from lookup and update functions")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/netfilter-devel/aJBzc3V5wk-yPOnH@stanley.mountain/
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: ctnetlink: remove refcounting in expectation dumpers

Same pattern as previous patch: do not keep the expectation object
alive via refcount, only store a cookie value and then use that
as the skip hint for dump resumption.

AFAICS this has the same issue as the one resolved in the conntrack
dumper, when we do
if (!refcount_inc_not_zero(&exp->use))

to increment the refcount, there is a chance that exp == last, which
causes a double-increment of the refcount and subsequent memory leak.

Fixes: cf6994c2b981 ("[NETFILTER]: nf_conntrack_netlink: sync expectation dumping with conntrack table dumping")
Fixes: e844a928431f ("netfilter: ctnetlink: allow to dump expectation per master conntrack")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: ctnetlink: fix refcount leak on table dump

There is a reference count leak in ctnetlink_dump_table():
      if (res < 0) {
                nf_conntrack_get(&ct->ct_general); // HERE
                cb->args[1] = (unsigned long)ct;
                ...

While its very unlikely, its possible that ct == last.
If this happens, then the refcount of ct was already incremented.
This 2nd increment is never undone.

This prevents the conntrack object from being released, which in turn
keeps prevents cnet->count from dropping back to 0.

This will then block the netns dismantle (or conntrack rmmod) as
nf_conntrack_cleanup_net_list() will wait forever.

This can be reproduced by running conntrack_resize.sh selftest in a loop.
It takes ~20 minutes for me on a preemptible kernel on average before
I see a runaway kworker spinning in nf_conntrack_cleanup_net_list.

One fix would to change this to:
        if (res < 0) {
if (ct != last)
                nf_conntrack_get(&ct->ct_general);

But this reference counting isn't needed in the first place.
We can just store a cookie value instead.

A followup patch will do the same for ctnetlink_exp_dump_table,
it looks to me as if this has the same problem and like
ctnetlink_dump_table, we only need a 'skip hint', not the actual
object so we can apply the same cookie strategy there as well.

Fixes: d205dc40798d ("[NETFILTER]: ctnetlink: fix deadlock in table dumping")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: add back NETFILTER_XTABLES dependencies

Some Kconfig symbols were changed to depend on the 'bool' symbol
NETFILTER_XTABLES_LEGACY, which means they can now be set to built-in
when the xtables code itself is in a loadable module:

x86_64-linux-ld: vmlinux.o: in function `arpt_unregister_table_pre_exit':
(.text+0x1831987): undefined reference to `xt_find_table'
x86_64-linux-ld: vmlinux.o: in function `get_info.constprop.0':
arp_tables.c:(.text+0x1831aab): undefined reference to `xt_request_find_table_lock'
x86_64-linux-ld: arp_tables.c:(.text+0x1831bea): undefined reference to `xt_table_unlock'
x86_64-linux-ld: vmlinux.o: in function `do_arpt_get_ctl':
arp_tables.c:(.text+0x183205d): undefined reference to `xt_find_table_lock'
x86_64-linux-ld: arp_tables.c:(.text+0x18320c1): undefined reference to `xt_table_unlock'
x86_64-linux-ld: arp_tables.c:(.text+0x183219a): undefined reference to `xt_recseq'

Change these to depend on both NETFILTER_XTABLES and
NETFILTER_XTABLES_LEGACY.

Fixes: 9fce66583f06 ("netfilter: Exclude LEGACY TABLES on PREEMPT_RT.")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Florian Westphal <fw@strlen.de>
Tested-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

MAINTAINERS: resurrect my netfilter maintainer entry

This reverts commit b5048d27872a9734d142540ea23c3e897e47e05c.
Its been more than a year, hope my motivation lasts a bit longer than
last time :-)

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

treewide: rename GPIO set callbacks back to their original names

The conversion of all GPIO drivers to using the .set_rv() and
.set_multiple_rv() callbacks from struct gpio_chip (which - unlike their
predecessors - return an integer and allow the controller drivers to
indicate failures to users) is now complete and the legacy ones have
been removed. Rename the new callbacks back to their original names in
one sweeping change.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>

gpio: remove legacy GPIO line value setter callbacks

With no more users of the legacy GPIO line value setters - .set() and
.set_multiple() - we can now remove them from the kernel.

Link: https://lore.kernel.org/r/20250725074651.14002-1-brgl@bgdev.pl
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>

Merge commit '6e64f4580381e32c06ee146ca807c555b8f73e24' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into gpio/for-next

ALSA: hda/cirrus: Restrict prompt only for CONFIG_EXPERT

The split of Cirrus HD-audio codec driver may confuse users when
migrating from the previous kernel configs and leave the needed
drivers disabled. Although we've already set y as default, it's still
safer to paper over the wrong choices.

This patch marks the prompt of split CS420x and CS421x codec drivers
with CONFIG_EXPERT, so that they are all enabled when the top-level
CONFIG_SND_HDA_CODEC_CIRRUS is set.

For users who really care about the minimalistic configuration, they
can turn each driver on/off individually after setting
CONFIG_EXPERT=y.

This patch adds the missing help text to the top-level
CONFIG_SND_HDA_CIRRUS_CODEC together with the explanation of
individual choices, and corrects the help texts that don't fit well
nowadays, too.

Fixes: 1cb8744a36c7 ("ALSA: hda/cirrus: Split to cs420x and cs421x drivers")
Link: https://lore.kernel.org/10172c80-daec-4e20-ab57-a483cf1afc02@molgen.mpg.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20250806192541.21949-4-tiwai@suse.de

ALSA: hda/hdmi: Restrict prompt only for CONFIG_EXPERT

The split of HDMI codec driver may confuse users when migrating from
the previous kernel configs and leave some drivers disabled
unexpectedly. Although we've already set y to all HDMI codec drivers
as default, it's still safer to paper over the wrong choices.

This patch marks the prompt of each HDMI codec driver with
CONFIG_EXPERT, so that they are all enabled when the top-level
CONFIG_SND_HDA_CODEC_HDMI is set.

For users who really care about the minimalistic configuration, they
can turn each driver on/off individually after setting
CONFIG_EXPERT=y.

The patch also adds the missing help text to the top-level
CONFIG_SND_HDA_CODEC_HDMI together with the explanation of individual
choices, too.

Fixes: 73cd0490819d ("ALSA: hda/hdmi: Split vendor codec drivers")
Link: https://lore.kernel.org/10172c80-daec-4e20-ab57-a483cf1afc02@molgen.mpg.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20250806192541.21949-3-tiwai@suse.de

ALSA: hda/realtek: Restrict prompt only for CONFIG_EXPERT

The split of Realtek HD-audio codec driver may cause confusions
especially when migrating from the previous kernel configurations
because it's hard to know which driver to be enabled.
Although we've already set default=y for those codec drivers, it may
still make people changing the stuff unnecessarily without knowing its
side effect.

This patch is for avoiding such pitfalls by marking the prompt of each
Realtek codec driver with CONFIG_EXPERT. For "normal" users (that is,
unless CONFIG_EXPERT is set), all Realtek HD-audio codecs are enabled
together with CONFIG_SND_HDA_CODEC_REALTEK; this is the very same
situation like the previous kernels, after all.

For users who really care about the minimalistic configuration, they
can turn each driver on/off individually after setting
CONFIG_EXPERT=y.

The patch also adds the missing help text to the top-level
CONFIG_SND_HDA_CODEC_REALTEK together with the explanation of
individual choices, too.

Fixes: aeeb85f26c3b ("ALSA: hda: Split Realtek HD-audio codec driver")
Link: https://lore.kernel.org/10172c80-daec-4e20-ab57-a483cf1afc02@molgen.mpg.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20250806192541.21949-2-tiwai@suse.de

udp: also consider secpath when evaluating ipsec use for checksumming

Commit b40c5f4fde22 ("udp: disable inner UDP checksum offloads in
IPsec case") tried to fix checksumming in UFO when the packets are
going through IPsec, so that we can't rely on offloads because the UDP
header and payload will be encrypted.

But when doing a TCP test over VXLAN going through IPsec transport
mode with GSO enabled (esp4_offload module loaded), I'm seeing broken
UDP checksums on the encap after successful decryption.

The skbs get to udp4_ufo_fragment/__skb_udp_tunnel_segment via
__dev_queue_xmit -> validate_xmit_skb -> skb_gso_segment and at this
point we've already dropped the dst (unless the device sets
IFF_XMIT_DST_RELEASE, which is not common), so need_ipsec is false and
we proceed with checksum offload.

Make need_ipsec also check the secpath, which is not dropped on this
callpath.

Fixes: b40c5f4fde22 ("udp: disable inner UDP checksum offloads in IPsec case")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

xfrm: bring back device check in validate_xmit_xfrm

This is partial revert of commit d53dda291bbd993a29b84d358d282076e3d01506.

This change causes traffic using GSO with SW crypto running through a
NIC capable of HW offload to no longer get segmented during
validate_xmit_xfrm, and is unrelated to the bonding use case mentioned
in the commit.

Fixes: d53dda291bbd ("xfrm: Remove unneeded device check from validate_xmit_xfrm")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

xfrm: restore GSO for SW crypto

Commit 49431af6c4ef incorrectly assumes that the GSO path is only used
by HW offload, but it's also useful for SW crypto.

This patch re-enables GSO for SW crypto. It's not an exact revert to
preserve the other changes made to xfrm_dev_offload_ok afterwards, but
it reverts all of its effects.

Fixes: 49431af6c4ef ("xfrm: rely on XFRM offload")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Merge tag 'input-for-v6.17-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull input updates from Dmitry Torokhov:

- updates to several drivers consuming GPIO APIs to use setters
   returning error codes

- an infrastructure allowing to define "overlays" for touchscreens
   carving out regions implementing buttons and other elements from a
   bigger sensors and a corresponding update to st1232 driver

- an update to AT/PS2 keyboard driver to map F13-F24 by default

- Samsung keypad driver got a facelift

- evdev input handler will now bind to all devices using EV_SYN event
   instead of abusing id->driver_info

- two new sub-drivers implementing 1A (capacitive buttons) and 21
   (forcepad button) functions in Synaptics RMI driver

- support for polling mode in Goodix touchscreen driver

- support for support for FocalTech FT8716 in edt-ft5x06 driver

- support for MT6359 in mtk-pmic-keys driver

- removal of pcf50633-input driver since platform it was used on is
   gone

- new definitions for game controller "grip" buttons (BTN_GRIP*) and
   corresponding changes to xpad and hid-steam controller drivers

- a new definition for "performance" key

* tag 'input-for-v6.17-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (38 commits)
  HID: hid-steam: Use new BTN_GRIP* buttons
  Input: add keycode for performance mode key
  Input: max77693 - convert to atomic pwm operation
  Input: st1232 - add touch-overlay handling
  dt-bindings: input: touchscreen: st1232: add touch-overlay example
  Input: touch-overlay - add touchscreen overlay handling
  dt-bindings: touchscreen: add touch-overlay property
  Input: atkbd - correctly map F13 - F24
  Input: xpad - use new BTN_GRIP* buttons
  Input: Add and document BTN_GRIP*
  Input: xpad - change buttons the D-Pad gets mapped as to BTN_DPAD_*
  Documentation: Fix capitalization of XBox -> Xbox
  Input: synaptics-rmi4 - add support for F1A
  dt-bindings: input: syna,rmi4: Document F1A function
  Input: synaptics-rmi4 - add support for Forcepads (F21)
  Input: mtk-pmic-keys - add support for MT6359 PMIC keys
  Input: remove special handling of id->driver_info when matching
  Input: evdev - switch matching to EV_SYN
  Input: samsung-keypad - use BIT() and GENMASK() where appropriate
  Input: samsung-keypad - use per-chip parameters
  ...

Merge tag 'for-linus-6.17-1' of https://github.com/cminyard/linux-ipmi

Pull ipmi updates from Corey Minyard:
"Some small fixes for the IPMI driver

  Nothing huge, some rate limiting on logs, a strncpy fix where the
  source and destination could be the same, and removal of some unused
  cruft"

* tag 'for-linus-6.17-1' of https://github.com/cminyard/linux-ipmi:
  ipmi: Use dev_warn_ratelimited() for incorrect message warnings
  char: ipmi: remove redundant variable 'type' and check
  ipmi: Fix strcpy source and destination the same

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fix from Jason Gunthorpe:
"Single fix to correct the iov_iter construction in soft iwarp. This
avoids blktest crashes with recent changes to the allocators"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/siw: Fix the sendmsg byte count in siw_tcp_sendpages

Merge tag 'vfio-v6.17-rc1-v2' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

- Fix imbalance where the no-iommu/cdev device path skips too much on
   open, failing to increment a reference, but still decrements the
   reference on close. Add bounds checking to prevent such underflows
   (Jacob Pan)

- Fill missing detach_ioas op for pds_vfio_pci, fixing probe failure
   when used with IOMMUFD (Brett Creeley)

- Split SR-IOV VFs to separate dev_set, avoiding unnecessary
   serialization between VFs that appear on the same bus (Alex
   Williamson)

- Fix a theoretical integer overflow is the mlx5-vfio-pci variant
   driver (Artem Sadovnikov)

- Implement missing VF token checking support via vfio cdev/IOMMUFD
   interface (Jason Gunthorpe)

- Update QAT vfio-pci variant driver to claim latest VF devices
   (Małgorzata Mielnik)

- Add a cond_resched() call to avoid holding the CPU too long during
   DMA mapping operations (Keith Busch)

* tag 'vfio-v6.17-rc1-v2' of https://github.com/awilliam/linux-vfio:
  vfio/type1: conditional rescheduling while pinning
  vfio/qat: add support for intel QAT 6xxx virtual functions
  vfio/qat: Remove myself from VFIO QAT PCI driver maintainers
  vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD
  vfio/mlx5: fix possible overflow in tracking max message size
  vfio/pci: Separate SR-IOV VF dev_set
  vfio/pds: Fix missing detach_ioas op
  vfio: Prevent open_count decrement to negative
  vfio: Fix unbalanced vfio_df_close call in no-iommu mode

drm/amdgpu: add missing vram lost check for LEGACY RESET

Legacy resets reset the memory controllers so VRAM contents
may be unreliable after reset.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit aae94897b6661a2a4b1de2d328090fc388b3e0af)
Cc: stable@vger.kernel.org

drm/amdgpu/discovery: fix fw based ip discovery

We only need the fw based discovery table for sysfs. No
need to parse it. Additionally parsing some of the board
specific tables may result in incorrect data on some boards.
just load the binary and don't parse it on those boards.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4441
Fixes: 80a0e8282933 ("drm/amdgpu/discovery: optionally use fw based ip discovery")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 62eedd150fa11aefc2d377fc746633fdb1baeb55)
Cc: stable@vger.kernel.org

drm/amdkfd: Destroy KFD debugfs after destroy KFD wq

Since KFD proc content was moved to kernel debugfs, we can't destroy KFD
debugfs before kfd_process_destroy_wq. Move kfd_process_destroy_wq prior
to kfd_debugfs_fini to fix a kernel NULL pointer problem. It happens
when /sys/kernel/debug/kfd was already destroyed in kfd_debugfs_fini but
kfd_process_destroy_wq calls kfd_debugfs_remove_process. This line
debugfs_remove_recursive(entry->proc_dentry);
tries to remove /sys/kernel/debug/kfd/proc/<pid> while
/sys/kernel/debug/kfd is already gone. It hangs the kernel by kernel
NULL pointer.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0333052d90683d88531558dcfdbf2525cc37c233)
Cc: stable@vger.kernel.org

amdgpu/amdgpu_discovery: increase timeout limit for IFWI init

With a timeout of only 1 second, my rx 5700XT fails to initialize,
so this increases the timeout to 2s.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3697
Signed-off-by: Xaver Hugl <xaver.hugl@kde.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9ed3d7bdf2dcdf1a1196630fab89a124526e9cc2)
Cc: stable@vger.kernel.org

cpu: Remove obsolete comment from takedown_cpu()

takedown_cpu() has a comment about "all preempt/rcu users must observe
!cpu_active()" which is kind of meaningless in this function. This
comment was originally introduced by commit 6acce3ef8452 ("sched: Remove
get_online_cpus() usage") when _cpu_down() was setting cpu_active_mask
and synchronize_rcu()/synchronize_sched() were added after that.

Later commit 40190a78f85f ("sched/hotplug: Convert cpu_[in]active
notifiers to state machine") added a new CPUHP_AP_ACTIVE hotplug
state to set/clear cpu_active_mask. The following commit b2454caa8977
("sched/hotplug: Move sync_rcu to be with set_cpu_active(false)")
move the synchronize_*() calls to sched_cpu_deactivate() associated
with the new hotplug state, but left the comment behind.

Remove this comment as it is no longer relevant in takedown_cpu().

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250729191232.664931-1-longman@redhat.com

smb: client: make use of smbdirect_socket.{send,recv}_io.mem.{cache,pool}

This will allow common helper functions to be created later.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: add smbdirect_socket.{send,recv}_io.mem.{cache,pool}

This will be the common location memory caches and pools.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: make use of struct smbdirect_send_io

The server will also use this soon, so that we can
split out common helper functions in future.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: introduce struct smbdirect_send_io

This will be used in client and server soon
in order to replace smbd_request/smb_direct_sendmsg.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: make use of SMBDIRECT_RECV_IO_MAX_SGE

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: add SMBDIRECT_RECV_IO_MAX_SGE

This will allow the client and server specific defines to be
replaced.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: remove unused enum smbd_connection_status

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: make use of smbdirect_socket.recv_io.reassembly.*

This will be used by the server too and will allow us to
create common helper functions.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: introduce smbdirect_socket.recv_io.reassembly.*

This will be used in common between client and server soon.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: make use of smb: smbdirect_socket.recv_io.free.{list,lock}

This will be used by the server too in order to have common
helper functions in future.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: introduce smbdirect_socket.recv_io.free.{list,lock}

This will allow the list of free smbdirect_recv_io messages including
the spinlock to be in common between client and server in order
to split out common helper functions in future.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: make use of struct smbdirect_recv_io

This is the shared structure that will be used in
the server too and will allow us to move helper functions
into common code soon.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: introduce struct smbdirect_recv_io

This will be used in client and server soon
in order to replace smbd_response/smb_direct_recvmsg.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: make use of smbdirect_socket->recv_io.expected

The expected incoming message type can be per connection.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: introduce smbdirect_socket.recv_io.expected

The expected message type can be global as they never change
during the after negotiation process.

This will replace smbd_response->type and smb_direct_recvmsg->type
in future.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: remove unused smbd_connection->fragment_reassembly_remaining

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: let recv_done() avoid touching data_transfer after cleanup/move

Calling enqueue_reassembly() and wake_up_interruptible(&info->wait_reassembly_queue)
or put_receive_buffer() means the response/data_transfer pointer might
get re-used by another thread, which means these should be
the last operations before calling return.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: let recv_done() cleanup before notifying the callers.

We should call put_receive_buffer() before waking up the callers.

For the internal error case of response->type being unexpected,
we now also call smbd_disconnect_rdma_connection() instead
of not waking up the callers at all.

Note that the SMBD_TRANSFER_DATA case still has problems,
which will be addressed in the next commit in order to make
it easier to review this one.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: make sure we call ib_dma_unmap_single() only if we called ib_dma_map_single already

In case of failures either ib_dma_map_single() might not be called yet
or ib_dma_unmap_single() was already called.

We should make sure put_receive_buffer() only calls
ib_dma_unmap_single() if needed.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: remove separate empty_packet_queue

There's no need to maintain two lists, we can just
have a single list of receive buffers, which are free to use.

It just added unneeded complexity and resulted in
ib_dma_unmap_single() not being called from recv_done()
for empty keepalive packets.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: let send_done() cleanup before calling smbd_disconnect_rdma_connection()

We should call ib_dma_unmap_single() and mempool_free() before calling
smbd_disconnect_rdma_connection().

And smbd_disconnect_rdma_connection() needs to be the last function to
call as all other state might already be gone after it returns.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

cifs: Fix null-ptr-deref by static initializing global lock

A kernel panic can be triggered by reading /proc/fs/cifs/debug_dirs.
The crash is a null-ptr-deref inside spin_lock(), caused by the use of the
uninitialized global spinlock cifs_tcp_ses_lock.

init_cifs()
└── cifs_proc_init()
      └── // User can access /proc/fs/cifs/debug_dirs here
           └── cifs_debug_dirs_proc_show()
                └── spin_lock(&cifs_tcp_ses_lock); // Uninitialized!

KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
Mem abort info:
ESR = 0x0000000096000005
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x05: level 1 translation fault
Data abort info:
ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
CM = 0, WnR = 0, TnD = 0, TagAccess = 0
GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[dfff800000000000] address between user and kernel address ranges
Internal error: Oops: 0000000096000005 [#1] SMP
Modules linked in:
CPU: 3 UID: 0 PID: 16435 Comm: stress-ng-procf Not tainted 6.16.0-10385-g79f14b5d84c6 #37 PREEMPT
Hardware name: QEMU KVM Virtual Machine, BIOS 2025.02-8ubuntu1 06/11/2025
pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
pc : do_raw_spin_lock+0x84/0x2cc
lr : _raw_spin_lock+0x24/0x34
sp : ffff8000966477e0
x29: ffff800096647860 x28: ffff800096647b88 x27: ffff0001c0c22070
x26: ffff0003eb2b60c8 x25: ffff0001c0c22018 x24: dfff800000000000
x23: ffff0000f624e000 x22: ffff0003eb2b6020 x21: ffff0000f624e768
x20: 0000000000000004 x19: 0000000000000000 x18: 0000000000000000
x17: 0000000000000000 x16: ffff8000804b9600 x15: ffff700012cc8f04
x14: 1ffff00012cc8f04 x13: 0000000000000004 x12: ffffffffffffffff
x11: 1ffff00012cc8f00 x10: ffff80008d9af0d2 x9 : f3f3f304f1f1f1f1
x8 : 0000000000000000 x7 : 7365733c203e6469 x6 : 20656572743c2023
x5 : ffff0000e0ce0044 x4 : ffff80008a4deb6e x3 : ffff8000804b9718
x2 : 0000000000000001 x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
do_raw_spin_lock+0x84/0x2cc (P)
_raw_spin_lock+0x24/0x34
cifs_debug_dirs_proc_show+0x1ac/0x4c0
seq_read_iter+0x3b0/0xc28
proc_reg_read_iter+0x178/0x2a8
vfs_read+0x5f8/0x88c
ksys_read+0x120/0x210
__arm64_sys_read+0x7c/0x90
invoke_syscall+0x98/0x2b8
el0_svc_common+0x130/0x23c
do_el0_svc+0x48/0x58
el0_svc+0x40/0x140
el0t_64_sync_handler+0x84/0x12c
el0t_64_sync+0x1ac/0x1b0
Code: aa0003f3 f9000feb f2fe7e69 f8386969 (38f86908)
---[ end trace 0000000000000000 ]---

The root cause is an initialization order problem. The lock is declared
as a global variable and intended to be initialized during module startup.
However, the procfs entry that uses this lock can be accessed by userspace
before the spin_lock_init() call has run. This creates a race window where
reading the proc file will attempt to use the lock before it is
initialized, leading to the crash.

For a global lock with a static lifetime, the correct and robust approach
is to use compile-time initialization.

Fixes: 844e5c0eb176 ("smb3 client: add way to show directory leases for improved debugging")
Signed-off-by: Yunseong Kim <ysk@kzalloc.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: let recv_done() avoid touching data_transfer after cleanup/move

Calling enqueue_reassembly() and wake_up_interruptible(&t->wait_reassembly_queue)
or put_receive_buffer() means the recvmsg/data_transfer pointer might
get re-used by another thread, which means these should be
the last operations before calling return.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: let recv_done() consistently call put_recvmsg/smb_direct_disconnect_rdma_connection

We should call put_recvmsg() before smb_direct_disconnect_rdma_connection()
in order to call it before waking up the callers.

In all error cases we should call smb_direct_disconnect_rdma_connection()
in order to avoid stale connections.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: make sure we call ib_dma_unmap_single() only if we called ib_dma_map_single already

In case of failures either ib_dma_map_single() might not be called yet
or ib_dma_unmap_single() was already called.

We should make sure put_recvmsg() only calls ib_dma_unmap_single() if needed.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>