Jens Axboe [Thu, 18 Dec 2025 22:21:28 +0000 (15:21 -0700)]
af_unix: don't post cmsg for SO_INQ unless explicitly asked for
A previous commit added SO_INQ support for AF_UNIX (SOCK_STREAM), but it
posts a SCM_INQ cmsg even if just msg->msg_get_inq is set. This is
incorrect, as ->msg_get_inq is just the caller asking for the remainder
to be passed back in msg->msg_inq, it has nothing to do with cmsg. The
original commit states that this is done to make sockets
io_uring-friendly", but it's actually incorrect as io_uring doesn't use
cmsg headers internally at all, and it's actively wrong as this means
that cmsg's are always posted if someone does recvmsg via io_uring.
Fix that up by only posting a cmsg if u->recvmsg_inq is set.
Additionally, mirror how TCP handles inquiry handling in that it should
only be done for a successful return. This makes the logic for the two
identical.
Dipayaan Roy [Thu, 18 Dec 2025 13:10:54 +0000 (05:10 -0800)]
net: mana: Fix use-after-free in reset service rescan path
When mana_serv_reset() encounters -ETIMEDOUT or -EPROTO from
mana_gd_resume(), it performs a PCI rescan via mana_serv_rescan().
mana_serv_rescan() calls pci_stop_and_remove_bus_device(), which can
invoke the driver's remove path and free the gdma_context associated
with the device. After returning, mana_serv_reset() currently jumps to
the out label and attempts to clear gc->in_service, dereferencing a
freed gdma_context.
The issue was observed with the following call logs:
[ 698.942636] BUG: unable to handle page fault for address: ff6c2b638088508d
[ 698.943121] #PF: supervisor write access in kernel mode
[ 698.943423] #PF: error_code(0x0002) - not-present page
[S[ 698.943793] Pat Dec 6 07:GD5 100000067 P4D 1002f7067 PUD 1002f8067 PMD 101bef067 PTE 0
0:56 2025] hv_[n e 698.944283] Oops: Oops: 0002 [#1] SMP NOPTI
tvsc f8615163-00[ 698.944611] CPU: 28 UID: 0 PID: 249 Comm: kworker/28:1
...
[Sat Dec 6 07:50:56 2025] R10: [ 699.121594] mana 7870:00:00.0 enP30832s1: Configured vPort 0 PD 18 DB 16 000000000000001b R11: 0000000000000000 R12: ff44cf3f40270000
[Sat Dec 6 07:50:56 2025] R13: 0000000000000001 R14: ff44cf3f402700c8 R15: ff44cf3f4021b405
[Sat Dec 6 07:50:56 2025] FS: 0000000000000000(0000) GS:ff44cf7e9fcf9000(0000) knlGS:0000000000000000
[Sat Dec 6 07:50:56 2025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sat Dec 6 07:50:56 2025] CR2: ff6c2b638088508d CR3: 000000011fe43001 CR4: 0000000000b73ef0
[Sat Dec 6 07:50:56 2025] Call Trace:
[Sat Dec 6 07:50:56 2025] <TASK>
[Sat Dec 6 07:50:56 2025] mana_serv_func+0x24/0x50 [mana]
[Sat Dec 6 07:50:56 2025] process_one_work+0x190/0x350
[Sat Dec 6 07:50:56 2025] worker_thread+0x2b7/0x3d0
[Sat Dec 6 07:50:56 2025] kthread+0xf3/0x200
[Sat Dec 6 07:50:56 2025] ? __pfx_worker_thread+0x10/0x10
[Sat Dec 6 07:50:56 2025] ? __pfx_kthread+0x10/0x10
[Sat Dec 6 07:50:56 2025] ret_from_fork+0x21a/0x250
[Sat Dec 6 07:50:56 2025] ? __pfx_kthread+0x10/0x10
[Sat Dec 6 07:50:56 2025] ret_from_fork_asm+0x1a/0x30
[Sat Dec 6 07:50:56 2025] </TASK>
Fix this by returning immediately after mana_serv_rescan() to avoid
accessing GC state that may no longer be valid.
net: nfc: fix deadlock between nfc_unregister_device and rfkill_fop_write
A deadlock can occur between nfc_unregister_device() and rfkill_fop_write()
due to lock ordering inversion between device_lock and rfkill_global_mutex.
The problematic lock order is:
Thread A (rfkill_fop_write):
rfkill_fop_write()
mutex_lock(&rfkill_global_mutex)
rfkill_set_block()
nfc_rfkill_set_block()
nfc_dev_down()
device_lock(&dev->dev) <- waits for device_lock
Thread B (nfc_unregister_device):
nfc_unregister_device()
device_lock(&dev->dev)
rfkill_unregister()
mutex_lock(&rfkill_global_mutex) <- waits for rfkill_global_mutex
This creates a classic ABBA deadlock scenario.
Fix this by moving rfkill_unregister() and rfkill_destroy() outside the
device_lock critical section. Store the rfkill pointer in a local variable
before releasing the lock, then call rfkill_unregister() after releasing
device_lock.
This change is safe because rfkill_fop_write() holds rfkill_global_mutex
while calling the rfkill callbacks, and rfkill_unregister() also acquires
rfkill_global_mutex before cleanup. Therefore, rfkill_unregister() will
wait for any ongoing callback to complete before proceeding, and
device_del() is only called after rfkill_unregister() returns, preventing
any use-after-free.
The similar lock ordering in nfc_register_device() (device_lock ->
rfkill_global_mutex via rfkill_register) is safe because during
registration the device is not yet in rfkill_list, so no concurrent
rfkill operations can occur on this device.
The ASIX driver reads the PHY address from the USB device via
asix_read_phy_addr(). A malicious or faulty device can return an
invalid address (>= PHY_MAX_ADDR), which causes a warning in
mdiobus_get_phy():
addr 207 out of range
WARNING: drivers/net/phy/mdio_bus.c:76
Validate the PHY address in asix_read_phy_addr() and remove the
now-redundant check in ax88172a.c.
Reported-by: syzbot+3d43c9066a5b54902232@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3d43c9066a5b54902232 Tested-by: syzbot+3d43c9066a5b54902232@syzkaller.appspotmail.com Fixes: 7e88b11a862a ("net: usb: asix: refactor asix_read_phy_addr() and handle errors on return") Link: https://lore.kernel.org/all/20251217085057.270704-1-kartikey406@gmail.com/T/ Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251218011156.276824-1-kartikey406@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
For i40e:
Przemyslaw immediately schedules service task following changes to
filters to ensure timely setup for PTP.
Gregory Herrero adjusts VF descriptor size checks to be device specific.
For iavf:
Kohei Enju corrects a couple of condition checks which caused off-by-one
issues.
For idpf:
Larysa fixes LAN memory region call to follow expected requirements.
Brian Vazquez reduces mailbox wait time during init to avoid lengthy
delays.
For e1000:
Guangshuo Li adds validation of data length to prevent out-of-bounds
access.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
e1000: fix OOB in e1000_tbi_should_accept()
idpf: reduce mbx_task schedule delay to 300us
idpf: fix LAN memory regions command on some NVMs
iavf: fix off-by-one issues in iavf_config_rss_reg()
i40e: validate ring_len parameter against hardware-specific values
i40e: fix scheduling in set_rx_mode
====================
This happens because smc_special_trylock() disables IRQs even on PREEMPT_RT,
but smc_special_unlock() does not restore IRQs on PREEMPT_RT.
The reason is that smc_special_unlock() calls spin_unlock_irqrestore(),
and rcu_read_unlock_bh() in __dev_queue_xmit() cannot invoke
rcu_read_unlock() through __local_bh_enable_ip() when current->softirq_disable_cnt becomes zero.
To address this issue, replace smc_special_trylock() with spin_trylock_irqsave().
Fixes: 342a93247e08 ("locking/spinlock: Provide RT variant header: <linux/spinlock_rt.h>") Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251217085115.1730036-1-yeoreum.yun@arm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Tue, 23 Dec 2025 11:55:39 +0000 (12:55 +0100)]
Merge tag 'for-net-2025-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- MGMT: report BIS capability flags in supported settings
- btusb: revert use of devm_kzalloc in btusb
* tag 'for-net-2025-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: btusb: revert use of devm_kzalloc in btusb
Bluetooth: MGMT: report BIS capability flags in supported settings
====================
Arnd Bergmann [Tue, 16 Dec 2025 21:35:42 +0000 (22:35 +0100)]
net: wangxun: move PHYLINK dependency
The LIBWX library code is what calls into phylink, so any user of
it has to select CONFIG_PHYLINK at the moment, with NGBEVF missing this:
x86_64-linux-ld: drivers/net/ethernet/wangxun/libwx/wx_ethtool.o: in function `wx_nway_reset':
wx_ethtool.c:(.text+0x613): undefined reference to `phylink_ethtool_nway_reset'
x86_64-linux-ld: drivers/net/ethernet/wangxun/libwx/wx_ethtool.o: in function `wx_get_link_ksettings':
wx_ethtool.c:(.text+0x62b): undefined reference to `phylink_ethtool_ksettings_get'
x86_64-linux-ld: drivers/net/ethernet/wangxun/libwx/wx_ethtool.o: in function `wx_set_link_ksettings':
wx_ethtool.c:(.text+0x643): undefined reference to `phylink_ethtool_ksettings_set'
x86_64-linux-ld: drivers/net/ethernet/wangxun/libwx/wx_ethtool.o: in function `wx_get_pauseparam':
wx_ethtool.c:(.text+0x65b): undefined reference to `phylink_ethtool_get_pauseparam'
x86_64-linux-ld: drivers/net/ethernet/wangxun/libwx/wx_ethtool.o: in function `wx_set_pauseparam':
wx_ethtool.c:(.text+0x677): undefined reference to `phylink_ethtool_set_pauseparam'
Add the 'select PHYLINK' line in the libwx option directly so this will
always be enabled for all current and future wangxun drivers, and remove
the now duplicate lines.
selftests: net: fix "buffer overflow detected" for tap.c
When the selftest 'tap.c' is compiled with '-D_FORTIFY_SOURCE=3',
the strcpy() in rtattr_add_strsz() is replaced with a checked
version which causes the test to consistently fail when compiled
with toolchains for which this option is enabled by default.
TAP version 13
1..3
# Starting 3 tests from 1 test cases.
# RUN tap.test_packet_valid_udp_gso ...
*** buffer overflow detected ***: terminated
# test_packet_valid_udp_gso: Test terminated by assertion
# FAIL tap.test_packet_valid_udp_gso
not ok 1 tap.test_packet_valid_udp_gso
# RUN tap.test_packet_valid_udp_csum ...
*** buffer overflow detected ***: terminated
# test_packet_valid_udp_csum: Test terminated by assertion
# FAIL tap.test_packet_valid_udp_csum
not ok 2 tap.test_packet_valid_udp_csum
# RUN tap.test_packet_crash_tap_invalid_eth_proto ...
*** buffer overflow detected ***: terminated
# test_packet_crash_tap_invalid_eth_proto: Test terminated by assertion
# FAIL tap.test_packet_crash_tap_invalid_eth_proto
not ok 3 tap.test_packet_crash_tap_invalid_eth_proto
# FAILED: 0 / 3 tests passed.
# Totals: pass:0 fail:3 xfail:0 xpass:0 skip:0 error:0
A buffer overflow is detected by the fortified glibc __strcpy_chk()
since the __builtin_object_size() of `RTA_DATA(rta)` is incorrectly
reported as 1, even though there is ample space in its bounding
buffer `req`.
Additionally, given that IFLA_IFNAME also expects a null-terminated
string, callers of rtaddr_add_str{,sz}() could simply use the
rtaddr_add_strsz() variant. (which has been renamed to remove the
trailing `sz`) memset() has been used for this function since it
is unchecked and thus circumvents the issue discussed in the
previous paragraph.
Fixes: 2e64fe4624d1 ("selftests: add few test cases for tap driver") Signed-off-by: Alice C. Munduruca <alice.munduruca@canonical.com> Reviewed-by: Cengiz Can <cengiz.can@canonical.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20251216170641.250494-1-alice.munduruca@canonical.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Deepakkumar Karn [Tue, 16 Dec 2025 15:13:05 +0000 (20:43 +0530)]
net: usb: rtl8150: fix memory leak on usb_submit_urb() failure
In async_set_registers(), when usb_submit_urb() fails, the allocated
async_req structure and URB are not freed, causing a memory leak.
The completion callback async_set_reg_cb() is responsible for freeing
these allocations, but it is only called after the URB is successfully
submitted and completes (successfully or with error). If submission
fails, the callback never runs and the memory is leaked.
Fix this by freeing both the URB and the request structure in the error
path when usb_submit_urb() fails.
====================
selftests: drv-net: psp: fix templated test names in psp.py
The templated test names in psp.py had a bug that was not exposed
until 80970e0fc07e ("selftests: net: py: extract the case generation
logic") changed the order of test case evaluation and test case name
extraction.
The test cases created in psp_ip_ver_test_builder() and
ipver_test_builder() were only assigning formatted names to the test
cases they returned, when the test itself was run. This series moves
the test case naming to the point where the test function is created.
Using netdevsim psp:
Before:
./tools/testing/selftests/drivers/net/psp.py
TAP version 13
1..28
ok 1 psp.test_case
ok 2 psp.test_case
ok 3 psp.test_case
ok 4 psp.test_case
ok 5 psp.test_case
ok 6 psp.test_case
ok 7 psp.test_case
ok 8 psp.test_case
ok 9 psp.test_case
ok 10 psp.test_case
ok 11 psp.dev_list_devices
...
ok 28 psp.removal_device_bi
# Totals: pass:28 fail:0 xfail:0 xpass:0 skip:0 error:0
#
# Responder logs (0):
# STDERR:
# Set PSP enable on device 3 to 0xf
# Set PSP enable on device 3 to 0x0
After:
./tools/testing/selftests/drivers/net/psp.py
TAP version 13
1..28
ok 1 psp.data_basic_send_v0_ip4
ok 2 psp.data_basic_send_v0_ip6
ok 3 psp.data_basic_send_v1_ip4
ok 4 psp.data_basic_send_v1_ip6
ok 5 psp.data_basic_send_v2_ip4
ok 6 psp.data_basic_send_v2_ip6
ok 7 psp.data_basic_send_v3_ip4
ok 8 psp.data_basic_send_v3_ip6
ok 9 psp.data_mss_adjust_ip4
ok 10 psp.data_mss_adjust_ip6
ok 11 psp.dev_list_devices
...
ok 28 psp.removal_device_bi
# Totals: pass:28 fail:0 xfail:0 xpass:0 skip:0 error:0
#
# Responder logs (0):
# STDERR:
# Set PSP enable on device 3 to 0xf
# Set PSP enable on device 3 to 0x0
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
====================
Daniel Zahka [Tue, 16 Dec 2025 14:21:36 +0000 (06:21 -0800)]
selftests: drv-net: psp: fix test names in ipver_test_builder()
test_case will only take on the formatted name after being
called. This does not work with the way ksft_run() currently
works. Assign the name after the test_case is created.
Daniel Zahka [Tue, 16 Dec 2025 14:21:35 +0000 (06:21 -0800)]
selftests: drv-net: psp: fix templated test names in psp_ip_ver_test_builder()
test_case will only take on its formatted name after it is called by
the test runner. Move the assignment to test_case.__name__ to when the
test_case is constructed, not called.
Raju Rangoju [Mon, 15 Dec 2025 15:17:28 +0000 (20:47 +0530)]
amd-xgbe: reset retries and mode on RX adapt failures
During the stress tests, early RX adaptation handshakes can fail, such
as missing the RX_ADAPT ACK or not receiving a coefficient update before
block lock is established. Continuing to retry RX adaptation in this
state is often ineffective if the current mode selection is not viable.
Resetting the RX adaptation retry counter when an RX_ADAPT request fails
to receive ACK or a coefficient update prior to block lock, and clearing
mode_set so the next bring-up performs a fresh mode selection rather
than looping on a likely invalid configuration.
Fixes: 4f3b20bfbb75 ("amd-xgbe: add support for rx-adaptation") Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Link: https://patch.msgid.link/20251215151728.311713-1-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Vladimir Oltean [Mon, 15 Dec 2025 15:02:36 +0000 (17:02 +0200)]
net: dsa: fix missing put_device() in dsa_tree_find_first_conduit()
of_find_net_device_by_node() searches net devices by their /sys/class/net/,
entry. It is documented in its kernel-doc that:
* If successful, returns a pointer to the net_device with the embedded
* struct device refcount incremented by one, or NULL on failure. The
* refcount must be dropped when done with the net_device.
We are missing a put_device(&conduit->dev) which we could place at the
end of dsa_tree_find_first_conduit(). But to explain why calling
put_device() right away is safe is the same as to explain why the chosen
solution is different.
The code is very poorly split: dsa_tree_find_first_conduit() was first
introduced in commit 95f510d0b792 ("net: dsa: allow the DSA master to be
seen and changed through rtnetlink") but was first used several commits
later, in commit acc43b7bf52a ("net: dsa: allow masters to join a LAG").
Assume there is a switch with 2 CPU ports and 2 conduits, eno2 and eno3.
When we create a LAG (bonding or team device) and place eno2 and eno3
beneath it, we create a 3rd conduit (the LAG device itself), but this is
slightly different than the first two.
Namely, the cpu_dp->conduit pointer of the CPU ports does not change,
and remains pointing towards the physical Ethernet controllers which are
now LAG ports. Only 2 things change:
- the LAG device has a dev->dsa_ptr which marks it as a DSA conduit
- dsa_port_to_conduit(user port) finds the LAG and not the physical
conduit, because of the dp->cpu_port_in_lag bit being set.
When the LAG device is destroyed, dsa_tree_migrate_ports_from_lag_conduit()
is called and this is where dsa_tree_find_first_conduit() kicks in.
This is the logical mistake and the reason why introducing code in one
patch and using it from another is bad practice. I didn't realize that I
don't have to call of_find_net_device_by_node() again; the cpu_dp->conduit
association was never undone, and is still available for direct (re)use.
There's only one concern - maybe the conduit disappeared in the
meantime, but the netdev_hold() call we made during dsa_port_parse_cpu()
(see previous change) ensures that this was not the case.
Therefore, fixing the code means reimplementing it in the simplest way.
I am blaming the time of use, since this is what "git blame" would show
if we were to monitor for the conduit's kobject's refcount remaining
elevated instead of being freed.
Tested on the NXP LS1028A, using the steps from
Documentation/networking/dsa/configuration.rst section "Affinity of user
ports to CPU ports", followed by (extra prints added by me):
$ ip link del bond0
mscc_felix 0000:00:00.5 swp3: Link is Down
bond0 (unregistering): (slave eno2): Releasing backup interface
fsl_enetc 0000:00:00.2 eno2: Link is Down
mscc_felix 0000:00:00.5 swp0: bond0 disappeared, migrating to eno2
mscc_felix 0000:00:00.5 swp1: bond0 disappeared, migrating to eno2
mscc_felix 0000:00:00.5 swp2: bond0 disappeared, migrating to eno2
mscc_felix 0000:00:00.5 swp3: bond0 disappeared, migrating to eno2
Vladimir Oltean [Mon, 15 Dec 2025 15:02:35 +0000 (17:02 +0200)]
net: dsa: properly keep track of conduit reference
Problem description
-------------------
DSA has a mumbo-jumbo of reference handling of the conduit net device
and its kobject which, sadly, is just wrong and doesn't make sense.
There are two distinct problems.
1. The OF path, which uses of_find_net_device_by_node(), never releases
the elevated refcount on the conduit's kobject. Nominally, the OF and
non-OF paths should result in objects having identical reference
counts taken, and it is already suspicious that
dsa_dev_to_net_device() has a put_device() call which is missing in
dsa_port_parse_of(), but we can actually even verify that an issue
exists. With CONFIG_DEBUG_KOBJECT_RELEASE=y, if we run this command
"before" and "after" applying this patch:
(unbind the conduit driver for net device eno2)
echo 0000:00:00.2 > /sys/bus/pci/drivers/fsl_enetc/unbind
we see these lines in the output diff which appear only with the patch
applied:
2. After we find the conduit interface one way (OF) or another (non-OF),
it can get unregistered at any time, and DSA remains with a long-lived,
but in this case stale, cpu_dp->conduit pointer. Holding the net
device's underlying kobject isn't actually of much help, it just
prevents it from being freed (but we never need that kobject
directly). What helps us to prevent the net device from being
unregistered is the parallel netdev reference mechanism (dev_hold()
and dev_put()).
Actually we actually use that netdev tracker mechanism implicitly on
user ports since commit 2f1e8ea726e9 ("net: dsa: link interfaces with
the DSA master to get rid of lockdep warnings"), via netdev_upper_dev_link().
But time still passes at DSA switch probe time between the initial
of_find_net_device_by_node() code and the user port creation time, time
during which the conduit could unregister itself and DSA wouldn't know
about it.
So we have to run of_find_net_device_by_node() under rtnl_lock() to
prevent that from happening, and release the lock only with the netdev
tracker having acquired the reference.
Do we need to keep the reference until dsa_unregister_switch() /
dsa_switch_shutdown()?
1: Maybe yes. A switch device will still be registered even if all user
ports failed to probe, see commit 86f8b1c01a0a ("net: dsa: Do not
make user port errors fatal"), and the cpu_dp->conduit pointers
remain valid. I haven't audited all call paths to see whether they
will actually use the conduit in lack of any user port, but if they
do, it seems safer to not rely on user ports for that reference.
2. Definitely yes. We support changing the conduit which a user port is
associated to, and we can get into a situation where we've moved all
user ports away from a conduit, thus no longer hold any reference to
it via the net device tracker. But we shouldn't let it go nonetheless
- see the next change in relation to dsa_tree_find_first_conduit()
and LAG conduits which disappear.
We have to be prepared to return to the physical conduit, so the CPU
port must explicitly keep another reference to it. This is also to
say: the user ports and their CPU ports may not always keep a
reference to the same conduit net device, and both are needed.
As for the conduit's kobject for the /sys/class/net/ entry, we don't
care about it, we can release it as soon as we hold the net device
object itself.
History and blame attribution
-----------------------------
The code has been refactored so many times, it is very difficult to
follow and properly attribute a blame, but I'll try to make a short
history which I hope to be correct.
We have two distinct probing paths:
- one for OF, introduced in 2016 in commit 83c0afaec7b7 ("net: dsa: Add
new binding implementation")
- one for non-OF, introduced in 2017 in commit 71e0bbde0d88 ("net: dsa:
Add support for platform data")
These are both complete rewrites of the original probing paths (which
used struct dsa_switch_driver and other weird stuff, instead of regular
devices on their respective buses for register access, like MDIO, SPI,
I2C etc):
- one for OF, introduced in 2013 in commit 5e95329b701c ("dsa: add
device tree bindings to register DSA switches")
- one for non-OF, introduced in 2008 in commit 91da11f870f0 ("net:
Distributed Switch Architecture protocol support")
except for tiny bits and pieces like dsa_dev_to_net_device() which were
seemingly carried over since the original commit, and used to this day.
The point is that the original probing paths received a fix in 2015 in
the form of commit 679fb46c5785 ("net: dsa: Add missing master netdev
dev_put() calls"), but the fix never made it into the "new" (dsa2)
probing paths that can still be traced to today, and the fixed probing
path was later deleted in 2019 in commit 93e86b3bc842 ("net: dsa: Remove
legacy probing support").
That is to say, the new probing paths were never quite correct in this
area.
The existence of the legacy probing support which was deleted in 2019
explains why dsa_dev_to_net_device() returns a conduit with elevated
refcount (because it was supposed to be released during
dsa_remove_dst()). After the removal of the legacy code, the only user
of dsa_dev_to_net_device() calls dev_put(conduit) immediately after this
function returns. This pattern makes no sense today, and can only be
interpreted historically to understand why dev_hold() was there in the
first place.
Change details
--------------
Today we have a better netdev tracking infrastructure which we should
use. Logically netdev_hold() belongs in common code
(dsa_port_parse_cpu(), where dp->conduit is assigned), but there is a
tradeoff to be made with the rtnl_lock() section which would become a
bit too long if we did that - dsa_port_parse_cpu() also calls
request_module(). So we duplicate a bit of logic in order for the
callers of dsa_port_parse_cpu() to be the ones responsible of holding
the conduit reference and releasing it on error. This shortens the
rtnl_lock() section significantly.
In the dsa_switch_probe() error path, dsa_switch_release_ports() will be
called in a number of situations, one being where dsa_port_parse_cpu()
maybe didn't get the chance to run at all (a different port failed
earlier, etc). So we have to test for the conduit being NULL prior to
calling netdev_put().
There have still been so many transformations to the code since the
blamed commits (rename master -> conduit, commit 0650bf52b31f ("net:
dsa: be compatible with masters which unregister on shutdown")), that it
only makes sense to fix the code using the best methods available today
and see how it can be backported to stable later. I suspect the fix
cannot even be backported to kernels which lack dsa_switch_shutdown(),
and I suspect this is also maybe why the long-lived conduit reference
didn't make it into the new DSA probing paths at the time (problems
during shutdown).
Because dsa_dev_to_net_device() has a single call site and has to be
changed anyway, the logic was just absorbed into the non-OF
dsa_port_parse().
Tested on the ocelot/felix switch and on dsa_loop, both on the NXP
LS1028A with CONFIG_DEBUG_KOBJECT_RELEASE=y.
Reported-by: Ma Ke <make24@iscas.ac.cn> Closes: https://lore.kernel.org/netdev/20251214131204.4684-1-make24@iscas.ac.cn/ Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation") Fixes: 71e0bbde0d88 ("net: dsa: Add support for platform data") Reviewed-by: Jonas Gorski <jonas.gorski@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20251215150236.3931670-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Lorenzo Bianconi [Sun, 14 Dec 2025 09:30:07 +0000 (10:30 +0100)]
net: airoha: Move net_devs registration in a dedicated routine
Since airoha_probe() is not executed under rtnl lock, there is small race
where a given device is configured by user-space while the remaining ones
are not completely loaded from the dts yet. This condition will allow a
hw device misconfiguration since there are some conditions (e.g. GDM2 check
in airoha_dev_init()) that require all device are properly loaded from the
device tree. Fix the issue moving net_devices registration at the end of
the airoha_probe routine.
Frode Nordahl [Sat, 13 Dec 2025 10:13:36 +0000 (10:13 +0000)]
erspan: Initialize options_len before referencing options.
The struct ip_tunnel_info has a flexible array member named
options that is protected by a counted_by(options_len)
attribute.
The compiler will use this information to enforce runtime bounds
checking deployed by FORTIFY_SOURCE string helpers.
As laid out in the GCC documentation, the counter must be
initialized before the first reference to the flexible array
member.
After scanning through the files that use struct ip_tunnel_info
and also refer to options or options_len, it appears the normal
case is to use the ip_tunnel_info_opts_set() helper.
Said helper would initialize options_len properly before copying
data into options, however in the GRE ERSPAN code a partial
update is done, preventing the use of the helper function.
Before this change the handling of ERSPAN traffic in GRE tunnels
would cause a kernel panic when the kernel is compiled with
GCC 15+ and having FORTIFY_SOURCE configured:
Paolo Abeni [Fri, 12 Dec 2025 12:54:04 +0000 (13:54 +0100)]
mptcp: ensure context reset on disconnect()
After the blamed commit below, if the MPC subflow is already in TCP_CLOSE
status or has fallback to TCP at mptcp_disconnect() time,
mptcp_do_fastclose() skips setting the `send_fastclose flag` and the later
__mptcp_close_ssk() does not reset anymore the related subflow context.
Any later connection will be created with both the `request_mptcp` flag
and the msk-level fallback status off (it is unconditionally cleared at
MPTCP disconnect time), leading to a warning in subflow_data_ready():
The TCP subflow can process the simult-connect syn-ack packet after
transitioning to TCP_FIN1 state, bypassing the MPTCP fallback check,
as the sk_state_change() callback is not invoked for * -> FIN_WAIT1
transitions.
That will move the msk socket to an inconsistent status and the next
incoming data will hit the reported splat.
Close the race moving the simult-fallback check at the earliest possible
stage - that is at syn-ack generation time.
About the fixes tags: [2] was supposed to also fix this issue introduced
by [3]. [1] is required as a dependence: it was not explicitly marked as
a fix, but it is one and it has already been backported before [3]. In
other words, this commit should be backported up to [3], including [2]
and [1] if that's not already there.
Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().") [1] Fixes: 4fd19a307016 ("mptcp: fix inconsistent state on fastopen race") [2] Fixes: 1e777f39b4d7 ("mptcp: add MSG_FASTOPEN sendmsg flag support") [3] Cc: stable@vger.kernel.org Reported-by: syzbot+0ff6b771b4f7a5bce83b@syzkaller.appspotmail.com Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/586 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251212-net-mptcp-subflow_data_ready-warn-v1-1-d1f9fd1c36c8@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The problem is in this flow:
1) Port is enabled, queue_id != 0, in qom_list
2) Port gets disabled
-> team_port_disable()
-> team_queue_override_port_del()
-> del (removed from list)
3) Port is disabled, queue_id != 0, not in any list
4) Priority changes
-> team_queue_override_port_prio_changed()
-> checks: port disabled && queue_id != 0
-> calls del - hits the BUG as it is removed already
To fix this, change the check in team_queue_override_port_prio_changed()
so it returns early if port is not enabled.
Reported-by: syzbot+422806e5f4cce722a71f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=422806e5f4cce722a71f Fixes: 6c31ff366c11 ("team: remove synchronize_rcu() called during queue override change") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251212102953.167287-1-jiri@resnulli.us Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Wang Liang [Fri, 12 Dec 2025 01:27:23 +0000 (09:27 +0800)]
net/handshake: Fix null-ptr-deref in handshake_complete()
A null pointer dereference in handshake_complete() was observed [1].
When handshake_req_next() return NULL in handshake_nl_accept_doit(),
function handshake_complete() will be called unexpectedly which triggers
this crash. Fix it by goto out_status when req is NULL.
Eric Dumazet [Thu, 11 Dec 2025 17:35:50 +0000 (17:35 +0000)]
ip6_gre: make ip6gre_header() robust
Over the years, syzbot found many ways to crash the kernel
in ip6gre_header() [1].
This involves team or bonding drivers ability to dynamically
change their dev->needed_headroom and/or dev->hard_header_len
In this particular crash mld_newpack() allocated an skb
with a too small reserve/headroom, and by the time mld_sendpack()
was called, syzbot managed to attach an ip6gre device.
Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Reported-by: syzbot+43a2ebcf2a64b1102d64@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/693b002c.a70a0220.33cd7b.0033.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251211173550.2032674-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net: openvswitch: Avoid needlessly taking the RTNL on vport destroy
The openvswitch teardown code will immediately call
ovs_netdev_detach_dev() in response to a NETDEV_UNREGISTER notification.
It will then start the dp_notify_work workqueue, which will later end up
calling the vport destroy() callback. This callback takes the RTNL to do
another ovs_netdev_detach_port(), which in this case is unnecessary.
This causes extra pressure on the RTNL, in some cases leading to
"unregister_netdevice: waiting for XX to become free" warnings on
teardown.
We can straight-forwardly avoid the extra RTNL lock acquisition by
checking the device flags before taking the lock, and skip the locking
altogether if the IFF_OVS_DATAPATH flag has already been unset.
Jacky Chou [Thu, 11 Dec 2025 06:24:58 +0000 (14:24 +0800)]
net: mdio: aspeed: add dummy read to avoid read-after-write issue
The Aspeed MDIO controller may return incorrect data when a read operation
follows immediately after a write. Due to a controller bug, the subsequent
read can latch stale data, causing the polling logic to terminate earlier
than expected.
To work around this hardware issue, insert a dummy read after each write
operation. This ensures that the next actual read returns the correct
data and prevents premature polling exit.
This workaround has been verified to stabilize MDIO transactions on
affected Aspeed platforms.
net: usb: sr9700: support devices with virtual driver CD
Some SR9700 devices have an SPI flash chip containing a virtual driver
CD, in which case they appear as a device with two interfaces and
product ID 0x9702. Interface 0 is the driver CD and interface 1 is the
Ethernet device.
Bluetooth: btusb: revert use of devm_kzalloc in btusb
This reverts commit 98921dbd00c4e ("Bluetooth: Use devm_kzalloc in
btusb.c file").
In btusb_probe(), we use devm_kzalloc() to allocate the btusb data. This
ties the lifetime of all the btusb data to the binding of a driver to
one interface, INTF. In a driver that binds to other interfaces, ISOC
and DIAG, this is an accident waiting to happen.
The issue is revealed in btusb_disconnect(), where calling
usb_driver_release_interface(&btusb_driver, data->intf) will have devm
free the data that is also being used by the other interfaces of the
driver that may not be released yet.
To fix this, revert the use of devm and go back to freeing memory
explicitly.
Fixes: 98921dbd00c4e ("Bluetooth: Use devm_kzalloc in btusb.c file") Signed-off-by: Raphael Pinsonneault-Thibeault <rpthibeault@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Pauli Virtanen [Thu, 4 Dec 2025 20:40:20 +0000 (22:40 +0200)]
Bluetooth: MGMT: report BIS capability flags in supported settings
MGMT_SETTING_ISO_BROADCASTER and MGMT_SETTING_ISO_RECEIVER flags are
missing from supported_settings although they are in current_settings.
Report them also in supported_settings to be consistent.
Fixes: ae7533613133 ("Bluetooth: Check for ISO support in controller") Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
- Jozsef Kadlecsik retires from maintaining netfilter
- tools: ynl: fix build on systems with old kernel headers"
* tag 'net-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
net: hns3: add VLAN id validation before using
net: hns3: using the num_tqps to check whether tqp_index is out of range when vf get ring info from mbx
net: hns3: using the num_tqps in the vf driver to apply for resources
net: enetc: do not transmit redirected XDP frames when the link is down
selftests/tc-testing: Test case exercising potential mirred redirect deadlock
net/sched: act_mirred: fix loop detection
sctp: Clear inet_opt in sctp_v6_copy_ip_options().
sctp: Fetch inet6_sk() after setting ->pinet6 in sctp_clone_sock().
net/handshake: duplicate handshake cancellations leak socket
net/mlx5e: Don't include PSP in the hard MTU calculations
net/mlx5e: Do not update BQL of old txqs during channel reconfiguration
net/mlx5e: Trigger neighbor resolution for unresolved destinations
net/mlx5e: Use ip6_dst_lookup instead of ipv6_dst_lookup_flow for MAC init
net/mlx5: Serialize firmware reset with devlink
net/mlx5: fw_tracer, Handle escaped percent properly
net/mlx5: fw_tracer, Validate format string parameters
net/mlx5: Drain firmware reset in shutdown callback
net/mlx5: fw reset, clear reset requested on drain_fw_reset
net: dsa: mxl-gsw1xx: manually clear RANEG bit
net: dsa: mxl-gsw1xx: fix .shutdown driver operation
...
Linus Torvalds [Thu, 18 Dec 2025 19:50:20 +0000 (07:50 +1200)]
Merge tag 'v6.19-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- important fix for reconnect problem
- minor cleanup
* tag 'v6.19-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number
smb: move some SMB1 definitions into common/smb1pdu.h
smb: align durable reconnect v2 context to 8 byte boundary
Linus Torvalds [Thu, 18 Dec 2025 19:41:17 +0000 (07:41 +1200)]
Merge tag 'fsnotify_for_v6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixes from Jan Kara:
"Two fsnotify fixes.
The fix from Ahelenia makes sure we generate event when modifying
inode flags, the fix from Amir disables sending of events from device
inodes to their parent directory as it could concievably create a
usable side channel attack in case of some devices and so far we
aren't aware of anybody depending on the functionality"
* tag 'fsnotify_for_v6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs: send fsnotify_xattr()/IN_ATTRIB from vfs_fileattr_set()/chattr(1)
fsnotify: do not generate ACCESS/MODIFY events on child for special files
Paolo Abeni [Thu, 18 Dec 2025 16:23:07 +0000 (17:23 +0100)]
Merge tag 'linux-can-fixes-for-6.19-20251218' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2025-12-18
this is a pull request of 3 patches for net/main.
Tetsuo Handa contributes 2 patches to fix race windows in the j1939
protocol to properly handle disappearing network devices.
The last patch is by me, it fixes a build dependency with the CAN
drivers, that got introduced while fixing a dependency between the CAN
protocol and CAN device code.
* tag 'linux-can-fixes-for-6.19-20251218' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: fix build dependency
can: j1939: make j1939_sk_bind() fail if device is no longer registered
can: j1939: make j1939_session_activate() fail if device is no longer registered
====================
Jian Shen [Thu, 11 Dec 2025 02:37:37 +0000 (10:37 +0800)]
net: hns3: add VLAN id validation before using
Currently, the VLAN id may be used without validation when
receive a VLAN configuration mailbox from VF. The length of
vlan_del_fail_bmap is BITS_TO_LONGS(VLAN_N_VID). It may cause
out-of-bounds memory access once the VLAN id is bigger than
or equal to VLAN_N_VID.
Therefore, VLAN id needs to be checked to ensure it is within
the range of VLAN_N_VID.
Fixes: fe4144d47eef ("net: hns3: sync VLAN filter entries when kill VLAN ID failed") Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251211023737.2327018-4-shaojijie@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jian Shen [Thu, 11 Dec 2025 02:37:36 +0000 (10:37 +0800)]
net: hns3: using the num_tqps to check whether tqp_index is out of range when vf get ring info from mbx
Currently, rss_size = num_tqps / tc_num. If tc_num is 1, then num_tqps
equals rss_size. However, if the tc_num is greater than 1, then rss_size
will be less than num_tqps, causing the tqp_index check for subsequent TCs
using rss_size to always fail.
This patch uses the num_tqps to check whether tqp_index is out of range,
instead of rss_size.
Fixes: 326334aad024 ("net: hns3: add a check for tqp_index in hclge_get_ring_chain_from_mbx()") Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251211023737.2327018-3-shaojijie@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jian Shen [Thu, 11 Dec 2025 02:37:35 +0000 (10:37 +0800)]
net: hns3: using the num_tqps in the vf driver to apply for resources
Currently, hdev->htqp is allocated using hdev->num_tqps, and kinfo->tqp
is allocated using kinfo->num_tqps. However, kinfo->num_tqps is set to
min(new_tqps, hdev->num_tqps); Therefore, kinfo->num_tqps may be smaller
than hdev->num_tqps, which causes some hdev->htqp[i] to remain
uninitialized in hclgevf_knic_setup().
Thus, this patch allocates hdev->htqp and kinfo->tqp using hdev->num_tqps,
ensuring that the lengths of hdev->htqp and kinfo->tqp are consistent
and that all elements are properly initialized.
Wei Fang [Thu, 11 Dec 2025 02:09:19 +0000 (10:09 +0800)]
net: enetc: do not transmit redirected XDP frames when the link is down
In the current implementation, the enetc_xdp_xmit() always transmits
redirected XDP frames even if the link is down, but the frames cannot
be transmitted from TX BD rings when the link is down, so the frames
are still kept in the TX BD rings. If the XDP program is uninstalled,
users will see the following warning logs.
fsl_enetc 0000:00:00.0 eno0: timeout for tx ring #6 clear
More worse, the TX BD ring cannot work properly anymore, because the
HW PIR and CIR are not equal after the re-initialization of the TX
BD ring. At this point, the BDs between CIR and PIR are invalid,
which will cause a hardware malfunction.
Another reason is that there is internal context in the ring prefetch
logic that will retain the state from the first incarnation of the ring
and continue prefetching from the stale location when we re-initialize
the ring. The internal context is only reset by an FLR. That is to say,
for LS1028A ENETC, software cannot set the HW CIR and PIR when
initializing the TX BD ring.
It does not make sense to transmit redirected XDP frames when the link is
down. Add a link status check to prevent transmission in this condition.
This fixes part of the issue, but more complex cases remain. For example,
the TX BD ring may still contain unsent frames when the link goes down.
Those situations require additional patches, which will build on this
one.
Fixes: 9d2b68cc108d ("net: enetc: add support for XDP_REDIRECT") Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Reviewed-by: Hariprasad Kelam <hkelam@marvell.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20251211020919.121113-1-wei.fang@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Victor Nogueira [Wed, 10 Dec 2025 16:22:55 +0000 (11:22 -0500)]
selftests/tc-testing: Test case exercising potential mirred redirect deadlock
Add a test case that reproduces deadlock scenario where the user has
a drr qdisc attached to root and has a mirred action that redirects to
self on egress
sctp: Fetch inet6_sk() after setting ->pinet6 in sctp_clone_sock().
syzbot reported the lockdep splat below. [0]
sctp_clone_sock() sets the child socket's ipv6_mc_list to NULL,
but somehow sock_release() in an error path finally acquires
lock_sock() in ipv6_sock_mc_close().
The root cause is that sctp_clone_sock() fetches inet6_sk(newsk)
before setting newinet->pinet6, meaning that the parent's
ipv6_mc_list was actually cleared.
Also, sctp_v6_copy_ip_options() uses inet6_sk() but is called
before newinet->pinet6 is set.
Let's use inet6_sk() only after setting newinet->pinet6.
[0]:
WARNING: possible recursive locking detected
syzkaller #0 Not tainted
syz.0.17/5996 is trying to acquire lock: ffff888031af4c60 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1700 [inline] ffff888031af4c60 (sk_lock-AF_INET6){+.+.}-{0:0}, at: ipv6_sock_mc_close+0xd3/0x140 net/ipv6/mcast.c:348
but task is already holding lock: ffff888031af4320 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1700 [inline] ffff888031af4320 (sk_lock-AF_INET6){+.+.}-{0:0}, at: sctp_getsockopt+0x135/0xb60 net/sctp/socket.c:8131
other info that might help us debug this:
Possible unsafe locking scenario:
When a handshake request is cancelled it is removed from the
handshake_net->hn_requests list, but it is still present in the
handshake_rhashtbl until it is destroyed.
If a second cancellation request arrives for the same handshake request,
then remove_pending() will return false... and assuming
HANDSHAKE_F_REQ_COMPLETED isn't set in req->hr_flags, we'll continue
processing through the out_true label, where we put another reference on
the sock and a refcount underflow occurs.
This can happen for example if a handshake times out - particularly if
the SUNRPC client sends the AUTH_TLS probe to the server but doesn't
follow it up with the ClientHello due to a problem with tlshd. When the
timeout is hit on the server, the server will send a FIN, which triggers
a cancellation request via xs_reset_transport(). When the timeout is
hit on the client, another cancellation request happens via
xs_tls_handshake_sync().
Add a test_and_set_bit(HANDSHAKE_F_REQ_COMPLETED) in the pending cancel
path so duplicate cancels can be detected.
Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Suggested-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20251209193015.3032058-1-smayhew@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 18 Dec 2025 12:55:01 +0000 (13:55 +0100)]
Merge tag 'nf-25-12-16' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter: updates for net
The following patchset contains Netfilter fixes for *net*:
1) Jozsef Kadlecsik is retiring. Fortunately Jozsef will still keep an
eye on ipset patches.
2) remove a bogus direction check from nat core, this caused spurious
flakes in the 'reverse clash' selftest, from myself.
3) nf_tables doesn't need to do chain validation on register store,
from Pablo Neira Ayuso.
4) nf_tables shouldn't revisit chains during ruleset (graph) validation
if possible. Both 3 and 4 were slated for -next initially but there
are now two independent reports of people hitting soft lockup errors
during ruleset validation, so it makes no sense anymore to route
this via -next given this is -stable material. From myself.
5) call cond_resched() in a more frequently visited place during nf_tables
chain validation, this wasn't possible earlier due to rcu read lock,
but nowadays its not held anymore during set walks.
6) Don't fail conntrack packetdrill test with HZ=100 kernels.
netfilter pull request nf-25-12-16
* tag 'nf-25-12-16' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: packetdrill: avoid failure on HZ=100 kernel
netfilter: nf_tables: avoid softlockup warnings in nft_chain_validate
netfilter: nf_tables: avoid chain re-validation if possible
netfilter: nf_tables: remove redundant chain validation on register store
netfilter: nf_nat: remove bogus direction check
MAINTAINERS: Remove Jozsef Kadlecsik from MAINTAINERS file
====================
Cosmin Ratiu [Tue, 9 Dec 2025 12:56:17 +0000 (14:56 +0200)]
net/mlx5e: Don't include PSP in the hard MTU calculations
Commit [1] added the 40 bytes required by the PSP header+trailer and the
UDP header to MLX5E_ETH_HARD_MTU, which limits the device-wide max
software MTU that could be set. This is not okay, because most packets
are not PSP packets and it doesn't make sense to always reserve space
for headers which won't get added in most cases.
As it turns out, for TCP connections, PSP overhead is already taken into
account in the TCP MSS calculations via inet_csk(sk)->icsk_ext_hdr_len.
This was added in commit [2]. This means that the extra space reserved
in the hard MTU for mlx5 ends up unused and wasted.
Remove the unnecessary 40 byte reservation from hard MTU.
[1] commit e5a1861a298e ("net/mlx5e: Implement PSP Tx data path")
[2] commit e97269257fe4 ("net: psp: update the TCP MSS to reflect PSP
packet overhead")
Fixes: e5a1861a298e ("net/mlx5e: Implement PSP Tx data path") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1765284977-1363052-10-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tariq Toukan [Tue, 9 Dec 2025 12:56:16 +0000 (14:56 +0200)]
net/mlx5e: Do not update BQL of old txqs during channel reconfiguration
During channel reconfiguration (e.g., ethtool private flags changes),
the driver can trigger a kernel BUG_ON in dql_completed() with the error
"kernel BUG at lib/dynamic_queue_limits.c:99".
The issue occurs in the following sequence:
During mlx5e_safe_switch_params(), old channels are deactivated via
mlx5e_deactivate_txqsq(). New channels are created and activated, taking
ownership of the netdev_queues and their BQL state.
When old channels are closed via mlx5e_close_txqsq(), there may be
pending TX descriptors (sq->cc != sq->pc) that were in-flight during the
deactivation.
mlx5e_free_txqsq_descs() frees these pending descriptors and attempts to
complete them via netdev_tx_completed_queue().
However, the BQL state (dql->num_queued and dql->num_completed) have
been reset in mlx5e_activate_txqsq and belong to the new queue owner,
leading to dql->num_queued - dql->num_completed < nbytes.
This triggers BUG_ON(count > num_queued - num_completed) in
dql_completed().
Fixes: 3b88a535a8e1 ("net/mlx5e: Defer channels closure to reduce interface down time") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: William Tu <witu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Link: https://patch.msgid.link/1765284977-1363052-9-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Tue, 9 Dec 2025 12:56:15 +0000 (14:56 +0200)]
net/mlx5e: Trigger neighbor resolution for unresolved destinations
When initializing the MAC addresses for an outbound IPsec packet offload
rule in mlx5e_ipsec_init_macs, the call to dst_neigh_lookup is used to
find the next-hop neighbor (typically the gateway in tunnel mode).
This call might create a new neighbor entry if one doesn't already
exist. This newly created entry starts in the INCOMPLETE state, as the
kernel hasn't yet sent an ARP or NDISC probe to resolve the MAC
address. In this case, neigh_ha_snapshot will correctly return an
all-zero MAC address.
IPsec packet offload requires the actual next-hop MAC address to
program the rule correctly. If the neighbor state is INCOMPLETE when
the rule is created, the hardware rule is programmed with an all-zero
destination MAC address. Packets sent using this rule will be
subsequently dropped by the receiving network infrastructure or host.
This patch adds a check specifically for the outbound offload path. If
neigh_ha_snapshot returns an all-zero MAC address, it proactively
calls neigh_event_send(n, NULL). This ensures the kernel immediately
sends the initial ARP or NDISC probe if one isn't already pending,
accelerating the resolution process. This helps prevent the hardware
rule from being programmed with an invalid MAC address and avoids
packet drops due to unresolved neighbors.
Fixes: 71670f766b8f ("net/mlx5e: Support routed networks during IPsec MACs initialization") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1765284977-1363052-8-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Tue, 9 Dec 2025 12:56:14 +0000 (14:56 +0200)]
net/mlx5e: Use ip6_dst_lookup instead of ipv6_dst_lookup_flow for MAC init
Replace ipv6_stub->ipv6_dst_lookup_flow() with ip6_dst_lookup() in
mlx5e_ipsec_init_macs() since IPsec transformations are not needed
during Security Association setup - only basic routing information is
required for nexthop MAC address resolution.
This resolves an issue where XfrmOutNoStates error counter would be
incremented when xfrm policy is configured before xfrm state, as the
IPsec-aware routing function would attempt policy checks during SA
initialization.
Fixes: 71670f766b8f ("net/mlx5e: Support routed networks during IPsec MACs initialization") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1765284977-1363052-7-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Shay Drory [Tue, 9 Dec 2025 12:56:13 +0000 (14:56 +0200)]
net/mlx5: Serialize firmware reset with devlink
The firmware reset mechanism can be triggered by asynchronous events,
which may race with other devlink operations like devlink reload or
devlink dev eswitch set, potentially leading to inconsistent states.
This patch addresses the race by using the devl_lock to serialize the
firmware reset against other devlink operations. When a reset is
requested, the driver attempts to acquire the lock. If successful, it
sets a flag to block devlink reload or eswitch changes, ACKs the reset
to firmware and then releases the lock. If the lock is already held by
another operation, the driver NACKs the firmware reset request,
indicating that the reset cannot proceed.
Firmware reset does not keep the devl_lock and instead uses an internal
firmware reset bit. This is because firmware resets can be triggered by
asynchronous events, and processed in different threads. It is illegal
and unsafe to acquire a lock in one thread and attempt to release it in
another, as lock ownership is intrinsically thread-specific.
This change ensures that firmware resets and other devlink operations
are mutually exclusive during the critical reset request phase,
preventing race conditions.
The firmware tracer's format string validation and parameter counting
did not properly handle escaped percent signs (%%). This caused
fw_tracer to count more parameters when trace format strings contained
literal percent characters.
To fix it, allow %% to pass string validation and skip %% sequences when
counting parameters since they represent literal percent signs rather
than format specifiers.
Shay Drory [Tue, 9 Dec 2025 12:56:11 +0000 (14:56 +0200)]
net/mlx5: fw_tracer, Validate format string parameters
Add validation for format string parameters in the firmware tracer to
prevent potential security vulnerabilities and crashes from malformed
format strings received from firmware.
The firmware tracer receives format strings from the device firmware and
uses them to format trace messages. Without proper validation, bad
firmware could provide format strings with invalid format specifiers
(e.g., %s, %p, %n) that could lead to crashes, or other undefined
behavior.
Add mlx5_tracer_validate_params() to validate that all format specifiers
in trace strings are limited to safe integer/hex formats (%x, %d, %i,
%u, %llx, %lx, etc.). Reject strings containing other format types that
could be used to access arbitrary memory or cause crashes.
Invalid format strings are added to the trace output for visibility with
"BAD_FORMAT: " prefix.
Moshe Shemesh [Tue, 9 Dec 2025 12:56:09 +0000 (14:56 +0200)]
net/mlx5: fw reset, clear reset requested on drain_fw_reset
drain_fw_reset() waits for ongoing firmware reset events and blocks new
event handling, but does not clear the reset requested flag, and may
keep sync reset polling.
To fix it, call mlx5_sync_reset_clear_reset_requested() to clear the
flag, stop sync reset polling, and resume health polling, ensuring
health issues are still detected after the firmware reset drain.
Paolo Abeni [Thu, 18 Dec 2025 11:53:23 +0000 (12:53 +0100)]
Merge branch 'net-dsa-lantiq-a-bunch-of-fixes'
Daniel Golle says:
====================
net: dsa: lantiq: a bunch of fixes
This series is the continuation and result of comments received for a fix
for the SGMII restart-an bit not actually being self-clearing, which was
reported by by Rasmus Villemoes.
A closer investigation and testing the .remove and the .shutdown paths
of the mxl-gsw1xx.c and lantiq_gswip.c drivers has revealed a couple of
existing problems, which are also addressed in this series.
====================
Daniel Golle [Tue, 9 Dec 2025 01:29:34 +0000 (01:29 +0000)]
net: dsa: mxl-gsw1xx: manually clear RANEG bit
Despite being documented as self-clearing, the RANEG bit sometimes
remains set, preventing auto-negotiation from happening.
Manually clear the RANEG bit after 10ms as advised by MaxLinear.
In order to not hold RTNL during the 10ms of waiting schedule
delayed work to take care of clearing the bit asynchronously, which
is similar to the self-clearing behavior.
The .shutdown operation should call dsa_switch_shutdown() just like
it is done also by the sibling lantiq_gswip driver. Not doing that
results in shutdown or reboot hanging and waiting for the CPU port
becoming free, which introduces a longer delay and a WARNING before
shutdown or reboot in case the driver is built-into the kernel.
Fix this by calling dsa_switch_shutdown() in the driver's shutdown
operation, harmonizing it with what is done in the lantiq_gswip
driver. As a side-effect this now allows to remove the previously
exported gswip_disable_switch() function which no longer got any
users.
Daniel Golle [Tue, 9 Dec 2025 01:28:49 +0000 (01:28 +0000)]
net: dsa: mxl-gsw1xx: fix order in .remove operation
The driver's .remove operation was calling gswip_disable_switch() which
clears the GSWIP_MDIO_GLOB_ENABLE bit before calling
dsa_unregister_switch() and thereby violating a Golden Rule of driver
development to always unpublish userspace interfaces before disabling
hardware, as pointed out by Russell King.
Fix this by relying in GSWIP_MDIO_GLOB_ENABLE being cleared by the
.teardown operation introduced by the previous commit
("net: dsa: lantiq_gswip: fix teardown order").
Daniel Golle [Tue, 9 Dec 2025 01:28:20 +0000 (01:28 +0000)]
net: dsa: lantiq_gswip: fix order in .remove operation
Russell King pointed out that disabling the switch by clearing
GSWIP_MDIO_GLOB_ENABLE before calling dsa_unregister_switch() is
problematic, as it violates a Golden Rule of driver development to
always first unpublish userspace interfaces and then disable the
hardware.
Fix this, and also simplify the probe() function, by introducing a
dsa_switch_ops teardown() operation which takes care of clearing the
GSWIP_MDIO_GLOB_ENABLE bit.
Gal Pressman [Mon, 8 Dec 2025 12:19:01 +0000 (14:19 +0200)]
ethtool: Avoid overflowing userspace buffer on stats query
The ethtool -S command operates across three ioctl calls:
ETHTOOL_GSSET_INFO for the size, ETHTOOL_GSTRINGS for the names, and
ETHTOOL_GSTATS for the values.
If the number of stats changes between these calls (e.g., due to device
reconfiguration), userspace's buffer allocation will be incorrect,
potentially leading to buffer overflow.
Drivers are generally expected to maintain stable stat counts, but some
drivers (e.g., mlx5, bnx2x, bna, ksz884x) use dynamic counters, making
this scenario possible.
Some drivers try to handle this internally:
- bnad_get_ethtool_stats() returns early in case stats.n_stats is not
equal to the driver's stats count.
- micrel/ksz884x also makes sure not to write anything beyond
stats.n_stats and overflow the buffer.
However, both use stats.n_stats which is already assigned with the value
returned from get_sset_count(), hence won't solve the issue described
here.
Change ethtool_get_strings(), ethtool_get_stats(),
ethtool_get_phy_stats() to not return anything in case of a mismatch
between userspace's size and get_sset_size(), to prevent buffer
overflow.
The returned n_stats value will be equal to zero, to reflect that
nothing has been returned.
This could result in one of two cases when using upstream ethtool,
depending on when the size change is detected:
1. When detected in ethtool_get_strings():
# ethtool -S eth2
no stats available
2. When detected in get stats, all stats will be reported as zero.
Both cases are presumably transient, and a subsequent ethtool call
should succeed.
Other than the overflow avoidance, these two cases are very evident (no
output/cleared stats), which is arguably better than presenting
incorrect/shifted stats.
I also considered returning an error instead of a "silent" response, but
that seems more destructive towards userspace apps.
Notes:
- This patch does not claim to fix the inherent race, it only makes sure
that we do not overflow the userspace buffer, and makes for a more
predictable behavior.
- RTNL lock is held during each ioctl, the race window exists between
the separate ioctl calls when the lock is released.
- Userspace ethtool always fills stats.n_stats, but it is likely that
these stats ioctls are implemented in other userspace applications
which might not fill it. The added code checks that it's not zero,
to prevent any regressions.
Arnd Bergmann's patch [1] fixed the build dependency problem introduced by
bugfix commit cb2dc6d2869a ("can: Kconfig: select CAN driver infrastructure
by default"). This ended up as commit 6abd4577bccc ("can: fix build
dependency"), but I broke Arnd's fix by removing a dependency that we
thought was superfluous.
Guangshuo Li [Mon, 1 Dec 2025 03:40:58 +0000 (11:40 +0800)]
e1000: fix OOB in e1000_tbi_should_accept()
In e1000_tbi_should_accept() we read the last byte of the frame via
'data[length - 1]' to evaluate the TBI workaround. If the descriptor-
reported length is zero or larger than the actual RX buffer size, this
read goes out of bounds and can hit unrelated slab objects. The issue
is observed from the NAPI receive path (e1000_clean_rx_irq):
==================================================================
BUG: KASAN: slab-out-of-bounds in e1000_tbi_should_accept+0x610/0x790
Read of size 1 at addr ffff888014114e54 by task sshd/363
CPU: 0 PID: 363 Comm: sshd Not tainted 5.18.0-rc1 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0x5a/0x74
print_address_description+0x7b/0x440
print_report+0x101/0x200
kasan_report+0xc1/0xf0
e1000_tbi_should_accept+0x610/0x790
e1000_clean_rx_irq+0xa8c/0x1110
e1000_clean+0xde2/0x3c10
__napi_poll+0x98/0x380
net_rx_action+0x491/0xa20
__do_softirq+0x2c9/0x61d
do_softirq+0xd1/0x120
</IRQ>
<TASK>
__local_bh_enable_ip+0xfe/0x130
ip_finish_output2+0x7d5/0xb00
__ip_queue_xmit+0xe24/0x1ab0
__tcp_transmit_skb+0x1bcb/0x3340
tcp_write_xmit+0x175d/0x6bd0
__tcp_push_pending_frames+0x7b/0x280
tcp_sendmsg_locked+0x2e4f/0x32d0
tcp_sendmsg+0x24/0x40
sock_write_iter+0x322/0x430
vfs_write+0x56c/0xa60
ksys_write+0xd1/0x190
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f511b476b10
Code: 73 01 c3 48 8b 0d 88 d3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d f9 2b 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 8e 9b 01 00 48 89 04 24
RSP: 002b:00007ffc9211d4e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000004024 RCX: 00007f511b476b10
RDX: 0000000000004024 RSI: 0000559a9385962c RDI: 0000000000000003
RBP: 0000559a9383a400 R08: fffffffffffffff0 R09: 0000000000004f00
R10: 0000000000000070 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffc9211d57f R14: 0000559a9347bde7 R15: 0000000000000003
</TASK>
Allocated by task 1:
__kasan_krealloc+0x131/0x1c0
krealloc+0x90/0xc0
add_sysfs_param+0xcb/0x8a0
kernel_add_sysfs_param+0x81/0xd4
param_sysfs_builtin+0x138/0x1a6
param_sysfs_init+0x57/0x5b
do_one_initcall+0x104/0x250
do_initcall_level+0x102/0x132
do_initcalls+0x46/0x74
kernel_init_freeable+0x28f/0x393
kernel_init+0x14/0x1a0
ret_from_fork+0x22/0x30
The buggy address belongs to the object at ffff888014114000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 1620 bytes to the right of
2048-byte region [ffff888014114000, ffff888014114800]
The buggy address belongs to the physical page:
page:ffffea0000504400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14110
head:ffffea0000504400 order:3 compound_mapcount:0 compound_pincount:0
flags: 0x100000000010200(slab|head|node=0|zone=1)
raw: 01000000000102000000000000000000dead000000000001ffff888013442000
raw: 0000000000000000000000000008000800000001ffffffff0000000000000000
page dumped because: kasan: bad access detected
==================================================================
This happens because the TBI check unconditionally dereferences the last
byte without validating the reported length first:
u8 last_byte = *(data + length - 1);
Fix by rejecting the frame early if the length is zero, or if it exceeds
adapter->rx_buffer_len. This preserves the TBI workaround semantics for
valid frames and prevents touching memory beyond the RX buffer.
Fixes: 2037110c96d5 ("e1000: move tbi workaround code into helper function") Cc: stable@vger.kernel.org Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Brian Vazquez [Mon, 10 Nov 2025 20:58:37 +0000 (20:58 +0000)]
idpf: reduce mbx_task schedule delay to 300us
During the IDPF init phase, the mailbox runs in poll mode until it is
configured to properly handle interrupts. The previous delay of 300ms is
excessively long for the mailbox polling mechanism, which causes a slow
initialization of ~2s:
[ 52.444239] idpf 0000:06:12.4: enabling device (0000 -> 0002)
[ 52.485005] idpf 0000:06:12.4: Device HW Reset initiated
[ 54.177181] idpf 0000:06:12.4: PTP init failed, err=-EOPNOTSUPP
[ 54.206177] idpf 0000:06:12.4: Minimum RX descriptor support not provided, using the default
[ 54.206182] idpf 0000:06:12.4: Minimum TX descriptor support not provided, using the default
Changing the delay to 300us avoids the delays during the initial mailbox
transactions, making the init phase much faster:
[ 83.342590] idpf 0000:06:12.4: enabling device (0000 -> 0002)
[ 83.384402] idpf 0000:06:12.4: Device HW Reset initiated
[ 83.518323] idpf 0000:06:12.4: PTP init failed, err=-EOPNOTSUPP
[ 83.547430] idpf 0000:06:12.4: Minimum RX descriptor support not provided, using the default
[ 83.547435] idpf 0000:06:12.4: Minimum TX descriptor support not provided, using the default
Fixes: 4930fbf419a7 ("idpf: add core init and interrupt request") Signed-off-by: Brian Vazquez <brianvv@google.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Larysa Zaremba [Tue, 7 Oct 2025 11:46:22 +0000 (13:46 +0200)]
idpf: fix LAN memory regions command on some NVMs
IPU SDK versions 1.9 through 2.0.5 require send buffer to contain a single
empty memory region. Set number of regions to 1 and use appropriate send
buffer size to satisfy this requirement.
Fixes: 6aa53e861c1a ("idpf: implement get LAN MMIO memory regions") Suggested-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Kohei Enju [Sat, 25 Oct 2025 16:58:50 +0000 (01:58 +0900)]
iavf: fix off-by-one issues in iavf_config_rss_reg()
There are off-by-one bugs when configuring RSS hash key and lookup
table, causing out-of-bounds reads to memory [1] and out-of-bounds
writes to device registers.
Before commit 43a3d9ba34c9 ("i40evf: Allow PF driver to configure RSS"),
the loop upper bounds were:
i <= I40E_VFQF_{HKEY,HLUT}_MAX_INDEX
which is safe since the value is the last valid index.
That commit changed the bounds to:
i <= adapter->rss_{key,lut}_size / 4
where `rss_{key,lut}_size / 4` is the number of dwords, so the last
valid index is `(rss_{key,lut}_size / 4) - 1`. Therefore, using `<=`
accesses one element past the end.
Fix the issues by using `<` instead of `<=`, ensuring we do not exceed
the bounds.
[1] KASAN splat about rss_key_size off-by-one
BUG: KASAN: slab-out-of-bounds in iavf_config_rss+0x619/0x800
Read of size 4 at addr ffff888102c50134 by task kworker/u8:6/63
The buggy address belongs to the object at ffff888102c50100
which belongs to the cache kmalloc-64 of size 64
The buggy address is located 0 bytes to the right of
allocated 52-byte region [ffff888102c50100, ffff888102c50134)
Memory state around the buggy address: ffff888102c50000: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc ffff888102c50080: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
>ffff888102c50100: 00 00 00 00 00 00 04 fc fc fc fc fc fc fc fc fc
^ ffff888102c50180: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc ffff888102c50200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
Fixes: 43a3d9ba34c9 ("i40evf: Allow PF driver to configure RSS") Signed-off-by: Kohei Enju <enjuk@amazon.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Gregory Herrero [Fri, 12 Dec 2025 21:06:43 +0000 (22:06 +0100)]
i40e: validate ring_len parameter against hardware-specific values
The maximum number of descriptors supported by the hardware is
hardware-dependent and can be retrieved using
i40e_get_max_num_descriptors(). Move this function to a shared header
and use it when checking for valid ring_len parameter rather than using
hardcoded value.
By fixing an over-acceptance issue, behavior change could be seen where
ring_len could now be rejected while configuring rx and tx queues if its
size is larger than the hardware-dependent maximum number of
descriptors.
Fixes: 55d225670def ("i40e: add validation for ring_len param") Signed-off-by: Gregory Herrero <gregory.herrero@oracle.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Przemyslaw Korba [Thu, 20 Nov 2025 12:07:28 +0000 (13:07 +0100)]
i40e: fix scheduling in set_rx_mode
Add service task schedule to set_rx_mode.
In some cases there are error messages printed out in PTP application
(ptp4l):
ptp4l[13848.762]: port 1 (ens2f3np3): received SYNC without timestamp
ptp4l[13848.825]: port 1 (ens2f3np3): received SYNC without timestamp
ptp4l[13848.887]: port 1 (ens2f3np3): received SYNC without timestamp
This happens when service task would not run immediately after
set_rx_mode, and we need it for setup tasks. This service task checks, if
PTP RX packets are hung in firmware, and propagate correct settings such
as multicast address for IEEE 1588 Precision Time Protocol.
RX timestamping depends on some of these filters set. Bug happens only
with high PTP packets frequency incoming, and not every run since
sometimes service task is being ran from a different place immediately
after starting ptp4l.
Fixes: 0e4425ed641f ("i40e: fix: do not sleep in netdev_ops") Reviewed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Przemyslaw Korba <przemyslaw.korba@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tetsuo Handa [Tue, 25 Nov 2025 13:43:12 +0000 (22:43 +0900)]
can: j1939: make j1939_sk_bind() fail if device is no longer registered
There is a theoretical race window in j1939_sk_netdev_event_unregister()
where two j1939_sk_bind() calls jump in between read_unlock_bh() and
lock_sock().
The assumption jsk->priv == priv can fail if the first j1939_sk_bind()
call once made jsk->priv == NULL due to failed j1939_local_ecu_get() call
and the second j1939_sk_bind() call again made jsk->priv != NULL due to
successful j1939_local_ecu_get() call.
Since the socket lock is held by both j1939_sk_netdev_event_unregister()
and j1939_sk_bind(), checking ndev->reg_state with the socket lock held can
reliably make the second j1939_sk_bind() call fail (and close this race
window).
Tetsuo Handa [Tue, 25 Nov 2025 13:39:59 +0000 (22:39 +0900)]
can: j1939: make j1939_session_activate() fail if device is no longer registered
syzbot is still reporting
unregister_netdevice: waiting for vcan0 to become free. Usage count = 2
even after commit 93a27b5891b8 ("can: j1939: add missing calls in
NETDEV_UNREGISTER notification handler") was added. A debug printk() patch
found that j1939_session_activate() can succeed even after
j1939_cancel_active_session() from j1939_netdev_notify(NETDEV_UNREGISTER)
has completed.
Since j1939_cancel_active_session() is processed with the session list lock
held, checking ndev->reg_state in j1939_session_activate() with the session
list lock held can reliably close the race window.
- Fix verifier assumptions of bpf_d_path's output buffer (Shuran Liu)
- Fix warnings in libbpf when built with -Wdiscarded-qualifiers under
C23 (Mikhail Gavrilov)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: add regression test for bpf_d_path()
bpf: Fix verifier assumptions of bpf_d_path's output buffer
selftests/bpf: Add test for truncated dmabuf_iter reads
bpf: Fix truncated dmabuf iterator reads
x86/unwind/orc: Support reliable unwinding through BPF stack frames
bpf: Add bpf_has_frame_pointer()
bpf, arm64: Do not audit capability check in do_jit()
libbpf: Fix -Wdiscarded-qualifiers under C23
bpftool: Fix build warnings due to MS extensions
net: smc: SMC_HS_CTRL_BPF should depend on BPF_JIT
selftests/bpf: Add -fms-extensions to bpf build flags
Linus Torvalds [Wed, 17 Dec 2025 03:48:30 +0000 (15:48 +1200)]
Merge tag 's390-6.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- clear 'Search boot program' flag when 'bootprog' sysfs file is
written to override a value set from Hardware Management Console
- fix cyclic dead-lock in zpci_zdev_put() and zpci_scan_devices()
functions when triggering PCI device recovery using sysfs
- annotate the expected lock context imbalance in zpci_release_device()
function to fix a sparse complaint
- fix the logic to fallback to the return address register value in the
topmost frame when stack tracing uses a back chain
* tag 's390-6.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/stacktrace: Do not fallback to RA register
s390/pci: Annotate lock context imbalance in zpci_release_device()
s390/pci: Fix cyclic dead-lock in zpci_zdev_put() and zpci_scan_devices()
s390/ipl: Clear SBP flag when bootprog is set
Bharath SM [Tue, 16 Dec 2025 15:56:05 +0000 (21:26 +0530)]
smb: align durable reconnect v2 context to 8 byte boundary
Add a 4-byte Pad to create_durable_handle_reconnect_v2 so the DH2C
create context is 8 byte aligned.
This avoids malformed CREATE contexts on reconnect.
Recent change removed this Padding, adding it back.
Fixes: 81a45de432c6 ("smb: move create_durable_handle_reconnect_v2 to common/smb2pdu.h") Signed-off-by: Bharath SM <bharathsm@microsoft.com> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Linus Torvalds [Tue, 16 Dec 2025 07:44:36 +0000 (19:44 +1200)]
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull shmem rename fixes from Al Viro:
"A couple of shmem rename fixes - recent regression from tree-in-dcache
series and older breakage from stable directory offsets stuff"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
shmem: fix recovery on rename failures
shmem_whiteout(): fix regression from tree-in-dcache series
Linus Torvalds [Tue, 16 Dec 2025 07:34:09 +0000 (19:34 +1200)]
Merge tag 'v6.19-rc1-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- Fix set xattr name validation
- Fix session refcount leak
- Minor cleanup
- smbdirect (RDMA) fixes: improve receive completion, and connect
* tag 'v6.19-rc1-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix buffer validation by including null terminator size in EA length
ksmbd: Fix refcount leak when invalid session is found on session lookup
ksmbd: remove redundant DACL check in smb_check_perm_dacl
ksmbd: convert comma to semicolon
smb: server: defer the initial recv completion logic to smb_direct_negotiate_recv_work()
smb: server: initialize recv_io->cqe.done = recv_done just once
smb: smbdirect: introduce smbdirect_socket.connect.{lock,work}
Linus Torvalds [Tue, 16 Dec 2025 07:28:20 +0000 (19:28 +1200)]
Merge tag 'for-6.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix missing btrfs_path release after printing a relocation error
message
- fix extent changeset leak on mmap write after failure to reserve
metadata
- fix fs devices list structure freeing, it could be potentially leaked
under some circumstances
- tree log fixes:
- fix incremental directory logging where inodes for new dentries
were incorrectly skipped
- don't log conflicting inode if it's a directory moved in the
current transaction
- regression fixes:
- fix incorrect btrfs_path freeing when it's auto-cleaned
- revert commit simplifying preallocation of temporary structures
in qgroup functions, some cases were not handled properly
* tag 'for-6.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix changeset leak on mmap write after failure to reserve metadata
btrfs: fix memory leak of fs_devices in degraded seed device path
btrfs: fix a potential path leak in print_data_reloc_error()
Revert "btrfs: add ASSERTs on prealloc in qgroup functions"
btrfs: do not skip logging new dentries when logging a new name
btrfs: don't log conflicting inode if it's a dir moved in the current transaction
btrfs: tests: fix double btrfs_path free in remove_extent_ref()
Linus Torvalds [Tue, 16 Dec 2025 07:24:35 +0000 (19:24 +1200)]
Merge tag 'sched_ext-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix memory leak when destroying helper kthread workers during
scheduler disable
- Fix bypass depth accounting on scx_enable() failure which could leave
the system permanently in bypass mode
- Fix missing preemption handling when moving tasks to local DSQs via
scx_bpf_dsq_move()
- Misc fixes including NULL check for put_prev_task(), flushing stdout
in selftests, and removing unused code
* tag 'sched_ext-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Remove unused code in the do_pick_task_scx()
selftests/sched_ext: flush stdout before test to avoid log spam
sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq()
sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue()
sched_ext: Fix bypass depth leak on scx_enable() failure
sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next
sched_ext: Fix the memleak for sch->helper objects
Linus Torvalds [Tue, 16 Dec 2025 07:21:17 +0000 (19:21 +1200)]
Merge tag 'cgroup-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
- Fix a race condition in css_rstat_updated() where CMPXCHG without
LOCK prefix could cause lnode corruption when the flusher runs
concurrently on another CPU. The issue was introduced in 6.17 and
causes memcg stats to become corrupted in production.
* tag 'cgroup-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
Al Viro [Sat, 13 Dec 2025 22:50:23 +0000 (17:50 -0500)]
shmem: fix recovery on rename failures
maple_tree insertions can fail if we are seriously short on memory;
simple_offset_rename() does not recover well if it runs into that.
The same goes for simple_offset_rename_exchange().
Moreover, shmem_whiteout() expects that if it succeeds, the caller will
progress to d_move(), i.e. that shmem_rename2() won't fail past the
successful call of shmem_whiteout().
Not hard to fix, fortunately - mtree_store() can't fail if the index we
are trying to store into is already present in the tree as a singleton.
For simple_offset_rename_exchange() that's enough - we just need to be
careful about the order of operations.
For simple_offset_rename() solution is to preinsert the target into the
tree for new_dir; the rest can be done without any potentially failing
operations.
That preinsertion has to be done in shmem_rename2() rather than in
simple_offset_rename() itself - otherwise we'd need to deal with the
possibility of failure after successful shmem_whiteout().
Fixes: a2e459555c5f ("shmem: stable directory offsets") Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Florian Westphal [Thu, 11 Dec 2025 11:55:19 +0000 (12:55 +0100)]
netfilter: nf_tables: avoid softlockup warnings in nft_chain_validate
This reverts commit 314c82841602 ("netfilter: nf_tables: can't schedule in nft_chain_validate"):
Since commit a60a5abe19d6 ("netfilter: nf_tables: allow iter callbacks to sleep")
the iterator callback is invoked without rcu read lock held, so this
cond_resched() is now valid.
Currently nf_tables will traverse the entire table (chain graph), starting
from the entry points (base chains), exploring all possible paths
(chain jumps). But there are cases where we could avoid revalidation.
Then the second rule does not need to revalidate j2, and, by extension j3,
because this was already checked during validation of the first rule.
We need to validate it only for rule 3.
This is needed because chain loop detection also ensures we do not exceed
the jump stack: Just because we know that j2 is cycle free, its last jump
might now exceed the allowed stack size. We also need to update all
reachable chains with the new largest observed call depth.
Care has to be taken to revalidate even if the chain depth won't be an
issue: chain validation also ensures that expressions are not called from
invalid base chains. For example, the masquerade expression can only be
called from NAT postrouting base chains.
Therefore we also need to keep record of the base chain context (type,
hooknum) and revalidate if the chain becomes reachable from a different
hook location.
Reported-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Closes: https://lore.kernel.org/netfilter-devel/20251118221735.GA5477@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/ Tested-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Signed-off-by: Florian Westphal <fw@strlen.de>
fs: send fsnotify_xattr()/IN_ATTRIB from vfs_fileattr_set()/chattr(1)
Currently it seems impossible to observe these changes to the file's
attributes. It's useful to be able to do this to see when the file
becomes immutable, for example, so emit IN_ATTRIB via fsnotify_xattr(),
like when changing other inode attributes.
Amir Goldstein [Sun, 7 Dec 2025 10:44:55 +0000 (11:44 +0100)]
fsnotify: do not generate ACCESS/MODIFY events on child for special files
inotify/fanotify do not allow users with no read access to a file to
subscribe to events (e.g. IN_ACCESS/IN_MODIFY), but they do allow the
same user to subscribe for watching events on children when the user
has access to the parent directory (e.g. /dev).
Users with no read access to a file but with read access to its parent
directory can still stat the file and see if it was accessed/modified
via atime/mtime change.
The same is not true for special files (e.g. /dev/null). Users will not
generally observe atime/mtime changes when other users read/write to
special files, only when someone sets atime/mtime via utimensat().
Align fsnotify events with this stat behavior and do not generate
ACCESS/MODIFY events to parent watchers on read/write of special files.
The events are still generated to parent watchers on utimensat(). This
closes some side-channels that could be possibly used for information
exfiltration [1].
Reported-by: Sudheendra Raghav Neela <sneela@tugraz.at> CC: stable@vger.kernel.org Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
Namjae Jeon [Sun, 14 Dec 2025 06:06:34 +0000 (15:06 +0900)]
ksmbd: fix buffer validation by including null terminator size in EA length
The smb2_set_ea function, which handles Extended Attributes (EA),
was performing buffer validation checks that incorrectly omitted the size
of the null terminating character (+1 byte) for EA Name.
This patch fixes the issue by explicitly adding '+ 1' to EaNameLength where
the null terminator is expected to be present in the buffer, ensuring
the validation accurately reflects the total required buffer size.
Cc: stable@vger.kernel.org Reported-by: Roger <roger.andersen@protonmail.com> Reported-by: Stanislas Polu <spolu@dust.tt> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Namjae Jeon [Sun, 14 Dec 2025 06:05:56 +0000 (15:05 +0900)]
ksmbd: Fix refcount leak when invalid session is found on session lookup
When a session is found but its state is not SMB2_SESSION_VALID, It
indicates that no valid session was found, but it is missing to decrement
the reference count acquired by the session lookup, which results in
a reference count leak. This patch fixes the issue by explicitly calling
ksmbd_user_session_put to release the reference to the session.
Cc: stable@vger.kernel.org Reported-by: Alexandre <roger.andersen@protonmail.com> Reported-by: Stanislas Polu <spolu@dust.tt> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Chen Ni [Tue, 18 Nov 2025 01:32:29 +0000 (09:32 +0800)]
ksmbd: convert comma to semicolon
Replace comma between expressions with semicolons.
Using a ',' in place of a ';' can have unintended side effects.
Although that is not the case here, it is seems best to use ';'
unless ',' is intended.
Found by inspection.
No functional change intended.
Compile tested only.
Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
smb: server: defer the initial recv completion logic to smb_direct_negotiate_recv_work()
The previous change to relax WARN_ON_ONCE(SMBDIRECT_SOCKET_*) checks in
recv_done() and smb_direct_cm_handler() seems to work around the
problem that the order of initial recv completion and
RDMA_CM_EVENT_ESTABLISHED is random, but it's still
a bit ugly.
This implements a better solution deferring the recv completion
processing to smb_direct_negotiate_recv_work(), which is queued
only if both events arrived.
In order to avoid more basic changes to the main recv_done
callback, I introduced a smb_direct_negotiate_recv_done,
which is only used for the first pdu, this will allow
further cleanup and simplifications in recv_done
as a future patch.
smb_direct_negotiate_recv_work() is also very basic
with only basic error checking and the transition
from SMBDIRECT_SOCKET_NEGOTIATE_NEEDED to
SMBDIRECT_SOCKET_NEGOTIATE_RUNNING, which allows
smb_direct_prepare() to continue as before.
Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
smb: server: initialize recv_io->cqe.done = recv_done just once
smbdirect_recv_io structures are pre-allocated so we can set the
callback function just once.
This will make it easy to move smb_direct_post_recv to common code
soon.
Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>