Add tests that verify PSP notifications are delivered to listeners in
associated namespaces:
- _key_rotation_notify_multi_ns_netkit: triggers key rotation and
verifies the notification is received in both main and guest namespaces
- _dev_change_notify_multi_ns_netkit: triggers dev_set and verifies the
dev_change notification is received in both namespaces
Wei Wang [Mon, 8 Jun 2026 23:31:16 +0000 (16:31 -0700)]
selftests/net: psp: add dev-assoc data path test
Add _assoc_check_list() test that associates nk_guest with the PSP
device and verifies the assoc-list is correctly populated.
Add _data_basic_send_netkit_psp_assoc() which tests PSP data send
through a netkit interface associated with a PSP device. The test
associates nk_guest with the PSP device, then sends PSP-encrypted
traffic from the guest namespace.
Wei Wang [Mon, 8 Jun 2026 23:31:15 +0000 (16:31 -0700)]
selftests/net: psp: support PSP in NetDrvContEnv infrastructure
Add infrastructure to support PSP tests across network namespaces
using NetDrvContEnv with netkit pairs. This enables testing PSP device
association, where a non-PSP-capable device (e.g. netkit) in a guest
namespace is associated with a real PSP device in the host namespace,
allowing the guest to perform PSP encryption/decryption through the
host's PSP hardware.
env.py:
- nk_guest_ifindex is queried after moving the device into the guest
namespace, so tests can use it directly for dev-assoc
psp.py:
- PSP device lookup supports container environments where the PSP
device is on the physical interface, not the test interface
- Association helpers handle dev-assoc/dev-disassoc with defer-based
cleanup to prevent state leaks on test assertion failures
- main() tries NetDrvContEnv with primary_rx_redirect and falls back
to NetDrvEpEnv, so existing tests continue to work without the
container environment
Wei Wang [Mon, 8 Jun 2026 23:31:14 +0000 (16:31 -0700)]
selftests/net: rename _nk_host_ifname to nk_host_ifname
Rename _nk_host_ifname to nk_host_ifname in NetDrvContEnv to make it
a public attribute, matching the nk_guest_ifname rename. Tests that
access the host-side netkit interface name (e.g. for cleanup after
deleting the netkit pair) no longer trigger pylint protected-access
warnings.
Wei Wang [Mon, 8 Jun 2026 23:31:13 +0000 (16:31 -0700)]
selftests/net: add _find_bpf_obj() to search hw/ for BPF objects
Add _find_bpf_obj() helper to NetDrvContEnv that searches the test
directory first, then falls back to the hw/ subdirectory. This allows
tests outside drivers/net/hw/ (e.g. psp.py in drivers/net/) to find
BPF objects built in the hw/ directory.
Update _attach_bpf() and _attach_primary_rx_redirect_bpf() to use
_find_bpf_obj() for BPF object discovery.
Wei Wang [Mon, 8 Jun 2026 23:31:12 +0000 (16:31 -0700)]
selftests/net: psp: refactor test builders to use ksft_variants
Replace the manual psp_ip_ver_test_builder() and ipver_test_builder()
functions with @ksft_variants decorators for data_basic_send and
data_mss_adjust. This is a pure refactor with no behavior change.
Wei Wang [Mon, 8 Jun 2026 23:31:10 +0000 (16:31 -0700)]
psp: add new netlink cmd for dev-assoc and dev-disassoc
The main purpose of this cmd is to be able to associate a
non-psp-capable device (e.g. veth or netkit) with a psp device.
One use case is if we create a pair of veth/netkit, and assign 1 end
inside a netns, while leaving the other end within the default netns,
with a real PSP device, e.g. netdevsim or a physical PSP-capable NIC.
With this command, we could associate the veth/netkit inside the netns
with PSP device, so the virtual device could act as PSP-capable device
to initiate PSP connections, and performs PSP encryption/decryption on
the real PSP device.
Wei Wang [Mon, 8 Jun 2026 23:31:09 +0000 (16:31 -0700)]
psp: add admin/non-admin version of psp_device_get_locked
Introduce 2 versions of psp_device_get_locked:
1. psp_device_get_locked_admin(): This version is used for operations
that would change the status of the psd, and are currently used for
dev-set and key-rotation.
2. psp_device_get_locked(): This is the non-admin version, which are
used for broader user issued operations including: dev-get, rx-assoc,
tx-assoc, get-stats.
Following commit will be implementing both of the checks.
Generic XDP devmap multi redirect can leave cloned skbs sharing packet
data. When a devmap egress program mutates packet data, another
destination sharing the same data may observe that mutation.
Fix this by making cloned skbs private before running the generic devmap
egress program. The private copy is made in dev_map_generic_redirect()
so dev_map_bpf_prog_run_skb() can keep returning the XDP action directly.
Add selftest coverage for the last-destination case, where the final
destination runs on the original skb while earlier destinations use
cloned skbs. The test records the source MAC observed by an earlier
destination and checks that it is neither the sentinel value left in the
result map nor the MAC written by the final destination.
---
v5:
- Move the skb_copy() check back to dev_map_generic_redirect() to keep
dev_map_bpf_prog_run_skb() returning only the XDP action.
- Preserve mac_len after skb_copy().
- Use __be64 temporary values when updating mac_map from userspace.
- Initialize rx_mac with a sentinel in the last-destination test instead
of relying on -ENOENT for ARRAY map lookups.
- Adjust the last-destination test topology so the checked earlier
destination is not the ingress/source veth.
- Split the last-destination check into two assertions: one for store_mac_1
updating rx_mac and one for detecting last-destination rewrite leakage.
v4: https://lore.kernel.org/bpf/20260611080850.536996-1-sun.jian.kdev@gmail.com/T/#mf830f03d362f33e0941d1b0e425169698fce76e5
- Preserve mac_len after skb_copy().
- Separate errno return from XDP action output in
dev_map_bpf_prog_run_skb().
- Zero-initialize net_config in the new selftest.
v3: https://lore.kernel.org/bpf/20260611043317.512843-1-sun.jian.kdev@gmail.com/
- Split the kernel fix and selftest into separate patches.
- Move the private-copy logic into dev_map_bpf_prog_run_skb().
- Use deterministic DEVMAP_HASH keys in the last-destination selftest.
- Fix the Fixes tag.
v2: https://lore.kernel.org/bpf/08c35c70-a59e-4e0e-91db-22b5ec30b611@linux.dev/
- Move the private-copy step into dev_map_generic_redirect() so the
last-destination path is covered as well.
- Use skb_copy() instead of skb_unshare() to keep caller ownership
unchanged on allocation failure.
- Add a generic XDP last-destination selftest case.
Strengthen xdp_veth_egress to check that each destination observes the
MAC selected for its own egress ifindex, instead of only checking that
the observed MAC differs from a single magic value.
Add a generic XDP last-destination test where an earlier destination does
not have a devmap egress program while the final destination does. This
covers the case where the final destination runs on the original skb and
could otherwise rewrite packet data still shared with an earlier cloned
skb.
Use deterministic DEVMAP_HASH keys for the egress map so the intended
last destination is stable. Initialize the result map with a sentinel
value and check that store_mac_1 overwrites it before checking that the
earlier destination did not observe the MAC written by the final
destination.
Sun Jian [Fri, 12 Jun 2026 11:40:31 +0000 (19:40 +0800)]
bpf: Run generic devmap egress prog on private skb
Generic XDP devmap multi redirect uses skb_clone() for intermediate
destinations and sends the last destination with the original skb. This
can leave multiple destinations sharing the same packet data.
This becomes visible after generic devmap egress-program support was
added: a devmap egress program may mutate packet data, and another
destination sharing the same data can observe that mutation.
Native XDP broadcast redirect does not have this issue because
xdpf_clone() copies the frame data for each destination. Generic XDP
should provide the same per-destination isolation before running a
devmap egress program.
Fix this by making cloned skbs private before running the generic devmap
egress program. Use skb_copy() instead of skb_unshare() so allocation
failure does not consume the skb and the existing caller error paths keep
their ownership semantics.
This series continues the rework of the KSZ driver initiated by two previous
series (see [1] & [2]).
The KSZ driver handles more than 20 switches split in several families.
This was previously handled through a common set of dsa_switch_ops
operations that used device-specific ksz_dev_ops callbacks. The two
previous series have split this common struct dsa_switch_ops into 5
to connect the ksz_dev_ops's implentations directly to the new
dsa_swicth ops.
This series continues in the same vein and removes the dsa_switch_ops
operations that aren't used.
On top of this on-going rework I added PTP and periodic output support for
the KSZ8463 (which was my first goal). There are still more than 20 patches
left for all this so this series will be followed by three others and if you
want to see the full picture we can check my github ([3]).
FYI, I only have a KSZ8463 so, unfortunately, I can't test other switches.
The next series is going to move out of ksz_common.c the last remaining
functions that aren't truly common to all KSZ switches. The series after
that will add PTP support for the KSZ8463 and the final one will add
periodic output support for the KSZ8463.
net: dsa: microchip: implement port_teardown only if needed
The port_teardown() operation is optional. Yet, it is implemented by all
the KSZ switches through a common function that doesn't do anything for
the switches that aren't part of the ksz9477 family
Remove the implementation from the switches that don't need it.
Implement instead a ksz9477-specific port_teardown.
All the switches use a common mdio_register() function that uses two
ksz_dev_ops callbacks (.mdio_bus_preinit() and .create_phy_addr_map())
to handle the lan937x specific case. These two callbacks are used only
at this place in the code.
Implement a new lan937x-specific MDIO registration functions that uses
these two lan937x-specific functions. The lan937x bindings don't
have any 'interrupts' property so this lan937x_mdio_register() doesn't
call ksz_irq_phy_setup().
Expose the common ksz_*_mdio_{read/write} functions so they can be used
in lan937x.c
Remove the callbacks from ksz_dev_ops.
net: dsa: microchip: implement .{get/set}_wol only if needed
All the KSZ switches use common {get/set}_wol operations while only the
ksz9477 and the ksz87xx families really support it. These operations are
optional so there is no point implementing them to return -EOPNOTSUPP.
Remove the {get/set}_wol callbacks from the switch operations for the
ksz88xx, the ksz8463 and the lan937x families.
Remove the family check from the common {get/set}_wol implementation.
Note that is_ksz9477() is only true for the KSZ9477 so this change will
also add WoL support for the other switches using the
ksz9477_switch_ops. I checked their datasheet, they implement the same
PME_WOL registers, at the same addresses, so this should go fine.
Modify the ksz_wol_pre_shutdown() initial check to ensure consistency in
the WoL handling for these non-KSZ9477 switches using ksz9477_switch_ops.
net: dsa: microchip: implement .support_eee() only if needed
The .support_eee() operation is optional. Yet, it is implemented by the
KSZ switches through a common functon that reports false for every chip
except for KSZ8563, KSZ9563 and KSZ9893 from the KSZ9477 family.
Remove the implementation from the switches that don't support EEE.
Also remove .set_mac_eee() for them as .set_mac_eee() is gated by the
`support_eee` presence in the core.
Implement instead a ksz9477-specific support_eee for these three supported
switches.
Note that comment /* KSZ879x/KSZ877x/KSZ876x Errata DS80000687C Module 2 */
is completely removed because it concerns the KSZ87xx family that doesn't
support at all EEE.
setup_rgmii_delay() operation is only used once during the common phylink
MAC configuration. Only the lan937x switch implements this
setup_rgmii_delay().
Remove the setup_rgmii_delay operation from ksz_dev_ops.
Implement a lan937x-specific phylink MAC configuration that does this
RGMII delay setup.
Export ksz_set_xmii since it's needed by the lan937x implementation.
net: dsa: microchip: wrap the MAC configuration checks in a function
The common .mac_config() implementation checks some conditions before
doing any register access. As this common implementation is about to be
split in the upcoming patch, these checks would lead to code
duplication.
Wrap all the checks in a need_config() function that returns true when
the driver really need to access the switch registers to configure the
MAC.
net: dsa: microchip: implement get_phy_flags only if needed
The common ksz_get_phy_flags() is used by all the switches to implement
the optional .get_phy_flags DSA operation. It always returns 0 except
for KSZ88X3 switches where an errata has to be handled.
Make ksz_get_phy_flags() ksz88xx-specific.
Remove the get_phy_flags implementation for the switches that don't need
it.
net: dsa: microchip: remove useless common cls_flower_{add/del} operations
All the KSZ switches share a common implementation of the
cls_flower_{add/del} operations. These common implementations return
ksz9477-specific implementations for the KSZ9477 family and -EOPNOTSUPP
for the others. -EOPNOTSUPP is already returned by the DSA core when
the operation isn't implemented.
Remove the common implementations.
Directly link the ksz9477_cls_flower_{add/del}() to the KSZ9477 callback.
Victor Nogueira [Thu, 11 Jun 2026 20:58:49 +0000 (17:58 -0300)]
net/sched: sch_dualpi2: Add missing module alias
When a qdisc is added by name, the kernel tries to autoload its module
via request_qdisc_module(), which calls:
request_module(NET_SCH_ALIAS_PREFIX "%s", name);
i.e. it asks modprobe to resolve the "net-sch-<kind>" alias (e.g.
"net-sch-dualpi2") rather than the module's file name. Since dualpi2
was shipped without this alias, the autoload fails:
tc qdisc add dev lo root handle 1: dualpi2
Error: Specified qdisc kind is unknown.
Fix this by adding the missing alias so the qdisc is autoloaded on demand
like the others.
Fixes: 320d031ad6e4 ("sched: Struct definition and parsing of dualpi2 qdisc") Signed-off-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Link: https://patch.msgid.link/20260611205849.3287640-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 11 Jun 2026 17:21:49 +0000 (10:21 -0700)]
docs: networking: add guidance on what to push via extack
Every now and then someone tries to duplicated extack
messages to dmesg. Document our guidance against this.
Also indicate that system level faults should continue
to go to system logs. The high level thinking is to try
to distinguish between what's important to the user vs
system admin.
Vadim Fedorenko [Thu, 11 Jun 2026 19:03:33 +0000 (19:03 +0000)]
ptp: ocp: add shutdown callback
The shutdown callback was never implemented for this driver, but it's
needed because .remove() callback is never called during kexec/reboot
process. That leaves HW with some interrupts enabled and may cause
spurious interrupt while booting into a new kernel during with kexec.
If it happens that I2C interrupt fires during kexec, the whole I2C bus
is disabled leaving TimeCard with no devlink communication. The same
happens if timestampers were enabled, leaving the card without
timestamper interrupts until full reboot cycle.
Implement .shutdown() callback with the same function as remove
callback.
Ido Schimmel [Thu, 11 Jun 2026 15:46:05 +0000 (18:46 +0300)]
selftests: fib_tests: Add test cases for route lookup with oif
Test that both address families respect the oif parameter when a
matching multipath route is found, regardless of the presence of a
source address.
Output without "ipv6: Select best matching nexthop object in
fib6_table_lookup()" and "ipv6: Honor oif when choosing nexthop for
locally generated traffic":
IPv4 multipath oif test
TEST: IPv4 multipath via first nexthop [ OK ]
TEST: IPv4 multipath via second nexthop [ OK ]
TEST: IPv4 multipath via first nexthop with source address [ OK ]
TEST: IPv4 multipath via second nexthop with source address [ OK ]
IPv4 multipath oif with nexthop object test
TEST: IPv4 multipath via first nexthop [ OK ]
TEST: IPv4 multipath via second nexthop [ OK ]
TEST: IPv4 multipath via first nexthop with source address [ OK ]
TEST: IPv4 multipath via second nexthop with source address [ OK ]
IPv4 multipath oif with VRF test
TEST: IPv4 multipath via first nexthop [ OK ]
TEST: IPv4 multipath via second nexthop [ OK ]
TEST: IPv4 multipath via first nexthop with source address [ OK ]
TEST: IPv4 multipath via second nexthop with source address [ OK ]
IPv6 multipath oif test
TEST: IPv6 multipath via first nexthop [ OK ]
TEST: IPv6 multipath via second nexthop [ OK ]
TEST: IPv6 multipath via first nexthop with source address [FAIL]
TEST: IPv6 multipath via second nexthop with source address [FAIL]
IPv6 multipath oif with nexthop object test
TEST: IPv6 multipath via first nexthop [FAIL]
TEST: IPv6 multipath via second nexthop [FAIL]
TEST: IPv6 multipath via first nexthop with source address [FAIL]
TEST: IPv6 multipath via second nexthop with source address [FAIL]
IPv6 multipath oif with VRF test
TEST: IPv6 multipath via first nexthop [ OK ]
TEST: IPv6 multipath via second nexthop [ OK ]
TEST: IPv6 multipath via first nexthop with source address [FAIL]
TEST: IPv6 multipath via second nexthop with source address [FAIL]
IPv4 multipath oif test
TEST: IPv4 multipath via first nexthop [ OK ]
TEST: IPv4 multipath via second nexthop [ OK ]
TEST: IPv4 multipath via first nexthop with source address [ OK ]
TEST: IPv4 multipath via second nexthop with source address [ OK ]
IPv4 multipath oif with nexthop object test
TEST: IPv4 multipath via first nexthop [ OK ]
TEST: IPv4 multipath via second nexthop [ OK ]
TEST: IPv4 multipath via first nexthop with source address [ OK ]
TEST: IPv4 multipath via second nexthop with source address [ OK ]
IPv4 multipath oif with VRF test
TEST: IPv4 multipath via first nexthop [ OK ]
TEST: IPv4 multipath via second nexthop [ OK ]
TEST: IPv4 multipath via first nexthop with source address [ OK ]
TEST: IPv4 multipath via second nexthop with source address [ OK ]
IPv6 multipath oif test
TEST: IPv6 multipath via first nexthop [ OK ]
TEST: IPv6 multipath via second nexthop [ OK ]
TEST: IPv6 multipath via first nexthop with source address [ OK ]
TEST: IPv6 multipath via second nexthop with source address [ OK ]
IPv6 multipath oif with nexthop object test
TEST: IPv6 multipath via first nexthop [ OK ]
TEST: IPv6 multipath via second nexthop [ OK ]
TEST: IPv6 multipath via first nexthop with source address [ OK ]
TEST: IPv6 multipath via second nexthop with source address [ OK ]
IPv6 multipath oif with VRF test
TEST: IPv6 multipath via first nexthop [ OK ]
TEST: IPv6 multipath via second nexthop [ OK ]
TEST: IPv6 multipath via first nexthop with source address [ OK ]
TEST: IPv6 multipath via second nexthop with source address [ OK ]
Ido Schimmel [Thu, 11 Jun 2026 15:46:04 +0000 (18:46 +0300)]
ipv6: Honor oif when choosing nexthop for locally generated traffic
Commit 741a11d9e410 ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is
set") made the kernel honor the oif parameter when specified as part of
output route lookup:
# ip route add 2001:db8:1::/64 dev dummy1
# ip route add ::/0 dev dummy2
# ip route get 2001:db8:1::1 oif dummy2 fibmatch
default dev dummy2 metric 1024 pref medium
Due to regression reports, the behavior was partially reverted in commit d46a9d678e4c ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE flag if saddr
set") to only honor the oif if source address is not specified:
# ip route get 2001:db8:1::1 from 2001:db8:2::1 oif dummy2 fibmatch
2001:db8:1::/64 dev dummy1 metric 1024 pref medium
That is, when source address is specified, the kernel will choose the
most specific route even if its nexthop device does not match the
specified oif.
This creates a problem for multipath routes. After looking up a route,
when source address is not specified, the kernel will choose a nexthop
whose nexthop device matches the specified oif:
# sysctl -wq net.ipv6.conf.all.forwarding=1
# ip route add 2001:db8:10::/64 nexthop via fe80::1 dev dummy1 nexthop via fe80::2 dev dummy2
# for i in {1..100}; do ip route get 2001:db8:10::${i} oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
100 dummy2
But will disregard the oif when source address is specified despite the
fact that a matching nexthop exists:
# for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
53 dummy1
47 dummy2
This behavior differs from IPv4:
# ip address add 192.0.2.1/32 dev lo
# ip route add 198.51.100.0/24 nexthop via inet6 fe80::1 dev dummy1 nexthop via inet6 fe80::2 dev dummy2
# for i in {1..100}; do ip route get 198.51.100.${i} from 192.0.2.1 oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
100 dummy2
What happens is that fib6_table_lookup() returns a route with a matching
nexthop device (assuming it exists):
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy2
But it is later overwritten during path selection in fib6_select_path()
which instead chooses a nexthop according to the calculated hash.
Solve this by telling fib6_select_path() to skip path selection if we
have an oif match during output route lookup (iif being
LOOPBACK_IFINDEX).
Behavior after the change:
# sysctl -wq net.ipv6.conf.all.forwarding=1
# ip route add 2001:db8:10::/64 nexthop via fe80::1 dev dummy1 nexthop via fe80::2 dev dummy2
# for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
100 dummy2
Note that enabling forwarding is only needed because we did not add
neighbor entries for the gateway addresses. When forwarding is disabled
and CONFIG_IPV6_ROUTER_PREF is not enabled in kernel config, the kernel
will treat non-existing neighbor entries as errors and perform
round-robin between the nexthops:
# sysctl -wq net.ipv6.conf.all.forwarding=0
# for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
50 dummy1
50 dummy2
Ido Schimmel [Thu, 11 Jun 2026 15:46:03 +0000 (18:46 +0300)]
ipv6: Select best matching nexthop object in fib6_table_lookup()
Currently, when using multipath routes without nexthop objects,
fib6_table_lookup() selects the nexthop with the highest score. This
means that when both a source address and an oif are specified, the
nexthop that is chosen is the one that matches in terms of oif:
# sysctl -wq net.ipv6.conf.all.forwarding=1
# ip address add 2001:db8:2::1/64 dev lo
# ip route add 2001:db8:10::/64 nexthop via fe80::1 dev dummy1 nexthop via fe80::2 dev dummy2
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy1; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy1
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy2
When using nexthop objects, fib6_table_lookup() selects the first
matching nexthop and not necessarily the one with the highest score:
# ip nexthop add id 1 via fe80::1 dev dummy1
# ip nexthop add id 2 via fe80::2 dev dummy2
# ip nexthop add id 3 group 1/2
# ip route add 2001:db8:20::/64 nhid 3
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy1; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy1
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy1
This is not very significant right now because the nexthop is later
overwritten during path selection in fib6_select_path(). However, the
next patch is going to skip path selection when we have an oif match
during output route lookup.
As a preparation for this change, align the nexthop object behavior with
the legacy one and make sure that fib6_table_lookup() always selects the
best matching nexthop. Do that by always returning 0 from
rt6_nh_find_match() in order not to terminate the loop in
nexthop_for_each_fib6_nh() and storing in arg->nh the best matching
nexthop so far.
Behavior after the change:
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy1; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy1
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy2
Breno Leitao [Wed, 10 Jun 2026 14:26:04 +0000 (07:26 -0700)]
netconsole: clear cached dev_name on resume-window cleanup
When process_resume_target() catches a device that was unregistered
while the target was off target_list, it calls do_netpoll_cleanup() to
release the reference but leaves the cached np.dev_name in place. The
other cleanup path, netconsole_process_cleanups_core(), already wipes
dev_name for MAC-bound targets because the name was only a cache of the
device that last carried the MAC and may no longer match.
The pattern is the same in both spots, so fold it into a small helper
netcons_release_dev() and route both call sites through it. This makes
the resume-window cleanup consistent with the notifier-driven one so a
later enable does not let netpoll_setup() pick a stale interface by name
when the user bound the target by MAC.
Eric Dumazet [Thu, 11 Jun 2026 15:27:37 +0000 (15:27 +0000)]
net: watchdog: fix refcount tracking races
Blamed commit converted the untracked dev_hold()/dev_put() calls
in the watchdog code to use the tracked dev_hold_track()/dev_put_track()
(which were later renamed/interfaced to netdev_hold() and netdev_put()).
By introducing dev->watchdog_dev_tracker to store the
reference tracking information without adding synchronization
between netdev_watchdog_up() and dev_watchdog(), it enabled the
race condition where this pointer could be overwritten or freed
concurrently, leading to the list corruption crash syzbot reported:
1) Add dev->watchdog_lock and dev->watchdog_ref_held to serialize watchdog operations.
2) Remove netdev_watchdog_up() call from netif_carrier_on():
This ensures netdev_watchdog_up() is only called from process/BH context
(via linkwatch workqueue dev_activate()), allowing us to use
spin_lock_bh() for synchronization.
3) Synchronize watchdog up and watchdog timer:
Protect netdev_watchdog_up() with tx_global_lock and watchdog_lock.
Only allocate a new tracker in netdev_watchdog_up() if one is
not already present.
In dev_watchdog(), ensure we don't release the tracker if the
timer was rescheduled either by dev_watchdog() itself or concurrently
by netdev_watchdog_up().
Fixes: f12bf6f3f942 ("net: watchdog: add net device refcount tracker") Reported-by: syzbot+381d82bbf0253710b35d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6a26b751.c25708ab.1b19ef.0013.GAE@google.com/T/#u Tested-by: syzbot+3479efbc2821cb2a79f2@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260611152737.2580480-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dragos Tatulea [Thu, 11 Jun 2026 16:03:41 +0000 (19:03 +0300)]
selftests: iou-zcrx: defer listen() until after zcrx setup
The server binds the queues for zero-copy after listen(). If the client
does a connect() during this time it can fail with EHOSTUNREACH on
a cold system. This was encountered with the mlx5 driver where binding
the .ndo_queue_start() is a slow operation during which no packets
can be exchanged.
This change moves listen() after queue binding, when the test server is
fully operational.
====================
net: mana: fix error-path issues in queue setup
Two error-path fixes in MANA queue setup, both surfaced during Sashiko
AI review of a recently upstreamed patch series.
Patch 1 initializes queue->id to INVALID_QUEUE_ID in
mana_gd_create_mana_wq_cq() so that a CQ creation failure before the
firmware id is assigned does not NULL gc->cq_table[0] and silently
break whichever real CQ owns that slot. This mirrors the existing
pattern in mana_gd_create_eq().
Patch 2 guards mana_destroy_txq()'s call to mana_destroy_wq_obj() with
an INVALID_MANA_HANDLE check, mirroring mana_destroy_rxq(). Without
it, TX setup failures lead to a firmware-rejected destroy of (u64)-1
and a spurious error in dmesg.
====================
Aditya Garg [Mon, 8 Jun 2026 10:13:41 +0000 (03:13 -0700)]
net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
mana_create_txq() has several error paths (after mana_alloc_queues() or
mana_create_wq_obj() failure) where tx_qp[i].tx_object stays as the
INVALID_MANA_HANDLE sentinel set at allocation. mana_destroy_txq() then
unconditionally calls mana_destroy_wq_obj() with (u64)-1, which firmware
rejects and logs an error.
Mirror the RX-side pattern in mana_destroy_rxq() and skip the destroy
when the handle is still INVALID_MANA_HANDLE.
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com> Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/20260608101345.2267320-3-gargaditya@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Aditya Garg [Mon, 8 Jun 2026 10:13:40 +0000 (03:13 -0700)]
net: mana: initialize gdma queue id to INVALID_QUEUE_ID
mana_gd_create_mana_wq_cq() leaves queue->id as 0 (from kzalloc_obj())
until mana_create_wq_obj() assigns the firmware-returned id. If creation
fails before that, cleanup calls mana_gd_destroy_cq() with id 0, NULLing
gc->cq_table[0] and silently breaking whichever real CQ owns that slot.
Initialize queue->id to INVALID_QUEUE_ID right after allocation, matching
mana_gd_create_eq(). The existing (id >= max_num_cqs) guard then
short-circuits cleanly.
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com> Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/20260608101345.2267320-2-gargaditya@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: mdio: realtek-rtl9300: Add RTL931x support
The Realtek Otto switch platform consists of four different series
- RTL838x aka maple : 28 port 1G Switches
- RTL839x aka cypress : 52 port 1G Switches
- RTL930x aka longan : 28 port 1G/2.5G/10G Switches
- RTL931x aka mango : 56 port 1G/2.5G/10G Switches
This patch series adds support for the RTL931x devices. For this
- Enhance device tree binding.
- Implement final cleanups and enhancments for the driver.
- Add RTL931x coding.
Remark: Instead of this series it was planned to bring support for
hardware polling configuration first. It turns out that more testing
is needed - especially for the RTL83xx SoCs. Instead add the lineup
of the RTL931x devices, that are known to have no obvious bus and
polling issues (at least from testing and vendor SDK perspective).
====================
net: mdio: realtek-rtl9300: Add support for RTL931x
The MDIO driver has been prepared for multiple device support. Add all
required bits for the RTL931x (aka mango) series. This is straightforward
but some things are worth to be mentioned.
- In contrast to RTL930x the I/O register has the input/output fields
swapped. Upper 16 bits are for read/outputs, and the lower 16 bits
are for write/inputs.
- The supported "pages" are 8192 and thus the raw page is 8191
- The devices support up to 56 ports. Thus the MAX_PORTS definition
is increased by this commit.
- There are multiple global SMI controller registers with a different
layout from RTL930x devices. Therefore a separate setup_controller()
callback is added.
net: mdio: realtek-rtl9300: Add registers for high port count models
The high port count models of the Realtek Otto switches have additional
registers to instrument the MDIO controller. These are:
- High port mask: A bitfield that extends the already existing low port
mask to select ports starting from 32.
- Broadcast: This takes the port number during reads on the RTL931x.
- Extended page: Some additional page info. The SDK does not give much
information about this. Basically some fixed value must be written
into it during access.
net: mdio: realtek-rtl9300: Make otto_emdio_read_cmd() generic
The otto_emdio_read_cmd() helper still uses RTL9300 specific properties.
This cannot be made generic as the I/O register has different layouts for
the different SoCs. E.g.
- RTL930x: data in bits 31-16, data out bits 15-0
- RTL931x: data in bits 15-0, data out bits 31-16
Add a mask parameter to the function signature and fill it properly
in the callers. As the masks will always have bits set from constant
defines, there is no need for a consistency check.
net: mdio: realtek-rtl9300: Add prefix to register field defines
The current Realtek Otto MDIO driver has some define leftovers without
a SoC prefix. When adding new devices there will be an overlap for some
of them. Sort this out as follows:
- PHY_CTRL_CMD/PHY_CTRL_MMD_DEVAD/PHY_CTRL_MMD_REG are common for all
series. Leave them as is but move them into a separate block.
- Add RTL9300 prefix to all other defines and adapt the callers.
dt-bindings: net: realtek,rtl9301-mdio: Add RTL931x series
The 10G Realtek Otto switches are divided into two series
- Longan: RTL930x up to 28 ports
- Mango : RTL931x up to 56 ports
The Mango based devices have 3 different SoCs RTL9311, RTL9312 and RTL9313.
The MDIO controller of these switches works like the existing RTL930x
logic but has different characteristics and different registers. Add new
compatibles in the device tree.
Linus Torvalds [Sat, 13 Jun 2026 00:23:05 +0000 (17:23 -0700)]
Merge tag 'pinctrl-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Two fixes for the mcp23s08 driver.
- Revert an earlier fix to the AMD pin controller that was all wrong. A
proper fix is being developed.
* tag 'pinctrl-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
Revert "pinctrl-amd: enable IRQ for WACF2200 touchscreen on Lenovo Yoga 7 14AGP11"
pinctrl: mcp23s08: Read spi-present-mask as u8 not u32
pinctrl: mcp23s08: Initialize mcp->dev and mcp->addr before regmap init
====================
Avoid mistaken parent class deactivation during peek
Several qdiscs (fq_codel, codel and dualpi2) may drop packets while
peeking at their queue. When that happens they call
qdisc_tree_reduce_backlog() to notify the parent of the backlog/qlen
change. The problem is that they do so *before* reincrementing the qlen
that peek had temporarily decremented.
If the qlen momentarily drops to zero while peek still has an skb to
return, qdisc_tree_reduce_backlog() ends up invoking the parent's
qlen_notify() callback even though the child is not actually empty. The
parent then deactivates the class, while the child still holds a packet.
For parents such as QFQ this desync corrupts the active class list and
leads to wild memory accesses and NULL pointer dereferences (see the
per-patch splats). For HFSC it might lead to stalls [1].
Fix all three qdiscs the same way: only call qdisc_tree_reduce_backlog()
once the qlen has been restored, so the parent never observes a
transient empty child during peek.
Patch 1 fixes this for fq_codel, patch 2 for codel, patch 3 for dualpi2
and patch 4 adds test cases for these 3 setups.
Note: Patch 1 is one of two fixes for the stall reported in [1]; the
companion fix is "net/sched: sch_hfsc: Don't make class passive twice",
sent separately.
Note2: A possible cleaner fix is to create a new helper function for peek
that only calls qdisc_tree_reduce_backlog after reincrementing the qlen.
This would be called from the 3 vulnerable qdiscs, however we thought this
might make it harder for backporting so, if people agree, we can submit
this cleaner version to net-next after this one is merged.
Victor Nogueira [Wed, 10 Jun 2026 19:28:55 +0000 (16:28 -0300)]
selftests/tc-testing: Verify child qdisc will not mistakenly deactivate QFQ parent
Create 3 test cases:
- Verify fq_codel won't mistakenly deactivate QFQ parent class during peek
- Verify codel won't mistakenly deactivate QFQ parent class during peek
- Verify dualpi2 won't mistakenly deactivate QFQ parent class during peek
Verify that these 3 qdiscs (fq_codel, codel, dualpi2) will not call
qdisc_tree_reduce_backlog with an incorrect qlen (0) during peek and
mistakenly deactivate a parent class.
Victor Nogueira [Wed, 10 Jun 2026 19:28:54 +0000 (16:28 -0300)]
net/sched: sch_dualpi2: Do not call qdisc_tree_reduce_backlog during peek before restoring qlen
Whenever dualpi2 drops packets during peek, it calls
qdisc_tree_reduce_backlog. An issue arises because it calls
qdisc_tree_reduce_backlog before it reincrements the qlen. If qlen drops
to zero, but peek returns an skb, the parent's qlen_notify callback will be
executed even though dualpi2 still has 1 packet on the queue and, thus,
mistakenly deactivates the parent's class which leads to a null-ptr-deref:
Victor Nogueira [Wed, 10 Jun 2026 19:28:53 +0000 (16:28 -0300)]
net/sched: sch_codel: Do not call qdisc_tree_reduce_backlog during peek before restoring qlen
Whenever codel drops packets during peek, it calls
qdisc_tree_reduce_backlog. An issue arises because it calls
qdisc_tree_reduce_backlog before it reincrements the qlen. If qlen drops
to zero, but peek returns an skb, the parent's qlen_notify callback will
be executed even though codel still has 1 packet on the queue and, thus,
will mistakenly deactivate the parent's class causing issues like a wild
memory access when qfq has codel as a child:
Victor Nogueira [Wed, 10 Jun 2026 19:28:52 +0000 (16:28 -0300)]
net/sched: sch_fq_codel: Do not call qdisc_tree_reduce_backlog during peek before restoring qlen
Whenever fq_codel drops packets during peek, it calls
qdisc_tree_reduce_backlog. An issue arises because it calls
qdisc_tree_reduce_backlog before it reincrements the qlen. If qlen drops
to zero, but peek returns an skb, the parent's qlen_notify callback will be
executed even though fq_codel still has 1 packet on the queue and, thus,
will mistakenly deactivate the parent's class causing issues like a recent
report [1] and a wild memory access in qfq:
====================
ipv6: mcast: annotate data races in /proc/net/igmp6
/proc/net/igmp6 walks IPv6 multicast memberships under RCU without
holding idev->mc_lock, taking a lockless snapshot of two fields that
writers update under the lock: mca_flags and mca_work.timer.expires.
Patch 1 adds WRITE_ONCE() to all mca_flags update sites and READ_ONCE()
to the procfs reader. Patch 2 does the same for the timer.expires read
in the procfs path.
====================
Yuyang Huang [Tue, 9 Jun 2026 08:11:13 +0000 (17:11 +0900)]
ipv6: mcast: annotate igmp6 timer expiry race
/proc/net/igmp6 walks IPv6 multicast memberships under RCU and reads
mca_work.timer.expires to print the remaining multicast timer. The
delayed-work timer can be updated concurrently.
Annotate the intentional lockless procfs snapshot with READ_ONCE().
Yuyang Huang [Tue, 9 Jun 2026 08:11:12 +0000 (17:11 +0900)]
ipv6: mcast: annotate data-races around mca_flags
/proc/net/igmp6 walks IPv6 multicast memberships under RCU and
prints mca_flags without holding idev->mc_lock. The multicast paths
update the field while holding idev->mc_lock.
Annotate this intentional lockless snapshot with READ_ONCE() and the
matching writers with WRITE_ONCE().
Jakub Kicinski [Fri, 12 Jun 2026 23:48:57 +0000 (16:48 -0700)]
Merge branch 'rxrpc-miscellaneous-fixes'
David Howells says:
====================
rxrpc: Miscellaneous fixes
Here are some miscellaneous AF_RXRPC fixes:
(1) Make sure rxrpc_verify_data() allocates a buffer, even if the DATA
packet being looked at is zero length to avoid potential NULL-pointer
exceptions.
(2) Don't move an OOB message (e.g. an RxGK CHALLENGE) off the receive
queue onto the pending queue in recvmsg() if MSG_PEEK is specified.
(3) Fix a potential UAF in rxgk_issue_challenge() in which a tracepoint
refers to memory just freed by a different pointer.
(4) Fix afs net namespace teardown to cancel the incoming call
preallocation charger before we disable listening (which will delete
the preallocation queue).
(5) Fix rxrpc_kernel_charge_accept() to use the socket mutex to defend
against listen(0)/shutdown simultaneously deleting the preallocation
queue.
====================
Li Daming [Tue, 9 Jun 2026 14:09:09 +0000 (15:09 +0100)]
rxrpc: serialize kernel accept preallocation with socket teardown
rxrpc_kernel_charge_accept() reads rx->backlog without any
socket/backlog synchronization and passes that raw pointer into
rxrpc_service_prealloc_one(). A concurrent rxrpc_discard_prealloc()
sets rx->backlog = NULL and frees the backlog rings, so a kernel
preallocation worker can keep using a freed struct rxrpc_backlog
while updating *_backlog_head/tail and array slots.
Serialize the state check and backlog lookup with the socket lock,
and reject kernel preallocation once teardown has disabled
listening or discarded the service backlog.
Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Li Daming <d4n.for.sec@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260609140911.838677-6-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Tue, 9 Jun 2026 14:09:08 +0000 (15:09 +0100)]
afs: Fix netns teardown to cancel the preallocation charger
Fix the teardown of an afs network namespace to make sure it cancels the
work item that keeps the preallocated rxrpc call/conn/peer queue charged
before incoming calls are disabled (i.e. listen 0).
Also, if net->live is false because the afs netns is being deleted, make
afs_charge_preallocation() skip charging and make afs_rx_new_call() avoid
requeuing the charger.
(This was found by AI review).
Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests") Reported-by: Simon Horman <horms@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Li Daming <d4n.for.sec@gmail.com>
cc: Ren Wei <n05ec@lzu.edu.cn>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260609140911.838677-5-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Tue, 9 Jun 2026 14:09:07 +0000 (15:09 +0100)]
rxrpc: Fix UAF in rxgk_issue_challenge()
Fix rxgk_issue_challenge() to free the page containing the challenge
content after invoking the tracepoint as the whdr passed to the tracepoint
points into the page just freed.
Fixes: 9d1d2b59341f ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260609140911.838677-4-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hyunwoo Kim [Tue, 9 Jun 2026 14:09:06 +0000 (15:09 +0100)]
rxrpc: Don't move a peeked OOB message onto the pending queue
rxrpc_recvmsg_oob() takes a received oob message off recvmsg_oobq and,
if a response is needed, moves it onto the pending_oobq tree. However,
only the unlink from recvmsg_oobq is guarded by MSG_PEEK; the move onto
pending_oobq always runs.
As a result, reading a challenge with MSG_PEEK leaves the skb on
recvmsg_oobq while also adding it to pending_oobq. Since struct
sk_buff's rbnode shares storage with its next and prev pointers,
rb_insert_color() overwrites the list linkage, and the skb, which holds
a single reference, becomes reachable from both queues at once.
When the socket is closed both queues are drained in turn. While
draining recvmsg_oobq, __skb_unlink() follows the next and prev
pointers that rbnode has overwritten and writes to a bad address. Also,
as the skb holds a single reference but is freed from each queue, both
the skb and the connection reference it holds are released twice. This
leads to memory corruption and to a use-after-free caused by the
connection refcount underflow.
MSG_PEEK does not consume the message from the queue, so only unlink it
from recvmsg_oobq and then move it onto pending_oobq or free it when
the message is actually consumed.
Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260609140911.838677-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rxrpc_recvmsg_data() calls rxrpc_verify_data() whenever the
rxrpc_call.rx_dec_buffer is unallocated and assumes that upon
successful return that rx_dec_buffer must be allocated.
However, rxrpc_verify_data() does not request an allocation if
the rxrpc_skb_priv.len is zero.
In addition, failure to allocate rx_dec_buffer will result in a
call to skb_copy_bits() with a NULL destination which can
trigger a NULL pointer dereference.
To prevent these issues rxrpc_verify_data() is modified to
always attempt to allocate the rxrpc_call.rx_dec_buffer if it
is NULL.
This issue was identified with assistance of a private
sashiko instance.
Fixes: d2bc90cf6c75cb ("rxrpc: Fix DATA decrypt vs splice() by copying data to buffer in recvmsg") Reported-by: Simon Horman <simon.horman@redhat.com> Signed-off-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jiayuan Chen <jiayuan.chen@linux.dev>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260609140911.838677-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sabrina Dubroca [Thu, 11 Jun 2026 10:21:33 +0000 (12:21 +0200)]
tls: remove tls_toe and the related driver
The tls_toe feature and its single user (chelsio chtls) have been
unmaintained for multiple years. It also hooks into the core of the
TCP implementation, and bypasses most of the networking stack.
Jakub Kicinski [Thu, 11 Jun 2026 20:03:55 +0000 (13:03 -0700)]
ethtool: tsconfig: always take rtnl_lock
mlx5 throws ASSERT_RTNL() warnings on timestamp config, because
it tries to update features. mlx5e_hwtstamp_set() calls
netdev_update_features().
I missed this while grepping the drivers because tsconfig goes
through ndo_hwtstamp_set/get, not ethtool ops, even tho the new
uAPI is in ethtool Netlink. We could add a dedicated opt out bit
for mlx5, but NDOs were not supposed to be part of the ethtool locking
conversion in the first place.
The mlx5 features update is related to the "compressed CQE" format
which lacks timestamp, apparently. See commit c0194e2d0ef0 ("net/mlx5e:
Disable rxhash when CQE compress is enabled").
Fixes: f9a3e05114b8 ("net: ethtool: optionally skip rtnl_lock on Netlink path for SET ops") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20260611200355.2020663-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Thu, 11 Jun 2026 16:36:48 +0000 (18:36 +0200)]
virtio_net: do not allow tunnel csum offload for non GSO packets
Fiona reports broken connectivity for virtio net setup using UDP tunnel
inside the guest and NIC with not UDP tunnel TSO support in the host.
Currently the virtio_net driver exposes csum offload for UDP-tunneled,
TCP non GSO packets. Such packet reach the host as CSUM_PARTIAL ones
with the 'encapsulation' flag cleared, as the virtio specification do
not support this specific kind of offload.
HW NICs with UDP tunnel TSO support - and those drivers directly
accessing skb->csum_start/csum_offset - are still capable of computing
the needed csum correctly, but otherwise the packets reach the wire with
bad csum on both the inner and outer transport header.
Address the issue explicitly disabling csum offload for UDP tunneled,
non GSO packets via the ndo_features_check op.
Fixes: 56a06bd40fab ("virtio_net: enable gso over UDP tunnel support.") Reported-by: Fiona Ebner <f.ebner@proxmox.com> Closes: https://bugzilla.proxmox.com/show_bug.cgi?id=7627 Tested-by: Fiona Ebner <f.ebner@proxmox.com> Tested-by: Gabriel Goller <g.goller@proxmox.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Gabriel Goller <g.goller@proxmox.com> Tested-by: Gabriel Goller <g.goller@proxmox.com> Link: https://patch.msgid.link/6c3b6c47fb05c100f384630dc48f3975cf37b67a.1781195144.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: atm: reject out-of-range traffic classes in QoS validation
Reject ATM traffic classes above ATM_ANYCLASS in check_tp().
SO_ATMQOS stores the supplied QoS after check_qos() succeeds, so
accepting larger values leaves invalid traffic_class values in
vcc->qos.
That bad state later reaches pvc_info(), which indexes class_name[]
with vcc->qos.{rx,tp}.traffic_class. Values above ATM_ANYCLASS cause
an out-of-bounds read when /proc/net/atm/pvc is read.
Tighten the existing QoS validation so invalid traffic_class values
are rejected at the point where user supplied QoS is accepted.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/58f02c6f73d9818fd5d2022e1116759fdde6116b.1780965530.git.zcliangcn@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sechang Lim [Thu, 11 Jun 2026 09:29:18 +0000 (09:29 +0000)]
tcp: clear sock_ops cb flags before force-closing a child socket
A child socket inherits the listener's bpf_sock_ops_cb_flags via
sk_clone_lock(). If its setup fails in tcp_v4_syn_recv_sock() /
tcp_v6_syn_recv_sock(), the child is freed through put_and_exit, where
inet_csk_prepare_forced_close() drops the socket lock and tcp_done() runs
without it.
If BPF_SOCK_OPS_STATE_CB_FLAG was inherited, tcp_done() -> tcp_set_state()
calls tcp_call_bpf(), which expects the lock and trips sock_owned_by_me():
The child is freed before it is ever established, so it should run no
sock_ops callback. Clear its cb flags in inet_csk_prepare_for_destroy_sock(),
the common point for the IPv4, IPv6 and chtls forced-close paths and for the
MPTCP ->syn_recv_sock() failure path (dispose_child), which reaches tcp_done()
on a child that was never established too.
Jakub Kicinski [Tue, 9 Jun 2026 20:12:22 +0000 (13:12 -0700)]
docs: net: fix minor issues with XDP metadata docs
Minor updates to the XDP metadata documentation:
- s/union/struct/ for xsk_tx_metadata
- document nested request and completion metadata fields
- point capability queries at the xsk-features attribute
- fix grammar in the XDP RX metadata guide
- typos
The test test_xdp_native_adjst_head_grow_data tests a case where the
head is adjusted by -256.
When this test runs, data_ptr is shifted to frag_start + 2 (where
frag_start = page_address(page) + offset).
Then, bnxt_rx_multi_page_skb is invoked and the napi_build_skb
expression subtracts 258, landing at an address before frag_start. This
could be either the previous fragment or the previous physical page when
the offset is < 256 (e.g. if the fragment started at offset 0).
When the skb is freed, the page pool fragment reference is dropped on
either the wrong page or the wrong frag of the right page. In either
case, the corrupted reference count can lead to the page being
prematurely recycled while still in use. Once (incorrectly) recycled, it
can be handed out again and on driver teardown this would result in a
double free.
The commit under fixes updated this code to handle the case where the
native page size is >= 64k, but it unintentionally broke the head grow
case.
To fix this, add an offset field to struct bnxt_sw_rx_bd, mirroring the
existing offset field in struct bnxt_sw_rx_agg_bd. Populate it on
allocation and preserve it on reuse.
In bnxt_rx_multi_page_skb, use the newly added offset field to compute
the fragment start and pass that to napi_build_skb. Adjust the layout
with skb_reserve.
There are two cases, the non-adjustment case and the adjustment case.
In both cases, the skb is built at page_address(page) + offset to
account for the case where the native page size >= 64K and skb_reserve
is called with data_ptr - (page_address(page) + offset). That
difference equals bp->rx_offset when data_ptr was not moved, or
bp->rx_offset + xdp_adjust when XDP adjusted the head.
Re-running the failing test with this commit applied causes the test to
run successfully to completion.
The other rx_skb_func implementations don't have this issue.
Fixes: f6974b4c2d8e ("bnxt_en: Fix page pool logic for page size >= 64K") Signed-off-by: Joe Damato <joe@dama.to> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20260609204458.2237787-2-joe@dama.to Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nazim Amirul [Tue, 9 Jun 2026 12:17:03 +0000 (05:17 -0700)]
net: stmmac: xgmac2: disable RBUE in default RX interrupt mask
Enabling the RX Buffer Unavailable (RBUE) interrupt is counterproductive
and can trigger a MAC interrupt storm under heavy RX pressure. When the
DMA runs out of RX descriptors it fires RBUE continuously until software
refills the ring.
However, RBUE is redundant: the normal RX completion interrupt (RIE)
already triggers NAPI, which processes completed descriptors and refills
the ring, causing the DMA to resume. The RBUE handler itself only sets
handle_rx - the same outcome as RIE.
On Agilex5 under heavy RX pressure, the MAC interrupt (which includes
RBUE) was observed firing 1,821,811,555 times against only 2,618,627
actual RX completions - a ~695x ratio - confirming the severity of the
storm.
RBUE does not provide OOM recovery. If page_pool is exhausted,
stmmac_rx_refill() cannot advance the DMA tail pointer, the DMA stays
suspended, and RBUE fires again on the next NAPI completion - a storm
with no forward progress. This patch trades that storm for a clean
stall with the same RX outcome. Proper OOM recovery is a pre-existing
gap outside the scope of this fix.
Note: as a consequence of disabling RBUE, the rx_buf_unav_irq ethtool
counter will always read 0 on XGMAC2 devices. This behaviour is already
inconsistent across DWMAC core versions.
Remove RBUE from XGMAC_DMA_INT_DEFAULT_EN and XGMAC_DMA_INT_DEFAULT_RX
to prevent the interrupt storm while keeping normal RX handling intact.
Linus Torvalds [Fri, 12 Jun 2026 22:51:16 +0000 (15:51 -0700)]
Merge tag 'drm-fixes-2026-06-13' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Looks like it's settled down a bit more thankfully. Small changes
across the board, amdgpu/xe leading with some colorop changes in the
core/amd. Otherwise some misc driver fixes.
amdkfd:
- Fix an event information leak
- Events bounds check fix
- Trap cleanup fix
i915:
- Check supported link rates DPCD read
- Fix phys BO pread/pwrite with offset
xe:
- fix oops in suspend/shutdown without display
- RAS fixes
- Use HW_ERR prefix in log
- include all registered queues in TLB invalidation
- Fix refcount leak in xe_range_tree in error paths
- fix job timeout recovery for unstarted jobs and kernel queues
amdxdna:
- fix possible leak of mm_struct
ivpu:
- fix integer truncation
vc4:
- fix leak in krealloc() error handling
virtio:
- fix dma_fence ref-count leak"
* tag 'drm-fixes-2026-06-13' of https://gitlab.freedesktop.org/drm/kernel: (24 commits)
accel/amdxdna: Fix mm_struct reference leak in aie2_populate_range()
drm/xe: fix job timeout recovery for unstarted jobs and kernel queues
drm/xe: fix refcount leak in xe_range_fence_insert()
drm/xe: include all registered queues in TLB invalidation
drm/xe/hw_error: Use HW_ERR prefix in log
drm/xe/drm_ras: Add per node cleanup action
drm/xe/drm_ras: Make counter allocation drm managed
drm/xe/display: fix oops in suspend/shutdown without display
drm/amd/display: use plane color_mgmt_changed to track colorop changes
drm/atomic: track individual colorop updates
drm/colorop: make lut(1/3)d_interpolation props correctly behave as mutable
drm/colorop: Remove read-only comments from interpolation fields
drm/i915/gem: Fix phys BO pread/pwrite with offset
drm/vc4: fix krealloc() memory leak
drm/virtio: Fix driver removal with disabled KMS
drm/i915/edp: Check supported link rates DPCD read
accel/ivpu: Fix signed integer truncation in IPC receive
drm/virtio: fix dma_fence refcount leak on error in virtio_gpu_dma_fence_wait()
drm/amd/display: Consult MCCS FreeSync cap only if requested & supported
drm/amdkfd: Unwind debug trap enable on copy_to_user failure
...
Chuck Lever [Tue, 9 Jun 2026 14:18:31 +0000 (10:18 -0400)]
handshake: Require admin permission for DONE command
ACCEPT and DONE are the two downcalls of the handshake genl
family, both intended for use by the trusted handshake agent
(tlshd). ACCEPT already requires GENL_ADMIN_PERM; DONE has
no privilege check at all.
The fd-lookup in handshake_nl_done_doit() only confirms that
some pending handshake request exists for the supplied sockfd;
it does not authenticate the sender. An unprivileged process
that guesses or observes a valid sockfd can therefore submit
a DONE with HANDSHAKE_A_DONE_STATUS == 0, leaving the kernel
consumer to proceed as if the handshake succeeded. A non-zero
status on a forged DONE tears down a legitimate in-flight
handshake before tlshd can report its real result.
Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260609141831.90694-1-cel@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This series simplifies UMEM handling in selftests/xsk.
It centralizes UMEM property setup through helpers, moves UMEM ownership
from ifobject to socket-owned state, and normalizes umem_size/mmap_size
usage across the touched paths.
====================
UMEM teardown currently recomputes the munmap() length from frame
geometry, shared-UMEM adjustment, and hugepage rounding. This duplicates
setup-time logic in cleanup and relies on re-deriving the mapping size
instead of using the size originally established for the mapping.
Store the final mapping length in xsk_umem_info as mmap_size when the
UMEM mapping is created, and use that value during teardown.
Also join the RX worker thread before cleanup in the single-thread
path. This establishes synchronization before reading umem->mmap_size
in teardown and avoids a potential visibility race.
This removes duplicated size arithmetic in cleanup and makes munmap()
use the canonical mapping size recorded at setup time.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20260608130938.958793-5-tushar.vyavahare@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
selftests/xsk: Move UMEM state from ifobject to xsk_socket_info
Move UMEM ownership from ifobject to xsk_socket_info and access it
through xsk->umem.
Allocate one shared umem_real in ifobject_create() and let all
sockets reference it through xsk->umem, while keeping ownership in
xsk_arr[0]. Keep the existing goto-based error path in
ifobject_create() and free the allocation once in ifobject_delete().
Reset the existing umem_real in __test_spec_init() with memset()
instead of reallocating it.
Preserve shared-UMEM behavior by copying RX UMEM state into a TX-local
UMEM state in thread_common_ops_tx() and reset base_addr/next_buffer
before TX socket configuration.
selftests/xsk: Introduce helpers for setting UMEM properties
UMEM properties are set via open-coded field assignments in multiple test
paths, which makes updates noisy and error-prone.
Introduce two helpers to set UMEM properties through a single interface.
This keeps setup logic consistent across tests and makes future refactoring
simpler.
No functional behavior change is intended.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20260608130938.958793-2-tushar.vyavahare@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a tdc test that checks the act_pedit extended L4 header mode does not
edit a packet whose IPv4 protocol does not match the selected transport
header.
The test installs an ingress pedit rule that sets the UDP destination
port, then injects a TCP packet with dport 2222. The UDP and TCP
destination ports sit at the same L4 offset, so a buggy kernel rewrites
the TCP dport. A second flower filter matches TCP dport 2222 and drops
the packet through an indexed gact action; the test then verifies via
JSON that this action saw exactly one packet, i.e. the dport was left
untouched and still matched 2222.
Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The extended IPv4 L4 header mode in act_pedit can select TCP or UDP
header fields without confirming that the IPv4 protocol field matches
the selected transport header.
That lets a rule written for TCP or UDP modify unrelated payload bytes
in a packet carrying a different protocol.
Verify that the IPv4 header is long enough, that the protocol matches
the selected TCP or UDP header, and that the packet is not a non-initial
fragment before applying TCP or UDP extended header edits.
Cc: stable+noautosel@kernel.org # in real rule sets the match confirms this before calling the action Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dave Airlie [Fri, 12 Jun 2026 21:58:44 +0000 (07:58 +1000)]
Merge tag 'drm-misc-next-fixes-2026-06-11' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
drm-misc-next-fixes for v7.2:
- Fix agp_amd64_probe error propagation.
- Require carveout when PASID is not enabled amdxdna.
- Clear variable to prevent second unbind in amdxdna.
- Add separate Kconfig option for DMABUF_HEAPS_SYSTEM_CC_SHARED.
Steffen Persvold [Fri, 12 Jun 2026 16:40:41 +0000 (18:40 +0200)]
fbdev: modedb: Fix misaligned fields in the 1920x1080-60 mode
The 1920x1080@60 modedb entry has one too many initializers before
its sync field: a stray "0" occupies the sync slot, which shifts the
remaining values by one field. The entry therefore decodes as
sync = 0, vmode = FB_SYNC_HOR_HIGH_ACT | FB_SYNC_VERT_HIGH_ACT (0x3,
i.e. FB_VMODE_INTERLACED | FB_VMODE_DOUBLE), and flag =
FB_VMODE_NONINTERLACED, instead of the intended sync = positive H/V,
vmode = non-interlaced.
fb_find_mode() then returns a 1920x1080 mode flagged as interlaced +
doublescan with active-low syncs. Drivers that honour var->vmode and
var->sync when programming display timing enable doublescan and the
wrong sync polarity, corrupting the output.
Drop the stray initializer so sync and vmode hold their intended
values (positive H/V sync, non-interlaced), matching the adjacent
1920x1200 entry.
Daniel Pereira [Mon, 1 Jun 2026 19:23:44 +0000 (16:23 -0300)]
docs: pt_BR: Translate 3.Early-stage.rst into Portuguese
Translate the documentation file '3.Early-stage.rst' into Portuguese.
This section addresses corporate kernel development constraints,
the balance between company secrecy and the open-loop approach,
and the use of NDAs or Linux Foundation programs to avoid
integration issues.
Signed-off-by: Daniel Pereira <danielmaraboo@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260601192346.192752-1-danielmaraboo@gmail.com>
Manuel Ebner [Fri, 5 Jun 2026 19:00:56 +0000 (21:00 +0200)]
Documentation: bug-hunting.rst: fix grammar
Fix a grammar issue to improve readability
Signed-off-by: Manuel Ebner <manuelebner@mailbox.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260605190055.15921-2-manuelebner@mailbox.org>
Translate the "Use trimmed interleaved replies in email discussions"
and "Don't get discouraged - or impatient" sections in
Documentation/translations/ja_JP/process/submitting-patches.rst.
Keep the wording close to the English text and wrap lines to match
the style used in the surrounding Japanese translation.
docs/{it_it,sp_SP,zh_CN,zh_TW}: update references to removed CONFIG_DEBUG_SLAB
CONFIG_DEBUG_SLAB was removed in commit 2a19be61a651 ("mm/slab: remove
CONFIG_SLAB from all Kconfig and Makefile"), but references to it
remained in documentation. The English documentation was updated to
refer to CONFIG_SLUB_DEBUG in commit 5969fbf30274 ("docs:
submit-checklist: structure by category"), but these translations were
never similarly updated. Update them.
Discovered while searching for CONFIG_* symbols referenced in the
kernel but not defined in any Kconfig file.
Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260611010014.412841-1-enelsonmoore@gmail.com>
Manuel Ebner [Fri, 12 Jun 2026 09:54:22 +0000 (11:54 +0200)]
Documentation: arch: fix brackets
Add missing and remove needless parentheses, brackets and curly braces.
Fix typos.
Signed-off-by: Manuel Ebner <manuelebner@mailbox.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260612095432.177759-2-manuelebner@mailbox.org>
Linus Torvalds [Fri, 12 Jun 2026 18:06:16 +0000 (11:06 -0700)]
Merge tag 'spi-fix-v7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A couple of driver specific fixes: a small targeted fix for hardware
error handling on DesignWare controllers and another for handling of
custom chip select management on Qualcomm GENI controllers"
* tag 'spi-fix-v7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: dw: fix race between IRQ handler and error handler on SMP
spi: qcom-geni: Fix cs_change handling on the last transfer