The s2io driver supports Exar (formerly Neterion and S2io) PCI-X 10
Gigabit Ethernet cards. Hardware supporting PCI-X has not been
manufactured in years. On x86, it was quickly replaced by PCIe. While
it stuck around longer on POWER hardware, the last POWER hardware to
support it was POWER7, which is not supported by ppc64le Linux
distributions. The last supported mainstream ppc64 Linux distribution
was RHEL 7; while it is still supported under ELS, ELS is only
available for x86 and IBM Z. It is possible to use many PCI-X cards in
standard PCI slots (which are still available on new motherboards), but
it does not make sense to do so for 10 Gigabit Ethernet because the
maximum bandwidth of standard PCI is only 1067 Mbps. It is therefore
highly unlikely that this driver is still being used. Remove the
driver, and move the former maintainer to the CREDITS file (restoring
credit for the vxge driver, which was removed in commit f05643a0f60b
("eth: remove neterion/vxge").
Daniel Zahka [Tue, 27 Jan 2026 16:30:55 +0000 (08:30 -0800)]
selftests: drv-net: psp: fix test flakes from racy connection close
There is a bug in assoc_sk_only_mismatch() and
assoc_sk_only_mismatch_tx() that creates a race condition which
triggers test flakes in later test cases e.g. data_send_bad_key().
The problem is that the client uses the "conn clr" rpc to setup a data
connection with psp_responder, but never uses a matching "data close"
rpc. This creates a race condition where if the client can queue
another data sock request, like in data_send_bad_key(), before the
server can accept the old connection from the backlog we end up in a
situation where we have two connections in the backlog: one for the
closed connection we have received a FIN for, and one for the new PSP
connection which is expecting to do key exchange.
From there the server pops the closed connection from the backlog, but
the data_send_bad_key() test case in psp.py hangs waiting to perform
key exchange.
The fix is to properly use _conn_close, which fill force the server to
remove the closed connection from the backlog before sending the RPC
ack to the client.
Passing IRQF_ONESHOT ensures that the interrupt source is masked until
the secondary (threaded) handler is done. If only a primary handler is
used then the flag makes no sense because the interrupt can not fire
(again) while its handler is running.
The flag also disallows force-threading of the primary handler and the
irq-core will warn about this as of commit aef30c8d569c0 ("genirq: Warn
about using IRQF_ONESHOT without a threaded handler").
The IRQF_ONESHOT flag was added in commit 0fabe1021f8bc ("MIPS:
DECstation I/O ASIC DMA interrupt classes"). It moved
clear_ioasic_dma_irq() from the driver into the irq-chip.
For EOI interrupts the clear_ioasic_dma_irq() callback is now invoked as
->irq_eoi() which is invoked after the IRQ was handled while the
interrupt is masked due to IRQF_ONESHOT. Without IRQF_ONESHOT it would
be invoked while interrupt is unmasked (but interrupts are disabled).
If it is *required* to invoke EOI-ack while the interrupt is masked (and
not a misunderstanding) due to irq-chip cascading/ hierarchical reasons
then using handle_fasteoi_mask_irq() as flow-handler would be the right
way to do so.
Remove IRQF_ONESHOT to irqflags.
Cc: Andrew Lunn <andrew+netdev@lunn.ch> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Maciej W. Rozycki <macro@orcam.me.uk> Link: https://patch.msgid.link/20260127135334.qUEaYP9G@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Test tcp_tx_timestamp() behavior after ("tcp: tcp_tx_timestamp()
must look at the rtx queue").
Without the fix, this new test fails like this:
tcp_timestamping_tcp_tx_timestamp_bug.pkt:55: runtime error in recvmsg call: Expected result 0 but got -1 with errno 11 (Resource temporarily unavailable)
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 4168 Comm: syz.4.206 Not tainted syzkaller #0 PREEMPT(voluntary)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
Reported-by: syzbot+d24f940f770afda885cf@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69783ead.050a0220.c9109.0013.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260127043528.514160-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Junjie Cao [Mon, 26 Jan 2026 06:15:32 +0000 (14:15 +0800)]
selftests: ptp: treat unsupported PHC operations as skip
Some PTP hardware clock (PHC) devices may return -EOPNOTSUPP for
operations like settime, adjtime, or adjfreq. This commonly occurs
with timestamp-only PHC implementations that don't support full clock
control.
For background, syzbot previously exposed a crash risk when PTP clock
drivers lacked required callbacks[1]. Subsequent work[2] made callback
presence a registration requirement. As a result, some drivers (like
iwlwifi MVM/MLD[3]) now provide stub callbacks that return -EOPNOTSUPP
for unsupported operations.
When phc_ctl encounters such devices, the "Operation not supported"
error should be treated as a skip (device limitation) rather than a
test failure. This patch:
- Adds [SKIP] output handling in log_test()
- Detects "Operation not supported" from phc_ctl and returns ksft_skip
- Returns ksft_skip if all tests are skipped, preventing false-positive
results when testing timestamp-only PHC implementations
Junjie Cao [Mon, 26 Jan 2026 06:15:31 +0000 (14:15 +0800)]
selftests: ptp: use KSFT_SKIP exit code for skip scenarios
The kselftest framework defines KSFT_SKIP=4 as the standard exit code
for skipped tests. However, phc.sh currently uses a mix of 'exit 0' and
'exit 1' to indicate skip conditions, which can confuse test harnesses
and CI systems.
This patch introduces ksft_skip=4 variable and unifies all skip exit
paths to use 'exit $ksft_skip', consistent with other selftests like
net/lib.sh and net/fib_nexthops.sh.
The int51x1 driver uses the same requests as USB CDC to handle packet
filtering, but provides its own definitions and function to handle it.
The chip datasheet says the requests are CDC compliant. Replace this
unnecessary code with a reference to usbnet_cdc_update_filter.
Also fix the broken datasheet link and remove an empty comment.
Jakub Kicinski [Wed, 28 Jan 2026 01:33:59 +0000 (17:33 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2026-01-26 (ice, idpf)
For ice:
Jake converts ring stats to utilize u64_stats APIs and performs some
cleanups along the way.
Alexander reorganizes layout of Tx and Rx rings for cacheline
locality and utilizes __cacheline_group* macros on the new layouts.
For idpf:
YiFei Zhu adds support for BPF kfunc reporting of hardware Rx timestamps.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
idpf: export RX hardware timestamping information to XDP
ice: reshuffle and group Rx and Tx queue fields by cachelines
ice: convert all ring stats to u64_stats_t
ice: shorten ring stat names and add accessors
ice: use u64_stats API to access pkts/bytes in dim sample
ice: remove ice_q_stats struct and use struct_group
ice: pass pointer to ice_fetch_u64_stats_per_ring
====================
Breno Leitao [Mon, 26 Jan 2026 10:00:15 +0000 (02:00 -0800)]
ethtool: remove ETHTOOL_GRXRINGS fallback through get_rxnfc
All drivers that need to report the RX ring count now implement the
get_rx_ring_count callback directly. Remove the legacy fallback path
that obtained this information by calling get_rxnfc with ETHTOOL_GRXRINGS.
This simplifies the code and makes get_rx_ring_count the only way
to retrieve the RX ring count.
Note: ethtool_get_rx_ring_count() returns int to allow returning
-EOPNOTSUPP, while the callback returns u32. The implicit conversion
is safe since RX ring counts will not exceed INT_MAX while we are still
alive.
====================
Single MSS length in UDP GSO_PARTIAL
This series addresses an inconsistency in how UDP GSO_PARTIAL handles
the UDP header length field.
Currently, when GSO_PARTIAL segmentation is used, the UDP header length
contains the large MSS size, requiring drivers to manually adjust it
before transmitting. This is inconsistent with how tunnel GSO_PARTIAL
handles outer headers in UDP tunnels, where the length is set to the
single segment size.
This was originally suggested by Alexander Duyck back in 2018:
https://lore.kernel.org/netdev/CAKgT0UcdnUWgr3KQ=RnLKigokkiUuYefmL-ePpDvJOBNpKScFA@mail.gmail.com/
The first patch fixes the core UDP offload code to set the UDP length
field to the single segment size (gso_size + UDP header) instead of the
large MSS size. This provides hardware with a proper template length
value for final segmentation.
The subsequent patches remove the now redundant UDP header length
adjustments from the mlx5e and aquantia drivers, as the core code now
handles this correctly.
I couldn't find any other drivers that support UDP GSO_PARTIAL; idpf
supports UDP segmentation, but it does not use GSO_PARTIAL.
====================
Gal Pressman [Sun, 25 Jan 2026 12:16:47 +0000 (14:16 +0200)]
udp: gso: Use single MSS length in UDP header for GSO_PARTIAL
In GSO_PARTIAL segmentation, set the UDP length field to the single
segment size (gso_size + UDP header) instead of the large MSS size.
This provides hardware with a template length value for final
segmentation, similar to how tunnel GSO_PARTIAL handles outer headers
in UDP tunnels.
This will remove the need to manually adjust the UDP header length in
the drivers, as can be seen in subsequent patches.
This was suggested by Alex in 2018:
https://lore.kernel.org/netdev/CAKgT0UcdnUWgr3KQ=RnLKigokkiUuYefmL-ePpDvJOBNpKScFA@mail.gmail.com/
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260125121649.778086-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: stmmac: don't pass ioaddr to fix_soc_reset() method
As the stmmac_priv struct is passed to the fix_soc_reset() method which
has the ioaddr, there is no need to pass ioaddr separately. Pass just
the stmmac_priv struct. Fix up the glues that use it.
net: usb: sr9700: replace magic numbers with register bit macros
The first byte of the Rx frame is a copy of the Rx status register, so
0x40 corresponds to RSR_MF (meaning the frame is multicast). Replace
0x40 with RSR_MF for clarity. (All other bits of the RSR indicate
errors. The fact that the driver ignores these errors will be fixed by
a later patch.)
The first byte of the status URB is a copy of the NSR, so 0x40
corresponds to NSR_LINKST. Replace 0x40 with NSR_LINKST for clarity.
This series updates net/ipv6/addrconf.c to use the regular SHA-1
functions, then removes sha1_init_raw() and sha1_transform().
(These were originally patches 25-26 of the series
https://lore.kernel.org/linux-crypto/20250712232329.818226-1-ebiggers@kernel.org/ )
====================
Eric Biggers [Fri, 23 Jan 2026 05:16:56 +0000 (21:16 -0800)]
lib/crypto: sha1: Remove low-level functions from API
Now that there are no users of the low-level SHA-1 interface, remove it.
Specifically:
- Remove SHA1_DIGEST_WORDS (no longer used)
- Remove sha1_init_raw() (no longer used)
- Rename sha1_transform() to sha1_block_generic() and make it static
- Move SHA1_WORKSPACE_WORDS into lib/crypto/sha1.c
Eric Biggers [Fri, 23 Jan 2026 05:16:55 +0000 (21:16 -0800)]
ipv6: Switch to higher-level SHA-1 functions
There's now a proper SHA-1 API that follows the usual conventions for
hash function APIs: sha1_init(), sha1_update(), sha1_final(), sha1().
The only remaining user of the older low-level SHA-1 API,
sha1_init_raw() and sha1_transform(), is ipv6_generate_stable_address().
I'd like to remove this older API, which is too low-level.
Unfortunately, ipv6_generate_stable_address() does in fact skip the
SHA-1 finalization for some reason. So the values it computes are not
standard SHA-1 values, and it sort of does want the low-level API.
Still, it's still possible to use the higher-level functions sha1_init()
and sha1_update() to get the same result, provided that the resulting
state is used directly, skipping sha1_final().
So, let's do that instead. This will allow removing the low-level API.
net: stmmac: rk: convert rk3328 to use bsp_priv->id
rk3328 contains two GMAC instances - gmac2io and gmac2phy. While the
gmac2io instance may be connected to an external PHY, the gmac2phy
instance is permanently connected via RMII to an on-SoC integrated PHY.
The driver currently tests for the gmac2phy instance by checking
bsp_priv->integrated_phy (determined from the PHY's phy-is-integrated
property) and sometimes that the interface mode is RMII. This works
because the rk3328.dtsi has:
The driver contains a mechanism to look up the MMIO address in a table
to determine bsp_priv->id, which is used for every other Rockchip
device. Switch rk3328 to use this mechanism to determine bsp_priv->id
and use that to select which GRF register is used for configuration,
similarly to how the other Rockchip SoCs handle such differences.
It is not worth having a common rk_phy_power_ctl() when the only
difference is which regulator function is called. Also, passing
true/false is non-descriptive. Split this function, moving the code
appropriately into rk_phy_powerup() and rk_phy_powerdown().
In https://lore.kernel.org/netdev/aDne1Ybuvbk0AwG0@shell.armlinux.org.uk/
I requested that a follow-up patch to change the name of dwmac-rk's
phy_power_on() function, which clashes with the drivers/phy function
of the same name. This can cause confusion when grepping for this
function name, or when reviewing code. Thankfully, stmmac doesn't make
use of drivers/phy which saves this from compile errors.
However, as is the usual case when a request is made as part of a
review, if the review leads to successful application of the patch the
author doesn't bother following up with any such requests, and so the
problem falls back onto the reviewer to address... so here is the
solution.
Rename dwmac-rk's function to rk_phy_power_ctl(), as the function both
powers up and down.
Jijie Shao [Fri, 23 Jan 2026 09:47:56 +0000 (17:47 +0800)]
net: hns3: extend HCLGE_FD_AD_COUNTER_NUM to 8 bits
Currently, HCLGE_FD_AD_COUNTER_NUM has only 7 bits and supports a
maximum of 127 counter_id. However, there are actually scenarios
where the counter_id exceeds 127.
This patch adds an additional bit to HCLGE_FD_AD_QID to ensure
that counter_id greater than 127 are supported.
Jijie Shao [Fri, 23 Jan 2026 09:47:55 +0000 (17:47 +0800)]
net: hns3: extend HCLGE_FD_AD_QID to 11 bits
Currently, HCLGE_FD_AD_QID has only 10 bits and supports a
maximum of 1023 queues. However, there are actually scenarios
where the queue_id exceeds 1023.
This patch adds an additional bit to HCLGE_FD_AD_QID to ensure
that queue_id greater than 1023 are supported.
====================
net: dsa: lantiq: add support for Intel GSW150
The Intel GSW150 Ethernet Switch (aka. Lantiq PEB7084) is the predecessor of
MaxLinear's GSW1xx series of switches. It shares most features, but has a
slightly different port layout and different MII interfaces.
Adding support for this switch to the mxl-gsw1xx driver is quite trivial.
====================
Daniel Golle [Thu, 22 Jan 2026 16:39:30 +0000 (16:39 +0000)]
net: dsa: mxl-gsw1xx: add support for Intel GSW150
Add support for the Intel GSW150 (aka. Lantiq PEB7084) switch IC to
the mxl-gsw1xx driver. This switch comes with 5 Gigabit Ethernet
copper ports (Intel XWAY PHY11G (xRX v1.2 integrated) PHYs) as well as
one GMII/RGMII and one RGMII port.
Daniel Golle [Thu, 22 Jan 2026 16:39:23 +0000 (16:39 +0000)]
net: dsa: mxl-gsw1xx: only setup SerDes PCS if it exists
Older Intel GSW150 chip doesn't have a SGMII/1000Base-X/2500Base-X PCS.
Prepare for supporting Intel GSW150 by skipping PCS reset and
initialization in case no .mac_select_pcs operation is defined.
Daniel Golle [Thu, 22 Jan 2026 16:39:09 +0000 (16:39 +0000)]
net: dsa: lantiq: allow arbitrary MII registers
The Lantiq GSWIP and MaxLinear GSW1xx drivers are currently relying on a
hard-coded mapping of MII ports to their respective MII_CFG and MII_PCDU
registers and only allow applying an offset to the port index.
While this is sufficient for the currently supported hardware, the very
similar Intel GSW150 (aka. Lantiq PEB7084) cannot be described using
this arrangement.
Introduce two arrays to specify the MII_CFG and MII_PCDU registers for
each port, replacing the current bitmap used to safeguard MII ports as
well as the port index offset.
====================
vsock: add namespace support to vhost-vsock and loopback
This series adds namespace support to vhost-vsock and loopback. It does
not add namespaces to any of the other guest transports (virtio-vsock,
hyperv, or vmci).
The current revision supports two modes: local and global. Local
mode is complete isolation of namespaces, while global mode is complete
sharing between namespaces of CIDs (the original behavior).
The mode is set using the parent namespace's
/proc/sys/net/vsock/child_ns_mode and inherited when a new namespace is
created. The mode of the current namespace can be queried by reading
/proc/sys/net/vsock/ns_mode. The mode can not change after the namespace
has been created.
Modes are per-netns. This allows a system to configure namespaces
independently (some may share CIDs, others are completely isolated).
This also supports future possible mixed use cases, where there may be
namespaces in global mode spinning up VMs while there are mixed mode
namespaces that provide services to the VMs, but are not allowed to
allocate from the global CID pool (this mode is not implemented in this
series).
Additionally, added tests for the new namespace features:
tools/testing/selftests/vsock/vmtest.sh
1..25
ok 1 vm_server_host_client
ok 2 vm_client_host_server
ok 3 vm_loopback
ok 4 ns_host_vsock_ns_mode_ok
ok 5 ns_host_vsock_child_ns_mode_ok
ok 6 ns_global_same_cid_fails
ok 7 ns_local_same_cid_ok
ok 8 ns_global_local_same_cid_ok
ok 9 ns_local_global_same_cid_ok
ok 10 ns_diff_global_host_connect_to_global_vm_ok
ok 11 ns_diff_global_host_connect_to_local_vm_fails
ok 12 ns_diff_global_vm_connect_to_global_host_ok
ok 13 ns_diff_global_vm_connect_to_local_host_fails
ok 14 ns_diff_local_host_connect_to_local_vm_fails
ok 15 ns_diff_local_vm_connect_to_local_host_fails
ok 16 ns_diff_global_to_local_loopback_local_fails
ok 17 ns_diff_local_to_global_loopback_fails
ok 18 ns_diff_local_to_local_loopback_fails
ok 19 ns_diff_global_to_global_loopback_ok
ok 20 ns_same_local_loopback_ok
ok 21 ns_same_local_host_connect_to_local_vm_ok
ok 22 ns_same_local_vm_connect_to_local_host_ok
ok 23 ns_delete_vm_ok
ok 24 ns_delete_host_ok
ok 25 ns_delete_both_ok
SUMMARY: PASS=25 SKIP=0 FAIL=0
Bobby Eshleman [Wed, 21 Jan 2026 22:11:52 +0000 (14:11 -0800)]
selftests/vsock: add tests for namespace deletion
Add tests that validate vsock sockets are resilient to deleting
namespaces. The vsock sockets should still function normally.
The function check_ns_delete_doesnt_break_connection() is added to
re-use the step-by-step logic of 1) setup connections, 2) delete ns,
3) check that the connections are still ok.
Bobby Eshleman [Wed, 21 Jan 2026 22:11:51 +0000 (14:11 -0800)]
selftests/vsock: add tests for host <-> vm connectivity with namespaces
Add tests to validate namespace correctness using vsock_test and socat.
The vsock_test tool is used to validate expected success tests, but
socat is used for expected failure tests. socat is used to ensure that
connections are rejected outright instead of failing due to some other
socket behavior (as tested in vsock_test). Additionally, socat is
already required for tunneling TCP traffic from vsock_test. Using only
one of the vsock_test tests like 'test_stream_client_close_client' would
have yielded a similar result, but doing so wouldn't remove the socat
dependency.
Additionally, check for the dependency socat. socat needs special
handling beyond just checking if it is on the path because it must be
compiled with support for both vsock and unix. The function
check_socat() checks that this support exists.
Add more padding to test name printf strings because the tests added in
this patch would otherwise overflow.
Add vm_dmesg_* helpers to encapsulate checking dmesg
for oops and warnings.
Add ability to pass extra args to host-side vsock_test so that tests
that cause false positives may be skipped with arg --skip.
Bobby Eshleman [Wed, 21 Jan 2026 22:11:50 +0000 (14:11 -0800)]
selftests/vsock: add namespace tests for CID collisions
Add tests to verify CID collision rules across different vsock namespace
modes.
1. Two VMs with the same CID cannot start in different global namespaces
(ns_global_same_cid_fails)
2. Two VMs with the same CID can start in different local namespaces
(ns_local_same_cid_ok)
3. VMs with the same CID can coexist when one is in a global namespace
and another is in a local namespace (ns_global_local_same_cid_ok and
ns_local_global_same_cid_ok)
The tests ns_global_local_same_cid_ok and ns_local_global_same_cid_ok
make sure that ordering does not matter.
The tests use a shared helper function namespaces_can_boot_same_cid()
that attempts to start two VMs with identical CIDs in the specified
namespaces and verifies whether VM initialization failed or succeeded.
Bobby Eshleman [Wed, 21 Jan 2026 22:11:49 +0000 (14:11 -0800)]
selftests/vsock: add tests for proc sys vsock ns_mode
Add tests for the /proc/sys/net/vsock/{ns_mode,child_ns_mode}
interfaces. Namely, that they accept/report "global" and "local" strings
and enforce their access policies.
Start a convention of commenting the test name over the test
description. Add test name comments over test descriptions that existed
before this convention.
Add a check_netns() function that checks if the test requires namespaces
and if the current kernel supports namespaces. Skip tests that require
namespaces if the system does not have namespace support.
This patch is the first to add tests that do *not* re-use the same
shared VM. For that reason, it adds a run_ns_tests() function to run
these tests and filter out the shared VM tests.
Bobby Eshleman [Wed, 21 Jan 2026 22:11:48 +0000 (14:11 -0800)]
selftests/vsock: use ss to wait for listeners instead of /proc/net
Replace /proc/net parsing with ss(8) for detecting listening sockets in
wait_for_listener() functions and add support for TCP, VSOCK, and Unix
socket protocols.
The previous implementation parsed /proc/net/tcp using awk to detect
listening sockets, but this approach could not support vsock because
vsock does not export socket information to /proc/net/.
Instead, use ss so that we can detect listeners on tcp, vsock, and unix.
The protocol parameter is now required for all wait_for_listener family
functions (wait_for_listener, vm_wait_for_listener,
host_wait_for_listener) to explicitly specify which socket type to wait
for.
ss is added to the dependency check in check_deps().
These functions are reused by the VM tests to collect and compare dmesg
warnings and oops counts. The future VM-specific tests use them heavily.
This patches relies on vm_ssh() already supporting namespaces.
Bobby Eshleman [Wed, 21 Jan 2026 22:11:46 +0000 (14:11 -0800)]
selftests/vsock: prepare vm management helpers for namespaces
Add namespace support to vm management, ssh helpers, and vsock_test
wrapper functions. This enables running VMs and test helpers in specific
namespaces, which is required for upcoming namespace isolation tests.
The functions still work correctly within the init ns, though the caller
must now pass "init_ns" explicitly.
No functional changes for existing tests. All have been updated to pass
"init_ns" explicitly.
Affected functions (such as vm_start() and vm_ssh()) now wrap their
commands with 'ip netns exec' when executing commands in non-init
namespaces.
Bobby Eshleman [Wed, 21 Jan 2026 22:11:45 +0000 (14:11 -0800)]
selftests/vsock: add namespace helpers to vmtest.sh
Add functions for initializing namespaces with the different vsock NS
modes. Callers can use add_namespaces() and del_namespaces() to create
namespaces global0, global1, local0, and local1.
The add_namespaces() function initializes global0, local0, etc... with
their respective vsock NS mode by toggling child_ns_mode before creating
the namespace.
Remove namespaces upon exiting the program in cleanup(). This is
unlikely to be needed for a healthy run, but it is useful for tests that
are manually killed mid-test.
This patch is in preparation for later namespace tests.
Bobby Eshleman [Wed, 21 Jan 2026 22:11:44 +0000 (14:11 -0800)]
selftests/vsock: increase timeout to 1200
Increase the timeout from 300s to 1200s. On a modern bare metal server
my last run showed the new set of tests taking ~400s. Multiply by an
(arbitrary) factor of three to account for slower/nested runners.
Bobby Eshleman [Wed, 21 Jan 2026 22:11:42 +0000 (14:11 -0800)]
virtio: set skb owner of virtio_transport_reset_no_sock() reply
Associate reply packets with the sending socket. When vsock must reply
with an RST packet and there exists a sending socket (e.g., for
loopback), setting the skb owner to the socket correctly handles
reference counting between the skb and sk (i.e., the sk stays alive
until the skb is freed).
This allows the net namespace to be used for socket lookups for the
duration of the reply skb's lifetime, preventing race conditions between
the namespace lifecycle and vsock socket search using the namespace
pointer.
Bobby Eshleman [Wed, 21 Jan 2026 22:11:41 +0000 (14:11 -0800)]
vsock: add netns to vsock core
Add netns logic to vsock core. Additionally, modify transport hook
prototypes to be used by later transport-specific patches (e.g.,
*_seqpacket_allow()).
Namespaces are supported primarily by changing socket lookup functions
(e.g., vsock_find_connected_socket()) to take into account the socket
namespace and the namespace mode before considering a candidate socket a
"match".
This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode to
report the mode and /proc/sys/net/vsock/child_ns_mode to set the mode
for new namespaces.
Add netns functionality (initialization, passing to transports, procfs,
etc...) to the af_vsock socket layer. Later patches that add netns
support to transports depend on this patch.
This patch changes the allocation of random ports for connectible vsocks
in order to avoid leaking the random port range starting point to other
namespaces.
dgram_allow(), stream_allow(), and seqpacket_allow() callbacks are
modified to take a vsk in order to perform logic on namespace modes. In
future patches, the net will also be used for socket
lookups in these functions.
Gal Pressman [Sun, 25 Jan 2026 10:55:24 +0000 (12:55 +0200)]
selftests: net: fix wrong boolean evaluation in __exit__
The __exit__ method receives ex_type as the exception class when an
exception occurs. The previous code used implicit boolean evaluation:
terminate = self.terminate or (self._exit_wait and ex_type)
^^^^^^^^^^^
In Python, the and operator can be used with non-boolean values, but it
does not always return a boolean result.
This is probably not what we want, because 'self._exit_wait and ex_type'
could return the actual ex_type value (the exception class) rather than
a boolean True when an exception occurs.
Use explicit `ex_type is not None` check to properly evaluate whether
an exception occurred, returning a boolean result.
Javen Xu [Sat, 24 Jan 2026 21:27:24 +0000 (22:27 +0100)]
r8169: add support for extended chip version id and RTL9151AS
The bits in register TxConfig used for chip identification aren't
sufficient for the number of upcoming chip versions. Therefore a register
is added with extended chip version information, for compatibility
purposes it's called TX_CONFIG_V2. First chip to use the extended chip
identification is RTL9151AS.
Signed-off-by: Javen Xu <javen_xu@realsil.com.cn>
[hkallweit1@gmail.com: add support for extended XID where XID is printed] Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/a3525b74-a1aa-43f6-8413-56615f6fa795@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: usb: replace unnecessary get_link functions with usbnet_get_link
usbnet_get_link calls mii_link_ok if the device has a MII defined in
its usbnet struct and no check_connect function defined there. This is
true of these drivers, so their custom get_link functions which call
mii_link_ok are useless. Remove them in favor of usbnet_get_link.
net: usb: smsc95xx: use phy_do_ioctl_running function
The smsc95xx_ioctl function behaves identically to the
phy_do_ioctl_running function. Remove it and use the
phy_do_ioctl_running function directly instead.
Justin Chen [Thu, 22 Jan 2026 19:49:49 +0000 (11:49 -0800)]
net: bcmasp: streamline early exit in probe
Streamline the bcmasp_probe early exit. As support for other
functionality is added(i.e. ptp), it is easier to keep track of early
exit cleanup when it is all in one place.
Justin Chen [Thu, 22 Jan 2026 19:49:48 +0000 (11:49 -0800)]
net: bcmasp: clean up some legacy logic
Removed wol_irq check. This was needed for brcm,asp-v2.0, which was
removed in previous commits.
Removed bcmasp_intf_ops. These function pointers were added to make
it easier to implement pseudo channels. These channels were removed
in newer versions of the hardware and were never implemented.
Eric Dumazet [Mon, 26 Jan 2026 17:47:31 +0000 (17:47 +0000)]
net: include <linux/hex.h> from sysctl_net_core.c
Needed for hex_byte_pack().
x86_64 was already including it, but some arches were not.
Fixes: 37b0ea8fef56 ("net: expand NETDEV_RSS_KEY_LEN to 256 bytes") Reported-by: Mark Brown <broonie@kernel.org> Closes: https://lore.kernel.org/netdev/aXeka0KYBnrkwUcF@sirena.org.uk/ Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260126174731.2767372-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
YiFei Zhu [Tue, 23 Dec 2025 19:46:46 +0000 (19:46 +0000)]
idpf: export RX hardware timestamping information to XDP
The logic is similar to idpf_rx_hwtstamp, but the data is exported
as a BPF kfunc instead of appended to an skb to support grabbing
timestamps in xsk packets.
A idpf_queue_has(PTP, rxq) condition is added to check the queue
supports PTP similar to idpf_rx_process_skb_fields.
Tested using an xsk connection and checking xdp timestamps are
retrievable in received packets.
Cc: intel-wired-lan@lists.osuosl.org Signed-off-by: YiFei Zhu <zhuyifei@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ice: reshuffle and group Rx and Tx queue fields by cachelines
Place the fields in ice_{rx,tx}_ring used in the same pieces of
hotpath code closer to each other and use
__cacheline_group_{begin,end}_aligned() to isolate the read mostly,
read-write, and cold groups into separate cachelines similarly
to idpf.
Suggested-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 20 Nov 2025 20:20:46 +0000 (12:20 -0800)]
ice: convert all ring stats to u64_stats_t
After several cleanups, the ice driver is now finally ready to convert all
Tx and Rx ring stats to the u64_stats_t and proper use of the u64 stats
APIs.
The final remaining part to cleanup is the VSI stats accumulation logic in
ice_update_vsi_ring_stats().
Refactor the function and its helpers so that all stat values (and not
just pkts and bytes) use the u64_stats APIs. The
ice_fetch_u64_(tx|rx)_stats functions read the stat values using
u64_stats_read and then copy them into local ice_vsi_(tx|rx)_stats
structures. This does require making a new struct with the stat fields as
u64.
The ice_update_vsi_(tx|rx)_ring_stats functions call the fetch functions
per ring and accumulate the result into one copy of the struct. This
accumulated total is then used to update the relevant VSI fields.
Since these are relatively small, the contents are all stored on the stack
rather than allocating and freeing memory.
Once the accumulator side is updated, the helper ice_stats_read and
ice_stats_inc and other related helper functions all easily translate to
use of u64_stats_read and u64_stats_inc. This completes the refactor and
ensures that all stats accesses now make proper use of the API.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 20 Nov 2025 20:20:45 +0000 (12:20 -0800)]
ice: shorten ring stat names and add accessors
The ice Tx/Rx hotpath has a few statistics counters for tracking unexpected
events. These values are stored as u64 but are not accumulated using the
u64_stats API. This could result in load/tear stores on some architectures.
Even some 64-bit architectures could have issues since the fields are not
read or written using ACCESS_ONCE or READ_ONCE.
A following change is going to refactor the stats accumulator code to use
the u64_stats API for all of these stats, and to use u64_stats_read and
u64_stats_inc properly to prevent load/store tears on all architectures.
Using u64_stats_inc and the syncp pointer is slightly verbose and would be
duplicated in a number of places in the Tx and Rx hot path. Add accessor
macros for the cases where only a single stat value is touched at once. To
keep lines short, also shorten the stats names and convert ice_txq_stats
and ice_rxq_stats to struct_group.
This will ease the transition to properly using the u64_stats API in the
following change.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 20 Nov 2025 20:20:44 +0000 (12:20 -0800)]
ice: use u64_stats API to access pkts/bytes in dim sample
The __ice_update_sample and __ice_get_ethtool_stats functions directly
accesses the pkts and bytes counters from the ring stats. A following
change is going to update the fields to be u64_stats_t type, and will need
to be accessed appropriately. This will ensure that the accesses do not
cause load/store tearing.
Add helper functions similar to the ones used for updating the stats
values, and use them. This ensures use of the syncp pointer on 32-bit
architectures. Once the fields are updated to u64_stats_t, it will then
properly avoid tears on all architectures.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 20 Nov 2025 20:20:43 +0000 (12:20 -0800)]
ice: remove ice_q_stats struct and use struct_group
The ice_qp_reset_stats function resets the stats for all rings on a VSI. It
currently behaves differently for Tx and Rx rings. For Rx rings, it only
clears the rx_stats which do not include the pkt and byte counts. For Tx
rings and XDP rings, it clears only the pkt and byte counts.
We could add extra memset calls to cover both the stats and relevant
tx/rx stats fields. Instead, lets convert stats into a struct_group which
contains both the pkts and bytes fields as well as the Tx or Rx stats, and
remove the ice_q_stats structure entirely.
The only remaining user of ice_q_stats is the ice_q_stats_len function in
ice_ethtool.c, which just counts the number of fields. Replace this with a
simple multiplication by 2. I find this to be simpler to reason about than
relying on knowing the layout of the ice_q_stats structure.
Now that the stats field of the ice_ring_stats covers all of the statistic
values, the ice_qp_reset_stats function will properly zero out all of the
fields.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Thu, 20 Nov 2025 20:20:42 +0000 (12:20 -0800)]
ice: pass pointer to ice_fetch_u64_stats_per_ring
The ice_fetch_u64_stats_per_ring function takes a pointer to the syncp from
the ring stats to synchronize reading of the packet stats. It also takes a
*copy* of the ice_q_stats fields instead of a pointer to the stats. This
completely defeats the point of using the u64_stats API. We pass the stats
by value, so they are static at the point of reading within the
u64_stats_fetch_retry loop.
Simplify the function to take a pointer to the ice_ring_stats instead of
two separate parameters. Additionally, since we never call this outside of
ice_main.c, make it a static function.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Queue and vector resources for a given vport, are stored in the
idpf_vport structure. At the time of configuration, these
resources are accessed using vport pointer. Meaning, all the
config path functions are tied to the default queue and vector
resources of the vport.
There are use cases which can make use of config path functions
to configure queue and vector resources that are not tied to any
vport. One such use case is PTP secondary mailbox creation
(it would be in a followup series). To configure queue and interrupt
resources for such cases, we can make use of the existing config
infrastructure by passing the necessary queue and vector resources info.
To achieve this, group the existing queue and vector resources into
default resource group and refactor the code to pass the resource
pointer to the config path functions.
This series also includes patches which generalizes the send virtchnl
message APIs and mailbox API that are necessary for the implementation
of PTP secondary mailbox.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
idpf: generalize mailbox API
idpf: avoid calling get_rx_ptypes for each vport
idpf: generalize send virtchnl message API
idpf: remove vport pointer from queue sets
idpf: add rss_data field to RSS function parameters
idpf: reshuffle idpf_vport struct members to avoid holes
idpf: move some iterator declarations inside for loops
idpf: move queue resources to idpf_q_vec_rsrc structure
idpf: introduce idpf_q_vec_rsrc struct and move vector resources to it
idpf: introduce local idpf structure to store virtchnl queue chunks
====================
net: usb: sr9700: use ETH_ALEN instead of magic number
The driver hardcodes the number 6 as the number of bytes to write to
the SR_PAR register, which stores the MAC address. Use ETH_ALEN instead
to make the code clearer.
Andy Roulin and Francesco Ruggeri have apparently independently both hit an
issue with the current neighbor notification scheme. Francesco reported the
issue in [1]. In a response[2] to that report, Andy said:
neigh_update sends a rtnl notification if an update, e.g.,
nud_state change, was done but there is no guarantee of
ordering of the rtnl notifications. Consider the following
scenario:
userspace thread kernel thread
================ =============
neigh_update
write_lock_bh(n->lock)
n->nud_state = STALE
write_unlock_bh(n->lock)
neigh_notify
neigh_fill_info
read_lock_bh(n->lock)
ndm->nud_state = STALE
read_unlock_bh(n->lock)
-------------------------->
neigh:update
write_lock_bh(n->lock)
n->nud_state = REACHABLE
write_unlock_bh(n->lock)
neigh_notify
neigh_fill_info
read_lock_bh(n->lock)
ndm->nud_state = REACHABLE
read_unlock_bh(n->lock)
rtnl_nofify
RTNL REACHABLE sent
<--------
rtnl_notify
RTNL STALE sent
In this scenario, the kernel neigh is updated first to STALE and
then REACHABLE but the netlink notifications are sent out of order,
first REACHABLE and then STALE.
The solution presented in [2] was to extend the critical region to include
both the call to neigh_fill_info(), as well as rtnl_notify(). Then we have
a guarantee that whatever state was captured by neigh_fill_info(), will be
sent right away. The above scenario can thus not happen.
This is how this patchset begins: patches #1 and #2 add helper duals to
neigh_fill_info() and __neigh_notify() such that the __-prefixed function
assumes the neighbor lock is held, and the unprefixed one is a thin wrapper
that manages locking. This extends locking further than Andy's patch, but
makes for a clear code and supports the following part.
At that point, the original race is gone. But what can happen is the
following race, where the notification does not reflect the change that was
made:
Here, even though neigh_update() made a change to STALE, it later sends a
notification with a NUD of REACHABLE. The obvious solution to fix this race
is to move the notifier to the same critical section that actually makes
the change.
Sending a notification in fact involves two things: invoking the internal
notifier chain, and sending the netlink notification. The overall approach
in this patchset is to move the netlink notification to the critical
section of the change, while keeping the internal notifier intact. Since
the motion is not obviously correct, the patchset presents the change in
series of incremental steps with discussion in commit messages. Please see
details in the patches themselves.
Reproducer
==========
To consistently reproduce, I injected an mdelay before the rtnl_notify()
call. Since only one thread should delay, a bit of instrumentation was
needed to see where the call originates. The mdelay was then only issued on
the call stack rooted in the RTNL request.
Then the general idea is to issue an "ip neigh replace" to mark a neighbor
entry as failed. In parallel to that, inject an ARP burst that validates
the entry. This is all observed with an "ip monitor neigh", where one can
see either a REACHABLE->FAILED transition, or FAILED->REACHABLE, while the
actual state at the end of the sequence is always REACHABLE.
With the patchset, only FAILED->REACHABLE is ever observed in the monitor.
Alternatives
============
Another approach to solving the issue would be to have a per-neighbor queue
of notification digests, each with a set of fields necessary for formatting
a notification. In pseudocode, a neighbor update would look something like
this:
neighbor_update:
- lock
- do update
- allocate notification digest, fill partially, mark not-committed
- unlock
- critical-section-breaking stuff (probes, ARP Q, etc.)
- lock
- fill in missing details to the digest (notably neigh->probes)
- mark the digest as committed
- while (front of the digest queue is committed)
- pop it, convert to notifier, send the notification
- unlock
This adds more complexity and would imply more changes to the code, which
is why I think the approach presented in this patchset is better. But it
would allow us to retain the overall structure of the code while giving us
accurate notifications.
A third approach would be to consider the second race not very serious and
be OK with seeing a notification that does not reflect the change that
prompted it. Then a two-patch prefix of this patchset would be all that is
needed.
Petr Machata [Wed, 21 Jan 2026 16:43:42 +0000 (17:43 +0100)]
net: core: neighbour: Make another netlink notification atomically
Similarly to the issue from the previous patch, neigh_timer_handler() also
updates the neighbor separately from formatting and sending the netlink
notification message. We have not seen reports to the effect of this
causing trouble, but in theory, the same sort of issues could have come up:
neigh_timer_handler() would make changes as necessary, but before
formatting and sending a notification, is interrupted before sending by
another thread, which makes a parallel change and sends its own message.
The message send that is prompted by an earlier change thus contains
information that does not reflect the change having been made.
To solve this, the netlink notification needs to be in the same critical
section that updates the neighbor. The critical section is ended by the
neigh_probe() call which drops the lock before calling solicit. Stretching
the critical section over the solicit call is problematic, because that can
then involved all sorts of forwarding callbacks. Therefore, like in the
previous patch, split the netlink notification away from the internal one
and move it ahead of the probe call.
Petr Machata [Wed, 21 Jan 2026 16:43:41 +0000 (17:43 +0100)]
net: core: neighbour: Make one netlink notification atomically
As noted in a previous patch, one race remains in the current code. A
kernel thread might interrupt a userspace thread after the change is done,
but before formatting and sending the message. Then what we would see is
two messages with the same contents:
The solution is to send the netlink message inside the critical section
where the neighbor is changed, so that it reflects the notified-upon
neighbor state.
To that end, in __neigh_update(), move the current neigh_notify() call up
to said critical section, and convert it to __neigh_notify(), because the
lock is held. This motion crosses calls to neigh_update_managed_list(),
neigh_update_gc_list() and neigh_update_process_arp_queue(), all of which
potentially unlock and give an opportunity for the above race.
This also crosses a call to neigh_update_process_arp_queue() which calls
neigh->output(), which might be neigh_resolve_output() calls
neigh_event_send() calls neigh_event_send_probe() calls
__neigh_event_send() calls neigh_probe(), which touches neigh->probes,
an update which will now not be visible in the notification.
However, there is indication that there is no promise that these changes
will be accurately projected to notifications: fib6_table_lookup()
indirectly calls route.c's find_match() calls rt6_probe(), which looks up a
neighbor and call __neigh_set_probe_once(), which sets neigh->probes to 0,
but neither this nor the caller seems to send a notification.
Additionally, the neighbor object that the neigh_probe() mentioned above is
called on, might be the alternative neighbor looked up for the ARP queue
packet destination. If that is the case, the changed value of n1->probes is
not notified anywhere.
So at least in some circumstances, the reported number of probes needs to
be assumed to change without notification.
The netlink message needs to be send inside the critical section where the
neighbor is changed, so that it reflects the notified-upon neighbor state.
On the other hand, there is no such need in case of notifier chain: the
listeners do not assume lock, and often in fact just schedule a delayed
work to act on the neighbor later. At least one in fact also takes the
neighbor lock.
This requires that the netlink notification be done before the internal
notifier chain message is sent. That is safe to do, because the current
listeners, as well as __neigh_notify(), only read the updated neighbor
fields, and never modify them. (Apart from locking.)
The obvious idea behind the helper is to keep together the two bits that
should be done either both or neither: the internal notifier chain message,
and the netlink notification.
To make sure that the notification sent reflects the change being made, the
netlink message needs to be send inside the critical section where the
neighbor is changed. But for the notifier chain, there is no such need: the
listeners do not assume lock, and often in fact just schedule a delayed
work to act on the neighbor later. At least one in fact also takes the
neighbor lock. Therefore these two items have each different locking needs.
Now we could unlock inside the helper, but I find that error prone, and the
fact that the notification is conditional in the first place does not help
to make the call site obvious.
So in this patch, the helper is instead removed and the body, which is just
these two calls, inlined. That way we can use each notifier independently.
Petr Machata [Wed, 21 Jan 2026 16:43:38 +0000 (17:43 +0100)]
net: core: neighbour: Process ARP queue later
ARP queue processing unlocks the neighbor lock, which can allow another
thread to asynchronously perform a neighbor update and send an out of order
notification. Therefore this needs to be done after the notification is
sent.
Move it just before the end of the critical section. Since
neigh_update_process_arp_queue() unlocks, it does not form a part of the
critical section anymore but it can benefit from the lock being taken. The
intention is to eventually do the RTNL notification before this call.
This motion crosses a call to neigh_update_is_router(), which should not
influence processing of the ARP queue.
Petr Machata [Wed, 21 Jan 2026 16:43:36 +0000 (17:43 +0100)]
net: core: neighbour: Call __neigh_notify() under a lock
Andy Roulin has described an issue with the current neighbor notification
scheme as follows. This was also presented publicly at the link below.
neigh_update sends a rtnl notification if an update, e.g.,
nud_state change, was done but there is no guarantee of
ordering of the rtnl notifications. Consider the following
scenario:
userspace thread kernel thread
================ =============
neigh_update
write_lock_bh(n->lock)
n->nud_state = STALE
write_unlock_bh(n->lock)
neigh_notify
neigh_fill_info
read_lock_bh(n->lock)
ndm->nud_state = STALE
read_unlock_bh(n->lock)
-------------------------->
neigh:update
write_lock_bh(n->lock)
n->nud_state = REACHABLE
write_unlock_bh(n->lock)
neigh_notify
neigh_fill_info
read_lock_bh(n->lock)
ndm->nud_state = REACHABLE
read_unlock_bh(n->lock)
rtnl_nofify
RTNL REACHABLE sent
<--------
rtnl_notify
RTNL STALE sent
In this scenario, the kernel neigh is updated first to STALE and
then REACHABLE but the netlink notifications are sent out of order,
first REACHABLE and then STALE.
The solution is to send the netlink message inside the same critical
section that formats the message. That way both the contents and ordering
of the message reflect the same state, and we cannot see the abovementioned
out-of-order delivery.
Even with this patch, a remaining issue that the contents of the message
may not reflect the changes made to the neighbor. A kernel thread might
still interrupt a userspace thread after the change is done, but before
formatting and sending the message. Then what we would see is two messages
with the same contents. The following patches will attempt to address that
issue.
To support those future patches, convert __neigh_notify() to a helper that
assumes that the neighbor lock is already taken by having it call
__neigh_fill_info() instead of neigh_fill_info(). Add a new helper,
neigh_notify(), which takes the lock before calling __neigh_notify().
Migrate all callers to use the latter.
Petr Machata [Wed, 21 Jan 2026 16:43:35 +0000 (17:43 +0100)]
net: core: neighbour: Add a neigh_fill_info() helper for when lock not held
The netlink message needs to be formatted and sent inside the critical
section where the neighbor is changed, so that it reflects the
notified-upon neighbor state. Because it will happen inside an already
existing critical section, it has to assume that the neighbor lock is held.
Add a helper __neigh_fill_info(), which is like neigh_fill_info(), but
makes this assumption. Convert neigh_fill_info() to a wrapper around this
new helper.
Eric Dumazet [Thu, 22 Jan 2026 16:50:49 +0000 (16:50 +0000)]
ipvlan: remove ipvlan_ht_addr_lookup()
ipvlan_ht_addr_lookup() is called four times and not inlined.
Split it to ipvlan_ht_addr_lookup6() and ipvlan_ht_addr_lookup4()
and rework ipvlan_addr_lookup() to call these helpers once,
so that they are (auto)inlined.
After this change, ipvlan_addr_lookup() is faster, and we save
350 bytes of text on x86_64.
Eric Dumazet [Thu, 22 Jan 2026 04:57:18 +0000 (04:57 +0000)]
net: inline net_is_devmem_iov()
1) Inline this small helper to reduce code size and decrease cpu costs.
2) Constify its argument.
3) Move it to include/net/netmem.h, as a prereq for the following patch.
On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.
====================
David Yang [Tue, 20 Jan 2026 09:21:32 +0000 (17:21 +0800)]
vxlan: vnifilter: fix memcpy with u64_stats
On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.
David Yang [Tue, 20 Jan 2026 09:21:31 +0000 (17:21 +0800)]
macsec: fix memcpy with u64_stats
On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.