]> git.ipfire.org Git - thirdparty/kernel/stable.git/log
thirdparty/kernel/stable.git
3 weeks ago8021q: delete cleared egress QoS mappings
Longxuan Yu [Mon, 20 Apr 2026 03:18:46 +0000 (11:18 +0800)] 
8021q: delete cleared egress QoS mappings

vlan_dev_set_egress_priority() currently keeps cleared egress
priority mappings in the hash as tombstones. Repeated set/clear cycles
with distinct skb priorities therefore accumulate mapping nodes until
device teardown and leak memory.

Delete mappings when vlan_prio is cleared instead of keeping tombstones.
Now that the egress mapping lists are RCU protected, the node can be
unlinked safely and freed after a grace period.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@kernel.org
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Longxuan Yu <ylong030@ucr.edu>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Link: https://patch.msgid.link/ecfa6f6ce2467a42647ff4c5221238ae85b79a59.1776647968.git.yuantan098@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks ago8021q: use RCU for egress QoS mappings
Longxuan Yu [Mon, 20 Apr 2026 03:18:45 +0000 (11:18 +0800)] 
8021q: use RCU for egress QoS mappings

The TX fast path and reporting paths walk egress QoS mappings without
RTNL. Convert the mapping lists to RCU-protected pointers, use RCU
reader annotations in readers, and defer freeing mapping nodes with an
embedded rcu_head.

This prepares the egress QoS mapping code for safe removal of mapping
nodes in a follow-up change while preserving the current behavior.

Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Longxuan Yu <ylong030@ucr.edu>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Link: https://patch.msgid.link/9136768189f8c6d3f824f476c62d2fa1111688e8.1776647968.git.yuantan098@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoMerge tag 'nf-26-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Paolo Abeni [Thu, 23 Apr 2026 09:20:38 +0000 (11:20 +0200)] 
Merge tag 'nf-26-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following batch contains Netfilter/IPVS fixes for net:

1) nft_osf actually only supports IPv4, restrict it.

2) Address possible division by zero in nfnetlink_osf, from Xiang Mei.

3) Remove unsafe use of sprintf to fix possible buffer overflow
   in the SIP NAT helper, from Florian Westphal.

4) Restrict xt_mac, xt_owner and xt_physdev to inet families only;
   xt_realm is only for ipv4, otherwise null-pointer-deref is possible.

5) Use kfree_rcu() in nat core to release hooks, this can be an issue
   once nfnetlink_hook gets support to dump NAT hook information, not
   currently a real issue but better fix it now. From Florian Westphal.

6) Fix MTU checks in IPVS, from Yingnan Zhang.

7) Fix possible out-of-bounds when matching TCP options in
   nfnetlink_osf, from Fernando Fernandez Mancera.

8) Fix potential nul-ptr-deref in ttl check in nfnetlink_osf,
   remove useless loop to fix this, also from Fernando.

This is a smaller batch, there are more patches pending in the queue
to arm another pull request as soon as this is considered good enough.

AI might complain again about one more issue regarding osf and
big-endian arches in osf but this batch is targetting crash fixes for
osf at this stage.

netfilter pull request 26-04-20

* tag 'nf-26-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check
  netfilter: nfnetlink_osf: fix out-of-bounds read on option matching
  ipvs: fix MTU check for GSO packets in tunnel mode
  netfilter: nat: use kfree_rcu to release ops
  netfilter: xtables: restrict several matches to inet family
  netfilter: conntrack: remove sprintf usage
  netfilter: nfnetlink_osf: fix divide-by-zero in OSF_WSS_MODULO
  netfilter: nft_osf: restrict it to ipv4
====================

Link: https://patch.msgid.link/20260420220215.111510-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: dsa: realtek: rtl8365mb: fix mode mask calculation
Mieczyslaw Nalewaj [Sun, 19 Apr 2026 19:37:07 +0000 (21:37 +0200)] 
net: dsa: realtek: rtl8365mb: fix mode mask calculation

The RTL8365MB_DIGITAL_INTERFACE_SELECT_MODE_MASK macro was shifting
the 4-bit mask (0xF) by only (_extint % 2) bits instead of
(_extint % 2) * 4. This caused the mask to overlap with the adjacent
nibble when configuring odd-numbered external interfaces, selecting
the wrong bits entirely.

Align the shift calculation with the existing ...MODE_OFFSET macro.

Fixes: 4af2950c50c8 ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC")
Signed-off-by: Abdulkader Alrezej <alrazj.abdulkader@gmail.com>
Signed-off-by: Mieczyslaw Nalewaj <namiltd@yahoo.com>
Reviewed-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Link: https://patch.msgid.link/400a6387-a444-4576-af6d-26be5410bce3@yahoo.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoMerge branch 'net-airoha-fix-airoha_qdma_cleanup_tx_queue-processing'
Paolo Abeni [Thu, 23 Apr 2026 07:08:00 +0000 (09:08 +0200)] 
Merge branch 'net-airoha-fix-airoha_qdma_cleanup_tx_queue-processing'

Lorenzo Bianconi says:

====================
net: airoha: Fix airoha_qdma_cleanup_tx_queue() processing

Add missing bits in airoha_qdma_cleanup_tx_queue routine.
Fix airoha_qdma_cleanup_tx_queue processing errors intorduced in commit
'3f47e67dff1f7 ("net: airoha: Add the capability to consume out-of-order
DMA tx descriptors")'.

v3: https://lore.kernel.org/r/20260416-airoha_qdma_cleanup_tx_queue-fix-net-v3-0-2b69f5788580@kernel.org
v2: https://lore.kernel.org/r/20260414-airoha_qdma_cleanup_tx_queue-fix-net-v2-1-875de57cc022@kernel.org
v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org
====================

Link: https://patch.msgid.link/20260417-airoha_qdma_cleanup_tx_queue-fix-net-v4-0-e04bcc2c9642@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
Lorenzo Bianconi [Fri, 17 Apr 2026 06:36:32 +0000 (08:36 +0200)] 
net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()

Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in
airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to
TX_CPU_IDX to notify the NIC the QDMA TX ring is empty.

Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260417-airoha_qdma_cleanup_tx_queue-fix-net-v4-2-e04bcc2c9642@kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: airoha: Move ndesc initialization at end of airoha_qdma_init_tx()
Lorenzo Bianconi [Fri, 17 Apr 2026 06:36:31 +0000 (08:36 +0200)] 
net: airoha: Move ndesc initialization at end of airoha_qdma_init_tx()

If queue entry list allocation fails in airoha_qdma_init_tx_queue routine,
airoha_qdma_cleanup_tx_queue() will trigger a NULL pointer dereference
accessing the queue entry array. The issue is due to the early ndesc
initialization in airoha_qdma_init_tx_queue(). Fix the issue moving ndesc
initialization at end of airoha_qdma_init_tx routine.

Fixes: 3f47e67dff1f7 ("net: airoha: Add the capability to consume out-of-order DMA tx descriptors")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260417-airoha_qdma_cleanup_tx_queue-fix-net-v4-1-e04bcc2c9642@kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet/sched: sch_sfb: annotate data-races in sfb_dump_stats()
Eric Dumazet [Tue, 21 Apr 2026 14:16:55 +0000 (14:16 +0000)] 
net/sched: sch_sfb: annotate data-races in sfb_dump_stats()

sfb_dump_stats() only runs with RTNL held,
reading fields that can be changed in qdisc fast path.

Add READ_ONCE()/WRITE_ONCE() annotations.

Alternative would be to acquire the qdisc spinlock, but our long-term
goal is to make qdisc dump operations lockless as much as we can.

tc_sfb_xstats fields don't need to be latched atomically,
otherwise this bug would have been caught earlier.

Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260421141655.3953721-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/sched: sch_red: annotate data-races in red_dump_stats()
Eric Dumazet [Tue, 21 Apr 2026 14:23:09 +0000 (14:23 +0000)] 
net/sched: sch_red: annotate data-races in red_dump_stats()

red_dump_stats() only runs with RTNL held,
reading fields that can be changed in qdisc fast path.

Add READ_ONCE()/WRITE_ONCE() annotations.

Alternative would be to acquire the qdisc spinlock, but our long-term
goal is to make qdisc dump operations lockless as much as we can.

tc_red_xstats fields don't need to be latched atomically,
otherwise this bug would have been caught earlier.

Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260421142309.3964322-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/sched: sch_fq_codel: remove data-races from fq_codel_dump_stats()
Eric Dumazet [Tue, 21 Apr 2026 14:25:09 +0000 (14:25 +0000)] 
net/sched: sch_fq_codel: remove data-races from fq_codel_dump_stats()

fq_codel_dump_stats() acquires the qdisc spinlock a bit too late.

Move this acquisition before we fill st.qdisc_stats with live data.

Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260421142509.3967231-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/sched: sch_pie: annotate data-races in pie_dump_stats()
Eric Dumazet [Tue, 21 Apr 2026 14:29:44 +0000 (14:29 +0000)] 
net/sched: sch_pie: annotate data-races in pie_dump_stats()

pie_dump_stats() only runs with RTNL held,
reading fields that can be changed in qdisc fast path.

Add READ_ONCE()/WRITE_ONCE() annotations.

Alternative would be to acquire the qdisc spinlock, but our long-term
goal is to make qdisc dump operations lockless as much as we can.

tc_pie_xstats fields don't need to be latched atomically,
otherwise this bug would have been caught earlier.

Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260421142944.4009941-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet_sched: sch_hhf: annotate data-races in hhf_dump_stats()
Eric Dumazet [Tue, 21 Apr 2026 14:33:49 +0000 (14:33 +0000)] 
net_sched: sch_hhf: annotate data-races in hhf_dump_stats()

hhf_dump_stats() only runs with RTNL held,
reading fields that can be changed in qdisc fast path.

Add READ_ONCE()/WRITE_ONCE() annotations.

Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260421143349.4052215-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'intel-wired-lan-driver-updates-2026-04-20-ice'
Jakub Kicinski [Thu, 23 Apr 2026 04:10:12 +0000 (21:10 -0700)] 
Merge branch 'intel-wired-lan-driver-updates-2026-04-20-ice'

Jacob Keller says:

====================
Intel Wired LAN Driver Updates 2026-04-20 (ice)

Since this is a set of related fixes for just the ice driver, Jake provides
the following description for the series:

We recently ran into a nasty corner case issue with a customer operating
E825C cards seeing some strange behavior with missing Tx timestamps. During
the course of debugging. This series contains a few fixes found during this
debugging process.

The primary issue discovered in the investigation is a misconfiguration of
the E825C PHY timestamp interrupt register, PHY_REG_TS_INT_CONFIG. This
register is responsible for programming the Tx timestamp behavior of a PHY
port. The driver programs two values here: a threshold for when to
interrupt and whether the interrupt is enabled.

The threshold value is used by hardware to determine when to trigger a Tx
timestamp interrupt. The interrupt cause for the port is raised when the
number of outstanding timestamps in the PHY port timestamp memory meets the
threshold. The interrupt cause is not cleared until the number of
outstanding timestamps drops *below* the threshold.

It is considered a misconfiguration if the threshold is programmed to 0. If
the interrupt is enabled while the threshold is zero, hardware will raise
the interrupt cause at the next time it checks. Once raised, the interrupt
cause for the port will never lower, since you cannot have fewer than zero
outstanding timestamps.

Worse, the timestamp status for the port will remain high even if the
PHY_REG_TS_INT_CONFIG is reprogrammed with a new threshold. The PHY is a
separate hardware block from the MAC, and thus the interrupt status for the
port will remain high even if you reset the device MAC with a PF reset,
CORE reset, or GLOBAL reset.

PHY ports are connected together into quads. Each quad muxes the PHY
interrupt status for the 4 ports on the quad together before connecting
that to the MACs miscellaneous interrupt vector. As a result, if a single
PHY port in the quad is stuck, no timestamp interrupts will be generated
for any timestamp on any port on that quad.

The ice driver never directly writes a value of 0 for the threshold.
Indeed, the desired behavior is to set the threshold to 1, so that
interrupts are generated as soon as a single timestamp is captured.
Unfortunately, it turns out that for the E825C PHY, programming the
threshold and enable bit in the same write may cause a race in the PHY
timestamp block. The PHY may "see" the interrupt as enabled first before it
sees the threshold value. If the previous threshold value is zero (such as
when the register is initialized to zero at a cold power on), the hardware
may race with programming the threshold and set the PHY interrupt status to
high as described above.

The first patch in this series corrects that programming order, ensuring
that the threshold is always written first in a separate transaction from
enabling the interrupt bit. Additionally, an explicit check against writing
a 0 is added to make it clear to future readers that writing 0 to the
threshold while enabling the interrupt is not safe.

The PHY timestamp block does not reset with the MAC, and seems to only
reset during cold power on. This makes recovery from the faulty
configuration difficult. To address this, perform an explicit reset of the
PHY PTP block during initialization. This is achieved by writing the
PHY_REG_GLOBAL register. This performs a PHY soft reset, which completely
resets the timestamp block. This includes clearing the timestamp memory,
the PHY timestamp interrupt status, and the PHY PTP counter. A soft reset
of all ports on the device is done as part of ice_ptp_init_phc() during
early initialization of the PTP functionality by the PTP clock owner, prior
to programming each PHY. The ice_ptp_init_phc() function is called at
driver init and during reinitialization after all forms of device reset.
This ensures that the driver begins operation at a clean slate, rather than
carrying over the stale and potentially buggy configuration of a previous
driver.

While attempting to root cause the issue with the PHY timestamp interrupt,
we also discovered that the driver incorrectly assumes that it is operating
on E822 hardware when reading the PHY timestamp memory status registers in
a few places. This includes the check at the end of the interrupt handler,
as well as the check done inside the PTP auxiliary function. This prevented
the driver from detecting waiting timestamps on ports other than the first
two.

Finally, the ice_ptp_read_tx_hwstamp_status_eth56g() function was
discovered to only read the timestamp interrupt status value from the first
quad due to mistaking the port index for a PHY quad index. This resulted in
reporting the timestamp status for the second quad as identical to the
first quad instead of properly reporting its value. This is a minor fix
since the function currently is only used for diagnostic purposes and does
not impact driver decision logic.
====================

Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-0-bc2240f42251@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoice: fix ice_ptp_read_tx_hwtstamp_status_eth56g
Jacob Keller [Tue, 21 Apr 2026 00:51:28 +0000 (17:51 -0700)] 
ice: fix ice_ptp_read_tx_hwtstamp_status_eth56g

The ice_ptp_read_tx_hwtstamp_status_eth56g function calls
ice_read_phy_eth56g with a PHY index. However the function actually expects
a port index. This causes the function to read the wrong PHY_PTP_INT_STATUS
registers, and effectively makes the status wrong for the second set of
ports from 4 to 7.

The ice_read_phy_eth56g function uses the provided port index to determine
which PHY device to read. We could refactor the entire chain to take a PHY
index, but this would impact many code sites. Instead, multiply the PHY
index by the number of ports, so that we read from the first port of each
PHY.

Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Petr Oros <poros@redhat.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-4-bc2240f42251@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoice: fix ready bitmap check for non-E822 devices
Jacob Keller [Tue, 21 Apr 2026 00:51:27 +0000 (17:51 -0700)] 
ice: fix ready bitmap check for non-E822 devices

The E800 hardware (apart from E810) has a ready bitmap for the PHY
indicating which timestamp slots currently have an outstanding timestamp
waiting to be read by software.

This bitmap is checked in multiple places using the
ice_get_phy_tx_tstamp_ready():

 * ice_ptp_process_tx_tstamp() calls it to determine which timestamps to
   attempt reading from the PHY
 * ice_ptp_tx_tstamps_pending() calls it in a loop at the end of the
   miscellaneous IRQ to check if new timestamps came in while the interrupt
   handler was executing.
 * ice_ptp_maybe_trigger_tx_interrupt() calls it in the auxiliary work task
   to trigger a software interrupt in the event that the hardware logic
   gets stuck.

For E82X devices, multiple PHYs share the same block, and the parameter
passed to the ready bitmap is a block number associated with the given
port. For E825-C devices, the PHYs have their own independent blocks and do
not share, so the parameter passed needs to be the port number. For E810
devices, the ice_get_phy_tx_tstamp_ready() always returns all 1s regardless
of what port, since this hardware does not have a ready bitmap. Finally,
for E830 devices, each PF has its own ready bitmap accessible via register,
and the block parameter is unused.

The first call correctly uses the Tx timestamp tracker block parameter to
check the appropriate timestamp block. This works because the tracker is
setup correctly for each timestamp device type.

The second two callers behave incorrectly for all device types other than
the older E822 devices. They both iterate in a loop using
ICE_GET_QUAD_NUM() which is a macro only used by E822 devices. This logic
is incorrect for devices other than the E822 devices.

For E810 the calls would always return true, causing E810 devices to always
attempt to trigger a software interrupt even when they have no reason to.
For E830, this results in duplicate work as the ready bitmap is checked
once per number of quads. Finally, for E825-C, this results in the pending
checks failing to detect timestamps on ports other than the first two.

Fix this by introducing a new hardware API function to ice_ptp_hw.c,
ice_check_phy_tx_tstamp_ready(). This function will check if any timestamps
are available and returns a positive value if any timestamps are pending.
For E810, the function always returns false, so that the re-trigger checks
never happen. For E830, check the ready bitmap just once. For E82x
hardware, check each quad. Finally, for E825-C, check every port.

The interface function returns an integer to enable reporting of error code
if the driver is unable read the ready bitmap. This enables callers to
handle this case properly. The previous implementation assumed that
timestamps are available if they failed to read the bitmap. This is
problematic as it could lead to continuous software IRQ triggering if the
PHY timestamp registers somehow become inaccessible.

This change is especially important for E825-C devices, as the missing
checks could leave a window open where a new timestamp could arrive while
the existing timestamps aren't completed. As a result, the hardware
threshold logic would not trigger a new interrupt. Without the check, the
timestamp is left unhandled, and new timestamps will not cause an interrupt
again until the timestamp is handled. Since both the interrupt check and
the backup check in the auxiliary task do not function properly, the device
may have Tx timestamps permanently stuck failing on a given port.

The faulty checks originate from commit d938a8cca88a ("ice: Auxbus devices
& driver for E822 TS") and commit 712e876371f8 ("ice: periodically kick Tx
timestamp interrupt"), however at the time of the original coding, both
functions only operated on E822 hardware. This is no longer the case, and
hasn't been since the introduction of the ETH56G PHY model in commit
7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products")

Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Petr Oros <poros@redhat.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-3-bc2240f42251@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoice: perform PHY soft reset for E825C ports at initialization
Grzegorz Nitka [Tue, 21 Apr 2026 00:51:26 +0000 (17:51 -0700)] 
ice: perform PHY soft reset for E825C ports at initialization

In some cases the PHY timestamp block of the E825C can become stuck. This
is known to occur if the software writes 0 to the Tx timestamp threshold,
and with older versions of the ice driver the threshold configuration is
buggy and can race in such that hardware briefly operates with a zero
threshold enabled. There are no other known ways to trigger this behavior,
but once it occurs, the hardware is not recovered by normal reset, a driver
reload, or even a warm power cycle of the system. A cold power cycle is
sufficient to recover hardware, but this is extremely invasive and can
result in significant downtime on customer deployments.

The PHY for each port has a timestamping block which has its own reset
functionality accessible by programming the PHY_REG_GLOBAL register.
Writing to the PHY_REG_GLOBAL_SOFT_RESET_BIT triggers the hardware to
perform a complete reset of the timestamping block of the PHY. This
includes clearing the timestamp status for the port, clearing all
outstanding timestamps in the memory bank, and resetting the PHY timer.

The new ice_ptp_phy_soft_reset_eth56g() function toggles the
PHY_REG_GLOBAL soft reset bit with the required delays, ensuring the
PHY is properly reinitialized without requiring a full device reset.
The sequence clears the reset bit, asserts it, then clears it again,
with short waits between transitions to allow hardware stabilization.

Call this function in the new ice_ptp_init_phc_e825c(), implementing the
E825C device specific variant of the ice_ptp_init_phc(). Note that if
ice_ptp_init_phc() fails, PTP functionality may be disabled, but the driver
will still load to allow basic functionality to continue.

This causes the clock owning PF driver to perform a PHY soft reset for
every port during initialization. This ensures the driver begins life in a
known functional state regardless of how it was previously programmed.

This ensures that we properly reconfigure the hardware after a device reset
or when loading the driver, even if it was previously misconfigured with an
out-of-date or modified driver.

Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products")
Signed-off-by: Timothy Miskell <timothy.miskell@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Petr Oros <poros@redhat.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-2-bc2240f42251@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoice: fix timestamp interrupt configuration for E825C
Grzegorz Nitka [Tue, 21 Apr 2026 00:51:25 +0000 (17:51 -0700)] 
ice: fix timestamp interrupt configuration for E825C

The E825C ice_phy_cfg_intr_eth56g() function is responsible for programming
the PHY interrupt for a given port. This function writes to the
PHY_REG_TS_INT_CONFIG register of the port. The register is responsible for
configuring whether the port interrupt logic is enabled, as well as
programming the threshold of waiting timestamps that will trigger an
interrupt from this port.

This threshold value must not be programmed to zero while the interrupt is
enabled. Doing so puts the port in a misconfigured state where the PHY
timestamp interrupt for the quad of connected ports will become stuck.

This occurs, because a threshold of zero results in the timestamp interrupt
status for the port becoming stuck high. The four ports in the connected
quad have their timestamp status indicators muxed together. A new interrupt
cannot be generated until the timestamp status indicators return low for
all four ports.

Normally, the timestamp status for a port will clear once there are fewer
timestamps in that ports timestamp memory bank than the threshold. A
threshold of zero makes this impossible, so the timestamp status for the
port does not clear.

The ice driver never intentionally programs the threshold to zero, indeed
the driver always programs it to a value of 1, intending to get an
interrupt immediately as soon as even a single packet is waiting for a
timestamp.

However, there is a subtle flaw in the programming logic in the
ice_phy_cfg_intr_eth56g() function. Due to the way that the hardware
handles enabling the PHY interrupt. If the threshold value is modified at
the same time as the interrupt is enabled, the HW PHY state machine might
enable the interrupt before the new threshold value is actually updated.
This leaves a potential race condition caused by the hardware logic where
a PHY timestamp interrupt might be triggered before the non-zero threshold
is written, resulting in the PHY timestamp logic becoming stuck.

Once the PHY timestamp status is stuck high, it will remain stuck even
after attempting to reprogram the PHY block by changing its threshold or
disabling the interrupt. Even a typical PF or CORE reset will not reset the
particular block of the PHY that becomes stuck. Even a warm power cycle is
not guaranteed to cause the PHY block to reset, and a cold power cycle is
required.

Prevent this by always writing the PHY_REG_TS_INT_CONFIG in two stages.
First write the threshold value with the interrupt disabled, and only write
the enable bit after the threshold has been programmed. When disabling the
interrupt, leave the threshold unchanged. Additionally, re-read the
register after writing it to guarantee that the write to the PHY has been
flushed upon exit of the function.

While we're modifying this function implementation, explicitly reject
programming a threshold of 0 when enabling the interrupt. No caller does
this today, but the consequences of doing so are significant. An explicit
rejection in the code makes this clear.

Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products")
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Petr Oros <poros@redhat.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-1-bc2240f42251@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/rds: zero per-item info buffer before handing it to visitors
Michael Bommarito [Sat, 18 Apr 2026 14:10:47 +0000 (10:10 -0400)] 
net/rds: zero per-item info buffer before handing it to visitors

rds_for_each_conn_info() and rds_walk_conn_path_info() both hand a
caller-allocated on-stack u64 buffer to a per-connection visitor and
then copy the full item_len bytes back to user space via
rds_info_copy() regardless of how much of the buffer the visitor
actually wrote.

rds_ib_conn_info_visitor() and rds6_ib_conn_info_visitor() only
write a subset of their output struct when the underlying
rds_connection is not in state RDS_CONN_UP (src/dst addr, tos, sl
and the two GIDs via explicit memsets). Several u32 fields
(max_send_wr, max_recv_wr, max_send_sge, rdma_mr_max, rdma_mr_size,
cache_allocs) and the 2-byte alignment hole between sl and
cache_allocs remain as whatever stack contents preceded the visitor
call and are then memcpy_to_user()'d out to user space.

struct rds_info_rdma_connection and struct rds6_info_rdma_connection
are the only rds_info_* structs in include/uapi/linux/rds.h that are
not marked __attribute__((packed)), so they have a real alignment
hole. The other info visitors (rds_conn_info_visitor,
rds6_conn_info_visitor, rds_tcp_tc_info, ...) write all fields of
their packed output struct today and are not known to be vulnerable,
but a future visitor that adds a conditional write-path would have
the same bug.

Reproduction on a kernel built without CONFIG_INIT_STACK_ALL_ZERO=y:
a local unprivileged user opens AF_RDS, sets SO_RDS_TRANSPORT=IB,
binds to a local address on an RDMA-capable netdev (rxe soft-RoCE on
any netdev is sufficient), sendto()'s any peer on the same subnet
(fails cleanly but installs an rds_connection in the global hash in
RDS_CONN_CONNECTING), then calls getsockopt(SOL_RDS,
RDS_INFO_IB_CONNECTIONS). The returned 68-byte item contains 26
bytes of stack garbage including kernel text/data pointers:

    0..7   0a 63 00 01 0a 63 00 02     src=10.99.0.1 dst=10.99.0.2
    8..39  00 ...                      gids (memset-zeroed)
    40..47 e0 92 a3 81 ff ff ff ff     kernel pointer (max_send_wr)
    48..55 7f 37 b5 81 ff ff ff ff     kernel pointer (rdma_mr_max)
    56..59 01 00 08 00                 rdma_mr_size (garbage)
    60..61 00 00                       tos, sl
    62..63 00 00                       alignment padding
    64..67 18 00 00 00                 cache_allocs (garbage)

Fix by zeroing the per-item buffer in both rds_for_each_conn_info()
and rds_walk_conn_path_info() before invoking the visitor. This
covers the IPv4/IPv6 IB visitors and hardens all current and future
visitors against the same class of bug.

No functional change for visitors that fully populate their output.

Changes in v2:
- retarget at the net tree (subject prefix "[PATCH net v2]",
  net/rds: prefix in the title)
- pick up Reviewed-by tags from Sharath Srinivasan and
  Allison Henderson

Fixes: ec16227e1414 ("RDS/IB: Infiniband transport")
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com>
Reviewed-by: Allison Henderson <achender@kernel.org>
Assisted-by: Claude:claude-opus-4-7
Link: https://patch.msgid.link/20260418141047.3398203-1-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoseg6: fix seg6 lwtunnel output redirect for L2 reduced encap mode
Andrea Mayer [Sat, 18 Apr 2026 16:28:38 +0000 (18:28 +0200)] 
seg6: fix seg6 lwtunnel output redirect for L2 reduced encap mode

When SEG6_IPTUN_MODE_L2ENCAP_RED (L2ENCAP_RED) was introduced, the
condition in seg6_build_state() that excludes L2 encap modes from
setting LWTUNNEL_STATE_OUTPUT_REDIRECT was not updated to account for
the new mode.
As a consequence, L2ENCAP_RED routes incorrectly trigger seg6_output()
on the output path, where the packet is silently dropped because
skb_mac_header_was_set() fails on L3 packets.

Extend the check to also exclude L2ENCAP_RED, consistent with L2ENCAP.

Fixes: 13f0296be8ec ("seg6: add support for SRv6 H.L2Encaps.Red behavior")
Cc: stable@vger.kernel.org
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260418162838.31979-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agosctp: fix sockets_allocated imbalance after sk_clone()
Xin Long [Fri, 17 Apr 2026 21:09:40 +0000 (17:09 -0400)] 
sctp: fix sockets_allocated imbalance after sk_clone()

sk_clone() increments sockets_allocated and sets the socket refcount to 2.
SCTP performs additional accounting in sctp_clone_sock(), so the clone-time
increment must be undone to avoid double counting.

Note we cannot simply remove the SCTP-side increment, because the SCTP
destroy path in sctp_destroy_sock() only decrements sockets_allocated when
sp->ep is set, which may not be true for all failure paths in
sctp_clone_sock().

Fixes: 16942cf4d3e3 ("sctp: Use sk_clone() in sctp_accept().")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/af8d66f928dec3e9fcbee8d4a85b7d5a6b86f515.1776460180.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'bnge-fixes'
Jakub Kicinski [Thu, 23 Apr 2026 03:30:49 +0000 (20:30 -0700)] 
Merge branch 'bnge-fixes'

Vikas Gupta says:

====================
bnge fixes

Patch-1:
    Due to wrong HWRM sequence, driver do not get the correct
    information regarding resources and capabilities.
    The patch fixes the initial HWRM sequence.
Patch-2:
    Remove the unsupported backing store type initialization, which is
    not supported in Thor Ultra devices.
====================

Link: https://patch.msgid.link/20260418023438.1597876-1-vikas.gupta@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agobnge: remove unsupported backing store type
Vikas Gupta [Sat, 18 Apr 2026 02:34:38 +0000 (08:04 +0530)] 
bnge: remove unsupported backing store type

The backing store type, BNGE_CTX_MRAV, is not applicable in Thor Ultra
devices. Remove it from the backing store configuration, as the firmware
will not populate entities in this backing store type, due to which the
driver load fails.

Fixes: 29c5b358f385 ("bng_en: Add backing store support")
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Dharmender Garg <dharmender.garg@broadcom.com>
Link: https://patch.msgid.link/20260418023438.1597876-3-vikas.gupta@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agobnge: fix initial HWRM sequence
Vikas Gupta [Sat, 18 Apr 2026 02:34:37 +0000 (08:04 +0530)] 
bnge: fix initial HWRM sequence

Firmware may not advertize correct resources if backing store is not
enabled before resource information is queried.
Fix the initial sequence of HWRMs so that driver gets capabilities
and resource information correctly.

Fixes: 3fa9e977a0cd ("bng_en: Initialize default configuration")
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rahul Gupta <rahul-rg.gupta@broadcom.com>
Link: https://patch.msgid.link/20260418023438.1597876-2-vikas.gupta@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodocs: maintainer-netdev: fix typo in "targeting"
Ariful Islam Shoikot [Mon, 20 Apr 2026 11:45:53 +0000 (17:45 +0600)] 
docs: maintainer-netdev: fix typo in "targeting"

Fix spelling mistake "targgeting" -> "targeting" in
maintainer-netdev.rst

No functional change.

Signed-off-by: Ariful Islam Shoikot <islamarifulshoikat@gmail.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260420114554.1026-1-islamarifulshoikat@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/packet: fix TOCTOU race on mmap'd vnet_hdr in tpacket_snd()
Bingquan Chen [Sat, 18 Apr 2026 11:20:06 +0000 (19:20 +0800)] 
net/packet: fix TOCTOU race on mmap'd vnet_hdr in tpacket_snd()

In tpacket_snd(), when PACKET_VNET_HDR is enabled, vnet_hdr points
directly into the mmap'd TX ring buffer shared with userspace. The
kernel validates the header via __packet_snd_vnet_parse() but then
re-reads all fields later in virtio_net_hdr_to_skb(). A concurrent
userspace thread can modify the vnet_hdr fields between validation
and use, bypassing all safety checks.

The non-TPACKET path (packet_snd()) already correctly copies vnet_hdr
to a stack-local variable. All other vnet_hdr consumers in the kernel
(tun.c, tap.c, virtio_net.c) also use stack copies. The TPACKET TX
path is the only caller of virtio_net_hdr_to_skb() that reads directly
from user-controlled shared memory.

Fix this by copying vnet_hdr from the mmap'd ring buffer to a
stack-local variable before validation and use, consistent with the
approach used in packet_snd() and all other callers.

Fixes: 1d036d25e560 ("packet: tpacket_snd gso and checksum offload")
Signed-off-by: Bingquan Chen <patzilla007@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260418112006.78823-1-patzilla007@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: validate skb->napi_id in RX tracepoints
Kohei Enju [Mon, 20 Apr 2026 10:54:23 +0000 (10:54 +0000)] 
net: validate skb->napi_id in RX tracepoints

Since commit 2bd82484bb4c ("xps: fix xps for stacked devices"),
skb->napi_id shares storage with sender_cpu. RX tracepoints using
net_dev_rx_verbose_template read skb->napi_id directly and can therefore
report sender_cpu values as if they were NAPI IDs.

For example, on the loopback path this can report 1 as napi_id, where 1
comes from raw_smp_processor_id() + 1 in the XPS path:

  # bpftrace -e 'tracepoint:net:netif_rx_entry{ print(args->napi_id); }'
  # taskset -c 0 ping -c 1 ::1

Report only valid NAPI IDs in these tracepoints and use 0 otherwise.

Fixes: 2bd82484bb4c ("xps: fix xps for stacked devices")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20260420105427.162816-1-kohei@enjuk.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/sched: sch_dualpi2: drain both C-queue and L-queue in dualpi2_change()
Chia-Yu Chang [Fri, 17 Apr 2026 15:25:51 +0000 (17:25 +0200)] 
net/sched: sch_dualpi2: drain both C-queue and L-queue in dualpi2_change()

Fix dualpi2_change() to correctly enforce updated limit and memlimit
values after a configuration change of the dualpi2 qdisc.

Before this patch, dualpi2_change() always attempted to dequeue packets
via the root qdisc (C-queue) when reducing backlog or memory usage, and
unconditionally assumed that a valid skb will be returned. When traffic
classification results in packets being queued in the L-queue while the
C-queue is empty, this leads to a NULL skb dereference during limit or
memlimit enforcement.

This is fixed by first dequeuing from the C-queue path if it is
non-empty. Once the C-queue is empty, packets are dequeued directly from
the L-queue. Return values from qdisc_dequeue_internal() are checked for
both queues. When dequeuing from the L-queue, the parent qdisc qlen and
backlog counters are updated explicitly to keep overall qdisc statistics
consistent.

Fixes: 320d031ad6e4 ("sched: Struct definition and parsing of dualpi2 qdisc")
Reported-by: "Kito Xu (veritas501)" <hxzene@gmail.com>
Closes: https://lore.kernel.org/netdev/20260413075740.2234828-1-hxzene@gmail.com/
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Link: https://patch.msgid.link/20260417152551.71648-1-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: airoha: Fix PPE cpu port configuration for GDM2 loopback path
Lorenzo Bianconi [Fri, 17 Apr 2026 15:24:41 +0000 (17:24 +0200)] 
net: airoha: Fix PPE cpu port configuration for GDM2 loopback path

When QoS loopback is enabled for GDM3 or GDM4, incoming packets are
forwarded to GDM2. However, the PPE cpu port for GDM2 is not configured
in this path, causing traffic originating from GDM3/GDM4, which may
be set up as WAN ports backed by QDMA1, to be incorrectly directed
to QDMA0 instead.
Configure the PPE cpu port for GDM2 when QoS loopback is active on
GDM3 or GDM4 to ensure traffic is routed to the correct QDMA instance.

Fixes: 9cd451d414f6 ("net: airoha: Add loopback support for GDM2")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260417-airoha-ppe-cpu-port-for-gdm2-loopback-v1-1-c7a9de0f6f57@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoMerge branch 'net-sleepable-ndo_set_rx_mode'
Paolo Abeni [Tue, 21 Apr 2026 10:50:26 +0000 (12:50 +0200)] 
Merge branch 'net-sleepable-ndo_set_rx_mode'

Stanislav Fomichev says:

====================
net: sleepable ndo_set_rx_mode

This series adds a new ndo_set_rx_mode_async callback that enables
drivers to handle address list updates in a sleepable context. The
current ndo_set_rx_mode is called under the netif_addr_lock spinlock
with BHs disabled, which prevents drivers from sleeping. This is
problematic for ops-locked drivers that need to sleep.

The approach:
1. Add snapshot/reconcile infrastructure for address lists
2. Introduce dev_rx_mode_work that takes snapshots under the lock,
   drops the lock, calls the driver, then reconciles changes back
3. Move promiscuity handling into the scheduled work as well
4. Convert existing ops-locked drivers to ndo_set_rx_mode_async
5. Add a warning for ops-locked drivers still using ndo_set_rx_mode
6. Add a selftest exercising the team+bridge+macvlan topology that
   triggers the addr_lock -> ops_lock ordering issue
====================

Link: https://patch.msgid.link/20260416185712.2155425-1-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoselftests: net: use ip commands instead of teamd in team rx_mode test
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:12 +0000 (11:57 -0700)] 
selftests: net: use ip commands instead of teamd in team rx_mode test

Replace teamd daemon usage with ip link commands for team device
setup. teamd -d daemonizes and returns to the shell before port
addition completes, creating a race: the test may create the macvlan
(and check for its address on a slave) before teamd has finished
adding ports. This makes the test inherently dependent on scheduling
timing.

Using ip commands makes port addition synchronous, removing the race
and making the test deterministic.

Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Jay Vosburgh <jv@jvosburgh.net>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-16-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoselftests: net: add team_bridge_macvlan rx_mode test
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:11 +0000 (11:57 -0700)] 
selftests: net: add team_bridge_macvlan rx_mode test

Add a test that exercises the ndo_change_rx_flags path through a
macvlan -> bridge -> team -> dummy stack. This triggers dev_uc_add
under addr_list_lock which flips promiscuity on the lower device.
With the new work queue approach, this must not deadlock.

Link: https://lore.kernel.org/netdev/20260214033859.43857-1-jiayuan.chen@linux.dev/
Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-15-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: warn ops-locked drivers still using ndo_set_rx_mode
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:10 +0000 (11:57 -0700)] 
net: warn ops-locked drivers still using ndo_set_rx_mode

Now that all in-tree ops-locked drivers have been converted to
ndo_set_rx_mode_async, add a warning in register_netdevice to catch
any remaining or newly added drivers that use ndo_set_rx_mode with
ops locking. This ensures future driver authors are guided toward
the async path.

Also route ops-locked devices through netdev_rx_mode_work even if they
lack rx_mode NDOs, to ensure netdev_ops_assert_locked() does not fire
on the legacy path where only RTNL is held.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-14-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonetkit: convert to ndo_set_rx_mode_async
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:09 +0000 (11:57 -0700)] 
netkit: convert to ndo_set_rx_mode_async

Convert netkit driver from ndo_set_rx_mode to ndo_set_rx_mode_async.
The netkit driver's set_multicast_list is a no-op, presumably
for the same reason as the one in dummy? (fake multicast ability)

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-13-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agodummy: convert to ndo_set_rx_mode_async
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:08 +0000 (11:57 -0700)] 
dummy: convert to ndo_set_rx_mode_async

Convert dummy driver from ndo_set_rx_mode to ndo_set_rx_mode_async.
The dummy driver's set_multicast_list is a no-op, so the conversion
is straightforward: update the signature and the ops assignment.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-12-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonetdevsim: convert to ndo_set_rx_mode_async
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:07 +0000 (11:57 -0700)] 
netdevsim: convert to ndo_set_rx_mode_async

Convert netdevsim from ndo_set_rx_mode to ndo_set_rx_mode_async.
The callback is a no-op stub so just update the signature and
ops struct wiring.

Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-11-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoiavf: convert to ndo_set_rx_mode_async
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:06 +0000 (11:57 -0700)] 
iavf: convert to ndo_set_rx_mode_async

Convert iavf from ndo_set_rx_mode to ndo_set_rx_mode_async.
iavf_set_rx_mode now takes explicit uc/mc list parameters and
uses __hw_addr_sync_dev on the snapshots instead of __dev_uc_sync
and __dev_mc_sync.

The iavf_configure internal caller passes the real lists directly.

Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-10-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agobnxt: use snapshot in bnxt_cfg_rx_mode
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:05 +0000 (11:57 -0700)] 
bnxt: use snapshot in bnxt_cfg_rx_mode

With the introduction of ndo_set_rx_mode_async (as discussed in [1])
we can call bnxt_cfg_rx_mode directly. Convert bnxt_cfg_rx_mode to
use uc/mc snapshots and move its call in bnxt_sp_task to the
section that resets BNXT_STATE_IN_SP_TASK. Switch to direct call in
bnxt_set_rx_mode.

Link: https://lore.kernel.org/netdev/CACKFLi=5vj8hPqEUKDd8RTw3au5G+zRgQEqjF+6NZnyoNm90KA@mail.gmail.com/
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-9-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agobnxt: convert to ndo_set_rx_mode_async
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:04 +0000 (11:57 -0700)] 
bnxt: convert to ndo_set_rx_mode_async

Convert bnxt from ndo_set_rx_mode to ndo_set_rx_mode_async.
bnxt_set_rx_mode, bnxt_mc_list_updated and bnxt_uc_list_updated
now take explicit uc/mc list parameters and iterate with
netdev_hw_addr_list_for_each instead of netdev_for_each_{uc,mc}_addr.

The bnxt_cfg_rx_mode internal caller passes the real lists under
netif_addr_lock_bh.

BNXT_RX_MASK_SP_EVENT is still used here, next patch converts to
the direct call.

Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-8-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agomlx5: convert to ndo_set_rx_mode_async
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:03 +0000 (11:57 -0700)] 
mlx5: convert to ndo_set_rx_mode_async

Convert mlx5 from ndo_set_rx_mode to ndo_set_rx_mode_async. The
driver's mlx5e_set_rx_mode now receives uc/mc snapshots and calls
mlx5e_fs_set_rx_mode_work directly instead of queueing work.

mlx5e_sync_netdev_addr and mlx5e_handle_netdev_addr now take
explicit uc/mc list parameters and iterate with
netdev_hw_addr_list_for_each instead of netdev_for_each_{uc,mc}_addr.

Fallback to netdev's uc/mc in a few places and grab addr lock.

Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Tariq Toukan <tariqt@nvidia.com>
Cc: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-7-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agofbnic: convert to ndo_set_rx_mode_async
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:02 +0000 (11:57 -0700)] 
fbnic: convert to ndo_set_rx_mode_async

Convert fbnic from ndo_set_rx_mode to ndo_set_rx_mode_async. The
driver's __fbnic_set_rx_mode() now takes explicit uc/mc list
parameters and uses __hw_addr_sync_dev() on the snapshots instead
of __dev_uc_sync/__dev_mc_sync on the netdev directly.

Update callers in fbnic_up, fbnic_fw_config_after_crash,
fbnic_bmc_rpc_check and fbnic_set_mac to pass the real address
lists calling __fbnic_set_rx_mode outside the async work path.

Cc: Alexander Duyck <alexanderduyck@fb.com>
Cc: kernel-team@meta.com
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-6-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: move promiscuity handling into netdev_rx_mode_work
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:01 +0000 (11:57 -0700)] 
net: move promiscuity handling into netdev_rx_mode_work

Move unicast promiscuity tracking into netdev_rx_mode_work so it runs
under netdev_ops_lock instead of under the addr_lock spinlock. This
is required because __dev_set_promiscuity calls dev_change_rx_flags
and __dev_notify_flags, both of which may need to sleep.

Change ASSERT_RTNL() to netdev_ops_assert_locked() in
__dev_set_promiscuity, netif_set_allmulti and __dev_change_flags
since these are now called from the work queue under the ops lock.

Link: https://lore.kernel.org/netdev/20260214033859.43857-1-jiayuan.chen@linux.dev/
Fixes: 78cd408356fe ("net: add missing instance lock to dev_set_promiscuity")
Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-5-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: cache snapshot entries for ndo_set_rx_mode_async
Stanislav Fomichev [Thu, 16 Apr 2026 18:57:00 +0000 (11:57 -0700)] 
net: cache snapshot entries for ndo_set_rx_mode_async

Add a per-device netdev_hw_addr_list cache (rx_mode_addr_cache) that
allows __hw_addr_list_snapshot() and __hw_addr_list_reconcile() to
reuse previously allocated entries instead of hitting GFP_ATOMIC on
every snapshot cycle.

snapshot pops entries from the cache when available, falling back to
__hw_addr_create(). reconcile splices both snapshot lists back into
the cache via __hw_addr_splice(). The cache is flushed in
free_netdev().

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-4-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: introduce ndo_set_rx_mode_async and netdev_rx_mode_work
Stanislav Fomichev [Thu, 16 Apr 2026 18:56:59 +0000 (11:56 -0700)] 
net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work

Add ndo_set_rx_mode_async callback that drivers can implement instead
of the legacy ndo_set_rx_mode. The legacy callback runs under the
netif_addr_lock spinlock with BHs disabled, preventing drivers from
sleeping. The async variant runs from a work queue with rtnl_lock and
netdev_lock_ops held, in fully sleepable context.

When __dev_set_rx_mode() sees ndo_set_rx_mode_async, it schedules
netdev_rx_mode_work instead of calling the driver inline. The work
function takes two snapshots of each address list (uc/mc) under
the addr_lock, then drops the lock and calls the driver with the
work copies. After the driver returns, it reconciles the snapshots
back to the real lists under the lock.

Add netif_rx_mode_sync() to opportunistically execute the pending
workqueue update inline, so that rx mode changes are committed
before returning to userspace:
  - dev_change_flags (SIOCSIFFLAGS / RTM_NEWLINK)
  - dev_set_promiscuity
  - dev_set_allmulti
  - dev_ifsioc SIOCADDMULTI / SIOCDELMULTI
  - do_setlink (RTM_SETLINK)

Note that some deep hierarchies still do skip the lower updates via:
  - dev_uc_sync
  - dev_mc_sync

If we do end up hitting user-visible issues, we can add more calls to
netif_rx_mode_sync in specific places. But hopefully we should not,
the actual user-visible lists are still synced, it's that just HW state
that might be lagging.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-3-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: add address list snapshot and reconciliation infrastructure
Stanislav Fomichev [Thu, 16 Apr 2026 18:56:58 +0000 (11:56 -0700)] 
net: add address list snapshot and reconciliation infrastructure

Introduce __hw_addr_list_snapshot() and __hw_addr_list_reconcile()
for use by the upcoming ndo_set_rx_mode_async callback.

The async rx_mode path needs to snapshot the device's unicast and
multicast address lists under the addr_lock, hand those snapshots
to the driver (which may sleep), and then propagate any sync_cnt
changes back to the real lists. Two identical snapshots are taken:
a work copy for the driver to pass to __hw_addr_sync_dev() and a
reference copy to compute deltas against.

__hw_addr_list_reconcile() walks the reference snapshot comparing
each entry against the work snapshot to determine what the driver
synced or unsynced. It then applies those deltas to the real list,
handling concurrent modifications:

  - If the real entry was concurrently removed but the driver synced
    it to hardware (delta > 0), re-insert a stale entry so the next
    work run properly unsyncs it from hardware.
  - If the entry still exists, apply the delta normally. An entry
    whose refcount drops to zero is removed.

  # dev_addr_test_snapshot_benchmark: 1024 addrs x 1000 snapshots: 89872802 ns total, 89872 ns/iter
  # dev_addr_test_snapshot_benchmark.speed: slow

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-2-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoslip: bound decode() reads against the compressed packet length
Weiming Shi [Thu, 16 Apr 2026 10:01:51 +0000 (18:01 +0800)] 
slip: bound decode() reads against the compressed packet length

slhc_uncompress() parses a VJ-compressed TCP header by advancing a
pointer through the packet via decode() and pull16(). Neither helper
bounds-checks against isize, and decode() masks its return with
& 0xffff so it can never return the -1 that callers test for -- those
error paths are dead code.

A short compressed frame whose change byte requests optional fields
lets decode() read past the end of the packet. The over-read bytes
are folded into the cached cstate and reflected into subsequent
reconstructed packets.

Make decode() and pull16() take the packet end pointer and return -1
when exhausted. Add a bounds check before the TCP-checksum read.
The existing == -1 tests now do what they were always meant to.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Simon Horman <horms@kernel.org>
Closes: https://lore.kernel.org/netdev/20260414134126.758795-2-horms@kernel.org/
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260416100147.531855-5-bestswngs@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoslip: reject VJ receive packets on instances with no rstate array
Weiming Shi [Wed, 15 Apr 2026 20:41:31 +0000 (04:41 +0800)] 
slip: reject VJ receive packets on instances with no rstate array

slhc_init() accepts rslots == 0 as a valid configuration, with the
documented meaning of 'no receive compression'. In that case the
allocation loop in slhc_init() is skipped, so comp->rstate stays
NULL and comp->rslot_limit stays 0 (from the kzalloc of struct
slcompress).

The receive helpers do not defend against that configuration.
slhc_uncompress() dereferences comp->rstate[x] when the VJ header
carries an explicit connection ID, and slhc_remember() later assigns
cs = &comp->rstate[...] after only comparing the packet's slot number
to comp->rslot_limit. Because rslot_limit is 0, slot 0 passes the
range check, and the code dereferences a NULL rstate.

The configuration is reachable in-tree through PPP. PPPIOCSMAXCID
stores its argument in a signed int, and (val >> 16) uses arithmetic
shift. Passing 0xffff0000 therefore sign-extends to -1, so val2 + 1
is 0 and ppp_generic.c ends up calling slhc_init(0, 1). Because
/dev/ppp open is gated by ns_capable(CAP_NET_ADMIN), the whole path
is reachable from an unprivileged user namespace. Once the malformed
VJ state is installed, any inbound VJ-compressed or VJ-uncompressed
frame that selects slot 0 crashes the kernel in softirq context:

 Oops: general protection fault, probably for non-canonical
       address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
 KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
 RIP: 0010:slhc_uncompress (drivers/net/slip/slhc.c:519)
 Call Trace:
  <TASK>
  ppp_receive_nonmp_frame (drivers/net/ppp/ppp_generic.c:2466)
  ppp_input (drivers/net/ppp/ppp_generic.c:2359)
  ppp_async_process (drivers/net/ppp/ppp_async.c:492)
  tasklet_action_common (kernel/softirq.c:926)
  handle_softirqs (kernel/softirq.c:623)
  run_ksoftirqd (kernel/softirq.c:1055)
  smpboot_thread_fn (kernel/smpboot.c:160)
  kthread (kernel/kthread.c:436)
  ret_from_fork (arch/x86/kernel/process.c:164)
  </TASK>

Reject the receive side on such instances instead of touching rstate.
slhc_uncompress() falls through to its existing 'bad' label, which
bumps sls_i_error and enters the toss state. slhc_remember() mirrors
that with an explicit sls_i_error increment followed by slhc_toss();
the sls_i_runt counter is not used here because a missing rstate is
an internal configuration state, not a runt packet.

The transmit path is unaffected: the only in-tree caller that picks
rslots from userspace (ppp_generic.c) still supplies tslots >= 1, and
slip.c always calls slhc_init(16, 16), so comp->tstate remains valid
and slhc_compress() continues to work.

Fixes: 4ab42d78e37a ("ppp, slip: Validate VJ compression slot parameters completely")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260415204130.258866-2-bestswngs@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonetfilter: nfnetlink_osf: fix potential NULL dereference in ttl check
Fernando Fernandez Mancera [Fri, 17 Apr 2026 16:20:57 +0000 (18:20 +0200)] 
netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check

The nf_osf_ttl() function accessed skb->dev to perform a local interface
address lookup without verifying that the device pointer was valid.

Additionally, the implementation utilized an in_dev_for_each_ifa_rcu
loop to match the packet source address against local interface
addresses. It assumed that packets from the same subnet should not see a
decrement on the initial TTL. A packet might appear it is from the same
subnet but it actually isn't especially in modern environments with
containers and virtual switching.

Remove the device dereference and interface loop. Replace the logic with
a switch statement that evaluates the TTL according to the ttl_check.

Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match")
Reported-by: Kito Xu (veritas501) <hxzene@gmail.com>
Closes: https://lore.kernel.org/netfilter-devel/20260414074556.2512750-1-hxzene@gmail.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 weeks agonetfilter: nfnetlink_osf: fix out-of-bounds read on option matching
Fernando Fernandez Mancera [Fri, 17 Apr 2026 16:20:56 +0000 (18:20 +0200)] 
netfilter: nfnetlink_osf: fix out-of-bounds read on option matching

In nf_osf_match(), the nf_osf_hdr_ctx structure is initialized once
and passed by reference to nf_osf_match_one() for each fingerprint
checked. During TCP option parsing, nf_osf_match_one() advances the
shared ctx->optp pointer.

If a fingerprint perfectly matches, the function returns early without
restoring ctx->optp to its initial state. If the user has configured
NF_OSF_LOGLEVEL_ALL, the loop continues to the next fingerprint.
However, because ctx->optp was not restored, the next call to
nf_osf_match_one() starts parsing from the end of the options buffer.
This causes subsequent matches to read garbage data and fail
immediately, making it impossible to log more than one match or logging
incorrect matches.

Instead of using a shared ctx->optp pointer, pass the context as a
constant pointer and use a local pointer (optp) for TCP option
traversal. This makes nf_osf_match_one() strictly stateless from the
caller's perspective, ensuring every fingerprint check starts at the
correct option offset.

Fixes: 1a6a0951fc00 ("netfilter: nfnetlink_osf: add missing fmatch check")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 weeks agoipvs: fix MTU check for GSO packets in tunnel mode
Yingnan Zhang [Wed, 15 Apr 2026 14:40:29 +0000 (22:40 +0800)] 
ipvs: fix MTU check for GSO packets in tunnel mode

Currently, IPVS skips MTU checks for GSO packets by excluding them with
the !skb_is_gso(skb) condition. This creates problems when IPVS tunnel
mode encapsulates GSO packets with IPIP headers.

The issue manifests in two ways:

1. MTU violation after encapsulation:
   When a GSO packet passes through IPVS tunnel mode, the original MTU
   check is bypassed. After adding the IPIP tunnel header, the packet
   size may exceed the outgoing interface MTU, leading to unexpected
   fragmentation at the IP layer.

2. Fragmentation with problematic IP IDs:
   When net.ipv4.vs.pmtu_disc=1 and a GSO packet with multiple segments
   is fragmented after encapsulation, each segment gets a sequentially
   incremented IP ID (0, 1, 2, ...). This happens because:

   a) The GSO packet bypasses MTU check and gets encapsulated
   b) At __ip_finish_output, the oversized GSO packet is split into
      separate SKBs (one per segment), with IP IDs incrementing
   c) Each SKB is then fragmented again based on the actual MTU

   This sequential IP ID allocation differs from the expected behavior
   and can cause issues with fragment reassembly and packet tracking.

Fix this by properly validating GSO packets using
skb_gso_validate_network_len(). This function correctly validates
whether the GSO segments will fit within the MTU after segmentation. If
validation fails, send an ICMP Fragmentation Needed message to enable
proper PMTU discovery.

Fixes: 4cdd34084d53 ("netfilter: nf_conntrack_ipv6: improve fragmentation handling")
Signed-off-by: Yingnan Zhang <342144303@qq.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 weeks agonetfilter: nat: use kfree_rcu to release ops
Pablo Neira Ayuso [Wed, 15 Apr 2026 15:29:45 +0000 (17:29 +0200)] 
netfilter: nat: use kfree_rcu to release ops

Florian Westphal says:

"Historically this is not an issue, even for normal base hooks: the data
path doesn't use the original nf_hook_ops that are used to register the
callbacks.

However, in v5.14 I added the ability to dump the active netfilter
hooks from userspace.

This code will peek back into the nf_hook_ops that are available
at the tail of the pointer-array blob used by the datapath.

The nat hooks are special, because they are called indirectly from
the central nat dispatcher hook. They are currently invisible to
the nfnl hook dump subsystem though.

But once that changes the nat ops structures have to be deferred too."

Update nf_nat_register_fn() to deal with partial exposition of the hooks
from error path which can be also an issue for nfnetlink_hook.

Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 weeks agonetfilter: xtables: restrict several matches to inet family
Pablo Neira Ayuso [Wed, 15 Apr 2026 10:21:00 +0000 (12:21 +0200)] 
netfilter: xtables: restrict several matches to inet family

This is a partial revert of:

  commit ab4f21e6fb1c ("netfilter: xtables: use NFPROTO_UNSPEC in more extensions")

to allow ipv4 and ipv6 only.

- xt_mac
- xt_owner
- xt_physdev

These extensions are not used by ebtables in userspace.

Moreover, xt_realm is only for ipv4, since dst->tclassid is ipv4
specific.

Fixes: ab4f21e6fb1c ("netfilter: xtables: use NFPROTO_UNSPEC in more extensions")
Reported-by: "Kito Xu (veritas501)" <hxzene@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 weeks agonetfilter: conntrack: remove sprintf usage
Florian Westphal [Tue, 14 Apr 2026 17:13:46 +0000 (19:13 +0200)] 
netfilter: conntrack: remove sprintf usage

Replace it with scnprintf, the buffer sizes are expected to be large enough
to hold the result, no need for snprintf+overflow check.

Increase buffer size in mangle_content_len() while at it.

BUG: KASAN: stack-out-of-bounds in vsnprintf+0xea5/0x1270
Write of size 1 at addr [..]
 vsnprintf+0xea5/0x1270
 sprintf+0xb1/0xe0
 mangle_content_len+0x1ac/0x280
 nf_nat_sdp_session+0x1cc/0x240
 process_sdp+0x8f8/0xb80
 process_invite_request+0x108/0x2b0
 process_sip_msg+0x5da/0xf50
 sip_help_tcp+0x45e/0x780
 nf_confirm+0x34d/0x990
 [..]

Fixes: 9fafcd7b2032 ("[NETFILTER]: nf_conntrack/nf_nat: add SIP helper port")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 weeks agonetfilter: nfnetlink_osf: fix divide-by-zero in OSF_WSS_MODULO
Xiang Mei [Tue, 14 Apr 2026 22:14:01 +0000 (15:14 -0700)] 
netfilter: nfnetlink_osf: fix divide-by-zero in OSF_WSS_MODULO

nf_osf_match_one() computes ctx->window % f->wss.val in the
OSF_WSS_MODULO branch with no guard for f->wss.val == 0. A
CAP_NET_ADMIN user can add such a fingerprint via nfnetlink; a
subsequent matching TCP SYN divides by zero and panics the kernel.

Reject the bogus fingerprint in nfnl_osf_add_callback() above the
per-option for-loop. f->wss is per-fingerprint, not per-option, so
the check must run regardless of f->opt_num (including 0). Also
reject wss.wc >= OSF_WSS_MAX; nf_osf_match_one() already treats that
as "should not happen".

Crash:
 Oops: divide error: 0000 [#1] SMP KASAN NOPTI
 RIP: 0010:nf_osf_match_one (net/netfilter/nfnetlink_osf.c:98)
 Call Trace:
 <IRQ>
  nf_osf_match (net/netfilter/nfnetlink_osf.c:220)
  xt_osf_match_packet (net/netfilter/xt_osf.c:32)
  ipt_do_table (net/ipv4/netfilter/ip_tables.c:348)
  nf_hook_slow (net/netfilter/core.c:622)
  ip_local_deliver (net/ipv4/ip_input.c:265)
  ip_rcv (include/linux/skbuff.h:1162)
  __netif_receive_skb_one_core (net/core/dev.c:6181)
  process_backlog (net/core/dev.c:6642)
  __napi_poll (net/core/dev.c:7710)
  net_rx_action (net/core/dev.c:7945)
  handle_softirqs (kernel/softirq.c:622)

Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 weeks agonetfilter: nft_osf: restrict it to ipv4
Pablo Neira Ayuso [Tue, 14 Apr 2026 11:06:38 +0000 (13:06 +0200)] 
netfilter: nft_osf: restrict it to ipv4

This expression only supports for ipv4, restrict it.

Fixes: b96af92d6eaf ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf")
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3 weeks agonet: mctp: fix don't require received header reserved bits to be zero
Yuan Zhaoming [Fri, 17 Apr 2026 14:13:40 +0000 (22:13 +0800)] 
net: mctp: fix don't require received header reserved bits to be zero

From the MCTP Base specification (DSP0236 v1.2.1), the first byte of
the MCTP header contains a 4 bit reserved field, and 4 bit version.

On our current receive path, we require those 4 reserved bits to be
zero, but the 9500-8i card is non-conformant, and may set these
reserved bits.

DSP0236 states that the reserved bits must be written as zero, and
ignored when read. While the device might not conform to the former,
we should accept these message to conform to the latter.

Relax our check on the MCTP version byte to allow non-zero bits in the
reserved field.

Fixes: 889b7da23abf ("mctp: Add initial routing framework")
Signed-off-by: Yuan Zhaoming <yuanzm2@lenovo.com>
Cc: stable@vger.kernel.org
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20260417141340.5306-1-yuanzhaoming901030@126.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agogtp: disable BH before calling udp_tunnel_xmit_skb()
David Carlier [Fri, 17 Apr 2026 05:54:08 +0000 (06:54 +0100)] 
gtp: disable BH before calling udp_tunnel_xmit_skb()

gtp_genl_send_echo_req() runs as a generic netlink doit handler in
process context with BH not disabled. It calls udp_tunnel_xmit_skb(),
which eventually invokes iptunnel_xmit() ā€” that uses __this_cpu_inc/dec
on softnet_data.xmit.recursion to track the tunnel xmit recursion level.

Without local_bh_disable(), the task may migrate between
dev_xmit_recursion_inc() and dev_xmit_recursion_dec(), breaking the
per-CPU counter pairing. The result is stale or negative recursion
levels that can later produce false-positive
SKB_DROP_REASON_RECURSION_LIMIT drops on either CPU.

The other udp_tunnel_xmit_skb() call sites in gtp.c are unaffected:
the data path runs under ndo_start_xmit and the echo response handlers
run from the UDP encap rx softirq, both with BH already disabled.

Fix it by disabling BH around the udp_tunnel_xmit_skb() call, mirroring
commit 2cd7e6971fc2 ("sctp: disable BH before calling
udp_tunnel_xmit_skb()").

Fixes: 6f1a9140ecda ("net: add xmit recursion limit to tunnel xmit functions")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
Link: https://patch.msgid.link/20260417055408.4667-1-devnexen@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agohv_sock: Report EOF instead of -EIO for FIN
Dexuan Cui [Thu, 16 Apr 2026 19:14:33 +0000 (12:14 -0700)] 
hv_sock: Report EOF instead of -EIO for FIN

Commit f0c5827d07cb unluckily causes a regression for the FIN packet,
and the final read syscall gets an error rather than 0.

Ideally, we would want to fix hvs_channel_readable_payload() so that it
could return 0 in the FIN scenario, but it's not good for the hv_sock
driver to use the VMBus ringbuffer's cached priv_read_index, which is
internal data in the VMBus driver.

Fix the regression in hv_sock by returning 0 rather than -EIO.

Fixes: f0c5827d07cb ("hv_sock: Return the readable bytes in hvs_stream_has_data()")
Cc: stable@vger.kernel.org
Reported-by: Ben Hillis <Ben.Hillis@microsoft.com>
Reported-by: Mitchell Levy <levymitchell0@gmail.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260416191433.840637-1-decui@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: airoha: Fix possible TX queue stall in airoha_qdma_tx_napi_poll()
Lorenzo Bianconi [Thu, 16 Apr 2026 10:30:12 +0000 (12:30 +0200)] 
net: airoha: Fix possible TX queue stall in airoha_qdma_tx_napi_poll()

Since multiple net_device TX queues can share the same hw QDMA TX queue,
there is no guarantee we have inflight packets queued in hw belonging to a
net_device TX queue stopped in the xmit path because hw QDMA TX queue
can be full. In this corner case the net_device TX queue will never be
re-activated. In order to avoid any potential net_device TX queue stall,
we need to wake all the net_device TX queues feeding the same hw QDMA TX
queue in airoha_qdma_tx_napi_poll routine.

Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260416-airoha-txq-potential-stall-v2-1-42c732074540@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoopenvswitch: cap upcall PID array size and pre-size vport replies
Weiming Shi [Thu, 16 Apr 2026 02:46:54 +0000 (19:46 -0700)] 
openvswitch: cap upcall PID array size and pre-size vport replies

The vport netlink reply helpers allocate a fixed-size skb with
nlmsg_new(NLMSG_DEFAULT_SIZE, ...) but serialize the full upcall PID
array via ovs_vport_get_upcall_portids().  Since
ovs_vport_set_upcall_portids() accepts any non-zero multiple of
sizeof(u32) with no upper bound, a CAP_NET_ADMIN user can install a PID
array large enough to overflow the reply buffer, causing nla_put() to
fail with -EMSGSIZE and hitting BUG_ON(err < 0).  On systems with
unprivileged user namespaces enabled (e.g., Ubuntu default), this is
reachable via unshare -Urn since OVS vport mutation operations use
GENL_UNS_ADMIN_PERM.

 kernel BUG at net/openvswitch/datapath.c:2414!
 Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
 CPU: 1 UID: 0 PID: 65 Comm: poc Not tainted 7.0.0-rc7-00195-geb216e422044 #1
 RIP: 0010:ovs_vport_cmd_set+0x34c/0x400
 Call Trace:
  <TASK>
  genl_family_rcv_msg_doit (net/netlink/genetlink.c:1116)
  genl_rcv_msg (net/netlink/genetlink.c:1194)
  netlink_rcv_skb (net/netlink/af_netlink.c:2550)
  genl_rcv (net/netlink/genetlink.c:1219)
  netlink_unicast (net/netlink/af_netlink.c:1344)
  netlink_sendmsg (net/netlink/af_netlink.c:1894)
  __sys_sendto (net/socket.c:2206)
  __x64_sys_sendto (net/socket.c:2209)
  do_syscall_64 (arch/x86/entry/syscall_64.c:63)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 Kernel panic - not syncing: Fatal exception

Reject attempts to set more PIDs than nr_cpu_ids in
ovs_vport_set_upcall_portids(), and pre-compute the worst-case reply
size in ovs_vport_cmd_msg_size() based on that bound, similar to the
existing ovs_dp_cmd_msg_size().  nr_cpu_ids matches the cap already
used by the per-CPU dispatch configuration on the datapath side
(ovs_dp_cmd_fill_info() serialises at most nr_cpu_ids PIDs), so the
two sides stay consistent.

Fixes: 5cd667b0a456 ("openvswitch: Allow each vport to have an array of 'port_id's.")
Reported-by: Xiang Mei <xmei5@asu.edu>
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Link: https://patch.msgid.link/20260416024653.153456-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/mlx5: Fix HCA caps leak on notifier init failure
Prathamesh Deshpande [Wed, 15 Apr 2026 00:49:37 +0000 (01:49 +0100)] 
net/mlx5: Fix HCA caps leak on notifier init failure

mlx5_mdev_init() allocates HCA caps via mlx5_hca_caps_alloc() before
calling mlx5_notifiers_init(). If notifier initialization fails, the
error path jumps to err_hca_caps and skips mlx5_hca_caps_free(), leaking
allocated caps.

Add a dedicated unwind label for notifier-init failure that frees HCA
caps before continuing the existing cleanup sequence.

Fixes: b6b03097f982 ("net/mlx5: Initialize events outside devlink lock")
Signed-off-by: Prathamesh Deshpande <prathameshdeshpande7@gmail.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260415005022.34764-1-prathameshdeshpande7@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agopppoe: drop PFC frames
Qingfang Deng [Wed, 15 Apr 2026 02:24:51 +0000 (10:24 +0800)] 
pppoe: drop PFC frames

RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT
RECOMMENDED for PPPoE. In practice, pppd does not support negotiating
PFC for PPPoE sessions, and the current PPPoE driver assumes an
uncompressed (2-byte) protocol field. However, the generic PPP layer
function ppp_input() is not aware of the negotiation result, and still
accepts PFC frames.

If a peer with a broken implementation or an attacker sends a frame with
a compressed (1-byte) protocol field, the subsequent PPP payload is
shifted by one byte. This causes the network header to be 4-byte
misaligned, which may trigger unaligned access exceptions on some
architectures.

To reduce the attack surface, drop PPPoE PFC frames. Introduce
ppp_skb_is_compressed_proto() helper function to be used in both
ppp_generic.c and pppoe.c to avoid open-coding.

Fixes: 7fb1b8ca8fa1 ("ppp: Move PFC decompression to PPP generic layer")
Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260415022456.141758-2-qingfang.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoflow_dissector: do not dissect PPPoE PFC frames
Qingfang Deng [Wed, 15 Apr 2026 02:24:50 +0000 (10:24 +0800)] 
flow_dissector: do not dissect PPPoE PFC frames

RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT
RECOMMENDED for PPPoE. In practice, pppd does not support negotiating
PFC for PPPoE sessions, and the flow dissector driver has assumed an
uncompressed frame until the blamed commit.

During the review process of that commit [1], support for PFC is
suggested. However, having a compressed (1-byte) protocol field means
the subsequent PPP payload is shifted by one byte, causing 4-byte
misalignment for the network header and an unaligned access exception
on some architectures.

The exception can be reproduced by sending a PPPoE PFC frame to an
ethernet interface of a MIPS board, with RPS enabled, even if no PPPoE
session is active on that interface:

$ 0   : 00000000 80c40000 00000000 85144817
$ 4   : 00000008 00000100 80a75758 81dc9bb8
$ 8   : 00000010 8087ae2c 0000003d 00000000
$12   : 000000e0 00000039 00000000 00000000
$16   : 85043240 80a75758 81dc9bb8 00006488
$20   : 0000002f 00000007 85144810 80a70000
$24   : 81d1bda0 00000000
$28   : 81dc8000 81dc9aa8 00000000 805ead08
Hi    : 00009d51
Lo    : 2163358a
epc   : 805e91f0 __skb_flow_dissect+0x1b0/0x1b50
ra    : 805ead08 __skb_get_hash_net+0x74/0x12c
Status: 11000403        KERNEL EXL IE
Cause : 40800010 (ExcCode 04)
BadVA : 85144817
PrId  : 0001992f (MIPS 1004Kc)
Call Trace:
[<805e91f0>] __skb_flow_dissect+0x1b0/0x1b50
[<805ead08>] __skb_get_hash_net+0x74/0x12c
[<805ef330>] get_rps_cpu+0x1b8/0x3fc
[<805fca70>] netif_receive_skb_list_internal+0x324/0x364
[<805fd120>] napi_complete_done+0x68/0x2a4
[<8058de5c>] mtk_napi_rx+0x228/0xfec
[<805fd398>] __napi_poll+0x3c/0x1c4
[<805fd754>] napi_threaded_poll_loop+0x234/0x29c
[<805fd848>] napi_threaded_poll+0x8c/0xb0
[<80053544>] kthread+0x104/0x12c
[<80002bd8>] ret_from_kernel_thread+0x14/0x1c

Code: 02d51821  1060045b  00000000 <8c6400003084000f  2c820005  144001a2  00042080  8e220000

To reduce the attack surface and maintain performance, do not process
PPPoE PFC frames.

[1] https://lore.kernel.org/r/20220630231016.GA392@debian.home
Fixes: 46126db9c861 ("flow_dissector: Add PPPoE dissectors")
Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Link: https://patch.msgid.link/20260415022456.141758-1-qingfang.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agosctp: fix OOB write to userspace in sctp_getsockopt_peer_auth_chunks
Michael Bommarito [Thu, 16 Apr 2026 03:19:03 +0000 (23:19 -0400)] 
sctp: fix OOB write to userspace in sctp_getsockopt_peer_auth_chunks

sctp_getsockopt_peer_auth_chunks() checks that the caller's optval
buffer is large enough for the peer AUTH chunk list with

    if (len < num_chunks)
            return -EINVAL;

but then writes num_chunks bytes to p->gauth_chunks, which lives
at offset offsetof(struct sctp_authchunks, gauth_chunks) == 8
inside optval.  The check is missing the sizeof(struct
sctp_authchunks) = 8-byte header.  When the caller supplies
len == num_chunks (for any num_chunks > 0) the test passes but
copy_to_user() writes sizeof(struct sctp_authchunks) = 8 bytes
past the declared buffer.

The sibling function sctp_getsockopt_local_auth_chunks() at the
next line already has the correct check:

    if (len < sizeof(struct sctp_authchunks) + num_chunks)
            return -EINVAL;

Align the peer variant with its sibling.

Reproducer confirms on v7.0-13-generic: an unprivileged userspace
caller that opens a loopback SCTP association with AUTH enabled,
queries num_chunks with a short optval, then issues the real
getsockopt with len == num_chunks and sentinel bytes painted past
the buffer observes those sentinel bytes overwritten with the
peer's AUTH chunk type.  The bytes written are under the peer's
control but land in the caller's own userspace; this is not a
kernel memory corruption, but it is a kernel-side contract
violation that can silently corrupt adjacent userspace data.

Fixes: 65b07e5d0d09 ("[SCTP]: API updates to suport SCTP-AUTH extensions.")
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20260416031903.1447072-1-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: ks8851: Avoid excess softirq scheduling
Marek Vasut [Wed, 15 Apr 2026 23:09:45 +0000 (01:09 +0200)] 
net: ks8851: Avoid excess softirq scheduling

The code injects a packet into netif_rx() repeatedly, which will add
it to its internal NAPI and schedule a softirq, and process it. It is
more efficient to queue multiple packets and process them all at the
local_bh_enable() time.

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: e0863634bf9f ("net: ks8851: Queue RX packets in IRQ handler instead of disabling BHs")
Cc: stable@vger.kernel.org
Signed-off-by: Marek Vasut <marex@nabladev.com>
Link: https://patch.msgid.link/20260415231020.455298-2-marex@nabladev.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: ks8851: Reinstate disabling of BHs around IRQ handler
Marek Vasut [Wed, 15 Apr 2026 23:09:44 +0000 (01:09 +0200)] 
net: ks8851: Reinstate disabling of BHs around IRQ handler

If the driver executes ks8851_irq() AND a TX packet has been sent, then
the driver enables TX queue via netif_wake_queue() which schedules TX
softirq to queue packets for this device.

If CONFIG_PREEMPT_RT=y is set AND a packet has also been received by
the MAC, then ks8851_rx_pkts() calls netdev_alloc_skb_ip_align() to
allocate SKBs for the received packets. If netdev_alloc_skb_ip_align()
is called with BH enabled, then local_bh_enable() at the end of
netdev_alloc_skb_ip_align() will trigger the pending softirq processing,
which may ultimately call the .xmit callback ks8851_start_xmit_par().
The ks8851_start_xmit_par() will try to lock struct ks8851_net_par
.lock spinlock, which is already locked by ks8851_irq() from which
ks8851_start_xmit_par() was called. This leads to a deadlock, which
is reported by the kernel, including a trace listed below.

If CONFIG_PREEMPT_RT is not set, then since commit 0913ec336a6c0
("net: ks8851: Fix deadlock with the SPI chip variant") the deadlock
can also be triggered without received packet in the RX FIFO. The
pending softirqs will be processed on return from
spin_unlock_bh(&ks->statelock) in ks8851_irq(), which triggers the
deadlock as well.

Fix the problem by disabling BH around critical sections, including the
IRQ handler, thus preventing the net_tx_action() softirq from triggering
during these critical sections. The net_tx_action() softirq is triggered
once BH are re-enabled and at the end of the IRQ handler, once all the
other IRQ handler actions have been completed.

 __schedule from schedule_rtlock+0x1c/0x34
 schedule_rtlock from rtlock_slowlock_locked+0x548/0x904
 rtlock_slowlock_locked from rt_spin_lock+0x60/0x9c
 rt_spin_lock from ks8851_start_xmit_par+0x74/0x1a8
 ks8851_start_xmit_par from netdev_start_xmit+0x20/0x44
 netdev_start_xmit from dev_hard_start_xmit+0xd0/0x188
 dev_hard_start_xmit from sch_direct_xmit+0xb8/0x25c
 sch_direct_xmit from __qdisc_run+0x1f8/0x4ec
 __qdisc_run from qdisc_run+0x1c/0x28
 qdisc_run from net_tx_action+0x1f0/0x268
 net_tx_action from handle_softirqs+0x1a4/0x270
 handle_softirqs from __local_bh_enable_ip+0xcc/0xe0
 __local_bh_enable_ip from __alloc_skb+0xd8/0x128
 __alloc_skb from __netdev_alloc_skb+0x3c/0x19c
 __netdev_alloc_skb from ks8851_irq+0x388/0x4d4
 ks8851_irq from irq_thread_fn+0x24/0x64
 irq_thread_fn from irq_thread+0x178/0x28c
 irq_thread from kthread+0x12c/0x138
 kthread from ret_from_fork+0x14/0x28

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: e0863634bf9f ("net: ks8851: Queue RX packets in IRQ handler instead of disabling BHs")
Cc: stable@vger.kernel.org
Signed-off-by: Marek Vasut <marex@nabladev.com>
Link: https://patch.msgid.link/20260415231020.455298-1-marex@nabladev.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoaf_unix: Drop all SCM attributes for SOCKMAP.
Kuniyuki Iwashima [Wed, 15 Apr 2026 18:48:29 +0000 (18:48 +0000)] 
af_unix: Drop all SCM attributes for SOCKMAP.

SOCKMAP can hide inflight fd from AF_UNIX GC.

When a socket in SOCKMAP receives skb with inflight fd,
sk_psock_verdict_data_ready() looks up the mapped socket and
enqueue skb to its psock->ingress_skb.

Since neither the old nor the new GC can inspect the psock
queue, the hidden skb leaks the inflight sockets.  Note that
this cannot be detected via kmemleak because inflight sockets
are linked to a global list.

In addition, SOCKMAP redirect breaks the Tarjan-based GC's
assumption that unix_edge.successor is always alive, which
is no longer true once skb is redirected, resulting in
use-after-free below. [0]

Moreover, SOCKMAP does not call scm_stat_del() properly,
so unix_show_fdinfo() could report an incorrect fd count.

sk_msg_recvmsg() does not support any SCM attributes in the
first place.

Let's drop all SCM attributes before passing skb to the
SOCKMAP layer.

[0]:
BUG: KASAN: slab-use-after-free in unix_del_edges (net/unix/garbage.c:118 net/unix/garbage.c:181 net/unix/garbage.c:251)
Read of size 8 at addr ffff888125362670 by task kworker/56:1/496

CPU: 56 UID: 0 PID: 496 Comm: kworker/56:1 Not tainted 7.0.0-rc7-00263-gb9d8b856689d #3 PREEMPT(lazy)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
Workqueue: events sk_psock_backlog
Call Trace:
 <TASK>
 dump_stack_lvl (lib/dump_stack.c:122)
 print_report (mm/kasan/report.c:379)
 kasan_report (mm/kasan/report.c:597)
 unix_del_edges (net/unix/garbage.c:118 net/unix/garbage.c:181 net/unix/garbage.c:251)
 unix_destroy_fpl (net/unix/garbage.c:317)
 unix_destruct_scm (./include/net/scm.h:80 ./include/net/scm.h:86 net/unix/af_unix.c:1976)
 sk_psock_backlog (./include/linux/skbuff.h:?)
 process_scheduled_works (kernel/workqueue.c:?)
 worker_thread (kernel/workqueue.c:?)
 kthread (kernel/kthread.c:438)
 ret_from_fork (arch/x86/kernel/process.c:164)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:258)
 </TASK>

Allocated by task 955:
 kasan_save_track (mm/kasan/common.c:58 mm/kasan/common.c:78)
 __kasan_slab_alloc (mm/kasan/common.c:369)
 kmem_cache_alloc_noprof (mm/slub.c:4539)
 sk_prot_alloc (net/core/sock.c:2240)
 sk_alloc (net/core/sock.c:2301)
 unix_create1 (net/unix/af_unix.c:1099)
 unix_create (net/unix/af_unix.c:1169)
 __sock_create (net/socket.c:1606)
 __sys_socketpair (net/socket.c:1811)
 __x64_sys_socketpair (net/socket.c:1863 net/socket.c:1860 net/socket.c:1860)
 do_syscall_64 (arch/x86/entry/syscall_64.c:?)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Freed by task 496:
 kasan_save_track (mm/kasan/common.c:58 mm/kasan/common.c:78)
 kasan_save_free_info (mm/kasan/generic.c:587)
 __kasan_slab_free (mm/kasan/common.c:287)
 kmem_cache_free (mm/slub.c:6165)
 __sk_destruct (net/core/sock.c:2282 net/core/sock.c:2384)
 sk_psock_destroy (./include/net/sock.h:?)
 process_scheduled_works (kernel/workqueue.c:?)
 worker_thread (kernel/workqueue.c:?)
 kthread (kernel/kthread.c:438)
 ret_from_fork (arch/x86/kernel/process.c:164)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:258)

Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
Fixes: 77462de14a43 ("af_unix: Add read_sock for stream socket types")
Reported-by: Xingyu Jin <xingyuj@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260415184830.3988432-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: Update default_an_inband before passing value to phylink_config
KhaiWenTan [Thu, 16 Apr 2026 10:26:09 +0000 (18:26 +0800)] 
net: stmmac: Update default_an_inband before passing value to phylink_config

get_interfaces() will update both the plat->phy_interfaces and
mdio_bus_data->default_an_inband based on reading a SERDES register. As
get_interfaces() will be called after default_an_inband had already been
read, dwmac-intel regressed as a result with incorrect default_an_inband
value in phylink_config.

Therefore, we moved the priv->plat->get_interfaces() to be executed first
before assigning priv->plat->default_an_inband to config->default_an_inband
to ensure default_an_inband is in correct value.

Fixes: d3836052fe09 ("net: stmmac: intel: convert speed_mode_2500() to get_interfaces()")
Signed-off-by: KhaiWenTan <khai.wen.tan@linux.intel.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260416102609.7953-1-khai.wen.tan@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoipv6: fix possible UAF in icmpv6_rcv()
Eric Dumazet [Thu, 16 Apr 2026 10:35:05 +0000 (10:35 +0000)] 
ipv6: fix possible UAF in icmpv6_rcv()

Caching saddr and daddr before pskb_pull() is problematic
since skb->head can change.

Remove these temporary variables:

- We only access &ipv6_hdr(skb)->saddr and &ipv6_hdr(skb)->daddr
  when net_dbg_ratelimited() is called in the slow path.

- Avoid potential future misuse after pskb_pull() call.

Fixes: 4b3418fba0fe ("ipv6: icmp: include addresses in debug messages")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260416103505.2380753-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge branch 'intel-wired-lan-driver-updates-2026-04-14-ice-i40e-iavf-idpf-e1000e'
Jakub Kicinski [Sat, 18 Apr 2026 19:01:41 +0000 (12:01 -0700)] 
Merge branch 'intel-wired-lan-driver-updates-2026-04-14-ice-i40e-iavf-idpf-e1000e'

Jacob Keller says:

====================
Intel Wired LAN Driver Updates 2026-04-14 (ice, i40e, iavf, e1000e)

Grzegorz updates the logic for adjusting the PTP hardware clock on E830,
fixing a bug that prevented adjustments below S32_MAX/MIN nanoseconds.

Grzegorz and Zoli update the PCS latency settings for E825 devices at 10GbE
and 25GbE, improving the accuracy of timestamps based on data from
production hardware.

Michal Schmidt fixes a double-free that could happen if a particular error
path is taken in ice_xmit_frame_ring().

Guangshuo fixes a double-free that could happen during error paths in the
ice_sf_eth_activate() function.

Paul Greenwalt fixes the PHY link configuration when the link-down-on-close
driver parameter is enabled and new media is inserted.

Paul Greenwalt fixes the ICE_AQ_LINK_SPEED_M macro for 200G, enabling 200G
link speed advertisement.

Keita Morisaki fixes a race condition in the ice Tx timestamp ring cleanup,
preventing a possible NULL pointer dereference.

Kohei Enju fixes a potential NULL pointer dereference in ice_set_ring_param().

Kohei Enju fixes i40e to stop advertising IFF_SUPP_NOFCS, when the driver
does not actually support the feature.

Petr fixes the VLAN L2TAG2 mask when the iAVF VF and a PF negotiate use of
the legacy Rx descriptor format.

Matt fixes the unrolling logic for PTP when the e1000e probe fails after
the PTP clock has been registered.

 **A note to stable backports**

  The patches [7/12] ("ice: fix race condition in TX timestamp ring
  cleanup") and [8/12] ("ice: fix potential NULL pointer deref in error
  path of ice_set_ringparam()") must be backported together. Otherwise the
  fix in patch 8 will not work properly.
====================

Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoe1000e: Unroll PTP in probe error handling
Matt Vollrath [Fri, 17 Apr 2026 00:53:36 +0000 (17:53 -0700)] 
e1000e: Unroll PTP in probe error handling

If probe fails after registering the PTP clock and its delayed work,
these resources must be released.

This was not an issue until a 2016 fix moved the e1000e_ptp_init() call
before the jump to err_register.

Fixes: aa524b66c5ef ("e1000e: don't modify SYSTIM registers during SIOCSHWTSTAMP ioctl")
Signed-off-by: Matt Vollrath <tactii@gmail.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-12-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoiavf: fix wrong VLAN mask for legacy Rx descriptors L2TAG2
Petr Oros [Fri, 17 Apr 2026 00:53:34 +0000 (17:53 -0700)] 
iavf: fix wrong VLAN mask for legacy Rx descriptors L2TAG2

The IAVF_RXD_LEGACY_L2TAG2_M mask was incorrectly defined as
GENMASK_ULL(63, 32), extracting 32 bits from qw2 instead of the
16-bit VLAN tag. In the legacy Rx descriptor layout, the 2nd L2TAG2
(VLAN tag) occupies bits 63:48 of qw2, not 63:32.

The oversized mask causes FIELD_GET to return a 32-bit value where the
actual VLAN tag sits in bits 31:16. When this value is passed to
iavf_receive_skb() as a u16 parameter, it gets truncated to the lower
16 bits (which contain the 1st L2TAG2, typically zero). As a result,
__vlan_hwaccel_put_tag() is never called and software VLAN interfaces
on VFs receive no traffic.

This affects VFs behind ice PF (VIRTCHNL VLAN v2) when the PF
advertises VLAN stripping into L2TAG2_2 and legacy descriptors are
used.

The flex descriptor path already uses the correct mask
(IAVF_RXD_FLEX_L2TAG2_2_M = GENMASK_ULL(63, 48)).

Reproducer:
 1. Create 2 VFs on ice PF (echo 2 > sriov_numvfs)
 2. Disable spoofchk on both VFs
 3. Move each VF into a separate network namespace
 4. On each VF: create VLAN interface (e.g. vlan 198), assign IP,
    bring up
 5. Set rx-vlan-offload OFF on both VFs
 6. Ping between VLAN interfaces -> expect PASS
    (VLAN tag stays in packet data, kernel matches in-band)
 7. Set rx-vlan-offload ON on both VFs
 8. Ping between VLAN interfaces -> expect FAIL if bug present
    (HW strips VLAN tag into descriptor L2TAG2 field, wrong mask
    extracts bits 47:32 instead of 63:48, truncated to u16 -> zero,
    __vlan_hwaccel_put_tag() never called, packet delivered to parent
    interface, not VLAN interface)

The reproducer requires legacy Rx descriptors. On modern ice + iavf
with full PTP support, flex descriptors are always negotiated and the
buggy legacy path is never reached. Flex descriptors require all of:
 - CONFIG_PTP_1588_CLOCK enabled
 - VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC granted by PF
 - PTP capabilities negotiated (VIRTCHNL_VF_CAP_PTP)
 - VIRTCHNL_1588_PTP_CAP_RX_TSTAMP supported
 - VIRTCHNL_RXDID_2_FLEX_SQ_NIC present in DDP profile

If any condition is not met, iavf_select_rx_desc_format() falls back
to legacy descriptors (RXDID=1) and the wrong L2TAG2 mask is hit.

Fixes: 2dc8e7c36d80 ("iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-10-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoi40e: don't advertise IFF_SUPP_NOFCS
Kohei Enju [Fri, 17 Apr 2026 00:53:33 +0000 (17:53 -0700)] 
i40e: don't advertise IFF_SUPP_NOFCS

i40e advertises IFF_SUPP_NOFCS, allowing users to use the SO_NOFCS
socket option. However, this option is silently ignored, as the driver
does not check skb->no_fcs, and always enables FCS insertion offload.

Fix this by removing the advertisement of IFF_SUPP_NOFCS.

This behavior can be reproduced with a simple AF_PACKET socket:

  import socket
  s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW)
  s.setsockopt(socket.SOL_SOCKET, 43, 1) # SO_NOFCS
  s.bind(("eth0", 0))
  s.send(b'\xff' * 64)

Previously, send() succeeds but the driver ignores SO_NOFCS.
With this change, send() fails with -EPROTONOSUPPORT, as expected.

Fixes: 41c445ff0f48 ("i40e: main driver core")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-9-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoice: fix potential NULL pointer deref in error path of ice_set_ringparam()
Kohei Enju [Fri, 17 Apr 2026 00:53:32 +0000 (17:53 -0700)] 
ice: fix potential NULL pointer deref in error path of ice_set_ringparam()

ice_set_ringparam nullifies tstamp_ring of temporary tx_rings, without
clearing ICE_TX_RING_FLAGS_TXTIME bit.
When ICE_TX_RING_FLAGS_TXTIME is set and the subsequent
ice_setup_tx_ring() call fails, a NULL pointer dereference could happen
in the unwinding sequence:

ice_clean_tx_ring()
-> ice_is_txtime_cfg() == true (ICE_TX_RING_FLAGS_TXTIME is set)
-> ice_free_tx_tstamp_ring()
  -> ice_free_tstamp_ring()
    -> tstamp_ring->desc (NULL deref)

Clear ICE_TX_RING_FLAGS_TXTIME bit to avoid the potential issue.

Note that this potential issue is found by manual code review.
Compile test only since unfortunately I don't have E830 devices.

Fixes: ccde82e90946 ("ice: add E830 Earliest TxTime First Offload support")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-8-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoice: fix race condition in TX timestamp ring cleanup
Keita Morisaki [Fri, 17 Apr 2026 00:53:31 +0000 (17:53 -0700)] 
ice: fix race condition in TX timestamp ring cleanup

Fix a race condition between ice_free_tx_tstamp_ring() and ice_tx_map()
that can cause a NULL pointer dereference.

ice_free_tx_tstamp_ring currently clears the ICE_TX_FLAGS_TXTIME flag
after NULLing the tstamp_ring. This could allow a concurrent ice_tx_map
call on another CPU to dereference the tstamp_ring, which could lead to
a NULL pointer dereference.

  CPU A:ice_free_tx_tstamp_ring() | CPU B:ice_tx_map()
  --------------------------------|---------------------------------
  tx_ring->tstamp_ring = NULL     |
                                  | ice_is_txtime_cfg() -> true
                                  | tstamp_ring = tx_ring->tstamp_ring
                                  | tstamp_ring->count  // NULL deref!
  flags &= ~ICE_TX_FLAGS_TXTIME   |

Fix by:
1. Reordering ice_free_tx_tstamp_ring() to clear the flag before
   NULLing the pointer, with smp_wmb() to ensure proper ordering.
2. Adding smp_rmb() in ice_tx_map() after the flag check to order the
   flag read before the pointer read, using READ_ONCE() for the
   pointer, and adding a NULL check as a safety net.
3. Converting tx_ring->flags from u8 to DECLARE_BITMAP() and using
   atomic bitops (set_bit(), clear_bit(), test_bit()) for all flag
   operations throughout the driver:
   - ICE_TX_RING_FLAGS_XDP
   - ICE_TX_RING_FLAGS_VLAN_L2TAG1
   - ICE_TX_RING_FLAGS_VLAN_L2TAG2
   - ICE_TX_RING_FLAGS_TXTIME

Fixes: ccde82e909467 ("ice: add E830 Earliest TxTime First Offload support")
Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-7-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoice: fix ICE_AQ_LINK_SPEED_M for 200G
Paul Greenwalt [Fri, 17 Apr 2026 00:53:30 +0000 (17:53 -0700)] 
ice: fix ICE_AQ_LINK_SPEED_M for 200G

When setting PHY configuration during driver initialization, 200G link
speed is not being advertised even when the PHY is capable. This is
because the get PHY capabilities link speed response is being masked by
ICE_AQ_LINK_SPEED_M, which does not include the 200G link speed bit.

ICE_AQ_LINK_SPEED_200GB is defined as BIT(11), but the mask 0x7FF only
covers bits 0-10. Fix ICE_AQ_LINK_SPEED_M to use GENMASK(11, 0) so
that it covers all defined link speed bits including 200G.

Fixes: 24407a01e57c ("ice: Add 200G speed/phy type use")
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-6-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoice: fix PHY config on media change with link-down-on-close
Paul Greenwalt [Fri, 17 Apr 2026 00:53:29 +0000 (17:53 -0700)] 
ice: fix PHY config on media change with link-down-on-close

Commit 1a3571b5938c ("ice: restore PHY settings on media insertion")
introduced separate flows for setting PHY configuration on media
present: ice_configure_phy() when link-down-on-close is disabled, and
ice_force_phys_link_state() when enabled. The latter incorrectly uses
the previous configuration even after module change, causing link
issues such as wrong speed or no link.

Unify PHY configuration into a single ice_phy_cfg() function with a
link_en parameter, ensuring PHY capabilities are always fetched fresh
from hardware.

Fixes: 1a3571b5938c ("ice: restore PHY settings on media insertion")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-5-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoice: fix double-free of tx_buf skb
Michal Schmidt [Fri, 17 Apr 2026 00:53:28 +0000 (17:53 -0700)] 
ice: fix double-free of tx_buf skb

If ice_tso() or ice_tx_csum() fail, the error path in
ice_xmit_frame_ring() frees the skb, but the 'first' tx_buf still points
to it and is marked as valid (ICE_TX_BUF_SKB).
'next_to_use' remains unchanged, so the potential problem will
likely fix itself when the next packet is transmitted and the tx_buf
gets overwritten. But if there is no next packet and the interface is
brought down instead, ice_clean_tx_ring() -> ice_unmap_and_free_tx_buf()
will find the tx_buf and free the skb for the second time.

The fix is to reset the tx_buf type to ICE_TX_BUF_EMPTY in the error
path, so that ice_unmap_and_free_tx_buf().
Move the initialization of 'first' up, to ensure it's already valid in
case we hit the linearization error path.

The bug was spotted by AI while I had it looking for something else.
It also proposed an initial version of the patch.

I reproduced the bug and tested the fix by adding code to inject
failures, on a build with KASAN.

I looked for similar bugs in related Intel drivers and did not find any.

Fixes: d76a60ba7afb ("ice: Add support for VLANs and offloads")
Assisted-by: Claude:claude-4.6-opus-high Cursor
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-4-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoice: fix double free in ice_sf_eth_activate() error path
Guangshuo Li [Fri, 17 Apr 2026 00:53:27 +0000 (17:53 -0700)] 
ice: fix double free in ice_sf_eth_activate() error path

When auxiliary_device_add() fails, ice_sf_eth_activate() jumps to
aux_dev_uninit and calls auxiliary_device_uninit(&sf_dev->adev).

The device release callback ice_sf_dev_release() frees sf_dev, but
the current error path falls through to sf_dev_free and calls
kfree(sf_dev) again, causing a double free.

Keep kfree(sf_dev) for the auxiliary_device_init() failure path, but
avoid falling through to sf_dev_free after auxiliary_device_uninit().

Fixes: 13acc5c4cdbe ("ice: subfunction activation and base devlink ops")
Cc: stable@vger.kernel.org
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-3-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoice: update PCS latency settings for E825 10G/25Gb modes
Grzegorz Nitka [Fri, 17 Apr 2026 00:53:26 +0000 (17:53 -0700)] 
ice: update PCS latency settings for E825 10G/25Gb modes

Update MAC Rx/Tx offset registers settings (PHY_MAC_[RX|TX]_OFFSET
registers) with the data obtained with the latest research. It applies
to PCS latency settings for the following speeds/modes:
* 10Gb NO-FEC
        - TX latency changed from 71.25 ns to 73 ns
        - RX latency changed from -25.6 ns to -28 ns
* 25Gb NO-FEC
- TX latency changed from 28.17 ns to 33 ns
        - RX latency changed from -12.45 ns to -12 ns
* 25Gb RS-FEC
        - TX latency changed from 64.5 ns to 69 ns
        - RX latency changed from -3.6 ns to -3 ns

The original data came from simulation and pre-production hardware.
The new data measures the actual delays and as such is more accurate.

Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products")
Co-developed-by: Zoltan Fodor <zoltan.fodor@intel.com>
Signed-off-by: Zoltan Fodor <zoltan.fodor@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-2-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoice: fix 'adjust' timer programming for E830 devices
Grzegorz Nitka [Fri, 17 Apr 2026 00:53:25 +0000 (17:53 -0700)] 
ice: fix 'adjust' timer programming for E830 devices

Fix incorrect 'adjust the timer' programming sequence for E830 devices
series. Only shadow registers GLTSYN_SHADJ were programmed in the
current implementation. According to the specification [1], write to
command GLTSYN_CMD register is also required with CMD field set to
"Adjust the Time" value, for the timer adjustment to take the effect.

The flow was broken for the adjustment less than S32_MAX/MIN range
(around +/- 2 seconds). For bigger adjustment, non-atomic programming
flow is used, involving set timer programming. Non-atomic flow is
implemented correctly.

Testing hints:
Run command:
phc_ctl /dev/ptpX get adj 2 get
Expected result:
Returned timestamps differ at least by 2 seconds

[1] IntelĀ® Ethernet Controller E830 Datasheet rev 1.3, chapter 9.7.5.4
https://cdrdv2.intel.com/v1/dl/getContent/787353?explicitVersion=true

Fixes: f00307522786 ("ice: Implement PTP support for E830 devices")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-1-686c33c9828d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge tag 'ovpn-net-20260417' of https://github.com/OpenVPN/ovpn-net-next
Jakub Kicinski [Sat, 18 Apr 2026 18:44:11 +0000 (11:44 -0700)] 
Merge tag 'ovpn-net-20260417' of https://github.com/OpenVPN/ovpn-net-next

Antonio Quartulli says:

====================
This batch includes only fixes to the selftest harness:
* switch to TAP test orchestration
* parse slurped notifications as returned by jq -s
* add ovpn_ prefix to helpers and global variables to avoid clashes
* fail test in case of netlink notification mismatch
* add missing kernel config dependencies
* add delay when launching multiple ynl/cli.py listeners

* tag 'ovpn-net-20260417' of https://github.com/OpenVPN/ovpn-net-next:
  selftests: ovpn: serialize YNL listener startup
  selftests: ovpn: align command flow with TAP
  selftests: ovpn: add prefix to helpers and shared variables
  selftests: ovpn: flatten slurped notification JSON before filtering
  selftests: ovpn: fail notification check on mismatch
  selftests: ovpn: add nftables config dependencies for test-mark
====================

Link: https://patch.msgid.link/20260417090305.2775723-1-antonio@openvpn.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge branch 'tcp-take-care-of-tcp_get_timestamping_opt_stats-races'
Jakub Kicinski [Sat, 18 Apr 2026 18:10:15 +0000 (11:10 -0700)] 
Merge branch 'tcp-take-care-of-tcp_get_timestamping_opt_stats-races'

Eric Dumazet says:

====================
tcp: take care of tcp_get_timestamping_opt_stats() races

tcp_get_timestamping_opt_stats() does not own the socket lock,
this is intentional.

It calls tcp_get_info_chrono_stats() while other threads could
change chrono fields in tcp_chrono_set(). It also reads many
tcp socket fields that can be modified by other cpus/threads.

I do not think we need coherent TCP socket state snapshot
in tcp_get_timestamping_opt_stats().

Add READ_ONCE()/WRITE_ONCE() or data_race() annotations.

Note that icsk_ca_state is a bitfield, thus not covered
in this series.
====================

Link: https://patch.msgid.link/20260416200319.3608680-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: annotate data-races around tp->plb_rehash
Eric Dumazet [Thu, 16 Apr 2026 20:03:19 +0000 (20:03 +0000)] 
tcp: annotate data-races around tp->plb_rehash

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 29c1c44646ae ("tcp: add u32 counter in tcp_sock and an SNMP counter for PLB")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-15-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: annotate data-races around (tp->write_seq - tp->snd_nxt)
Eric Dumazet [Thu, 16 Apr 2026 20:03:18 +0000 (20:03 +0000)] 
tcp: annotate data-races around (tp->write_seq - tp->snd_nxt)

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() annotations to keep KCSAN happy.

WRITE_ONCE() annotations are already present.

Fixes: e08ab0b377a1 ("tcp: add bytes not sent to SCM_TIMESTAMPING_OPT_STATS")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-14-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: annotate data-races around tp->timeout_rehash
Eric Dumazet [Thu, 16 Apr 2026 20:03:17 +0000 (20:03 +0000)] 
tcp: annotate data-races around tp->timeout_rehash

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 32efcc06d2a1 ("tcp: export count for rehash attempts")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-13-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: annotate data-races around tp->srtt_us
Eric Dumazet [Thu, 16 Apr 2026 20:03:16 +0000 (20:03 +0000)] 
tcp: annotate data-races around tp->srtt_us

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: e8bd8fca6773 ("tcp: add SRTT to SCM_TIMESTAMPING_OPT_STATS")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: annotate data-races around tp->reord_seen
Eric Dumazet [Thu, 16 Apr 2026 20:03:15 +0000 (20:03 +0000)] 
tcp: annotate data-races around tp->reord_seen

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 7ec65372ca53 ("tcp: add stat of data packet reordering events")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: annotate data-races around tp->dsack_dups
Eric Dumazet [Thu, 16 Apr 2026 20:03:14 +0000 (20:03 +0000)] 
tcp: annotate data-races around tp->dsack_dups

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 7e10b6554ff2 ("tcp: add dsack blocks received stats")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: annotate data-races around tp->bytes_retrans
Eric Dumazet [Thu, 16 Apr 2026 20:03:13 +0000 (20:03 +0000)] 
tcp: annotate data-races around tp->bytes_retrans

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: fb31c9b9f6c8 ("tcp: add data bytes retransmitted stats")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: annotate data-races around tp->bytes_sent
Eric Dumazet [Thu, 16 Apr 2026 20:03:12 +0000 (20:03 +0000)] 
tcp: annotate data-races around tp->bytes_sent

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: ba113c3aa79a ("tcp: add data bytes sent stats")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: add data-race annotations for TCP_NLA_SNDQ_SIZE
Eric Dumazet [Thu, 16 Apr 2026 20:03:11 +0000 (20:03 +0000)] 
tcp: add data-race annotations for TCP_NLA_SNDQ_SIZE

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 87ecc95d81d9 ("tcp: add send queue size stat in SCM_TIMESTAMPING_OPT_STATS")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: annotate data-races around tp->delivered and tp->delivered_ce
Eric Dumazet [Thu, 16 Apr 2026 20:03:10 +0000 (20:03 +0000)] 
tcp: annotate data-races around tp->delivered and tp->delivered_ce

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: feb5f2ec6464 ("tcp: export packets delivery info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: annotate data-races around tp->snd_ssthresh
Eric Dumazet [Thu, 16 Apr 2026 20:03:09 +0000 (20:03 +0000)] 
tcp: annotate data-races around tp->snd_ssthresh

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 7156d194a077 ("tcp: add snd_ssthresh stat in SCM_TIMESTAMPING_OPT_STATS")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: add data-races annotations around tp->reordering, tp->snd_cwnd
Eric Dumazet [Thu, 16 Apr 2026 20:03:08 +0000 (20:03 +0000)] 
tcp: add data-races annotations around tp->reordering, tp->snd_cwnd

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE(), WRITE_ONCE() data_race() annotations to keep KCSAN happy.

Fixes: bb7c19f96012 ("tcp: add related fields into SCM_TIMESTAMPING_OPT_STATS")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: add data-race annotations around tp->data_segs_out and tp->total_retrans
Eric Dumazet [Thu, 16 Apr 2026 20:03:07 +0000 (20:03 +0000)] 
tcp: add data-race annotations around tp->data_segs_out and tp->total_retrans

tcp_get_timestamping_opt_stats() intentionally runs lockless, we must
add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy.

Fixes: 7e98102f4897 ("tcp: record pkts sent and retransmistted")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: annotate data-races in tcp_get_info_chrono_stats()
Eric Dumazet [Thu, 16 Apr 2026 20:03:06 +0000 (20:03 +0000)] 
tcp: annotate data-races in tcp_get_info_chrono_stats()

tcp_get_timestamping_opt_stats() does not own the socket lock,
this is intentional.

It calls tcp_get_info_chrono_stats() while other threads could
change chrono fields in tcp_chrono_set().

I do not think we need coherent TCP socket state snapshot
in tcp_get_timestamping_opt_stats(), I chose to only
add annotations to keep KCSAN happy.

Fixes: 1c885808e456 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260416200319.3608680-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoselftests: ovpn: serialize YNL listener startup
Ralf Lici [Thu, 16 Apr 2026 07:19:28 +0000 (09:19 +0200)] 
selftests: ovpn: serialize YNL listener startup

Starting one background YNL notification listener per peer back-to-back
can intermittently stall the test setup before the listeners even reach
the Python main function.

This was reproducible in a reduced test.sh setup-only loop: a single
listener stayed stable across repeated runs, while starting listeners
for all peers could hang early in the listener launch phase. Adding a
short delay between listener launches makes the listeners start cleanly
and eliminates the reproduced hangs in repeated normal and slow-runner
tests.

Serialize listener startup with a small sleep between setup_listener
calls.

Fixes: 77de28cd7cf1 ("selftests: ovpn: add notification parsing and matching")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
4 weeks agoselftests: ovpn: align command flow with TAP
Ralf Lici [Mon, 23 Mar 2026 14:12:32 +0000 (15:12 +0100)] 
selftests: ovpn: align command flow with TAP

Current tests do not properly adhere to the TAP infrastructure
therefore they do not properly report failures leading to hangs of
the CI machinery.

Restructure ovpn selftests into using the TAP infrastructure: split each
test in stages, execute stage bodies with fail-fast semantics, and emit
KTAP pass/fail for each stage.

Centralize behavior control in common.sh and makes the scripts use
dedicated wrappers for required-success, expected-failure, and non-fatal
commands. Also add the OVPN_VERBOSE mode that exposes captured command
output for debugging.
This way tests won't hang anymore in case of failure when executed
within the CI machinery.

This change also makes default OVPN_CLI and YNL resolution
independent from the caller CWD by anchoring both to COMMON_DIR, so
behavior is stable across direct execution and run_tests-style
execution.

Fixes: 959bc330a439 ("testing/selftests: add test tool and scripts for ovpn module")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
4 weeks agoselftests: ovpn: add prefix to helpers and shared variables
Ralf Lici [Fri, 20 Mar 2026 16:29:38 +0000 (17:29 +0100)] 
selftests: ovpn: add prefix to helpers and shared variables

Current naming for shared variables, helpers and netnamespaces is
a bit unfortunate as it doesn't come with a clean prefix.
This showed to be problematic in case of name clashes with external
scripts or in case of abrupt test termination (hanging netns' weren't
easily reconducible to ovpn).

Rename common helper entry points and all shared globals in the ovpn
selftests to ovpn_ or OVPN_ names so test scripts and wrappers use a
single explicit prefix. Also rename the temporary network namespaces
created by the tests from peerN to ovpn_peerN. This makes leaked
namespaces easier to identify.

This is a mechanical refactor only, behavior is unchanged.

Fixes: 959bc330a439 ("testing/selftests: add test tool and scripts for ovpn module")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
4 weeks agoselftests: ovpn: flatten slurped notification JSON before filtering
Ralf Lici [Tue, 24 Mar 2026 14:54:18 +0000 (15:54 +0100)] 
selftests: ovpn: flatten slurped notification JSON before filtering

Notification comparison uses jq -s, which slurps all inputs into an
array. Some inputs can be arrays themselves, and applying the .msg.peer
filter directly on those entries triggers jq type errors.

Expand any array-valued JSON items returned by jq -s before selecting
.msg.peer, so the filter handles both normal notification objects and []
entries without type errors.

Fixes: 77de28cd7cf1 ("selftests: ovpn: add notification parsing and matching")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>