Jiawen Wu [Mon, 25 May 2026 10:05:43 +0000 (18:05 +0800)]
net: txgbe: rework service event handling
Convert to use test_and_clear_bit() for link event subtasks. Only re-arm
the WX_FLAG_NEED_MODULE_RESET flag when module is absent. Unsupported or
invalid modules no longer cause the service task to continuously retry
module identification.
Additionally, explicitly cancel service_task during device teardown to
ensure no pending asynchronous service work survives after the device
has entered the DOWN state.
Jiawen Wu [Mon, 25 May 2026 10:05:42 +0000 (18:05 +0800)]
net: wangxun: avoid statistics updates during device teardown
After introducing WX_STATE_DOWN, wx_update_stats() now explicitly skips
statistics collection while the device is in teardown or reset state.
Calling wx_update_stats() from the device disable path therefore becomes
redundant.
Remove wx_update_stats() calls from ngbe_disable_device() and
txgbe_disable_device().
Jiawen Wu [Mon, 25 May 2026 10:05:41 +0000 (18:05 +0800)]
net: wangxun: introduce WX_STATE_DOWN to serialize device shutdown state
Replace various netif_running() checks with an explicit WX_STATE_DOWN
state bit to track whether the device datapath and interrupt handling
are operational.
The previous logic relied on netif_running() to gate interrupt
reenablement, queue wakeups, statistics updates, and service task
execution. However, netif_running() only reflects the administrative
state of the netdevice and does not fully serialize against teardown
and reset paths. During device shutdown and reset flows, asynchronous
contexts such as interrupt handlers, NAPI poll, and service work could
still observe netif_running() as true while device resources were
already being disabled or freed.
Willem de Bruijn [Tue, 26 May 2026 13:40:37 +0000 (09:40 -0400)]
net: sch_fq: update flow delivery time on earlier EDT packet
When inserting an EDT packet with time before flow->time_next_packet,
update the flow and possibly queue next delivery time.
Reinsert the flow into the q->delayed rb-tree to position correctly
and to have fq_check_throttled set wake-up at the right next time.
Factor RB tree insertion out fq_flow_set_throttled to avoid open
coding twice.
EDT packets do not take precedence over queue rate limit. Skip this
new step if a queue limit is set. EDT packets do take precedence over
per-socket rate limits, as can be seen from fq_dequeue reading
sk_pacing_rate if !skb->tstamp.
With this change the so_txtime selftest sends packets in the expected
order.
====================
Introduce Airoha AN8801R series Gigabit Ethernet PHY driver
This series introduces the Airoha AN8801R Gigabit Ethernet PHY initial
support.
The Airoha AN8801R is a low power single-port Ethernet PHY Transceiver
with Single-port serdes interface for 1000Base-X/RGMII.
This chip is compliant with 10Base-T, 100Base-TX and 1000Base-T IEEE
802.3(u,ab) and supports:
- Energy Efficient Ethernet (802.3az)
- Full Duplex Control Flow (802.3x)
- auto-negotiation
- crossover detect and autocorrection,
- Wake-on-LAN with Magic Packet
- Jumbo Frame up to 9 Kilobytes.
This PHY also supports up to three user-configurable LEDs, which are
usually used for LAN Activity, 100M, 1000M indication.
The series provides the devicetree binding and the driver that have been
written by AngeloGioacchino Del Regno, based on downstream
implementation ([1]). The driver allows setting up PHY LEDs, 10/100M,
1000M speeds, and Wake on LAN and PHY interrupts.
Since v2, the series also adds the air_phy_lib library, which goal is to
share common code between air_en8811h and air_an8801 drivers, and its use
in them. The first shared functions are the existing BuckPbus register
accessors and air_phy_read/write_page functions coming from air_en8811h
driver.
The series is based on net-next kernel tree (sha1: 90d03ee2c5dc) and
I have tested it on Mediatek Genio 720-EVK board (that integrates an
Airoha AN8801RIN/A Ethernet PHY) with early board hardware enablement
patches.
net: phy: air_an8801: ensure maximum available speed link use
To ensure that the Airoha AN8801R PHY uses the maximum available link
speed, an additional register write is needed to configure the function
mode for either 1G or 100M/10M operation after link detection.
So, in air_an8801 driver, implement a custom read_status callback, that
after genphy_read_status determines the link speed, sets the bit 0 of
the link mode register (REG_LINK_MODE) if the detected speed is 1Gbps,
or unsets it otherwise.
Introduce a driver for the Airoha AN8801R Series Gigabit Ethernet
PHY; this currently supports setting up PHY LEDs, 10/100M, 1000M
speeds, and Wake on LAN and PHY interrupts.
net: phy: Rename Airoha common BuckPBus register accessors
Rename the BuckPBus register accessors functions present in air_phy_lib
and their calls in air_en8811h driver, so all exported functions start
with the same prefix.
In preparation of Airoha AN8801R PHY support, move the BuckPBus
register accessors and definitions, present in air_en8811h driver,
into the Airoha PHY shared code (air_phy_lib), so they will be usable
by the new driver without duplicating them.
In preparation of Airoha AN8801R PHY support, split out the interface
functions that will be common between the already present air_en8811h
driver and the new one, and put them into a new library named
air_phy_lib.
Jiayuan Chen [Tue, 26 May 2026 02:55:29 +0000 (10:55 +0800)]
net/sched: cls_bpf: prevent unbounded recursion in offload rollback
Quan Sun reported [1] a stack overflow in cls_bpf_offload_cmd().
Reproducer on netdevsim: add a skip_sw cls_bpf filter, set the
bpf_tc_accept debugfs knob to 0, then `tc filter replace`. The replace
calls tc_setup_cb_replace() which fails. cls_bpf_offload_cmd() then
swaps prog/oldprog and recursively calls itself to roll back. But
bpf_tc_accept=0 makes the rollback fail too, which triggers yet another
rollback frame with the same arguments, and so on until the stack is
exhausted.
bpf_tc_accept is just a convenient knob for the reproducer. Any driver
whose tc_setup_cb_replace() fails twice in a row can hit the same loop,
so this is not a netdevsim-only issue.
Two ways to fix it:
1) Have the rollback call tc_setup_cb_add() on oldprog instead of
re-entering cls_bpf_offload_cmd().
2) Mark the rollback frame with a flag and skip a second-level
rollback from inside it.
Go with (2). It is the smaller change and keeps the original behaviour:
the rollback still goes through tc_setup_cb_replace(), so the driver
gets one real chance to restore its state. If that attempt also fails,
we just return the original error instead of recursing.
Zhao Dongdong [Tue, 26 May 2026 06:51:56 +0000 (14:51 +0800)]
net: page_pool: silence static analysis warnings in page_pool_nl_stats_fill()
nla_nest_start() can return NULL if the skb runs out of space.
Jakub:
There is no bug here, if nla_nest_start() failed there's not space
left in the message. Next nla_put_uint() will also fail and we will
exit via nla_nest_cancel() which handles NULL just fine.
Various people keep sending us this patch so let's commit this.
Eric Dumazet [Tue, 26 May 2026 14:55:29 +0000 (14:55 +0000)]
ipv6: frags: cleanup __IP6_INC_STATS() confusion
After commits e1ae5c2ea478 ("vrf: Increment Icmp6InMsgs on the original
netdev") and bdb7cc643fc9 ("ipv6: Count interface receive statistics
on the ingress netdev") net/ipv6/reassembly.c uses three different
ways to reach idev in various __IP6_INC_STATS() calls.
Lets centralize this from ipv6_frag_rcv() and use __in6_dev_stats_get().
Note that ipv6_frag_rcv() tests if skb->dev could be NULL already, so
I chose to also guard against NULL, but we probably can remove the
tests in a followup patch, because I do not think skb->dev could be NULL.
iif = skb->dev ? skb->dev->ifindex : 0;
idev can be NULL, __IP6_INC_STATS() deals with this possibility.
Small code size reduction as a bonus.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-145 (-145)
Function old new delta
ipv6_frag_rcv 2399 2362 -37
ip6_frag_reasm 705 597 -108
Total: Before=31455552, After=31455407, chg -0.00%
Eric Dumazet [Tue, 26 May 2026 14:55:28 +0000 (14:55 +0000)]
ipv6: guard against possible NULL deref in __in6_dev_stats_get()
dev_get_by_index_rcu() could return NULL if the original physical
device is unregistered.
Found by Sashiko.
Fixes: e1ae5c2ea478 ("vrf: Increment Icmp6InMsgs on the original netdev") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260526145529.3587126-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yuyang Huang [Sun, 24 May 2026 02:24:56 +0000 (11:24 +0900)]
ipv6: mcast: annotate data-races around mca_users
/proc/net/igmp6 walks IPv6 multicast memberships under RCU and prints
mca_users without holding idev->mc_lock, while multicast join and leave
paths update the field while holding idev->mc_lock. Annotate this
intentional lockless snapshot with READ_ONCE() and the matching writers
with WRITE_ONCE().
Haoxiang Li [Mon, 25 May 2026 08:26:11 +0000 (16:26 +0800)]
net: thunderx: fix PTP device ref leak in nicvf_probe()
cavium_ptp_get() acquires a reference to the PTP PCI device
through pci_get_device(). If any initialization step fails
after cavium_ptp_get(), the PTP PCI device reference is leaked.
Add a common error path to release the PTP reference before
returning from probe failures.
Luka Gejak [Sat, 23 May 2026 13:04:20 +0000 (15:04 +0200)]
net: hsr: require valid EOT supervision TLV
Supervision frames are only valid if terminated with a zero-length EOT
TLV. The current check fails to reject non-EOT entries as the terminal
TLV, potentially allowing malformed supervision traffic.
Fix this by strictly requiring the terminal TLV to be HSR_TLV_EOT with
a length of zero.
Jakub Kicinski [Wed, 27 May 2026 01:07:28 +0000 (18:07 -0700)]
Merge tag 'nf-next-26-05-25' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
The following patchset contains Netfilter fixes and small enhancements:
1) Disable 32-bit x_tables compatibility (32bit binaries on 64bit
kernel) interface in user namespaces. This is 'last warning'
before this is removed for good.
2) Add a configuration toggle for netfilter GCOV profiling. Provide
dedicated toggles for ipset and ipvs.
3) Remove modular support for nfnetlink and restrict it to built-in only.
From Pablo Neira Ayuso.
4) Use per-rule hash initval in nf_conncount. This avoids unecessary
lock contention with short keys (e.g. conntrack zones) in different
namespaces.
5) Use nf_ct_exp_net() in ctnetlink expectation dumps.
From Pratham Gupta.
6) Remove a dead conditional in nft_set_rbtree.
7) Fix conntrack helper policy updates to apply per-class values correctly.
From David Carlier.
8) Fix an off-by-one OOB read in nf_conntrack_irc:parse_dcc(). Use strict
less-than comparison in the newline search loop to respect the
exclusive-end pointer convention. From Muhammad Bilal.
9) Fix typos in nf_conntrack_proto_tcp comments. From Avinash Duduskar.
10) Restore performance optimization in nft_set_pipapo_avx2 by passing
the next map index. Refactor lookup logic for clarity and add a
DEBUG_NET check to document this.
11) Avoid (harmless) u16 overflow in nf_conntrack_ftp when parsing FTP PORT
and EPRT commands. Ignore commands where single octet exceeds 255.
From Giuseppe Caruso.
Patch 12, which removes incorrect (and obviously unused) code from
nft_byteorder was kept back to avoid a net -> net-next merge conflict.
* tag 'nf-next-26-05-25' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_conntrack_ftp: avoid u16 overflows
netfilter: nft_set_pipapo_avx2: restore performance optimization
netfilter: nf_conntrack_proto_tcp: fix typos in comments
netfilter: nf_conntrack_irc: fix parse_dcc() off-by-one OOB read
netfilter: nfnl_cthelper: apply per-class values when updating policies
netfilter: nft_set_rbtree: remove dead conditional
netfilter: ctnetlink: use nf_ct_exp_net() in expectation dump
netfilter: nf_conncount: use per-rule hash initval
netfilter: allow nfnetlink built-in only
netfilter: add option for GCOV profiling
netfilter: x_tables: disable 32bit compat interface in user namespaces
====================
netconsole: Constify struct configfs_item_operations and configfs_group_operations
'struct configfs_item_operations' and 'configfs_group_operations' are not
modified in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
64259 24272 608 89139 15c33 drivers/net/netconsole.o
After:
=====
text data bss dec hex filename
64579 23952 608 89139 15c33 drivers/net/netconsole.o
Maoyi Xie [Mon, 25 May 2026 07:17:59 +0000 (15:17 +0800)]
mlxsw: spectrum_fid: use a dedicated list head pointer for sorted insert
mlxsw_sp_fid_port_vid_list_add() inserts into a list sorted by
local_port. It walks the list to find the first entry with a
larger local_port, then inserts the new entry before it:
If the loop falls through (the new local_port is the largest),
tmp_port_vid runs off the end of the list. &tmp_port_vid->list
then ends up at the list head itself (container_of() offsets
cancel), and list_add_tail() inserts at the tail. So the code
works today.
It is fragile though. Anyone who later adds a read of another
field of tmp_port_vid will hit memory outside the list head.
Track the insertion point with a dedicated list_head pointer.
Initialise insert_before to &fid->port_vid_list, set it to
&tmp_port_vid->list only on early break, and pass insert_before
to list_add_tail(). The cursor is no longer touched after the
loop. Behaviour is unchanged.
Wei Fang [Sun, 24 May 2026 07:03:10 +0000 (15:03 +0800)]
net: dsa: netc: fix unmet Kconfig dependencies for NET_DSA_NETC_SWITCH
NET_DSA_NETC_SWITCH selects NXP_NTMP, NXP_NETC_LIB and FSL_ENETC_MDIO,
but these symbols depend on NET_VENDOR_FREESCALE which may not be
enabled. This results in Kconfig warnings and linker errors like:
undefined reference to `ntmp_bpt_update_entry'
undefined reference to `ntmp_fdbt_search_port_entry'
undefined reference to `ntmp_free_cbdr'
undefined reference to `enetc_hw_alloc'
...
Therefore, add "depends on NET_VENDOR_FREESCALE" to NET_DSA_NETC_SWITCH,
ensuring that the selected symbols NXP_NTMP, NXP_NETC_LIB and
FSL_ENETC_MDIO, which all depend on NET_VENDOR_FREESCALE, can only be
selected when that dependency is already satisfied.
ipv6: addrconf: fix temp address generation after prefix deprecation
When a router temporarily deprecates an IPv6 prefix (either by sending a
Router Advertisement with Preferred Lifetime = 0 or by letting the
lifetime expire) and later restores it, the kernel permanently loses its
ability to generate temporary privacy addresses (RFC 8981) for that
prefix.
This happens because the address worker attempts to generate a
replacement temporary address when the current one nears expiration. As
the base prefix is deprecated already, the generation fails after
marking the temporary address as already having spawned a replacement
(ifp->regen_count++).
When the router eventually restores the prefix, the temporary address
becomes active again. However, once it naturally expires, the address
worker sees this temporary address already tried to generate one and
skips the regeneration.
Fix the issue by resetting the regen_count check of the latest temp
address generated for the prefix updated by the incoming RA.
Martin Karsten [Sat, 23 May 2026 01:22:20 +0000 (21:22 -0400)]
net: napi: Skip last poll when arming gro timer in busy poll
Skip the extra call to napi->poll(), if the gro timer is armed at the
end of busy polling. This removes the need for having a separate
__busy_poll_stop() routine and its code is moved directly into the
relevant places in busy_poll_stop(). Remove obsolete comment about
ndo_busy_poll_stop().
This is a follow-up to commit 58e2330bd455 ("net: napi: Avoid gro timer
misfiring at end of busypoll"), which has deferred arming the gro timer
to the end of __busy_poll_stop() to eliminate a race condition between
a short timer and long poll that could leave the queue stuck with
interrupts disabled and no timer armed.
Co-developed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://patch.msgid.link/20260523012247.1574691-1-mkarsten@uwaterloo.ca Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tim Bird [Fri, 22 May 2026 22:55:08 +0000 (16:55 -0600)]
llc: Add SPDX id lines to some llc source files
Most of the lls source files are missing SPDX-License-Identifier
lines. Add appropriate IDs to these files, and remove other license
info from the header. In once case, leave the existing id line
and just remove the license reference text.
====================
Add OVS packet family YNL spec and unicast notification support
This series adds a YAML netlink spec for the OVS_PACKET_FAMILY genetlink
family and a bind-only ntf_bind() helper for receiving unicast
notifications.
====================
Minxi Hou [Fri, 22 May 2026 17:41:54 +0000 (01:41 +0800)]
tools: ynl: add unicast notification receive support
Add ntf_bind() method to YnlFamily for binding the netlink
socket without joining a multicast group. This enables receiving
unicast notifications through the existing poll_ntf/check_ntf
path.
The OVS packet family sends MISS and ACTION upcalls via
genlmsg_unicast() to a per-vport PID rather than through a
multicast group. The existing ntf_subscribe() couples bind()
with setsockopt(ADD_MEMBERSHIP), which does not fit the unicast
case. ntf_bind() provides the bind-only alternative, with the
address defaulting to (0, 0) but exposed as an explicit argument.
Minxi Hou [Fri, 22 May 2026 17:41:53 +0000 (01:41 +0800)]
netlink: specs: add OVS packet family specification
Add YAML netlink spec for the OVS_PACKET_FAMILY (ovs_packet).
This completes the set of OVS genetlink family specs (ovs_datapath,
ovs_flow, ovs_vport already exist).
The spec defines three operations: MISS (event), ACTION (event),
and EXECUTE (do). MISS and ACTION are kernel-to-userspace upcalls
sent via genlmsg_unicast(); EXECUTE is the only registered genl
operation.
Key, actions, and egress-tun-key attributes are typed as binary
rather than nest because the nested attribute definitions belong
to the ovs_flow spec and cross-spec references are not supported
by the YNL framework.
s390/ism: Drop superfluous zeros in pci_device_id array
The .driver_data member of the struct pci_device_id array were
initialized by a list expressions to zero without making use of that
value. In this case it's better to not specify a value at all and let
the compiler fill in the zeros. Same for the list terminator that can
better be completely empty.
This patch doesn't introduce changes to the compiled array.
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com> Reviewed-by: Breno Leitao <leitao@debian.org> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20260522153010.777081-2-u.kleine-koenig@baylibre.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
net: enetc: Prepare for ENETC v4 VF support
This patch series refactors and extends the ENETC driver infrastructure
to prepare for upcoming ENETC v4 Virtual Function (VF) support. The main
focus is on code commonization, improved VF-PF communication, and dynamic
resource allocation.
The ENETC IP has evolved across different revisions, and the existing
driver architecture was primarily designed around v1 hardware. To support
v4 VFs efficiently, we need to share common code between PF drivers of
different IP versions while maintaining compatibility.
Key changes in this series:
1. VF-PF Messaging Infrastructure:
- Convert mailbox messages to new formats
- Use read_poll_timeout() for simplifying VF mailbox polling
- Add support for IP minor revision query via messaging
2. Code Commonization:
- Relocate SR-IOV configuration helpers to common PF code
- Move VF message handlers to dedicated enetc_msg.c
- Integrate enetc_msg.c into enetc-pf-common driver
3. CBDR (Control Buffer Descriptor Ring) Improvements:
- Align v1 CBDR API with v4 for VF driver sharing
- Add CBDR setup/teardown hooks to enetc_si_ops
4. Dynamic Resource Management:
- Dynamically allocate rxmsg based on actual VF count
- Use MADDR_TYPE constant for MAC filter array sizing
This refactoring lays the groundwork for cleanly integrating ENETC v4 VF
support in subsequent patch series, allowing code reuse between v1 and v4
PF drivers while maintaining a clean separation of version-specific
logic.
Wei Fang [Fri, 22 May 2026 09:24:38 +0000 (17:24 +0800)]
net: enetc: dynamically allocate rxmsg based on VF count
The constant ENETC_MAX_NUM_VFS is defined as 2 when enabling support for
LS1028A. This works for LS1028A because its ENETC hardware supports up
to 2 VFs. However, ENETC v4 has varying VF capabilities depending on the
SoC:
Using a fixed ENETC_MAX_NUM_VFS for memory allocation leads to
over-allocation on SoCs with fewer or no VF support. To better match
hardware capabilities and avoid unnecessary memory usage, change rxmsg
memory allocation from a fixed-size array to dynamic allocation based
on the actual VF count retrieved via pci_sriov_get_totalvfs().
Wei Fang [Fri, 22 May 2026 09:24:37 +0000 (17:24 +0800)]
net: enetc: use MADDR_TYPE for MAC filter array size
The mac_filter array in struct enetc_pf is sized as
ENETC_MAX_NUM_MAC_FLT, defined as (ENETC_MAX_NUM_VFS + 1) * MADDR_TYPE.
This resulted in an array of 6 elements (for 2 VFs), but only the first
2 entries are actually used.
The PF driver maintains MAC filters for unicast (UC) and multicast (MC)
addresses, indexed by the enum enetc_mac_addr_type (UC=0, MC=1). The
code only iterates over MADDR_TYPE (2) entries and directly accesses
mac_filter[UC] and mac_filter[MC]. The extra space allocated for
(ENETC_MAX_NUM_VFS * MADDR_TYPE) entries is never used because VF MAC
filtering is not implemented yet.
Remove the ENETC_MAX_NUM_MAC_FLT macro and size the array as
MADDR_TYPE, reducing the allocation from 6 to 2 entries. This saves 48
bytes per PF and better reflects the actual usage.
This change has no functional impact. Future VF MAC filtering support
will move mac_filter into struct enetc_si, allowing each SI (PF or VF)
to maintain its own independent filter table.
Wei Fang [Fri, 22 May 2026 09:24:36 +0000 (17:24 +0800)]
net: enetc: add generic helper to initialize SR-IOV resources
The upcoming ENETC v4 PF driver will support SR-IOV, and its logic for
initializing VF resources is identical to the existing ENETC v1 PF
implementation. To avoid code duplication across PF drivers, factor out
the common SR-IOV initialization logic into the enetc-pf-common driver.
Add enetc_init_sriov_resources() to handle:
- Querying the total number of VFs supported by the device via
pci_sriov_get_totalvfs()
- Allocating memory for the VF state array (struct enetc_vf_state)
The implementation uses devm_kcalloc() instead of kzalloc() to simplify
memory management. This automatically frees VF state memory when the PF
device is removed, eliminating the need for explicit cleanup in error
and remove paths.
Wei Fang [Fri, 22 May 2026 09:24:35 +0000 (17:24 +0800)]
net: enetc: add CBDR setup/teardown hooks to enetc_si_ops for VF support
The upcoming ENETC v4 VF will share the enetc-vf driver with the
existing v1 VF. However, ENETC v4 uses a revised CBDR (command BD ring)
setup/teardown API that differs from v1.
To support both versions in the same driver, add setup_cbdr() and
teardown_cbdr() function pointers to struct enetc_si_ops. This allows
each hardware version to register its own CBDR implementation:
Update the enetc-vf driver to call CBDR operations through si->ops
instead of directly invoking the v1 functions. This enables runtime
selection of the correct CBDR backend based on hardware version.
Changes:
- Add setup_cbdr() and teardown_cbdr() hooks to struct enetc_si_ops
- Register v1 CBDR functions in enetc_vsi_ops
- Replace direct calls with si->ops->setup_cbdr() and
si->ops->teardown_cbdr() in enetc_vf.c
Wei Fang [Fri, 22 May 2026 09:24:34 +0000 (17:24 +0800)]
net: enetc: align v1 CBDR API with v4 for VF driver sharing
The upcoming ENETC v4 VF will share the enetc-vf driver with the v1 VF.
However, ENETC v4 introduces different CBDR (command BD ring) setup and
teardown semantics that are incompatible with v1.
To support both versions in the same driver, the .setup_cbdr() and
.teardown_cbdr() hooks will be added to struct enetc_si_ops, allowing
the driver to register version-specific implementations. So refactor the
v1 CBDR functions to match the v4-style interface (taking struct enetc_si*
instead of individual parameters), enabling them to be registered via
si_ops in the subsequent patch.
Changes:
- Update enetc_setup_cbdr() and enetc_teardown_cbdr() prototypes to
take 'struct enetc_si *' as the sole parameter
- Extract parameters (dev, hw) from the enetc_si structure within the
function implementations
- ENETC_CBDR_DEFAULT_SIZE has always been used as the number of command
BDs, and there is no need to adjust the size of the command BD ring.
Therefore, ENETC_CBDR_DEFAULT_SIZE is moved into the enetc_setup_cbdr()
- Update all call sites in enetc_pf.c and enetc_vf.c
No functional changes. This prepares for adding v4-specific CBDR handling
in subsequent patches.
Wei Fang [Fri, 22 May 2026 09:24:33 +0000 (17:24 +0800)]
net: enetc: add VF-PF messaging support for IP minor revision query
For ENETC v4, different SoCs use different minor revisions, such as
i.MX95 v4.1, i.MX94 v4.3, and i.MX952 v4.6. Unlike the PF, the VF does
not have access to a global register that exposes the IP minor revision.
In the current driver model, the VF must select the appropriate driver
data based on this revision information.
To support this requirement, the VF now sends a minor revision query
message to the PF through the VSI-to-PSI mailbox mechanism. The PF
responds with the IP minor revision so that the VF can match the correct
driver data.
This patch adds PF-side support for replying to the minor revision
message and VF-side support for sending the query.
Wei Fang [Fri, 22 May 2026 09:24:32 +0000 (17:24 +0800)]
net: enetc: convert mailbox messages to new formats
On the LS1028A platform, the PF-VF mailbox was only used to update the
VF's MAC address. The original message format is minimal, lacks a clear
structure, and provides no means for the receiver to validate message
integrity, making it difficult to extend for new features.
With the introduction of i.MX ENETC v4, the interaction between PF and
VF has become significantly more complex. Typical deployments now include
scenarios where the PF is controlled by an M core while the VF is driven
by either the Linux kernel or DPDK, or where the PF is controlled by the
Linux kernel while the VF is controlled by DPDK. These heterogeneous
driver combinations require a unified and extensible message format to
ensure compatibility across different operating environments.
This patch introduces a newly defined PF-VF message structure and
converts the existing MAC-update mechanism to use the new format. The
redesigned message layout provides:
- extensibility to support future PF-VF features on ENETC v4,
- consistent framing for all message types,
- improved data integrity checking,
- a common protocol usable across Linux, M core firmware, and DPDK.
Additional PF-VF message types will be added in subsequent patches.
Note that switch to the new message format will not affect ENETC v1
(LS1028A). Due to a hardware limitation of ENETC v1, the ENETC PF and
VFs can only be controlled by the same OS. If the PF is controlled by
the Linux kernel driver, then the VFs must also be controlled by the
Linux kernel driver.
Wei Fang [Fri, 22 May 2026 09:24:30 +0000 (17:24 +0800)]
net: enetc: integrate enetc_msg.c into enetc-pf-common driver
Move enetc_msg.c from the fsl-enetc driver to the nxp-enetc-pf-common
driver so that SR-IOV mailbox handling can be shared between ENETC v1
and v4 PF drivers.
Changes:
- Move enetc_msg.o compilation from fsl-enetc to nxp-enetc-pf-common
- Export enetc_sriov_configure() with EXPORT_SYMBOL_GPL for use by
both PF drivers
The fsl-enetc driver now depends on nxp-enetc-pf-common for SR-IOV
functionality.
Wei Fang [Fri, 22 May 2026 09:24:29 +0000 (17:24 +0800)]
net: enetc: relocate SR-IOV configuration helper for common PF support
Move enetc_sriov_configure() from enetc_pf.c to enetc_msg.c to prepare
for integrating enetc_msg.c into the enetc-pf-common driver, where it
will be shared between ENETC v1 and v4 PF drivers.
Since enetc_msg_psi_init() and enetc_msg_psi_free() are now only called
from enetc_sriov_configure() within the same file, make them static.
Wei Fang [Fri, 22 May 2026 09:24:27 +0000 (17:24 +0800)]
net: enetc: use enetc_set_si_hw_addr() for setting MAC address
Replace enetc_pf_set_primary_mac_addr() with the generic
enetc_set_si_hw_addr() function. This prepares for moving
enetc_msg_pf_set_vf_primary_mac_addr() to the enetc-pf-common driver,
where it can be shared between ENETC v1 and v4 PF drivers.
====================
mv88e6xxx: Cache scratch config3 of 6352
In mv88e6352 scratch register in Global Control 2 set of registers
returns which port is attached to SERDES. This value is a pin
strapping value and is set after the switch is released from reset.
Thus, it can be cached during chip setup instead of reading the
register everytime when SERDES check is needed.
The series consist of 4 parts:
1. Add new mv88e6352_reset function as ops->reset
2. Cache the register value in this reset function
3. Refactor mv88e6352_g2_scratch_port_has_serdes to use the cached
value.
4. Remove the locks surrounding mv88e6352_g2_scratch_port_has_serdes.
Fidan Aliyeva [Thu, 21 May 2026 20:29:23 +0000 (22:29 +0200)]
mv88e6xxx: Use cached config3 in 6352 has_serdes
1. Refactor mv88e6352_g2_scratch_port_has_serdes to use the cached
scratch config3 value instead of reading it everytime.
2. Remove err<0 check from mv88e6352_phylink_get_caps as it is never
true anymore
Co-developed-by: Thomas Eckerman <thomas.eckerman.ext@ericsson.com> Signed-off-by: Thomas Eckerman <thomas.eckerman.ext@ericsson.com> Signed-off-by: Fidan Aliyeva <fidan.aliyeva.ext@ericsson.com> Link: https://patch.msgid.link/20260521202924.727929-4-fidan.aliyeva.ext@ericsson.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
net: mdio: realtek-rtl9300: Groundwork for multi SOC support
The Realtek Otto switch platform consist of four different series
- RTL838x aka maple : 28 port 1G Switches
- RTL839x aka cypress : 52 port 1G Switches
- RTL930x aka longan : 28 port 1G/2.5G/10G Switches
- RTL931x aka mango : 56 port 1G/2.5G/10G Switches
The existing realtek-rtl9300 MDIO driver was only designed for RTL930x
devices. The three other SOCs are not supported although they basically
incorporate a very similar MDIO controller.
This series is the first step in a multi-stage approach to also support
the missing SOCs. Device specific properties and registers are converted
into designated initializers. Based on this new devices can be added
much easier in future commits.
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
====================
net: mdio: realtek-rtl9300: Link I/O functions in info structure
The MDIO controller registers of the different devices of the
Realtek Otto switch series are very similar. Nevertheless each
device will need to feed the whole command data distributed over
the controller registers slightly different.
E.g. the combined C22/command register has different field layouts.
On RTL930x bits 24-20 define the to-be-accessed C22 register number
while on RTL839x this is stored in bits 9-5.
Thus there need to be device specific read/write functions that
are called dynamically. Add them into the info structure and make
use of them where needed.
net: mdio: realtek-rtl9300: Add port mask register
MDIO controller commands work on ports. These are converted by
the driver and hardware forth and back to bus/address. For write
commands a port mask register needs to be filled. Each bit tells the
controller to which PHY the write will be issued. Setting multiple
bits allows to program multiple PHYs in one step. The driver will
not make use of this parallel write feature. But it must at least
fill the bit of the target port that it wants to write to.
Depending on the SOC type and the number of supported PHYs this is
either one or two 32 bit port mask registers. The driver currently only
supports the 28 port RTL930x SOCs. So provide only the mask register
for the lower 32 ports. Add it to the register structure and make use
of it where needed.
The MDIO data that needs to be written or read to registers of the
controller is handled by an I/O register. Add that to the register
structure and make use of it where needed.
Command issuing/status bits and C22 data share the same register. In the
future the number of places where this register is used will be:
- One generic command helper/runner for all devices that will access the
command bits of the register
- 8 device specific C22 read/write functions that will access the C22
data fields.
Thus name the register c22_data to align with the existing c45_data
register. This way all device specific helpers will have a common
view on the to-be-fed data. Add the register to the existing structure
and make use of it where needed.
The MDIO controller of the Realtek Otto switches has either 4 or 7 command
registers. This depends on the number of supported ports. These registers
are "scattered" around the MMIO block and their addresses depend on the
specific model.
Nevertheless all command registers share a common pattern:
- A mask register with one bit per addressed port
(remark: the driver internally works on ports instead of bus/address)
- A I/O data register that transfers the to be read/written data
- A C45 registers that takes devnum and regnum
- A C22 register that also includes run and status bits
(remark: this also takes the Realtek proprietary C22 PHY page)
Provide an additional structure for these command registers so it can be
reused in two places.
1. For defining the register addresses in the regmap.
2. For defining the to be read/written register data
net: mdio: realtek-rtl9300: Add pages to info structure
The Realtek ethernet MDIO controller has a proprietary paging feature
that is closely aligned with Realtek based PHYs. These PHY know "pages"
for C22 access. Those can be switched via reads/writes to register 31.
Usually the paged access must be programmed in four steps.
1. read/save page register
2. change "page" register 31
3. read/write data register (on the given page)
4. restore page register
The controller can run all this in hardware with one single request
from the driver. It is given the page, the register and the data
and takes care of all the rest. This reduces CPU load. The number
of supported pages depend on the model. This is either 4096 for low
port count SOCs (up to 28 ports) or 8192 for high port count SOCs
(up to 56 ports).
There is however one special page that allows to pass through all C22
commands directly to the PHY - without any caching. This so called raw
page is dependent of the hardware. It is the highest supported page
number minus 1.
Provide the number of supported pages as a device specific property.
This new "num_pages" aligns with the existing properties and gives
an better insight into the hardware layout than just defining the
number of the raw page. The later directly derives from that and
can be accessed with the new RAW_PAGE() macro. Make use of it where
needed.
net: mdio: realtek-rtl9300: Add ports to info structure
The ethernet MDIO controller in the Realtek Otto series has a very special
command register style. Instead of working with bus/address it works on
ethernet port numbers. For this the controller is initialized via mapping
registers that tell which port is mapped to which bus/address. Every
request to the driver is then converted as follows
1. Kernel calls driver with bus/address
2. Driver converts bus/address to port and issues command
3. Hardware maps port back to bus/address
The number of ports is different for each device. Make this configurable
by adding a property to the info structure. Switch the existing usage of
MAX_PORTS to this new property where needed.
net: mdio: realtek-rtl9300: Add device specific info structure
Device properties of the RTL930x SOCs are hardcoded into the MDIO driver.
This must be relaxed to support additional devices like the RTL838x or
RTL839x. These do not have 4 SMI buses but 1 or 2 instead.
To support multiple devices establish an info structure that contains
individual variations of each series. As a first use case add the number
of buses into this structure and use it where needed.
The Realtek ethernet MDIO driver currently only serves SOCs from the
Realtek RTL930x series. This is only one lineup of the Realtek Otto
switch series that also knows RTL838x, RTL839x, RTL931x devices.
All of these share similar hardware with comparable MMIO access logic
but have individual variations. Important to note
- Controller works on switch ports instead of buses and addresses.
- Devices incorporate additional MDIO hardware. E.g.
- an auxiliary MDIO controller for GPIO expanders [1]
- a MDIO style SerDes controller [2]
To avoid future confusion enhance the driver documentation and
function naming. Make clear what this driver is about and what
parts are generic and what parts are device specific. For this
rename the function and structure prefix as follows:
- for generic functions use otto_emdio_
- for device specific helpers use e.g. otto_emdio_9300_
This prefix naming tries to align with the watchdog timer [3].
It paves the way so that drivers for the other Realtek Otto MDIO
controllers can be added in future commits using the same naming
convention.
Remark 1: The read/write functions are kept device specific for now
because they will only fit the RTL930x SOCs. Renaming will take place
as soon as the I/O handling will be generalized.
Remark 2: The driver name "mdio-rtl9300" is kept for now.
Len Bao [Sat, 23 May 2026 15:07:35 +0000 (15:07 +0000)]
eth: dpaa2: constify dpaa2_ethtool_stats and dpaa2_ethtool_extras
The 'dpaa2_ethtool_stats' and 'dpaa2_ethtool_extras' structures are
initialized in their declarations and never changed. So, constify them
to reduce the attack surface.
Rosen Penev [Thu, 21 May 2026 21:59:08 +0000 (14:59 -0700)]
net: ibm: emac: Use napi_gro_receive() for Rx packets
emac_poll_rx() already runs in NAPI context and TAH-equipped EMACs set
CHECKSUM_UNNECESSARY on verified frames, which lets GRO coalesce TCP
segments without a software checksum on the merge path. Replace the
per-poll rx_list batched with netif_receive_skb_list() with direct
napi_gro_receive() calls so the stack can merge segments into super-skbs
and skip a full traversal per packet -- a meaningful win on the slow
4xx-class CPUs this driver targets.
Small routing speed improvement tested on a Cisco Meraki MX60W:
Patch 1 reduces stack usage in mlx5e_pcie_cong_get_thresh_config()
by reusing a single union devlink_param_value across four
devl_param_driverinit_value_get() calls (instead of
union devlink_param_value val[4] on the stack) and assigning each
vu16 into mlx5e_pcie_cong_thresh, so the helper stays under the
frame-size warning limit as the union grows.
Patch 2 changes devlink_nl_param_value_put() and
devlink_nl_param_value_fill_one() to pass union devlink_param_value
by pointer instead of by value. Passing two copies of the union
by value in the param netlink path consumes over 500 bytes of argument
stack and risks CONFIG_FRAME_WARN as the union grows beyond its
historical size.
====================
Picking a couple of uncontroversial changes from the series
since it's making very slow progress.
Ratheesh Kannoth [Thu, 21 May 2026 09:52:57 +0000 (15:22 +0530)]
devlink: pass param values by pointer
union devlink_param_value grows substantially once U64 array
parameters are added to devlink (from 32 bytes to over 264 bytes).
devlink_nl_param_value_fill_one() and devlink_nl_param_value_put()
copy the union by value in several places. Passing two instances as
value arguments alone consumes over 528 bytes of stack; combined with
deeper call chains the parameter stack can approach 800 bytes and trip
CONFIG_FRAME_WARN more easily.
Switch internal helpers and exported driver APIs to pass pointers to
union devlink_param_value rather than passing the union by value.
Reviewed-by: Petr Machata <petrm@nvidia.com> # for mlxsw Acked-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Arthur Kiyanovski <akiyano@amazon.com> #for ena Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Link: https://patch.msgid.link/20260521095303.2395584-4-rkannoth@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ratheesh Kannoth [Thu, 21 May 2026 09:52:56 +0000 (15:22 +0530)]
net/mlx5e: Reduce stack use reading PCIe congestion thresholds
union devlink_param_value grew when U64 array parameters were added.
Keeping union devlink_param_value val[4] in
mlx5e_pcie_cong_get_thresh_config() exceeded the compiler's
-Wframe-larger-than limit.
Reuse one union: call devl_param_driverinit_value_get() once per
MLX5 PCIe congestion threshold and assign each vu16 to the
corresponding mlx5e_pcie_cong_thresh member.
Jakub Kicinski [Mon, 25 May 2026 20:48:20 +0000 (13:48 -0700)]
Merge branch 'net-mlx5-add-satellite-pf-support'
Tariq Toukan says:
====================
net/mlx5: Add satellite PF support
A satellite PF is a new SmartNIC configuration that adds another
physical function on the DPU that is not an eswitch manager and not a
page manager. The satellite PF can have its own SFs and can be passed
through to a VM on the DPU, providing an isolated function for users who
should not have access to the privileged ECPF. The ECPF handles the
satellite PF and the host PF in a similar way, using the same management
framework.
This series adds support for satellite PFs (SPFs) in the mlx5 eswitch.
SPFs are discovered through the v1 response layout of the
query_esw_functions command, introduced in the previous infrastructure
preparation series.
The first four patches discover satellite PFs, allocate eswitch vports
for them and their SFs, and extend the SF hardware table to manage SPF
SF entries.
The next five patches expose PF numbers from firmware, map SF
controllers to their pfnum, register devlink ports with proper
attributes, and register SF resource on satellite PF ports.
The final four patches add devlink port state management, FDB peer miss
rules, dedicated page accounting, and SF resource registration for
satellite PF vports.
This series builds on the eswitch infrastructure preparation series
previously submitted.
====================
Moshe Shemesh [Thu, 21 May 2026 11:08:43 +0000 (14:08 +0300)]
net/mlx5: Add SPF function type for page management
Add MLX5_SPF to enum mlx5_func_type so SPFs get their own page counter,
and add the corresponding WARN check at page cleanup. Wait for SPF pages
to be reclaimed during ECPF teardown, alongside the existing host PF and
VF page waits.
SPF page requests are always identified by vhca_id, so the legacy
func_id_to_type() path is not reached for satellite PFs.
Moshe Shemesh [Thu, 21 May 2026 11:08:41 +0000 (14:08 +0300)]
net/mlx5: Support state get/set for satellite PF ports
Extend mlx5_devlink_pf_port_fn_state_get() to support satellite PF
vports by querying their vhca_state from the query_esw_functions output
using the vport's vhca_id.
Extend mlx5_devlink_pf_port_fn_state_set() to support satellite PFs by
using the generic mlx5_esw_pf_enable/disable_hca() functions.
Moshe Shemesh [Thu, 21 May 2026 11:08:39 +0000 (14:08 +0300)]
net/mlx5: Register devlink ports for satellite PFs
Include satellite PFs in mlx5_eswitch_is_pf_vf_vport() so they receive
the standard PF/VF devlink port operations. Update
mlx5_esw_devlink_port_supported() and devlink port attribute setup to
register SPF devlink ports with controller number and PF number.
Add mlx5_esw_spf_vport_to_idx() to look up the SPF array index by vport
number, and mlx5_esw_is_spf_vport() boolean wrapper to identify
satellite PF vports.
Moshe Shemesh [Thu, 21 May 2026 11:08:38 +0000 (14:08 +0300)]
net/mlx5: Map SF controller to pfnum for satellite PFs
SF devlink port creation and registration used the ECPF's PCI function
as pfnum. Extend this to support satellite PF controllers by introducing
mlx5_esw_sf_controller_to_pfnum() that maps a controller number to the
corresponding PF number, and use it in SF port attribute setup and SF
creation validation.
Reorder the checks in mlx5_devlink_sf_port_new() so that
mlx5_sf_table_supported() runs before attribute validation, since the
new helper requires the eswitch to be initialized.
Moshe Shemesh [Thu, 21 May 2026 11:08:37 +0000 (14:08 +0300)]
net/mlx5: Expose PF number from query_esw_functions
Extract pci_device_function from the query_esw_functions output for both
the host PF and satellite PFs, storing it alongside the existing
host_number field.
Add mlx5_esw_get_hpf_pf_num() helper that returns the host PF's actual
PCI device function when the new query format is supported, falling back
to PCI_FUNC(dev->pdev->devfn) for older firmware. Use it in devlink port
attribute setup so that host PF and VF devlink ports report the correct
PF number rather than the ECPF's own PCI function number.
Moshe Shemesh [Thu, 21 May 2026 11:08:36 +0000 (14:08 +0300)]
net/mlx5: Support SPF SFs in SF hardware table
Convert the SF hardware table from a fixed-size hwc array to a
dynamically allocated one, supporting satellite PF (SPF) SFs alongside
local and external host SFs. Initialize hwc entries for each SPF using
its host_number as controller. Rename MLX5_SF_HWC_EXTERNAL to
MLX5_SF_HWC_EXT_HOST and add MLX5_SF_HWC_FIRST_SPF for clarity.
Moshe Shemesh [Thu, 21 May 2026 11:08:35 +0000 (14:08 +0300)]
net/mlx5: Initialize satellite PF SF vports
Extend satellite PF (SPF) initialization to allocate SF vports for each
SPF. For each discovered SPF, query its SF capabilities, allocate SF
vports, and store the host_number for controller identification.
Add accessor APIs mlx5_esw_get_num_spfs(),
mlx5_esw_spf_get_host_number(), mlx5_esw_sf_max_spf_functions(), and
mlx5_esw_has_spf_sfs() for use by the SF hardware table in a subsequent
patch. Also extend mlx5_esw_offloads_controller_valid() to accept SPF
controllers in addition to the host PF controller.
Moshe Shemesh [Thu, 21 May 2026 11:08:34 +0000 (14:08 +0300)]
net/mlx5: Initialize host PF host number earlier
Move host_number from esw->offloads to esw->esw_funcs as hpf_host_number
and initialize it during vports_init instead of offloads_enable. This
makes the host PF host number available earlier in the initialization
sequence, which is required for upcoming SF hardware table support for
satellite PFs.
Add a mlx5_esw_get_hpf_host_number() accessor to retrieve the stored
host number.
Moshe Shemesh [Thu, 21 May 2026 11:08:33 +0000 (14:08 +0300)]
net/mlx5: Introduce generic helper for PF SFs info
Introduce mlx5_esw_sf_max_pf_functions() that queries a PF's max_num_sf
and sf_base_id using mlx5_vport_get_other_func_general_cap(), which
supports both function_id and vhca_id based addressing.
Refactor mlx5_esw_sf_max_hpf_functions() into a thin wrapper that adds
the host PF precondition checks and calls the new generic helper. Remove
mlx5_query_hca_cap_host_pf() as it is not used anymore.
This prepares for querying SFs info of Satellite PFs.
Moshe Shemesh [Thu, 21 May 2026 11:08:32 +0000 (14:08 +0300)]
net/mlx5: Add satellite PF vport support
Discover satellite PFs from query_esw_functions output and allocate
eswitch vports for them. For each satellite PF, create a vport via the
CREATE_ESW_VPORT command using its vhca_id and allocate it in the
eswitch vport table.
When enabling switchdev mode, the ECPF acting as the eswitch manager
activates each satellite PF with enable_hca, loads its vport and adds
a representor. Since satellite PF devlink ports are registered in a
later patch, guard mlx5_esw_offloads_devlink_port() against vports
with no devlink port to avoid NULL dereference during representor
attach.
Dan Carpenter [Thu, 21 May 2026 12:49:36 +0000 (15:49 +0300)]
net: lan966x: cleanup error handling in lan966x_fdma_rx_alloc_page_pool()
This code works, but there are a few things to tidy up:
1. No need to an unlikely() because IS_ERR() already has an unlikely()
built in.
2. No need to use PTR_ERR_OR_ZERO() because it's not an error pointer.
3. Use the returned error code directly instead of using groveling in
rx->page_pool to find it.
The handlers do not filter by the caller's network namespace.
rds_info_getsockopt() has no netns or capable() check, and
rds_create() has no capable() check, so AF_RDS is reachable from
an unprivileged user namespace. As a result, an unprivileged
caller in a fresh user_ns plus netns can read the bound address
and sock inode of every RDS socket on the host, the peer address
of incoming messages on every RDS socket on the host, the peer
address and TCP sequence numbers of every rds-tcp connection on
the host, and the peer address and RDS sequence numbers of every
RDS connection on the host.
The rds-tcp transport is reachable from a non-initial netns (see
rds_set_transport()), so a one-shot init_net gate at
rds_info_getsockopt() would deny legitimate per-netns visibility
to rds-tcp callers. Instead, filter at each handler by comparing
the netns of the caller's socket to the netns of the list entry,
or to rds_conn_net(conn) for connection paths. Only copy entries
whose netns matches the caller. Counters (RDS_INFO_COUNTERS) are
aggregate statistics and remain global.
Reproducer (KASAN VM, rds and rds_tcp loaded): an AF_RDS socket
binds 127.0.0.1:4242 in init_net as root. A child process enters
a fresh user_ns plus netns and opens AF_RDS there, then calls
getsockopt(SOL_RDS, RDS_INFO_SOCKETS). Before this change, the
child sees the init_net socket. After this change, the child
sees zero entries.
Drop the rds_sock_count, rds_tcp_tc_count, and rds6_tcp_tc_count
globals. v2 used them for the size precheck and lens->nr; v3
replaced the precheck with a per-ns count from a first pass over
the list, so the globals have no remaining readers. The matching
increments and decrements in rds_create()/rds_destroy_sock() and
rds_tcp_set_callbacks()/rds_tcp_restore_callbacks() go away with
them. Reported by the kernel test robot under clang W=1.
Jiayuan Chen [Fri, 22 May 2026 01:16:20 +0000 (09:16 +0800)]
rds: annotate data-race around rs_seen_congestion
rs_seen_congestion is read in rds_poll() and written in rds_sendmsg()
and rds_poll() without any lock. Use READ_ONCE()/WRITE_ONCE() to
annotate these lockless accesses and silence KCSAN.
Yuyang Huang [Fri, 22 May 2026 09:39:06 +0000 (18:39 +0900)]
ipv4: igmp: annotate data-races around im->users
/proc/net/igmp walks IPv4 multicast memberships under RCU and
prints im->users without holding RTNL, while multicast join and leave
paths update the field while holding RTNL. Annotate this intentional
lockless snapshot with READ_ONCE() and the matching writers with
WRITE_ONCE().
Giuseppe Caruso [Fri, 10 Apr 2026 13:57:33 +0000 (09:57 -0400)]
netfilter: nf_conntrack_ftp: avoid u16 overflows
get_port and try_number() parse comma-separated decimal values from FTP PORT
and EPRT commands into a u_int32_t array, but does not validate that each
value fits in a single octet. RFC 959 specifies that PORT parameters
are decimal integers in the range 0-255, representing the four octets
of an IP address followed by two octets encoding the port number.
Values exceeding 255 are silently accepted. In try_rfc959(), the raw
u32 values are combined via shift-and-OR to form the IP and port:
When array elements exceed 255, bits from one field bleed into adjacent
fields after shifting, producing IP addresses and port numbers that
differ from what the text representation suggests. For example,
"PORT 10,0,1,2,256,22" yields port (256<<8)|22 = 65558, truncated to
u16 = 22. This mismatch between the textual and computed values can
confuse network monitoring tools that parse FTP commands independently.
Ignore the command by returning 0 (no match) when any accumulated
value exceeds 255 so that no expectation is created.
Signed-off-by: Giuseppe Caruso <giuseppecaruso0990@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
The avx2 lookup routines get the next map index to process passes as a
function argument, but this isn't obvious because it's hidden in the
lookup macro.
Additionally, a recent LLM review pointed out following "bug":
-------------------------------------------------------------
> b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last);
> if (last)
> - return b;
> + ret = b;
>
> if (unlikely(ret == -1))
> ret = b / XSAVE_YMM_SIZE;
Does this change introduce a logic error when last=true and no match is
found? [..]
Should this be changed to an else-if structure instead?
-------------------------------------------------------------
LLM sees a control-flow change, but there is none:
All call sites invoke nft_pipapo_avx2_refill() only when at least one
bit in the map is set, i.e. nft_pipapo_avx2_refill() never returns -1.
Add a runtime debug check that fires if we'd return -1 as additional
documentation and also make the suggested change, code might be easier
to understand this way.
In commit 17a20e09f086 ("netfilter: nft_set: remove one argument from
lookup and update functions") I incorrectly moved the "ret" scope into
the loop.
This has no effect on the correctness, but it can (depending on map sizes)
cause a redundant repeat of an earlier processing step.
Restore the intended 'pass map index' instead of always-0. Note that I
did not see any change in performance numbers, but Stefano correctly
points out that the existing perf test likely lack a sparse intermediate
bitmap (between fields) with a lot of leading zeroes.
parse_dcc() treats data_end as an inclusive end pointer, but its only
caller passes data_limit = ib_ptr + datalen, which points one past the
last valid byte.
The newline search loop iterates while tmp <= data_end, so when no
newline is present, *tmp is read at tmp == data_end, one byte beyond
the region filled by skb_header_pointer().
irc_buffer is kmalloc'd as MAX_SEARCH_SIZE + 1 bytes and datalen is
capped at MAX_SEARCH_SIZE, so the stray read does not fault. The byte
is uninitialized or stale; if it contains an ASCII digit, simple_strtoul
will consume it and produce a wrong DCC IP or port in the conntrack
expectation. The extra allocation byte is also a fragile guard: if the
cap or allocation size changes, this becomes a real out-of-bounds read.
Change the loop and its post-loop check to use strict less-than,
consistent with the caller's exclusive-end convention. Update the
function comment accordingly.
David Carlier [Sat, 11 Apr 2026 18:57:21 +0000 (19:57 +0100)]
netfilter: nfnl_cthelper: apply per-class values when updating policies
When a userspace conntrack helper with multiple expectation classes is
updated via nfnetlink, every class ends up with the first class's
max_expected and timeout values.
nfnl_cthelper_update_policy_all() validates each new policy into the
corresponding slot of the temporary new_policy array, but the second
loop that commits the values into the live helper dereferences
new_policy as a pointer instead of indexing it, so every iteration
reads new_policy[0] regardless of i. An update that changes per-class
values is silently collapsed onto class 0's values with no error
returned to userspace.
Index the temporary array by i in the commit loop so each class gets
its own validated values.
netfilter: nft_set_rbtree: remove dead conditional
net/netfilter/nft_set_rbtree.c:399 __nft_rbtree_insert()
warn: 'removed_end' is not an error pointer
Since commit : 087388278e0f ("netfilter: nf_tables: nft_set_rbtree: fix
spurious insertion failure") __nft_rbtree_insert() can no longer fail
and this condition is always false. Remove it.
Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/netfilter-devel/adjSaolTji0mPgqx@stanley.mountain/ Signed-off-by: Florian Westphal <fw@strlen.de>
Pratham Gupta [Tue, 5 May 2026 05:11:57 +0000 (22:11 -0700)]
netfilter: ctnetlink: use nf_ct_exp_net() in expectation dump
Commit 02a3231b6d82 ("netfilter: nf_conntrack_expect: store netns and zone in expectation")
introduced exp->net so RCU-only expectation paths no longer need to
dereference exp->master for netns lookups.
Commit 3db5647984de ("netfilter: nf_conntrack_expect: skip expectations in other netns via proc")
updated the proc path accordingly, but ctnetlink_exp_dump_table() still
compares against nf_ct_net(exp->master).
Use nf_ct_exp_net(exp) here as well so the netlink dump path matches
the rest of the March 2026 expectation netns/RCU cleanup.
Fixes: 02a3231b6d82 ("netfilter: nf_conntrack_expect: store netns and zone in expectation") Cc: stable@vger.kernel.org Signed-off-by: Pratham Gupta <pratham36gupta@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
netfilter: nf_conncount: use per-rule hash initval
As-is, different netns will use same slots if the key is the same.
OVS uses this infrastructure to limit conntrack counts per zones.
Those can easily overlap. Make them hash to different slots internally.