Tariq Toukan [Thu, 14 May 2026 11:10:37 +0000 (14:10 +0300)]
net/mlx5e: Reduce branches in napi poll
Reduce the number of branches in napi poll, based on the following list
of dependencies:
1. xsk_open=t only if c->xdp and c->async_icosq.
2. c->xdpsq only if c->xdp.
3. c->xdp implies c->async_icosq.
4. ktls_rx_was_enabled implies c->async_icosq.
Costa Shulyupin [Fri, 15 May 2026 18:45:25 +0000 (21:45 +0300)]
include: Remove unused ks8851_mll.h
The last user was removed in commit 72628da6d634 ("net: ks8851:
Remove ks8851_mll.c") which consolidated the driver into a
common implementation. No file includes this header.
DIBS support (DIBS) [N/m/y/?]
intra-OS shortcut with dibs loopback (DIBS_LO) [N/y/?]
Clarify the DIBS prompt by expanding the acronym.
Capitalize the first character of the DIBS_LO prompt.
While at it, join the first two lines of the DIBS help text into a real
sentence.
Daniel Zahka [Fri, 15 May 2026 17:08:52 +0000 (10:08 -0700)]
netdevsim: psp: reset spi on key rotation and check for exhaustion on alloc
The PSP spec states that the lower 31b of the SPI need to be
non-zero. Though not in the spec, I think it is reasonable to reset
the lower 31b of the spi space after a key rotation, and to also
decline to generate session keys when the lower 31b saturate.
This series adds observability for mlx5 fragment buffer DMA pools and
improves the default NUMA placement policy for fragment buffer
allocations.
Patch 1 adds a debugfs interface exposing per-node DMA pool usage
statistics for mlx5_frag_buf allocations, helping with debugging and
visibility into pool utilization.
Patch 2 improves locality of default fragment buffer allocations by
using numa_mem_id() when no explicit NUMA node is requested, allowing
allocations to prefer the current CPU's local memory node.
Together, these changes improve both introspection and memory locality
behavior of mlx5 fragment buffer allocations.
====================
Linmao Li [Wed, 13 May 2026 02:55:09 +0000 (10:55 +0800)]
ipv6: addrconf: bail out of dad_failure when state is no longer POSTDAD
addrconf_dad_failure() transitions ifp->state from DAD to POSTDAD
via addrconf_dad_end(), which drops ifp->lock on return. The lock
is re-acquired after net_info_ratelimited(). A concurrent
ipv6_del_addr() can take the lock in that window, set ifp->state
to DEAD and run list_del_rcu(&ifp->if_list).
addrconf_dad_failure() then overwrites DEAD with ERRDAD at errdad:
and schedules a new dad_work. The work calls ipv6_del_addr()
again, hitting the already-poisoned list entry:
Fold the addrconf_dad_end() logic into addrconf_dad_failure() under
a single ifp->lock critical section. The STABLE_PRIVACY branch
temporarily drops ifp->lock around address regeneration, so at
lock_errdad: verify the state is still POSTDAD before transitioning
to ERRDAD; bail out otherwise to avoid overwriting a state set by
another path while the lock was released.
Fixes: c15b1ccadb32 ("ipv6: move DAD and addrconf_verify processing to workqueue") Signed-off-by: Linmao Li <lilinmao@kylinos.cn> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260513025509.3776405-1-lilinmao@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Avinash Duduskar [Wed, 13 May 2026 09:22:53 +0000 (14:52 +0530)]
llc: avoid sparse cast-truncates warning in counter clamps
llc_conn_ac_inc_npta_value() and llc_conn_ac_inc_tx_win_size()
clamp their counters to the maximum valid 7-bit value via
(u8) ~LLC_2_SEQ_NBR_MODULO. LLC_2_SEQ_NBR_MODULO is defined as
((u8) 128) in include/net/llc_pdu.h, but the (u8) cast does not
prevent integer promotion of the operand of ~: ~128 is computed
as int (0xffffff7f), and the surrounding (u8) cast truncates
back to 0x7f. The result is correct (127), but the implicit
truncation is flagged by sparse:
net/llc/llc_c_ac.c:1008:38: warning: cast truncates bits from
constant value (ffffff7f becomes 7f)
(and three more at lines 1009, 1099, 1100)
Replace the (u8) ~LLC_2_SEQ_NBR_MODULO expression with
LLC_2_SEQ_NBR_MODULO - 1, which evaluates to 127 directly and
silences sparse.
The same ~LLC_2_SEQ_NBR_MODULO pattern also appears in
include/net/llc_pdu.h:148 as part of PDU_GET_NEXT_Vr, but there
the result is immediately &-masked, so the int promotion is
harmless and sparse does not flag it; it is left alone.
This patch is the minimum diff to silence the warning. The
counter-clamp idiom itself could be modernized to
min_t(u8, ..., LLC_2_SEQ_NBR_MODULO - 1), but that is a
separate cleanup left for another patch.
Eric Dumazet [Thu, 14 May 2026 09:55:06 +0000 (09:55 +0000)]
net: always declare __sock_wfree() and tcp_wfree()
Even if guarded by IS_ENABLED(CONFIG_INET) compilers need to know
what __sock_wfree() and tcp_wfree() are:
include/net/sock.h:1861:63: note: each undeclared identifier is reported only once for each function it appears in
include/net/sock.h:1862:63: error: 'tcp_wfree' undeclared (first use in this function); did you mean 'sock_wfree'?
1862 | (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree);
Fixes: f0de88303d5e ("net: make is_skb_wmem() available to modules") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202605141607.mDXnYFKY-lkp@intel.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260514095506.3919094-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Luka Gejak [Wed, 13 May 2026 18:26:57 +0000 (20:26 +0200)]
net: hsr: reject unresolved interlink ifindex
In hsr_newlink(), a provided but invalid IFLA_HSR_INTERLINK attribute
was silently ignored if __dev_get_by_index() returned NULL. This leads
to incorrect RedBox topology creation without notifying the user.
Fix this by returning -EINVAL and an extack message when the
interlink attribute is present but cannot be resolved.
Reviewed-by: Felix Maurer <fmaurer@redhat.com> Signed-off-by: Luka Gejak <luka.gejak@linux.dev> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260513182657.20346-3-luka.gejak@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update the docs to match the code (include/linux/netlink.h):
/*
* skb should fit one page. This choice is good for headerless malloc.
* But we should limit to 8K so that userspace does not have to
* use enormous buffer sizes on recvmsg() calls just to avoid
* MSG_TRUNC when PAGE_SIZE is very large.
*/
#if PAGE_SIZE < 8192UL
#define NLMSG_GOODSIZE SKB_WITH_OVERHEAD(PAGE_SIZE)
#else
#define NLMSG_GOODSIZE SKB_WITH_OVERHEAD(8192UL)
#endif
This series continues the rework of the KSZ driver initiated by a previous
series (see [1]), following the discussion we had here [2].
The KSZ driver got way too convoluted over time because it uses a common
framework to handle more than 20 switches split in 5 families (see below
table)
The previous series ([1]) replaced the unique dsa_swicth_ops struct used
by all the KSZ families with one dsa_switch_ops struct for each family.
These dsa_switch_ops structs still rely on common functions that redirect
the calls to ksz_dev_ops operations which are custom to each switch
family. Many of hese ksz_dev_ops callbacks have a direct equivalent in the
struct dsa_switch_ops. This series directly connects the implementations of
these ksz_dev_ops operations to the relevant dsa_switch_ops attribute
to get rid of one unnecessary level of indirection.
====================
Vladimir Oltean [Tue, 12 May 2026 13:06:29 +0000 (15:06 +0200)]
net: dsa: microchip: bypass dev_ops for phylink_get_caps()
ksz_phylink_get_caps() is a bit different from other generic methods.
It has a dev_ops->get_caps() call in the middle of the function, and it
does other stuff before (set some supported_interfaces) and after (set
lpi_interfaces from supported_interfaces).
Whereas the dev_ops->get_caps() methods set mac_capabilities and
(optionally) logically OR the supported_interfaces with that of the PCS.
The idea is that this can be expressed simpler, and avoid a indirect
function call to dev_ops->get_caps(). If we tail-call the common
ksz_phylink_get_caps() from individual phylink_get_caps() methods, we do
reorder the settings, but in an inconsequential way (the transfer from
supported_interfaces to lpi_interfaces still sees a complete list of the
supported_interfaces).
Remove the no longer used get_caps() callbacl the ksz_dev_ops.
Vladimir Oltean [Tue, 12 May 2026 13:06:28 +0000 (15:06 +0200)]
net: dsa: microchip: bypass dev_ops for mirror operations
Mirror operations are handled through a common function that redirects
the treatment to ksz_dev_ops callbacks. This layer of indirection isn't
needed since we now have a dsa_switch_ops for each switch family.
Remove this indirection layer for KSZ switches, by connecting the
ksz_dev_ops :: mirror_add() and mirror_del() operations directly to
dsa_switch_ops.
Remove the now unused mirror callbacks from ksz_dev_ops.
Vladimir Oltean [Tue, 12 May 2026 13:06:27 +0000 (15:06 +0200)]
net: dsa: microchip: bypass dev_ops for FDB and MDB operations
FDB and MDB operations are handled through a common function that
redirects the treatment to ksz_dev_ops callbacks. This layer of
indirection isn't needed since we now have a dsa_switch_ops for each kind
of switch.
Remove one indirection layer for KSZ switches, by connecting the
ksz_dev_ops :: fdb_dump(), fdb_add(), fdb_del(), mdb_add() and mdb_del()
operations directly to dsa_switch_ops.
Remove the FDB and MDB operations from ksz_dev_ops.
Vladimir Oltean [Tue, 12 May 2026 13:06:26 +0000 (15:06 +0200)]
net: dsa: microchip: bypass dev_ops for VLAN operations
VLAN operations are handled through a common function that redirects the
treatment to ksz_dev_ops callbacks. This level of indirection isn't
needed since we now have a dsa_switch_ops for each kind of switch.
Remove this useless layer of indirection by connecting directly the VLAN
operations to the relevant dsa_switch_ops.
Adapt their prototypes to match dsa_switch_ops expectations.
Remove the now unused VLAN callbacks from ksz_dev_ops.
Vladimir Oltean [Tue, 12 May 2026 13:06:25 +0000 (15:06 +0200)]
net: dsa: microchip: bypass dev_ops for change_mtu() operation
MTU changing is done through a common function that redirects the
treatment to a specific ksz_dev_ops callback. This layer of indirection
isn't needed since we now have a dsa_switch_ops struct for each switch
family.
Remove this indirection layer in MTU changing for KSZ switches, by
directly connecting the ksz_dev_ops :: change_mtu() implementations to
dsa_switch_ops.
Remove the no longer used change_mtu() callback from ksz_dev_ops
Vladimir Oltean [Tue, 12 May 2026 13:06:24 +0000 (15:06 +0200)]
net: dsa: microchip: bypass dev_ops for FDB ageing operations
dsa_switch_ops :: set_ageing_time() goes through ksz_set_ageing_time(),
further dispatched through ksz_dev_ops :: set_ageing_time(). Only
ksz9477 and lan937x provide an implementation for this, so remove the
(optional) method from ksz8463_switch_ops, ksz87xx_switch_ops,
ksz88xx_switch_ops. Also, hook up ksz9477 and lan937x dsa_switch_ops
directly to their respective implementations.
Every switch family provides a dsa_switch_ops :: port_fast_age()
implementation, which is dispatched through ksz_dev_ops ::
flush_dyn_mac_table(). Remove the dev_ops indirection and connect the
flush_dyn_mac_table() methods directly to their respective dsa_switch_ops.
Jann Horn [Tue, 12 May 2026 14:02:03 +0000 (16:02 +0200)]
net: block MSG_NO_SHARED_FRAGS in sendmsg()
This change should cause no difference in behavior; it just cleans up some
hazardous code that could have become a problem in the future.
MSG_NO_SHARED_FRAGS is a kernel-internal flag that cancels the effect of
MSG_SPLICE_PAGES, another kernel-internal flag that influences the
data-sharing semantics of SKBs.
Prevent passing this flag in from userspace via sendmsg() by adding it to
MSG_INTERNAL_SENDMSG_FLAGS.
This is not currently an observable problem because MSG_NO_SHARED_FRAGS
only has an effect if kernel code adds MSG_SPLICE_PAGES to it.
The only codepath that adds MSG_SPLICE_PAGES to user-supplied flags from
which MSG_NO_SHARED_FRAGS hasn't been cleared is the path
tcp_bpf_sendmsg -> tcp_bpf_send_verdict -> tcp_bpf_push, and that is not a
problem because tcp_bpf_sendmsg always intentionally sets
MSG_NO_SHARED_FRAGS anyway.
Minxi Hou [Tue, 12 May 2026 07:08:41 +0000 (15:08 +0800)]
selftests: openvswitch: add pop_vlan test
Add test_pop_vlan() to verify OVS kernel datapath pop_vlan action
correctly strips 802.1Q VLAN tags from frames.
Test structure:
- Baseline: untagged forwarding validates basic connectivity.
- Negative: forward without pop_vlan, tagged frame is invisible
to ns2 (no VLAN sub-interface), ping fails.
- Positive: pop_vlan strips tag on forward path, push_vlan
restores tag on return path, ping succeeds.
Use static ARP entries to avoid VLAN-tagged ARP complexity.
Rely on ping success/failure for verification -- no tcpdump or
pcap files needed.
Minxi Hou [Tue, 12 May 2026 07:08:40 +0000 (15:08 +0800)]
selftests: openvswitch: add vlan() and encap() flow string parsing
Add VLAN TCI formatting and parsing support to ovs-dpctl.py:
- Add _vlan_dpstr() to decompose TCI into vid/pcp/cfi fields,
with raw tci=0x%04x fallback when cfi=0 for round-trip safety.
- Add _parse_vlan_from_flowstr() boundary check for missing ')'.
- Add encap_ovskey subclass restricting nla_map to L2-L4 attributes
(slots 0-21) that appear inside 802.1Q ENCAP, with metadata
attributes set to "none".
- Check encap parse() return value for unrecognized trailing content.
- Support callable format functions in dpstr() output.
- Change OVS_KEY_ATTR_VLAN type from uint16 to be16 to match the
kernel __be16 wire format; uint16 decodes in host byte order,
which gives wrong values on little-endian architectures.
- Change OVS_KEY_ATTR_ENCAP type from none to encap_ovskey to
enable recursive parsing of 802.1Q encapsulated flow keys.
- Add push_vlan action class with fields matching kernel struct
ovs_action_push_vlan (vlan_tpid, vlan_tci as network-order u16).
- Add push_vlan dpstr format and parse with range validation
(vid 0-4095, pcp 0-7, tpid 0-0xFFFF) and CFI forced to 1.
- ipv6: flowlabel: enforce per-netns limit for unprivileged callers
- tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
- smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
- sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALL
- batman-adv:
- reject new tp_meter sessions during teardown
- purge non-released claims
- eth:
- i40e: cleanup PTP registration on probe failure
- idpf: fix double free and use-after-free in aux device error paths
- ena: fix potential use-after-free in get_timestamp"
* tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
net: phy: DP83TC811: add reading of abilities
net: tls: prevent chain-after-chain in plain text SG
net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot
macsec: use rcu_work to defer TX SA crypto cleanup out of softirq
macsec: use rcu_work to defer RX SA crypto cleanup out of softirq
macsec: introduce dedicated workqueue for SA crypto cleanup
net: net_failover: Fix the deadlock in slave register
MAINTAINERS: update atlantic driver maintainer
selftests/tc-testing: Add QFQ/CBS qlen underflow test
net/sched: sch_cbs: Call qdisc_reset for child qdisc
FDDI: defza: Sanitise the reset safety timer
net: ethernet: ravb: Do not check URAM suspension when WoL is active
ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics
net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS
net: atm: fix skb leak in sigd_send() default branch
net: ethtool: phy: avoid NULL deref when PHY driver is unbound
net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled
net: shaper: reject QUEUE scope handle with missing id
...
Linus Torvalds [Thu, 14 May 2026 15:53:24 +0000 (08:53 -0700)]
Merge tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fixes from Paul Moore:
- Correctly log the inheritable capabilities
- Honor AUDIT_LOCKED in the AUDIT_TRIM and AUDIT_MAKE_EQUIV commands
* tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV
audit: fix incorrect inheritable capability in CAPSET records
Linus Torvalds [Wed, 13 May 2026 18:37:18 +0000 (11:37 -0700)]
ptrace: slightly saner 'get_dumpable()' logic
The 'dumpability' of a task is fundamentally about the memory image of
the task - the concept comes from whether it can core dump or not - and
makes no sense when you don't have an associated mm.
And almost all users do in fact use it only for the case where the task
has a mm pointer.
But we have one odd special case: ptrace_may_access() uses 'dumpable' to
check various other things entirely independently of the MM (typically
explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for
threads that no longer have a VM (and maybe never did, like most kernel
threads).
It's not what this flag was designed for, but it is what it is.
The ptrace code does check that the uid/gid matches, so you do have to
be uid-0 to see kernel thread details, but this means that the
traditional "drop capabilities" model doesn't make any difference for
this all.
Make it all make a *bit* more sense by saying that if you don't have a
MM pointer, we'll use a cached "last dumpability" flag if the thread
ever had a MM (it will be zero for kernel threads since it is never
set), and require a proper CAP_SYS_PTRACE capability to override.
Sven Schuchmann [Tue, 12 May 2026 07:19:47 +0000 (09:19 +0200)]
net: phy: DP83TC811: add reading of abilities
At this time the driver is not listing any speeds
it supports. This should be ETHTOOL_LINK_MODE_100baseT1_Full_BIT
for DP83TC811. Add the missing call for phylib to read the abilities.
Fixes: b753a9faaf9a ("net: phy: DP83TC811: Introduce support for the DP83TC811 phy") Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Sven Schuchmann <schuchmann@schleissheimer.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260512071949.6218-1-schuchmann@schleissheimer.de
[pabeni@redhat.com: dropped revision history] Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Mon, 11 May 2026 17:49:18 +0000 (10:49 -0700)]
net: tls: prevent chain-after-chain in plain text SG
Sashiko points out that if end = 0 (start != 0) the current
code will create a chain link to content type right after
the wrap link:
This would create a chain where the wrap link points directly
to another chain link. The scatterlist API sg_next iterator
does not recursively resolve consecutive chain links.
meaning this is illegal input to crypto.
The wrapping link is unnecessary if end = 0. end is the entry after
the last one used so end = 0 means there's nothing pushed after
the wrap:
end start i
v v v
[ ]...[ ][ d ][ d ][ d ][ d ][rsv for wrap]
Skip the wrapping in this case.
TLS 1.3 can use the "wrapping slot" for it's chaining if end = 0.
This avoids the chain-after-chain.
Move the wrap chaining before marking END and chaining off content
type, that feels like more logical ordering to me, but should not
matter from functional perspective.
Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260511174920.433155-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Mon, 11 May 2026 17:49:17 +0000 (10:49 -0700)]
net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
When an sk_msg scatterlist ring wraps (sg.end < sg.start),
tls_push_record() chains the tail portion of the ring to the head
using sg_chain(). An extra entry in the sg array is reserved for
this:
struct sk_msg_sg {
[...]
/* The extra two elements:
* 1) used for chaining the front and sections when the list becomes
* partitioned (e.g. end < start). The crypto APIs require the
* chaining;
* 2) to chain tailer SG entries after the message.
*/
struct scatterlist data[MAX_MSG_FRAGS + 2];
The current code uses MAX_SKB_FRAGS + 1 as the ring size:
instead of the true last entry. This is likely due to a "race" of
the commit under Fixes landing close to
commit 031097d9e079 ("bpf: sk_msg, zap ingress queue on psock down")
Convert to ARRAY_SIZE and drop the data[start] / - start (as suggested
by Sabrina).
Reported-by: é’±ä¸€é“ <yimingqian591@gmail.com> Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260511174920.433155-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
bridge: Add selective forwarding of gratuitous neighbor announcements
The existing neighbor suppression unconditionally suppresses gratuitous
ARPs and unsolicited Neighbor Advertisements, which prevents fast
mobility of hosts between VTEPs.
This series adds a new neigh_forward_grat option that provides
independent control of gratuitous ARP and unsolicited NA forwarding.
When neigh_suppress is enabled but neigh_forward_grat is enabled,
regular neighbor discovery is suppressed while gratuitous announcements
are forwarded.
The implementation marks gratuitous ARPs and unsolicited NAs in
BR_INPUT_SKB_CB during input processing, then checks the per-output-port
neigh_forward_grat setting during flooding. This allows gratuitous
announcements from any input port to be selectively forwarded based on
each output port's individual configuration.
Both port-level control (via IFLA_BRPORT_NEIGH_FORWARD_GRAT) and
per-VLAN control (via BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT) are
provided. The default value of OFF preserves existing behavior.
This behavior is in accordance with RFC 9161 (Section 3.6), which
recommends that VTEPs forward gratuitous ARP and unsolicited NA messages
to avoid traffic disruption during host mobility events.
The new attributes use NLA_U8, although the kernel netlink guideline
recommends NLA_U32 as the minimum integer type on the grounds that
alignment makes smaller types equivalent on the wire. For a simple
on/off attribute there is no technical advantage to u32 over u8, and
keeping u8 preserves consistency with all surrounding bridge port
attributes and avoids introducing new helpers alongside the existing
infrastructure.
Patchset overview:
Patch #1: adds uapi headers.
Patches #2-#3: support selective forwarding of gratuitous ARP.
Patches #4-#5: add netlink handling.
Patch #6: adds tests.
Please see iproute related patches in the last 3 commits of:
https://github.com/daniellerts/iproute2
====================
Danielle Ratson [Mon, 11 May 2026 06:59:36 +0000 (09:59 +0300)]
selftests: net: Add tests for neigh_forward_grat option
Add tests to validate the neigh_forward_grat bridge option for selective
forwarding of gratuitous neighbor announcements.
The tests verify per-port and per-VLAN control of gratuitous neighbor
announcement forwarding for both IPv4 (gratuitous ARP) and IPv6
(unsolicited NA):
- When neigh_suppress is enabled with neigh_forward_grat off (default),
gratuitous announcements are suppressed
- When neigh_forward_grat is enabled, gratuitous announcements are
forwarded while regular neighbor discovery remains suppressed
For IPv4, use arping to send gratuitous ARP packets. For IPv6, use
mausezahn to craft unsolicited Neighbor Advertisement packets.
For the per-port tests, the IPv4 test exercises the ip link interface,
while the IPv6 test exercises the bridge link interface.
The per-VLAN tests use the bridge interface throughout, as per-VLAN
attributes are only accessible via 'bridge vlan'.
Danielle Ratson [Mon, 11 May 2026 06:59:35 +0000 (09:59 +0300)]
bridge: Add per-VLAN netlink handling for neigh_forward_grat
Add netlink handlers for the per-VLAN neigh_forward_grat option via
BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT attribute.
The per-VLAN option provides fine-grained control, allowing different
VLANs on the same port to have different gratuitous ARP/unsolicited NA
forwarding behavior.
This enables control via 'bridge' commands:
# bridge vlan set dev eth0 vid 10 neigh_suppress on
# bridge vlan set dev eth0 vid 10 neigh_forward_grat on
Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Danielle Ratson <danieller@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260511065936.4173106-6-danieller@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Danielle Ratson [Mon, 11 May 2026 06:59:34 +0000 (09:59 +0300)]
bridge: Add port-level netlink handling for neigh_forward_grat
Add netlink handlers for the port-level neigh_forward_grat option via
IFLA_BRPORT_NEIGH_FORWARD_GRAT attribute.
The default value of OFF preserves existing behavior, i.e. gratuitous ARP
and unsolicited NA are suppressed when neigh_suppress is enabled. Users can
explicitly set it to ON to allow these packets through.
Example for enabling control via 'bridge link' command:
# bridge link set dev eth0 neigh_suppress on
# bridge link set dev eth0 neigh_forward_grat on
Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Danielle Ratson <danieller@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260511065936.4173106-5-danieller@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Danielle Ratson [Mon, 11 May 2026 06:59:33 +0000 (09:59 +0300)]
bridge: Add selective forwarding of gratuitous neighbor announcements
The existing neighbor suppression unconditionally suppresses gratuitous
ARPs and unsolicited Neighbor Advertisements, which prevents fast
mobility of hosts between VTEPs.
Add the neigh_forward_grat option to allow selective control of gratuitous
neighbor announcements. When neigh_suppress is enabled but
neigh_forward_grat is disabled (default), gratuitous announcements are
suppressed. When neigh_forward_grat is enabled, gratuitous announcements
are forwarded while regular neighbor discovery remains suppressed.
The implementation provides per-output-port control by:
1. Adding a 'grat_arp' flag to BR_INPUT_SKB_CB to mark gratuitous ARPs and
unsolicited NAs.
2. Setting both grat_arp and proxyarp_replied flags in
br_do_proxy_suppress_arp() and br_do_suppress_nd() when gratuitous
packets are detected.
3. Checking neigh_forward_grat per output port during flooding:
- For gratuitous ARPs/NAs: suppress unless the output port has
neigh_forward_grat enabled.
- For regular ARPs/NDs: maintain existing behavior.
This allows gratuitous announcements from any input port to be selectively
forwarded based on each output port's individual neigh_forward_grat
setting, enabling gratuitous neighbor announcements to be flooded to the
VXLAN fabric.
Regular neighbor discovery (ARP requests, NS queries, solicited replies)
remains controlled by neigh_suppress and is unaffected.
Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Danielle Ratson <danieller@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260511065936.4173106-4-danieller@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add netlink attributes for controlling gratuitous ARP and unsolicited NA
forwarding when neighbor suppression is enabled.
Add IFLA_BRPORT_NEIGH_FORWARD_GRAT for port-level control and
BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT for per-VLAN control.
The new attributes provide independent control of gratuitous ARP and
unsolicited NA packets. Operators can enable forwarding for those packets
for fast mobility across VTEPs while keeping general neighbor suppression
active.
Xiang Mei [Mon, 11 May 2026 06:21:38 +0000 (23:21 -0700)]
net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot
On the SMC-D client, slot 0 of ini->ism_dev[]/ini->ism_chid[] is
reserved for an SMC-Dv1 device. smc_find_ism_v2_device_clnt()
populates V2 entries starting at index 1, so when no V1 device is
selected slot 0 is left in its kzalloc()'ed state with ism_dev[0] ==
NULL and ism_chid[0] == 0.
smc_v2_determine_accepted_chid() then matches the peer's CHID against
the array starting from index 0 using the CHID alone. A malicious
peer replying to a SMC-Dv2-only proposal with d1.chid == 0 matches
the empty slot, ini->ism_selected becomes 0, and the subsequent
ism_dev[0]->lgr_lock dereference in smc_conn_create() faults at
offsetof(struct smcd_dev, lgr_lock) == 0x68:
BUG: KASAN: null-ptr-deref in _raw_spin_lock_bh+0x79/0xe0
Write of size 4 at addr 0000000000000068 by task exploit/144
Call Trace:
_raw_spin_lock_bh
smc_conn_create (net/smc/smc_core.c:1997)
__smc_connect (net/smc/af_smc.c:1447)
smc_connect (net/smc/af_smc.c:1720)
__sys_connect
__x64_sys_connect
do_syscall_64
Require ism_dev[i] to be non-NULL before accepting a CHID match.
Fixes: a7c9c5f4af7f ("net/smc: CLC accept / confirm V2") Reported-by: Weiming Shi <bestswngs@gmail.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/20260511062138.2839584-1-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add 64-bit counters for each impairment netem applies (delay, loss,
ECN marking, corruption, duplication, reordering) and for skb
allocation failures during enqueue. Exposed through TCA_STATS_APP
as struct tc_netem_xstats.
Counters increment when an impairment is occurs, independent of later
events that may mask its on-wire effect. Added allocation_errors
(similar to sch_fq) to account for when impairment could not be
applied due to memory pressure, etc.
net/sched: netem: handle multi-segment skb in corruption
The packet corruption code only flipped bits in the linear
header portion of the skb, skipping corruption when
skb_headlen() was zero.
Linearize the whole skb if necessary before corruption.
Extends d64cb81dcbd5 ("net/sched: sch_netem: fix out-of-bounds access
in packet corruption") with a more general solution.
net/sched: netem: replace pr_info with netlink extack error messages
Use netlink extack to report errors instead of sending them
to the kernel log with pr_info(). The error message can them be seen
with tc commands; and avoids log spam.
The current layout of struct netem_sched_data can be improved
by optimizing cache locality, compacting data types (use u8
for enum) and eliminating unused elements.
Reorganize the struct as follows:
- Cacheline 0 holds the tfifo state (t_root/t_head/t_tail/t_len),
counter, and the unconditional enqueue scalars
latency/jitter/rate/gap/loss.
- Cacheline 1 holds the remaining zero-check scalars
(duplicate/reorder/corrupt/ecn), all five crndstate correlation
structures, and loss_model.
- Cacheline 2 holds prng, delay_dist, the slot dequeue state,
slot_dist, and the inner classful qdisc pointer.
- Rate-shaping fields, q->limit (config-only; the fast path reads
sch->limit), and the CLG Markov state move to the warm tail.
- tc_netem_slot slot_config and qdisc_watchdog (only consulted on
slot reschedule and watchdog wake) move to the cold tail.
Also reorder struct clgstate to place the u8 state member after the
u32 transition probabilities. This removes the 3-byte interior hole
without changing the struct's size.
====================
net/mlx5e: improve RSS indirection table sizing and resizing
This series by Yael improves mlx5e RSS indirection table handling around
channel count changes and large RSS configurations.
The series:
* removes the XOR8-specific channel count limitation,
* advertises the maximum supported RSS indirection table size,
* fixes resizing of non-default RSS contexts,
* allows resizing configured default RSS contexts during channel
changes,
* and increases the default RSS spread factor from 2x to 4x to improve
traffic distribution for large channel counts.
Together, these changes make RSS table sizing more flexible and robust,
while improving load balancing behavior on large systems.
====================
Increase the RQT uniform spread factor from 2 to 4 so that each channel
gets more indirection table entries and traffic is spread more evenly.
For num_channels > 64 imbalance drops from up to ~50% to up to ~25%.
For 64 or fewer channels the 256 entry minimum already provides at least
4x coverage and the table size is unchanged by this commit.
This satisfies the minimum 4x coverage requirement validated by the
generic RSS selftest commit 9e3d4dae9832 ("selftests: drv-net: rss:
validate min RSS table size").
The 4x spread factor is best-effort and the table size is always capped by
the device's log_max_rqt_size capability.
Yael Chemla [Mon, 11 May 2026 17:27:18 +0000 (20:27 +0300)]
net/mlx5e: resize configured default RSS context table on channel change
mlx5e_ethtool_set_channels() rejected channel count changes that
required a different RQT size when the default context indirection
table was user-configured. This restriction was introduced by
commit ee3572409f74 ("net/mlx5e: RSS, Block changing channels number
when RXFH is configured").
Lift the restriction. Validate the resize upfront with
ethtool_rxfh_indir_can_resize(), then fold or unfold the table
in-place via ethtool_rxfh_indir_resize() inside state_lock, before
mlx5e_safe_switch_params(), so the preactivate callback sees the
correct table content when it programs the HW.
Yael Chemla [Mon, 11 May 2026 17:27:17 +0000 (20:27 +0300)]
net/mlx5e: resize non-default RSS indirection tables on channel change
When the channel count changes and the RQT size changes with it, a
problem arise for non-default RSS contexts. The driver-side indirection
table grows actual_table_size without filling the new entries; stale
entries from a prior larger configuration may be re-exposed, causing
mlx5e_calc_indir_rqns() to WARN on an out-of-range index.
Replace mlx5e_rss_params_indir_modify_actual_size() with
mlx5e_rss_ctx_resize(), which fills new entries by replicating
the existing pattern, matching what ethtool_rxfh_ctxs_resize() does
for the same case. And restrict the loop to non-default contexts.
Call ethtool_rxfh_ctxs_can_resize() before acquiring state_lock to
validate that all non-default contexts can be resized, and
ethtool_rxfh_ctxs_resize() after releasing it to fold or unfold their
indirection tables. Both functions acquire rss_lock internally and
cannot be called under state_lock. RTNL, held by all set_channels
callers, serialises context creation and deletion making the pre-lock
check safe.
Guard both ethtool calls on mlx5e_rx_res_rss_cnt() > 1: skip the
validation and resize when no non-default contexts exist. This
naturally covers representors and IPoIB, which share
mlx5e_ethtool_set_channels() but cannot have non-default RSS contexts.
Yael Chemla [Mon, 11 May 2026 17:27:16 +0000 (20:27 +0300)]
net/mlx5e: advertise max RSS indirection table size to ethtool
Set rxfh_indir_space to the maximum indirection table size the driver
can support: the next power of two above MLX5E_MAX_NUM_CHANNELS times
MLX5E_UNIFORM_SPREAD_RQT_FACTOR.
Without this, ethtool_rxfh_ctxs_can_resize() returns -EINVAL, blocking
non-default RSS contexts from tracking indirection table size changes
when the channel count changes.
Yael Chemla [Mon, 11 May 2026 17:27:15 +0000 (20:27 +0300)]
net/mlx5e: remove channel count limit for XOR8 RSS hash
mlx5e_ethtool_set_channels() and mlx5e_rxfh_hfunc_check() rejected
channel counts that would produce an indirection table larger than 256
entries when the XOR8 hash function was active. This check was
introduced in commit 49e6c9387051 ("net/mlx5e: RSS, Block XOR hash
with over 128 channels").
XOR8 yields an 8-bit hash, so in practice only up to 256 entries in the
indirection table can be reached due to limited entropy. However, this
does not provide a strong justification for prohibiting larger
indirection tables. Remove the limitation.
====================
macsec: use rcu_work to fix crypto cleanup in softirq context
From: Jinliang Zheng <alexjlzheng@tencent.com>
crypto_free_aead() can internally call vunmap() (e.g. via dma_free_attrs()
in hardware crypto drivers like hisi_sec2), which must not be invoked from
softirq context. Both free_rxsa() and free_txsa() are RCU callbacks that
run in softirq, causing a kernel crash on affected hardware.
This series fixes the issue by deferring the actual cleanup to a workqueue
using rcu_work, which combines the RCU grace period and workqueue dispatch
into a single primitive.
Two design decisions worth noting:
1. rcu_work instead of schedule_work() + synchronize_rcu()
An alternative would be to call schedule_work() directly from
macsec_rxsa_put()/macsec_txsa_put(), then call synchronize_rcu() at
the start of the work handler to replace the grace period previously
provided by call_rcu(). However, synchronize_rcu() blocks the worker
thread for the duration of a full RCU grace period. Under high SA
churn (e.g. tearing down an interface with many SAs), each SA would
occupy a worker thread while waiting, and multiple concurrent calls
cannot share the same grace period — leading to unnecessary latency
and resource waste.
rcu_work uses call_rcu_hurry() internally, which is fully asynchronous:
the worker thread is only dispatched after the grace period has elapsed,
and multiple concurrent queue_rcu_work() calls naturally batch under the
same grace period via the RCU subsystem's existing coalescing mechanism.
2. Dedicated workqueue instead of system_wq
Using a dedicated workqueue (macsec_wq) allows macsec_exit() to drain
exactly the work items belonging to this module — by calling
destroy_workqueue() after rcu_barrier(). If system_wq were used,
flush_scheduled_work() would drain all pending work items across the
entire system, creating unnecessary coupling with unrelated subsystems
and potentially causing unexpected delays. The dedicated workqueue
provides a clean, contained teardown path.
====================
Jinliang Zheng [Mon, 11 May 2026 15:31:00 +0000 (23:31 +0800)]
macsec: use rcu_work to defer TX SA crypto cleanup out of softirq
free_txsa() is an RCU callback running in softirq context, but calls
crypto_free_aead() which can invoke vunmap() internally on hardware
crypto drivers (e.g. hisi_sec2), triggering a kernel crash.
Use rcu_work to defer the cleanup to a workqueue, for the same reasons
as the analogous fix to free_rxsa() in the previous patch.
Jinliang Zheng [Mon, 11 May 2026 15:30:59 +0000 (23:30 +0800)]
macsec: use rcu_work to defer RX SA crypto cleanup out of softirq
crypto_free_aead() can internally invoke vunmap() (e.g. via
dma_free_attrs() in hardware crypto drivers such as hisi_sec2).
vunmap() must not be called from softirq context, but free_rxsa()
is an RCU callback that runs in softirq, leading to a kernel crash:
Use rcu_work to defer the cleanup to a workqueue. rcu_work dispatches
the worker asynchronously after the RCU grace period, so no thread
blocks waiting, and concurrent releases of multiple SAs naturally
share the same grace period.
Jinliang Zheng [Mon, 11 May 2026 15:30:58 +0000 (23:30 +0800)]
macsec: introduce dedicated workqueue for SA crypto cleanup
Introduce a dedicated ordered workqueue, macsec_wq, which will be used
by subsequent patches to defer SA crypto cleanup (crypto_free_aead and
related teardown) out of softirq context.
Using a dedicated workqueue instead of system_wq allows macsec_exit()
to drain exactly the work items belonging to this module via
destroy_workqueue(), without interfering with unrelated work items on
system_wq or causing unexpected delays elsewhere.
rcu_barrier() in macsec_exit() ensures all in-flight rcu_work callbacks
have enqueued their work items before destroy_workqueue() drains and
destroys the queue, making the two-step teardown correct and complete.
The same sequence is kept in the error path of macsec_init() as a
precaution, to mirror macsec_exit() and stay safe if work ever becomes
queueable before this point in the future.
While at it, rename the error labels in macsec_init() from the
resource-named style (rtnl:, notifier:, wq:) to the err_xxx: style
(err_rtnl:, err_notifier:, err_destroy_wq:) to align with the broader
kernel convention.
Faicker Mo [Mon, 11 May 2026 14:05:51 +0000 (22:05 +0800)]
net: net_failover: Fix the deadlock in slave register
There is netdev_lock_ops() before the NETDEV_REGISTER notifier
in register_netdevice(), so use the non-locking functions
in net_failover_slave_register().
failover_slave_register() in failover_existing_slave_register() adds lock
and unlock ops too.
Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP") Signed-off-by: Faicker Mo <faicker.mo@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
netpoll: move out netconsole-specific functions
netpoll and netconsole were created together and their code has
been intermixed in net/core/netpoll.c for decades. The result is
that netpoll exposes two send-side interfaces:
* a generic "give me an sk_buff" path used by every stacked-device
driver (bonding, team, vlan, bridge, macvlan, dsa),
* a second path that takes raw bytes and builds a UDP/IP/Ethernet
packet -- exclusively for netconsole.
The packet builder, an skb pool allocator, and several
netconsole-specific helpers all live next to the generic plumbing even
though no other consumer ever touches them.
Worse, every netpoll user pays for that overlap: struct netpoll carries
an skb_pool and a refill work_struct that only netconsole's find_skb()
ever reads from, and net-core has to review unrelated changes (TTL, hop
limit, IP ID generation, source MAC selection, pool sizing) just because
they happen to be coded inside netpoll.
This is a waste of memory for something useless.
This series splits the netconsole-specific code out:
* netpoll_send_udp() and its private helpers (push_ipv6, push_ipv4,
push_eth, push_udp, netpoll_udp_checksum, find_skb) move into
drivers/net/netconsole.c, leaving netpoll with a single skb-only
send interface that is the same for every user.
The moves are one function per patch for reviewability; helpers are
temporarily EXPORT_SYMBOL_GPL'd while netpoll_send_udp() is still in
netpoll calling them, then those exports are dropped together once
netpoll_send_udp() itself moves.
The only new permanent export is zap_completion_queue(), needed because
find_skb() still drains the per-CPU TX completion queue before
allocating.
struct netpoll is unchanged in this series; making the pool itself
netconsole-private (and reclaiming the skb_pool / refill_wq fields for
the rest of netpoll's users) is the natural follow-up, once this patchset
lands.
====================
Breno Leitao [Tue, 12 May 2026 10:46:42 +0000 (03:46 -0700)]
netconsole: move find_skb() from netpoll
find_skb() is the netconsole-specific entry into the netpoll skb
pool: every other netpoll consumer (bonding, team, vlan, bridge,
macvlan, dsa) builds its own sk_buff and never touches the pool.
With netpoll_send_udp() (its only caller) now living in netconsole,
find_skb() can join it.
Move find_skb() into drivers/net/netconsole.c as a file-static
helper, drop EXPORT_SYMBOL_GPL(find_skb) and remove its prototype
from include/linux/netpoll.h. find_skb() drains TX completions via
netpoll_zap_completion_queue(), which is already exported in the
NETDEV_INTERNAL namespace, so netconsole picks up
MODULE_IMPORT_NS("NETDEV_INTERNAL") to consume it.
The skb pool's lifecycle (np->skb_pool, np->refill_wq, refill_skbs(),
refill_skbs_work_handler(), skb_pool_flush()) stays in netpoll: it
is initialised in __netpoll_setup() and torn down in
__netpoll_cleanup(), both of which remain netpoll's responsibility.
The refill work queued via schedule_work(&np->refill_wq) from the
moved find_skb() runs refill_skbs_work_handler() in netpoll without
any further plumbing.
This is pure code motion: the function body is unchanged and its
sole caller (netpoll_send_udp(), already moved by an earlier patch)
keeps invoking it the same way. Pre-existing concerns about
find_skb() running from NMI/printk context (zap_completion_queue()
re-entry, skb_pool spinlocks, GFP_ATOMIC allocation, fallback skb
sizing vs. MAX_SKB_SIZE, PREEMPT_RT semantics of __kfree_skb()) are
inherited as-is and are not addressed here; they predate this
series and are out of scope. Fixing them is left for follow-up
work.
Breno Leitao [Tue, 12 May 2026 10:46:41 +0000 (03:46 -0700)]
netpoll: rename and export netpoll_zap_completion_queue()
zap_completion_queue() drains the per-CPU softnet completion queue.
Rename it with the netpoll_ prefix shared by the rest of the
subsystem's public API, and promote it from file-static to
EXPORT_SYMBOL_NS_GPL in the NETDEV_INTERNAL namespace so the upcoming
netconsole-side find_skb() can call it once the function moves out.
A forward declaration is added to include/linux/netpoll.h, and the
old file-static forward declaration is dropped.
Breno Leitao [Tue, 12 May 2026 10:46:40 +0000 (03:46 -0700)]
netconsole: move netpoll_udp_checksum() from netpoll
netpoll_udp_checksum() computes the UDP checksum for netconsole's
packets. Move it into drivers/net/netconsole.c as a file-static
helper; drop its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.
This was the last csum_ipv6_magic() consumer in net/core/netpoll.c,
so drop the now-stale <net/ip6_checksum.h> include there. Pull it
into netconsole.c so the moved code keeps building.
It was also the last udp_hdr() consumer in net/core/netpoll.c. The
file no longer needs anything from <net/udp.h> (the UDP socket-layer
helpers); MAX_SKB_SIZE only needs struct udphdr, which is provided
by the lighter <linux/udp.h>. Swap the include accordingly.
Breno Leitao [Tue, 12 May 2026 10:46:39 +0000 (03:46 -0700)]
netconsole: move push_udp() from netpoll
push_udp() builds the UDP header (and triggers the checksum) for
netconsole's UDP packets. Move it into drivers/net/netconsole.c as
a file-static helper; drop its EXPORT_SYMBOL_GPL and remove the
prototype from include/linux/netpoll.h.
Breno Leitao [Tue, 12 May 2026 10:46:38 +0000 (03:46 -0700)]
netconsole: move push_eth() from netpoll
push_eth() builds the Ethernet header for netconsole's UDP packets.
Move it into drivers/net/netconsole.c as a file-static helper; drop
its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.
Breno Leitao [Tue, 12 May 2026 10:46:37 +0000 (03:46 -0700)]
netconsole: move push_ipv4() from netpoll
push_ipv4() builds the IPv4 header for netconsole's UDP packets.
Move it into drivers/net/netconsole.c as a file-static helper; drop
its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.
put_unaligned() is no longer used in net/core/netpoll.c, so drop
the now-stale <linux/unaligned.h> include from there. Pull it into
netconsole.c so the moved code keeps building.
Breno Leitao [Tue, 12 May 2026 10:46:36 +0000 (03:46 -0700)]
netconsole: move push_ipv6() from netpoll
push_ipv6() builds the IPv6 header for netconsole's UDP packets.
Its only caller, netpoll_send_udp(), now lives in netconsole, so
the helper can move there as a file-static function. Drop its
EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.
Breno Leitao [Tue, 12 May 2026 10:46:35 +0000 (03:46 -0700)]
netconsole: move netpoll_send_udp() from netpoll
Move netpoll_send_udp() from net/core/netpoll.c into
drivers/net/netconsole.c as a static helper, drop EXPORT_SYMBOL(),
and remove the prototype from include/linux/netpoll.h.
netconsole was the only in-tree caller of this entry point. Every
other netpoll consumer (bonding, team, vlan, bridge, macvlan, dsa)
already builds its own sk_buff and hands it to netpoll_send_skb(),
so the netpoll send-side interface is now skb-only.
The helpers it depends on (find_skb(), push_ipv6(), push_ipv4(),
push_udp(), push_eth(), netpoll_udp_checksum()) were exposed in
the previous patches and stay in net/core/netpoll.c for now.
Subsequent patches move each of them into netconsole one at a time
and drop the corresponding EXPORT_SYMBOL_GPL.
Pull <linux/ip.h>, <linux/ipv6.h> and <linux/udp.h> into netconsole.c
so the moved code can name the header structures.
Breno Leitao [Tue, 12 May 2026 10:46:34 +0000 (03:46 -0700)]
netpoll: expose UDP packet builder helpers for netconsole
Promote each from file-static to EXPORT_SYMBOL_GPL and forward-
declare them in include/linux/netpoll.h so netconsole can call
them once netpoll_send_udp() moves out.
These exports are kept until the end of the series, when
al of them move into netconsole.
Sajal Gupta [Sat, 9 May 2026 09:51:48 +0000 (15:21 +0530)]
net: usb: pegasus: replace simple_strtoul with kstrtouint
simple_strtoul() is deprecated as it has no error checking. Replace it
with kstrtouint() which returns an error code on invalid input, and add
appropriate error handling.
Also add a NULL check before parsing flags, since strsep() can set id
to NULL if the input has fewer tokens than expected.
Preserve the original behavior for a trailing colon by checking *id
before parsing flags, so an empty string results in flags = 0 rather
than an error.
Victor Nogueira [Mon, 11 May 2026 18:30:58 +0000 (14:30 -0400)]
selftests/tc-testing: Add QFQ/CBS qlen underflow test
Since CBS was not calling reset for its child qdisc, there are scenarios
where it could cause an underflow on its parent's qlen/backlog. When the
parent is QFQ, a null-ptr deref could occur.
Add a test case that reproduces the underflow followed by a null-ptr
deref scenario.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jamal Hadi Salim [Mon, 11 May 2026 18:30:57 +0000 (14:30 -0400)]
net/sched: sch_cbs: Call qdisc_reset for child qdisc
During a reset, CBS is not calling reset on its child qdisc, which
might cause qlen/backlog accounting issues. For example, if we have CBS
with a QFQ parent and a netem child with delay, we can create a scenario
where the parent's qlen underflows. QFQ, specifically, uses qlen to
check whether it should deference a pointer, so this scenario may cause
a null-ptr deref in QFQ:
Fix this by calling qdisc_reset for CBS' child qdisc.
Sashiko caught an issue which could result in a null ptr deref if
qdisc_create_dflt() is invoked on an unitialised cbs qdisc which is exposed
by this patch. We add an early return if the qdisc is null to address this.
This is a similar approach used by two other fixes[1][2].
The proper fix for this specific issue elucidated by sashiko is to remove
the call to qdisc_reset when qdisc_create_dflt fails. Since the dflt qdisc
isn't attached anywhere yet at that point, calling the reset callback doesn't
make much sense (and as stated has been a source of two other bugs).
We plan on submitting this fix in a later patch.
[1] https://lore.kernel.org/netdev/20221018063201.306474-2-shaozhengchao@huawei.com/
[2] https://lore.kernel.org/netdev/20221018063201.306474-4-shaozhengchao@huawei.com/
Fixes: 585d763af09c ("net/sched: Introduce Credit Based Shaper (CBS) qdisc") Reported-by: Junyoung Jang <graypanda.inzag@gmail.com> Tested-by: Junyoung Jang <graypanda.inzag@gmail.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops
This patch series deals with tun/tap & vhost-net which drop incoming
SKBs whenever their internal ptr_ring buffer is full. Instead, with this
patch series, the associated netdev queue is stopped - but only when a
qdisc is attached. If no qdisc is present the existing behavior is
preserved. The XDP transmit path is not affected. This patch series
touches tun/tap and vhost-net, as they share common logic and must be
updated together. Modifying only one of them would break the other.
By applying proper backpressure, this change allows the connected qdisc to
operate correctly, as reported in [1], and significantly improves
performance in real-world scenarios, as demonstrated in our paper [2]. For
example, we observed a 36% TCP throughput improvement for an OpenVPN
connection between Germany and the USA.
Synthetic pktgen benchmarks indicate a slight regression, and packet
loss is reduced to near zero. Pktgen benchmarks are provided per commit,
with the final commit showing the overall performance.
Simon Schippers [Sun, 10 May 2026 15:15:29 +0000 (17:15 +0200)]
tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
This commit prevents tail-drop when a qdisc is present and the ptr_ring
becomes full. Once the ring reaches capacity after a produce attempt,
the netdev queue is stopped instead of dropping subsequent packets.
If no qdisc is present, the previous tail-drop behavior is preserved.
If producing an entry fails anyway due to a race, tun_net_xmit() drops
the packet. Such races are expected because LLTX is enabled and the
transmit path operates without the usual locking.
The __tun_wake_queue() function of the consumer races with the producer
for waking/stopping the netdev queue, which could result in a stalled
queue. Therefore, an smp_mb__after_atomic() is introduced that pairs
with the smp_mb() of the consumer. It follows the principle of store
buffering described in tools/memory-model/Documentation/recipes.txt:
- The producer in tun_net_xmit() first sets __QUEUE_STATE_DRV_XOFF,
followed by an smp_mb__after_atomic() (= smp_mb()), and then reads the
ring with __ptr_ring_check_produce().
- The consumer in __tun_wake_queue() first writes zero to the ring in
__ptr_ring_consume(), followed by an smp_mb(), and then reads the queue
status with netif_tx_queue_stopped().
=> Following the aforementioned principle, it is impossible for the
producer to see a full ring (and therefore not wake the queue on the
re-check) while the consumer simultaneously fails to see a stopped
queue (and therefore also does not wake it).
Benchmarks:
The benchmarks show a slight regression in raw transmission performance
when using two sending threads. Packet loss also occurs only in the
two-thread sending case; no packet loss was observed with a single
sending thread.
Test setup:
AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
mitigations disabled.
Note for tap+vhost-net:
XDP drop program active in VM -> ~2.5x faster; slower for tap due to
more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260510151529.43895-5-simon.schippers@tu-dortmund.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Schippers [Sun, 10 May 2026 15:15:28 +0000 (17:15 +0200)]
ptr_ring: move free-space check into separate helper
This patch moves the check for available free space for a new entry into
a separate function. Existing callers that only check for a non-zero
return value are unaffected; __ptr_ring_produce() now returns -EINVAL
for a zero-size ring and -ENOSPC when full, whereas before both cases
returned -ENOSPC. The new helper allows callers to determine in advance
whether subsequent __ptr_ring_produce() calls will succeed. This
information can, for example, be used to temporarily stop producing until
__ptr_ring_check_produce() indicates that space is available again.
Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260510151529.43895-4-simon.schippers@tu-dortmund.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Schippers [Sun, 10 May 2026 15:15:27 +0000 (17:15 +0200)]
vhost-net: wake queue of tun/tap after ptr_ring consume
Add tun_wake_queue() to tun.c and export it for use by vhost-net. The
function validates that the file belongs to a tun/tap device and that
the tfile exists, dereferences the tun_struct under RCU, and delegates
to __tun_wake_queue().
vhost_net_buf_produce() now calls tun_wake_queue() after a successful
batched consume of the ring to allow the netdev subqueue to be woken up.
The point is to allow the queue to be stopped when it gets full, which
is required for traffic shaping - implemented by the following
"avoid ptr_ring tail-drop when a qdisc is present".
Without the corresponding queue stopping, this patch alone causes no
throughput regression for a tap+vhost-net setup sending to a qemu VM:
3.857 Mpps to 3.891 Mpps.
Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
threads, XDP drop program active in VM, pktgen sender; Avg over
50 runs @ 100,000,000 packets. SRSO and spectre v2 mitigations disabled.
Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260510151529.43895-3-simon.schippers@tu-dortmund.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Schippers [Sun, 10 May 2026 15:15:26 +0000 (17:15 +0200)]
tun/tap: add ptr_ring consume helper with netdev queue wakeup
Introduce tun_ring_consume() that wraps ptr_ring_consume() and calls
__tun_wake_queue(). The latter wakes the stopped netdev subqueue once
half of the ring capacity has been consumed, tracked via the new
cons_cnt field in tun_file. As a safety net, the queue is also woken on
the last consumed entry if it leaves the ring empty. The point is to
allow the queue to be stopped when it gets full, which is required for
traffic shaping - implemented by the following "avoid ptr_ring tail-drop
when a qdisc is present".
Some implementation details:
- tun_ring_recv() replaces ptr_ring_consume() with tun_ring_consume()
to properly wake the queue.
- __tun_detach() locks the tx_ring.consumer_lock to avoid races with
the consumer on the queue_index.
- The ptr_ring_consume() call in tun_queue_purge() is not replaced with
tun_ring_consume(). Instead, within the same tx_ring.consumer_lock
in __tun_detach(), the netdev queue is woken for the ntfile taking
it over, to avoid a possible stall. This does not matter for
tun_detach_all(), as it is called during device teardown and no tfile
takes over any queue.
- Reset cons_cnt in tun_attach() so the half-ring wake threshold is
valid for the new ring size after ptr_ring_resize().
- tun_queue_resize() wakes all queues after resizing with the proper
tx_ring.consumer_lock and resets the cons_cnt to avoid a possible
stale queue.
- The aforementioned upcoming patch explains the pairing of the smp_mb()
of __tun_wake_queue().
Without the corresponding queue stopping, this patch alone causes no
regression for a tap setup sending to a qemu VM: 1.132 Mpps
to 1.134 Mpps.
Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
threads, pktgen sender; Avg over 50 runs @ 100,000,000 packets;
SRSO and spectre v2 mitigations disabled.
Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260510151529.43895-2-simon.schippers@tu-dortmund.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The reset actions of the DEFZA adapters are exceedingly slow, taking up
to 30 seconds to complete by the device spec and typically in the range
of 10 seconds in reality, as required for the device RTOS to boot, still
quite a lot. Therefore a state machine is used that's interrupt driven,
however a safety mechanism is required in case of adapter malfunction,
so that if no state change interrupt has arrived in time, then the
situation is taken care of.
The safety mechanism depends on the origin of the reset. For regular
adapter initialisation at the device probe time a sleep is requested.
However a reset is also required by the device spec when the adapter has
transitioned into the halted state, such as in response to a PC Trace
event in the course of ring fault recovery, possibly a common network
event. In that case no sleep is possible as a device halt is reported
at the hardirq level.
A timer is therefore set up to ensure progress in case no adapter state
change interrupt has arrived in time, but as from commit 168f6b6ffbee
("timers: Use del_timer_sync() even on UP") a warning is issued as the
timer is deleted in the hardirq handler upon an expected state change:
The immediate origin of the new warning is the switch away from aliasing
del_timer_sync() to del_timer() (timer_delete_sync() to timer_delete()
in terms of current function names) for UP configurations, which however
is the only choice for this driver anyway as no SMP hardware supports
the TURBOchannel bus this device interfaces to. Therefore there is a
very remote issue only this is a sign of.
Specifically if an adapter reset issued upon a transition to the halted
state times out and first triggers fza_reset_timer() for another reset
assertion, which then schedules fza_reset_timer() for reset deassertion
and then that second call is pre-empted after poking at the hardware,
but before the timer has been rearmed and owing to high system load
causing exceedingly high scheduling latency control is not handed back
before a transition to the uninitialised state has caused the timer to
be deleted even before it has been started, then fza_reset_timer() will
be called yet again and issue another reset even though by then the
adapter has already recovered.
Prevent this situation from happening by switching to timer_delete() for
the transition to the halted state and protect the code region affected
with a spinlock, also to make sure add_timer() has not been called twice
in a row due to an execution race between the interrupt handler and the
timer handler (though it could only happen on SMP, but let's keep the
driver clean). It's a very unlikely sequence of events to happen and
therefore there's no point in trying to be overly clever about it, such
as by placing printk() calls outside the protection. For the transition
to the uninitialised state switch to timer_delete_sync_try() instead, so
that a timer isn't deleted that's just been rearmed by the timer handler
and needs to watch for the device to come out of reset again (again, an
SMP scenario only).
Retain timer_delete_sync() invocations outside the hardirq context for a
stray timer not to fire once device structures have been released.
Fixes: 61414f5ec9834 ("FDDI: defza: Add support for DEC FDDIcontroller 700 TURBOchannel adapter") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 13 May 2026 22:00:40 +0000 (15:00 -0700)]
Merge tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"The bulk of this is hardening of the new sub-scheduler infrastructure.
- UAFs and lifecycle bugs on the sub-sched attach/detach paths:
parent sub_kset freed under a racing child, list_del_rcu on an
uninitialized list head, ops->priv stomped by concurrent
attach/detach, and a UAF in the init-failure error path
- Task state-machine reorg closing concurrent enable-vs-dead races: a
task exiting during the unlocked init window could trip NULL ops
derefs or skip exit_task() cleanup
- A scx_link_sched() self-deadlock on scx_sched_lock
- isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
cpumask without RCU, and stop rejecting BPF schedulers when only
cpuset isolated partitions are active
- PREEMPT_RT: disable irq_work runs in hardirq context so dumps show
the failing task rather than the irq_work kthread
- Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build
fixes"
* tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()
sched_ext: Drop NONE early return in scx_disable_and_exit_task()
sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths
sched_ext: Fix ops->priv clobber on concurrent attach/detach
selftests/sched_ext: Fix build error in dequeue selftest
sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
sched_ext: Close sub-sched init race with post-init DEAD recheck
sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work
sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings
sched_ext: Drop unused scx_find_sub_sched() stub
sched_ext: Move scx_error() out of scx_link_sched()'s lock region
Linus Torvalds [Wed, 13 May 2026 21:56:31 +0000 (14:56 -0700)]
Merge tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- cpuset fixes:
- Partition invalidation could return CPUs still in use by sibling
partitions, producing overlapping effective_cpus
- cpuset_can_attach() over-reserved DL bandwidth on moves that
stayed within the same root domain
- Pending DL migration state leaked into later attaches when a
later can_attach() check failed
- Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can
allocate from any node and exit quickly
- dmem: propagate -ENOMEM instead of spinning forever when the fallback
pool allocation also fails
- selftests/cgroup: percpu test error-path leak, bogus numeric
comparison of cpuset strings, and a zero-length read() that silently
passed OOM-kill tests
* tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Return only actually allocated CPUs during partition invalidation
selftests/cgroup: Fix error path leaks in test_percpu_basic
cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
cgroup/cpuset: Reset DL migration state on can_attach() failure
selftests/cgroup: Fix string comparison in write_test
selftests/cgroup: Fix cg_read_strcmp() empty string comparison
cgroup/dmem: Return -ENOMEM on failed pool preallocation
cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
Linus Torvalds [Wed, 13 May 2026 21:49:13 +0000 (14:49 -0700)]
Merge tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path
- Fix a cancel_delayed_work_sync() livelock against drain_workqueue()
caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING
set with no owner
* tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path
workqueue: Release PENDING in __queue_work() drain/destroy reject path
Andrea Righi [Wed, 13 May 2026 11:24:38 +0000 (13:24 +0200)]
sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is
in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against
cpu_possible_mask.
Since commit 27c3a5967f05 ("sched/isolation: Convert housekeeping
cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected
and dereferencing it requires either RCU read lock, the cpu_hotplug
write lock, or the cpuset lock; scx_enable() holds none of these, so
booting with isolcpus=domain and attaching any BPF scheduler triggers
the following lockdep splat:
In addition, commit 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask
from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as
well, which means the current check also rejects BPF schedulers when a
cpuset partition is active. That contradicts the original intent of
commit 9f391f94a173 ("sched_ext: Disallow loading BPF scheduler if
isolcpus= domain isolation is in effect"), which explicitly noted that
cpuset partitions are honored through per-task cpumasks and should not
be rejected.
Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only
the housekeeping flag bit (no RCU dereference) and reflects exactly the
boot-time isolcpus= configuration that the error message refers to.