Miri Korenblit [Tue, 12 May 2026 05:23:04 +0000 (08:23 +0300)]
wifi: iwlwifi: mld: stop supporting iwl_compressed_ba_notif version 5 and 6
The oldest core that devices that load iwlmld op mode are supporting is core 101.
Core 101 has version 7 of iwl_compressed_ba_notif, so earlier versions
are no longer needed.
Miri Korenblit [Tue, 12 May 2026 05:23:01 +0000 (08:23 +0300)]
wifi: iwlwifi: support a TLV indicating num of mgmt mcast keys
FW has a limitation of how many multicast management keys it supports.
Until today we just assumed this limitation. But now as it is changing,
due to NAN, we need a clear indication from the FW so we know how many
we can install.
wifi: iwlwifi: remove nvm_ver for devices that don't need it
This was needed only to check the NVM for devices that had a specific
firmware image to run the initial calibrations.
Remove this field from newer devices that no longer have a specific
image for those.
rename iwl_system_statistics_notif_oper to
iwl_system_statistics_notif_oper_v3 since v4 is on the way.
Same for iwl_stats_ntfy_per_phy, since v2 is on the way.
Johannes Berg [Tue, 12 May 2026 05:22:53 +0000 (08:22 +0300)]
wifi: iwlwifi: pcie: fix ACPI DSM check
The acpi_check_dsm() function expects a bitmap of function
IDs to check for, not a single value. Evidently, on many
platforms function 1 exists so checking for 2 succeeded,
but it's wrong, we need to check correctly for function 2.
Fix that.
Avinash Bhatt [Mon, 11 May 2026 17:36:31 +0000 (20:36 +0300)]
wifi: iwlwifi: fix buffer overflow when firmware reports no channels
On parsing NVM in setting country code, if firmware reports 0 channels,
buffer is allocated for 0 rules but a dummy rule is added for cfg80211
compatibility, causing kmemdup() to read 128 bytes from a 32-byte buffer.
Allocate regd buffer for one rule addition when reported
channels are 0.
Johannes Berg [Mon, 11 May 2026 17:36:27 +0000 (20:36 +0300)]
wifi: iwlwifi: mld: don't report bad STA ID in EHT TB sniffer
The field being reported here is part of the EHT union, and not
valid in EHT TB. Don't report it there. We could probably report
the station ID we're following, but for now just don't, since it
appears nobody really cared.
Johannes Berg [Mon, 11 May 2026 17:36:26 +0000 (20:36 +0300)]
wifi: iwlwifi: mld: track TX/RX IGTKs separately
Due to FW/HW limitations and the MME being at the end of the
frame, the devices only support a single IGTK for RX. For TX
multiple aren't needed, only the latest will be used, but in
the device there are space restrictions, so we can also only
install one.
For NAN, however, we will have one for RX for each peer, and
one for ourselves to transmit with.
Separate out the tracking of IGTK: instead of being per link
make the TX ones per link and the RX ones per (link) station.
Note that we currently hardcode that the FW can only have two
(IWL_MAX_NUM_IGTKS) IGTKs, which won't be sufficient for NAN
with security, concurrently with BSS.
Johannes Berg [Mon, 11 May 2026 17:36:24 +0000 (20:36 +0300)]
wifi: iwlwifi: mld: add TLC support for NAN stations
In order to support NAN, TLC now has a station bitmap. Use this
and update TLC for the NAN stations accordingly whenever links
(and thus PHYs) change, and whenever else mac80211 might update
the rate scale information.
Johannes Berg [Mon, 11 May 2026 17:36:23 +0000 (20:36 +0300)]
wifi: iwlwifi: mld: clean up station handling in key APIs
The internal key APIs, when called with group keys where mac80211
doesn't pass the (AP) station pointer, are still sometimes called
with the AP station pointer on internal calls. This is confusing.
Clean it up and always call them with the AP STA when it exists,
even when coming in from mac80211, by looking it up immediately.
wifi: iwlwifi: mld: move iwl_mld_link_info_changed_ap_ibss to ap.c
This function is ap mode related, move it to ap.c.
Also, don't call iwl_mld_ftm_responder_clear from stop_ap() since
mac80211 does it now before stopping the AP.
We should stick to mac80211's flow to start / stop beaconing. This
allows to stop beaconing before we remove the BIGTK.
Note that the start and stop beaconing flows are not exactly symmetric.
When we start beaconing, we just update the beacon template. We assume
that mac80211 won't update the beacons, if we're not supposed to be
sending it.
Also note that we now send the beacon template after the broadcast
station was added to the firmware: the broadcast station is added in
the start_ap() flow, while the beacon template is now added in the
link_changed() flow which happens later. This is not what we did
before this patch, but this sequence is supported by the firmware as
well.
Israel Kozitz [Sun, 10 May 2026 20:48:39 +0000 (23:48 +0300)]
wifi: iwlwifi: mld: fix NAN max channel switch time unit
The max_channel_switch_time in wiphy_nan_capa is in microseconds, but
the value was set to 4, which is only 4 microseconds instead of the
intended 4 milliseconds.
Ilan Peer [Sun, 10 May 2026 20:48:38 +0000 (23:48 +0300)]
wifi: iwlwifi: mld: Do not declare NAN support for Extended Key ID
Do not declare support for Extended Key ID for NAN, as defined in section
7.4 in the WiFi Aware specification v4.0 (in order to support security
association upgrade).
Miri Korenblit [Sun, 10 May 2026 20:48:34 +0000 (23:48 +0300)]
wifi: iwlwifi: mld: use host rate for NAN management frames
Frames that are sent to an NMI station are always NAN management frames.
Therefore there is no need to configure TLC for such a station.
Always use host rate for the frames going to that station.
Johannes Berg [Sun, 10 May 2026 20:48:30 +0000 (23:48 +0300)]
wifi: iwlwifi: mld: add NAN link management
The firmware requires links for NAN which mac80211 doesn't use,
so introduce a new NAN link data structure that the driver has
for itself only, and handle the link command sending code for
NAN using this data structure, most of the bss_conf data isn't
used for NAN anyway, so those structures aren't useful.
With that, add, activate, deactivate or remove links depending
on the local NAN schedule updates.
Johannes Berg [Sun, 10 May 2026 20:48:29 +0000 (23:48 +0300)]
wifi: iwlwifi: mld: support NAN and NAN_DATA interfaces
Until now we maintained the NAN vif in the driver only. The fw used the
AUX MAC for sync and discovery operations.
But when we want to configure a local schedule, we need to add the MAC
first.
NAN_DATA interfaces are not added to the FW. Instead, the local
address of these interfaces are configured to the FW via the NAN MAC.
Add the add/remove/update operations for the NAN interface, and fill the
NAN special parameters in it.
Note that this doesn't fully implement the schedule change, but only the
addition/removal of the NAN MAC. The full schedule management
implementation will come in a later patch.
Johannes Berg [Sun, 10 May 2026 20:48:27 +0000 (23:48 +0300)]
wifi: iwlwifi: mld: tlc: separate from link STA
While NAN stations have the deflink link STA and that even
carries their information, having link STAs mostly implies
having real links, and NAN muddies that by having stations
with deflink carrying their capabilities and links at the
NAN level, but no link stations corresponding to NAN links.
Separate out the data needed to build TLC commands into a
new struct iwl_mld_tlc_sta_capa data structure so that the
whole data usage in the TLC code is clarified and we won't
make assumptions, say about being able to look up the link
of an interface from the (NAN) link sta correctly, which
would result in a link but not with a chanctx.
Miri Korenblit [Sun, 10 May 2026 20:48:26 +0000 (23:48 +0300)]
wifi: iwlwifi: mld: set NAN phy capabilities
Copy the HT, VHT and HE capabilities from the sbands:
- The HT capabilities from the 2.4 GHz sband (there is no difference
between the bands anyway).
- The VHT capabilities from the 5 GHz sband, obviously.
- The HE capabilities from the 2.4 GHz and for NL80211_IFTYPE_STATION.
Fix it up to include also the needed 5 GHz bits.
For HE, there are bits that are band-dependent and iftype-dependent. For
those set to what makes most sense, and leave a comment to re-visit.
Junrui Luo [Thu, 2 Apr 2026 06:48:07 +0000 (14:48 +0800)]
wifi: iwlwifi: mld: validate sta_mask before ffs() in BA session handlers
Three BA session handlers use ffs(ba_data->sta_mask) - 1 to derive a
station ID without checking that sta_mask is non-zero. When sta_mask is
zero, ffs() returns 0 and the subtraction wraps to 0xFFFFFFFF, causing
an out-of-bounds access on fw_id_to_link_sta[].
Add WARN_ON_ONCE(!ba_data->sta_mask) guards before each ffs() call,
consistent with the existing check in iwl_mld_ampdu_rx_start().
Jay Ng [Wed, 8 Apr 2026 03:42:36 +0000 (20:42 -0700)]
wifi: iwlwifi: remove unused header inclusions
Remove header files that are included but provide no symbols,
types, or macros used by the including translation unit.
In iwl-trans.c, fw/api/tx.h defines TX command structures
(iwl_tx_cmd, iwl_tx_resp, TX_CMD_* flags) used by the PCIe TX
path, not by the transport core itself. Similarly, iwl-fh.h
defines Flow Handler register addresses and DMA-related constants
(FH_*, RFH_*, TFD_*) that are consumed by PCIe-specific code,
none of which are referenced in iwl-trans.c.
In iwl-nvm-parse.c, fw/acpi.h defines ACPI/SAR/GEO/PPAG
interfaces (iwl_acpi_*, iwl_sar_*, iwl_geo_*). No references to
any of these interfaces exist in this file.
Junjie Cao [Thu, 12 Feb 2026 12:50:34 +0000 (20:50 +0800)]
wifi: iwlwifi: mvm: fix race condition in PTP removal
iwl_mvm_ptp_remove() calls cancel_delayed_work_sync() only after
ptp_clock_unregister() and clearing ptp_data state (ptp_clock,
ptp_clock_info, last_gp2).
This creates a race where the delayed work iwl_mvm_ptp_work() can
execute between ptp_clock_unregister() and cancel_delayed_work_sync(),
observing partially cleared PTP state.
Move cancel_delayed_work_sync() before ptp_clock_unregister() to
ensure the delayed work is fully stopped before any PTP cleanup
begins.
Cc: stable@vger.kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Junjie Cao <junjie.cao@intel.com> Link: https://patch.msgid.link/20260212125035.1345718-1-junjie.cao@intel.com Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Junjie Cao [Thu, 12 Feb 2026 12:50:35 +0000 (20:50 +0800)]
wifi: iwlwifi: mld: fix race condition in PTP removal
iwl_mld_ptp_remove() calls cancel_delayed_work_sync() only after
ptp_clock_unregister() and clearing ptp_data state (ptp_clock,
last_gp2, wrap_counter).
This creates a race where the delayed work iwl_mld_ptp_work() can
execute between ptp_clock_unregister() and cancel_delayed_work_sync(),
observing partially cleared PTP state.
Move cancel_delayed_work_sync() before ptp_clock_unregister() to
ensure the delayed work is fully stopped before any PTP cleanup
begins.
Cc: stable@vger.kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Junjie Cao <junjie.cao@intel.com> Link: https://patch.msgid.link/20260212125035.1345718-2-junjie.cao@intel.com Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Arnd Bergmann [Thu, 7 May 2026 21:24:50 +0000 (23:24 +0200)]
p54spi: convert to devicetree
The Prism54 SPI driver hardcodes GPIO numbers and expects users to
pass them as module parameters, apparently a relic from its life as a
staging driver. This works because there is only one user, the Nokia
N8x0 tablet.
Convert this to the gpio descriptor interface and DT based probing
to improve this and simplify the code at the same time.
Arnd Bergmann [Thu, 7 May 2026 21:24:49 +0000 (23:24 +0200)]
dt-bindings: net: add st,stlc4560/p54spi binding
The SPI version of Prism54 was sold under a couple of different
names and supported by the Linux p54spi driver, but there was
never a DT binding for it.
Document the four known names of this device and the properties
that are sufficient for its use on the Nokia N8x0 tablet.
As I don't have this hardware or documentation for it, this is
purely based on existing usage in the driver.
Daniel Gabay [Fri, 15 May 2026 11:28:06 +0000 (14:28 +0300)]
wifi: mac80211: allow cipher change on NAN_DATA interfaces
ieee80211_key_link() rejects pairwise key installation when the
cipher differs from the existing PTK. Per Wi-Fi Aware version 4.0
section 7.4, the ND-TKSA between the same NDI pair shall be updated
when a new NDP requires a stronger cipher suite.
Exempt NL80211_IFTYPE_NAN_DATA from the same-cipher enforcement so
the PTK can be replaced with a different cipher.
Ilan Peer [Fri, 15 May 2026 11:15:16 +0000 (14:15 +0300)]
wifi: mac80211_hwsim: Do not declare NAN support for Extended Key ID
Do not declare support for Extended Key ID for NAN, as defined in
section 7.4 in the WiFi Aware specification v4.0 (in order to support
security association upgrade).
Miri Korenblit [Wed, 13 May 2026 15:26:56 +0000 (18:26 +0300)]
wifi: mac80211: don't call ieee80211_handle_reconfig_failure when not needed
In case reconfiguration of NAN fails, we call
ieee80211_handle_reconfig_failure, that marks all interfaces as not in
the driver.
Then, at the error path of the reconfig, cfg80211_shutdown_all_interfaces
is called to destroy all the interfaces.
If we have any other interface but the NAN one, for example a BSS
station, then when its state (links, stations) will be removed, we
won't tell the driver about this, because we will think that the
interfaces are not in the driver, and then drivers might remain with
dangling pointers to objects like stations and links (at least for
iwlwifi this is the case).
ieee80211_handle_reconfig_failure is meant to be called after we cleaned
up the state in the driver, there is no reason to call it for NAN
reconfiguration failure.
Fix the code to just warn in such a case, as we do in other error paths
in reconfig where it is too complicated to rewind.
Ilan Peer [Wed, 13 May 2026 14:24:22 +0000 (17:24 +0300)]
wifi: mac80211: Allow per station GTK for NAN Data interfaces
The WiFi Aware specification (v4.0) requires that NAN devices that
support security would also support per station GTK. Thus, allow
per station GTK installation to the driver on NAN Data interfaces.
Signed-off-by: Ilan Peer <ilan.peer@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Tested-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
tested: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260513172418.37a8e259e611.I39bb9f3c1a65a8184124f531c18e121dc123d411@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
wifi: mac80211_hwsim: reject NAN on multi-radio wiphys
When userspace creates a new hwsim radio with both
HWSIM_ATTR_MULTI_RADIO and HWSIM_ATTR_SUPPORT_NAN_DEVICE,
hwsim_new_radio_nl() sets BIT(NL80211_IFTYPE_NAN_DATA) in
wiphy->interface_modes while configuring the wiphy with
n_radio > 1. This violates the invariant checked in
wiphy_register():
triggering a WARN reachable from userspace via genetlink.
With panic_on_warn this becomes a denial of service.
Refuse the combination at parse time with -EINVAL and an
extack message, matching the cfg80211 constraint that NAN
is not supported on multi-radio wiphys.
Lachlan Hodges [Wed, 6 May 2026 13:19:25 +0000 (23:19 +1000)]
wifi: mac80211: don't recalc min def for S1G chan ctx
__ieee80211_recalc_chanctx_min_def() currently does not attempt
to find the min def for S1G widths, meaning the BW will never change.
However, the following call into ieee80211_chan_bw_change() will
lead to a WARN within ieee80211_chan_width_to_rx_bw(). Not only that,
this entire path is geared towards 20MHz based channels, so it doesn't
make sense anyway. For now, return early when calculating the mindef
for S1G channels.
Lachlan Hodges [Wed, 6 May 2026 13:19:24 +0000 (23:19 +1000)]
wifi: mac80211: skip NSS and BW init for S1G sta
Currently there is no S1G STA bandwidth support throughout mac80211
as existing support is all based on 20MHz widths. With the recent
STA NSS/BW handling rework, S1G associations now hit the new WARN within
ieee80211_chan_width_to_rx_bw() as the chandef is not a 20MHz based
width. For now, skip initialisating link_sta->pub->bandwidth for
S1G chandefs to avoid the WARN though this should at some point be
properly implemented since there are vendors that offer differing
maximum bandwidths.
Additionally, currently all S1G hardware out there is 1SS so rather
then introducing new parsing code which wouldn't be used anyway, just
initialise the NSS related fields to 1 and skip initialising the STA
bandwidth for S1G chandefs within ieee80211_sta_init_nss_bw_capa().
Johannes Berg [Tue, 5 May 2026 13:17:31 +0000 (15:17 +0200)]
wifi: mac80211: check stations are removed before MLD change
If an interface changes to/from MLD, then all stations related
to it must have been removed first. This is just natural since
we go from having links to not (or vice versa), but not doing
so also causes crashes in debugfs since vif changing to/from
MLD removes the entire debugfs for the vif, including stations.
Delete all stations but warn in this case, other code should
be handling it, in effect fail fast rather than doing a double
free or use-after-free in debugfs.
- ipv6: flowlabel: enforce per-netns limit for unprivileged callers
- tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
- smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
- sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALL
- batman-adv:
- reject new tp_meter sessions during teardown
- purge non-released claims
- eth:
- i40e: cleanup PTP registration on probe failure
- idpf: fix double free and use-after-free in aux device error paths
- ena: fix potential use-after-free in get_timestamp"
* tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
net: phy: DP83TC811: add reading of abilities
net: tls: prevent chain-after-chain in plain text SG
net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot
macsec: use rcu_work to defer TX SA crypto cleanup out of softirq
macsec: use rcu_work to defer RX SA crypto cleanup out of softirq
macsec: introduce dedicated workqueue for SA crypto cleanup
net: net_failover: Fix the deadlock in slave register
MAINTAINERS: update atlantic driver maintainer
selftests/tc-testing: Add QFQ/CBS qlen underflow test
net/sched: sch_cbs: Call qdisc_reset for child qdisc
FDDI: defza: Sanitise the reset safety timer
net: ethernet: ravb: Do not check URAM suspension when WoL is active
ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics
net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS
net: atm: fix skb leak in sigd_send() default branch
net: ethtool: phy: avoid NULL deref when PHY driver is unbound
net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled
net: shaper: reject QUEUE scope handle with missing id
...
Linus Torvalds [Thu, 14 May 2026 15:53:24 +0000 (08:53 -0700)]
Merge tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fixes from Paul Moore:
- Correctly log the inheritable capabilities
- Honor AUDIT_LOCKED in the AUDIT_TRIM and AUDIT_MAKE_EQUIV commands
* tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV
audit: fix incorrect inheritable capability in CAPSET records
Linus Torvalds [Wed, 13 May 2026 18:37:18 +0000 (11:37 -0700)]
ptrace: slightly saner 'get_dumpable()' logic
The 'dumpability' of a task is fundamentally about the memory image of
the task - the concept comes from whether it can core dump or not - and
makes no sense when you don't have an associated mm.
And almost all users do in fact use it only for the case where the task
has a mm pointer.
But we have one odd special case: ptrace_may_access() uses 'dumpable' to
check various other things entirely independently of the MM (typically
explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for
threads that no longer have a VM (and maybe never did, like most kernel
threads).
It's not what this flag was designed for, but it is what it is.
The ptrace code does check that the uid/gid matches, so you do have to
be uid-0 to see kernel thread details, but this means that the
traditional "drop capabilities" model doesn't make any difference for
this all.
Make it all make a *bit* more sense by saying that if you don't have a
MM pointer, we'll use a cached "last dumpability" flag if the thread
ever had a MM (it will be zero for kernel threads since it is never
set), and require a proper CAP_SYS_PTRACE capability to override.
Sven Schuchmann [Tue, 12 May 2026 07:19:47 +0000 (09:19 +0200)]
net: phy: DP83TC811: add reading of abilities
At this time the driver is not listing any speeds
it supports. This should be ETHTOOL_LINK_MODE_100baseT1_Full_BIT
for DP83TC811. Add the missing call for phylib to read the abilities.
Fixes: b753a9faaf9a ("net: phy: DP83TC811: Introduce support for the DP83TC811 phy") Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Sven Schuchmann <schuchmann@schleissheimer.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260512071949.6218-1-schuchmann@schleissheimer.de
[pabeni@redhat.com: dropped revision history] Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Mon, 11 May 2026 17:49:18 +0000 (10:49 -0700)]
net: tls: prevent chain-after-chain in plain text SG
Sashiko points out that if end = 0 (start != 0) the current
code will create a chain link to content type right after
the wrap link:
This would create a chain where the wrap link points directly
to another chain link. The scatterlist API sg_next iterator
does not recursively resolve consecutive chain links.
meaning this is illegal input to crypto.
The wrapping link is unnecessary if end = 0. end is the entry after
the last one used so end = 0 means there's nothing pushed after
the wrap:
end start i
v v v
[ ]...[ ][ d ][ d ][ d ][ d ][rsv for wrap]
Skip the wrapping in this case.
TLS 1.3 can use the "wrapping slot" for it's chaining if end = 0.
This avoids the chain-after-chain.
Move the wrap chaining before marking END and chaining off content
type, that feels like more logical ordering to me, but should not
matter from functional perspective.
Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260511174920.433155-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Mon, 11 May 2026 17:49:17 +0000 (10:49 -0700)]
net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
When an sk_msg scatterlist ring wraps (sg.end < sg.start),
tls_push_record() chains the tail portion of the ring to the head
using sg_chain(). An extra entry in the sg array is reserved for
this:
struct sk_msg_sg {
[...]
/* The extra two elements:
* 1) used for chaining the front and sections when the list becomes
* partitioned (e.g. end < start). The crypto APIs require the
* chaining;
* 2) to chain tailer SG entries after the message.
*/
struct scatterlist data[MAX_MSG_FRAGS + 2];
The current code uses MAX_SKB_FRAGS + 1 as the ring size:
instead of the true last entry. This is likely due to a "race" of
the commit under Fixes landing close to
commit 031097d9e079 ("bpf: sk_msg, zap ingress queue on psock down")
Convert to ARRAY_SIZE and drop the data[start] / - start (as suggested
by Sabrina).
Reported-by: 钱一铭 <yimingqian591@gmail.com> Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260511174920.433155-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
bridge: Add selective forwarding of gratuitous neighbor announcements
The existing neighbor suppression unconditionally suppresses gratuitous
ARPs and unsolicited Neighbor Advertisements, which prevents fast
mobility of hosts between VTEPs.
This series adds a new neigh_forward_grat option that provides
independent control of gratuitous ARP and unsolicited NA forwarding.
When neigh_suppress is enabled but neigh_forward_grat is enabled,
regular neighbor discovery is suppressed while gratuitous announcements
are forwarded.
The implementation marks gratuitous ARPs and unsolicited NAs in
BR_INPUT_SKB_CB during input processing, then checks the per-output-port
neigh_forward_grat setting during flooding. This allows gratuitous
announcements from any input port to be selectively forwarded based on
each output port's individual configuration.
Both port-level control (via IFLA_BRPORT_NEIGH_FORWARD_GRAT) and
per-VLAN control (via BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT) are
provided. The default value of OFF preserves existing behavior.
This behavior is in accordance with RFC 9161 (Section 3.6), which
recommends that VTEPs forward gratuitous ARP and unsolicited NA messages
to avoid traffic disruption during host mobility events.
The new attributes use NLA_U8, although the kernel netlink guideline
recommends NLA_U32 as the minimum integer type on the grounds that
alignment makes smaller types equivalent on the wire. For a simple
on/off attribute there is no technical advantage to u32 over u8, and
keeping u8 preserves consistency with all surrounding bridge port
attributes and avoids introducing new helpers alongside the existing
infrastructure.
Patchset overview:
Patch #1: adds uapi headers.
Patches #2-#3: support selective forwarding of gratuitous ARP.
Patches #4-#5: add netlink handling.
Patch #6: adds tests.
Please see iproute related patches in the last 3 commits of:
https://github.com/daniellerts/iproute2
====================
Danielle Ratson [Mon, 11 May 2026 06:59:36 +0000 (09:59 +0300)]
selftests: net: Add tests for neigh_forward_grat option
Add tests to validate the neigh_forward_grat bridge option for selective
forwarding of gratuitous neighbor announcements.
The tests verify per-port and per-VLAN control of gratuitous neighbor
announcement forwarding for both IPv4 (gratuitous ARP) and IPv6
(unsolicited NA):
- When neigh_suppress is enabled with neigh_forward_grat off (default),
gratuitous announcements are suppressed
- When neigh_forward_grat is enabled, gratuitous announcements are
forwarded while regular neighbor discovery remains suppressed
For IPv4, use arping to send gratuitous ARP packets. For IPv6, use
mausezahn to craft unsolicited Neighbor Advertisement packets.
For the per-port tests, the IPv4 test exercises the ip link interface,
while the IPv6 test exercises the bridge link interface.
The per-VLAN tests use the bridge interface throughout, as per-VLAN
attributes are only accessible via 'bridge vlan'.
Danielle Ratson [Mon, 11 May 2026 06:59:35 +0000 (09:59 +0300)]
bridge: Add per-VLAN netlink handling for neigh_forward_grat
Add netlink handlers for the per-VLAN neigh_forward_grat option via
BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT attribute.
The per-VLAN option provides fine-grained control, allowing different
VLANs on the same port to have different gratuitous ARP/unsolicited NA
forwarding behavior.
This enables control via 'bridge' commands:
# bridge vlan set dev eth0 vid 10 neigh_suppress on
# bridge vlan set dev eth0 vid 10 neigh_forward_grat on
Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Danielle Ratson <danieller@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260511065936.4173106-6-danieller@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Danielle Ratson [Mon, 11 May 2026 06:59:34 +0000 (09:59 +0300)]
bridge: Add port-level netlink handling for neigh_forward_grat
Add netlink handlers for the port-level neigh_forward_grat option via
IFLA_BRPORT_NEIGH_FORWARD_GRAT attribute.
The default value of OFF preserves existing behavior, i.e. gratuitous ARP
and unsolicited NA are suppressed when neigh_suppress is enabled. Users can
explicitly set it to ON to allow these packets through.
Example for enabling control via 'bridge link' command:
# bridge link set dev eth0 neigh_suppress on
# bridge link set dev eth0 neigh_forward_grat on
Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Danielle Ratson <danieller@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260511065936.4173106-5-danieller@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Danielle Ratson [Mon, 11 May 2026 06:59:33 +0000 (09:59 +0300)]
bridge: Add selective forwarding of gratuitous neighbor announcements
The existing neighbor suppression unconditionally suppresses gratuitous
ARPs and unsolicited Neighbor Advertisements, which prevents fast
mobility of hosts between VTEPs.
Add the neigh_forward_grat option to allow selective control of gratuitous
neighbor announcements. When neigh_suppress is enabled but
neigh_forward_grat is disabled (default), gratuitous announcements are
suppressed. When neigh_forward_grat is enabled, gratuitous announcements
are forwarded while regular neighbor discovery remains suppressed.
The implementation provides per-output-port control by:
1. Adding a 'grat_arp' flag to BR_INPUT_SKB_CB to mark gratuitous ARPs and
unsolicited NAs.
2. Setting both grat_arp and proxyarp_replied flags in
br_do_proxy_suppress_arp() and br_do_suppress_nd() when gratuitous
packets are detected.
3. Checking neigh_forward_grat per output port during flooding:
- For gratuitous ARPs/NAs: suppress unless the output port has
neigh_forward_grat enabled.
- For regular ARPs/NDs: maintain existing behavior.
This allows gratuitous announcements from any input port to be selectively
forwarded based on each output port's individual neigh_forward_grat
setting, enabling gratuitous neighbor announcements to be flooded to the
VXLAN fabric.
Regular neighbor discovery (ARP requests, NS queries, solicited replies)
remains controlled by neigh_suppress and is unaffected.
Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Danielle Ratson <danieller@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260511065936.4173106-4-danieller@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add netlink attributes for controlling gratuitous ARP and unsolicited NA
forwarding when neighbor suppression is enabled.
Add IFLA_BRPORT_NEIGH_FORWARD_GRAT for port-level control and
BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT for per-VLAN control.
The new attributes provide independent control of gratuitous ARP and
unsolicited NA packets. Operators can enable forwarding for those packets
for fast mobility across VTEPs while keeping general neighbor suppression
active.
Xiang Mei [Mon, 11 May 2026 06:21:38 +0000 (23:21 -0700)]
net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot
On the SMC-D client, slot 0 of ini->ism_dev[]/ini->ism_chid[] is
reserved for an SMC-Dv1 device. smc_find_ism_v2_device_clnt()
populates V2 entries starting at index 1, so when no V1 device is
selected slot 0 is left in its kzalloc()'ed state with ism_dev[0] ==
NULL and ism_chid[0] == 0.
smc_v2_determine_accepted_chid() then matches the peer's CHID against
the array starting from index 0 using the CHID alone. A malicious
peer replying to a SMC-Dv2-only proposal with d1.chid == 0 matches
the empty slot, ini->ism_selected becomes 0, and the subsequent
ism_dev[0]->lgr_lock dereference in smc_conn_create() faults at
offsetof(struct smcd_dev, lgr_lock) == 0x68:
BUG: KASAN: null-ptr-deref in _raw_spin_lock_bh+0x79/0xe0
Write of size 4 at addr 0000000000000068 by task exploit/144
Call Trace:
_raw_spin_lock_bh
smc_conn_create (net/smc/smc_core.c:1997)
__smc_connect (net/smc/af_smc.c:1447)
smc_connect (net/smc/af_smc.c:1720)
__sys_connect
__x64_sys_connect
do_syscall_64
Require ism_dev[i] to be non-NULL before accepting a CHID match.
Fixes: a7c9c5f4af7f ("net/smc: CLC accept / confirm V2") Reported-by: Weiming Shi <bestswngs@gmail.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/20260511062138.2839584-1-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add 64-bit counters for each impairment netem applies (delay, loss,
ECN marking, corruption, duplication, reordering) and for skb
allocation failures during enqueue. Exposed through TCA_STATS_APP
as struct tc_netem_xstats.
Counters increment when an impairment is occurs, independent of later
events that may mask its on-wire effect. Added allocation_errors
(similar to sch_fq) to account for when impairment could not be
applied due to memory pressure, etc.
net/sched: netem: handle multi-segment skb in corruption
The packet corruption code only flipped bits in the linear
header portion of the skb, skipping corruption when
skb_headlen() was zero.
Linearize the whole skb if necessary before corruption.
Extends d64cb81dcbd5 ("net/sched: sch_netem: fix out-of-bounds access
in packet corruption") with a more general solution.
net/sched: netem: replace pr_info with netlink extack error messages
Use netlink extack to report errors instead of sending them
to the kernel log with pr_info(). The error message can them be seen
with tc commands; and avoids log spam.
The current layout of struct netem_sched_data can be improved
by optimizing cache locality, compacting data types (use u8
for enum) and eliminating unused elements.
Reorganize the struct as follows:
- Cacheline 0 holds the tfifo state (t_root/t_head/t_tail/t_len),
counter, and the unconditional enqueue scalars
latency/jitter/rate/gap/loss.
- Cacheline 1 holds the remaining zero-check scalars
(duplicate/reorder/corrupt/ecn), all five crndstate correlation
structures, and loss_model.
- Cacheline 2 holds prng, delay_dist, the slot dequeue state,
slot_dist, and the inner classful qdisc pointer.
- Rate-shaping fields, q->limit (config-only; the fast path reads
sch->limit), and the CLG Markov state move to the warm tail.
- tc_netem_slot slot_config and qdisc_watchdog (only consulted on
slot reschedule and watchdog wake) move to the cold tail.
Also reorder struct clgstate to place the u8 state member after the
u32 transition probabilities. This removes the 3-byte interior hole
without changing the struct's size.
====================
net/mlx5e: improve RSS indirection table sizing and resizing
This series by Yael improves mlx5e RSS indirection table handling around
channel count changes and large RSS configurations.
The series:
* removes the XOR8-specific channel count limitation,
* advertises the maximum supported RSS indirection table size,
* fixes resizing of non-default RSS contexts,
* allows resizing configured default RSS contexts during channel
changes,
* and increases the default RSS spread factor from 2x to 4x to improve
traffic distribution for large channel counts.
Together, these changes make RSS table sizing more flexible and robust,
while improving load balancing behavior on large systems.
====================
Increase the RQT uniform spread factor from 2 to 4 so that each channel
gets more indirection table entries and traffic is spread more evenly.
For num_channels > 64 imbalance drops from up to ~50% to up to ~25%.
For 64 or fewer channels the 256 entry minimum already provides at least
4x coverage and the table size is unchanged by this commit.
This satisfies the minimum 4x coverage requirement validated by the
generic RSS selftest commit 9e3d4dae9832 ("selftests: drv-net: rss:
validate min RSS table size").
The 4x spread factor is best-effort and the table size is always capped by
the device's log_max_rqt_size capability.
Yael Chemla [Mon, 11 May 2026 17:27:18 +0000 (20:27 +0300)]
net/mlx5e: resize configured default RSS context table on channel change
mlx5e_ethtool_set_channels() rejected channel count changes that
required a different RQT size when the default context indirection
table was user-configured. This restriction was introduced by
commit ee3572409f74 ("net/mlx5e: RSS, Block changing channels number
when RXFH is configured").
Lift the restriction. Validate the resize upfront with
ethtool_rxfh_indir_can_resize(), then fold or unfold the table
in-place via ethtool_rxfh_indir_resize() inside state_lock, before
mlx5e_safe_switch_params(), so the preactivate callback sees the
correct table content when it programs the HW.
Yael Chemla [Mon, 11 May 2026 17:27:17 +0000 (20:27 +0300)]
net/mlx5e: resize non-default RSS indirection tables on channel change
When the channel count changes and the RQT size changes with it, a
problem arise for non-default RSS contexts. The driver-side indirection
table grows actual_table_size without filling the new entries; stale
entries from a prior larger configuration may be re-exposed, causing
mlx5e_calc_indir_rqns() to WARN on an out-of-range index.
Replace mlx5e_rss_params_indir_modify_actual_size() with
mlx5e_rss_ctx_resize(), which fills new entries by replicating
the existing pattern, matching what ethtool_rxfh_ctxs_resize() does
for the same case. And restrict the loop to non-default contexts.
Call ethtool_rxfh_ctxs_can_resize() before acquiring state_lock to
validate that all non-default contexts can be resized, and
ethtool_rxfh_ctxs_resize() after releasing it to fold or unfold their
indirection tables. Both functions acquire rss_lock internally and
cannot be called under state_lock. RTNL, held by all set_channels
callers, serialises context creation and deletion making the pre-lock
check safe.
Guard both ethtool calls on mlx5e_rx_res_rss_cnt() > 1: skip the
validation and resize when no non-default contexts exist. This
naturally covers representors and IPoIB, which share
mlx5e_ethtool_set_channels() but cannot have non-default RSS contexts.
Yael Chemla [Mon, 11 May 2026 17:27:16 +0000 (20:27 +0300)]
net/mlx5e: advertise max RSS indirection table size to ethtool
Set rxfh_indir_space to the maximum indirection table size the driver
can support: the next power of two above MLX5E_MAX_NUM_CHANNELS times
MLX5E_UNIFORM_SPREAD_RQT_FACTOR.
Without this, ethtool_rxfh_ctxs_can_resize() returns -EINVAL, blocking
non-default RSS contexts from tracking indirection table size changes
when the channel count changes.
Yael Chemla [Mon, 11 May 2026 17:27:15 +0000 (20:27 +0300)]
net/mlx5e: remove channel count limit for XOR8 RSS hash
mlx5e_ethtool_set_channels() and mlx5e_rxfh_hfunc_check() rejected
channel counts that would produce an indirection table larger than 256
entries when the XOR8 hash function was active. This check was
introduced in commit 49e6c9387051 ("net/mlx5e: RSS, Block XOR hash
with over 128 channels").
XOR8 yields an 8-bit hash, so in practice only up to 256 entries in the
indirection table can be reached due to limited entropy. However, this
does not provide a strong justification for prohibiting larger
indirection tables. Remove the limitation.
====================
macsec: use rcu_work to fix crypto cleanup in softirq context
From: Jinliang Zheng <alexjlzheng@tencent.com>
crypto_free_aead() can internally call vunmap() (e.g. via dma_free_attrs()
in hardware crypto drivers like hisi_sec2), which must not be invoked from
softirq context. Both free_rxsa() and free_txsa() are RCU callbacks that
run in softirq, causing a kernel crash on affected hardware.
This series fixes the issue by deferring the actual cleanup to a workqueue
using rcu_work, which combines the RCU grace period and workqueue dispatch
into a single primitive.
Two design decisions worth noting:
1. rcu_work instead of schedule_work() + synchronize_rcu()
An alternative would be to call schedule_work() directly from
macsec_rxsa_put()/macsec_txsa_put(), then call synchronize_rcu() at
the start of the work handler to replace the grace period previously
provided by call_rcu(). However, synchronize_rcu() blocks the worker
thread for the duration of a full RCU grace period. Under high SA
churn (e.g. tearing down an interface with many SAs), each SA would
occupy a worker thread while waiting, and multiple concurrent calls
cannot share the same grace period — leading to unnecessary latency
and resource waste.
rcu_work uses call_rcu_hurry() internally, which is fully asynchronous:
the worker thread is only dispatched after the grace period has elapsed,
and multiple concurrent queue_rcu_work() calls naturally batch under the
same grace period via the RCU subsystem's existing coalescing mechanism.
2. Dedicated workqueue instead of system_wq
Using a dedicated workqueue (macsec_wq) allows macsec_exit() to drain
exactly the work items belonging to this module — by calling
destroy_workqueue() after rcu_barrier(). If system_wq were used,
flush_scheduled_work() would drain all pending work items across the
entire system, creating unnecessary coupling with unrelated subsystems
and potentially causing unexpected delays. The dedicated workqueue
provides a clean, contained teardown path.
====================
Jinliang Zheng [Mon, 11 May 2026 15:31:00 +0000 (23:31 +0800)]
macsec: use rcu_work to defer TX SA crypto cleanup out of softirq
free_txsa() is an RCU callback running in softirq context, but calls
crypto_free_aead() which can invoke vunmap() internally on hardware
crypto drivers (e.g. hisi_sec2), triggering a kernel crash.
Use rcu_work to defer the cleanup to a workqueue, for the same reasons
as the analogous fix to free_rxsa() in the previous patch.
Jinliang Zheng [Mon, 11 May 2026 15:30:59 +0000 (23:30 +0800)]
macsec: use rcu_work to defer RX SA crypto cleanup out of softirq
crypto_free_aead() can internally invoke vunmap() (e.g. via
dma_free_attrs() in hardware crypto drivers such as hisi_sec2).
vunmap() must not be called from softirq context, but free_rxsa()
is an RCU callback that runs in softirq, leading to a kernel crash:
Use rcu_work to defer the cleanup to a workqueue. rcu_work dispatches
the worker asynchronously after the RCU grace period, so no thread
blocks waiting, and concurrent releases of multiple SAs naturally
share the same grace period.
Jinliang Zheng [Mon, 11 May 2026 15:30:58 +0000 (23:30 +0800)]
macsec: introduce dedicated workqueue for SA crypto cleanup
Introduce a dedicated ordered workqueue, macsec_wq, which will be used
by subsequent patches to defer SA crypto cleanup (crypto_free_aead and
related teardown) out of softirq context.
Using a dedicated workqueue instead of system_wq allows macsec_exit()
to drain exactly the work items belonging to this module via
destroy_workqueue(), without interfering with unrelated work items on
system_wq or causing unexpected delays elsewhere.
rcu_barrier() in macsec_exit() ensures all in-flight rcu_work callbacks
have enqueued their work items before destroy_workqueue() drains and
destroys the queue, making the two-step teardown correct and complete.
The same sequence is kept in the error path of macsec_init() as a
precaution, to mirror macsec_exit() and stay safe if work ever becomes
queueable before this point in the future.
While at it, rename the error labels in macsec_init() from the
resource-named style (rtnl:, notifier:, wq:) to the err_xxx: style
(err_rtnl:, err_notifier:, err_destroy_wq:) to align with the broader
kernel convention.
Faicker Mo [Mon, 11 May 2026 14:05:51 +0000 (22:05 +0800)]
net: net_failover: Fix the deadlock in slave register
There is netdev_lock_ops() before the NETDEV_REGISTER notifier
in register_netdevice(), so use the non-locking functions
in net_failover_slave_register().
failover_slave_register() in failover_existing_slave_register() adds lock
and unlock ops too.
Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP") Signed-off-by: Faicker Mo <faicker.mo@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>