idpf: Replace use of system_unbound_wq with system_dfl_wq
This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.
Before that to happen, workqueue users must be converted to the better named
new workqueues with no intended behaviour changes:
This series extends Marvell octeontx2-af support for CN20K NPC (MCAM
debuggability, allocation policy, default-rule lifetime, optional KPU
profiles from firmware files, X2/X4 MCAM keyword handling in flows and
defaults, and dynamic CN20K NPC private state), adds a devlink mechanism
for multi-value parameters, and moves devlink_nl_param_fill() temporaries
to the heap so stack usage stays reasonable once union devlink_param_value
grows (patch 3).
Patch 1 enforces a single RVU admin-function PCI device in the kernel.
On Octeon series SoCs, hardware resources such as NPC, NIX and related
blocks are global and coordinated by the AF driver; PFs and VFs request
them through AF mailbox messages. Firmware exposes only one AF PCI
function at boot, so two AF driver instances cannot both own that state.
rvu_probe() rejects a second bind with -EBUSY, logs a warning, clears the
probe gate on early allocation failures, and aligns the driver model with
hardware so reviewers and automation can rely on exactly one bound AF.
Patch 2 improves CN20K MCAM visibility in debugfs: mcam_layout marks
enabled entries, dstats reports per-entry hit deltas (baseline updated in
software after each read; hardware counters are not cleared), and mismatch
lists enabled entries without a PF mapping.
Patch 3 allocates the per-configuration-mode union devlink_param_value
buffers and struct devlink_param_gset_ctx used by devlink_nl_param_fill()
with kcalloc()/kzalloc_obj() and funnels failures through a single cleanup
path so the netlink reply path stays safe as the union grows.
Patch 4 (Saeed) introduces DEVLINK_PARAM_TYPE_U64_ARRAY and nested
DEVLINK_ATTR_PARAM_VALUE_DATA attributes so drivers and user space can
exchange bounded u64 arrays; YAML, uapi, and netlink validation are
updated.
Patch 5 adds a runtime devlink parameter srch_order to reorder CN20K
subbank search during MCAM allocation (the param uses the u64 array type
from patch 4).
Patch 6 ties default MCAM entries to NIX LF alloc/free on CN20K, adds
NIX_LF_DONT_FREE_DFT_IDXS for PF teardown paths that must not drop default
NPC indexes while the driver still owns state, and tightens nix_lf_alloc
error propagation.
Patch 7 allows loading a custom KPU profile from /lib/firmware/kpu via
module parameter kpu_profile, with cam2 / ptype_mask wiring and helpers
that share firmware-sourced vs filesystem-sourced profile layouts.
Patch 8 makes default-rule allocation, AF flow install, and PF-side RSS,
defaults, and ethtool flows respect the active CN20K MCAM keyword width
(X2 vs X4), including X4 reference-index masking and -EOPNOTSUPP when a
flow needs X4 keys on an X2-only profile.
Patch 9 replaces file-scope npc_priv and static dstats with allocation
sized from discovered bank/subbank geometry, threads npc_priv_get()
through CN20K NPC paths, and allocates dstats via devm_kzalloc for the
debugfs helper.
Patch 1 is ordered first so later patches assume a single bound AF.
Heap-backed devlink_nl_param_fill() sits immediately before the U64 array
param work so incremental builds stay stack-safe as the union grows; the
CN20K patches keep srch_order ahead of NIX LF coordination, optional KPU
profile load from firmware files, X2/X4 handling, and the npc_priv refactor
that touches the same files heavily.
====================
octeontx2-af: npc: cn20k: Allocate npc_priv and dstats dynamically.
Replace the file-scope static npc_priv with a kcalloc'd struct filled
from hardware bank/subbank geometry at init (num_banks is no longer a
const compile-time constant; drop init_done and use a non-NULL
npc_priv pointer for liveness). Thread npc_priv_get() / pointer access
through the CN20K NPC code paths, extend teardown to kfree the root
struct on failure and in npc_cn20k_deinit, and adjust MCAM section
setup to use the discovered subbank count.
Allocate MCAM debugfs dstats via devm_kzalloc instead of a static matrix,
and use the allocated backing store consistently when computing deltas
(including the counter rollover compare).
octeontx2: cn20k: Respect NPC MCAM X2/X4 profile in flows and DFT alloc
Default CN20K NPC rule allocation now keys off the active MCAM keyword
width: use X4 with a bank-masked reference index when the silicon uses
X4 keys, and X2 with the raw index otherwise (replacing the previous
always-X2 / eidx + 1 behaviour).
In the AF flow-install path, flows that need more than 256 key bits
query the NPC profile; if the platform is fixed to X2 entries, fail
with -EOPNOTSUPP instead of requesting X4. Otherwise select X4 for the
MCAM alloc.
On the PF, cache and pass the profile kw_type from npc_get_pfl_info
through otx2_mcam_pfl_info_get(), and use it when allocating MCAM
entries for RSS/defaults and when installing ethtool flows on CN20K,
including masking the reference index for X4 slot layout.
octeontx2-af: npc: Support for custom KPU profile from filesystem
Flashing updated firmware on deployed devices is cumbersome. Provide a
mechanism to load a custom KPU (Key Parse Unit) profile directly from
the filesystem at module load time.
When the rvu_af module is loaded with the kpu_profile parameter, the
specified profile is read from /lib/firmware/kpu and programmed into
the KPU registers. Add npc_kpu_profile_cam2 for the extended cam format
used by filesystem-loaded profiles and support ptype/ptype_mask in
npc_config_kpucam when profile->from_fs is set.
Usage:
1. Copy the KPU profile file to /lib/firmware/kpu.
2. Build OCTEONTX2_AF as a module.
3. Load: insmod rvu_af.ko kpu_profile=<profile_name>
octeontx2: cn20k: Coordinate default rules with NIX LF lifecycle
Add NIX_LF_DONT_FREE_DFT_IDXS so the PF can send NIX LF free during hw
reinit or teardown without the AF freeing CN20K default NPC rule indexes
while the driver still owns that state (otx2_init_hw_resources and
otx2_free_hw_resources).
On CN20K, allocate default NPC rules from NIX LF alloc before
nix_interface_init, roll back with npc_cn20k_dft_rules_free on failure,
and free from NIX LF free when the new flag is not set. Tighten
rvu_mbox_handler_nix_lf_alloc error handling: use a single rc, propagate
qmem_alloc and other errors, and set -ENOMEM only when kcalloc fails
(remove the blanket -ENOMEM at the free_mem path).
octeontx2-af: npc: cn20k: add subbank search order control
CN20K NPC MCAM is split into 32 subbanks that are searched in a
predefined order during allocation. Lower-numbered subbanks have
higher priority than higher-numbered ones.
Add a runtime "srch_order" to control the order in which
subbanks are searched during MCAM allocation.
Saeed Mahameed [Tue, 9 Jun 2026 04:04:48 +0000 (09:34 +0530)]
devlink: Implement devlink param multi attribute nested data values
Devlink param value attribute is not defined since devlink is handling
the value validating and parsing internally, this allows us to implement
multi attribute values without breaking any policies.
Devlink param multi-attribute values are considered to be dynamically
sized arrays of u64 values, by introducing a new devlink param type
DEVLINK_PARAM_TYPE_U64_ARRAY, driver and user space can set a variable
count of u64 values into the DEVLINK_ATTR_PARAM_VALUE_DATA attribute.
Implement get/set parsing and add to the internal value structure passed
to drivers.
This is useful for devices that need to configure a list of values for
a specific configuration.
example:
$ devlink dev param show pci/... name multi-value-param
name multi-value-param type driver-specific
values:
cmode permanent value: 0,1,2,3,4,5,6,7
$ devlink dev param set pci/... name multi-value-param \
value 4,5,6,7,0,1,2,3 cmode permanent
devlink: heap-allocate param fill buffers in devlink_nl_param_fill
devlink_nl_param_fill() kept two per-configuration-mode copies of
union devlink_param_value plus a struct devlink_param_gset_ctx on the
stack while building the Netlink reply. Allocate those with kcalloc()
and kzalloc_obj() instead, and route failures through a single cleanup
path so temporary buffers are always freed.
Improve MCAM visibility and field debugging for CN20K NPC.
- Extend "mcam_layout" to show enabled (+) or disabled state per entry
so status can be verified without parsing the full "mcam_entry" dump.
- Add "dstats" debugfs entry: for enabled MCAM indices, print hit deltas
since the prior read by comparing hardware counters to a per-entry
software baseline and advancing that baseline after each read (hardware
counters are not cleared).
- Add "mismatch" debugfs entry: lists MCAM entries that are enabled
but not explicitly allocated, helping diagnose allocation/field issues.
On Octeon series SoCs, the AF is an integrated device within the SoC, and
hardware resources such as NPC, NIX and related blocks are global and
coordinated by the AF driver. Physical and virtual functions request those
resources via AF mailbox messages, so two AF driver instances cannot both
own that global state; firmware exposes only one AF PCI function at boot
and any further octeontx2-af PCI probe returns -EBUSY so software matches
the single-AF model.
====================
net/stmmac: Fixes for maximum TX/RX queues to use by driver
When contributing other changes preparing functions for new XGMAC hardware
https://lore.kernel.org/netdev/20260601162537.553512-1-j.raczynski@samsung.com/
there have been reports by Sashiko AI.
All of issues are wrong DTS configuration, but kernel needs to handle it.
====================
Jakub Raczynski [Thu, 11 Jun 2026 11:33:58 +0000 (13:33 +0200)]
net/stmmac: Apply MTL_MAX queue limit if config missing
When "snps,rx-queues-to-use" or "tx-queues-to-use" config in DTS is provided
current code will apply U8_MAX value for queues_to_use if there is input of
higher value. But actual maximum number of supported queues is set via
macro MTL_MAX_RX_QUEUES and MTL_MAX_TX_QUEUES, which currently have value of 8.
This value of U8_MAX will be capped to value provided by core in DMA
capabilities (dma_conf), but it does so only if core provides it.
This is true for XGMAC (dwxgmac2) and some GMAC (dwmac4),
but not for (dwmac1000). This capping is at later stage in stmmac_hw_init(),
and during stmmac_mtl_setup() we might parse fields outside allocated memory
if queues_to_use is over defines MTL_MAX_ values,
for example following rx_queues_cfg is array of size of MTL_MAX_RX_QUEUES.
Fix this by capping value to MTL_MAX during config parsing.
Jakub Raczynski [Thu, 11 Jun 2026 11:33:57 +0000 (13:33 +0200)]
net/stmmac: Apply TBS config only to used queues
While opening stmmac driver, there is enabling of TBS (Time-Based Scheduling)
option in dma config. Currently this is executed for all possible TX queues via
MTL_MAX_TX_QUEUES macro, but actual number of queues used might differ.
While setting this is generally harmless, since memory for MTL_MAX_TX_QUEUES
is allocated, it is incorrect, because it prepares config for unused queues.
Change this to apply tbs config only to tx_queues_to_use.
Co-developed-by: Chang-Sub Lee <cs0617.lee@samsung.com> Signed-off-by: Chang-Sub Lee <cs0617.lee@samsung.com> Signed-off-by: Jakub Raczynski <j.raczynski@samsung.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260611113358.3379518-2-j.raczynski@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wayen.Yan [Thu, 11 Jun 2026 23:09:56 +0000 (07:09 +0800)]
net: airoha: Fix debugfs new-tuple display for IPv4 ROUTE entries
In airoha_ppe_debugfs_foe_show(), the second switch statement falls
through from PPE_PKT_TYPE_IPV4_HNAPT/DSLITE to PPE_PKT_TYPE_IPV4_ROUTE,
accessing hwe->ipv4.new_tuple for all three types. However, IPv4 ROUTE
(3-tuple) entries do not contain a valid new_tuple — this field is only
meaningful for NATted flows (HNAPT/DSLITE). For ROUTE entries, the
memory at the new_tuple offset holds routing information, not NAT data,
so displaying "new=" produces garbage output.
Display new_tuple only for HNAPT and DSLITE, and let IPV4_ROUTE fall
through to the default case.
Wayen.Yan [Thu, 11 Jun 2026 23:09:13 +0000 (07:09 +0800)]
net: airoha: Fix register index for Tx-fwd counter configuration
In airoha_qdma_init_qos_stats(), the Tx-fwd counter configuration
register uses the same index (i << 1) as the Tx-cpu counter, which
overwrites the Tx-cpu configuration. The Tx-fwd counter value register
correctly uses (i << 1) + 1, so the configuration register should use
the same index.
Fix the REG_CNTR_CFG index from (i << 1) to ((i << 1) + 1) so that
the Tx-fwd counter is properly configured instead of clobbering the
Tx-cpu counter config.
Li Xiasong [Thu, 11 Jun 2026 13:56:47 +0000 (21:56 +0800)]
tipc: restrict socket queue dumps in enqueue tracepoints
tipc_sk_enqueue() runs with sk->sk_lock.slock held while the socket is
owned by user context. The spinlock protects the backlog queue in this
path, but it does not serialize against the socket owner consuming or
purging sk_receive_queue.
The TIPC_DUMP_ALL tracepoints in tipc_sk_enqueue() also dump
sk_receive_queue and can therefore dereference skbs that the socket
owner has already dequeued or freed. Restrict these dumps to
TIPC_DUMP_SK_BKLGQ, which matches the queue protected by the held
spinlock.
Keep the change limited to the enqueue path, where the unsafe queue dump
is reachable while the socket is owned by user context.
Fixes: 01e661ebfbad ("tipc: add trace_events for tipc socket") Cc: stable@vger.kernel.org Signed-off-by: Li Xiasong <lixiasong1@huawei.com> Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech> Link: https://patch.msgid.link/20260611135647.3666727-1-lixiasong1@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lorenzo Bianconi [Thu, 11 Jun 2026 10:43:00 +0000 (12:43 +0200)]
net: airoha: better handle MIBs for GDM ports with multiple devs attached
In the context of a GDM port that can have multiple net_devices attached
(GDM3 and GDM4), the HW counters (MIBs) are global for the GDM port.
This cause duplicated stats reported to the kernel for the related
net_device.
The SoC supports a split MIB feature where each counter is tracked based
on the relevant HW channel (NBQ) to account for this scenario and
provide a way to select the related counter on accessing the MIB
registers.
Enable this feature for GDM3 and GDM4 and configure the relevant HW
channel before updating the HW stats to report correct HW counter to the
kernel for the related interface.
Move the stats struct from port to dev since HW counter are now specific
to the network device instead of the GDM port. Refactor
airoha_update_hw_stats() to take airoha_eth and airoha_gdm_port
parameters since the function operates on the entire port.
Ratheesh Kannoth [Thu, 11 Jun 2026 08:33:30 +0000 (14:03 +0530)]
octeontx2-af: fix NPC mailbox codes in mbox.h
Several NPC mailbox command IDs in the 0x601x range were assigned out of
order. Renumber and reorder the M() definitions so each opcode matches
the stable contract expected by userspace tools and applications.
Fixes: 4e527f1e5c15 ("octeontx2-af: npc: cn20k: Add new mailboxes for CN20K silicon") Cc: Suman Ghosh <sumang@marvell.com> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260611083330.1652181-1-rkannoth@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Frank Li [Wed, 10 Jun 2026 15:05:30 +0000 (11:05 -0400)]
dt-bindings: net: dsa: Convert lan9303.txt to yaml format
Convert lan9303.txt to yaml format to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx53-kp-hsc.dtb: /soc/bus@50000000/i2c@53fec000/switch@a: failed to match any schema with compatible: ['smsc,lan9303-i2c']
Additional changes:
- rename switch-phy to switch in example.
Ovidiu Panait [Wed, 10 Jun 2026 08:52:38 +0000 (08:52 +0000)]
net: bcmgenet: Use weighted round-robin TX DMA arbitration
Under heavy network traffic, we observed sporadic TX queue timeouts on the
Raspberry Pi 4. The timeouts can be reproduced by stress testing the TX
path with multiple concurrent iperf UDP streams:
iperf3 -c <ip> -u -b0 -P16 -t60
NETDEV WATCHDOG: CPU: 0: transmit queue 0 timed out 2044 ms
NETDEV WATCHDOG: CPU: 3: transmit queue 0 timed out 2004 ms
Investigation showed that the timeouts are caused by the priority-based
arbiter. Under heavy load the highest priority queue starves the lower
priority ones, causing timeouts. The TX strict priority arbiter is not
suitable for the default use case where all the traffic gets spread
across all the TX queues.
Therefore, to fix this, switch the TX DMA arbiter to Weighted Round-Robin,
which services all queues, so they do not stall. The weights were chosen
to follow the existing priority scheme: q0 gets the smallest weight, while
q1-4 get the bulk of the TX bandwidth.
selftests: net: add test for IPv4 devconf netlink notifications
Introduce a new test, `ipv4_devconf_notify`, to verify that the kernel
sends the appropriate netlink notifications when IPv4 devconf parameters
are modified.
The test depends on the newly introduced iproute2 command:
ipv4: handle devconf post-set actions on netlink updates
When IPv4 device configuration parameters are updated via netlink, the
kernel currently only updates the value. This bypasses several
post-modification actions that occur when these same parameters are
updated via sysctl, such as flushing the routing cache or emitting
RTM_NEWNETCONF notifications.
This patch addresses the inconsistency by calling the
devinet_conf_post_set() helper inside inet_set_link_af(). If a flush is
required, we defer it until the netlink attribute parsing loop
completes.
This ensures consistent behavior and side-effects for devconf changes,
regardless of whether they are initiated via sysctl or netlink.
The logic for handling IPv4 devconf sysctls is scattered. Notification
and cache flushes are managed in devinet_conf_proc(), while a separate
ipv4_doint_and_flush() function and DEVINET_SYSCTL_FLUSHING_ENTRY macro
is used for properties that solely require a cache flush.
This patch refactors the sysctl handling by introducing a centralized
helper, devinet_conf_post_set(). This new function evaluates the changed
attribute and handles all necessary operations like triggering netlink
notifications. It returns a boolean indicating whether a routing cache
flush is required.
Note that the boolean is necessary as this function will be re-used for
netlink IPv4 devconf handling where the cache flushing must wait until
all the attributes have been processed.
Finally, this is introducing a small change in behavior for
IPV4_DEVCONF_ROUTE_LOCALNET. As commit d0daebc3d622 ("ipv4: Add
interface option to enable routing of 127.0.0.0/8") intended, the cache
flush should only be performed when ROUTE_LOCALNET changes from 1 to 0.
Unfortunately, this was not true because while implementing it the
DEVINET_SYSCTL_FLUSHING_ENTRY was used for the attribute, making the
code related to it on devinet_conf_proc() dead.
IPV4_DEVCONF_FORWARDING is still being handled separately as it requires
more operations.
Matthieu Buffet [Thu, 11 Jun 2026 16:21:02 +0000 (18:21 +0200)]
landlock: Add UDP send+connect access control
Add support for a second fine-grained UDP access right.
LANDLOCK_ACCESS_NET_CONNECT_SEND_UDP controls the ability to set the
remote port of a socket (via connect()) and to specify an explicit
destination when sending a datagram, to override any remote peer set on
a UDP socket (e.g. in sendto() or sendmsg()). It will be useful for
applications that send datagrams, and for some servers too (those
creating per-client sockets, which want to receive traffic only from a
specific address).
Similarly as for bind(), this access control is performed when
configuring sockets, not in hot code paths.
Add detection of when autobind is about to be required, and deny the
operation if the process would not be allowed to call bind(0)
explicitly. Autobind can only be performed in udp_lib_get_port() from
code paths already controlled by LSM hooks: when connect()ing, sending a
first datagram, and in some splice() EOF edge case which, afaiu, can
only happen after a remote peer has been set. This invariant needs to be
preserved to keep bind policies actually enforced.
Matthieu Buffet [Thu, 11 Jun 2026 16:21:01 +0000 (18:21 +0200)]
landlock: Add UDP bind() access control
Add support for a first fine-grained UDP access right.
LANDLOCK_ACCESS_NET_BIND_UDP controls the ability to set the local port
of a UDP socket (via bind()). It will be useful for servers (to start
receiving datagrams), and for some clients that need to use a specific
source port (e.g. mDNS requires to use port 5353)
For obvious performance concerns, access control is only enforced when
configuring sockets, not when using them for common send/recv
operations.
Bump ABI to allow userspace to detect and use this new right.
Matthieu Buffet [Tue, 9 Jun 2026 21:15:10 +0000 (23:15 +0200)]
landlock: Fix unmarked concurrent access to socket family
Socket family is read (twice) in a context where the socket is not
locked, so another thread can setsockopt(IPV6_ADDRFORM) to write it
concurrently. Add needed READ_ONCE() annotation.
Use the proper macro to access __sk_common.skc_family like everywhere
else.
Maximilian Heyne [Fri, 29 May 2026 20:03:41 +0000 (20:03 +0000)]
selftests/landlock: Explicitly disable audit in teardowns
I'm seeing sporadic selftest failures, such as
# RUN scoped_audit.connect_to_child ...
# scoped_abstract_unix_test.c:314:connect_to_child:Expected 0 (0) == records.access (8)
# connect_to_child: Test failed
# FAIL scoped_audit.connect_to_child
not ok 19 scoped_audit.connect_to_child
This seems similar to what commit 3647a4977fb73d ("selftests/landlock:
Drain stale audit records on init") tried to fix. However, the added
drain loop is not effective. When setting the AUDIT_STATUS_PID, the
kauditd_thread is woken up starting to send messages from the hold queue
to the netlink. Depending on scheduling of this kthread not all messages
might be send via the netlink in the 1 us interval.
Therefore, instead of trying to drain the queue, let's just disable
audit when running non-audit tests or more precisely disable it after
audit-tests. This way we won't generate any new audit message that could
interfere with the other tests.
The comment saying that on process exit audit will be disabled is wrong.
The closed file descriptor just causes an auditd_reset(), not a
disablement. So future messages will be queued in the hold queue.
Cc: stable@vger.kernel.org Fixes: 6a500b22971c ("selftests/landlock: Add tests for audit flags and domain IDs") Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Link: https://patch.msgid.link/20260529-welsh-nagoya-b4d9ca60@mheyne-amazon
[mic: Fix FD leak, update subject, call audit_cleanup() in audit_exec teardown] Signed-off-by: Mickaël Salaün <mic@digikod.net>
Bryam Vargas [Thu, 4 Jun 2026 23:17:05 +0000 (23:17 +0000)]
selftests/landlock: Test SCOPE_SIGNAL on the SIGIO/fowner pgid path
Add regression tests for the LANDLOCK_SCOPE_SIGNAL handling of the
asynchronous SIGIO delivery path (fcntl(F_SETOWN)) with a process-group
owner.
sigio_to_pgid_members covers the bypass: a sandboxed process at the head
of its process group's PGID hlist (the default after fork()) arms
F_SETOWN(-pgrp) + O_ASYNC and triggers the fan-out; the in-domain owner
must be signaled (proving the trigger fired) while the non-sandboxed
member of the group, outside the domain, must not.
sigio_to_pgid_self covers the same-process guarantee: the owner is
registered from a sandboxed non-leader thread, whose domain differs from
the thread-group leader the kernel signals for a process-group owner.
That leader belongs to the owner's own process and must still be
signaled.
Without the fix the first test sees the out-of-domain member signaled
and the second sees the owner's own leader denied.
Bryam Vargas [Thu, 4 Jun 2026 23:16:56 +0000 (23:16 +0000)]
landlock: Fix LANDLOCK_SCOPE_SIGNAL bypass on the SIGIO path
LANDLOCK_SCOPE_SIGNAL must prevent a sandboxed process from signaling
processes outside its Landlock domain. It can be bypassed through the
asynchronous SIGIO delivery path.
A sandboxed process that owns any file or socket can arm it with
fcntl(fd, F_SETOWN, -pgid), fcntl(fd, F_SETSIG, SIGKILL) and O_ASYNC, so
that an I/O event makes the kernel deliver the chosen signal to the
whole process group. As the head of its process group's task list (the
default position right after fork()) that group can also hold the
non-sandboxed process that launched it, e.g. a supervisor or a security
monitor. The sandbox can thus kill or signal the processes
LANDLOCK_SCOPE_SIGNAL is meant to protect from it.
The scope is enforced in hook_file_send_sigiotask() against the Landlock
domain recorded at F_SETOWN time, not the live domain of the sender.
control_current_fowner() decides whether to record that domain and skips
recording it when the fowner target is in the caller's thread group,
which is safe only for a single-task target (PIDTYPE_PID, PIDTYPE_TGID).
For a process group (PIDTYPE_PGID) pid_task() returns only one member;
recording is skipped whenever that member shares the caller's thread
group, and hook_file_send_sigiotask() then lets the signal fan out to
the whole group unchecked.
Record the domain for every non single-process target so the scope is
enforced against each group member at delivery time.
That recording is necessary but not sufficient on its own: the kernel
signals a process group through its members' thread-group leaders, and
the leader of the registrant's own process can carry a different
Landlock domain than the sibling thread that armed the owner.
domain_is_scoped() would then deny that leader, even though commit 18eb75f3af40 ("landlock: Always allow signals between threads of the
same process") requires same-process delivery to be allowed.
hook_task_kill() avoids this by evaluating same_thread_group() live, per
recipient; the SIGIO path instead delegates the whole decision to a
single registration-time check, which a process-group fan-out cannot
honor.
So also record the registrant's thread group next to its domain and
exempt it at delivery: hook_file_send_sigiotask() allows the signal
whenever the recipient belongs to the registrant's own process,
restoring the same-process guarantee while keeping out-of-domain group
members blocked. The direct kill() path (hook_task_kill) already
evaluates the live domain and is unaffected.
Fixes: 18eb75f3af40 ("landlock: Always allow signals between threads of the same process") Cc: stable@vger.kernel.org Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me> Reviewed-by: Günther Noack <gnoack3000@gmail.com> Link: https://patch.msgid.link/56bffc24f3d0d08b45a686a48e99766b0a0821fa.1780614610.git.hexlabsecurity@proton.me
[mic: Check pid_type earlier and improve comment, fix commit message,
fix comment formatting] Signed-off-by: Mickaël Salaün <mic@digikod.net>
Landlock provides best-effort sandboxing across ABI versions:
applications request the rights they need, and on older kernels the
unsupported rights are silently dropped from handled_access_* by the
documented compatibility switch. The recommended pattern for
landlock_add_rule(2) calls is to mirror this filtering at the rule
level, which wasn't explicitly described in the exemple.
Show the pattern explicitly in the filesystem and network rule examples
by masking each rule's allowed_access against the ruleset's
handled_access_* and adding the rule only when at least one bit remains
set. This makes the recommended best-effort pattern self-documenting.
Mickaël Salaün [Wed, 13 May 2026 18:03:08 +0000 (20:03 +0200)]
landlock: Account all audit data allocations to user space
Mark the kzalloc_flex() of struct landlock_details with
GFP_KERNEL_ACCOUNT so the allocation is charged to the calling task,
like the other Landlock per-domain allocations which have used
GFP_KERNEL_ACCOUNT forever.
Every property of landlock_details is caller-attributable: allocated by
landlock_restrict_self(2), owned by the caller's landlock_hierarchy,
contents are the caller's pid, uid, comm, and exe_path, lifetime bounded
by the caller's domain. While the caller may not know nor control the
size of this allocation (i.e. exe_path), this data should still be
accounted for it.
The deciding factor is whether userspace can trigger the allocation, not
whether the size of the data is known nor controlled by the caller.
This aligns with the kmemcg accounting policy established by commit 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg").
No new failure modes: the hierarchy and ruleset are allocated before
details and are already accounted, so landlock_restrict_self(2) already
returns -ENOMEM under memcg pressure. This change widens that existing
failure window slightly; it does not introduce a new error code.
Cc: Günther Noack <gnoack@google.com> Cc: Paul Moore <paul@paul-moore.com> Cc: stable@vger.kernel.org Fixes: 1d636984e088 ("landlock: Add AUDIT_LANDLOCK_DOMAIN and log domain status") Link: https://patch.msgid.link/20260513180309.165840-1-mic@digikod.net Signed-off-by: Mickaël Salaün <mic@digikod.net>
Mickaël Salaün [Fri, 12 Jun 2026 17:27:55 +0000 (19:27 +0200)]
landlock: Set audit_net.sk for socket access checks
Set audit_net.sk in current_check_access_socket() to provide the socket
object to audit_log_lsm_data(). This makes Landlock consistent with
AppArmor, which always sets .sk for socket operations, and with
SELinux's generic socket permission checks.
The socket's local and foreign address information (laddr, lport, faddr,
fport) is logged by the shared lsm_audit.c infrastructure when the
socket has bound or connected state. Fields with zero values are
suppressed by print_ipv4_addr()/print_ipv6_addr(), so the audit output
is unchanged for the common case of bind denials on unbound sockets.
For connect denials after a prior bind, the bound local address (laddr,
lport) appears before the existing sockaddr fields (daddr, dest).
No existing fields are removed or reordered, and the new field names
(laddr, lport, faddr, fport) are standard audit fields already emitted
by other LSMs through the same lsm_audit.c code path.
Add a connect_tcp_bound audit test that binds to an allowed port and
then connects to a denied one, verifying that the denial record reports
laddr/lport from the bound socket in addition to the connect
destination.
Eric Dumazet [Mon, 8 Jun 2026 15:14:52 +0000 (15:14 +0000)]
tcp: refine tcp_sequence() for the FIN exception
Commit 0e24d17bd966 ("tcp: implement RFC 7323 window retraction
receiver requirements") removed the special FIN case that
was added in commit 1e3bb184e941 ("tcp: re-enable acceptance of
FIN packets when RWIN is 0").
If a peer sends a segment containing data and a FIN flag before
it learns about our window retraction and has a buggy TCP stack,
it might place the FIN one byte beyond what it thinks is the
right edge of the window (i.e., max_window_edge + 1).
The data portion (end_seq - th->fin) will end exactly at max_window_edge.
In this case, we will drop the packet if our receive queue is not empty,
even though the data was sent within the window we previously allowed.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Baatz <gmbnomis@gmail.com> Link: https://patch.msgid.link/20260608151452.706822-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
dpll/ice: Add generic DPLL type and full TX reference clock control for E825
NOTE: This series is intentionally submitted on net-next (not
intel-wired-lan) as early feedback of DPLL subsystem changes is
welcomed. In the past possible approaches were discussed in [1].
This series adds TX reference clock support for E825 devices and exposes
TX clock selection and synchronization status via the Linux DPLL
subsystem.
E825 hardware contains a dedicated TX clock domain with per-port source
selection behavior that is distinct from PPS handling and from board-level
EEC distribution. TX reference clock selection is device-wide, shared
across ports, and mediated by firmware as part of link bring-up. As a
result, TX clock selection intent may differ from effective hardware
configuration, and software must verify outcome after link-up.
To support this, the series extends the DPLL core and the ice driver
incrementally. The series also introduces DPLL_TYPE_GENERIC as a broad
UAPI class for DPLL instances outside PPS/EEC categories. The intent is
to keep type naming reusable and scalable across different ASIC
topologies while preserving functional discoverability via
driver/device context and pin topology.
This follows netdev discussion guidance that UAPI type naming should avoid
location-specific or vendor-specific taxonomy, because such labels do not
scale across different ASIC designs. The function of a given DPLL instance
is already discoverable from driver/device context and pin topology, and
does not require an additional narrow type identifier in UAPI.
At the same time, a separate DPLL object is still needed for E825 TX clock
control/reporting semantics. Using DPLL_TYPE_GENERIC provides a reusable
class for devices outside PPS/EEC without overfitting UAPI naming to one
topology.
The relevant discussion is in [2].
Series content
- add a new generic DPLL type for devices outside PPS/EEC classes;
- relax DPLL pin registration rules for firmware-described shared pins
and extend pin notifications with a source identifier;
- allow dynamic state control of SyncE reference pins where hardware
supports it;
- add CPI infrastructure for PHY-side TX clock control on E825C;
- introduce a TX-clock DPLL device and TX reference clock pins
(EXT_EREF0 and SYNCE) in the ice driver;
- extend the Restart Auto-Negotiation command to carry a TX reference
clock index;
- implement hardware-backed TX reference clock switching, post-link
verification, and TX synchronization reporting.
TXCLK pins report TX reference topology only. Actual synchronization
success is reported via DPLL lock status, updated after hardware
verification: external TX references report LOCKED, while the internal
ENET/TXCO source reports UNLOCKED.
This provides reliable TX reference selection and observability on E825
devices using standard DPLL interfaces, without conflating user intent
with effective hardware behavior.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:45 +0000 (20:30 +0200)]
ice: implement E825 TX ref clock control and TXC hardware sync status
Build on the previously introduced TXC DPLL framework and implement
full TX reference clock control and hardware-backed synchronization
status reporting for E825 devices.
E825 firmware may accept or override TX reference clock requests based
on device-wide routing constraints and link conditions. Because the
final selection becomes visible only after a link-up event, the driver
splits the observation into two complementary signals:
- TXCLK pin state reflects the requested TX reference clock
(pf->ptp.port.tx_clk_req). After a link-up, the value is reconciled
against the SERDES reference selector by
ice_txclk_update_and_notify(); if firmware or auto-negotiation
selected a different clock, tx_clk_req is overwritten so that pin
state converges to the actual hardware selection.
- TXC DPLL lock status reflects hardware synchronization:
* LOCKED when an external TX reference is in use
* UNLOCKED when falling back to ENET/TXCO, or when a requested
external reference has not (yet) been accepted by hardware.
Userspace observing only pin state therefore sees user intent, while
lock status is the authoritative indicator of whether the requested
clock is actually selected and synchronizing. This matches the DPLL
subsystem model where pin state describes topology and device lock
status describes signal quality.
TX reference selection topology:
- External references (SYNCE, EREF0) are represented as TXCLK pins
- The internal ENET/TXCO clock has no pin representation; when
selected, all TXCLK pins are reported DISCONNECTED
With this change, TX reference clocks on E825 devices can be reliably
selected, observed via standard DPLL interfaces, and monitored for
effective synchronization through TXC DPLL lock status.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:44 +0000 (20:30 +0200)]
ice: add Tx reference clock index handling to AN restart command
Extend the Restart Auto-Negotiation (AN) AdminQ command with a new
parameter allowing software to specify the Tx reference clock index to
be used during link restart.
This patch:
- adds REFCLK field definitions to ice_aqc_restart_an
- updates ice_aq_set_link_restart_an() to take a new refclk parameter
and properly encode it into the command
- keeps legacy behavior by passing REFCLK_NOCHANGE where appropriate
This prepares the driver for configurations requiring dynamic selection
of the Tx reference clock as part of the AN flow.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-13-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:43 +0000 (20:30 +0200)]
ice: implement CPI support for E825C
Add full CPI (Converged PHY Interface) command handling required for
E825C devices. The CPI interface allows the driver to interact with
PHY-side control logic through the LM/PHY command registers, including
enabling/disabling/selection of PHY reference clock.
This patch introduces:
- a new CPI subsystem (ice_cpi.c / ice_cpi.h) implementing the CPI
request/acknowledge state machine, including REQ/ACK protocol,
command execution, and response handling
- helper functions for reading/writing PHY registers over Sideband
Queue
- CPI command execution API (ice_cpi_exec) and a helper for enabling or
disabling Tx reference clocks (CPI 0xF1 opcode 'Config PHY clocking')
- assurance of CPI transaction serialization into the CPI core.
CPI REQ/ACK is a multi-step handshake and must be executed
atomically per PHY. Centralize the lock in ice_cpi_exec() and
use adapter-scoped per-PHY mutexes, which match the hardware sharing
model across PFs.
- addition of the non-posted write opcode (wr_np) to SBQ
- Makefile integration to build CPI support together with the PTP stack
This provides the infrastructure necessary to support PHY-side
configuration flows on E825C and is required for advanced link control
and Tx reference clock management.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-12-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:42 +0000 (20:30 +0200)]
ice: introduce TXC DPLL device and TX ref clock pin framework for E825
E825 devices provide a dedicated TX clock (TXC) domain which may be
driven by multiple reference clock sources, including external board
references and port-derived SyncE. To support future TX clock control
and observability through the Linux DPLL subsystem, introduce a
separate TXC DPLL device (of DPLL_TYPE_GENERIC) and a framework for
representing TX reference clock inputs.
This change adds a new internal DPLL pin type (TXCLK) and registers
TX reference clock pins for E825-based devices:
- EXT_EREF0: a board-level external electrical reference
- SYNCE: a port-derived SyncE reference described via firmware nodes
The TXC DPLL device is created and managed alongside the existing
PPS and EEC DPLL instances. TXCLK pins are registered directly or
deferred via a notifier when backed by fwnode-described pins.
A per-pin attribute encodes the TX reference source associated with
each TXCLK pin.
At this stage, TXCLK pin state callbacks and TXC DPLL lock status
reporting are implemented as placeholders. Pin state getters always
return DISCONNECTED, and the TXC DPLL is initialized in the UNLOCKED
state. No hardware configuration or TX reference switching is
performed yet.
This patch establishes the structural groundwork required for
hardware-backed TX reference selection, verification, and
synchronization status reporting, which will be implemented in
subsequent patches.
Also signal dpll_init from the fwnode pin init error path so any
notifier worker already blocked on it can drain, avoiding a
flush_workqueue() deadlock during teardown.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-11-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:41 +0000 (20:30 +0200)]
dpll: allow fwnode pins to attempt state change without capability bit
Pins registered with an fwnode may have .state_on_dpll_set implemented
without advertising DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE upfront.
Requiring the bit for fwnode pins ties firmware description to driver
implementation details unnecessarily.
Relax the capability check in dpll_pin_state_set() and
dpll_pin_on_pin_state_set(): when a pin has an associated fwnode, bypass
the capability gate and let the ops layer decide, returning -EOPNOTSUPP
if .state_on_dpll_set is absent. Non-fwnode pins retain the original
strict behavior.
This is used later in the series by the SyncE_Ref output pin, which
relies on the fwnode path for state control.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:40 +0000 (20:30 +0200)]
dpll: extend pin notifier with notification source ID
Extend the DPLL pin notification API to include a source identifier
indicating where the notification originates. This allows notifier
consumers to distinguish between notifications coming from
an associated DPLL instance, a parent pin, or the pin itself.
A new field, src_clock_id, is added to struct dpll_pin_notifier_info
and is passed through all pin-related notification paths. Callers of
dpll_pin_notify() are updated to provide a meaningful source identifier
based on their context:
- pin registration/unregistration uses the DPLL's clock_id,
- pin-on-pin operations use the parent pin's clock_id,
- pin changes use the pin's own clock_id.
As introduced in the commit ("dpll: allow registering FW-identified pin
with a different DPLL"), it is possible to share the same physical pin
via firmware description (fwnode) with DPLL objects from different
kernel modules. This means that a given pin can be registered multiple
times.
Driver such as ICE (E825 devices) rely on this mechanism when listening
for the event where a shared-fwnode pin appears, while avoiding reacting
to events triggered by their own registration logic.
This change only extends the notification metadata and does not alter
existing semantics for drivers that do not use the new field.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-9-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:39 +0000 (20:30 +0200)]
dpll: balance create/delete notifications in __dpll_pin_(un)register
__dpll_pin_register() emits dpll_pin_create_ntf() internally, but
__dpll_pin_unregister() left the matching delete to its callers. The
counts then diverge on dpll_pin_on_pin_register() rollback and on
dpll_pin_on_pin_unregister(), leaking stale notifications.
Emit dpll_pin_delete_ntf() inside __dpll_pin_unregister() and drop the
now-redundant call in dpll_pin_unregister().
Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base functions") Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-8-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:38 +0000 (20:30 +0200)]
dpll: guard sync-pair removal on full pin unregister
__dpll_pin_unregister() wiped the global sync-pair state on every
(dpll, ops, priv, cookie) tuple removed from a pin. When a pin is
registered multiple times and only one registration is being torn
down, this dropped sync-pair pairings still in use by the surviving
registrations.
Move dpll_pin_ref_sync_pair_del() inside the xa_empty(&pin->dpll_refs)
branch so it only runs when the last registration is gone, alongside
clearing the DPLL_REGISTERED mark.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:37 +0000 (20:30 +0200)]
dpll: emit per-dpll delete notifications in dpll_pin_on_pin_unregister()
dpll_pin_on_pin_register() emits a creation notification for every
parent->dpll_refs entry, but dpll_pin_on_pin_unregister() emitted only
one deletion notification outside the loop. When a pin is registered
against multiple parent dplls, userspace sees N creates but a single
delete and leaks per-dpll state.
Move dpll_pin_delete_ntf() into the loop and call it before
__dpll_pin_unregister() so the DPLL_REGISTERED mark is still set when
dpll_pin_available() is consulted.
Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions") Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-6-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:36 +0000 (20:30 +0200)]
dpll: send delete notification before unregister in on-pin rollback
The rollback path in dpll_pin_on_pin_register() called
__dpll_pin_unregister() before dpll_pin_delete_ntf(). When the
unregister dropped the pin's last DPLL reference it cleared the
DPLL_REGISTERED mark in dpll_pin_xa, so the subsequent
dpll_pin_event_send() failed dpll_pin_available() and aborted with
-ENODEV. As a result userspace was never notified of the rollback
deletion and remained out of sync with the kernel.
Send the delete notification first, matching the order used by
dpll_pin_unregister() and dpll_pin_on_pin_unregister().
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:35 +0000 (20:30 +0200)]
dpll: fix stale iteration in dpll_pin_on_pin_unregister()
Neither parent->dpll_refs nor pin->dpll_refs on its own is a correct
iteration target at unregister time:
- pin->dpll_refs includes DPLLs the child was registered against
via a different parent or directly; blind unregister WARNs on
the cookie miss in dpll_xa_ref_pin_del().
- parent->dpll_refs reflects the parent's current attachments, not
those at child-register time. Another driver may have (un)reg'd
the parent against additional DPLLs in the meantime, so we miss
registrations that exist and visit DPLLs that have none.
Walk pin->dpll_refs and use dpll_pin_registration_find() to filter
to entries whose cookie is this parent. Symmetric with
dpll_pin_on_pin_register(), correct under any subsequent change to
parent->dpll_refs.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:34 +0000 (20:30 +0200)]
dpll: allow registering FW-identified pin with a different DPLL
Relax the (module, clock_id) equality requirement when registering a
pin identified by firmware (pin->fwnode). Some platforms associate a
FW-described pin with a DPLL instance that differs from the pin's
(module, clock_id) tuple. For such pins, permit registration without
requiring the strict match. Non-FW pins still require equality.
Keep netlink pin module reporting/filtering safe for this relaxed
registration model by caching the module name in the pin object at
allocation time and using the cached string in netlink paths.
This avoids dereferencing pin->module after provider module teardown.
Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-3-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:33 +0000 (20:30 +0200)]
dpll: add generic DPLL type
Add DPLL_TYPE_GENERIC to represent DPLL devices which do not fit the
existing PPS or EEC classes.
The UAPI type is intentionally generic. During netdev discussion,
maintainers pointed out that introducing identifiers tied to a specific
placement or single design does not scale across ASICs and vendors.
The role of a DPLL is already inferable from the spawning driver,
bus device, and pin topology, without encoding additional
purpose-specific taxonomy in the type name.
Using a generic type keeps the UAPI extensible and avoids premature
naming that may become incorrect as new hardware topologies are
exposed through the DPLL subsystem.
Expose the new type through UAPI and netlink specification as "generic".
1) Replace the open-coded manual cleanup in xfrm_add_policy() error
path with xfrm_policy_destroy() for consistency with
xfrm_policy_construct().
From Deepanshu Kartikey.
2) Limit XFRMA_TFCPAD to a sensible maximum (max IP length, 64k) since
u32 is excessive for traffic flow confidentiality padding.
From David Ahern.
3) Add a new netlink message XFRM_MSG_MIGRATE_STATE that
allows migrating individual IPsec SAs independently of
their policies. The existing XFRM_MSG_MIGRATE is tightly coupled
to policy+SA migration, lacks SPI for unique SA identification,
and cannot express reqid changes or migrate Transport mode
selectors. The new interface identifies the SA via SPI and mark,
supports reqid changes, address family changes, encap removal,
and uses an atomic create+install flow under x->lock to prevent
SN/IV reuse during AEAD SA migration.
From Antony Antony.
* tag 'ipsec-next-2026-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: add documentation for XFRM_MSG_MIGRATE_STATE
xfrm: restrict netlink attributes for XFRM_MSG_MIGRATE_STATE
xfrm: add XFRM_MSG_MIGRATE_STATE for single SA migration
xfrm: make xfrm_dev_state_add xuo parameter const
xfrm: extract address family and selector validation helpers
xfrm: refactor XFRMA_MTIMER_THRESH validation into a helper
xfrm: move encap and xuo into struct xfrm_migrate
xfrm: add error messages to state migration
xfrm: add state synchronization after migration
xfrm: check family before comparing addresses in migrate
xfrm: split xfrm_state_migrate into create and install functions
xfrm: rename reqid in xfrm_migrate
xfrm: fix NAT-related field inheritance in SA migration
xfrm: allow migration from UDP encapsulated to non-encapsulated ESP
xfrm: add extack to xfrm_init_state
xfrm: remove redundant assignments
xfrm: Reject excessive values for XFRMA_TFCPAD
xfrm: cleanup error path in xfrm_add_policy()
====================
Ruoyu Wang [Fri, 12 Jun 2026 03:56:13 +0000 (11:56 +0800)]
net: wwan: t7xx: check skb_clone in control TX
t7xx_port_ctrl_tx() clones each skb fragment before passing it to the
port transmit path. The clone is used immediately to set cloned->len, so
an skb_clone() failure results in a NULL pointer dereference.
Check the clone before using it. If previous fragments were already
queued, preserve the driver's existing partial-write behavior by
returning the number of bytes submitted so far.
Fixes: 36bd28c1cb0d ("wwan: core: Support slicing in port TX flow of WWAN subsystem") Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com> Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com> Link: https://patch.msgid.link/20260612035613.1192486-1-ruoyuw560@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
vsock: consolidate acceptq accounting into core helpers
These patches follow up on commit c05fa14db43e
("vsock/vmci: fix sk_ack_backlog leak on failed handshake")
by consolidating sk_acceptq_added() and sk_acceptq_removed() into
the core vsock helpers so transports cannot forget them.
Raf Dickson [Fri, 12 Jun 2026 04:52:16 +0000 (04:52 +0000)]
vsock: fold sk_acceptq_removed() into vsock_remove_pending()
Callers of vsock_remove_pending() must also call sk_acceptq_removed()
to keep sk_ack_backlog consistent. Move the call into
vsock_remove_pending() itself to make it automatic and prevent future
callers from forgetting it.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-5-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Fri, 12 Jun 2026 04:52:15 +0000 (04:52 +0000)]
vsock: fold sk_acceptq_added() into vsock_enqueue_accept()
virtio and hyperv call sk_acceptq_added() immediately before
vsock_enqueue_accept(). Move the call into vsock_enqueue_accept()
itself so callers cannot forget it and the accounting is consistent.
Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-4-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Fri, 12 Jun 2026 04:52:14 +0000 (04:52 +0000)]
vsock: fold sk_acceptq_added() into vsock_add_pending()
Move sk_acceptq_added() into vsock_add_pending() so callers cannot
forget it. vmci is the only transport using the pending list and
is updated accordingly.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-3-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Fri, 12 Jun 2026 04:52:13 +0000 (04:52 +0000)]
vsock: introduce vsock_pending_to_accept() helper
Add vsock_pending_to_accept() to move a socket directly from the
pending list to the accept queue in a single operation, avoiding
the sock_put/sock_hold dance and the sk_acceptq_removed()/
sk_acceptq_added() pair that would otherwise be needed when
calling vsock_remove_pending() followed by vsock_enqueue_accept().
Use it in vmci_transport_recv_connecting_server() where a completed
handshake transitions the socket from pending to accept queue.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-2-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Fri, 12 Jun 2026 04:58:42 +0000 (04:58 +0000)]
vsock: use sk_acceptq_is_full() helper in all transports
Replace the open-coded backlog check with sk_acceptq_is_full().
The helper uses > instead of >=, which is the correct comparison
per commit 64a146513f8f ("[NET]: Revert incorrect accept queue
backlog changes."), and adds READ_ONCE() for proper memory ordering.
Florian Westphal [Fri, 12 Jun 2026 09:22:09 +0000 (11:22 +0200)]
selftests: netfilter: add phony nft_offload test
... "phony", because its not testing offloads, it tests the control
plane code. Also test error unwind via fault injection framework.
For a proper test, real hardware would be required given we'd have
check if 'previously handed off to hardware' offload commands are
properly removed again on failure or rule flush.
Florian Westphal [Fri, 12 Jun 2026 09:22:08 +0000 (11:22 +0200)]
netdevsim: tc: allow to test nf_tables offload control plane code
The actual 'offload' is phony, all commands are ignored: this is only
useful to test control plane code.
Tag the existing callback to permit error injection to test rollback/abort
code in nf_tables. This is also for fuzzers - the fault injection
framework allows probabilistic error insertion.
Wayen.Yan [Fri, 12 Jun 2026 09:37:00 +0000 (17:37 +0800)]
net: airoha: Fix error handling in airoha_ppe_flush_sram_entries()
In airoha_ppe_flush_sram_entries(), the outer "err" variable was never
updated when the inner loop variable shadowed it, causing the function
to always return 0 even when airoha_ppe_foe_commit_sram_entry() fails.
Drop the outer "err" variable and return directly on error, propagating
the error code from airoha_ppe_foe_commit_sram_entry() correctly.
Linus Torvalds [Sat, 13 Jun 2026 15:23:36 +0000 (08:23 -0700)]
Merge tag 'core-urgent-2026-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull debugobjects fix from Ingo Molnar:
- Fix potential debugobjects deadlock on PREEMPT_RT kernels (Waiman
Long)
* tag 'core-urgent-2026-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobjects: Don't call fill_pool() in early boot hardirq context
Linus Torvalds [Sat, 13 Jun 2026 15:14:17 +0000 (08:14 -0700)]
Merge tag 'i2c-for-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"The biggest news here is that this is my last pull request as I2C
maintainer after 13.5 years. Starting with the 7.2 cycle, Andi Shyti
is taking over who helped me greatly maintaining the host drivers for
a while now. Thank you, Andi, and good luck with the subsystem. I'll
be around for help, of course.
Technically, there are two patches which might be a tad large for this
late cycle, but most of them is explaining comments, so I think they
are suitable.
- MAINTAINERS:
- hand over I2C maintainership to Andi
- minor updates
- rust: fix I2cAdapter refcount double increment
- imx: keep clock and pinctrl states consistent in runtime PM
- imx-lpi2c: fix DMA resource leaks on PIO fallback
- qcom-cci: fix NULL pointer dereference on remove
- riic: fix reset refcount leak on resume_noirq error path
- stm32f7: account for analog filter in timing computation
- tegra:
- fix suspend/resume handling in NOIRQ phase
- update Tegra410 I2C timings to match hardware specs"
* tag 'i2c-for-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
dt-bindings: i2c: mux-gpio: name correct maintainer
MAINTAINERS: hand over I2C to Andi Shyti
i2c: imx-lpi2c: fix resource leaks switching to devm_dma_request_chan()
MAINTAINERS: i2c: designware: Remove inactive reviewer
i2c: tegra: Fix NOIRQ suspend/resume
i2c: tegra: Update Tegra410 I2C timing parameters
i2c: qcom-cci: Fix NULL pointer dereference in cci_remove()
i2c: stm32f7: fix timing computation ignoring i2c-analog-filter
i2c: imx: fix clock and pinctrl state inconsistency in runtime PM
i2c: riic: fix refcount leak in riic_i2c_resume_noirq()
rust: i2c: fix I2cAdapter refcounts double increment
Thomas Gleixner [Sat, 13 Jun 2026 14:24:29 +0000 (16:24 +0200)]
Merge tag 'timers-v7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/daniel.lezcano/linux into timers/clocksource
Pull clocksource/driver updates from Daniel Lezcano:
- Remove the sifive,fine-ctr-bits property bindings because it is a
redundant information (Nick Hu)
- Remove the TCIU8 interrupt bindings on Renesas because it should not
be described as the documentation marked reserved and fix the
conditional reset line for the RZ/{T2H,N2H} (Cosmin Tanislav)
- Extend schema condition for interrupts to cover D1 compatible
variant an add the D1 hstimer support (Michal Piekos)
- Update the ARM architected timer support to handle the ACPI GTDT v3
format and the EL2 virtual timer, enabling Linux to use the most
appropriate timer when running with VHE, while also fixing several
Device Trees to accurately reflect the underlying hardware (Marc
Zyngier)
- Cleanup and add the clocksource and the clockevent in the TI DM
timer (Markus Schneider-Pargmann)
- Add the multiple watchdogs support in the tegra186 and
tegra234. Dedicate one as a kernel watchdog (Kartik Rajput)
- Add the NXP clocksource selection for the scheduler in the Kconfig
(Enric Balletbo i Serra)
WenTao Liang [Thu, 11 Jun 2026 16:17:38 +0000 (00:17 +0800)]
posix-cpu-timers: Fix pid refcount leak in do_cpu_nanosleep() error path
In do_cpu_nanosleep(), posix_cpu_timer_create() takes a pid reference
via get_pid() and stores it in timer.it.cpu.pid. If the subsequent
posix_cpu_timer_set() call fails, the function returns immediately
without calling posix_cpu_timer_del() to release the pid reference,
causing a leak.
Fix it by calling posix_cpu_timer_del() before the unlock-and-return
on the error path, consistent with the other exit paths in the same
function.
Thomas Gleixner [Tue, 9 Jun 2026 15:14:45 +0000 (17:14 +0200)]
time/jiffies: Register jiffies clocksource before usage
Teddy reported that a XEN HVM has a long boot delay, which was bisected to
the recent enhancements to the negative motion detection. It turned out
that the jiffies clocksource is used in early boot before it is registered,
which leaves the max_delta_raw field at zero. That causes the read out to
be clamped to the max delta of 0, which means time is not making progress.
Cure it by ensuring that it is initialized before its first usage in
timekeeping_init().
Fixes: 76031d9536a0 ("clocksource: Make negative motion detection more robust") Reported-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Teddy Astie <teddy.astie@vates.tech> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/87y0gn3fve.ffs@fw13 Closes: https://lore.kernel.org/all/1780914594.8631fc262581453bbf619ec5b2062170.19ea6c8227b000701b@vates.tech
The "ti,n-factor" binding and examples allow negative correction
values. Reading it as u32 makes the helper type disagree with the
documented signed value and hides real schema mismatches.
Use the signed helper so the DT access matches the s32 value stored by
the driver.
Keith Busch [Fri, 12 Jun 2026 22:32:04 +0000 (15:32 -0700)]
block: check bio split for unaligned bvec
Offsets and lengths need to be validated against the dma alignment. This
check was skipped for sufficiently a small bio with a single bvec, which
may allow an invalid request dispatched to the driver. Force the
validation for an unaligned bvec by forcing the bio split path that
handles this condition.
Fixes: 7eac33186957 ("iomap: simplify direct io validity check") Fixes: 5ff3f74e145a ("block: simplify direct io validity check") Reported-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://patch.msgid.link/20260612223205.465913-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Dumazet [Sat, 13 Jun 2026 04:26:19 +0000 (04:26 +0000)]
nbd: Reclassify sockets to avoid lockdep circular dependency
syzbot reported a possible circular locking dependency in udp_sendmsg()
where fs_reclaim can be triggered while holding sk_lock, and fs_reclaim
can eventually depend on another sk_lock (e.g., if NBD is used for swap
or writeback and NBD uses TLS/TCP which acquires sk_lock).
Since the UDP socket and the NBD TCP/TLS socket are different, this is a
false positive. Fix this by reclassifying NBD sockets to a separate lock
class when they are added to the NBD device.
This is similar to what nvme-tcp and other network block devices do.
Fixes: ffa1e7ada456 ("block: Make request_queue lockdep splats show up earlier") Reported-by: syzbot+607cdcf978b3e79da878@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6a2cdafe.428ffe26.258b27.0161.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260613042619.1108126-1-edumazet@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 30 May 2026 02:03:47 +0000 (20:03 -0600)]
io_uring/net: make POLL_FIRST receive side checks consistent
io_recv() and io_recvzc() are the odd ones out, as they checks for
whether POLL_FIRST should be honored before checking if the file is a
socket. It doesn't really matter, but might as well make it consistent
across all receive and send types.
Jens Axboe [Thu, 11 Jun 2026 17:44:47 +0000 (11:44 -0600)]
io_uring: remove the per-ctx fallback task_work machinery
With the tctx fallback running its entries directly, the per-ctx
fallback work has a single user left: moving local (DEFER_TASKRUN)
task_work entries out of a ring that is going away. Both of its call
sites are process context and don't hold ->uring_lock, the same
conditions the deferred fallback work itself ran under - so run the
entries in cancel mode right there instead, and rename the helper to
io_cancel_local_task_work() to match what it now does.
With that, ->fallback_llist, ->fallback_work, io_fallback_req_func()
and __io_fallback_tw() can all go away, along with the fallback work
flushing in the ring exit and cancel paths. Requests that get
orphaned by an exiting task now run via the tctx fallback work, which
the ring exit side implicitly waits on through the ctx refs those
requests hold.
Jens Axboe [Thu, 11 Jun 2026 17:41:25 +0000 (11:41 -0600)]
io_uring: run the tctx task_work fallback directly
The fallback work drains the tctx queue only to redistribute the entries
into the per-ctx fallback lists, bouncing them through a second
(per-ctx) work item before they finally run. That made sense when the
producer side did the draining and could be in any context, but the
fallback work is a regular process context kworker: it can just run the
entries itself. Reuse the normal run loop - if run from the fallback
kernel thread, ts.cancel will get set, and the work terminated.
Jens Axboe [Thu, 11 Jun 2026 16:13:22 +0000 (10:13 -0600)]
io_uring: switch normal task_work to a mpscq
Like the local task_work list, the normal (tctx) task_work list is an
llist, and hence needs the O(n) llist_reverse_order() pass before
running entries in queue order. On top of that, capped runs - sqpoll
processing IORING_TW_CAP_ENTRIES_VALUE entries at a time - need the
claimed-but-unprocessed leftovers carried in a separate retry_list,
as they can't be pushed back to the shared list.
Switch tctx->task_list to a mpscq, like what was done for the
DEFER_TASKRUN paths as well.
Jens Axboe [Wed, 10 Jun 2026 21:19:35 +0000 (15:19 -0600)]
io_uring: switch local task_work to a mpscq
The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
ordered, and hence __io_run_local_work() has to restore the right
running order with an O(n) llist_reverse_order() pass first. On top of
that, a batch that gets capped by max_events needs the leftover entries
parked on a separate ->retry_llist, as they can't be pushed back to the
shared list.
Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
retry loop, entries are popped in queue order with no reversal pass,
capping a run simply leaves the remainder on the queue, and
->retry_llist goes away entirely. The consumer cursor, ->work_head,
lives with the rest of the ->uring_lock protected state rather than
next to the queue, so that popping entries doesn't dirty the producer
side cacheline.
For low amounts of task_work, this ends up being a bit more efficient
than the existing scheme. As an example of that, doing multishot
receives for 8 clients has the following task_work overhead:
most of that overhead is gone, and performance is better as well.
Caleb Sander Mateos <csander@purestorage.com> reports that it improves
the performance of a ublk 4kb workload by 4% [1], while testing v1 of
this patchset.
Local task_work is currently using llists for managing the work,
but that's a LIFO type of list. This means that running this task_work
needs to reverse the list first, to ensure fairness in running the
queued items.
Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
node-based queue algorithm, modified with an externally held consumer
cursor and conditional stub reinsertion. See comments in the header.
Producers are wait-free: a push is a single xchg() on the queue tail,
which serializes concurrent producers and defines the FIFO order, plus
a store linking the node to its predecessor. There are no cmpxchg retry
loops, and pushing is safe from any context, including hardirq.
The cost of linked list FIFO ordering is that a push publishes the node
in two steps - the xchg() makes it visible as the new tail before the
subsequent store links it into the chain that is reachable from the
head. A consumer hitting that window gets a NULL from mpscq_pop() while
mpscq_empty() reports false, and must retry later rather than treat the
queue as empty. The window is two instructions wide, but a producer can
get preempted inside it, so the consumer must not busy wait on it.
The consumer side supports a single consumer at a time, with callers
providing their own serialization. A stub node, which also defines the
empty state (tail == stub), allows the consumer to detach the final
node without racing against producer link stores: that node is only
handed out once the stub has been cmpxchg'ed back in as the tail. This
also guarantees that the previous tail returned by mpscq_push() cannot
get freed before that push has linked it, making it always valid for
comparisons.
The consumer cursor is deliberately not part of the queue struct - the
caller owns it and passes it to mpscq_pop(). This is done to separate
the consumer and producers cacheline. The cursor is written for every
popped entry, and keeping it on the same cacheline as ->tail would have
the consumer invalidating the line that producers need for every push.
Keeping it external lets the caller place it with its own consumer side
data instead.
Jens Axboe [Fri, 12 Jun 2026 02:27:22 +0000 (20:27 -0600)]
io_uring: grab RCU read lock marking task run
Not required right now, as io_req_local_work_add() already calls this
helper with the RCU read lock held. But in preparation for that not
being the case, grab it locally.
Paolo Abeni [Sat, 13 Jun 2026 09:50:31 +0000 (11:50 +0200)]
Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2026-06-09 (idpf, ixgbe, igc)
Przemyslaw adds needed padding to idpf PTP structures to match firmware
expectations.
Larysa bypasses XPS configuration on XDP queues for ixgbe.
Khai Wen corrects offset into packet buffer when handling for frame
preemption on igc.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
igc: skip RX timestamp header for frame preemption verification
ixgbe: do not configure xps for XDP queues
idpf: add padding to PTP virtchnl structures
====================
Ratheesh Kannoth [Wed, 10 Jun 2026 02:23:44 +0000 (07:53 +0530)]
octeontx2-af: npc: Fix size of entry2cntr_map
KASAN prints below splat. This is caused by allocating counter for
reserved mcam entry for cpt 2nd pass entry. But mcam->entry2cntr_map
is not allocated for reserved entries.
BUG: KASAN: slab-out-of-bounds in npc_map_mcam_entry_and_cntr+0xb0/0x1a0
Write of size 2 at addr ffff0001033e7ffe by task kworker/0:1/14
Woojin Ji [Fri, 12 Jun 2026 05:26:55 +0000 (14:26 +0900)]
selftests/bpf: Add arena direct-value one-past-end reject test
BPF_MAP_TYPE_ARENA supports direct-value pseudo loads, but unlike array
maps its map value_size is zero and the valid direct-value range is the
arena mmap size, max_entries * PAGE_SIZE.
Commit 3ac1a467e376 ("bpf: Fix off-by-one boundary validation in arena
direct-value access") fixed arena_map_direct_value_addr() to reject an
offset exactly at the end of the arena mapping. Add a regression test
that loads a BPF_PSEUDO_MAP_VALUE with off == arena_size and verifies
that the verifier rejects it with the expected offset in the log.
This is intentionally kept as a userspace raw-instruction test. I tried
expressing the same BPF_PSEUDO_MAP_VALUE + off == arena_size case in
verifier_arena.c with inline assembly. The only form that produces the
desired instruction bytes uses __imm_addr(arena), but that emits
R_BPF_64_NODYLD32, which the libbpf/bpftool link step rejects. Other
register, immediate, and memory constraints either fail in the BPF
backend or lower to a normal R_BPF_64_64 load followed by an ALU add,
which does not exercise arena_map_direct_value_addr() with the boundary
offset in the second ldimm64 slot.
A legacy test_verifier fixture can express the raw instruction directly,
but it needs arena map creation, mmap, and fixup plumbing in the legacy
runner. That is more intrusive than the small prog_tests raw-instruction
test.
Use the userspace raw-instruction test, following the existing selftests
pattern used for direct map-value pseudo loads, so insns[1].imm can be
set to arena_size precisely.
Assisted-by: ChatGPT:gpt-5.5 Signed-off-by: Woojin Ji <random6.xyz@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Junyoung Jang <graypanda.inzag@gmail.com> Link: https://lore.kernel.org/r/20260612-arena-direct-value-v1-v4-1-b81b642f5277@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Gabriele Monaco [Wed, 10 Jun 2026 09:04:29 +0000 (11:04 +0200)]
rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
raw_res_spin_unlock_irqrestore() calls raw_res_spin_unlock() and then
restores interrupts, this means preemption is enabled when interrupts
are still disabled (as part of raw_res_spin_unlock()) so this cannot
trigger an actual preemption.
This is inconsistent with other spinlock implementations
(raw_spin_unlock_irqrestore() and bpf_res_spin_unlock_irqrestore()
itself).
Adjust the macro to ensure interrupts are enabled before enabling
preemption, allowing to schedule at that point. Make the same
modification in the error path of raw_res_spin_lock_irqsave().
====================
bpf: Fix setting retval to -EPERM for cgroup hooks not returning errno
This series fixes the issue reported by sashiko in [1]. The issue is that,
when a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets
the hook return value to -EPERM if it is not a valid errno. This is
correct for errno-based hooks, which return 0 on success and negative
errno on failure, but wrong for void and boolean LSM hooks. Boolean
LSM hooks should only return true or false, and void LSM hooks have
no return value at all.
Fix it by skipping setting -EPERM for hooks not returning errno.
Xu Kuohai [Wed, 10 Jun 2026 20:17:24 +0000 (20:17 +0000)]
selftests/bpf: Add retval test for bool and errno LSM cgroup hooks
Add test to check the return value when a BPF program exits with 0 for
a boolean and an errno LSM hook.
For each hook, two BPF programs are attached. The first program returns
0 without calling bpf_set_retval() to exercise the return value translation
logic, while the second program reads the retval via bpf_get_retval().
Xu Kuohai [Wed, 10 Jun 2026 20:17:23 +0000 (20:17 +0000)]
bpf: Fix setting retval to -EPERM for cgroup hooks not returning errno
When a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets
the hook return value to -EPERM if it is not a valid errno. This is
correct for errno-based hooks, which return 0 on success and negative
errno on failure, but wrong for boolean and void LSM hooks. Boolean
LSM hooks should only return true or false, and void LSM hooks have
no return value at all.
Fix it by skipping setting -EPERM for hooks not returning errno.
net: qrtr: fix 32-bit integer overflow in qrtr_endpoint_post()
qrtr_endpoint_post() validates an incoming packet with
if (!size || len != ALIGN(size, 4) + hdrlen)
goto err;
where size comes from the wire. On 32-bit, size_t is 32 bits and
ALIGN(size, 4) wraps to 0 for size >= 0xfffffffd, so the check
passes and skb_put_data(skb, data + hdrlen, size) writes past the
hdrlen-sized skb and oopses the kernel. 64-bit is unaffected.
This is the 32-bit residual of ad9d24c9429e2 ("net: qrtr: fix OOB
Read in qrtr_endpoint_post"), which fixed only the 64-bit case.
Reject any size that cannot fit the buffer before the ALIGN.
Fixes: ad9d24c9429e2 ("net: qrtr: fix OOB Read in qrtr_endpoint_post") Cc: stable@vger.kernel.org Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260611125455.2352279-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dragos Tatulea [Thu, 11 Jun 2026 13:52:30 +0000 (16:52 +0300)]
net/mlx5: Check max_macs devlink param value against max capability
The max_macs devlink param is checked against the FW max value only at
param register time (driver load) and inside the validate callback
(devlink param set). The stored DRIVERINIT value persists across FW
resets and devlink reloads without any further checks against the max.
If the FW link type changes from Ethernet to IB and a FW reset happens,
the MAX cap for log_max_current_uc_list will become zero, but the
previously stored max_macs value remains and is unconditionally
programmed into the HCA caps in handle_hca_cap(). FW will then return a
syndrome during SET_HCA_CAP:
mlx5_cmd_out_err:839:(pid 3831): SET_HCA_CAP(0x109) op_mod(0x0) failed,
status bad parameter(0x3), syndrome (0x537801), err(-22)
set_hca_cap:907:(pid 3831): handle_hca_cap failed
This results in a failure to register the RDMA device.
This patch skips programming log_max_current_uc_list when the MAX
capability is 0 (in case of IB).
Fixes: 8680a60fc1fc ("net/mlx5: Let user configure max_macs generic param") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260611135230.534513-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
psp: Add support for dev-assoc/disassoc
The main purpose of this feature is to associate virtual devices like
veth or netkit with a real PSP device, so we could provide PSP
functionality to the application running with virtual devices.
A typical deployment that works with this feature is as follows:
Host Namespace:
psp_dev_local ←──physically linked──→ psp_dev_peer
(PSP device)
│
│ BPF on psp_dev_local ingress: bpf_redirect_peer() to nk_guest
│
nk_host / veth_host
│
│ BPF on nk_host ingress: bpf_redirect_neigh() to psp_dev_local
│
Guest Namespace (netns):
│
nk_guest / veth_guest
★ PSP application run here
Remote Namespace (_netns):
psp_dev_peer
★ PSP server application runs here
Note:
The general requirement for this feature to work:
For PSP to work correctly, the egress device at validate_xmit_skb()
time must have psp_dev matching the association's psd. Any device
stacking or traffic redirection that changes the egress device will
cause either:
1. TX validation failure (SKB_DROP_REASON_PSP_OUTPUT) - fail-safe
2. RX policy failure after tx-assoc - packets without PSP extension
are rejected by receiver expecting encrypted traffic
Here are a few examples that this feature would not work:
- Bonding with load balancing in round-robin, XOR, 802.3ad mode across
multiple PSP devices, or mixed PSP and non-PSP devices
- Bonding with active-backup mode might work without PSP migration for
failover case.
- ipvlan/macvlan in bridge mode would not work given packets are
loopbacked locally without going through the PSP device.
====================
Wei Wang [Mon, 8 Jun 2026 23:31:18 +0000 (16:31 -0700)]
selftests/net: psp: add dev-get, no-nsid, and cleanup tests
Add the following 3 tests:
- _psp_dev_get_check_netkit_psp_assoc: verifies dev-get output in both
host and guest namespaces, checking assoc-list, by-association flag,
and nsid values
- _dev_assoc_no_nsid: tests dev-assoc and dev-disassoc without the nsid
attribute, verifying ifindex lookup in the caller's namespace
- _psp_dev_assoc_cleanup_on_netkit_del: verifies that deleting the
associated netkit interface properly cleans up the assoc-list, using
a disposable netkit pair to avoid disturbing the shared environment
Add tests that verify PSP notifications are delivered to listeners in
associated namespaces:
- _key_rotation_notify_multi_ns_netkit: triggers key rotation and
verifies the notification is received in both main and guest namespaces
- _dev_change_notify_multi_ns_netkit: triggers dev_set and verifies the
dev_change notification is received in both namespaces
Wei Wang [Mon, 8 Jun 2026 23:31:16 +0000 (16:31 -0700)]
selftests/net: psp: add dev-assoc data path test
Add _assoc_check_list() test that associates nk_guest with the PSP
device and verifies the assoc-list is correctly populated.
Add _data_basic_send_netkit_psp_assoc() which tests PSP data send
through a netkit interface associated with a PSP device. The test
associates nk_guest with the PSP device, then sends PSP-encrypted
traffic from the guest namespace.
Wei Wang [Mon, 8 Jun 2026 23:31:15 +0000 (16:31 -0700)]
selftests/net: psp: support PSP in NetDrvContEnv infrastructure
Add infrastructure to support PSP tests across network namespaces
using NetDrvContEnv with netkit pairs. This enables testing PSP device
association, where a non-PSP-capable device (e.g. netkit) in a guest
namespace is associated with a real PSP device in the host namespace,
allowing the guest to perform PSP encryption/decryption through the
host's PSP hardware.
env.py:
- nk_guest_ifindex is queried after moving the device into the guest
namespace, so tests can use it directly for dev-assoc
psp.py:
- PSP device lookup supports container environments where the PSP
device is on the physical interface, not the test interface
- Association helpers handle dev-assoc/dev-disassoc with defer-based
cleanup to prevent state leaks on test assertion failures
- main() tries NetDrvContEnv with primary_rx_redirect and falls back
to NetDrvEpEnv, so existing tests continue to work without the
container environment
Wei Wang [Mon, 8 Jun 2026 23:31:14 +0000 (16:31 -0700)]
selftests/net: rename _nk_host_ifname to nk_host_ifname
Rename _nk_host_ifname to nk_host_ifname in NetDrvContEnv to make it
a public attribute, matching the nk_guest_ifname rename. Tests that
access the host-side netkit interface name (e.g. for cleanup after
deleting the netkit pair) no longer trigger pylint protected-access
warnings.
Wei Wang [Mon, 8 Jun 2026 23:31:13 +0000 (16:31 -0700)]
selftests/net: add _find_bpf_obj() to search hw/ for BPF objects
Add _find_bpf_obj() helper to NetDrvContEnv that searches the test
directory first, then falls back to the hw/ subdirectory. This allows
tests outside drivers/net/hw/ (e.g. psp.py in drivers/net/) to find
BPF objects built in the hw/ directory.
Update _attach_bpf() and _attach_primary_rx_redirect_bpf() to use
_find_bpf_obj() for BPF object discovery.
Wei Wang [Mon, 8 Jun 2026 23:31:12 +0000 (16:31 -0700)]
selftests/net: psp: refactor test builders to use ksft_variants
Replace the manual psp_ip_ver_test_builder() and ipver_test_builder()
functions with @ksft_variants decorators for data_basic_send and
data_mss_adjust. This is a pure refactor with no behavior change.