Bryam Vargas [Thu, 4 Jun 2026 23:17:05 +0000 (23:17 +0000)]
selftests/landlock: Test SCOPE_SIGNAL on the SIGIO/fowner pgid path
Add regression tests for the LANDLOCK_SCOPE_SIGNAL handling of the
asynchronous SIGIO delivery path (fcntl(F_SETOWN)) with a process-group
owner.
sigio_to_pgid_members covers the bypass: a sandboxed process at the head
of its process group's PGID hlist (the default after fork()) arms
F_SETOWN(-pgrp) + O_ASYNC and triggers the fan-out; the in-domain owner
must be signaled (proving the trigger fired) while the non-sandboxed
member of the group, outside the domain, must not.
sigio_to_pgid_self covers the same-process guarantee: the owner is
registered from a sandboxed non-leader thread, whose domain differs from
the thread-group leader the kernel signals for a process-group owner.
That leader belongs to the owner's own process and must still be
signaled.
Without the fix the first test sees the out-of-domain member signaled
and the second sees the owner's own leader denied.
Bryam Vargas [Thu, 4 Jun 2026 23:16:56 +0000 (23:16 +0000)]
landlock: Fix LANDLOCK_SCOPE_SIGNAL bypass on the SIGIO path
LANDLOCK_SCOPE_SIGNAL must prevent a sandboxed process from signaling
processes outside its Landlock domain. It can be bypassed through the
asynchronous SIGIO delivery path.
A sandboxed process that owns any file or socket can arm it with
fcntl(fd, F_SETOWN, -pgid), fcntl(fd, F_SETSIG, SIGKILL) and O_ASYNC, so
that an I/O event makes the kernel deliver the chosen signal to the
whole process group. As the head of its process group's task list (the
default position right after fork()) that group can also hold the
non-sandboxed process that launched it, e.g. a supervisor or a security
monitor. The sandbox can thus kill or signal the processes
LANDLOCK_SCOPE_SIGNAL is meant to protect from it.
The scope is enforced in hook_file_send_sigiotask() against the Landlock
domain recorded at F_SETOWN time, not the live domain of the sender.
control_current_fowner() decides whether to record that domain and skips
recording it when the fowner target is in the caller's thread group,
which is safe only for a single-task target (PIDTYPE_PID, PIDTYPE_TGID).
For a process group (PIDTYPE_PGID) pid_task() returns only one member;
recording is skipped whenever that member shares the caller's thread
group, and hook_file_send_sigiotask() then lets the signal fan out to
the whole group unchecked.
Record the domain for every non single-process target so the scope is
enforced against each group member at delivery time.
That recording is necessary but not sufficient on its own: the kernel
signals a process group through its members' thread-group leaders, and
the leader of the registrant's own process can carry a different
Landlock domain than the sibling thread that armed the owner.
domain_is_scoped() would then deny that leader, even though commit 18eb75f3af40 ("landlock: Always allow signals between threads of the
same process") requires same-process delivery to be allowed.
hook_task_kill() avoids this by evaluating same_thread_group() live, per
recipient; the SIGIO path instead delegates the whole decision to a
single registration-time check, which a process-group fan-out cannot
honor.
So also record the registrant's thread group next to its domain and
exempt it at delivery: hook_file_send_sigiotask() allows the signal
whenever the recipient belongs to the registrant's own process,
restoring the same-process guarantee while keeping out-of-domain group
members blocked. The direct kill() path (hook_task_kill) already
evaluates the live domain and is unaffected.
Fixes: 18eb75f3af40 ("landlock: Always allow signals between threads of the same process") Cc: stable@vger.kernel.org Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me> Reviewed-by: Günther Noack <gnoack3000@gmail.com> Link: https://patch.msgid.link/56bffc24f3d0d08b45a686a48e99766b0a0821fa.1780614610.git.hexlabsecurity@proton.me
[mic: Check pid_type earlier and improve comment, fix commit message,
fix comment formatting] Signed-off-by: Mickaël Salaün <mic@digikod.net>
Landlock provides best-effort sandboxing across ABI versions:
applications request the rights they need, and on older kernels the
unsupported rights are silently dropped from handled_access_* by the
documented compatibility switch. The recommended pattern for
landlock_add_rule(2) calls is to mirror this filtering at the rule
level, which wasn't explicitly described in the exemple.
Show the pattern explicitly in the filesystem and network rule examples
by masking each rule's allowed_access against the ruleset's
handled_access_* and adding the rule only when at least one bit remains
set. This makes the recommended best-effort pattern self-documenting.
Mickaël Salaün [Wed, 13 May 2026 18:03:08 +0000 (20:03 +0200)]
landlock: Account all audit data allocations to user space
Mark the kzalloc_flex() of struct landlock_details with
GFP_KERNEL_ACCOUNT so the allocation is charged to the calling task,
like the other Landlock per-domain allocations which have used
GFP_KERNEL_ACCOUNT forever.
Every property of landlock_details is caller-attributable: allocated by
landlock_restrict_self(2), owned by the caller's landlock_hierarchy,
contents are the caller's pid, uid, comm, and exe_path, lifetime bounded
by the caller's domain. While the caller may not know nor control the
size of this allocation (i.e. exe_path), this data should still be
accounted for it.
The deciding factor is whether userspace can trigger the allocation, not
whether the size of the data is known nor controlled by the caller.
This aligns with the kmemcg accounting policy established by commit 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg").
No new failure modes: the hierarchy and ruleset are allocated before
details and are already accounted, so landlock_restrict_self(2) already
returns -ENOMEM under memcg pressure. This change widens that existing
failure window slightly; it does not introduce a new error code.
Cc: Günther Noack <gnoack@google.com> Cc: Paul Moore <paul@paul-moore.com> Cc: stable@vger.kernel.org Fixes: 1d636984e088 ("landlock: Add AUDIT_LANDLOCK_DOMAIN and log domain status") Link: https://patch.msgid.link/20260513180309.165840-1-mic@digikod.net Signed-off-by: Mickaël Salaün <mic@digikod.net>
Mickaël Salaün [Fri, 12 Jun 2026 17:27:55 +0000 (19:27 +0200)]
landlock: Set audit_net.sk for socket access checks
Set audit_net.sk in current_check_access_socket() to provide the socket
object to audit_log_lsm_data(). This makes Landlock consistent with
AppArmor, which always sets .sk for socket operations, and with
SELinux's generic socket permission checks.
The socket's local and foreign address information (laddr, lport, faddr,
fport) is logged by the shared lsm_audit.c infrastructure when the
socket has bound or connected state. Fields with zero values are
suppressed by print_ipv4_addr()/print_ipv6_addr(), so the audit output
is unchanged for the common case of bind denials on unbound sockets.
For connect denials after a prior bind, the bound local address (laddr,
lport) appears before the existing sockaddr fields (daddr, dest).
No existing fields are removed or reordered, and the new field names
(laddr, lport, faddr, fport) are standard audit fields already emitted
by other LSMs through the same lsm_audit.c code path.
Add a connect_tcp_bound audit test that binds to an allowed port and
then connects to a denied one, verifying that the denial record reports
laddr/lport from the bound socket in addition to the connect
destination.
Eric Dumazet [Mon, 8 Jun 2026 15:14:52 +0000 (15:14 +0000)]
tcp: refine tcp_sequence() for the FIN exception
Commit 0e24d17bd966 ("tcp: implement RFC 7323 window retraction
receiver requirements") removed the special FIN case that
was added in commit 1e3bb184e941 ("tcp: re-enable acceptance of
FIN packets when RWIN is 0").
If a peer sends a segment containing data and a FIN flag before
it learns about our window retraction and has a buggy TCP stack,
it might place the FIN one byte beyond what it thinks is the
right edge of the window (i.e., max_window_edge + 1).
The data portion (end_seq - th->fin) will end exactly at max_window_edge.
In this case, we will drop the packet if our receive queue is not empty,
even though the data was sent within the window we previously allowed.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Baatz <gmbnomis@gmail.com> Link: https://patch.msgid.link/20260608151452.706822-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
dpll/ice: Add generic DPLL type and full TX reference clock control for E825
NOTE: This series is intentionally submitted on net-next (not
intel-wired-lan) as early feedback of DPLL subsystem changes is
welcomed. In the past possible approaches were discussed in [1].
This series adds TX reference clock support for E825 devices and exposes
TX clock selection and synchronization status via the Linux DPLL
subsystem.
E825 hardware contains a dedicated TX clock domain with per-port source
selection behavior that is distinct from PPS handling and from board-level
EEC distribution. TX reference clock selection is device-wide, shared
across ports, and mediated by firmware as part of link bring-up. As a
result, TX clock selection intent may differ from effective hardware
configuration, and software must verify outcome after link-up.
To support this, the series extends the DPLL core and the ice driver
incrementally. The series also introduces DPLL_TYPE_GENERIC as a broad
UAPI class for DPLL instances outside PPS/EEC categories. The intent is
to keep type naming reusable and scalable across different ASIC
topologies while preserving functional discoverability via
driver/device context and pin topology.
This follows netdev discussion guidance that UAPI type naming should avoid
location-specific or vendor-specific taxonomy, because such labels do not
scale across different ASIC designs. The function of a given DPLL instance
is already discoverable from driver/device context and pin topology, and
does not require an additional narrow type identifier in UAPI.
At the same time, a separate DPLL object is still needed for E825 TX clock
control/reporting semantics. Using DPLL_TYPE_GENERIC provides a reusable
class for devices outside PPS/EEC without overfitting UAPI naming to one
topology.
The relevant discussion is in [2].
Series content
- add a new generic DPLL type for devices outside PPS/EEC classes;
- relax DPLL pin registration rules for firmware-described shared pins
and extend pin notifications with a source identifier;
- allow dynamic state control of SyncE reference pins where hardware
supports it;
- add CPI infrastructure for PHY-side TX clock control on E825C;
- introduce a TX-clock DPLL device and TX reference clock pins
(EXT_EREF0 and SYNCE) in the ice driver;
- extend the Restart Auto-Negotiation command to carry a TX reference
clock index;
- implement hardware-backed TX reference clock switching, post-link
verification, and TX synchronization reporting.
TXCLK pins report TX reference topology only. Actual synchronization
success is reported via DPLL lock status, updated after hardware
verification: external TX references report LOCKED, while the internal
ENET/TXCO source reports UNLOCKED.
This provides reliable TX reference selection and observability on E825
devices using standard DPLL interfaces, without conflating user intent
with effective hardware behavior.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:45 +0000 (20:30 +0200)]
ice: implement E825 TX ref clock control and TXC hardware sync status
Build on the previously introduced TXC DPLL framework and implement
full TX reference clock control and hardware-backed synchronization
status reporting for E825 devices.
E825 firmware may accept or override TX reference clock requests based
on device-wide routing constraints and link conditions. Because the
final selection becomes visible only after a link-up event, the driver
splits the observation into two complementary signals:
- TXCLK pin state reflects the requested TX reference clock
(pf->ptp.port.tx_clk_req). After a link-up, the value is reconciled
against the SERDES reference selector by
ice_txclk_update_and_notify(); if firmware or auto-negotiation
selected a different clock, tx_clk_req is overwritten so that pin
state converges to the actual hardware selection.
- TXC DPLL lock status reflects hardware synchronization:
* LOCKED when an external TX reference is in use
* UNLOCKED when falling back to ENET/TXCO, or when a requested
external reference has not (yet) been accepted by hardware.
Userspace observing only pin state therefore sees user intent, while
lock status is the authoritative indicator of whether the requested
clock is actually selected and synchronizing. This matches the DPLL
subsystem model where pin state describes topology and device lock
status describes signal quality.
TX reference selection topology:
- External references (SYNCE, EREF0) are represented as TXCLK pins
- The internal ENET/TXCO clock has no pin representation; when
selected, all TXCLK pins are reported DISCONNECTED
With this change, TX reference clocks on E825 devices can be reliably
selected, observed via standard DPLL interfaces, and monitored for
effective synchronization through TXC DPLL lock status.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:44 +0000 (20:30 +0200)]
ice: add Tx reference clock index handling to AN restart command
Extend the Restart Auto-Negotiation (AN) AdminQ command with a new
parameter allowing software to specify the Tx reference clock index to
be used during link restart.
This patch:
- adds REFCLK field definitions to ice_aqc_restart_an
- updates ice_aq_set_link_restart_an() to take a new refclk parameter
and properly encode it into the command
- keeps legacy behavior by passing REFCLK_NOCHANGE where appropriate
This prepares the driver for configurations requiring dynamic selection
of the Tx reference clock as part of the AN flow.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-13-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:43 +0000 (20:30 +0200)]
ice: implement CPI support for E825C
Add full CPI (Converged PHY Interface) command handling required for
E825C devices. The CPI interface allows the driver to interact with
PHY-side control logic through the LM/PHY command registers, including
enabling/disabling/selection of PHY reference clock.
This patch introduces:
- a new CPI subsystem (ice_cpi.c / ice_cpi.h) implementing the CPI
request/acknowledge state machine, including REQ/ACK protocol,
command execution, and response handling
- helper functions for reading/writing PHY registers over Sideband
Queue
- CPI command execution API (ice_cpi_exec) and a helper for enabling or
disabling Tx reference clocks (CPI 0xF1 opcode 'Config PHY clocking')
- assurance of CPI transaction serialization into the CPI core.
CPI REQ/ACK is a multi-step handshake and must be executed
atomically per PHY. Centralize the lock in ice_cpi_exec() and
use adapter-scoped per-PHY mutexes, which match the hardware sharing
model across PFs.
- addition of the non-posted write opcode (wr_np) to SBQ
- Makefile integration to build CPI support together with the PTP stack
This provides the infrastructure necessary to support PHY-side
configuration flows on E825C and is required for advanced link control
and Tx reference clock management.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-12-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:42 +0000 (20:30 +0200)]
ice: introduce TXC DPLL device and TX ref clock pin framework for E825
E825 devices provide a dedicated TX clock (TXC) domain which may be
driven by multiple reference clock sources, including external board
references and port-derived SyncE. To support future TX clock control
and observability through the Linux DPLL subsystem, introduce a
separate TXC DPLL device (of DPLL_TYPE_GENERIC) and a framework for
representing TX reference clock inputs.
This change adds a new internal DPLL pin type (TXCLK) and registers
TX reference clock pins for E825-based devices:
- EXT_EREF0: a board-level external electrical reference
- SYNCE: a port-derived SyncE reference described via firmware nodes
The TXC DPLL device is created and managed alongside the existing
PPS and EEC DPLL instances. TXCLK pins are registered directly or
deferred via a notifier when backed by fwnode-described pins.
A per-pin attribute encodes the TX reference source associated with
each TXCLK pin.
At this stage, TXCLK pin state callbacks and TXC DPLL lock status
reporting are implemented as placeholders. Pin state getters always
return DISCONNECTED, and the TXC DPLL is initialized in the UNLOCKED
state. No hardware configuration or TX reference switching is
performed yet.
This patch establishes the structural groundwork required for
hardware-backed TX reference selection, verification, and
synchronization status reporting, which will be implemented in
subsequent patches.
Also signal dpll_init from the fwnode pin init error path so any
notifier worker already blocked on it can drain, avoiding a
flush_workqueue() deadlock during teardown.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-11-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:41 +0000 (20:30 +0200)]
dpll: allow fwnode pins to attempt state change without capability bit
Pins registered with an fwnode may have .state_on_dpll_set implemented
without advertising DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE upfront.
Requiring the bit for fwnode pins ties firmware description to driver
implementation details unnecessarily.
Relax the capability check in dpll_pin_state_set() and
dpll_pin_on_pin_state_set(): when a pin has an associated fwnode, bypass
the capability gate and let the ops layer decide, returning -EOPNOTSUPP
if .state_on_dpll_set is absent. Non-fwnode pins retain the original
strict behavior.
This is used later in the series by the SyncE_Ref output pin, which
relies on the fwnode path for state control.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:40 +0000 (20:30 +0200)]
dpll: extend pin notifier with notification source ID
Extend the DPLL pin notification API to include a source identifier
indicating where the notification originates. This allows notifier
consumers to distinguish between notifications coming from
an associated DPLL instance, a parent pin, or the pin itself.
A new field, src_clock_id, is added to struct dpll_pin_notifier_info
and is passed through all pin-related notification paths. Callers of
dpll_pin_notify() are updated to provide a meaningful source identifier
based on their context:
- pin registration/unregistration uses the DPLL's clock_id,
- pin-on-pin operations use the parent pin's clock_id,
- pin changes use the pin's own clock_id.
As introduced in the commit ("dpll: allow registering FW-identified pin
with a different DPLL"), it is possible to share the same physical pin
via firmware description (fwnode) with DPLL objects from different
kernel modules. This means that a given pin can be registered multiple
times.
Driver such as ICE (E825 devices) rely on this mechanism when listening
for the event where a shared-fwnode pin appears, while avoiding reacting
to events triggered by their own registration logic.
This change only extends the notification metadata and does not alter
existing semantics for drivers that do not use the new field.
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-9-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:39 +0000 (20:30 +0200)]
dpll: balance create/delete notifications in __dpll_pin_(un)register
__dpll_pin_register() emits dpll_pin_create_ntf() internally, but
__dpll_pin_unregister() left the matching delete to its callers. The
counts then diverge on dpll_pin_on_pin_register() rollback and on
dpll_pin_on_pin_unregister(), leaking stale notifications.
Emit dpll_pin_delete_ntf() inside __dpll_pin_unregister() and drop the
now-redundant call in dpll_pin_unregister().
Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base functions") Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-8-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:38 +0000 (20:30 +0200)]
dpll: guard sync-pair removal on full pin unregister
__dpll_pin_unregister() wiped the global sync-pair state on every
(dpll, ops, priv, cookie) tuple removed from a pin. When a pin is
registered multiple times and only one registration is being torn
down, this dropped sync-pair pairings still in use by the surviving
registrations.
Move dpll_pin_ref_sync_pair_del() inside the xa_empty(&pin->dpll_refs)
branch so it only runs when the last registration is gone, alongside
clearing the DPLL_REGISTERED mark.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:37 +0000 (20:30 +0200)]
dpll: emit per-dpll delete notifications in dpll_pin_on_pin_unregister()
dpll_pin_on_pin_register() emits a creation notification for every
parent->dpll_refs entry, but dpll_pin_on_pin_unregister() emitted only
one deletion notification outside the loop. When a pin is registered
against multiple parent dplls, userspace sees N creates but a single
delete and leaks per-dpll state.
Move dpll_pin_delete_ntf() into the loop and call it before
__dpll_pin_unregister() so the DPLL_REGISTERED mark is still set when
dpll_pin_available() is consulted.
Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions") Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-6-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:36 +0000 (20:30 +0200)]
dpll: send delete notification before unregister in on-pin rollback
The rollback path in dpll_pin_on_pin_register() called
__dpll_pin_unregister() before dpll_pin_delete_ntf(). When the
unregister dropped the pin's last DPLL reference it cleared the
DPLL_REGISTERED mark in dpll_pin_xa, so the subsequent
dpll_pin_event_send() failed dpll_pin_available() and aborted with
-ENODEV. As a result userspace was never notified of the rollback
deletion and remained out of sync with the kernel.
Send the delete notification first, matching the order used by
dpll_pin_unregister() and dpll_pin_on_pin_unregister().
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:35 +0000 (20:30 +0200)]
dpll: fix stale iteration in dpll_pin_on_pin_unregister()
Neither parent->dpll_refs nor pin->dpll_refs on its own is a correct
iteration target at unregister time:
- pin->dpll_refs includes DPLLs the child was registered against
via a different parent or directly; blind unregister WARNs on
the cookie miss in dpll_xa_ref_pin_del().
- parent->dpll_refs reflects the parent's current attachments, not
those at child-register time. Another driver may have (un)reg'd
the parent against additional DPLLs in the meantime, so we miss
registrations that exist and visit DPLLs that have none.
Walk pin->dpll_refs and use dpll_pin_registration_find() to filter
to entries whose cookie is this parent. Symmetric with
dpll_pin_on_pin_register(), correct under any subsequent change to
parent->dpll_refs.
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:34 +0000 (20:30 +0200)]
dpll: allow registering FW-identified pin with a different DPLL
Relax the (module, clock_id) equality requirement when registering a
pin identified by firmware (pin->fwnode). Some platforms associate a
FW-described pin with a DPLL instance that differs from the pin's
(module, clock_id) tuple. For such pins, permit registration without
requiring the strict match. Non-FW pins still require equality.
Keep netlink pin module reporting/filtering safe for this relaxed
registration model by caching the module name in the pin object at
allocation time and using the cached string in netlink paths.
This avoids dereferencing pin->module after provider module teardown.
Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-3-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Grzegorz Nitka [Sun, 7 Jun 2026 18:30:33 +0000 (20:30 +0200)]
dpll: add generic DPLL type
Add DPLL_TYPE_GENERIC to represent DPLL devices which do not fit the
existing PPS or EEC classes.
The UAPI type is intentionally generic. During netdev discussion,
maintainers pointed out that introducing identifiers tied to a specific
placement or single design does not scale across ASICs and vendors.
The role of a DPLL is already inferable from the spawning driver,
bus device, and pin topology, without encoding additional
purpose-specific taxonomy in the type name.
Using a generic type keeps the UAPI extensible and avoids premature
naming that may become incorrect as new hardware topologies are
exposed through the DPLL subsystem.
Expose the new type through UAPI and netlink specification as "generic".
1) Replace the open-coded manual cleanup in xfrm_add_policy() error
path with xfrm_policy_destroy() for consistency with
xfrm_policy_construct().
From Deepanshu Kartikey.
2) Limit XFRMA_TFCPAD to a sensible maximum (max IP length, 64k) since
u32 is excessive for traffic flow confidentiality padding.
From David Ahern.
3) Add a new netlink message XFRM_MSG_MIGRATE_STATE that
allows migrating individual IPsec SAs independently of
their policies. The existing XFRM_MSG_MIGRATE is tightly coupled
to policy+SA migration, lacks SPI for unique SA identification,
and cannot express reqid changes or migrate Transport mode
selectors. The new interface identifies the SA via SPI and mark,
supports reqid changes, address family changes, encap removal,
and uses an atomic create+install flow under x->lock to prevent
SN/IV reuse during AEAD SA migration.
From Antony Antony.
* tag 'ipsec-next-2026-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: add documentation for XFRM_MSG_MIGRATE_STATE
xfrm: restrict netlink attributes for XFRM_MSG_MIGRATE_STATE
xfrm: add XFRM_MSG_MIGRATE_STATE for single SA migration
xfrm: make xfrm_dev_state_add xuo parameter const
xfrm: extract address family and selector validation helpers
xfrm: refactor XFRMA_MTIMER_THRESH validation into a helper
xfrm: move encap and xuo into struct xfrm_migrate
xfrm: add error messages to state migration
xfrm: add state synchronization after migration
xfrm: check family before comparing addresses in migrate
xfrm: split xfrm_state_migrate into create and install functions
xfrm: rename reqid in xfrm_migrate
xfrm: fix NAT-related field inheritance in SA migration
xfrm: allow migration from UDP encapsulated to non-encapsulated ESP
xfrm: add extack to xfrm_init_state
xfrm: remove redundant assignments
xfrm: Reject excessive values for XFRMA_TFCPAD
xfrm: cleanup error path in xfrm_add_policy()
====================
Ruoyu Wang [Fri, 12 Jun 2026 03:56:13 +0000 (11:56 +0800)]
net: wwan: t7xx: check skb_clone in control TX
t7xx_port_ctrl_tx() clones each skb fragment before passing it to the
port transmit path. The clone is used immediately to set cloned->len, so
an skb_clone() failure results in a NULL pointer dereference.
Check the clone before using it. If previous fragments were already
queued, preserve the driver's existing partial-write behavior by
returning the number of bytes submitted so far.
Fixes: 36bd28c1cb0d ("wwan: core: Support slicing in port TX flow of WWAN subsystem") Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com> Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com> Link: https://patch.msgid.link/20260612035613.1192486-1-ruoyuw560@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
vsock: consolidate acceptq accounting into core helpers
These patches follow up on commit c05fa14db43e
("vsock/vmci: fix sk_ack_backlog leak on failed handshake")
by consolidating sk_acceptq_added() and sk_acceptq_removed() into
the core vsock helpers so transports cannot forget them.
Raf Dickson [Fri, 12 Jun 2026 04:52:16 +0000 (04:52 +0000)]
vsock: fold sk_acceptq_removed() into vsock_remove_pending()
Callers of vsock_remove_pending() must also call sk_acceptq_removed()
to keep sk_ack_backlog consistent. Move the call into
vsock_remove_pending() itself to make it automatic and prevent future
callers from forgetting it.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-5-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Fri, 12 Jun 2026 04:52:15 +0000 (04:52 +0000)]
vsock: fold sk_acceptq_added() into vsock_enqueue_accept()
virtio and hyperv call sk_acceptq_added() immediately before
vsock_enqueue_accept(). Move the call into vsock_enqueue_accept()
itself so callers cannot forget it and the accounting is consistent.
Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-4-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Fri, 12 Jun 2026 04:52:14 +0000 (04:52 +0000)]
vsock: fold sk_acceptq_added() into vsock_add_pending()
Move sk_acceptq_added() into vsock_add_pending() so callers cannot
forget it. vmci is the only transport using the pending list and
is updated accordingly.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-3-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Fri, 12 Jun 2026 04:52:13 +0000 (04:52 +0000)]
vsock: introduce vsock_pending_to_accept() helper
Add vsock_pending_to_accept() to move a socket directly from the
pending list to the accept queue in a single operation, avoiding
the sock_put/sock_hold dance and the sk_acceptq_removed()/
sk_acceptq_added() pair that would otherwise be needed when
calling vsock_remove_pending() followed by vsock_enqueue_accept().
Use it in vmci_transport_recv_connecting_server() where a completed
handshake transitions the socket from pending to accept queue.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-2-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Fri, 12 Jun 2026 04:58:42 +0000 (04:58 +0000)]
vsock: use sk_acceptq_is_full() helper in all transports
Replace the open-coded backlog check with sk_acceptq_is_full().
The helper uses > instead of >=, which is the correct comparison
per commit 64a146513f8f ("[NET]: Revert incorrect accept queue
backlog changes."), and adds READ_ONCE() for proper memory ordering.
Florian Westphal [Fri, 12 Jun 2026 09:22:09 +0000 (11:22 +0200)]
selftests: netfilter: add phony nft_offload test
... "phony", because its not testing offloads, it tests the control
plane code. Also test error unwind via fault injection framework.
For a proper test, real hardware would be required given we'd have
check if 'previously handed off to hardware' offload commands are
properly removed again on failure or rule flush.
Florian Westphal [Fri, 12 Jun 2026 09:22:08 +0000 (11:22 +0200)]
netdevsim: tc: allow to test nf_tables offload control plane code
The actual 'offload' is phony, all commands are ignored: this is only
useful to test control plane code.
Tag the existing callback to permit error injection to test rollback/abort
code in nf_tables. This is also for fuzzers - the fault injection
framework allows probabilistic error insertion.
Wayen.Yan [Fri, 12 Jun 2026 09:37:00 +0000 (17:37 +0800)]
net: airoha: Fix error handling in airoha_ppe_flush_sram_entries()
In airoha_ppe_flush_sram_entries(), the outer "err" variable was never
updated when the inner loop variable shadowed it, causing the function
to always return 0 even when airoha_ppe_foe_commit_sram_entry() fails.
Drop the outer "err" variable and return directly on error, propagating
the error code from airoha_ppe_foe_commit_sram_entry() correctly.
Linus Torvalds [Sat, 13 Jun 2026 15:23:36 +0000 (08:23 -0700)]
Merge tag 'core-urgent-2026-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull debugobjects fix from Ingo Molnar:
- Fix potential debugobjects deadlock on PREEMPT_RT kernels (Waiman
Long)
* tag 'core-urgent-2026-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobjects: Don't call fill_pool() in early boot hardirq context
Linus Torvalds [Sat, 13 Jun 2026 15:14:17 +0000 (08:14 -0700)]
Merge tag 'i2c-for-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"The biggest news here is that this is my last pull request as I2C
maintainer after 13.5 years. Starting with the 7.2 cycle, Andi Shyti
is taking over who helped me greatly maintaining the host drivers for
a while now. Thank you, Andi, and good luck with the subsystem. I'll
be around for help, of course.
Technically, there are two patches which might be a tad large for this
late cycle, but most of them is explaining comments, so I think they
are suitable.
- MAINTAINERS:
- hand over I2C maintainership to Andi
- minor updates
- rust: fix I2cAdapter refcount double increment
- imx: keep clock and pinctrl states consistent in runtime PM
- imx-lpi2c: fix DMA resource leaks on PIO fallback
- qcom-cci: fix NULL pointer dereference on remove
- riic: fix reset refcount leak on resume_noirq error path
- stm32f7: account for analog filter in timing computation
- tegra:
- fix suspend/resume handling in NOIRQ phase
- update Tegra410 I2C timings to match hardware specs"
* tag 'i2c-for-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
dt-bindings: i2c: mux-gpio: name correct maintainer
MAINTAINERS: hand over I2C to Andi Shyti
i2c: imx-lpi2c: fix resource leaks switching to devm_dma_request_chan()
MAINTAINERS: i2c: designware: Remove inactive reviewer
i2c: tegra: Fix NOIRQ suspend/resume
i2c: tegra: Update Tegra410 I2C timing parameters
i2c: qcom-cci: Fix NULL pointer dereference in cci_remove()
i2c: stm32f7: fix timing computation ignoring i2c-analog-filter
i2c: imx: fix clock and pinctrl state inconsistency in runtime PM
i2c: riic: fix refcount leak in riic_i2c_resume_noirq()
rust: i2c: fix I2cAdapter refcounts double increment
Thomas Gleixner [Sat, 13 Jun 2026 14:24:29 +0000 (16:24 +0200)]
Merge tag 'timers-v7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/daniel.lezcano/linux into timers/clocksource
Pull clocksource/driver updates from Daniel Lezcano:
- Remove the sifive,fine-ctr-bits property bindings because it is a
redundant information (Nick Hu)
- Remove the TCIU8 interrupt bindings on Renesas because it should not
be described as the documentation marked reserved and fix the
conditional reset line for the RZ/{T2H,N2H} (Cosmin Tanislav)
- Extend schema condition for interrupts to cover D1 compatible
variant an add the D1 hstimer support (Michal Piekos)
- Update the ARM architected timer support to handle the ACPI GTDT v3
format and the EL2 virtual timer, enabling Linux to use the most
appropriate timer when running with VHE, while also fixing several
Device Trees to accurately reflect the underlying hardware (Marc
Zyngier)
- Cleanup and add the clocksource and the clockevent in the TI DM
timer (Markus Schneider-Pargmann)
- Add the multiple watchdogs support in the tegra186 and
tegra234. Dedicate one as a kernel watchdog (Kartik Rajput)
- Add the NXP clocksource selection for the scheduler in the Kconfig
(Enric Balletbo i Serra)
WenTao Liang [Thu, 11 Jun 2026 16:17:38 +0000 (00:17 +0800)]
posix-cpu-timers: Fix pid refcount leak in do_cpu_nanosleep() error path
In do_cpu_nanosleep(), posix_cpu_timer_create() takes a pid reference
via get_pid() and stores it in timer.it.cpu.pid. If the subsequent
posix_cpu_timer_set() call fails, the function returns immediately
without calling posix_cpu_timer_del() to release the pid reference,
causing a leak.
Fix it by calling posix_cpu_timer_del() before the unlock-and-return
on the error path, consistent with the other exit paths in the same
function.
Thomas Gleixner [Tue, 9 Jun 2026 15:14:45 +0000 (17:14 +0200)]
time/jiffies: Register jiffies clocksource before usage
Teddy reported that a XEN HVM has a long boot delay, which was bisected to
the recent enhancements to the negative motion detection. It turned out
that the jiffies clocksource is used in early boot before it is registered,
which leaves the max_delta_raw field at zero. That causes the read out to
be clamped to the max delta of 0, which means time is not making progress.
Cure it by ensuring that it is initialized before its first usage in
timekeeping_init().
Fixes: 76031d9536a0 ("clocksource: Make negative motion detection more robust") Reported-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Teddy Astie <teddy.astie@vates.tech> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/87y0gn3fve.ffs@fw13 Closes: https://lore.kernel.org/all/1780914594.8631fc262581453bbf619ec5b2062170.19ea6c8227b000701b@vates.tech
The "ti,n-factor" binding and examples allow negative correction
values. Reading it as u32 makes the helper type disagree with the
documented signed value and hides real schema mismatches.
Use the signed helper so the DT access matches the s32 value stored by
the driver.
Keith Busch [Fri, 12 Jun 2026 22:32:04 +0000 (15:32 -0700)]
block: check bio split for unaligned bvec
Offsets and lengths need to be validated against the dma alignment. This
check was skipped for sufficiently a small bio with a single bvec, which
may allow an invalid request dispatched to the driver. Force the
validation for an unaligned bvec by forcing the bio split path that
handles this condition.
Fixes: 7eac33186957 ("iomap: simplify direct io validity check") Fixes: 5ff3f74e145a ("block: simplify direct io validity check") Reported-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://patch.msgid.link/20260612223205.465913-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Dumazet [Sat, 13 Jun 2026 04:26:19 +0000 (04:26 +0000)]
nbd: Reclassify sockets to avoid lockdep circular dependency
syzbot reported a possible circular locking dependency in udp_sendmsg()
where fs_reclaim can be triggered while holding sk_lock, and fs_reclaim
can eventually depend on another sk_lock (e.g., if NBD is used for swap
or writeback and NBD uses TLS/TCP which acquires sk_lock).
Since the UDP socket and the NBD TCP/TLS socket are different, this is a
false positive. Fix this by reclassifying NBD sockets to a separate lock
class when they are added to the NBD device.
This is similar to what nvme-tcp and other network block devices do.
Fixes: ffa1e7ada456 ("block: Make request_queue lockdep splats show up earlier") Reported-by: syzbot+607cdcf978b3e79da878@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6a2cdafe.428ffe26.258b27.0161.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260613042619.1108126-1-edumazet@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 30 May 2026 02:03:47 +0000 (20:03 -0600)]
io_uring/net: make POLL_FIRST receive side checks consistent
io_recv() and io_recvzc() are the odd ones out, as they checks for
whether POLL_FIRST should be honored before checking if the file is a
socket. It doesn't really matter, but might as well make it consistent
across all receive and send types.
Jens Axboe [Thu, 11 Jun 2026 17:44:47 +0000 (11:44 -0600)]
io_uring: remove the per-ctx fallback task_work machinery
With the tctx fallback running its entries directly, the per-ctx
fallback work has a single user left: moving local (DEFER_TASKRUN)
task_work entries out of a ring that is going away. Both of its call
sites are process context and don't hold ->uring_lock, the same
conditions the deferred fallback work itself ran under - so run the
entries in cancel mode right there instead, and rename the helper to
io_cancel_local_task_work() to match what it now does.
With that, ->fallback_llist, ->fallback_work, io_fallback_req_func()
and __io_fallback_tw() can all go away, along with the fallback work
flushing in the ring exit and cancel paths. Requests that get
orphaned by an exiting task now run via the tctx fallback work, which
the ring exit side implicitly waits on through the ctx refs those
requests hold.
Jens Axboe [Thu, 11 Jun 2026 17:41:25 +0000 (11:41 -0600)]
io_uring: run the tctx task_work fallback directly
The fallback work drains the tctx queue only to redistribute the entries
into the per-ctx fallback lists, bouncing them through a second
(per-ctx) work item before they finally run. That made sense when the
producer side did the draining and could be in any context, but the
fallback work is a regular process context kworker: it can just run the
entries itself. Reuse the normal run loop - if run from the fallback
kernel thread, ts.cancel will get set, and the work terminated.
Jens Axboe [Thu, 11 Jun 2026 16:13:22 +0000 (10:13 -0600)]
io_uring: switch normal task_work to a mpscq
Like the local task_work list, the normal (tctx) task_work list is an
llist, and hence needs the O(n) llist_reverse_order() pass before
running entries in queue order. On top of that, capped runs - sqpoll
processing IORING_TW_CAP_ENTRIES_VALUE entries at a time - need the
claimed-but-unprocessed leftovers carried in a separate retry_list,
as they can't be pushed back to the shared list.
Switch tctx->task_list to a mpscq, like what was done for the
DEFER_TASKRUN paths as well.
Jens Axboe [Wed, 10 Jun 2026 21:19:35 +0000 (15:19 -0600)]
io_uring: switch local task_work to a mpscq
The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
ordered, and hence __io_run_local_work() has to restore the right
running order with an O(n) llist_reverse_order() pass first. On top of
that, a batch that gets capped by max_events needs the leftover entries
parked on a separate ->retry_llist, as they can't be pushed back to the
shared list.
Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
retry loop, entries are popped in queue order with no reversal pass,
capping a run simply leaves the remainder on the queue, and
->retry_llist goes away entirely. The consumer cursor, ->work_head,
lives with the rest of the ->uring_lock protected state rather than
next to the queue, so that popping entries doesn't dirty the producer
side cacheline.
For low amounts of task_work, this ends up being a bit more efficient
than the existing scheme. As an example of that, doing multishot
receives for 8 clients has the following task_work overhead:
most of that overhead is gone, and performance is better as well.
Caleb Sander Mateos <csander@purestorage.com> reports that it improves
the performance of a ublk 4kb workload by 4% [1], while testing v1 of
this patchset.
Local task_work is currently using llists for managing the work,
but that's a LIFO type of list. This means that running this task_work
needs to reverse the list first, to ensure fairness in running the
queued items.
Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
node-based queue algorithm, modified with an externally held consumer
cursor and conditional stub reinsertion. See comments in the header.
Producers are wait-free: a push is a single xchg() on the queue tail,
which serializes concurrent producers and defines the FIFO order, plus
a store linking the node to its predecessor. There are no cmpxchg retry
loops, and pushing is safe from any context, including hardirq.
The cost of linked list FIFO ordering is that a push publishes the node
in two steps - the xchg() makes it visible as the new tail before the
subsequent store links it into the chain that is reachable from the
head. A consumer hitting that window gets a NULL from mpscq_pop() while
mpscq_empty() reports false, and must retry later rather than treat the
queue as empty. The window is two instructions wide, but a producer can
get preempted inside it, so the consumer must not busy wait on it.
The consumer side supports a single consumer at a time, with callers
providing their own serialization. A stub node, which also defines the
empty state (tail == stub), allows the consumer to detach the final
node without racing against producer link stores: that node is only
handed out once the stub has been cmpxchg'ed back in as the tail. This
also guarantees that the previous tail returned by mpscq_push() cannot
get freed before that push has linked it, making it always valid for
comparisons.
The consumer cursor is deliberately not part of the queue struct - the
caller owns it and passes it to mpscq_pop(). This is done to separate
the consumer and producers cacheline. The cursor is written for every
popped entry, and keeping it on the same cacheline as ->tail would have
the consumer invalidating the line that producers need for every push.
Keeping it external lets the caller place it with its own consumer side
data instead.
Jens Axboe [Fri, 12 Jun 2026 02:27:22 +0000 (20:27 -0600)]
io_uring: grab RCU read lock marking task run
Not required right now, as io_req_local_work_add() already calls this
helper with the RCU read lock held. But in preparation for that not
being the case, grab it locally.
Paolo Abeni [Sat, 13 Jun 2026 09:50:31 +0000 (11:50 +0200)]
Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2026-06-09 (idpf, ixgbe, igc)
Przemyslaw adds needed padding to idpf PTP structures to match firmware
expectations.
Larysa bypasses XPS configuration on XDP queues for ixgbe.
Khai Wen corrects offset into packet buffer when handling for frame
preemption on igc.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
igc: skip RX timestamp header for frame preemption verification
ixgbe: do not configure xps for XDP queues
idpf: add padding to PTP virtchnl structures
====================
Ratheesh Kannoth [Wed, 10 Jun 2026 02:23:44 +0000 (07:53 +0530)]
octeontx2-af: npc: Fix size of entry2cntr_map
KASAN prints below splat. This is caused by allocating counter for
reserved mcam entry for cpt 2nd pass entry. But mcam->entry2cntr_map
is not allocated for reserved entries.
BUG: KASAN: slab-out-of-bounds in npc_map_mcam_entry_and_cntr+0xb0/0x1a0
Write of size 2 at addr ffff0001033e7ffe by task kworker/0:1/14
Woojin Ji [Fri, 12 Jun 2026 05:26:55 +0000 (14:26 +0900)]
selftests/bpf: Add arena direct-value one-past-end reject test
BPF_MAP_TYPE_ARENA supports direct-value pseudo loads, but unlike array
maps its map value_size is zero and the valid direct-value range is the
arena mmap size, max_entries * PAGE_SIZE.
Commit 3ac1a467e376 ("bpf: Fix off-by-one boundary validation in arena
direct-value access") fixed arena_map_direct_value_addr() to reject an
offset exactly at the end of the arena mapping. Add a regression test
that loads a BPF_PSEUDO_MAP_VALUE with off == arena_size and verifies
that the verifier rejects it with the expected offset in the log.
This is intentionally kept as a userspace raw-instruction test. I tried
expressing the same BPF_PSEUDO_MAP_VALUE + off == arena_size case in
verifier_arena.c with inline assembly. The only form that produces the
desired instruction bytes uses __imm_addr(arena), but that emits
R_BPF_64_NODYLD32, which the libbpf/bpftool link step rejects. Other
register, immediate, and memory constraints either fail in the BPF
backend or lower to a normal R_BPF_64_64 load followed by an ALU add,
which does not exercise arena_map_direct_value_addr() with the boundary
offset in the second ldimm64 slot.
A legacy test_verifier fixture can express the raw instruction directly,
but it needs arena map creation, mmap, and fixup plumbing in the legacy
runner. That is more intrusive than the small prog_tests raw-instruction
test.
Use the userspace raw-instruction test, following the existing selftests
pattern used for direct map-value pseudo loads, so insns[1].imm can be
set to arena_size precisely.
Assisted-by: ChatGPT:gpt-5.5 Signed-off-by: Woojin Ji <random6.xyz@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Junyoung Jang <graypanda.inzag@gmail.com> Link: https://lore.kernel.org/r/20260612-arena-direct-value-v1-v4-1-b81b642f5277@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Gabriele Monaco [Wed, 10 Jun 2026 09:04:29 +0000 (11:04 +0200)]
rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
raw_res_spin_unlock_irqrestore() calls raw_res_spin_unlock() and then
restores interrupts, this means preemption is enabled when interrupts
are still disabled (as part of raw_res_spin_unlock()) so this cannot
trigger an actual preemption.
This is inconsistent with other spinlock implementations
(raw_spin_unlock_irqrestore() and bpf_res_spin_unlock_irqrestore()
itself).
Adjust the macro to ensure interrupts are enabled before enabling
preemption, allowing to schedule at that point. Make the same
modification in the error path of raw_res_spin_lock_irqsave().
====================
bpf: Fix setting retval to -EPERM for cgroup hooks not returning errno
This series fixes the issue reported by sashiko in [1]. The issue is that,
when a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets
the hook return value to -EPERM if it is not a valid errno. This is
correct for errno-based hooks, which return 0 on success and negative
errno on failure, but wrong for void and boolean LSM hooks. Boolean
LSM hooks should only return true or false, and void LSM hooks have
no return value at all.
Fix it by skipping setting -EPERM for hooks not returning errno.
Xu Kuohai [Wed, 10 Jun 2026 20:17:24 +0000 (20:17 +0000)]
selftests/bpf: Add retval test for bool and errno LSM cgroup hooks
Add test to check the return value when a BPF program exits with 0 for
a boolean and an errno LSM hook.
For each hook, two BPF programs are attached. The first program returns
0 without calling bpf_set_retval() to exercise the return value translation
logic, while the second program reads the retval via bpf_get_retval().
Xu Kuohai [Wed, 10 Jun 2026 20:17:23 +0000 (20:17 +0000)]
bpf: Fix setting retval to -EPERM for cgroup hooks not returning errno
When a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets
the hook return value to -EPERM if it is not a valid errno. This is
correct for errno-based hooks, which return 0 on success and negative
errno on failure, but wrong for boolean and void LSM hooks. Boolean
LSM hooks should only return true or false, and void LSM hooks have
no return value at all.
Fix it by skipping setting -EPERM for hooks not returning errno.
net: qrtr: fix 32-bit integer overflow in qrtr_endpoint_post()
qrtr_endpoint_post() validates an incoming packet with
if (!size || len != ALIGN(size, 4) + hdrlen)
goto err;
where size comes from the wire. On 32-bit, size_t is 32 bits and
ALIGN(size, 4) wraps to 0 for size >= 0xfffffffd, so the check
passes and skb_put_data(skb, data + hdrlen, size) writes past the
hdrlen-sized skb and oopses the kernel. 64-bit is unaffected.
This is the 32-bit residual of ad9d24c9429e2 ("net: qrtr: fix OOB
Read in qrtr_endpoint_post"), which fixed only the 64-bit case.
Reject any size that cannot fit the buffer before the ALIGN.
Fixes: ad9d24c9429e2 ("net: qrtr: fix OOB Read in qrtr_endpoint_post") Cc: stable@vger.kernel.org Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260611125455.2352279-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dragos Tatulea [Thu, 11 Jun 2026 13:52:30 +0000 (16:52 +0300)]
net/mlx5: Check max_macs devlink param value against max capability
The max_macs devlink param is checked against the FW max value only at
param register time (driver load) and inside the validate callback
(devlink param set). The stored DRIVERINIT value persists across FW
resets and devlink reloads without any further checks against the max.
If the FW link type changes from Ethernet to IB and a FW reset happens,
the MAX cap for log_max_current_uc_list will become zero, but the
previously stored max_macs value remains and is unconditionally
programmed into the HCA caps in handle_hca_cap(). FW will then return a
syndrome during SET_HCA_CAP:
mlx5_cmd_out_err:839:(pid 3831): SET_HCA_CAP(0x109) op_mod(0x0) failed,
status bad parameter(0x3), syndrome (0x537801), err(-22)
set_hca_cap:907:(pid 3831): handle_hca_cap failed
This results in a failure to register the RDMA device.
This patch skips programming log_max_current_uc_list when the MAX
capability is 0 (in case of IB).
Fixes: 8680a60fc1fc ("net/mlx5: Let user configure max_macs generic param") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260611135230.534513-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
psp: Add support for dev-assoc/disassoc
The main purpose of this feature is to associate virtual devices like
veth or netkit with a real PSP device, so we could provide PSP
functionality to the application running with virtual devices.
A typical deployment that works with this feature is as follows:
Host Namespace:
psp_dev_local ←──physically linked──→ psp_dev_peer
(PSP device)
│
│ BPF on psp_dev_local ingress: bpf_redirect_peer() to nk_guest
│
nk_host / veth_host
│
│ BPF on nk_host ingress: bpf_redirect_neigh() to psp_dev_local
│
Guest Namespace (netns):
│
nk_guest / veth_guest
★ PSP application run here
Remote Namespace (_netns):
psp_dev_peer
★ PSP server application runs here
Note:
The general requirement for this feature to work:
For PSP to work correctly, the egress device at validate_xmit_skb()
time must have psp_dev matching the association's psd. Any device
stacking or traffic redirection that changes the egress device will
cause either:
1. TX validation failure (SKB_DROP_REASON_PSP_OUTPUT) - fail-safe
2. RX policy failure after tx-assoc - packets without PSP extension
are rejected by receiver expecting encrypted traffic
Here are a few examples that this feature would not work:
- Bonding with load balancing in round-robin, XOR, 802.3ad mode across
multiple PSP devices, or mixed PSP and non-PSP devices
- Bonding with active-backup mode might work without PSP migration for
failover case.
- ipvlan/macvlan in bridge mode would not work given packets are
loopbacked locally without going through the PSP device.
====================
Wei Wang [Mon, 8 Jun 2026 23:31:18 +0000 (16:31 -0700)]
selftests/net: psp: add dev-get, no-nsid, and cleanup tests
Add the following 3 tests:
- _psp_dev_get_check_netkit_psp_assoc: verifies dev-get output in both
host and guest namespaces, checking assoc-list, by-association flag,
and nsid values
- _dev_assoc_no_nsid: tests dev-assoc and dev-disassoc without the nsid
attribute, verifying ifindex lookup in the caller's namespace
- _psp_dev_assoc_cleanup_on_netkit_del: verifies that deleting the
associated netkit interface properly cleans up the assoc-list, using
a disposable netkit pair to avoid disturbing the shared environment
Add tests that verify PSP notifications are delivered to listeners in
associated namespaces:
- _key_rotation_notify_multi_ns_netkit: triggers key rotation and
verifies the notification is received in both main and guest namespaces
- _dev_change_notify_multi_ns_netkit: triggers dev_set and verifies the
dev_change notification is received in both namespaces
Wei Wang [Mon, 8 Jun 2026 23:31:16 +0000 (16:31 -0700)]
selftests/net: psp: add dev-assoc data path test
Add _assoc_check_list() test that associates nk_guest with the PSP
device and verifies the assoc-list is correctly populated.
Add _data_basic_send_netkit_psp_assoc() which tests PSP data send
through a netkit interface associated with a PSP device. The test
associates nk_guest with the PSP device, then sends PSP-encrypted
traffic from the guest namespace.
Wei Wang [Mon, 8 Jun 2026 23:31:15 +0000 (16:31 -0700)]
selftests/net: psp: support PSP in NetDrvContEnv infrastructure
Add infrastructure to support PSP tests across network namespaces
using NetDrvContEnv with netkit pairs. This enables testing PSP device
association, where a non-PSP-capable device (e.g. netkit) in a guest
namespace is associated with a real PSP device in the host namespace,
allowing the guest to perform PSP encryption/decryption through the
host's PSP hardware.
env.py:
- nk_guest_ifindex is queried after moving the device into the guest
namespace, so tests can use it directly for dev-assoc
psp.py:
- PSP device lookup supports container environments where the PSP
device is on the physical interface, not the test interface
- Association helpers handle dev-assoc/dev-disassoc with defer-based
cleanup to prevent state leaks on test assertion failures
- main() tries NetDrvContEnv with primary_rx_redirect and falls back
to NetDrvEpEnv, so existing tests continue to work without the
container environment
Wei Wang [Mon, 8 Jun 2026 23:31:14 +0000 (16:31 -0700)]
selftests/net: rename _nk_host_ifname to nk_host_ifname
Rename _nk_host_ifname to nk_host_ifname in NetDrvContEnv to make it
a public attribute, matching the nk_guest_ifname rename. Tests that
access the host-side netkit interface name (e.g. for cleanup after
deleting the netkit pair) no longer trigger pylint protected-access
warnings.
Wei Wang [Mon, 8 Jun 2026 23:31:13 +0000 (16:31 -0700)]
selftests/net: add _find_bpf_obj() to search hw/ for BPF objects
Add _find_bpf_obj() helper to NetDrvContEnv that searches the test
directory first, then falls back to the hw/ subdirectory. This allows
tests outside drivers/net/hw/ (e.g. psp.py in drivers/net/) to find
BPF objects built in the hw/ directory.
Update _attach_bpf() and _attach_primary_rx_redirect_bpf() to use
_find_bpf_obj() for BPF object discovery.
Wei Wang [Mon, 8 Jun 2026 23:31:12 +0000 (16:31 -0700)]
selftests/net: psp: refactor test builders to use ksft_variants
Replace the manual psp_ip_ver_test_builder() and ipver_test_builder()
functions with @ksft_variants decorators for data_basic_send and
data_mss_adjust. This is a pure refactor with no behavior change.
Wei Wang [Mon, 8 Jun 2026 23:31:10 +0000 (16:31 -0700)]
psp: add new netlink cmd for dev-assoc and dev-disassoc
The main purpose of this cmd is to be able to associate a
non-psp-capable device (e.g. veth or netkit) with a psp device.
One use case is if we create a pair of veth/netkit, and assign 1 end
inside a netns, while leaving the other end within the default netns,
with a real PSP device, e.g. netdevsim or a physical PSP-capable NIC.
With this command, we could associate the veth/netkit inside the netns
with PSP device, so the virtual device could act as PSP-capable device
to initiate PSP connections, and performs PSP encryption/decryption on
the real PSP device.
Wei Wang [Mon, 8 Jun 2026 23:31:09 +0000 (16:31 -0700)]
psp: add admin/non-admin version of psp_device_get_locked
Introduce 2 versions of psp_device_get_locked:
1. psp_device_get_locked_admin(): This version is used for operations
that would change the status of the psd, and are currently used for
dev-set and key-rotation.
2. psp_device_get_locked(): This is the non-admin version, which are
used for broader user issued operations including: dev-get, rx-assoc,
tx-assoc, get-stats.
Following commit will be implementing both of the checks.
Generic XDP devmap multi redirect can leave cloned skbs sharing packet
data. When a devmap egress program mutates packet data, another
destination sharing the same data may observe that mutation.
Fix this by making cloned skbs private before running the generic devmap
egress program. The private copy is made in dev_map_generic_redirect()
so dev_map_bpf_prog_run_skb() can keep returning the XDP action directly.
Add selftest coverage for the last-destination case, where the final
destination runs on the original skb while earlier destinations use
cloned skbs. The test records the source MAC observed by an earlier
destination and checks that it is neither the sentinel value left in the
result map nor the MAC written by the final destination.
---
v5:
- Move the skb_copy() check back to dev_map_generic_redirect() to keep
dev_map_bpf_prog_run_skb() returning only the XDP action.
- Preserve mac_len after skb_copy().
- Use __be64 temporary values when updating mac_map from userspace.
- Initialize rx_mac with a sentinel in the last-destination test instead
of relying on -ENOENT for ARRAY map lookups.
- Adjust the last-destination test topology so the checked earlier
destination is not the ingress/source veth.
- Split the last-destination check into two assertions: one for store_mac_1
updating rx_mac and one for detecting last-destination rewrite leakage.
v4: https://lore.kernel.org/bpf/20260611080850.536996-1-sun.jian.kdev@gmail.com/T/#mf830f03d362f33e0941d1b0e425169698fce76e5
- Preserve mac_len after skb_copy().
- Separate errno return from XDP action output in
dev_map_bpf_prog_run_skb().
- Zero-initialize net_config in the new selftest.
v3: https://lore.kernel.org/bpf/20260611043317.512843-1-sun.jian.kdev@gmail.com/
- Split the kernel fix and selftest into separate patches.
- Move the private-copy logic into dev_map_bpf_prog_run_skb().
- Use deterministic DEVMAP_HASH keys in the last-destination selftest.
- Fix the Fixes tag.
v2: https://lore.kernel.org/bpf/08c35c70-a59e-4e0e-91db-22b5ec30b611@linux.dev/
- Move the private-copy step into dev_map_generic_redirect() so the
last-destination path is covered as well.
- Use skb_copy() instead of skb_unshare() to keep caller ownership
unchanged on allocation failure.
- Add a generic XDP last-destination selftest case.
Strengthen xdp_veth_egress to check that each destination observes the
MAC selected for its own egress ifindex, instead of only checking that
the observed MAC differs from a single magic value.
Add a generic XDP last-destination test where an earlier destination does
not have a devmap egress program while the final destination does. This
covers the case where the final destination runs on the original skb and
could otherwise rewrite packet data still shared with an earlier cloned
skb.
Use deterministic DEVMAP_HASH keys for the egress map so the intended
last destination is stable. Initialize the result map with a sentinel
value and check that store_mac_1 overwrites it before checking that the
earlier destination did not observe the MAC written by the final
destination.
Sun Jian [Fri, 12 Jun 2026 11:40:31 +0000 (19:40 +0800)]
bpf: Run generic devmap egress prog on private skb
Generic XDP devmap multi redirect uses skb_clone() for intermediate
destinations and sends the last destination with the original skb. This
can leave multiple destinations sharing the same packet data.
This becomes visible after generic devmap egress-program support was
added: a devmap egress program may mutate packet data, and another
destination sharing the same data can observe that mutation.
Native XDP broadcast redirect does not have this issue because
xdpf_clone() copies the frame data for each destination. Generic XDP
should provide the same per-destination isolation before running a
devmap egress program.
Fix this by making cloned skbs private before running the generic devmap
egress program. Use skb_copy() instead of skb_unshare() so allocation
failure does not consume the skb and the existing caller error paths keep
their ownership semantics.
This series continues the rework of the KSZ driver initiated by two previous
series (see [1] & [2]).
The KSZ driver handles more than 20 switches split in several families.
This was previously handled through a common set of dsa_switch_ops
operations that used device-specific ksz_dev_ops callbacks. The two
previous series have split this common struct dsa_switch_ops into 5
to connect the ksz_dev_ops's implentations directly to the new
dsa_swicth ops.
This series continues in the same vein and removes the dsa_switch_ops
operations that aren't used.
On top of this on-going rework I added PTP and periodic output support for
the KSZ8463 (which was my first goal). There are still more than 20 patches
left for all this so this series will be followed by three others and if you
want to see the full picture we can check my github ([3]).
FYI, I only have a KSZ8463 so, unfortunately, I can't test other switches.
The next series is going to move out of ksz_common.c the last remaining
functions that aren't truly common to all KSZ switches. The series after
that will add PTP support for the KSZ8463 and the final one will add
periodic output support for the KSZ8463.
net: dsa: microchip: implement port_teardown only if needed
The port_teardown() operation is optional. Yet, it is implemented by all
the KSZ switches through a common function that doesn't do anything for
the switches that aren't part of the ksz9477 family
Remove the implementation from the switches that don't need it.
Implement instead a ksz9477-specific port_teardown.
All the switches use a common mdio_register() function that uses two
ksz_dev_ops callbacks (.mdio_bus_preinit() and .create_phy_addr_map())
to handle the lan937x specific case. These two callbacks are used only
at this place in the code.
Implement a new lan937x-specific MDIO registration functions that uses
these two lan937x-specific functions. The lan937x bindings don't
have any 'interrupts' property so this lan937x_mdio_register() doesn't
call ksz_irq_phy_setup().
Expose the common ksz_*_mdio_{read/write} functions so they can be used
in lan937x.c
Remove the callbacks from ksz_dev_ops.
net: dsa: microchip: implement .{get/set}_wol only if needed
All the KSZ switches use common {get/set}_wol operations while only the
ksz9477 and the ksz87xx families really support it. These operations are
optional so there is no point implementing them to return -EOPNOTSUPP.
Remove the {get/set}_wol callbacks from the switch operations for the
ksz88xx, the ksz8463 and the lan937x families.
Remove the family check from the common {get/set}_wol implementation.
Note that is_ksz9477() is only true for the KSZ9477 so this change will
also add WoL support for the other switches using the
ksz9477_switch_ops. I checked their datasheet, they implement the same
PME_WOL registers, at the same addresses, so this should go fine.
Modify the ksz_wol_pre_shutdown() initial check to ensure consistency in
the WoL handling for these non-KSZ9477 switches using ksz9477_switch_ops.
net: dsa: microchip: implement .support_eee() only if needed
The .support_eee() operation is optional. Yet, it is implemented by the
KSZ switches through a common functon that reports false for every chip
except for KSZ8563, KSZ9563 and KSZ9893 from the KSZ9477 family.
Remove the implementation from the switches that don't support EEE.
Also remove .set_mac_eee() for them as .set_mac_eee() is gated by the
`support_eee` presence in the core.
Implement instead a ksz9477-specific support_eee for these three supported
switches.
Note that comment /* KSZ879x/KSZ877x/KSZ876x Errata DS80000687C Module 2 */
is completely removed because it concerns the KSZ87xx family that doesn't
support at all EEE.
setup_rgmii_delay() operation is only used once during the common phylink
MAC configuration. Only the lan937x switch implements this
setup_rgmii_delay().
Remove the setup_rgmii_delay operation from ksz_dev_ops.
Implement a lan937x-specific phylink MAC configuration that does this
RGMII delay setup.
Export ksz_set_xmii since it's needed by the lan937x implementation.
net: dsa: microchip: wrap the MAC configuration checks in a function
The common .mac_config() implementation checks some conditions before
doing any register access. As this common implementation is about to be
split in the upcoming patch, these checks would lead to code
duplication.
Wrap all the checks in a need_config() function that returns true when
the driver really need to access the switch registers to configure the
MAC.
net: dsa: microchip: implement get_phy_flags only if needed
The common ksz_get_phy_flags() is used by all the switches to implement
the optional .get_phy_flags DSA operation. It always returns 0 except
for KSZ88X3 switches where an errata has to be handled.
Make ksz_get_phy_flags() ksz88xx-specific.
Remove the get_phy_flags implementation for the switches that don't need
it.
net: dsa: microchip: remove useless common cls_flower_{add/del} operations
All the KSZ switches share a common implementation of the
cls_flower_{add/del} operations. These common implementations return
ksz9477-specific implementations for the KSZ9477 family and -EOPNOTSUPP
for the others. -EOPNOTSUPP is already returned by the DSA core when
the operation isn't implemented.
Remove the common implementations.
Directly link the ksz9477_cls_flower_{add/del}() to the KSZ9477 callback.
Victor Nogueira [Thu, 11 Jun 2026 20:58:49 +0000 (17:58 -0300)]
net/sched: sch_dualpi2: Add missing module alias
When a qdisc is added by name, the kernel tries to autoload its module
via request_qdisc_module(), which calls:
request_module(NET_SCH_ALIAS_PREFIX "%s", name);
i.e. it asks modprobe to resolve the "net-sch-<kind>" alias (e.g.
"net-sch-dualpi2") rather than the module's file name. Since dualpi2
was shipped without this alias, the autoload fails:
tc qdisc add dev lo root handle 1: dualpi2
Error: Specified qdisc kind is unknown.
Fix this by adding the missing alias so the qdisc is autoloaded on demand
like the others.
Fixes: 320d031ad6e4 ("sched: Struct definition and parsing of dualpi2 qdisc") Signed-off-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Link: https://patch.msgid.link/20260611205849.3287640-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 11 Jun 2026 17:21:49 +0000 (10:21 -0700)]
docs: networking: add guidance on what to push via extack
Every now and then someone tries to duplicated extack
messages to dmesg. Document our guidance against this.
Also indicate that system level faults should continue
to go to system logs. The high level thinking is to try
to distinguish between what's important to the user vs
system admin.
Vadim Fedorenko [Thu, 11 Jun 2026 19:03:33 +0000 (19:03 +0000)]
ptp: ocp: add shutdown callback
The shutdown callback was never implemented for this driver, but it's
needed because .remove() callback is never called during kexec/reboot
process. That leaves HW with some interrupts enabled and may cause
spurious interrupt while booting into a new kernel during with kexec.
If it happens that I2C interrupt fires during kexec, the whole I2C bus
is disabled leaving TimeCard with no devlink communication. The same
happens if timestampers were enabled, leaving the card without
timestamper interrupts until full reboot cycle.
Implement .shutdown() callback with the same function as remove
callback.
Ido Schimmel [Thu, 11 Jun 2026 15:46:05 +0000 (18:46 +0300)]
selftests: fib_tests: Add test cases for route lookup with oif
Test that both address families respect the oif parameter when a
matching multipath route is found, regardless of the presence of a
source address.
Output without "ipv6: Select best matching nexthop object in
fib6_table_lookup()" and "ipv6: Honor oif when choosing nexthop for
locally generated traffic":
IPv4 multipath oif test
TEST: IPv4 multipath via first nexthop [ OK ]
TEST: IPv4 multipath via second nexthop [ OK ]
TEST: IPv4 multipath via first nexthop with source address [ OK ]
TEST: IPv4 multipath via second nexthop with source address [ OK ]
IPv4 multipath oif with nexthop object test
TEST: IPv4 multipath via first nexthop [ OK ]
TEST: IPv4 multipath via second nexthop [ OK ]
TEST: IPv4 multipath via first nexthop with source address [ OK ]
TEST: IPv4 multipath via second nexthop with source address [ OK ]
IPv4 multipath oif with VRF test
TEST: IPv4 multipath via first nexthop [ OK ]
TEST: IPv4 multipath via second nexthop [ OK ]
TEST: IPv4 multipath via first nexthop with source address [ OK ]
TEST: IPv4 multipath via second nexthop with source address [ OK ]
IPv6 multipath oif test
TEST: IPv6 multipath via first nexthop [ OK ]
TEST: IPv6 multipath via second nexthop [ OK ]
TEST: IPv6 multipath via first nexthop with source address [FAIL]
TEST: IPv6 multipath via second nexthop with source address [FAIL]
IPv6 multipath oif with nexthop object test
TEST: IPv6 multipath via first nexthop [FAIL]
TEST: IPv6 multipath via second nexthop [FAIL]
TEST: IPv6 multipath via first nexthop with source address [FAIL]
TEST: IPv6 multipath via second nexthop with source address [FAIL]
IPv6 multipath oif with VRF test
TEST: IPv6 multipath via first nexthop [ OK ]
TEST: IPv6 multipath via second nexthop [ OK ]
TEST: IPv6 multipath via first nexthop with source address [FAIL]
TEST: IPv6 multipath via second nexthop with source address [FAIL]
IPv4 multipath oif test
TEST: IPv4 multipath via first nexthop [ OK ]
TEST: IPv4 multipath via second nexthop [ OK ]
TEST: IPv4 multipath via first nexthop with source address [ OK ]
TEST: IPv4 multipath via second nexthop with source address [ OK ]
IPv4 multipath oif with nexthop object test
TEST: IPv4 multipath via first nexthop [ OK ]
TEST: IPv4 multipath via second nexthop [ OK ]
TEST: IPv4 multipath via first nexthop with source address [ OK ]
TEST: IPv4 multipath via second nexthop with source address [ OK ]
IPv4 multipath oif with VRF test
TEST: IPv4 multipath via first nexthop [ OK ]
TEST: IPv4 multipath via second nexthop [ OK ]
TEST: IPv4 multipath via first nexthop with source address [ OK ]
TEST: IPv4 multipath via second nexthop with source address [ OK ]
IPv6 multipath oif test
TEST: IPv6 multipath via first nexthop [ OK ]
TEST: IPv6 multipath via second nexthop [ OK ]
TEST: IPv6 multipath via first nexthop with source address [ OK ]
TEST: IPv6 multipath via second nexthop with source address [ OK ]
IPv6 multipath oif with nexthop object test
TEST: IPv6 multipath via first nexthop [ OK ]
TEST: IPv6 multipath via second nexthop [ OK ]
TEST: IPv6 multipath via first nexthop with source address [ OK ]
TEST: IPv6 multipath via second nexthop with source address [ OK ]
IPv6 multipath oif with VRF test
TEST: IPv6 multipath via first nexthop [ OK ]
TEST: IPv6 multipath via second nexthop [ OK ]
TEST: IPv6 multipath via first nexthop with source address [ OK ]
TEST: IPv6 multipath via second nexthop with source address [ OK ]
Ido Schimmel [Thu, 11 Jun 2026 15:46:04 +0000 (18:46 +0300)]
ipv6: Honor oif when choosing nexthop for locally generated traffic
Commit 741a11d9e410 ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is
set") made the kernel honor the oif parameter when specified as part of
output route lookup:
# ip route add 2001:db8:1::/64 dev dummy1
# ip route add ::/0 dev dummy2
# ip route get 2001:db8:1::1 oif dummy2 fibmatch
default dev dummy2 metric 1024 pref medium
Due to regression reports, the behavior was partially reverted in commit d46a9d678e4c ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE flag if saddr
set") to only honor the oif if source address is not specified:
# ip route get 2001:db8:1::1 from 2001:db8:2::1 oif dummy2 fibmatch
2001:db8:1::/64 dev dummy1 metric 1024 pref medium
That is, when source address is specified, the kernel will choose the
most specific route even if its nexthop device does not match the
specified oif.
This creates a problem for multipath routes. After looking up a route,
when source address is not specified, the kernel will choose a nexthop
whose nexthop device matches the specified oif:
# sysctl -wq net.ipv6.conf.all.forwarding=1
# ip route add 2001:db8:10::/64 nexthop via fe80::1 dev dummy1 nexthop via fe80::2 dev dummy2
# for i in {1..100}; do ip route get 2001:db8:10::${i} oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
100 dummy2
But will disregard the oif when source address is specified despite the
fact that a matching nexthop exists:
# for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
53 dummy1
47 dummy2
This behavior differs from IPv4:
# ip address add 192.0.2.1/32 dev lo
# ip route add 198.51.100.0/24 nexthop via inet6 fe80::1 dev dummy1 nexthop via inet6 fe80::2 dev dummy2
# for i in {1..100}; do ip route get 198.51.100.${i} from 192.0.2.1 oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
100 dummy2
What happens is that fib6_table_lookup() returns a route with a matching
nexthop device (assuming it exists):
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy2
But it is later overwritten during path selection in fib6_select_path()
which instead chooses a nexthop according to the calculated hash.
Solve this by telling fib6_select_path() to skip path selection if we
have an oif match during output route lookup (iif being
LOOPBACK_IFINDEX).
Behavior after the change:
# sysctl -wq net.ipv6.conf.all.forwarding=1
# ip route add 2001:db8:10::/64 nexthop via fe80::1 dev dummy1 nexthop via fe80::2 dev dummy2
# for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
100 dummy2
Note that enabling forwarding is only needed because we did not add
neighbor entries for the gateway addresses. When forwarding is disabled
and CONFIG_IPV6_ROUTER_PREF is not enabled in kernel config, the kernel
will treat non-existing neighbor entries as errors and perform
round-robin between the nexthops:
# sysctl -wq net.ipv6.conf.all.forwarding=0
# for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
50 dummy1
50 dummy2
Ido Schimmel [Thu, 11 Jun 2026 15:46:03 +0000 (18:46 +0300)]
ipv6: Select best matching nexthop object in fib6_table_lookup()
Currently, when using multipath routes without nexthop objects,
fib6_table_lookup() selects the nexthop with the highest score. This
means that when both a source address and an oif are specified, the
nexthop that is chosen is the one that matches in terms of oif:
# sysctl -wq net.ipv6.conf.all.forwarding=1
# ip address add 2001:db8:2::1/64 dev lo
# ip route add 2001:db8:10::/64 nexthop via fe80::1 dev dummy1 nexthop via fe80::2 dev dummy2
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy1; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy1
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy2
When using nexthop objects, fib6_table_lookup() selects the first
matching nexthop and not necessarily the one with the highest score:
# ip nexthop add id 1 via fe80::1 dev dummy1
# ip nexthop add id 2 via fe80::2 dev dummy2
# ip nexthop add id 3 group 1/2
# ip route add 2001:db8:20::/64 nhid 3
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy1; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy1
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy1
This is not very significant right now because the nexthop is later
overwritten during path selection in fib6_select_path(). However, the
next patch is going to skip path selection when we have an oif match
during output route lookup.
As a preparation for this change, align the nexthop object behavior with
the legacy one and make sure that fib6_table_lookup() always selects the
best matching nexthop. Do that by always returning 0 from
rt6_nh_find_match() in order not to terminate the loop in
nexthop_for_each_fib6_nh() and storing in arg->nh the best matching
nexthop so far.
Behavior after the change:
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy1; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy1
# perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null"
# perf script | grep -o dummy[0-9] | sort | uniq -c
100 dummy2
Breno Leitao [Wed, 10 Jun 2026 14:26:04 +0000 (07:26 -0700)]
netconsole: clear cached dev_name on resume-window cleanup
When process_resume_target() catches a device that was unregistered
while the target was off target_list, it calls do_netpoll_cleanup() to
release the reference but leaves the cached np.dev_name in place. The
other cleanup path, netconsole_process_cleanups_core(), already wipes
dev_name for MAC-bound targets because the name was only a cache of the
device that last carried the MAC and may no longer match.
The pattern is the same in both spots, so fold it into a small helper
netcons_release_dev() and route both call sites through it. This makes
the resume-window cleanup consistent with the notifier-driven one so a
later enable does not let netpoll_setup() pick a stale interface by name
when the user bound the target by MAC.