git.ipfire.org Git - thirdparty/linux.git/log

idpf: Replace use of system_unbound_wq with system_dfl_wq

This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:

   commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")

The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.

Before that to happen, workqueue users must be converted to the better named
new workqueues with no intended behaviour changes:

   system_wq -> system_percpu_wq
   system_unbound_wq -> system_dfl_wq

This way the old obsolete workqueues (system_wq, system_unbound_wq) can be
removed in the future.

Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20260609213559.178657-2-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'octeontx2-af-npc-enhancements'

Ratheesh Kannoth says:

====================
octeontx2-af: npc: Enhancements.

This series extends Marvell octeontx2-af support for CN20K NPC (MCAM
debuggability, allocation policy, default-rule lifetime, optional KPU
profiles from firmware files, X2/X4 MCAM keyword handling in flows and
defaults, and dynamic CN20K NPC private state), adds a devlink mechanism
for multi-value parameters, and moves devlink_nl_param_fill() temporaries
to the heap so stack usage stays reasonable once union devlink_param_value
grows (patch 3).

Patch 1 enforces a single RVU admin-function PCI device in the kernel.
On Octeon series SoCs, hardware resources such as NPC, NIX and related
blocks are global and coordinated by the AF driver; PFs and VFs request
them through AF mailbox messages. Firmware exposes only one AF PCI
function at boot, so two AF driver instances cannot both own that state.
rvu_probe() rejects a second bind with -EBUSY, logs a warning, clears the
probe gate on early allocation failures, and aligns the driver model with
hardware so reviewers and automation can rely on exactly one bound AF.

Patch 2 improves CN20K MCAM visibility in debugfs: mcam_layout marks
enabled entries, dstats reports per-entry hit deltas (baseline updated in
software after each read; hardware counters are not cleared), and mismatch
lists enabled entries without a PF mapping.

Patch 3 allocates the per-configuration-mode union devlink_param_value
buffers and struct devlink_param_gset_ctx used by devlink_nl_param_fill()
with kcalloc()/kzalloc_obj() and funnels failures through a single cleanup
path so the netlink reply path stays safe as the union grows.

Patch 4 (Saeed) introduces DEVLINK_PARAM_TYPE_U64_ARRAY and nested
DEVLINK_ATTR_PARAM_VALUE_DATA attributes so drivers and user space can
exchange bounded u64 arrays; YAML, uapi, and netlink validation are
updated.

Patch 5 adds a runtime devlink parameter srch_order to reorder CN20K
subbank search during MCAM allocation (the param uses the u64 array type
from patch 4).

Patch 6 ties default MCAM entries to NIX LF alloc/free on CN20K, adds
NIX_LF_DONT_FREE_DFT_IDXS for PF teardown paths that must not drop default
NPC indexes while the driver still owns state, and tightens nix_lf_alloc
error propagation.

Patch 7 allows loading a custom KPU profile from /lib/firmware/kpu via
module parameter kpu_profile, with cam2 / ptype_mask wiring and helpers
that share firmware-sourced vs filesystem-sourced profile layouts.

Patch 8 makes default-rule allocation, AF flow install, and PF-side RSS,
defaults, and ethtool flows respect the active CN20K MCAM keyword width
(X2 vs X4), including X4 reference-index masking and -EOPNOTSUPP when a
flow needs X4 keys on an X2-only profile.

Patch 9 replaces file-scope npc_priv and static dstats with allocation
sized from discovered bank/subbank geometry, threads npc_priv_get()
through CN20K NPC paths, and allocates dstats via devm_kzalloc for the
debugfs helper.

Patch 1 is ordered first so later patches assume a single bound AF.
Heap-backed devlink_nl_param_fill() sits immediately before the U64 array
param work so incremental builds stay stack-safe as the union grows; the
CN20K patches keep srch_order ahead of NIX LF coordination, optional KPU
profile load from firmware files, X2/X4 handling, and the npc_priv refactor
that touches the same files heavily.
====================

Link: https://patch.msgid.link/20260609040453.711932-1-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: Allocate npc_priv and dstats dynamically.

Replace the file-scope static npc_priv with a kcalloc'd struct filled
from hardware bank/subbank geometry at init (num_banks is no longer a
const compile-time constant; drop init_done and use a non-NULL
npc_priv pointer for liveness). Thread npc_priv_get() / pointer access
through the CN20K NPC code paths, extend teardown to kfree the root
struct on failure and in npc_cn20k_deinit, and adjust MCAM section
setup to use the discovered subbank count.

Allocate MCAM debugfs dstats via devm_kzalloc instead of a static matrix,
and use the allocated backing store consistently when computing deltas
(including the counter rollover compare).

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260609040453.711932-10-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2: cn20k: Respect NPC MCAM X2/X4 profile in flows and DFT alloc

Default CN20K NPC rule allocation now keys off the active MCAM keyword
width: use X4 with a bank-masked reference index when the silicon uses
X4 keys, and X2 with the raw index otherwise (replacing the previous
always-X2 / eidx + 1 behaviour).

In the AF flow-install path, flows that need more than 256 key bits
query the NPC profile; if the platform is fixed to X2 entries, fail
with -EOPNOTSUPP instead of requesting X4. Otherwise select X4 for the
MCAM alloc.

On the PF, cache and pass the profile kw_type from npc_get_pfl_info
through otx2_mcam_pfl_info_get(), and use it when allocating MCAM
entries for RSS/defaults and when installing ethtool flows on CN20K,
including masking the reference index for X4 slot layout.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260609040453.711932-9-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: Support for custom KPU profile from filesystem

Flashing updated firmware on deployed devices is cumbersome. Provide a
mechanism to load a custom KPU (Key Parse Unit) profile directly from
the filesystem at module load time.

When the rvu_af module is loaded with the kpu_profile parameter, the
specified profile is read from /lib/firmware/kpu and programmed into
the KPU registers. Add npc_kpu_profile_cam2 for the extended cam format
used by filesystem-loaded profiles and support ptype/ptype_mask in
npc_config_kpucam when profile->from_fs is set.

Usage:
  1. Copy the KPU profile file to /lib/firmware/kpu.
  2. Build OCTEONTX2_AF as a module.
  3. Load: insmod rvu_af.ko kpu_profile=<profile_name>

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260609040453.711932-8-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2: cn20k: Coordinate default rules with NIX LF lifecycle

Add NIX_LF_DONT_FREE_DFT_IDXS so the PF can send NIX LF free during hw
reinit or teardown without the AF freeing CN20K default NPC rule indexes
while the driver still owns that state (otx2_init_hw_resources and
otx2_free_hw_resources).

On CN20K, allocate default NPC rules from NIX LF alloc before
nix_interface_init, roll back with npc_cn20k_dft_rules_free on failure,
and free from NIX LF free when the new flag is not set. Tighten
rvu_mbox_handler_nix_lf_alloc error handling: use a single rc, propagate
qmem_alloc and other errors, and set -ENOMEM only when kcalloc fails
(remove the blanket -ENOMEM at the free_mem path).

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260609040453.711932-7-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: add subbank search order control

CN20K NPC MCAM is split into 32 subbanks that are searched in a
predefined order during allocation. Lower-numbered subbanks have
higher priority than higher-numbered ones.

Add a runtime "srch_order" to control the order in which
subbanks are searched during MCAM allocation.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260609040453.711932-6-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: Implement devlink param multi attribute nested data values

Devlink param value attribute is not defined since devlink is handling
the value validating and parsing internally, this allows us to implement
multi attribute values without breaking any policies.

Devlink param multi-attribute values are considered to be dynamically
sized arrays of u64 values, by introducing a new devlink param type
DEVLINK_PARAM_TYPE_U64_ARRAY, driver and user space can set a variable
count of u64 values into the DEVLINK_ATTR_PARAM_VALUE_DATA attribute.

Implement get/set parsing and add to the internal value structure passed
to drivers.

This is useful for devices that need to configure a list of values for
a specific configuration.

example:
$ devlink dev param show pci/... name multi-value-param
name multi-value-param type driver-specific
values:
cmode permanent value: 0,1,2,3,4,5,6,7

$ devlink dev param set pci/... name multi-value-param \
value 4,5,6,7,0,1,2,3 cmode permanent

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260609040453.711932-5-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: heap-allocate param fill buffers in devlink_nl_param_fill

devlink_nl_param_fill() kept two per-configuration-mode copies of
union devlink_param_value plus a struct devlink_param_gset_ctx on the
stack while building the Netlink reply. Allocate those with kcalloc()
and kzalloc_obj() instead, and route failures through a single cleanup
path so temporary buffers are always freed.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260609040453.711932-4-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: debugfs enhancements

Improve MCAM visibility and field debugging for CN20K NPC.

- Extend "mcam_layout" to show enabled (+) or disabled state per entry
  so status can be verified without parsing the full "mcam_entry" dump.
- Add "dstats" debugfs entry: for enabled MCAM indices, print hit deltas
  since the prior read by comparing hardware counters to a per-entry
  software baseline and advancing that baseline after each read (hardware
  counters are not cleared).
- Add "mismatch" debugfs entry: lists MCAM entries that are enabled
  but not explicitly allocated, helping diagnose allocation/field issues.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260609040453.711932-3-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: enforce single RVU AF probe

On Octeon series SoCs, the AF is an integrated device within the SoC, and
hardware resources such as NPC, NIX and related blocks are global and
coordinated by the AF driver. Physical and virtual functions request those
resources via AF mailbox messages, so two AF driver instances cannot both
own that global state; firmware exposes only one AF PCI function at boot
and any further octeontx2-af PCI probe returns -EBUSY so software matches
the single-AF model.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260609040453.711932-2-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-stmmac-fixes-for-maximum-tx-rx-queues-to-use-by-driver'

Jakub Raczynski says:

====================
net/stmmac: Fixes for maximum TX/RX queues to use by driver

When contributing other changes preparing functions for new XGMAC hardware
https://lore.kernel.org/netdev/20260601162537.553512-1-j.raczynski@samsung.com/
there have been reports by Sashiko AI.

All of issues are wrong DTS configuration, but kernel needs to handle it.
====================

Link: https://patch.msgid.link/20260611113358.3379518-1-j.raczynski@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/stmmac: Apply MTL_MAX queue limit if config missing

When "snps,rx-queues-to-use" or "tx-queues-to-use" config in DTS is provided
current code will apply U8_MAX value for queues_to_use if there is input of
higher value. But actual maximum number of supported queues is set via
macro MTL_MAX_RX_QUEUES and MTL_MAX_TX_QUEUES, which currently have value of 8.

This value of U8_MAX will be capped to value provided by core in DMA
capabilities (dma_conf), but it does so only if core provides it.
This is true for XGMAC (dwxgmac2) and some GMAC (dwmac4),
but not for (dwmac1000). This capping is at later stage in stmmac_hw_init(),
and during stmmac_mtl_setup() we might parse fields outside allocated memory
if queues_to_use is over defines MTL_MAX_ values,
for example following rx_queues_cfg is array of size of MTL_MAX_RX_QUEUES.

Fix this by capping value to MTL_MAX during config parsing.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Signed-off-by: Jakub Raczynski <j.raczynski@samsung.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260611113358.3379518-3-j.raczynski@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/stmmac: Apply TBS config only to used queues

While opening stmmac driver, there is enabling of TBS (Time-Based Scheduling)
option in dma config. Currently this is executed for all possible TX queues via
MTL_MAX_TX_QUEUES macro, but actual number of queues used might differ.
While setting this is generally harmless, since memory for MTL_MAX_TX_QUEUES
is allocated, it is incorrect, because it prepares config for unused queues.

Change this to apply tbs config only to tx_queues_to_use.

Co-developed-by: Chang-Sub Lee <cs0617.lee@samsung.com>
Signed-off-by: Chang-Sub Lee <cs0617.lee@samsung.com>
Signed-off-by: Jakub Raczynski <j.raczynski@samsung.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260611113358.3379518-2-j.raczynski@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Fix debugfs new-tuple display for IPv4 ROUTE entries

In airoha_ppe_debugfs_foe_show(), the second switch statement falls
through from PPE_PKT_TYPE_IPV4_HNAPT/DSLITE to PPE_PKT_TYPE_IPV4_ROUTE,
accessing hwe->ipv4.new_tuple for all three types. However, IPv4 ROUTE
(3-tuple) entries do not contain a valid new_tuple — this field is only
meaningful for NATted flows (HNAPT/DSLITE). For ROUTE entries, the
memory at the new_tuple offset holds routing information, not NAT data,
so displaying "new=" produces garbage output.

Display new_tuple only for HNAPT and DSLITE, and let IPV4_ROUTE fall
through to the default case.

Fixes: 3fe15c640f38 ("net: airoha: Introduce PPE debugfs support")
Link: https://lore.kernel.org/6a2b40ea.4dd82583.3a5c46.e5a2@mx.google.com
Signed-off-by: Wayen.Yan <win847@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/6a2be54b.ef98c1b2.3c3224.2ed8@mx.google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Fix register index for Tx-fwd counter configuration

In airoha_qdma_init_qos_stats(), the Tx-fwd counter configuration
register uses the same index (i << 1) as the Tx-cpu counter, which
overwrites the Tx-cpu configuration. The Tx-fwd counter value register
correctly uses (i << 1) + 1, so the configuration register should use
the same index.

Fix the REG_CNTR_CFG index from (i << 1) to ((i << 1) + 1) so that
the Tx-fwd counter is properly configured instead of clobbering the
Tx-cpu counter config.

Fixes: 20bf7d07c956 ("net: airoha: Add sched ETS offload support")
Signed-off-by: Wayen.Yan <win847@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/6a2b40e7.4dd82583.3a5c46.e566@mx.google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tipc: restrict socket queue dumps in enqueue tracepoints

tipc_sk_enqueue() runs with sk->sk_lock.slock held while the socket is
owned by user context. The spinlock protects the backlog queue in this
path, but it does not serialize against the socket owner consuming or
purging sk_receive_queue.

KASAN reported:

  CPU: 14 UID: 0 PID: 1050 Comm: tipc3 Not tainted 7.1.0-rc6+ #126 PREEMPT(lazy)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  Call Trace:
    <TASK>
    dump_stack_lvl+0x76/0xa0 lib/dump_stack.c:123
    print_report+0xce/0x5b0 mm/kasan/report.c:482
    kasan_report+0xc6/0x100 mm/kasan/report.c:597
    __asan_report_load4_noabort+0x14/0x30 mm/kasan/report_generic.c:380
    tipc_skb_dump+0x1327/0x16f0 net/tipc/trace.c:73
    tipc_list_dump+0x208/0x2e0 net/tipc/trace.c:187
    tipc_sk_dump+0xaf6/0xd60 net/tipc/socket.c:3996
    trace_event_raw_event_tipc_sk_class+0x312/0x5a0 net/tipc/trace.h:188
    tipc_sk_rcv+0xb1d/0x1d50 net/tipc/socket.c:2497
    tipc_node_xmit+0x1c3/0x1440 net/tipc/node.c:1689
    __tipc_sendmsg+0x97a/0x1440 net/tipc/socket.c:1512
    tipc_sendmsg+0x52/0x80 net/tipc/socket.c:1400
    sock_sendmsg+0x2f6/0x3e0 net/socket.c:825
    splice_to_socket+0x7f9/0x1010 fs/splice.c:884
    do_splice+0xe21/0x2330 fs/splice.c:936
    __do_splice+0x153/0x260 fs/splice.c:1431
    __x64_sys_splice+0x150/0x230 fs/splice.c:1616
    x64_sys_call+0xeb5/0x2790 arch/x86/entry/syscall_64.c:41
    do_syscall_64+0xf3/0x620 arch/x86/entry/syscall_64.c:63
    entry_SYSCALL_64_after_hwframe+0x76/0x7e arch/x86/entry/entry_64.S:130
  RIP: 0033:0x71624e8aafe2
  Code: 08 0f 85 71 3a ff ff 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66
  RSP: 002b:0000716157ffed68 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
  RAX: ffffffffffffffda RBX: 0000716157fff6c0 RCX: 000071624e8aafe2
  RDX: 000000000000005f RSI: 0000000000000000 RDI: 0000000000000066
  RBP: 0000716157ffed90 R08: 0000000000008000 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffff00
  R13: 0000000000000021 R14: 0000000000000000 R15: 00007fff89799c40
    </TASK>

The TIPC_DUMP_ALL tracepoints in tipc_sk_enqueue() also dump
sk_receive_queue and can therefore dereference skbs that the socket
owner has already dequeued or freed. Restrict these dumps to
TIPC_DUMP_SK_BKLGQ, which matches the queue protected by the held
spinlock.

Keep the change limited to the enqueue path, where the unsafe queue dump
is reachable while the socket is owned by user context.

Fixes: 01e661ebfbad ("tipc: add trace_events for tipc socket")
Cc: stable@vger.kernel.org
Signed-off-by: Li Xiasong <lixiasong1@huawei.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20260611135647.3666727-1-lixiasong1@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: better handle MIBs for GDM ports with multiple devs attached

In the context of a GDM port that can have multiple net_devices attached
(GDM3 and GDM4), the HW counters (MIBs) are global for the GDM port.
This cause duplicated stats reported to the kernel for the related
net_device.
The SoC supports a split MIB feature where each counter is tracked based
on the relevant HW channel (NBQ) to account for this scenario and
provide a way to select the related counter on accessing the MIB
registers.
Enable this feature for GDM3 and GDM4 and configure the relevant HW
channel before updating the HW stats to report correct HW counter to the
kernel for the related interface.
Move the stats struct from port to dev since HW counter are now specific
to the network device instead of the GDM port. Refactor
airoha_update_hw_stats() to take airoha_eth and airoha_gdm_port
parameters since the function operates on the entire port.

Co-developed-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260611-airoha-eth-multi-serdes-stats-v1-1-42442ae42064@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: fix NPC mailbox codes in mbox.h

Several NPC mailbox command IDs in the 0x601x range were assigned out of
order. Renumber and reorder the M() definitions so each opcode matches
the stable contract expected by userspace tools and applications.

Fixes: 4e527f1e5c15 ("octeontx2-af: npc: cn20k: Add new mailboxes for CN20K silicon")
Cc: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260611083330.1652181-1-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: dsa: Convert lan9303.txt to yaml format

Convert lan9303.txt to yaml format to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx53-kp-hsc.dtb: /soc/bus@50000000/i2c@53fec000/switch@a: failed to match any schema with compatible: ['smsc,lan9303-i2c']

Additional changes:
- rename switch-phy to switch in example.

Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260610150533.515914-1-Frank.Li@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: 3c509: Improve style of pnp_device_id array terminator

To match how device-id array terminators look like for other device
types drop `.id = ""` from it and let the compiler care for zeroing the
entry.

There are no changes in the compiled drivers, only the source looks
nicer.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Link: https://patch.msgid.link/a0cd057e6a24b9d355b5e4bdfcdb812cdd1e4652.1781082923.git.u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bcmgenet: Use weighted round-robin TX DMA arbitration

Under heavy network traffic, we observed sporadic TX queue timeouts on the
Raspberry Pi 4. The timeouts can be reproduced by stress testing the TX
path with multiple concurrent iperf UDP streams:

    iperf3 -c <ip> -u -b0 -P16 -t60
    NETDEV WATCHDOG: CPU: 0: transmit queue 0 timed out 2044 ms
    NETDEV WATCHDOG: CPU: 3: transmit queue 0 timed out 2004 ms

Investigation showed that the timeouts are caused by the priority-based
arbiter. Under heavy load the highest priority queue starves the lower
priority ones, causing timeouts. The TX strict priority arbiter is not
suitable for the default use case where all the traffic gets spread
across all the TX queues.

Therefore, to fix this, switch the TX DMA arbiter to Weighted Round-Robin,
which services all queues, so they do not stall. The weights were chosen
to follow the existing priority scheme: q0 gets the smallest weight, while
q1-4 get the bulk of the TX bandwidth.

Fixes: 1c1008c793fa ("net: bcmgenet: add main driver file")
Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com>
Link: https://patch.msgid.link/20260610085238.56300-1-ovidiu.panait.rb@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: add test for IPv4 devconf netlink notifications

Introduce a new test, `ipv4_devconf_notify`, to verify that the kernel
sends the appropriate netlink notifications when IPv4 devconf parameters
are modified.

The test depends on the newly introduced iproute2 command:

`ip link set dev <ifname> inet`

Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260609204520.4670-3-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: handle devconf post-set actions on netlink updates

When IPv4 device configuration parameters are updated via netlink, the
kernel currently only updates the value. This bypasses several
post-modification actions that occur when these same parameters are
updated via sysctl, such as flushing the routing cache or emitting
RTM_NEWNETCONF notifications.

This patch addresses the inconsistency by calling the
devinet_conf_post_set() helper inside inet_set_link_af(). If a flush is
required, we defer it until the netlink attribute parsing loop
completes.

This ensures consistent behavior and side-effects for devconf changes,
regardless of whether they are initiated via sysctl or netlink.

Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260609204520.4670-2-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: centralize devconf sysctl handling

The logic for handling IPv4 devconf sysctls is scattered. Notification
and cache flushes are managed in devinet_conf_proc(), while a separate
ipv4_doint_and_flush() function and DEVINET_SYSCTL_FLUSHING_ENTRY macro
is used for properties that solely require a cache flush.

This patch refactors the sysctl handling by introducing a centralized
helper, devinet_conf_post_set(). This new function evaluates the changed
attribute and handles all necessary operations like triggering netlink
notifications. It returns a boolean indicating whether a routing cache
flush is required.

Note that the boolean is necessary as this function will be re-used for
netlink IPv4 devconf handling where the cache flushing must wait until
all the attributes have been processed.

Finally, this is introducing a small change in behavior for
IPV4_DEVCONF_ROUTE_LOCALNET. As commit d0daebc3d622 ("ipv4: Add
interface option to enable routing of 127.0.0.0/8") intended, the cache
flush should only be performed when ROUTE_LOCALNET changes from 1 to 0.
Unfortunately, this was not true because while implementing it the
DEVINET_SYSCTL_FLUSHING_ENTRY was used for the attribute, making the
code related to it on devinet_conf_proc() dead.

IPV4_DEVCONF_FORWARDING is still being handled separately as it requires
more operations.

Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260609204520.4670-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

samples/landlock: Add sandboxer UDP access control

Add environment variables to control associated access rights:
- LL_UDP_BIND
- LL_UDP_CONNECT_SEND

Each one takes a list of ports separated by colons, like other list
options.

Signed-off-by: Matthieu Buffet <matthieu@buffet.re>
Link: https://patch.msgid.link/20260611162107.49278-6-matthieu@buffet.re
Signed-off-by: Mickaël Salaün <mic@digikod.net>

selftests/landlock: Add tests for UDP send

Add tests specific to UDP sendmsg() in the protocol_* variants to ensure
behaviour is consistent across AF_INET, AF_INET6 and AF_UNIX.

Signed-off-by: Matthieu Buffet <matthieu@buffet.re>
Link: https://patch.msgid.link/20260611162107.49278-5-matthieu@buffet.re
[mic: Fix comment formatting, rebase]
Signed-off-by: Mickaël Salaün <mic@digikod.net>

selftests/landlock: Add tests for UDP bind/connect

Make basic changes to the existing bind() and connect() test suite to
cover UDP restriction.

Signed-off-by: Matthieu Buffet <matthieu@buffet.re>
Link: https://patch.msgid.link/20260611162107.49278-4-matthieu@buffet.re
[mic: Update audit.connect_bound, fix comment formatting]
Signed-off-by: Mickaël Salaün <mic@digikod.net>

landlock: Add UDP send+connect access control

Add support for a second fine-grained UDP access right.
LANDLOCK_ACCESS_NET_CONNECT_SEND_UDP controls the ability to set the
remote port of a socket (via connect()) and to specify an explicit
destination when sending a datagram, to override any remote peer set on
a UDP socket (e.g. in sendto() or sendmsg()). It will be useful for
applications that send datagrams, and for some servers too (those
creating per-client sockets, which want to receive traffic only from a
specific address).

Similarly as for bind(), this access control is performed when
configuring sockets, not in hot code paths.

Add detection of when autobind is about to be required, and deny the
operation if the process would not be allowed to call bind(0)
explicitly. Autobind can only be performed in udp_lib_get_port() from
code paths already controlled by LSM hooks: when connect()ing, sending a
first datagram, and in some splice() EOF edge case which, afaiu, can
only happen after a remote peer has been set. This invariant needs to be
preserved to keep bind policies actually enforced.

Signed-off-by: Matthieu Buffet <matthieu@buffet.re>
Link: https://patch.msgid.link/20260611162107.49278-3-matthieu@buffet.re
[mic: Add quick return for non-sandboxed tasks, fix sa_family
dereferencing, fix comment formatting]
Signed-off-by: Mickaël Salaün <mic@digikod.net>

landlock: Add UDP bind() access control

Add support for a first fine-grained UDP access right.
LANDLOCK_ACCESS_NET_BIND_UDP controls the ability to set the local port
of a UDP socket (via bind()). It will be useful for servers (to start
receiving datagrams), and for some clients that need to use a specific
source port (e.g. mDNS requires to use port 5353)

For obvious performance concerns, access control is only enforced when
configuring sockets, not when using them for common send/recv
operations.

Bump ABI to allow userspace to detect and use this new right.

Signed-off-by: Matthieu Buffet <matthieu@buffet.re>
Link: https://patch.msgid.link/20260611162107.49278-2-matthieu@buffet.re
[mic: Fix comment formatting]
Signed-off-by: Mickaël Salaün <mic@digikod.net>

landlock: Fix unmarked concurrent access to socket family

Socket family is read (twice) in a context where the socket is not
locked, so another thread can setsockopt(IPV6_ADDRFORM) to write it
concurrently. Add needed READ_ONCE() annotation.

Use the proper macro to access __sk_common.skc_family like everywhere
else.

Fixes: fff69fb03dde ("landlock: Support network rules with TCP bind and connect")
Signed-off-by: Matthieu Buffet <matthieu@buffet.re>
Link: https://patch.msgid.link/20260609211511.85630-1-matthieu@buffet.re
Link: https://patch.msgid.link/20260609211511.85630-2-matthieu@buffet.re
[mic: Squash two patches, move variable to ease backport, fix comment
formatting]
Signed-off-by: Mickaël Salaün <mic@digikod.net>

selftests/landlock: Explicitly disable audit in teardowns

I'm seeing sporadic selftest failures, such as

  #  RUN           scoped_audit.connect_to_child ...
  # scoped_abstract_unix_test.c:314:connect_to_child:Expected 0 (0) == records.access (8)
  # connect_to_child: Test failed
  #          FAIL  scoped_audit.connect_to_child
  not ok 19 scoped_audit.connect_to_child

This seems similar to what commit 3647a4977fb73d ("selftests/landlock:
Drain stale audit records on init") tried to fix. However, the added
drain loop is not effective. When setting the AUDIT_STATUS_PID, the
kauditd_thread is woken up starting to send messages from the hold queue
to the netlink. Depending on scheduling of this kthread not all messages
might be send via the netlink in the 1 us interval.

Therefore, instead of trying to drain the queue, let's just disable
audit when running non-audit tests or more precisely disable it after
audit-tests. This way we won't generate any new audit message that could
interfere with the other tests.

The comment saying that on process exit audit will be disabled is wrong.
The closed file descriptor just causes an auditd_reset(), not a
disablement. So future messages will be queued in the hold queue.

Cc: stable@vger.kernel.org
Fixes: 6a500b22971c ("selftests/landlock: Add tests for audit flags and domain IDs")
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Link: https://patch.msgid.link/20260529-welsh-nagoya-b4d9ca60@mheyne-amazon
[mic: Fix FD leak, update subject, call audit_cleanup() in audit_exec teardown]
Signed-off-by: Mickaël Salaün <mic@digikod.net>

selftests/landlock: Test SCOPE_SIGNAL on the SIGIO/fowner pgid path

Add regression tests for the LANDLOCK_SCOPE_SIGNAL handling of the
asynchronous SIGIO delivery path (fcntl(F_SETOWN)) with a process-group
owner.

sigio_to_pgid_members covers the bypass: a sandboxed process at the head
of its process group's PGID hlist (the default after fork()) arms
F_SETOWN(-pgrp) + O_ASYNC and triggers the fan-out; the in-domain owner
must be signaled (proving the trigger fired) while the non-sandboxed
member of the group, outside the domain, must not.

sigio_to_pgid_self covers the same-process guarantee: the owner is
registered from a sandboxed non-leader thread, whose domain differs from
the thread-group leader the kernel signals for a process-group owner.
That leader belongs to the owner's own process and must still be
signaled.

Without the fix the first test sees the out-of-domain member signaled
and the second sees the owner's own leader denied.

Cc: stable@vger.kernel.org
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://patch.msgid.link/43370e89f7a896a583bf33d1cd171d02630e61bf.1780614610.git.hexlabsecurity@proton.me
[mic: Fix comment]
Signed-off-by: Mickaël Salaün <mic@digikod.net>

landlock: Fix LANDLOCK_SCOPE_SIGNAL bypass on the SIGIO path

LANDLOCK_SCOPE_SIGNAL must prevent a sandboxed process from signaling
processes outside its Landlock domain.  It can be bypassed through the
asynchronous SIGIO delivery path.

A sandboxed process that owns any file or socket can arm it with
fcntl(fd, F_SETOWN, -pgid), fcntl(fd, F_SETSIG, SIGKILL) and O_ASYNC, so
that an I/O event makes the kernel deliver the chosen signal to the
whole process group.  As the head of its process group's task list (the
default position right after fork()) that group can also hold the
non-sandboxed process that launched it, e.g. a supervisor or a security
monitor.  The sandbox can thus kill or signal the processes
LANDLOCK_SCOPE_SIGNAL is meant to protect from it.

The scope is enforced in hook_file_send_sigiotask() against the Landlock
domain recorded at F_SETOWN time, not the live domain of the sender.
control_current_fowner() decides whether to record that domain and skips
recording it when the fowner target is in the caller's thread group,
which is safe only for a single-task target (PIDTYPE_PID, PIDTYPE_TGID).
For a process group (PIDTYPE_PGID) pid_task() returns only one member;
recording is skipped whenever that member shares the caller's thread
group, and hook_file_send_sigiotask() then lets the signal fan out to
the whole group unchecked.

Record the domain for every non single-process target so the scope is
enforced against each group member at delivery time.

That recording is necessary but not sufficient on its own: the kernel
signals a process group through its members' thread-group leaders, and
the leader of the registrant's own process can carry a different
Landlock domain than the sibling thread that armed the owner.
domain_is_scoped() would then deny that leader, even though commit
18eb75f3af40 ("landlock: Always allow signals between threads of the
same process") requires same-process delivery to be allowed.
hook_task_kill() avoids this by evaluating same_thread_group() live, per
recipient; the SIGIO path instead delegates the whole decision to a
single registration-time check, which a process-group fan-out cannot
honor.

So also record the registrant's thread group next to its domain and
exempt it at delivery: hook_file_send_sigiotask() allows the signal
whenever the recipient belongs to the registrant's own process,
restoring the same-process guarantee while keeping out-of-domain group
members blocked.  The direct kill() path (hook_task_kill) already
evaluates the live domain and is unaffected.

Fixes: 18eb75f3af40 ("landlock: Always allow signals between threads of the same process")
Cc: stable@vger.kernel.org
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://patch.msgid.link/56bffc24f3d0d08b45a686a48e99766b0a0821fa.1780614610.git.hexlabsecurity@proton.me
[mic: Check pid_type earlier and improve comment, fix commit message,
fix comment formatting]
Signed-off-by: Mickaël Salaün <mic@digikod.net>

landlock: Demonstrate best-effort allowed_access filtering

Landlock provides best-effort sandboxing across ABI versions:
applications request the rights they need, and on older kernels the
unsupported rights are silently dropped from handled_access_* by the
documented compatibility switch. The recommended pattern for
landlock_add_rule(2) calls is to mirror this filtering at the rule
level, which wasn't explicitly described in the exemple.

Show the pattern explicitly in the filesystem and network rule examples
by masking each rule's allowed_access against the ruleset's
handled_access_* and adding the rule only when at least one bit remains
set. This makes the recommended best-effort pattern self-documenting.

Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://patch.msgid.link/20260513151856.148423-1-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>

landlock: Account all audit data allocations to user space

Mark the kzalloc_flex() of struct landlock_details with
GFP_KERNEL_ACCOUNT so the allocation is charged to the calling task,
like the other Landlock per-domain allocations which have used
GFP_KERNEL_ACCOUNT forever.

Every property of landlock_details is caller-attributable: allocated by
landlock_restrict_self(2), owned by the caller's landlock_hierarchy,
contents are the caller's pid, uid, comm, and exe_path, lifetime bounded
by the caller's domain. While the caller may not know nor control the
size of this allocation (i.e. exe_path), this data should still be
accounted for it.

The deciding factor is whether userspace can trigger the allocation, not
whether the size of the data is known nor controlled by the caller.
This aligns with the kmemcg accounting policy established by commit
5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg").

No new failure modes: the hierarchy and ruleset are allocated before
details and are already accounted, so landlock_restrict_self(2) already
returns -ENOMEM under memcg pressure. This change widens that existing
failure window slightly; it does not introduce a new error code.

Cc: Günther Noack <gnoack@google.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: stable@vger.kernel.org
Fixes: 1d636984e088 ("landlock: Add AUDIT_LANDLOCK_DOMAIN and log domain status")
Link: https://patch.msgid.link/20260513180309.165840-1-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>

landlock: Set audit_net.sk for socket access checks

Set audit_net.sk in current_check_access_socket() to provide the socket
object to audit_log_lsm_data(). This makes Landlock consistent with
AppArmor, which always sets .sk for socket operations, and with
SELinux's generic socket permission checks.

The socket's local and foreign address information (laddr, lport, faddr,
fport) is logged by the shared lsm_audit.c infrastructure when the
socket has bound or connected state. Fields with zero values are
suppressed by print_ipv4_addr()/print_ipv6_addr(), so the audit output
is unchanged for the common case of bind denials on unbound sockets.
For connect denials after a prior bind, the bound local address (laddr,
lport) appears before the existing sockaddr fields (daddr, dest).

No existing fields are removed or reordered, and the new field names
(laddr, lport, faddr, fport) are standard audit fields already emitted
by other LSMs through the same lsm_audit.c code path.

Add a connect_tcp_bound audit test that binds to an allowed port and
then connects to a denied one, verifying that the denial record reports
laddr/lport from the bound socket in addition to the connect
destination.

Cc: Günther Noack <gnoack@google.com>
Cc: Tingmao Wang <m@maowtm.org>
Cc: stable@vger.kernel.org
Fixes: 9f74411a40ce ("landlock: Log TCP bind and connect denials")
Link: https://patch.msgid.link/20260612172757.1003481-1-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>

tcp: refine tcp_sequence() for the FIN exception

Commit 0e24d17bd966 ("tcp: implement RFC 7323 window retraction
receiver requirements") removed the special FIN case that
was added in commit 1e3bb184e941 ("tcp: re-enable acceptance of
FIN packets when RWIN is 0").

If a peer sends a segment containing data and a FIN flag before
it learns about our window retraction and has a buggy TCP stack,
it might place the FIN one byte beyond what it thinks is the
right edge of the window (i.e., max_window_edge + 1).

The data portion (end_seq - th->fin) will end exactly at max_window_edge.
In this case, we will drop the packet if our receive queue is not empty,
even though the data was sent within the window we previously allowed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Simon Baatz <gmbnomis@gmail.com>
Link: https://patch.msgid.link/20260608151452.706822-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'dpll-ice-add-generic-dpll-type-and-full-tx-reference-clock-control-for-e825'

Grzegorz Nitka says:

====================
dpll/ice: Add generic DPLL type and full TX reference clock control for E825

NOTE: This series is intentionally submitted on net-next (not
intel-wired-lan) as early feedback of DPLL subsystem changes is
welcomed. In the past possible approaches were discussed in [1].

This series adds TX reference clock support for E825 devices and exposes
TX clock selection and synchronization status via the Linux DPLL
subsystem.

Here is the high-level connection diagram for E825 device:
  +------------------------------------------------------------------+
  |                                                                  |
  |                           +-----------------------------+        |
  |                           |                             |        |
  |                           |         MAC                 |        |
  |                           |+------------+-----+         |        |
  |                           ||RX/1588 |PHC|tspll<----\    |        |
+---+----+                    ||MUX     +---+-^---|    |    |        |
| E | RX >--------------------->              |   >--\ |    |        |
| T |    |    /---------------->              |   >-\| |    |        |
| H |----+    |               |+---------+----^---+ || |    |        |
| 1 | TX <----|----------------+TX MUX   < OCXO   | || |    |        |
|   |PLL |    |               ||         |--------| || |    |        |
+---+----+    |           /----+         <-ext_ref<-||-|----|------ext_ref
| E | RX >----/           |   ||         |--------+ || |    |        |
| T |    |                |   ||         <  SyncE | || |    |        |
| H |----+                |   |+-----------^------+ || |    |        |
| 2 | TX <----------------/   |            | /------||-/    |        |
|   |PLL |                    +------------|-|------||------+        |
+---+----+                              /--/ |      ||               |
| . | RX >---                           |    |      ||               |
| . |    |                   +----------|----|------||--+            |
| . |----+                   |        +-^-+--^+     ||  |            |
|   | TX <---                |        |EEC|PPS|     ||  |            |
|   |PLL |                   |        +-------+     ||  |            |
+---+----+                   |        |       <-CLK0/|  |            |
| E | RX >---                |        |  DPLL |      |  |            |
| T |    |                   |        |       <-CLK1-/  |            |
| H |----+                   |        |       |         |            |
| X | TX <---                |        |       <---SMA---<            |
|   |PLL |                   |        |       |         |            |
+---+----+                   |        |       <---GPS---<            |
  |                          |        |       |         |            |
  |                          |        |       <---...---<            |
  |                          |        |       |         |            |
  |                          |        +-------+         |            |
  |                          | External timing module   |            |
  |                          +--------------------------+            |
  +------------------------------------------------------------------+

E825 hardware contains a dedicated TX clock domain with per-port source
selection behavior that is distinct from PPS handling and from board-level
EEC distribution. TX reference clock selection is device-wide, shared
across ports, and mediated by firmware as part of link bring-up. As a
result, TX clock selection intent may differ from effective hardware
configuration, and software must verify outcome after link-up.

To support this, the series extends the DPLL core and the ice driver
incrementally. The series also introduces DPLL_TYPE_GENERIC as a broad
UAPI class for DPLL instances outside PPS/EEC categories. The intent is
to keep type naming reusable and scalable across different ASIC
topologies while preserving functional discoverability via
driver/device context and pin topology.

This follows netdev discussion guidance that UAPI type naming should avoid
location-specific or vendor-specific taxonomy, because such labels do not
scale across different ASIC designs. The function of a given DPLL instance
is already discoverable from driver/device context and pin topology, and
does not require an additional narrow type identifier in UAPI.

At the same time, a separate DPLL object is still needed for E825 TX clock
control/reporting semantics. Using DPLL_TYPE_GENERIC provides a reusable
class for devices outside PPS/EEC without overfitting UAPI naming to one
topology.

The relevant discussion is in [2].

Series content
- add a new generic DPLL type for devices outside PPS/EEC classes;
- relax DPLL pin registration rules for firmware-described shared pins
  and extend pin notifications with a source identifier;
- allow dynamic state control of SyncE reference pins where hardware
  supports it;
- add CPI infrastructure for PHY-side TX clock control on E825C;
- introduce a TX-clock DPLL device and TX reference clock pins
  (EXT_EREF0 and SYNCE) in the ice driver;
- extend the Restart Auto-Negotiation command to carry a TX reference
  clock index;
- implement hardware-backed TX reference clock switching, post-link
  verification, and TX synchronization reporting.

TXCLK pins report TX reference topology only. Actual synchronization
success is reported via DPLL lock status, updated after hardware
verification: external TX references report LOCKED, while the internal
ENET/TXCO source reports UNLOCKED.

This provides reliable TX reference selection and observability on E825
devices using standard DPLL interfaces, without conflating user intent
with effective hardware behavior.

[1] https://lore.kernel.org/netdev/20250905160333.715c34ac@kernel.org/
[2] https://lore.kernel.org/netdev/20260402230626.3826719-1-grzegorz.nitka@intel.com/
====================

Link: https://patch.msgid.link/20260607183045.1213735-1-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: implement E825 TX ref clock control and TXC hardware sync status

Build on the previously introduced TXC DPLL framework and implement
full TX reference clock control and hardware-backed synchronization
status reporting for E825 devices.

E825 firmware may accept or override TX reference clock requests based
on device-wide routing constraints and link conditions. Because the
final selection becomes visible only after a link-up event, the driver
splits the observation into two complementary signals:

  - TXCLK pin state reflects the requested TX reference clock
    (pf->ptp.port.tx_clk_req). After a link-up, the value is reconciled
    against the SERDES reference selector by
    ice_txclk_update_and_notify(); if firmware or auto-negotiation
    selected a different clock, tx_clk_req is overwritten so that pin
    state converges to the actual hardware selection.

  - TXC DPLL lock status reflects hardware synchronization:
      * LOCKED   when an external TX reference is in use
      * UNLOCKED when falling back to ENET/TXCO, or when a requested
        external reference has not (yet) been accepted by hardware.

Userspace observing only pin state therefore sees user intent, while
lock status is the authoritative indicator of whether the requested
clock is actually selected and synchronizing. This matches the DPLL
subsystem model where pin state describes topology and device lock
status describes signal quality.

TX reference selection topology:
  - External references (SYNCE, EREF0) are represented as TXCLK pins
  - The internal ENET/TXCO clock has no pin representation; when
    selected, all TXCLK pins are reported DISCONNECTED

With this change, TX reference clocks on E825 devices can be reliably
selected, observed via standard DPLL interfaces, and monitored for
effective synchronization through TXC DPLL lock status.

Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-14-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: add Tx reference clock index handling to AN restart command

Extend the Restart Auto-Negotiation (AN) AdminQ command with a new
parameter allowing software to specify the Tx reference clock index to
be used during link restart.

This patch:
- adds REFCLK field definitions to ice_aqc_restart_an
- updates ice_aq_set_link_restart_an() to take a new refclk parameter
and properly encode it into the command
- keeps legacy behavior by passing REFCLK_NOCHANGE where appropriate

This prepares the driver for configurations requiring dynamic selection
of the Tx reference clock as part of the AN flow.

Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-13-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: implement CPI support for E825C

Add full CPI (Converged PHY Interface) command handling required for
E825C devices. The CPI interface allows the driver to interact with
PHY-side control logic through the LM/PHY command registers, including
enabling/disabling/selection of PHY reference clock.

This patch introduces:
- a new CPI subsystem (ice_cpi.c / ice_cpi.h) implementing the CPI
   request/acknowledge state machine, including REQ/ACK protocol,
   command execution, and response handling
- helper functions for reading/writing PHY registers over Sideband
   Queue
- CPI command execution API (ice_cpi_exec) and a helper for enabling or
   disabling Tx reference clocks (CPI 0xF1 opcode 'Config PHY clocking')
- assurance of CPI transaction serialization into the CPI core.
   CPI REQ/ACK is a multi-step handshake    and must be executed
   atomically per PHY. Centralize the lock in ice_cpi_exec() and
   use adapter-scoped per-PHY mutexes, which match the hardware sharing
   model across PFs.
- addition of the non-posted write opcode (wr_np) to SBQ
- Makefile integration to build CPI support together with the PTP stack

This provides the infrastructure necessary to support PHY-side
configuration flows on E825C and is required for advanced link control
and Tx reference clock management.

Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-12-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: introduce TXC DPLL device and TX ref clock pin framework for E825

E825 devices provide a dedicated TX clock (TXC) domain which may be
driven by multiple reference clock sources, including external board
references and port-derived SyncE. To support future TX clock control
and observability through the Linux DPLL subsystem, introduce a
separate TXC DPLL device (of DPLL_TYPE_GENERIC) and a framework for
representing TX reference clock inputs.

This change adds a new internal DPLL pin type (TXCLK) and registers
TX reference clock pins for E825-based devices:
- EXT_EREF0: a board-level external electrical reference
- SYNCE: a port-derived SyncE reference described via firmware nodes

The TXC DPLL device is created and managed alongside the existing
PPS and EEC DPLL instances. TXCLK pins are registered directly or
deferred via a notifier when backed by fwnode-described pins.
A per-pin attribute encodes the TX reference source associated with
each TXCLK pin.

At this stage, TXCLK pin state callbacks and TXC DPLL lock status
reporting are implemented as placeholders. Pin state getters always
return DISCONNECTED, and the TXC DPLL is initialized in the UNLOCKED
state. No hardware configuration or TX reference switching is
performed yet.

This patch establishes the structural groundwork required for
hardware-backed TX reference selection, verification, and
synchronization status reporting, which will be implemented in
subsequent patches.

Also signal dpll_init from the fwnode pin init error path so any
notifier worker already blocked on it can drain, avoiding a
flush_workqueue() deadlock during teardown.

Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-11-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: allow fwnode pins to attempt state change without capability bit

Pins registered with an fwnode may have .state_on_dpll_set implemented
without advertising DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE upfront.
Requiring the bit for fwnode pins ties firmware description to driver
implementation details unnecessarily.

Relax the capability check in dpll_pin_state_set() and
dpll_pin_on_pin_state_set(): when a pin has an associated fwnode, bypass
the capability gate and let the ops layer decide, returning -EOPNOTSUPP
if .state_on_dpll_set is absent. Non-fwnode pins retain the original
strict behavior.

This is used later in the series by the SyncE_Ref output pin, which
relies on the fwnode path for state control.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-10-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: extend pin notifier with notification source ID

Extend the DPLL pin notification API to include a source identifier
indicating where the notification originates. This allows notifier
consumers to distinguish between notifications coming from
an associated DPLL instance, a parent pin, or the pin itself.

A new field, src_clock_id, is added to struct dpll_pin_notifier_info
and is passed through all pin-related notification paths. Callers of
dpll_pin_notify() are updated to provide a meaningful source identifier
based on their context:
  - pin registration/unregistration uses the DPLL's clock_id,
  - pin-on-pin operations use the parent pin's clock_id,
  - pin changes use the pin's own clock_id.

As introduced in the commit ("dpll: allow registering FW-identified pin
with a different DPLL"), it is possible to share the same physical pin
via firmware description (fwnode) with DPLL objects from different
kernel modules. This means that a given pin can be registered multiple
times.

Driver such as ICE (E825 devices) rely on this mechanism when listening
for the event where a shared-fwnode pin appears, while avoiding reacting
to events triggered by their own registration logic.

This change only extends the notification metadata and does not alter
existing semantics for drivers that do not use the new field.

Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-9-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: balance create/delete notifications in __dpll_pin_(un)register

__dpll_pin_register() emits dpll_pin_create_ntf() internally, but
__dpll_pin_unregister() left the matching delete to its callers. The
counts then diverge on dpll_pin_on_pin_register() rollback and on
dpll_pin_on_pin_unregister(), leaking stale notifications.

Emit dpll_pin_delete_ntf() inside __dpll_pin_unregister() and drop the
now-redundant call in dpll_pin_unregister().

Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base functions")
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-8-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: guard sync-pair removal on full pin unregister

__dpll_pin_unregister() wiped the global sync-pair state on every
(dpll, ops, priv, cookie) tuple removed from a pin. When a pin is
registered multiple times and only one registration is being torn
down, this dropped sync-pair pairings still in use by the surviving
registrations.

Move dpll_pin_ref_sync_pair_del() inside the xa_empty(&pin->dpll_refs)
branch so it only runs when the last registration is gone, alongside
clearing the DPLL_REGISTERED mark.

Fixes: 58256a26bfb3 ("dpll: add reference sync get/set")
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-7-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: emit per-dpll delete notifications in dpll_pin_on_pin_unregister()

dpll_pin_on_pin_register() emits a creation notification for every
parent->dpll_refs entry, but dpll_pin_on_pin_unregister() emitted only
one deletion notification outside the loop. When a pin is registered
against multiple parent dplls, userspace sees N creates but a single
delete and leaks per-dpll state.

Move dpll_pin_delete_ntf() into the loop and call it before
__dpll_pin_unregister() so the DPLL_REGISTERED mark is still set when
dpll_pin_available() is consulted.

Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions")
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-6-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: send delete notification before unregister in on-pin rollback

The rollback path in dpll_pin_on_pin_register() called
__dpll_pin_unregister() before dpll_pin_delete_ntf(). When the
unregister dropped the pin's last DPLL reference it cleared the
DPLL_REGISTERED mark in dpll_pin_xa, so the subsequent
dpll_pin_event_send() failed dpll_pin_available() and aborted with
-ENODEV. As a result userspace was never notified of the rollback
deletion and remained out of sync with the kernel.

Send the delete notification first, matching the order used by
dpll_pin_unregister() and dpll_pin_on_pin_unregister().

Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions")
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-5-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: fix stale iteration in dpll_pin_on_pin_unregister()

Neither parent->dpll_refs nor pin->dpll_refs on its own is a correct
iteration target at unregister time:

  - pin->dpll_refs includes DPLLs the child was registered against
    via a different parent or directly; blind unregister WARNs on
    the cookie miss in dpll_xa_ref_pin_del().
  - parent->dpll_refs reflects the parent's current attachments, not
    those at child-register time. Another driver may have (un)reg'd
    the parent against additional DPLLs in the meantime, so we miss
    registrations that exist and visit DPLLs that have none.

Walk pin->dpll_refs and use dpll_pin_registration_find() to filter
to entries whose cookie is this parent. Symmetric with
dpll_pin_on_pin_register(), correct under any subsequent change to
parent->dpll_refs.

Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base functions")
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-4-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: allow registering FW-identified pin with a different DPLL

Relax the (module, clock_id) equality requirement when registering a
pin identified by firmware (pin->fwnode). Some platforms associate a
FW-described pin with a DPLL instance that differs from the pin's
(module, clock_id) tuple. For such pins, permit registration without
requiring the strict match. Non-FW pins still require equality.

Keep netlink pin module reporting/filtering safe for this relaxed
registration model by caching the module name in the pin object at
allocation time and using the cached string in netlink paths.
This avoids dereferencing pin->module after provider module teardown.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-3-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: add generic DPLL type

Add DPLL_TYPE_GENERIC to represent DPLL devices which do not fit the
existing PPS or EEC classes.

The UAPI type is intentionally generic. During netdev discussion,
maintainers pointed out that introducing identifiers tied to a specific
placement or single design does not scale across ASICs and vendors.
The role of a DPLL is already inferable from the spawning driver,
bus device, and pin topology, without encoding additional
purpose-specific taxonomy in the type name.

Using a generic type keeps the UAPI extensible and avoids premature
naming that may become incorrect as new hardware topologies are
exposed through the DPLL subsystem.

Expose the new type through UAPI and netlink specification as "generic".

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Link: https://patch.msgid.link/20260607183045.1213735-2-grzegorz.nitka@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'ipsec-next-2026-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2026-06-12

1) Replace the open-coded manual cleanup in xfrm_add_policy() error
   path with xfrm_policy_destroy() for consistency with
   xfrm_policy_construct().
   From Deepanshu Kartikey.

2) Limit XFRMA_TFCPAD to a sensible maximum (max IP length, 64k) since
   u32 is excessive for traffic flow confidentiality padding.
   From David Ahern.

3) Add a new netlink message XFRM_MSG_MIGRATE_STATE that
   allows migrating individual IPsec SAs independently of
   their policies. The existing XFRM_MSG_MIGRATE is tightly coupled
   to policy+SA migration, lacks SPI for unique SA identification,
   and cannot express reqid changes or migrate Transport mode
   selectors. The new interface identifies the SA via SPI and mark,
   supports reqid changes, address family changes, encap removal,
   and uses an atomic create+install flow under x->lock to prevent
   SN/IV reuse during AEAD SA migration.
   From Antony Antony.

* tag 'ipsec-next-2026-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
  xfrm: add documentation for XFRM_MSG_MIGRATE_STATE
  xfrm: restrict netlink attributes for XFRM_MSG_MIGRATE_STATE
  xfrm: add XFRM_MSG_MIGRATE_STATE for single SA migration
  xfrm: make xfrm_dev_state_add xuo parameter const
  xfrm: extract address family and selector validation helpers
  xfrm: refactor XFRMA_MTIMER_THRESH validation into a helper
  xfrm: move encap and xuo into struct xfrm_migrate
  xfrm: add error messages to state migration
  xfrm: add state synchronization after migration
  xfrm: check family before comparing addresses in migrate
  xfrm: split xfrm_state_migrate into create and install functions
  xfrm: rename reqid in xfrm_migrate
  xfrm: fix NAT-related field inheritance in SA migration
  xfrm: allow migration from UDP encapsulated to non-encapsulated ESP
  xfrm: add extack to xfrm_init_state
  xfrm: remove redundant assignments
  xfrm: Reject excessive values for XFRMA_TFCPAD
  xfrm: cleanup error path in xfrm_add_policy()
====================

Link: https://patch.msgid.link/20260612074725.1760473-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wwan: t7xx: check skb_clone in control TX

t7xx_port_ctrl_tx() clones each skb fragment before passing it to the
port transmit path. The clone is used immediately to set cloned->len, so
an skb_clone() failure results in a NULL pointer dereference.

Check the clone before using it. If previous fragments were already
queued, preserve the driver's existing partial-write behavior by
returning the number of bytes submitted so far.

Fixes: 36bd28c1cb0d ("wwan: core: Support slicing in port TX flow of WWAN subsystem")
Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
Link: https://patch.msgid.link/20260612035613.1192486-1-ruoyuw560@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'vsock-consolidate-acceptq-accounting-into-core-helpers'

Raf Dickson says:

====================
vsock: consolidate acceptq accounting into core helpers

These patches follow up on commit c05fa14db43e
("vsock/vmci: fix sk_ack_backlog leak on failed handshake")
by consolidating sk_acceptq_added() and sk_acceptq_removed() into
the core vsock helpers so transports cannot forget them.

Link: https://lore.kernel.org/netdev/20260611021317.69362-1-rafdog35@gmail.com/
====================

Link: https://patch.msgid.link/20260612045216.105796-1-rafdog35@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock: fold sk_acceptq_removed() into vsock_remove_pending()

Callers of vsock_remove_pending() must also call sk_acceptq_removed()
to keep sk_ack_backlog consistent. Move the call into
vsock_remove_pending() itself to make it automatic and prevent future
callers from forgetting it.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Raf Dickson <rafdog35@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260612045216.105796-5-rafdog35@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock: fold sk_acceptq_added() into vsock_enqueue_accept()

virtio and hyperv call sk_acceptq_added() immediately before
vsock_enqueue_accept(). Move the call into vsock_enqueue_accept()
itself so callers cannot forget it and the accounting is consistent.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Raf Dickson <rafdog35@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260612045216.105796-4-rafdog35@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock: fold sk_acceptq_added() into vsock_add_pending()

Move sk_acceptq_added() into vsock_add_pending() so callers cannot
forget it. vmci is the only transport using the pending list and
is updated accordingly.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Raf Dickson <rafdog35@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260612045216.105796-3-rafdog35@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock: introduce vsock_pending_to_accept() helper

Add vsock_pending_to_accept() to move a socket directly from the
pending list to the accept queue in a single operation, avoiding
the sock_put/sock_hold dance and the sk_acceptq_removed()/
sk_acceptq_added() pair that would otherwise be needed when
calling vsock_remove_pending() followed by vsock_enqueue_accept().

Use it in vmci_transport_recv_connecting_server() where a completed
handshake transitions the socket from pending to accept queue.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Raf Dickson <rafdog35@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260612045216.105796-2-rafdog35@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock: use sk_acceptq_is_full() helper in all transports

Replace the open-coded backlog check with sk_acceptq_is_full().
The helper uses > instead of >=, which is the correct comparison
per commit 64a146513f8f ("[NET]: Revert incorrect accept queue
backlog changes."), and adds READ_ONCE() for proper memory ordering.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Raf Dickson <rafdog35@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Link: https://patch.msgid.link/20260612045842.122207-1-rafdog35@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: mtk_wed: debugfs: correct index in wed_amsdu_show()

WED_MON_AMSDU_ENG_CNT point to different entry by 'base+n*offset' mode,
correct the wed amsdu entry number in wed_amsdu_show().

Fixes: 3f3de094e8342 ("net: ethernet: mtk_wed: debugfs: add WED 3.0 debugfs entries")
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Link: https://patch.msgid.link/20260612064501.203058-1-guanwentao@uniontech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netdevsim-add-fake-ft-cls_flower-offload'

Florian Westphal says:

====================
netdevsim: add fake FT/CLS_FLOWER offload

v2: fix up error reporting via extack
    shellcheck cleanups
    sort config toggles

1) Enable nf_tables offload control plane testing in netdevsim. Tag
   existing offload fn to allow error injection for testing rollback and abort
   logic.

2) Add nft_offload selftest to exercise the control plane and error
   unwind via fault injection.
====================

Link: https://patch.msgid.link/20260612092209.11966-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: netfilter: add phony nft_offload test

... "phony", because its not testing offloads, it tests the control
plane code. Also test error unwind via fault injection framework.

For a proper test, real hardware would be required given we'd have
check if 'previously handed off to hardware' offload commands are
properly removed again on failure or rule flush.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260612092209.11966-3-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdevsim: tc: allow to test nf_tables offload control plane code

The actual 'offload' is phony, all commands are ignored: this is only
useful to test control plane code.

Tag the existing callback to permit error injection to test rollback/abort
code in nf_tables. This is also for fuzzers - the fault injection
framework allows probabilistic error insertion.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260612092209.11966-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Fix error handling in airoha_ppe_flush_sram_entries()

In airoha_ppe_flush_sram_entries(), the outer "err" variable was never
updated when the inner loop variable shadowed it, causing the function
to always return 0 even when airoha_ppe_foe_commit_sram_entry() fails.

Drop the outer "err" variable and return directly on error, propagating
the error code from airoha_ppe_foe_commit_sram_entry() correctly.

Fixes: 620d7b91aadb ("net: airoha: ppe: Flush PPE SRAM table during PPE setup")
Link: https://lore.kernel.org/netdev/6a2b40e4.4dd82583.3a5c46.e52f@mx.google.com/
Signed-off-by: Wayen.Yan <win847@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/6a2bd37a.4034e349.1b41bb.1caf@mx.google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

MAINTAINERS: Update Coly Li's email address

I switch to colyli@fygo.io as my current email address.

Signed-off-by: Coly Li <colyli@fygo.io>
Link: https://patch.msgid.link/20260613150458.682707-1-colyli@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'core-urgent-2026-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull debugobjects fix from Ingo Molnar:

- Fix potential debugobjects deadlock on PREEMPT_RT kernels (Waiman
Long)

* tag 'core-urgent-2026-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobjects: Don't call fill_pool() in early boot hardirq context

Merge tag 'i2c-for-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
"The biggest news here is that this is my last pull request as I2C
  maintainer after 13.5 years. Starting with the 7.2 cycle, Andi Shyti
  is taking over who helped me greatly maintaining the host drivers for
  a while now. Thank you, Andi, and good luck with the subsystem. I'll
  be around for help, of course.

  Technically, there are two patches which might be a tad large for this
  late cycle, but most of them is explaining comments, so I think they
  are suitable.

   - MAINTAINERS:
      - hand over I2C maintainership to Andi
      - minor updates

   - rust: fix I2cAdapter refcount double increment

   - imx: keep clock and pinctrl states consistent in runtime PM

   - imx-lpi2c: fix DMA resource leaks on PIO fallback

   - qcom-cci: fix NULL pointer dereference on remove

   - riic: fix reset refcount leak on resume_noirq error path

   - stm32f7: account for analog filter in timing computation

   - tegra:
      - fix suspend/resume handling in NOIRQ phase
      - update Tegra410 I2C timings to match hardware specs"

* tag 'i2c-for-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  dt-bindings: i2c: mux-gpio: name correct maintainer
  MAINTAINERS: hand over I2C to Andi Shyti
  i2c: imx-lpi2c: fix resource leaks switching to devm_dma_request_chan()
  MAINTAINERS: i2c: designware: Remove inactive reviewer
  i2c: tegra: Fix NOIRQ suspend/resume
  i2c: tegra: Update Tegra410 I2C timing parameters
  i2c: qcom-cci: Fix NULL pointer dereference in cci_remove()
  i2c: stm32f7: fix timing computation ignoring i2c-analog-filter
  i2c: imx: fix clock and pinctrl state inconsistency in runtime PM
  i2c: riic: fix refcount leak in riic_i2c_resume_noirq()
  rust: i2c: fix I2cAdapter refcounts double increment

Merge tag 'timers-v7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/daniel.lezcano/linux into timers/clocksource

Pull clocksource/driver updates from Daniel Lezcano:

  - Remove the sifive,fine-ctr-bits property bindings because it is a
    redundant information (Nick Hu)

  - Remove the TCIU8 interrupt bindings on Renesas because it should not
    be described as the documentation marked reserved and fix the
    conditional reset line for the RZ/{T2H,N2H} (Cosmin Tanislav)

  - Add the StarFive JHB100 clint DT bindings compatible string (Ley
    Foon Tan)

  - Extend schema condition for interrupts to cover D1 compatible
    variant an add the D1 hstimer support (Michal Piekos)

  - Update the ARM architected timer support to handle the ACPI GTDT v3
    format and the EL2 virtual timer, enabling Linux to use the most
    appropriate timer when running with VHE, while also fixing several
    Device Trees to accurately reflect the underlying hardware (Marc
    Zyngier)

  - Cleanup and add the clocksource and the clockevent in the TI DM
    timer (Markus Schneider-Pargmann)

  - Add the multiple watchdogs support in the tegra186 and
    tegra234. Dedicate one as a kernel watchdog (Kartik Rajput)

  - Add the NXP clocksource selection for the scheduler in the Kconfig
    (Enric Balletbo i Serra)

Link: https://lore.kernel.org/all/1e55e8d6-8024-4f17-8620-ab3385465d76@oss.qualcomm.com

posix-cpu-timers: Fix pid refcount leak in do_cpu_nanosleep() error path

In do_cpu_nanosleep(), posix_cpu_timer_create() takes a pid reference
via get_pid() and stores it in timer.it.cpu.pid. If the subsequent
posix_cpu_timer_set() call fails, the function returns immediately
without calling posix_cpu_timer_del() to release the pid reference,
causing a leak.

Fix it by calling posix_cpu_timer_del() before the unlock-and-return
on the error path, consistent with the other exit paths in the same
function.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260611161738.97043-1-vulab@iscas.ac.cn

x86/irq: Add missing 's' back to thermal event printout

The /proc/interrupt handling rework dropped a 's' in the thermal event
printout, which breaks the thermal test in the Intel LKVS suite.

Bring the important letter back.

Fixes: 2b57c69917ee ("x86/irq: Make irqstats array based")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Closes: https://lore.kernel.org/oe-lkp/202606121325.97b29701-lkp@intel.com

time/jiffies: Register jiffies clocksource before usage

Teddy reported that a XEN HVM has a long boot delay, which was bisected to
the recent enhancements to the negative motion detection. It turned out
that the jiffies clocksource is used in early boot before it is registered,
which leaves the max_delta_raw field at zero. That causes the read out to
be clamped to the max delta of 0, which means time is not making progress.

Cure it by ensuring that it is initialized before its first usage in
timekeeping_init().

Fixes: 76031d9536a0 ("clocksource: Make negative motion detection more robust")
Reported-by: Teddy Astie <teddy.astie@vates.tech>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Teddy Astie <teddy.astie@vates.tech>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/87y0gn3fve.ffs@fw13
Closes: https://lore.kernel.org/all/1780914594.8631fc262581453bbf619ec5b2062170.19ea6c8227b000701b@vates.tech

hwmon: tmp401: Read "ti,n-factor" as signed

The "ti,n-factor" binding and examples allow negative correction
values. Reading it as u32 makes the helper type disagree with the
documented signed value and hides real schema mismatches.

Use the signed helper so the DT access matches the s32 value stored by
the driver.

Assisted-by: Codex:gpt-5-5
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20260612215332.1889497-1-robh@kernel.org
Signed-off-by: Guenter Roeck <linux@roeck-us.net>

io_uring/bpf-ops: add a separate maintainer entry

Add a maintainer entry for io_uring bpf struct_ops related files.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/d89f3b89e77b09a18daa45476fd1a40f2ee253cd.1780930463.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: check bio split for unaligned bvec

Offsets and lengths need to be validated against the dma alignment. This
check was skipped for sufficiently a small bio with a single bvec, which
may allow an invalid request dispatched to the driver. Force the
validation for an unaligned bvec by forcing the bio split path that
handles this condition.

Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Reported-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://patch.msgid.link/20260612223205.465913-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

nbd: Reclassify sockets to avoid lockdep circular dependency

syzbot reported a possible circular locking dependency in udp_sendmsg()
where fs_reclaim can be triggered while holding sk_lock, and fs_reclaim
can eventually depend on another sk_lock (e.g., if NBD is used for swap
or writeback and NBD uses TLS/TCP which acquires sk_lock).

Since the UDP socket and the NBD TCP/TLS socket are different, this is a
false positive. Fix this by reclassifying NBD sockets to a separate lock
class when they are added to the NBD device.

This is similar to what nvme-tcp and other network block devices do.

Fixes: ffa1e7ada456 ("block: Make request_queue lockdep splats show up earlier")
Reported-by: syzbot+607cdcf978b3e79da878@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6a2cdafe.428ffe26.258b27.0161.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260613042619.1108126-1-edumazet@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/net: make POLL_FIRST receive side checks consistent

io_recv() and io_recvzc() are the odd ones out, as they checks for
whether POLL_FIRST should be honored before checking if the file is a
socket. It doesn't really matter, but might as well make it consistent
across all receive and send types.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: remove the per-ctx fallback task_work machinery

With the tctx fallback running its entries directly, the per-ctx
fallback work has a single user left: moving local (DEFER_TASKRUN)
task_work entries out of a ring that is going away. Both of its call
sites are process context and don't hold ->uring_lock, the same
conditions the deferred fallback work itself ran under - so run the
entries in cancel mode right there instead, and rename the helper to
io_cancel_local_task_work() to match what it now does.

With that, ->fallback_llist, ->fallback_work, io_fallback_req_func()
and __io_fallback_tw() can all go away, along with the fallback work
flushing in the ring exit and cancel paths. Requests that get
orphaned by an exiting task now run via the tctx fallback work, which
the ring exit side implicitly waits on through the ctx refs those
requests hold.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: run the tctx task_work fallback directly

The fallback work drains the tctx queue only to redistribute the entries
into the per-ctx fallback lists, bouncing them through a second
(per-ctx) work item before they finally run. That made sense when the
producer side did the draining and could be in any context, but the
fallback work is a regular process context kworker: it can just run the
entries itself. Reuse the normal run loop - if run from the fallback
kernel thread, ts.cancel will get set, and the work terminated.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: switch normal task_work to a mpscq

Like the local task_work list, the normal (tctx) task_work list is an
llist, and hence needs the O(n) llist_reverse_order() pass before
running entries in queue order. On top of that, capped runs - sqpoll
processing IORING_TW_CAP_ENTRIES_VALUE entries at a time - need the
claimed-but-unprocessed leftovers carried in a separate retry_list,
as they can't be pushed back to the shared list.

Switch tctx->task_list to a mpscq, like what was done for the
DEFER_TASKRUN paths as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: switch local task_work to a mpscq

The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
ordered, and hence __io_run_local_work() has to restore the right
running order with an O(n) llist_reverse_order() pass first. On top of
that, a batch that gets capped by max_events needs the leftover entries
parked on a separate ->retry_llist, as they can't be pushed back to the
shared list.

Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
retry loop, entries are popped in queue order with no reversal pass,
capping a run simply leaves the remainder on the queue, and
->retry_llist goes away entirely. The consumer cursor, ->work_head,
lives with the rest of the ->uring_lock protected state rather than
next to the queue, so that popping entries doesn't dirty the producer
side cacheline.

For low amounts of task_work, this ends up being a bit more efficient
than the existing scheme. As an example of that, doing multishot
receives for 8 clients has the following task_work overhead:

     1.02%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     0.88%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
     0.60%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
     0.14%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     2.64% at ~46Gb/sec

and after this change:

     1.08%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     1.03%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     2.11% at ~53Gb/sec

which has less overhead even though that test run was faster. For a case
of having 1024 clients on a single ring:

     2.22%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
     0.84%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
     0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     0.02%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     3.50% at ~24Gb/sec

we start to see the llist reversing taking a considerable amount of
time, and the total add+run task_work overhead is around 3.5%. After
the change:

     0.90%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     1.32% at ~26Gb/sec

most of that overhead is gone, and performance is better as well.

Caleb Sander Mateos <csander@purestorage.com> reports that it improves
the performance of a ublk 4kb workload by 4% [1], while testing v1 of
this patchset.

[1] https://lore.kernel.org/io-uring/CADUfDZr-MMYBaP-e+y9+xuRhuiunO2sBTUCmwZyd7AgT8sVtiQ@mail.gmail.com/

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue

Local task_work is currently using llists for managing the work,
but that's a LIFO type of list. This means that running this task_work
needs to reverse the list first, to ensure fairness in running the
queued items.

Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
node-based queue algorithm, modified with an externally held consumer
cursor and conditional stub reinsertion. See comments in the header.

Producers are wait-free: a push is a single xchg() on the queue tail,
which serializes concurrent producers and defines the FIFO order, plus
a store linking the node to its predecessor. There are no cmpxchg retry
loops, and pushing is safe from any context, including hardirq.

The cost of linked list FIFO ordering is that a push publishes the node
in two steps - the xchg() makes it visible as the new tail before the
subsequent store links it into the chain that is reachable from the
head. A consumer hitting that window gets a NULL from mpscq_pop() while
mpscq_empty() reports false, and must retry later rather than treat the
queue as empty. The window is two instructions wide, but a producer can
get preempted inside it, so the consumer must not busy wait on it.

The consumer side supports a single consumer at a time, with callers
providing their own serialization. A stub node, which also defines the
empty state (tail == stub), allows the consumer to detach the final
node without racing against producer link stores: that node is only
handed out once the stub has been cmpxchg'ed back in as the tail. This
also guarantees that the previous tail returned by mpscq_push() cannot
get freed before that push has linked it, making it always valid for
comparisons.

The consumer cursor is deliberately not part of the queue struct - the
caller owns it and passes it to mpscq_pop(). This is done to separate
the consumer and producers cacheline. The cursor is written for every
popped entry, and keeping it on the same cacheline as ->tail would have
the consumer invalidating the line that producers need for every push.
Keeping it external lets the caller place it with its own consumer side
data instead.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: grab RCU read lock marking task run

Not required right now, as io_req_local_work_add() already calls this
helper with the RCU read lock held. But in preparation for that not
being the case, grab it locally.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2026-06-09 (idpf, ixgbe, igc)

Przemyslaw adds needed padding to idpf PTP structures to match firmware
expectations.

Larysa bypasses XPS configuration on XDP queues for ixgbe.

Khai Wen corrects offset into packet buffer when handling for frame
preemption on igc.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  igc: skip RX timestamp header for frame preemption verification
  ixgbe: do not configure xps for XDP queues
  idpf: add padding to PTP virtchnl structures
====================

Link: https://patch.msgid.link/
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

octeontx2-af: npc: Fix size of entry2cntr_map

KASAN prints below splat. This is caused by allocating counter for
reserved mcam entry for cpt 2nd pass entry. But mcam->entry2cntr_map
is not allocated for reserved entries.

BUG: KASAN: slab-out-of-bounds in npc_map_mcam_entry_and_cntr+0xb0/0x1a0
Write of size 2 at addr ffff0001033e7ffe by task kworker/0:1/14

CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted 6.1.67 #1
Hardware name: Marvell CN106XX board (DT)
Workqueue: events work_for_cpu_fn
Call trace:
dump_backtrace.part.0+0xe4/0xf0
show_stack+0x18/0x30
dump_stack_lvl+0x88/0xb4
print_report+0x154/0x458
kasan_report+0xb8/0x194
__asan_store2+0x7c/0xa0
npc_map_mcam_entry_and_cntr+0xb0/0x1a0
rvu_mbox_handler_npc_mcam_write_entry+0x268/0x280
npc_install_flow+0x840/0xfe0
rvu_npc_install_cpt_pass2_entry+0x138/0x190
rvu_nix_init+0x148c/0x2880
rvu_probe+0x1800/0x30b0
local_pci_probe+0x78/0xe0
work_for_cpu_fn+0x30/0x50
process_one_work+0x4cc/0x97c
worker_thread+0x360/0x630
kthread+0x1a0/0x1b0
ret_from_fork+0x10/0x20

Fixes: 55307fcb9258 ("octeontx2-af: Add mbox messages to install and delete MCAM rules")
Cc: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260610022344.969774-1-rkannoth@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/bpf: Add arena direct-value one-past-end reject test

BPF_MAP_TYPE_ARENA supports direct-value pseudo loads, but unlike array
maps its map value_size is zero and the valid direct-value range is the
arena mmap size, max_entries * PAGE_SIZE.

Commit 3ac1a467e376 ("bpf: Fix off-by-one boundary validation in arena
direct-value access") fixed arena_map_direct_value_addr() to reject an
offset exactly at the end of the arena mapping. Add a regression test
that loads a BPF_PSEUDO_MAP_VALUE with off == arena_size and verifies
that the verifier rejects it with the expected offset in the log.

This is intentionally kept as a userspace raw-instruction test. I tried
expressing the same BPF_PSEUDO_MAP_VALUE + off == arena_size case in
verifier_arena.c with inline assembly. The only form that produces the
desired instruction bytes uses __imm_addr(arena), but that emits
R_BPF_64_NODYLD32, which the libbpf/bpftool link step rejects. Other
register, immediate, and memory constraints either fail in the BPF
backend or lower to a normal R_BPF_64_64 load followed by an ALU add,
which does not exercise arena_map_direct_value_addr() with the boundary
offset in the second ldimm64 slot.

A legacy test_verifier fixture can express the raw instruction directly,
but it needs arena map creation, mmap, and fixup plumbing in the legacy
runner. That is more intrusive than the small prog_tests raw-instruction
test.

Use the userspace raw-instruction test, following the existing selftests
pattern used for direct map-value pseudo loads, so insns[1].imm can be
set to arena_size precisely.

Assisted-by: ChatGPT:gpt-5.5
Signed-off-by: Woojin Ji <random6.xyz@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Junyoung Jang <graypanda.inzag@gmail.com>
Link: https://lore.kernel.org/r/20260612-arena-direct-value-v1-v4-1-b81b642f5277@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule

raw_res_spin_unlock_irqrestore() calls raw_res_spin_unlock() and then
restores interrupts, this means preemption is enabled when interrupts
are still disabled (as part of raw_res_spin_unlock()) so this cannot
trigger an actual preemption.
This is inconsistent with other spinlock implementations
(raw_spin_unlock_irqrestore() and bpf_res_spin_unlock_irqrestore()
itself).

Adjust the macro to ensure interrupts are enabled before enabling
preemption, allowing to schedule at that point. Make the same
modification in the error path of raw_res_spin_lock_irqsave().

Fixes: 101acd2e78b1 ("rqspinlock: Add macros for rqspinlock usage")
Cc: stable@vger.kernel.org
Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260610090431.32427-1-gmonaco@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-fix-setting-retval-to-eperm-for-cgroup-hooks-not-returning-errno'

Xu Kuohai says:

====================
bpf: Fix setting retval to -EPERM for cgroup hooks not returning errno

This series fixes the issue reported by sashiko in [1]. The issue is that,
when a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets
the hook return value to -EPERM if it is not a valid errno. This is
correct for errno-based hooks, which return 0 on success and negative
errno on failure, but wrong for void and boolean LSM hooks. Boolean
LSM hooks should only return true or false, and void LSM hooks have
no return value at all.

Fix it by skipping setting -EPERM for hooks not returning errno.

[1] https://lore.kernel.org/bpf/20260605144232.95A141F00893@smtp.kernel.org/
====================

Link: https://patch.msgid.link/20260610201724.733943-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add retval test for bool and errno LSM cgroup hooks

Add test to check the return value when a BPF program exits with 0 for
a boolean and an errno LSM hook.

For each hook, two BPF programs are attached. The first program returns
0 without calling bpf_set_retval() to exercise the return value translation
logic, while the second program reads the retval via bpf_get_retval().

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260610201724.733943-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Fix setting retval to -EPERM for cgroup hooks not returning errno

When a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets
the hook return value to -EPERM if it is not a valid errno. This is
correct for errno-based hooks, which return 0 on success and negative
errno on failure, but wrong for boolean and void LSM hooks. Boolean
LSM hooks should only return true or false, and void LSM hooks have
no return value at all.

Fix it by skipping setting -EPERM for hooks not returning errno.

Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260610201724.733943-2-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

net: qrtr: fix 32-bit integer overflow in qrtr_endpoint_post()

qrtr_endpoint_post() validates an incoming packet with

if (!size || len != ALIGN(size, 4) + hdrlen)
goto err;

where size comes from the wire. On 32-bit, size_t is 32 bits and
ALIGN(size, 4) wraps to 0 for size >= 0xfffffffd, so the check
passes and skb_put_data(skb, data + hdrlen, size) writes past the
hdrlen-sized skb and oopses the kernel. 64-bit is unaffected.

This is the 32-bit residual of ad9d24c9429e2 ("net: qrtr: fix OOB
Read in qrtr_endpoint_post"), which fixed only the 64-bit case.

Reject any size that cannot fit the buffer before the ALIGN.

Fixes: ad9d24c9429e2 ("net: qrtr: fix OOB Read in qrtr_endpoint_post")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260611125455.2352279-1-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Check max_macs devlink param value against max capability

The max_macs devlink param is checked against the FW max value only at
param register time (driver load) and inside the validate callback
(devlink param set). The stored DRIVERINIT value persists across FW
resets and devlink reloads without any further checks against the max.

If the FW link type changes from Ethernet to IB and a FW reset happens,
the MAX cap for log_max_current_uc_list will become zero, but the
previously stored max_macs value remains and is unconditionally
programmed into the HCA caps in handle_hca_cap(). FW will then return a
syndrome during SET_HCA_CAP:

mlx5_cmd_out_err:839:(pid 3831): SET_HCA_CAP(0x109) op_mod(0x0) failed,
status bad parameter(0x3), syndrome (0x537801), err(-22)
set_hca_cap:907:(pid 3831): handle_hca_cap failed

This results in a failure to register the RDMA device.

This patch skips programming log_max_current_uc_list when the MAX
capability is 0 (in case of IB).

Fixes: 8680a60fc1fc ("net/mlx5: Let user configure max_macs generic param")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20260611135230.534513-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'psp-add-support-for-dev-assoc-disassoc'

Wei Wang says:

====================
psp: Add support for dev-assoc/disassoc

The main purpose of this feature is to associate virtual devices like
veth or netkit with a real PSP device, so we could provide PSP
functionality to the application running with virtual devices.

A typical deployment that works with this feature is as follows:
     Host Namespace:
     psp_dev_local  ←──physically linked──→ psp_dev_peer
  (PSP device)
       │
       │ BPF on psp_dev_local ingress: bpf_redirect_peer() to nk_guest
       │
  nk_host / veth_host
       │
       │ BPF on nk_host ingress: bpf_redirect_neigh() to psp_dev_local
       │
      Guest Namespace (netns):
       │
  nk_guest / veth_guest
  ★ PSP application run here

      Remote Namespace (_netns):
  psp_dev_peer
  ★ PSP server application runs here

Note:
The general requirement for this feature to work:
For PSP to work correctly, the egress device at validate_xmit_skb()
time must have psp_dev matching the association's psd. Any device
stacking or traffic redirection that changes the egress device will
cause either:
1. TX validation failure (SKB_DROP_REASON_PSP_OUTPUT) - fail-safe
2. RX policy failure after tx-assoc - packets without PSP extension
   are rejected by receiver expecting encrypted traffic

Here are a few examples that this feature would not work:
- Bonding with load balancing in round-robin, XOR, 802.3ad mode across
  multiple PSP devices, or mixed PSP and non-PSP devices
- Bonding with active-backup mode might work without PSP migration for
  failover case.
- ipvlan/macvlan in bridge mode would not work given packets are
  loopbacked locally without going through the PSP device.
====================

Link: https://patch.msgid.link/20260608233118.2694144-1-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: psp: add dev-get, no-nsid, and cleanup tests

Add the following 3 tests:

- _psp_dev_get_check_netkit_psp_assoc: verifies dev-get output in both
  host and guest namespaces, checking assoc-list, by-association flag,
  and nsid values
- _dev_assoc_no_nsid: tests dev-assoc and dev-disassoc without the nsid
  attribute, verifying ifindex lookup in the caller's namespace
- _psp_dev_assoc_cleanup_on_netkit_del: verifies that deleting the
  associated netkit interface properly cleans up the assoc-list, using
  a disposable netkit pair to avoid disturbing the shared environment

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-11-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: psp: add cross-namespace notification tests

Add tests that verify PSP notifications are delivered to listeners in
associated namespaces:

- _key_rotation_notify_multi_ns_netkit: triggers key rotation and
verifies the notification is received in both main and guest namespaces
- _dev_change_notify_multi_ns_netkit: triggers dev_set and verifies the
dev_change notification is received in both namespaces

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-10-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: psp: add dev-assoc data path test

Add _assoc_check_list() test that associates nk_guest with the PSP
device and verifies the assoc-list is correctly populated.

Add _data_basic_send_netkit_psp_assoc() which tests PSP data send
through a netkit interface associated with a PSP device. The test
associates nk_guest with the PSP device, then sends PSP-encrypted
traffic from the guest namespace.

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-9-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: psp: support PSP in NetDrvContEnv infrastructure

Add infrastructure to support PSP tests across network namespaces
using NetDrvContEnv with netkit pairs. This enables testing PSP device
association, where a non-PSP-capable device (e.g. netkit) in a guest
namespace is associated with a real PSP device in the host namespace,
allowing the guest to perform PSP encryption/decryption through the
host's PSP hardware.

The topology is:
  Host NS:  psp_dev_local <---> nk_host
                |                  |
                |                  | (netkit pair)
                |                  |
  Remote NS: psp_dev_peer      Guest NS: nk_guest
             (responder)             (PSP tests)

env.py:
- nk_guest_ifindex is queried after moving the device into the guest
  namespace, so tests can use it directly for dev-assoc

psp.py:
- PSP device lookup supports container environments where the PSP
  device is on the physical interface, not the test interface
- Association helpers handle dev-assoc/dev-disassoc with defer-based
  cleanup to prevent state leaks on test assertion failures
- main() tries NetDrvContEnv with primary_rx_redirect and falls back
  to NetDrvEpEnv, so existing tests continue to work without the
  container environment

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-8-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: rename _nk_host_ifname to nk_host_ifname

Rename _nk_host_ifname to nk_host_ifname in NetDrvContEnv to make it
a public attribute, matching the nk_guest_ifname rename. Tests that
access the host-side netkit interface name (e.g. for cleanup after
deleting the netkit pair) no longer trigger pylint protected-access
warnings.

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-7-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: add _find_bpf_obj() to search hw/ for BPF objects

Add _find_bpf_obj() helper to NetDrvContEnv that searches the test
directory first, then falls back to the hw/ subdirectory. This allows
tests outside drivers/net/hw/ (e.g. psp.py in drivers/net/) to find
BPF objects built in the hw/ directory.

Update _attach_bpf() and _attach_primary_rx_redirect_bpf() to use
_find_bpf_obj() for BPF object discovery.

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-6-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: psp: refactor test builders to use ksft_variants

Replace the manual psp_ip_ver_test_builder() and ipver_test_builder()
functions with @ksft_variants decorators for data_basic_send and
data_mss_adjust. This is a pure refactor with no behavior change.

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-5-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>