git.ipfire.org Git - thirdparty/kernel/linux.git/log

Merge branch 'net-dsa-netc-add-bridge-mode-support'

Wei Fang says:

====================
net: dsa: netc: add bridge mode support

This series adds bridge mode support to the NETC DSA switch driver,
covering both VLAN-aware and VLAN-unaware operation.

The NETC switch manages forwarding through a set of hardware tables
accessed via NTMP: the FDB table (FDBT), VLAN filter table (VFT), egress
treatment table (ETT), and egress count table (ECT). The series extends
the NTMP layer with the operations required for bridging, then builds the
DSA bridge callbacks on top.

Since all switch ports share the VFT, so only one VLAN-aware bridge is
supported.

FDB aging is managed in software. A periodic delayed work sweeps the
table using the hardware activity element mechanism, with a default aging
time of 300 seconds matching the IEEE 802.1Q standard. Per-port entries
are also flushed immediately on bridge leave and link-down events.
====================

Link: https://patch.msgid.link/20260611021458.2629145-1-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: netc: implement dynamic FDB entry ageing

The NETC switch does not age out dynamic FDB entries automatically.
Without software management, stale entries persist after topology
changes and cause incorrect forwarding.

Add a delayed work that periodically removes entries that have not been
refreshed within the specified cycles. The effective ageing time is:

ageing_time = fdbt_ageing_delay * 100

Default values are 3s interval and 100 cycles (300s total), matching
the IEEE 802.1Q default ageing time. The work starts when the first
port joins a bridge (tracked via br_cnt) and is cancelled when the
last port leaves. All FDB operations are serialized under fdbt_lock.

Implement .set_ageing_time() to allow the bridge layer to reconfigure
ageing parameters on demand.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-10-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: netc: add bridge mode support

Wire up the port_bridge_join, port_bridge_leave and port_vlan_filtering
DSA callbacks to support both VLAN-unaware and VLAN-aware bridge modes.

For VLAN-unaware bridges, each bridge instance is assigned a dedicated
internal PVID via NETC_VLAN_UNAWARE_PVID(bridge.num), counting down
from VID 4095. A VFT entry is created for this PVID with hardware MAC
learning and flood-on-miss forwarding enabled. The CPU port is included
as a VFT member so frames can reach the host. The reserved VID range is
blocked in port_vlan_add to prevent user-space conflicts.

Only one VLAN-aware bridge is supported at a time; this constraint is
enforced in port_bridge_join and port_vlan_filtering. The per-port PVID
is tracked in software and written to the BPDVR register whenever VLAN
filtering is active.

When a port leaves the bridge, its dynamic FDB entries are flushed right
away in port_bridge_leave(), without waiting for the ageing cycle. When
a link down event occurs on a port, netc_mac_link_down() will also clear
the port's dynamic FDB entries via netc_port_remove_dynamic_entries().
Non-bridge ports have no dynamic FDB entries, so this call is always
safe. Additionally, .port_fast_age() callback is added to flush the
dynamic FDB entries associated to a port.

Host flood rules are removed from the ingress port filter table when a
port joins a bridge to avoid bypassing FDB lookup and MAC learning.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-9-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: netc: add VLAN filter table and egress treatment management

Implement the DSA .port_vlan_add and .port_vlan_del operations to enable
VLAN-aware bridge offloading on the NETC switch.

VLAN membership is maintained in the VLAN Filter Table (VFT). Adding the
first port to a VLAN creates a new VFT entry with hardware MAC learning
and flood-on-miss forwarding; subsequent ports update the existing
entry's membership bitmap. Removing the last port deletes the entry.

Egress tagging is handled through the Egress Treatment Table (ETT). Each
VLAN is allocated a group of ETT entries, one per available port. Ports
are assigned a sequential ett_offset during initialisation, used to
address each port's entry within the group. Untagged ports configure the
ETT to strip the outer VLAN tag; tagged ports pass frames through
unmodified. Each ETT group is optionally paired with an Egress Counter
Table (ECT) group for per-port frame counting, allocated on a best-effort
basis. When the egress rule of an ETT entry changes, the counter of the
corresponding ECT entry will be recounted to track the number of frames
that match the new egress rule.

A software shadow list serialised by vft_lock tracks active VLAN state
across both port membership and egress tagging. VID 0 is used for single
port mode and is ignored by both callbacks.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-8-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: add helpers to set/clear table bitmap

NTMP index tables require software to allocate and manage entry IDs.
Add two bitmap helper functions to facilitate this management:

ntmp_lookup_free_eid(): finds the first zero bit in the given bitmap,
sets it to mark the entry as in-use, and returns the corresponding entry
ID. Returns NTMP_NULL_ENTRY_ID if no free entry is available.

ntmp_clear_eid_bitmap(): clears the bit associated with the given entry
ID in the bitmap to mark the entry as free. It is a no-op if the entry
ID is NTMP_NULL_ENTRY_ID.

Both functions are exported for use by other modules, such as the NETC
switch driver which needs to manage group index bitmaps for the Egress
Treatment Table (ETT) and Egress Count Table (ECT).

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-7-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: netc: initialize the group bitmap of ETT and ECT

The Egress Treatment Table (ETT) and Egress Count Table (ECT) are both
index tables whose entry IDs are allocated by software. Every num_ports
entries form a group, where each entry in the group corresponds to one
port. To facilitate group allocation and management, initialize the group
index bitmaps for both tables based on hardware capabilities reported by
ETTCAPR and ECTCAPR registers.

The bitmap size per table is calculated as the total number of hardware
entries divided by the number of available ports, which gives the number
of groups available for software allocation. A set bit in the bitmap
represents a group index that has been allocated.

These bitmaps will be used by subsequent patches that add VLAN support.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-6-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: add "Update" operation to the egress count table

The egress count table is a static bounded index table, egress related
statistics are maintained in this table. The table is implemented as a
linear array of entries accessed using an index (0, 1, 2, ..., n) that
uniquely identifies an entry within the array. Egress Counter Entry ID
(EC_EID) is used as an index to an entry in this table. The EC_EID is
specified in the egress treatment table.

Egress count table entries are always present and enabled. The table
only supports access via entry ID, which is assigned by the software.
And it supports Update, Query and Query followed by Update operations.
Currently, only Update operation is supported.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-5-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: add interfaces to manage egress treatment table

Each entry in the egress treatment table contains the egress packet
processing actions to be applied to a grouping or scope of packets
exiting on a particular egress port of the switch. A scope of packets,
for example, could be the packets exiting a particular VLAN, matching
a particular 802.1Q bridge forwarding entry or belonging to a stream
identified at ingress. The egress treatment table is implemented as a
linear array of entries accessed using an index (0,1, 2, ..., n) that
uniquely identifies an entry within the array.

The egress treatment table only supports access vid entry ID, which is
assigned by the software. It supports Add, Update, Delete and Query
operations. Note that only Query operation is not supported yet.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-4-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: add "Update" and "Delete" operations to VLAN filter table

Add two interfaces to manage entries in the VLAN filter table:

ntmp_vft_update_entry(): Update the configuration element data of the
specified VLAN filter entry based on the given VLAN ID. It uses the
exact key access method to locate the entry.

ntmp_vft_delete_entry(): Delete the VLAN filter entry corresponding to
the specified VLAN ID. It also uses the exact key access method to
identify the target entry.

In addition, introduce struct vft_req_qd to describe the request data
buffer format for Query and Delete actions of the VLAN filter table,
which contains a common request data header and a VLAN access key.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-3-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: add interfaces to manage dynamic FDB entries

Add three interfaces to manage dynamic entries in the FDB table:

ntmp_fdbt_update_activity_element(): Update the activity element of all
dynamic FDB entries. For each entry, if its activity flag is not set,
which means no packet has matched this entry since the last update, the
activity counter is incremented. Otherwise, both the activity flag and
activity counter are reset. The activity counter is used to track how
long an FDB entry has been inactive, which is useful for implementing
an ageing mechanism.

ntmp_fdbt_delete_ageing_entries(): Delete all dynamic FDB entries whose
activity flag is not set and whose activity counter is greater than or
equal to the specified threshold. This is used to remove stale entries
that have been inactive for too long.

ntmp_fdbt_delete_port_dynamic_entries(): Delete all dynamic FDB entries
associated with the specified switch port. This is typically called when
a port goes down or is removed from a bridge.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-2-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net/openvswitch: add SET action test

Add test_action_set exercising OVS_ACTION_ATTR_SET with an ipv4 dst
rewrite. The test verifies the SET action in three steps: first
confirm normal forwarding, then apply set(ipv4(dst=10.0.0.99)) to
rewrite the destination to an address nobody owns and verify ping
fails, then restore normal forwarding and verify connectivity
recovers.

Signed-off-by: Minxi Hou <houminxi@gmail.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20260612130503.311240-1-houminxi@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-sfp-extend-smbus-support'

Jonas Jelonek says:

====================
net: sfp: extend SMBus support

Today, the SFP driver only drives I2C adapters that advertise full
I2C_FUNC_I2C, or SMBus-only adapters via single-byte transfers (with
hwmon disabled). Several SoCs ship I2C/SMBus-only controllers that
support more than just byte access -- e.g. word and I2C block -- and
have SFP cages wired to them. Today, those adapters either work
poorly or not at all.

This series teaches the SFP driver to use the larger SMBus access
modes when the adapter advertises them, and along the way starts
honoring i2c_adapter quirks on read/write length so adapters that
cap below the SFP block size are handled correctly. Patch 1 is a
small prep doing only the quirks handling; patch 2 extends the
SMBus path itself.

Capability matrix supported by patch 2:
  - BYTE only:                   single-byte access (unchanged).
  - BYTE + WORD:                 word for >=2-byte chunks, byte tail.
  - I2C_BLOCK present:           block as the universal transport.
  - WORD only (no BYTE/BLOCK):   accepted with WARN_ONCE; works for
                                 even-length transfers, odd-length
                                 transfers will error at xfer time.

Adapters with asymmetric R/W capabilities (e.g. only READ_I2C_BLOCK
without WRITE_I2C_BLOCK) remain functionally correct but use the
worse-supported direction's max for both directions, since
i2c_max_block_size is a single field. No mainline I2C driver was
seen advertising such asymmetry; per-direction sizes can be added
later if needed.
====================

Link: https://patch.msgid.link/20260614133418.2068201-1-jelonek.jonas@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: extend SMBus support

Commit 7662abf4db94 ("net: phy: sfp: Add support for SMBus module access")
added SMBus access for SFP modules, but limited it to single-byte
transfers. As a side effect, hwmon is disabled (16-bit reads cannot be
guaranteed atomic) and a warning is printed.

Many SMBus-only I2C controllers in the wild support more than just
byte access, and SFP cages are often wired to such controllers
rather than to a full-featured I2C controller -- e.g. the SMBus
controllers in the Realtek longan and mango SoCs, which advertise
word access and I2C block reads. Today, they cannot drive an SFP at
all without falling back to the byte-only path.

Extend sfp_smbus_read()/sfp_smbus_write() so that, in addition to
the existing byte access, they also use SMBus word access and SMBus
I2C block access whenever the adapter advertises them. Both
directions are handled in a single read and a single write helper
that pick the largest supported transfer per chunk and fall back as
needed.

I2C-block is preferred unconditionally when available: the protocol
carries any length 1..32, so it can serve every chunk -- including
the 1- and 2-byte tails -- without help from word or byte access.
Note that this requires I2C_FUNC_SMBUS_I2C_BLOCK, which reads a
caller-specified number of bytes. This deviates from the official
SMBus Block Read (length is supplied by the slave) but is widely
supported by Linux I2C controllers/drivers.

Capability matrix this implementation supports:

  - BYTE only:                  works (unchanged behaviour); 1-byte
                                xfers, hwmon disabled.
  - BYTE + WORD:                word for >=2-byte chunks, byte for
                                trailing odd byte.
  - I2C_BLOCK present (with or
    without BYTE/WORD):         block as the universal transport for
                                every chunk.
  - WORD only (no BYTE/BLOCK):  accepted with WARN_ONCE. Even-length
                                transfers work; odd-length transfers
                                (e.g. the 3-byte cotsworks fixup
                                write) hit the BYTE branch which the
                                adapter does not implement, so the
                                xfer returns an error and the
                                operation is aborted. No mainline
                                I2C driver was found to advertise
                                WORD without BYTE; the warning lets
                                us learn about it if it ever shows
                                up.

Adapters with asymmetric R/W capabilities (e.g. only READ_I2C_BLOCK
but not WRITE_I2C_BLOCK) remain functionally correct -- the
per-iteration fallback uses the direction-specific bits -- but the
shared i2c_max_block_size is sized by the all-bits-set check, so a
transfer in the better-supported direction is not upgraded. None of
the mainline I2C bus drivers surveyed during review advertise such
asymmetry; promoting i2c_max_block_size to per-direction sizes can
be revisited if needed.

Signed-off-by: Jonas Jelonek <jelonek.jonas@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260614133418.2068201-3-jelonek.jonas@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: apply I2C adapter quirks to limit block size

The SFP driver assumes all I2C adapters support reading and writing the
pre-defined block size SFP_EEPROM_BLOCK_SIZE of 16 bytes. This constant
was probably chosen based on good guesses and known limitations of a
range of I2C adapters and SFP modules.

However, I2C adapters may even support less and usually need to specify
this via I2C quirks. Theoretically, such an adapter may provide full
functionality but only support a read and write length of e.g. 8 bytes.
Currently, the SFP driver doesn't account for that.

Add handling for I2C quirks in SFP I2C configuration taking the fields
max_read_len and max_write_len in struct i2c_adapter_quirks into account
to further limit the maximum block size if needed.

Signed-off-by: Jonas Jelonek <jelonek.jonas@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260614133418.2068201-2-jelonek.jonas@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'nf-next-26-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next.
More specifically, this contains conncount rework to address AI related
reports, assorted Netfiter updates and two small incremental updates on
IPVS:

1) Replace old obsolete workqueues (system_wq, system_unbound_wq)
   in IPVS, from Marco Crivellari.

2) Replace WARN_ON{_ONCE} by DEBUG_NET_WARN_ON_ONCE in nf_tables.
   In the recent years, reporters say that the use of WARN_ON{_ONCE}
   in conjunction with panic_on_warn=1 results in DoS. Let's replace
   it by DEBUG_NET_WARN_ON_ONCE so this is only exercised by test
   infrastructure and fuzzers, while also providing context to AI
   agents. From Fernando F. Mancera.

Five patches from Florian Westphal to address AI reports in the conncount
infrastructures:

3) Fix missing rcu read lock section when calling
   __ovs_ct_limit_get_zone_limit().

4) Add a dedicate lock per rbtree tree, this increases memory
   usage but it should improve scalability.

5) Add a helper function to find the rbtree node, no functional
   changes are intented.

6) Add sequence counter to detect concurrent tree modifications
   and retry lookups.

7) Add locks to GC conncount walk and address other nitpicks.

Then, several assorted updates:

8) Defensive Tree-wide addition of NULL checks for ct extensions.

9) Bail out if flowtable bypass cannot be fully set up from the
   flow offload expression, instead of lazy building a likely
   incomplete one.

10) Fix documentation for the new conn_max sysctl toggle in IPVS.

11) Add nf_dev_xmit_recursion*() helpers and use them, to address
    recent AI reports.

* tag 'nf-next-26-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_dup_netdev: add nf_dev_xmit_recursion*() helpers and use them
  ipvs: fix doc syntax for conn_max sysctl
  netfilter: flowtable: bail out if forward path cannot be discovered
  netfilter: conntrack: check NULL when retrieving ct extension
  netfilter: nf_conncount: gc and rcu fixes
  netfilter: nf_conncount: add sequence counter to detect tree modifications
  netfilter: nf_conncount: split count_tree_node rbtree walk into helper
  netfilter: nf_conncount: use per nf_conncount_data spinlocks
  netfilter: nf_conncount: callers must hold rcu read lock
  netfilter: nf_tables: use DEBUG_NET_WARN_ON_ONCE in packet and control paths
  ipvs: Replace use of system_unbound_wq with system_dfl_long_wq
====================

Link: https://patch.msgid.link/20260614114605.474783-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mailmap: add entry for Jesse Brandeburg

My Intel email address is no longer used, redirect it to my kernel.org
address.

Signed-off-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20260612224727.141614-1-jbrandeb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netdev-expose-page-pool-order-via-netlink'

Dragos Tatulea says:

====================
netdev: expose page pool order via netlink

This small series exposes io_uring's high order page configuration
via the page_pool netlink interface and updates the appropriate
selftest to check this value.
====================

Link: https://patch.msgid.link/20260612211709.1456966-2-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

io_uring/zcrx: selftests: verify rx_buf_len for large chunks

Check the newly added rx_buf_len page_pool field for io_uring
in the existing large-chunks test after the receiver is up.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20260612211709.1456966-4-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdev: expose io_uring rx_page_order order via netlink

This adds observability for the io_uring zcrx rx-buf-len configuration.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/20260612211709.1456966-3-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-vsock-improve-vng-version-and-quirk-handling'

Bobby Eshleman says:

====================
selftests/vsock: improve vng version and quirk handling

As vng has continued updating, there have been two things in our
selftests that have been affected. One is that newer versions always
emit the vng version warning, and two is that we have a workaround that
is not needed in newer versions.

This series just updates the version handling to allow all newer
versions without warning and version-gates the workaround to only those
versions that don't have the commit that fixed the root cause.

Additionally, we add function for comparing major.minor versions which
is used in both patches.
-===================

Link: https://patch.msgid.link/20260612-vsock-test-update-v1-0-7d7eeed3ac8f@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/vsock: skip vng setsid workaround on >= 1.41

virtme-ng 1.41 ships the upstream fix for the SIGTTOU hang
(https://github.com/arighi/virtme-ng/pull/453), so the setsid wrapper in
vng_dry_run() is no longer needed there. Gate the workaround on the vng
version: setsid is used for vng < 1.41, and vng is invoked directly on
>= 1.41.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260612-vsock-test-update-v1-2-7d7eeed3ac8f@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/vsock: accept vng 1.33 or >= 1.36

The current vng version check uses a discrete allowlist of "1.33",
"1.36", and "1.37", which forces a script update on every new release
even though all post-1.36 releases work.

Replace the discrete list with: "1.33", or any version >= 1.36. 1.34
and 1.35 are skipped because they were not tested. Add a version_lt()
helper that compares MAJOR.MINOR numerically, so the check reads as a
straightforward version comparison.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260612-vsock-test-update-v1-1-7d7eeed3ac8f@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: ipv6: clamp default adverting MSS to avoid GSO_BY_FRAGS (0xFFFF)

When MTU is large, ip6_default_advmss() can return IPV6_MAXPLEN (65535).
This is interpreted by TCP as mss_clamp, allowing the MSS to reach 65535.

However, 0xFFFF is also used as a magic value GSO_BY_FRAGS in the kernel.
If a TCP packet with gso_size=0xFFFF is passed to skb_segment(), it will
be mistakenly treated as GSO_BY_FRAGS, leading to a NULL pointer
dereference because local TCP packets do not use frag_list.

Fix this by returning min(IPV6_MAXPLEN, GSO_BY_FRAGS - 1) (65534) from
ip6_default_advmss() when MTU is large.

Also update the stale comment in ip6_default_advmss() which suggested
that IPV6_MAXPLEN is returned to mean "any MSS".

Fixes: 3953c46c3ac7 ("sk_buff: allow segmenting based on frag sizes")
Reported-by: syzbot+ebdb22d461c904fc3cb2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6a2c3193.8812e0fc.3c3fa4.0001.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260612162517.83394-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tipc: fix UAF in tipc_l2_send_msg()

Syzbot reported a slab-use-after-free in ipvlan_hard_header() when
called from tipc_l2_send_msg().

The root cause is that tipc_disable_l2_media() calls synchronize_net()
while b->media_ptr is still valid. This allows concurrent RCU readers
to obtain the device pointer after synchronize_net() has finished.
The pointer is cleared later in bearer_disable(), but without any
subsequent synchronization, allowing the device to be freed while
still in use by readers.

Fix this by clearing b->media_ptr in tipc_disable_l2_media() before
calling synchronize_net().

This is safe to do now because the call order in bearer_disable()
was reversed in 0d051bf93c06 ("tipc: make bearer packet filtering generic")
to call tipc_node_delete_links() (which needs the pointer) before
disable_media().

Fixes: 282b3a056225 ("tipc: send out RESET immediately when link goes down")
https: //lore.kernel.org/netdev/6a2c1007.428ffe26.258b27.015d.GAE@google.com/T/#u
Reported-by: syzbot+64ec81389cbad56a8c35@syzkaller.appspotmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20260612135949.4010482-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'octeontx2-quiesce-stale-mailbox-irq-state-before-request_irq'

Runyu Xiao says:

====================
octeontx2: quiesce stale mailbox IRQ state before request_irq()

Both OTX2 mailbox registration paths currently install their IRQ
handlers before clearing stale local mailbox interrupt state, even
though the code comments already say that the clear is needed first to
avoid spurious interrupts.

This issue was found by our static analysis tool and manually audited on
Linux v6.18.21. Directed QEMU no-device validation further showed that
the real PF and VF mailbox handlers are already reachable in that
pre-clear window and can touch the same mailbox and workqueue carrier
before local quiesce has completed.

This series keeps the change minimal:

- clear stale mailbox interrupt state before request_irq()
- keep interrupt enabling after the handler is installed

That closes the early-IRQ window without introducing a new
enable-before-handler window.

Patch 1 fixes the PF mailbox registration path.
Patch 2 fixes the VF mailbox registration path.

Build-tested by compiling otx2_pf.o and otx2_vf.o.

No OTX2 hardware was available for end-to-end runtime testing.
====================

Link: https://patch.msgid.link/20260611160014.3202224-1-runyu.xiao@seu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-vf: clear stale mailbox IRQ state before request_irq()

otx2vf_register_mbox_intr() currently installs the VF mailbox IRQ
handler before clearing stale mailbox interrupt state. The code then says
that local interrupt bits should be cleared first to avoid spurious
interrupts, but that clear still happens only after request_irq() has
already made the handler reachable.

A running system can reach this during VF mailbox interrupt registration
while stale or latched RVU_VF_INT state is still present. If delivery
happens in the request_irq()-to-clear window,
otx2vf_vfaf_mbox_intr_handler() can run before local quiesce and touch
the same vf->mbox and vf->mbox_wq carrier that probe and teardown later
reuse or destroy.

Move the stale mailbox interrupt clear ahead of request_irq(), but keep
interrupt enabling after the handler is installed. This closes the
pre-clear early-IRQ window without creating a new enable-before-handler
window.

Fixes: 3184fb5ba96e ("octeontx2-vf: Virtual function driver support")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260611160014.3202224-3-runyu.xiao@seu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-pf: clear stale mailbox IRQ state before request_irq()

otx2_register_mbox_intr() currently installs the PF mailbox IRQ handler
before clearing stale mailbox interrupt state. The function itself then
comments that the local interrupt bits must be cleared first to avoid
spurious interrupts, but that clear happens only after request_irq() has
already exposed the handler to irq delivery.

A running system can reach this during PF mailbox interrupt registration
while stale or latched RVU_PF_INT state is still present. If delivery
happens in the request_irq()-to-clear window,
otx2_pfaf_mbox_intr_handler() can run before local quiesce and touch
the same pf->mbox and pf->mbox_wq carrier that probe and teardown later
reuse or destroy.

Move the stale mailbox interrupt clear ahead of request_irq(), but keep
interrupt enabling after the handler is installed. This closes the
pre-clear early-IRQ window without creating a new enable-before-handler
window.

Fixes: 5a6d7c9daef3 ("octeontx2-pf: Mailbox communication with AF")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260611160014.3202224-2-runyu.xiao@seu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: sfp: detect presence via I2C when no MOD_DEF0 GPIO

An SFP cage (compatible "sff,sfp") whose MOD_DEF0 signal is not wired to a
GPIO currently falls back to sff_gpio_get_state(), which unconditionally
reports the module as present. An empty cage therefore fails its probe and
is parked in SFP_MOD_ERROR forever; because SFP_F_PRESENT never deasserts
there is no REMOVE event to recover the state machine, so a module inserted
after boot is never detected, and empty cages spam -EIO at boot.

This affects boards that route none of the cage presence signal to a
software-readable input. On the NicGiga S100-0800S-M (RTL9303, 8x SFP+) the
cage I2C bus is the switch's SMBus master; TX_DISABLE is driven via a
PCA9534 I/O expander, but no MOD_ABS/MOD_DEF0 line reaches a readable GPIO
(the RTL9303 gpio0 lines read stuck-low, the single PCA9534 is fully
consumed by TX_DISABLE, and there is no RTL8231). The Horaco ZX-SW82TS-L2P
(RTL9302D, 2x SFP+) is independently affected in the same way.

For such an SFP cage, derive presence from a throttled single-byte I2C read
of the module EEPROM instead: a successful read asserts SFP_F_PRESENT,
R_PROBE_ABSENT consecutive failures clear it (to ride out a transient error
on a live module). The existing poll then emits SFP_E_INSERT / SFP_E_REMOVE
normally, giving working hot-plug and silencing the boot-time -EIO spam on
empty cages. Presence is re-probed every T_PROBE_PRESENT, so insertion is
detected within that interval and removal within
T_PROBE_PRESENT * R_PROBE_ABSENT.

A soldered-down module (compatible "sff,sff") has no presence signal and is
genuinely always present, so it continues to use sff_gpio_get_state(); the
new path is gated on the cage type advertising SFP_F_PRESENT.

Signed-off-by: Greg Patrick <gregspatrick@hotmail.com>
Tested-by: Manuel Stocker <mensi@mensi.ch>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260611175341.2223184-1-gregspatrick@hotmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'devlink-warn-on-resource-id-collision-with-parent_top'

David Yang says:

====================
devlink: Warn on resource ID collision with PARENT_TOP

Filter out the ambiguous case of

enum {
    MY_RESOURCE_ID_A,  /* == DEVLINK_RESOURCE_ID_PARENT_TOP ! */
    MY_RESOURCE_ID_B,
    ...
};

register(..., MY_RESOURCE_ID_A, DEVLINK_RESOURCE_ID_PARENT_TOP, ...);
register(..., MY_RESOURCE_ID_B, MY_RESOURCE_ID_A, ...);
====================

Link: https://patch.msgid.link/20260611070856.889700-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: Warn on resource ID collision with PARENT_TOP

ID 0 serves as the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP to mark
top-level resources. While it is technically possible to use 0 as a real
resource ID, a user might be tempted to write:

enum {
    MY_RESOURCE_ID_A,  /* == DEVLINK_RESOURCE_ID_PARENT_TOP ! */
    MY_RESOURCE_ID_B,
    MY_RESOURCE_ID_C,
    MY_RESOURCE_ID_D,
    ...
};

register(..., MY_RESOURCE_ID_C, DEVLINK_RESOURCE_ID_PARENT_TOP, ...);
register(..., MY_RESOURCE_ID_D, MY_RESOURCE_ID_C, ...);
/* D is a child of C */

register(..., MY_RESOURCE_ID_A, DEVLINK_RESOURCE_ID_PARENT_TOP, ...);
register(..., MY_RESOURCE_ID_B, MY_RESOURCE_ID_A, ...);
/* Is B intentionally top-level, or is it actually a child of A? */

Add a WARN_ON() to catch this and prevent confusion.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260611070856.889700-6-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mv88e6xxx: Avoid devlink resource IDs collision with PARENT_TOP

The devlink resource ID for ATU collides with the sentinel
DEVLINK_RESOURCE_ID_PARENT_TOP (0). As a result, ATU_bin_* are
registered as in fact registered as top-level siblings, not as children
of ATU.

Whether intentional or unintentional, clarify it by keeping the real
resource IDs starting at 1. Unfortunately ATU_bin_* are already
registered at top-level, so keep their parent to PARENT_TOP.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260611070856.889700-5-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: hellcreek: avoid devlink resource IDs collision with PARENT_TOP

This might not cause real problems, but the hellcreek devlink resource
ID collides with the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP (0). Avoid
it by keeping the real resource IDs starting at 1.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://patch.msgid.link/20260611070856.889700-4-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: b53: avoid devlink resource IDs collision with PARENT_TOP

This might not cause real problems, but the b53 devlink resource ID
collides with the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP (0). Avoid it
by keeping the real resource IDs starting at 1.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260611070856.889700-3-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: dsa_loop: avoid devlink resource IDs collision with PARENT_TOP

This might not cause real problems, but the dsa_loop devlink resource ID
collides with the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP (0). Avoid it
by keeping the real resource IDs starting at 1.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260611070856.889700-2-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

pmdomain: core: fix unused variable warning with !PM_GENERIC_DOMAINS_OF

The genpd provider bus is really only used when
CONFIG_PM_GENERIC_DOMAINS_OF is enabled, and since the recent deferred
initialisation of domain parent devices, the root device pointer is
otherwise unused.

Fix the unused variable warning by moving the definition of the root device
pointer inside the corresponding ifdef.

Fixes: 92b69eff8012 ("pmdomain: core: fix early domain registration")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202606111746.kAxaAbwg-lkp@intel.com/
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Ulf Hansson <ulfh@kernel.org>

dt-bindings: interrupt-controller: ti,irq-crossbar: Convert to DT schema

Convert TI irq-crossbar binding from text format to DT schema.

As part of conversion following changes are made:
- Add '#interrupt-cells' as a required property which was missing in
text binding
- As irq-crossbar is interrupt-controller. Move binding from
bindings/arm/omap to bindings/interrupt-controller

Signed-off-by: Bhargav Joshi <j.bhargav.u@gmail.com>
Link: https://patch.msgid.link/20260612-crossbar-v3-1-266747bc2e86@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

Merge branch 'ipv4-fib-remove-rtnl-in-fib_net_exit_batch'

Kuniyuki Iwashima says:

====================
ipv4: fib: Remove RTNL in fib_net_exit_batch().

Currently, we flush all IPv4 routes at ->exit_batch() during
netns dismantle, which requires an extra RTNL.

IPv4 routes are not added from the fast path unlike IPv6, so
we can flush routes before default_device_exit_batch().

However, there is implicit ordering between ip_fib_net_exit()
and default_device_exit_batch().

This series detangles it and moves ip_fib_net_exit() to
->exit_rtnl() to save the RTNL dance.

The same change for IPv6 will need more work.
====================

Link: https://patch.msgid.link/20260612063225.455191-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: fib: Convert fib_net_exit_batch() to ->exit_rtnl().

Currently, IPv4 routes are flushed in ->exit_batch() after
all devices are unregistered.

Unlike IPv6, IPv4 routes are not added from the fast path,
so we can flush routes before default_device_exit_batch().

Let's call ip_fib_net_exit() from ->exit_rtnl() to save
one RTNL locking dance.

ip_fib_net_exit() must use list_del_rcu() for fib_table
for the fast path on dying dev.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260612063225.455191-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: fib: Avoid calling fib_trie_table() in fib_new_table() for dying net.

We will call ip_fib_net_exit() from ->exit_rtnl().

All fib_table will be destroyed before devices are unregistered.

During device unregistration, inetdev_destroy() could call
fib_del_ifaddr(), which calls fib_magic(RTM_DELROUTE).

fib_magic() calls fib_new_table(), but we do not want to create
a new table after ip_fib_net_exit() destroys all tables.

As a prep, let's add check_net() before fib_trie_table() in
fib_new_table().

fib_trie_table() is also called from fib_trie_unmerge(), but
fib_get_table() fails first in fib_unmerge(), so the same
problem does not occur there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260612063225.455191-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: fib: Free net->ipv4.{fib_table_hash,notifier_ops} without RTNL.

We will call ip_fib_net_exit() from ->exit_rtnl().

However, some paths will still access net->ipv4.fib_table_hash
after ->exit_rtnl().

For example, fib_flush() is called from fib_disable_ip() for
NETDEV_UNREGISTER.

Let's move kfree(net->ipv4.fib_table_hash) and fib4_notifier_exit()
from ip_fib_net_exit() to its caller.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260612063225.455191-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: fib: Call fib_proc_exit() and nl_fib_lookup_exit() at ->pre_exit().

We will call ip_fib_net_exit() from ->exit_rtnl().

Since the exit callbacks are called in the following order,

  1. ->pre_exit()
  ~~~ synchronize_rcu() ~~~
  2. ->exit_rtnl()   : ip_fib_net_exit()
  3. ->exit()        : fib_proc_exit() / nl_fib_lookup_exit()
  4. ->exit_batch()  : fib4_semantics_exit()

the reverse order of fib_net_init() would get messed up.

Let's move fib_proc_exit() and nl_fib_lookup_exit() to ->pre_exit().

This is fine because procfs/netlink access from userspace cannot
occur at this point and synchronize_rcu() is not needed.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260612063225.455191-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: fib: Flush all fib_info in fib_table_flush() during netns dismantle.

Even when fib_table_flush() is called with flush_all true, it does
not flush all fib_info due to this condition:

  !(fi->fib_flags & RTNH_F_DEAD) && !fib_props[fa->fa_type].error)

This creates an implicit ordering between default_device_exit_batch()
and fib_net_exit_batch().

fib_table_flush(flush_all=true) must be called after all devices
are NETDEV_UNREGISTERed, which is after nexthop_flush_dev() marks
RTNH_F_DEAD.

This would cause memory leak if the order were reversed.

fib_table_flush() does not skip non-dead error routes when flush_all
is true:

  !flush_all &&
  !(fi->fib_flags & RTNH_F_DEAD) && fib_props[fa->fa_type].error

Let's merge the two conditions not to skip all non-dead fib_info
during netns dismantle.

Note that we could further apply !flush_all to the basic table
id check and the rtmsg_fib() call in the loop.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260612063225.455191-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: hellcreek: replace kcalloc with struct_size

One fewer allocation for the priv struct.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://patch.msgid.link/20260608045640.5172-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5-add-switchdev-mode-support-for-socket-direct-single-netdev-part-2-2'

Tariq Toukan says:

====================
net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2

This is part 2. Find part 1 here:
https://lore.kernel.org/all/20260531113954.395443-1-tariqt@nvidia.com/

This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. SD single netdev combines multiple PCI functions
behind a single netdev interface. To support switchdev offloads, these
functions must participate in virtual LAG (shared FDB).

Design

Rather than introducing a separate LAG instance for SD, this series
integrates SD secondary devices into the existing LAG structure
(priv.lag) created at probe time. Each lag_func entry carries a
group_id field that identifies its SD group membership (0 means not
part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
physical port entries from SD secondaries, enabling a single unified
iterator that filters by group:

  - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
    behavior, used by bonding, FW LAG commands, v2p_map)
  - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
    (used by MPESW shared FDB across all devices)
  - specific group_id: iterate only devices in that SD group (used by
    per-group SD shared FDB operations)

Existing callers use mlx5_ldev_for_each() which maps to
MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
configurations.

Lifecycle and ownership

The SD LAG lifecycle is tied to the SD group, not to bonding events:

1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
   (priv.lag) for each LAG-capable PF. e.g.: SD primary devices

2. During mlx5_sd_init(), after the SD group is fully formed (primary
   and secondaries paired), sd_lag_init() registers the secondary
   devices into the primary's existing priv.lag by calling
   mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
   also gets its group_id set. No separate LAG instance is created.

3. After all the devices in SD group transition to switchdev,
   mlx5_lag_shared_fdb_create() is invoked with the group_id to create
   a software-only shared FDB scoped to that SD group. This sets
   sd_fdb_active on all lag_func entries in the group. No FW LAG
   commands are issued since SD devices share the same physical port.

4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
   per-group SD shared FDB is torn down first, then MPESW shared FDB is
   created spanning all devices (ports + SD secondaries) using
   MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
   restored.

5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
   removes secondaries from priv.lag and clears the primary's group_id.
   The LAG structure itself is not destroyed.

The sd_fdb_active flag is set on all lag_func entries in a group (not
just the primary), so any device can detect the SD shared FDB state
during lag_disable_change teardown without needing to look up peer
entries.

SD shared FDB is a pure software construct -- unlike regular LAG modes
(ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
commands. The software vport LAG for SD is implemented via eswitch
egress ACL bounce rules, managed by the IB layer through
mlx5_eth_lag_init(). And the software LAG demux is implemented via
steering rules that utilize new destination, VHCA_RX.

Patches

E-Switch preparation (patch 1):
  - Skip uplink IB rep load for SD secondary devices

Devcom support (patches 2-3):
  - Expose locked variant of send_event
  - Add DEVCOM_CANT_FAIL for non-rollback events

SD core hardening (patches 4-6):
  - Make primary/secondary role determination more robust
  - Add L2 table silent mode query support
  - Expand vport metadata for SD secondary devices

SD switchdev transition (patches 7-8):
  - Support switchdev mode transition with shared FDB
  - Notify SD on eswitch disable

LAG integration (patches 9-12):
  - Store demux resources per master lag_func
  - Disable both regular and SD LAG on lag_disable_change
  - Introduce software vport LAG implementation
  - Add MPESW over SD LAG support

Deferred init (patches 13-14):
  - Tie rep load/unload to SD LAG state
  - Defer vport metadata init until SD is ready

Enablement (patch 15):
  - Enable SD over ECPF and allow switchdev transition

v2: https://lore.kernel.org/20260608135547.482825-1-tariqt@nvidia.com
v1: https://lore.kernel.org/20260604114455.434711-1-tariqt@nvidia.com
====================

Link: https://patch.msgid.link/20260612113904.537595-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, enable SD over ECPF and allow switchdev transition

Remove the restriction blocking SD on embedded CPU PFs (ECPF), enabling
SD functionality on BlueField DPUs. Remove the blocker preventing SD
devices from transitioning to switchdev mode.

The infrastructure added in earlier patches properly handles this case.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-16-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, defer vport metadata init until SD is ready

Allow SD devices to transition to switchdev before the SD group is
fully up. Metadata allocation requires the SD group to be ready, so
defer it from esw_offloads_enable() until SD shared-FDB activation.

Add mlx5_esw_offloads_init_deferred_metadata() which allocates per-vport
metadata and refreshes the ingress ACLs that were previously programmed
with metadata=0. The helper is idempotent and can be called multiple
times.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-15-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-Switch, Tie rep load/unload to SD LAG state

On an SD device, vport representors are not functional until the SD
group is combined and shared FDB is active. Skip the initial load and
the reload paths in that window; reps are loaded as part of the SD LAG
activation flow once it becomes active.

In addition, explicitly unload representors when SD LAG is destroyed.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-14-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, add MPESW over SD LAG support

Enable MPESW LAG creation over SD LAG members, forming a composite LAG
hierarchy. This allows bonding multiple SD groups together under a
single MPESW configuration with shared FDB.

When enabling composite MPESW, the individual SD LAG shared FDB
configurations are temporarily torn down and recreated when the
composite LAG is disabled.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, introduce software vport LAG implementation

SD LAG is a virtual LAG without hardware LAG support, so it cannot use
the firmware vport LAG commands. Implement a software-based vport LAG
using egress ACL bounce rules.

Add esw_set_slave_egress_rule() to create an egress ACL rule on the
slave's manager vport that bounces traffic to the master's manager
vport. This achieves the same traffic steering as hardware vport LAG.

Redirect mlx5_cmd_create_vport_lag() and mlx5_cmd_destroy_vport_lag()
to the software implementation when operating in SD LAG mode.
In addition, adjust lag_demux creation to check SD LAG mode as well.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, disable both regular and SD LAG on lag_disable_change

Extend mlx5_lag_disable_change() to properly disable both regular LAG
and SD LAG when requested. Each LAG type uses its own devcom component
for locking.

Use mlx5_sd_get_devcom() helper to retrieve the SD devcom component,
needed for proper locking when disabling SD LAG.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, store demux resources per master lag_func

The lag demux resources (flow table, flow group, and rules xarray)
are stored on the shared ldev. With Socket Direct, multiple SD groups
each create their own demux FT/FG during their master's IB device
initialization. Since they all write to the same ldev fields, the
second group's init overwrites the first group's pointers, leaking
the first group's FT/FG.

During teardown, the cleanup uses the overwritten pointers, destroying
the wrong group's resources and leaving leaked flow tables in the LAG
namespace. These leaked tables can interfere with subsequently created
demux tables.

Move the demux resources from the shared ldev to per-master lag_func
instances. Each master device now owns its own independent demux
state. The rule_add and rule_del helpers look up the appropriate
master's lag_func via the existing filter/group infrastructure.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-Switch, notify SD on eswitch disable

When eswitch is disabled, notify the SD layer so it can clean up
SD-specific resources such as the TX flow table root configuration
on secondary devices.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, support switchdev mode transition with shared FDB

When the eswitch transitions, propagate the change to SD: secondaries
get their TX flow table root reconfigured for the new mode, and when
all group devices move to switchdev, the per-group shared FDB is
activated.

Shared FDB activation is best-effort - failure does not block the
eswitch transition; the next transition retries.

Note: the existing mlx5_get_sd() guard that blocks switchdev for SD
devices is intentionally retained. It will be removed once all
supporting patches are in place.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, expend vport metadata for SD secondary devices

In Socket Direct configurations the primary and secondary PFs share the
same native_port_num. The eswitch vport metadata encodes pf_num in its
upper bits to distinguish vports across PFs. Without SD-awareness, both
PFs generate identical metadata, causing FDB rules to steer traffic to
the wrong representor.

Add mlx5_sd_pf_num_get() which remaps the pf_num for SD devices.
Use it so each PF in an SD group produces unique vport metadata.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, add L2 table silent mode query support

Add mlx5_fs_cmd_query_l2table_silent() to query the current silent mode
state from firmware. This allows detecting if firmware has already put
secondary devices into silent mode.

During SD group registration, query the silent mode of each device. If
a device is already in silent mode (set by firmware), record this in
the fw_silents_secondaries flag and use it to help determine the
primary/secondary roles.

When fw_silents_secondaries is set, skip the driver-initiated silent
mode set/unset operations since firmware manages this state. This
handles configurations where firmware persistently silences secondary
devices.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, make primary/secondary role determination more robust

Refactor SD group registration to use devcom event-driven role
determination to ensure SD is marked as ready only after roles are fully
assigned and the group state is consistent, making outside accessors,
which will be added in downstream patches, safe to use without races.

The devcom events:
- SD_PRIMARY_SET event: each device compares bus numbers with peers
to determine which should be primary
- SD_SECONDARIES_SET event: secondaries register themselves with the
elected primary device

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: devcom, add DEVCOM_CANT_FAIL for non-rollback events

Some devcom events are not expected to fail. Rather than attempting
a rollback that may not be meaningful, allow callers to pass
DEVCOM_CANT_FAIL as the rollback_event to indicate that the event
handler should not fail. If it does, emit a warning and stop
propagating to further peers, but skip the rollback path.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: devcom, expose locked variant of send_event

Factor mlx5_devcom_send_event() into two functions:
- mlx5_devcom_locked_send_event(): performs the dispatch (and
rollback) with comp->sem already held by the caller.
- mlx5_devcom_send_event(): unchanged wrapper that takes comp->sem,
calls the locked variant, and releases it.

This lets callers bracket multiple event broadcasts under a single
held write lock, eliminating the gap between consecutive dispatches
where peer state could change.

Will be used by a downstream patch.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-Switch, skip uplink IB rep load for SD secondary devices

SD secondary devices share the primary's uplink and do not have
their own uplink representor. When reloading IB reps on secondary
devices, skip the uplink and only load VF/SF vport IB reps.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'asoc-v7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Updates for v7.2

There's been quite a lot of framework improvements this time around,
though mainly cleanups and robustness rather than user visible features.
The same pattern is seen with a lot of the driver work that's going on,
there are new features but a huge proportion of this is bug fixing and
cleanup work.  We also have a good selectio of new device support.

- Improvements to SDCA jack handling from Charles Keepax.
- Use of device links to make suspend handling more robust from Richard
   Fitzgerald.
- Use of a new helper to factor out a common pattern in SoundWire
   enmeration from Charles Keepax.
- Slimming down of the component from Kuninori Morimoto.
- Simplification of format auto selection from Kuninori Morimoto.
- Lots of conversions to guard() from Bui Duc Phuc.
- Addition of a simple-amplifier driver supporting more featureful GPIO
   controller amplifiers than the previous basic driver from Herve
   Codina.
- Support for AMD ACP 7.x, Cirrus Logic CS42448/CS42888, Everest Semi
   ES9356, Mediatek MT2701 and MT8196, Renesas RZ/G3E, Spacemit K3,
   Texas Instruments TAC5xx2 and TAS67524.

s390/idle: Remove idle time and count sysfs files

Remove the s390 specific idle_time_us and idle_count per cpu sysfs
files. They do not provide any additional value. The risk that there
are existing applications which rely on these architecture specific
files should be very low.

However if it turns out such applications exist, this can be easily
reverted.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/idle: Provide arch specific kcpustat_field_idle()/kcpustat_field_iowait()

The former s390 specific arch_cpu_idle_time() implementation was
removed, since its implementation was racy and reported idle time
could go backwards [1].

However this removal was not necessary, since independently of the s390
architecture specific races there exists the iowait counter update race,
which can also lead to reported idle time going backwards [2].

With Frederic Weisbecker's recent cpu idle time accounting refactoring
kernel_cpustat got a sequence counter. Use this to implement s390 specific
variants of kcpustat_field_idle() and kcpustat_field_iowait(). This is
logically a revert of [1] and moves cpu idle time accounting back into s390
architecture code, which is also more precise than the dyntick idle time
accounting by nohz/scheduler.

For comparing cross cpu time stamps it is necessary to use the stcke
instead of the stckf instruction in irq entry path. Furthermore this
open-codes a sequence lock in assembler and C code, which is required to
update the irq entry time stamp to the per cpu idle_data structure in a
race free manner.

[1] commit be76ea614460 ("s390/idle: remove arch_cpu_idle_time() and corresponding code")
[2] commit ead70b752373 ("timers/nohz: Add a comment about broken iowait counter update race")

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/irq/idle: Use stcke instead of stckf for time stamps

The upcoming cpu idle time accounting rework involves comparing and
subtracting cross cpu time stamps. Time stamps created with the stckf
instruction monotonic with respect to the local cpu. For cross cpu
monotonic time stamps the slightly slower stcke instruction has to
be used [1].

Convert the idle time accounting relevant usages of stckf to stcke.

[1] Principles of Operation - Setting and Inspecting the Clock

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

s390/timex: Move union tod_clock type to separate header

Move union tod_clock type to separate header file. This is preparation
for upcoming changes in order to avoid header dependency problems.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

dt-bindings: vendor-prefixes: add Gira

Add vendor prefix for Gira Giersiepen GmbH & Co. KG
Link: https://www.gira.de/
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Reviewed-by: Alexander Dahl <ada@thorsis.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20260610213047.500701-1-l.stach@pengutronix.de
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

ALSA: usb-audio: Add iface reset and delay quirk for XIBERIA K03S

Setting up the interface when suspended/resumeing fail on this card.
Adding a reset and delay quirk will eliminate this problem.

usb 1-1: New USB device found, idVendor=36f9, idProduct=c009
usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 1-1: Product: XIBERIA K03S
usb 1-1: Manufacturer: Actions
usb 1-1: usb_probe_device

Signed-off-by: Lianqin Hu <hulianqin@vivo.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/TYUPR06MB621706287FE30F4D8EE4618BD2E62@TYUPR06MB6217.apcprd06.prod.outlook.com

Merge tag 'kvm-riscv-7.2-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv changes for 7.2

- Batch G-stage TLB flushes for GPA range based page table updates
- Convert HGEI line management to fully per-HART
- Fix missing CSR dirty marking when FWFT state updated via ONE_REG
- Fix stale FWFT feature exposure to Guest/VM
- Speed up dirty logging write faults using MMU rwlock and atomic
PTE updates using cmpxchg() for permission-only changes
- Use flexible array for APLIC IRQ state
- Use kvm_slot_dirty_track_enabled() for logging enable check on
a memslot
- Avoid skipping valid pages in kvm_riscv_gstage_wp_range()
- Avoid skipping valid pages in kvm_riscv_gstage_unmap_range()
- Use endian-specific __lelong for NACL shared memory

Merge tag 'loongarch-kvm-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD

LoongArch KVM changes for v7.2

1. Enable FPU with max VM supported FPU type.
2. Some enhancements about interrupt injection.
3. Some bug fixes and other small changes.

Merge tag 'kvm-s390-next-7.2-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

KVM: s390: New features for 7.2

New features for 7.2 for KVM/s390:
* KVM_PRE_FAULT_MEMORY support
* Support for 2G hugepages
* Support for the ASTFLEIE 2 facility
* kvm_arch_set_irq_inatomic Fast Inject
* Fix potential leak of uninitialized bytes

pinctrl: Export pinctrl_get_group_selector()

The recently added UltraRISC DP1000 is using this symbol, and in
a reasonable way as well, so export it.

Acked-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: Link: https://lore.kernel.org/linux-gpio/20260613164847.GA3152104@ax162/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202606130210.ytVPxHlm-lkp@intel.com/
Fixes: cb7037924836 ("pinctrl: ultrarisc: Add UltraRISC DP1000 pinctrl driver")
Signed-off-by: Linus Walleij <linusw@kernel.org>

HID: hidpp: fix potential UAF in hidpp_connect_event()

If input_register_device() fails, we call input_free_device(), but keep
stale pointer to the old device in hidpp->input, which could potentially
lead to UAF. Fix that by resetting it to NULL before returning from
hidpp_connect_event().

Reported-by: zdi-disclosures@trendmicro.com
Signed-off-by: Jiri Kosina <jkosina@suse.com>

fuse-uring: clear ent->fuse_req in commit_fetch error path

fuse_uring_commit_fetch() error path called fuse_request_end(req) without
clearing ent->fuse_req when fuse_ring_ent_set_commit() fails. The
still-pending fuse_uring_send_in_task() task-work later dereferences the
dangling pointer through fuse_uring_prepare_send(), causing a
use-after-free.

End the request with fuse_uring_req_end(), which handles all conditions
already.

Annotation/edition by Bernd: The UAF should be fixed by other means already
and actually has to be avoided that way.
Just checking for ent->fuse_req == NULL in fuse_uring_send_in_task()
would be prone to race conditions, because if malicious userspace
would commit requests that have passed the NULL check, but are
in doing args copy, it would still trigger a use-after-free.
Setting ent->fuse_req = NULL in fuse_uring_commit_fetch() still
makes sense, though.

Reported-by: Shuvam Pandey <shuvampandey1@gmail.com>
Reported-by: Berkant Koc <me@berkoc.com>
Signed-off-by: Zhenghang Xiao <kipreyyy@gmail.com>
Signed-off-by: Bernd Schubert <bernd@bsbernd.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

keys: keyctl_pkey: replace BUG with return -EOPNOTSUPP

Replace two BUG() calls in keyctl_pkey_params_get_2() and
keyctl_pkey_e_d_s() default cases with -EOPNOTSUPP, matching
the error style already used in these functions.

Signed-off-by: Mohammed EL Kadiri <med08elkadiri@gmail.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

keys: request_key: replace BUG with return -EINVAL

Replace BUG() in construct_get_dest_keyring() default case
with return -EINVAL to handle the unimplemented group keyring
destination gracefully.

Signed-off-by: Mohammed EL Kadiri <med08elkadiri@gmail.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/r/20260613130408.13709-2-med08elkadiri@gmail.com
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

keys: Pin request_key_auth payload in instantiate paths

A: request_key()       B: KEYCTL_INSTANTIATE_IOV
================       =========================

create auth key
store rka in auth key
wait for helper
                       get auth key
                       load rka from auth key
                       copy user payload
                       sleep on #PF

helper completed
detach and free rka
destroy auth key
                       wake up
                       use rka->target_key
                       **USE-AFTER-FREE**

Give request_key_auth payloads a refcount.  Take a payload reference while
authkey->sem stabilizes the payload and revocation state.  Hold that
reference across the instantiate and reject paths.  Drop the auth key
owning reference from revoke and destroy.

[jarkko: Replaced the first two paragraphs of text with an actual
concurrency scenario.]
Cc: stable@vger.kernel.org # v5.10+
Fixes: b5f545c880a2 ("[PATCH] keys: Permit running process to instantiate keys")
Reported-by: Shaomin Chen <eeesssooo020@gmail.com>
Closes: https://lore.kernel.org/r/20260519144403.436694-1-eeesssooo020@gmail.com
Signed-off-by: Shaomin Chen <eeesssooo020@gmail.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

keys: prevent slab cache merging for key_jar

Add SLAB_NO_MERGE to key_jar to prevent the allocator from merging it
with other similarly-sized caches. This hardens struct key isolation by
ensuring dedicated slab pages.

Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Signed-off-by: Mohammed EL Kadiri <med08elkadiri@gmail.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/r/20260610065052.9120-1-med08elkadiri@gmail.com
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

keys: Replace strcpy(derived_buf, "AUTH_KEY") with strscpy(..., HASH_SIZE)

derived_buf is guaranteed to be HASH_SIZE - and it is more than enough.
The strscpy() degenerates into an memcpy() (as did the strcpy()).
Do the same for the associated "ENC_KEY" copy.

Removes a possibly unbounded strcpy().

Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/r/20260606202633.5018-9-david.laight.linux@gmail.com
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

KEYS: Use acquire when reading state in keyring search

The negative-key race fix added release/acquire ordering for key use.

Publish payload before state; read state before payload.

keyring_search_iterator() still uses READ_ONCE() before match callbacks.
An asymmetric match callback calls asymmetric_key_ids(), which reads
key->payload.data[asym_key_ids].

Use key_read_state() there to complete that ordering.

Fixes: 363b02dab09b ("KEYS: Fix race between updating and finding a negative key")
Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/r/20260529033406.20673-1-hanguidong02@gmail.com
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

keys/trusted_keys: mark 'migratable' as __ro_after_init

The 'migratable' variable is initialized only during the init phase
in the 'init_trusted' function and never changed. So, mark it as
__ro_after_init.

Signed-off-by: Len Bao <len.bao@gmx.us>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/r/20260516152249.41851-1-len.bao@gmx.us
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

keys: use kmalloc_flex in user_preparse

Use kmalloc_flex() when allocating a new struct user_key_payload in
user_preparse() to replace the open-coded size arithmetic and to keep
the size type-safe.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20260504093058.49720-3-thorsten.blum@linux.dev
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

KEYS: trusted: Debugging as a feature

TPM_DEBUG, and other similar flags, are a non-standard way to specify a
feature in Linux kernel. Introduce CONFIG_TRUSTED_KEYS_DEBUG for trusted
keys, and use it to replace these ad-hoc feature flags.

Given that trusted keys debug dumps can contain sensitive data, harden the
feature as follows:

1. In the Kconfig description postulate that pr_debug() statements must be
used.
2. Use pr_debug() statements in TPM 1.x driver to print the protocol dump.
3. Require trusted.debug=1 on the kernel command line (default: 0) to
activate dumps at runtime, even when CONFIG_TRUSTED_KEYS_DEBUG=y.

Traces, when actually needed, can be easily enabled by providing
trusted.dyndbg='+p' and trusted.debug=1 in the kernel command-line.

Reported-by: Nayna Jain <nayna@linux.ibm.com>
Closes: https://lore.kernel.org/all/7f8b8478-5cd8-4d97-bfd0-341fd5cf10f9@linux.ibm.com/
Reviewed-by: Nayna Jain <nayna@linux.ibm.com>
Tested-by: Srish Srinivasan <ssrish@linux.ibm.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

KEYS: encrypted: Remove unnecessary selection of CRYPTO_RNG

encrypted-keys uses the regular Linux RNG (get_random_bytes()), not the
duplicative crypto_rng one. So it does not need to select CRYPTO_RNG.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

KEYS: fix overflow in keyctl_pkey_params_get_2()

The length for the internal output buffer is calculated incorrectly, which
can result overflow when a too small buffer is provided.

Fix the bug by allocating internal output with the size of the maximum
length of the cryptographic primitive instead of caller provided size.

Link: https://lore.kernel.org/keyrings/20260531024914.3712130-1-jarkko@kernel.org/
Cc: stable@vger.kernel.org # v4.20+
Fixes: 00d60fd3b932 ("KEYS: Provide keyctls to drive the new key type ops for asymmetric keys [ver #2]")
Reported-by: Alessandro Groppo <ale.grpp@gmail.com>
Tested-by: Alessandro Groppo <ale.grpp@gmail.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

KVM: s390: Introducing kvm_arch_set_irq_inatomic fast inject

s390 needs a fast path for irq injection, and along those lines we
introduce kvm_arch_set_irq_inatomic. Instead of placing all interrupts on
the global work queue as it does today, this patch provides a fast path for
irq injection.

The inatomic fast path cannot lose control since it is running with
interrupts disabled. This meant making the following changes that exist on
the slow path today. First, the adapter_indicators page needs to be mapped
since it is accessed with interrupts disabled, so we added map/unmap
functions. Second, access to shared resources between the fast and slow
paths needed to be changed from mutex and semaphores to spin_lock's.
Finally, the memory allocation on the slow path utilizes GFP_KERNEL_ACCOUNT
but we had to implement the fast path with GFP_ATOMIC allocation. Each of
these enhancements were required to prevent blocking on the fast inject
path.

Fencing of Fast Inject in Secure Execution environments is enabled in the
patch series by not mapping adapter indicator pages. In Secure Execution
environments the path of execution available before this patch is followed.

Statistical counters have been added to enable analysis of irq injection on
the fast path and slow path including io_390_inatomic, io_flic_inject_airq,
io_set_adapter_int and io_390_inatomic_no_inject. The no inject counter
captures adapter masked, coalesced and suppressed interrupts.

Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Douglas Freimuth <freimuth@linux.ibm.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260604192755.203143-4-freimuth@linux.ibm.com>

KVM: s390: Enable adapter_indicators_set to use mapped pages

The s390 adapter_indicators_set function can now be optimized to use
long-term mapped pages when available so that work can be
processed on a fast path when interrupts are disabled.
If adapter indicator pages are not mapped then local mapping is
done on a slow path as it is prior to this patch. For example, Secure
Execution environments will take the local mapping path as it does prior to
this patch.

Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Douglas Freimuth <freimuth@linux.ibm.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260604192755.203143-3-freimuth@linux.ibm.com>

KVM: s390: Add map/unmap ioctl and clean mappings post-guest

s390 needs map/unmap ioctls, which map the adapter set
indicator pages, so the pages can be accessed when interrupts are
disabled. The mappings are cleaned up when the guest is removed.
pin_user_pages_remote is used for both the ioctl as well
as the pin-on-demand logic in adapter_indicators_set().

Map/Unmap ioctls are fenced in order to avoid the longterm pinning
in Secure Execution environments. In Secure Execution
environments the path of execution available before this patch is followed.

Statistical counters to count map/unmap functions for adapter indicator
pages are added. The counters can be used to analyze
map/unmap functions in non-Secure Execution environments and similarly
can be used to analyze Secure Execution environments where the counters
will not be incremented as the adapter indicator pages are not mapped.

Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Douglas Freimuth <freimuth@linux.ibm.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260604192755.203143-2-freimuth@linux.ibm.com>

fuse-uring: use named constants for io-uring iovec indices

Replace magic indices 0 and 1 for the iovec array with named constants
FUSE_URING_IOV_HEADERS and FUSE_URING_IOV_PAYLOAD. This makes the usages
self-documenting and prepares for buffer ring support which will also
reference these iovec slots by index.

Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: refactor setting up copy state for payload copying

Add a new helper function setup_fuse_copy_state() to contain the logic
for setting up the copy state for payload copying.

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: use enum types for header copying

Use enum types to identify which part of the header needs to be copied.
This improves the interface and will simplify both kernel-space and
user-space header addresses copying when buffer rings are added.

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: refactor io-uring header copying from ring

Move header copying from ring logic into a new copy_header_from_ring()
function. This makes the copy_from_user() logic more clear and
centralizes error handling / rate-limited logging.

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: refactor io-uring header copying to ring

Move header copying to ring logic into a new copy_header_to_ring()
function. This makes the copy_to_user() logic more clear and centralizes
error handling / rate-limited logging.

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: separate next request fetching from sending logic

Simplify the logic for fetching + sending off the next request.

This gets rid of fuse_uring_send_next_to_ring() which contained
duplicated logic from fuse_uring_send(). This decouples request fetching
from the send operation, which makes the control flow clearer and
reduces unnecessary parameter passing.

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: invalidate readdir cache on epoch bump

FUSE_NOTIFY_INC_EPOCH invalidates dentries, but does not invalidate cached
readdir results. A process with cwd inside a FUSE mount can therefore
observe stale readdir(".") output after an epoch bump.

Fix this by recording epoch in the readdir cache and checking it on reuse.

Minimal reproducer:

- mount a tiny FUSE fs with an empty root directory
- on opendir, enable fi->cache_readdir and fi->keep_cache
- chdir into the mount and call readdir(".") to populate readdir cache
- make the FUSE server report one file in the root directory
- send only FUSE_NOTIFY_INC_EPOCH
- call readdir(".") again; before this change it stays stale, after this
change it sees the new file

Fixes: 2396356a945b ("fuse: add more control over cache invalidation behaviour")
Signed-off-by: Jun Wu <quark@meta.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

virtio-fs: avoid double-free on failed queue setup

virtio_fs_setup_vqs() allocates fs->vqs and fs->mq_map before calling
virtio_find_vqs(). If virtio_find_vqs() fails, the error path frees both
pointers and returns an error to virtio_fs_probe().

virtio_fs_probe() then drops the last kobject reference, and
virtio_fs_ktype_release() frees fs->vqs and fs->mq_map again. This leaves
dangling pointers in struct virtio_fs and can trigger a double-free during
probe failure cleanup.

Set fs->vqs and fs->mq_map to NULL immediately after kfree() in the
virtio_fs_setup_vqs() error path so that the later kobject release sees an
uninitialized state and kfree(NULL) becomes harmless.

This can be reproduced when a broken virtio-fs device advertises more
request queues than the transport actually provides. In that case
virtio_find_vqs() fails while setting up the extra queue, and the probe
path reaches the double-free cleanup sequence.

Signed-off-by: Yung-Tse Cheng <mes900903@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: invalidate page cache after DIO and async DIO writes

This fixe does page cache invalidation after DIO and async DIO writes for
both O_DIRECT and FOPEN_DIRECT_IO cases.

Commit b359af8275a9 ("fuse: Invalidate the page cache after FOPEN_DIRECT_IO
write") fixed xfstests generic/209 for DIO writes in the FOPEN_DIRECT_IO
path. DIO writes without FOPEN_DIRECT_IO are already handled by
generic_file_direct_write().
However, async DIO writes (xfstests generic/451) remain unhandled.

After this fix:
- Async write with FUSE_ASYNC_DIO:
    invalidate in fuse_aio_invalidate_worker()

- Otherwise (Sync or async write without FUSE_ASYNC_DIO):
    - With FOPEN_DIRECT_IO:
        invalidate in fuse_direct_write_iter()
    - Without FOPEN_DIRECT_IO:
        invalidate in generic_file_direct_write()

Workqueue is required for async write invalidation to prevent deadlock:
calling it directly in the I/O end routine (which is in fuse worker thread
context) can block on a folio lock held by a buffered I/O thread waiting
for the same fuse worker thread.

Co-developed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Cheng Ding <cding@ddn.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: set ff->flock only on success

If FUSE_SETLK fails (e.g., due to EWOULDBLOCK), we shall not set
FUSE_RELEASE_FLOCK_UNLOCK in fuse_file_release().

Reported-by: Li Yichao <liyichao.1@bytedance.com>
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: clean up interrupt reading

Clean up interrupt reading logic. Remove passing the pointer to the fuse
request as an arg and make the header initializations more readable.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: remove stray newline in fuse_dev_do_read()

Remove stray newline that shouldn't be there.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: use READ_ONCE in fuse_chan_num_background()

fuse_chan_num_background() is called without holding fch->bg_lock (for
example from fuse_writepages() to compare against fc->congestion_threshold),
while fch->num_background is updated under bg_lock in dev.c and dev_uring.c.
This is the same locked-write/lockless-read pattern already used for
max_background in fuse_chan_max_background().

Use READ_ONCE() on the read side so that:

- The compiler does not cache or coalesce loads of a value that may change
concurrently on another CPU.
- Prevent KCSAN from reporting an unexpected race.

Signed-off-by: Li Wang <liwang@kylinos.cn>
Fixes: 670d21c6e17f ("fuse: remove reliance on bdi congestion")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: dax: Move long delayed work on system_dfl_long_wq

Currently the code enqueue work items using {queue|mod}_delayed_work(),
using system_long_wq. This workqueue should be used when long works are
expected and it is a per-cpu workqueue.

The function(s) end up calling __queue_delayed_work(), which set a global
timer that could fire anywhere, enqueuing the work where the timer fired.

Unbound works could benefit from scheduler task placement, to optimize
performance and power consumption. Long work shouldn't stick to a single
CPU.

Recently, a new unbound workqueue specific for long running work has
been added:

c116737e972e ("workqueue: Add system_dfl_long_wq for long unbound works")

Since the workqueue work doesn't rely on per-cpu variables, there is no
obvious reason that justify the use of a per-cpu workqueue. So change
system_long_wq with system_dfl_long_wq so that the work may benefit from
scheduler task placement.

Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>