Sukhdeep Singh [Wed, 10 Jun 2026 11:54:41 +0000 (17:24 +0530)]
net: atlantic: add AQC113 filter data structures, firmware query and register dump
Add filter infrastructure for AQC113 hardware:
- Define L3 (IPv4/IPv6), L4 (TCP/UDP/SCTP), and combined L3L4 filter
structures with serialized usage counter for filter sharing.
- Define tag policy structure for ethertype filter management.
- Add RPF L3/L4 command bit definitions for filter programming.
- Add filter count constants for L3L4, L3V4, L4, VLAN, and ethertype.
- Extend hw_atl2_priv with filter arrays, base indices, and counts
discovered from firmware.
Query filter capabilities from firmware shared memory at init time
to discover available L2/L3/L4/VLAN/ethertype filter resources and
ART (Action Resolver Table) configuration.
Add hardware register dump utility for AQC113 debug support.
Sukhdeep Singh [Wed, 10 Jun 2026 11:54:39 +0000 (17:24 +0530)]
net: atlantic: decouple aq_set_data_fl3l4() from driver internals
Refactor aq_set_data_fl3l4() to take an ethtool_rx_flow_spec pointer and
an explicit HW register location instead of driver-internal structures
(aq_nic_s, aq_rx_filter). This makes the function reusable for PTP
filter setup which constructs flow specs independently.
Key changes:
- Add aq_is_ipv6_flow_type() helper to derive IPv6 status from the
flow_type field, replacing the dependency on rx_fltrs->fl3l4.is_ipv6
shared state.
- Change aq_set_data_fl3l4() signature to accept (fsp, data, location,
add) and export it via aq_filters.h.
- Update aq_add_del_fl3l4() to compute the HW register location and
pass it explicitly.
Sukhdeep Singh [Wed, 10 Jun 2026 11:54:38 +0000 (17:24 +0530)]
net: atlantic: move active_ipv4/ipv6 bitmap updates after HW write
Move active_ipv4/active_ipv6 bitmap updates from aq_set_data_fl3l4()
into aq_add_del_fl3l4() after the hardware write succeeds. The bitmaps
track which filter slots are actively programmed in hardware and must
only be updated once the HW write is confirmed.
The bitmap updates in aq_nic_reserve_filter() and aq_nic_release_filter()
are intentionally retained: they guard the aq_check_approve_fl3l4()
IPv4/IPv6 mixing validation for callers such as the AQC113 PTP path that
program filters directly via hw_atl2_new_fl3l4_configure() without going
through aq_add_del_fl3l4().
Sukhdeep Singh [Wed, 10 Jun 2026 11:54:37 +0000 (17:24 +0530)]
net: atlantic: correct L3L4 filter flow_type masking and IPv6 handling
Correct three issues in aq_set_data_fl3l4() required for the AQC113
PTP filter path introduced later in this series:
1. Mask FLOW_EXT from flow_type before the protocol switch statement.
Flow types with FLOW_EXT set (e.g. TCP_V4_FLOW | FLOW_EXT) fall
through to the default case and skip protocol comparison flags.
2. Extend the L3 address comparison check to cover all four IPv6
words. The original code only checked ip_src[0]/ip_dst[0] and
required !is_ipv6, so CMP_SRC_ADDR_L3/CMP_DEST_ADDR_L3 were never
set for IPv6 filters.
3. Use explicit flow type checks for port extraction instead of
negating IP_USER_FLOW/IPV6_USER_FLOW. The old check did not mask
FLOW_EXT, so IP_USER_FLOW | FLOW_EXT would incorrectly attempt
port extraction. Use the actual flow type to pick the correct
union member directly.
====================
net: dsa: netc: add bridge mode support
This series adds bridge mode support to the NETC DSA switch driver,
covering both VLAN-aware and VLAN-unaware operation.
The NETC switch manages forwarding through a set of hardware tables
accessed via NTMP: the FDB table (FDBT), VLAN filter table (VFT), egress
treatment table (ETT), and egress count table (ECT). The series extends
the NTMP layer with the operations required for bridging, then builds the
DSA bridge callbacks on top.
Since all switch ports share the VFT, so only one VLAN-aware bridge is
supported.
FDB aging is managed in software. A periodic delayed work sweeps the
table using the hardware activity element mechanism, with a default aging
time of 300 seconds matching the IEEE 802.1Q standard. Per-port entries
are also flushed immediately on bridge leave and link-down events.
====================
The NETC switch does not age out dynamic FDB entries automatically.
Without software management, stale entries persist after topology
changes and cause incorrect forwarding.
Add a delayed work that periodically removes entries that have not been
refreshed within the specified cycles. The effective ageing time is:
ageing_time = fdbt_ageing_delay * 100
Default values are 3s interval and 100 cycles (300s total), matching
the IEEE 802.1Q default ageing time. The work starts when the first
port joins a bridge (tracked via br_cnt) and is cancelled when the
last port leaves. All FDB operations are serialized under fdbt_lock.
Implement .set_ageing_time() to allow the bridge layer to reconfigure
ageing parameters on demand.
Wei Fang [Thu, 11 Jun 2026 02:14:57 +0000 (10:14 +0800)]
net: dsa: netc: add bridge mode support
Wire up the port_bridge_join, port_bridge_leave and port_vlan_filtering
DSA callbacks to support both VLAN-unaware and VLAN-aware bridge modes.
For VLAN-unaware bridges, each bridge instance is assigned a dedicated
internal PVID via NETC_VLAN_UNAWARE_PVID(bridge.num), counting down
from VID 4095. A VFT entry is created for this PVID with hardware MAC
learning and flood-on-miss forwarding enabled. The CPU port is included
as a VFT member so frames can reach the host. The reserved VID range is
blocked in port_vlan_add to prevent user-space conflicts.
Only one VLAN-aware bridge is supported at a time; this constraint is
enforced in port_bridge_join and port_vlan_filtering. The per-port PVID
is tracked in software and written to the BPDVR register whenever VLAN
filtering is active.
When a port leaves the bridge, its dynamic FDB entries are flushed right
away in port_bridge_leave(), without waiting for the ageing cycle. When
a link down event occurs on a port, netc_mac_link_down() will also clear
the port's dynamic FDB entries via netc_port_remove_dynamic_entries().
Non-bridge ports have no dynamic FDB entries, so this call is always
safe. Additionally, .port_fast_age() callback is added to flush the
dynamic FDB entries associated to a port.
Host flood rules are removed from the ingress port filter table when a
port joins a bridge to avoid bypassing FDB lookup and MAC learning.
Implement the DSA .port_vlan_add and .port_vlan_del operations to enable
VLAN-aware bridge offloading on the NETC switch.
VLAN membership is maintained in the VLAN Filter Table (VFT). Adding the
first port to a VLAN creates a new VFT entry with hardware MAC learning
and flood-on-miss forwarding; subsequent ports update the existing
entry's membership bitmap. Removing the last port deletes the entry.
Egress tagging is handled through the Egress Treatment Table (ETT). Each
VLAN is allocated a group of ETT entries, one per available port. Ports
are assigned a sequential ett_offset during initialisation, used to
address each port's entry within the group. Untagged ports configure the
ETT to strip the outer VLAN tag; tagged ports pass frames through
unmodified. Each ETT group is optionally paired with an Egress Counter
Table (ECT) group for per-port frame counting, allocated on a best-effort
basis. When the egress rule of an ETT entry changes, the counter of the
corresponding ECT entry will be recounted to track the number of frames
that match the new egress rule.
A software shadow list serialised by vft_lock tracks active VLAN state
across both port membership and egress tagging. VID 0 is used for single
port mode and is ignored by both callbacks.
Wei Fang [Thu, 11 Jun 2026 02:14:55 +0000 (10:14 +0800)]
net: enetc: add helpers to set/clear table bitmap
NTMP index tables require software to allocate and manage entry IDs.
Add two bitmap helper functions to facilitate this management:
ntmp_lookup_free_eid(): finds the first zero bit in the given bitmap,
sets it to mark the entry as in-use, and returns the corresponding entry
ID. Returns NTMP_NULL_ENTRY_ID if no free entry is available.
ntmp_clear_eid_bitmap(): clears the bit associated with the given entry
ID in the bitmap to mark the entry as free. It is a no-op if the entry
ID is NTMP_NULL_ENTRY_ID.
Both functions are exported for use by other modules, such as the NETC
switch driver which needs to manage group index bitmaps for the Egress
Treatment Table (ETT) and Egress Count Table (ECT).
Wei Fang [Thu, 11 Jun 2026 02:14:54 +0000 (10:14 +0800)]
net: dsa: netc: initialize the group bitmap of ETT and ECT
The Egress Treatment Table (ETT) and Egress Count Table (ECT) are both
index tables whose entry IDs are allocated by software. Every num_ports
entries form a group, where each entry in the group corresponds to one
port. To facilitate group allocation and management, initialize the group
index bitmaps for both tables based on hardware capabilities reported by
ETTCAPR and ECTCAPR registers.
The bitmap size per table is calculated as the total number of hardware
entries divided by the number of available ports, which gives the number
of groups available for software allocation. A set bit in the bitmap
represents a group index that has been allocated.
These bitmaps will be used by subsequent patches that add VLAN support.
Wei Fang [Thu, 11 Jun 2026 02:14:53 +0000 (10:14 +0800)]
net: enetc: add "Update" operation to the egress count table
The egress count table is a static bounded index table, egress related
statistics are maintained in this table. The table is implemented as a
linear array of entries accessed using an index (0, 1, 2, ..., n) that
uniquely identifies an entry within the array. Egress Counter Entry ID
(EC_EID) is used as an index to an entry in this table. The EC_EID is
specified in the egress treatment table.
Egress count table entries are always present and enabled. The table
only supports access via entry ID, which is assigned by the software.
And it supports Update, Query and Query followed by Update operations.
Currently, only Update operation is supported.
Wei Fang [Thu, 11 Jun 2026 02:14:52 +0000 (10:14 +0800)]
net: enetc: add interfaces to manage egress treatment table
Each entry in the egress treatment table contains the egress packet
processing actions to be applied to a grouping or scope of packets
exiting on a particular egress port of the switch. A scope of packets,
for example, could be the packets exiting a particular VLAN, matching
a particular 802.1Q bridge forwarding entry or belonging to a stream
identified at ingress. The egress treatment table is implemented as a
linear array of entries accessed using an index (0,1, 2, ..., n) that
uniquely identifies an entry within the array.
The egress treatment table only supports access vid entry ID, which is
assigned by the software. It supports Add, Update, Delete and Query
operations. Note that only Query operation is not supported yet.
Wei Fang [Thu, 11 Jun 2026 02:14:51 +0000 (10:14 +0800)]
net: enetc: add "Update" and "Delete" operations to VLAN filter table
Add two interfaces to manage entries in the VLAN filter table:
ntmp_vft_update_entry(): Update the configuration element data of the
specified VLAN filter entry based on the given VLAN ID. It uses the
exact key access method to locate the entry.
ntmp_vft_delete_entry(): Delete the VLAN filter entry corresponding to
the specified VLAN ID. It also uses the exact key access method to
identify the target entry.
In addition, introduce struct vft_req_qd to describe the request data
buffer format for Query and Delete actions of the VLAN filter table,
which contains a common request data header and a VLAN access key.
Wei Fang [Thu, 11 Jun 2026 02:14:50 +0000 (10:14 +0800)]
net: enetc: add interfaces to manage dynamic FDB entries
Add three interfaces to manage dynamic entries in the FDB table:
ntmp_fdbt_update_activity_element(): Update the activity element of all
dynamic FDB entries. For each entry, if its activity flag is not set,
which means no packet has matched this entry since the last update, the
activity counter is incremented. Otherwise, both the activity flag and
activity counter are reset. The activity counter is used to track how
long an FDB entry has been inactive, which is useful for implementing
an ageing mechanism.
ntmp_fdbt_delete_ageing_entries(): Delete all dynamic FDB entries whose
activity flag is not set and whose activity counter is greater than or
equal to the specified threshold. This is used to remove stale entries
that have been inactive for too long.
ntmp_fdbt_delete_port_dynamic_entries(): Delete all dynamic FDB entries
associated with the specified switch port. This is typically called when
a port goes down or is removed from a bridge.
Minxi Hou [Fri, 12 Jun 2026 13:05:03 +0000 (21:05 +0800)]
selftests/net/openvswitch: add SET action test
Add test_action_set exercising OVS_ACTION_ATTR_SET with an ipv4 dst
rewrite. The test verifies the SET action in three steps: first
confirm normal forwarding, then apply set(ipv4(dst=10.0.0.99)) to
rewrite the destination to an address nobody owns and verify ping
fails, then restore normal forwarding and verify connectivity
recovers.
Jakub Kicinski [Mon, 15 Jun 2026 21:14:50 +0000 (14:14 -0700)]
Merge branch 'net-sfp-extend-smbus-support'
Jonas Jelonek says:
====================
net: sfp: extend SMBus support
Today, the SFP driver only drives I2C adapters that advertise full
I2C_FUNC_I2C, or SMBus-only adapters via single-byte transfers (with
hwmon disabled). Several SoCs ship I2C/SMBus-only controllers that
support more than just byte access -- e.g. word and I2C block -- and
have SFP cages wired to them. Today, those adapters either work
poorly or not at all.
This series teaches the SFP driver to use the larger SMBus access
modes when the adapter advertises them, and along the way starts
honoring i2c_adapter quirks on read/write length so adapters that
cap below the SFP block size are handled correctly. Patch 1 is a
small prep doing only the quirks handling; patch 2 extends the
SMBus path itself.
Capability matrix supported by patch 2:
- BYTE only: single-byte access (unchanged).
- BYTE + WORD: word for >=2-byte chunks, byte tail.
- I2C_BLOCK present: block as the universal transport.
- WORD only (no BYTE/BLOCK): accepted with WARN_ONCE; works for
even-length transfers, odd-length
transfers will error at xfer time.
Adapters with asymmetric R/W capabilities (e.g. only READ_I2C_BLOCK
without WRITE_I2C_BLOCK) remain functionally correct but use the
worse-supported direction's max for both directions, since
i2c_max_block_size is a single field. No mainline I2C driver was
seen advertising such asymmetry; per-direction sizes can be added
later if needed.
====================
Jonas Jelonek [Sun, 14 Jun 2026 13:34:18 +0000 (13:34 +0000)]
net: sfp: extend SMBus support
Commit 7662abf4db94 ("net: phy: sfp: Add support for SMBus module access")
added SMBus access for SFP modules, but limited it to single-byte
transfers. As a side effect, hwmon is disabled (16-bit reads cannot be
guaranteed atomic) and a warning is printed.
Many SMBus-only I2C controllers in the wild support more than just
byte access, and SFP cages are often wired to such controllers
rather than to a full-featured I2C controller -- e.g. the SMBus
controllers in the Realtek longan and mango SoCs, which advertise
word access and I2C block reads. Today, they cannot drive an SFP at
all without falling back to the byte-only path.
Extend sfp_smbus_read()/sfp_smbus_write() so that, in addition to
the existing byte access, they also use SMBus word access and SMBus
I2C block access whenever the adapter advertises them. Both
directions are handled in a single read and a single write helper
that pick the largest supported transfer per chunk and fall back as
needed.
I2C-block is preferred unconditionally when available: the protocol
carries any length 1..32, so it can serve every chunk -- including
the 1- and 2-byte tails -- without help from word or byte access.
Note that this requires I2C_FUNC_SMBUS_I2C_BLOCK, which reads a
caller-specified number of bytes. This deviates from the official
SMBus Block Read (length is supplied by the slave) but is widely
supported by Linux I2C controllers/drivers.
Capability matrix this implementation supports:
- BYTE only: works (unchanged behaviour); 1-byte
xfers, hwmon disabled.
- BYTE + WORD: word for >=2-byte chunks, byte for
trailing odd byte.
- I2C_BLOCK present (with or
without BYTE/WORD): block as the universal transport for
every chunk.
- WORD only (no BYTE/BLOCK): accepted with WARN_ONCE. Even-length
transfers work; odd-length transfers
(e.g. the 3-byte cotsworks fixup
write) hit the BYTE branch which the
adapter does not implement, so the
xfer returns an error and the
operation is aborted. No mainline
I2C driver was found to advertise
WORD without BYTE; the warning lets
us learn about it if it ever shows
up.
Adapters with asymmetric R/W capabilities (e.g. only READ_I2C_BLOCK
but not WRITE_I2C_BLOCK) remain functionally correct -- the
per-iteration fallback uses the direction-specific bits -- but the
shared i2c_max_block_size is sized by the all-bits-set check, so a
transfer in the better-supported direction is not upgraded. None of
the mainline I2C bus drivers surveyed during review advertise such
asymmetry; promoting i2c_max_block_size to per-direction sizes can
be revisited if needed.
Jonas Jelonek [Sun, 14 Jun 2026 13:34:17 +0000 (13:34 +0000)]
net: sfp: apply I2C adapter quirks to limit block size
The SFP driver assumes all I2C adapters support reading and writing the
pre-defined block size SFP_EEPROM_BLOCK_SIZE of 16 bytes. This constant
was probably chosen based on good guesses and known limitations of a
range of I2C adapters and SFP modules.
However, I2C adapters may even support less and usually need to specify
this via I2C quirks. Theoretically, such an adapter may provide full
functionality but only support a read and write length of e.g. 8 bytes.
Currently, the SFP driver doesn't account for that.
Add handling for I2C quirks in SFP I2C configuration taking the fields
max_read_len and max_write_len in struct i2c_adapter_quirks into account
to further limit the maximum block size if needed.
Jakub Kicinski [Mon, 15 Jun 2026 21:09:56 +0000 (14:09 -0700)]
Merge tag 'nf-next-26-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next.
More specifically, this contains conncount rework to address AI related
reports, assorted Netfiter updates and two small incremental updates on
IPVS:
1) Replace old obsolete workqueues (system_wq, system_unbound_wq)
in IPVS, from Marco Crivellari.
2) Replace WARN_ON{_ONCE} by DEBUG_NET_WARN_ON_ONCE in nf_tables.
In the recent years, reporters say that the use of WARN_ON{_ONCE}
in conjunction with panic_on_warn=1 results in DoS. Let's replace
it by DEBUG_NET_WARN_ON_ONCE so this is only exercised by test
infrastructure and fuzzers, while also providing context to AI
agents. From Fernando F. Mancera.
Five patches from Florian Westphal to address AI reports in the conncount
infrastructures:
3) Fix missing rcu read lock section when calling
__ovs_ct_limit_get_zone_limit().
4) Add a dedicate lock per rbtree tree, this increases memory
usage but it should improve scalability.
5) Add a helper function to find the rbtree node, no functional
changes are intented.
6) Add sequence counter to detect concurrent tree modifications
and retry lookups.
7) Add locks to GC conncount walk and address other nitpicks.
Then, several assorted updates:
8) Defensive Tree-wide addition of NULL checks for ct extensions.
9) Bail out if flowtable bypass cannot be fully set up from the
flow offload expression, instead of lazy building a likely
incomplete one.
10) Fix documentation for the new conn_max sysctl toggle in IPVS.
11) Add nf_dev_xmit_recursion*() helpers and use them, to address
recent AI reports.
* tag 'nf-next-26-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_dup_netdev: add nf_dev_xmit_recursion*() helpers and use them
ipvs: fix doc syntax for conn_max sysctl
netfilter: flowtable: bail out if forward path cannot be discovered
netfilter: conntrack: check NULL when retrieving ct extension
netfilter: nf_conncount: gc and rcu fixes
netfilter: nf_conncount: add sequence counter to detect tree modifications
netfilter: nf_conncount: split count_tree_node rbtree walk into helper
netfilter: nf_conncount: use per nf_conncount_data spinlocks
netfilter: nf_conncount: callers must hold rcu read lock
netfilter: nf_tables: use DEBUG_NET_WARN_ON_ONCE in packet and control paths
ipvs: Replace use of system_unbound_wq with system_dfl_long_wq
====================
====================
netdev: expose page pool order via netlink
This small series exposes io_uring's high order page configuration
via the page_pool netlink interface and updates the appropriate
selftest to check this value.
====================
====================
selftests/vsock: improve vng version and quirk handling
As vng has continued updating, there have been two things in our
selftests that have been affected. One is that newer versions always
emit the vng version warning, and two is that we have a workaround that
is not needed in newer versions.
This series just updates the version handling to allow all newer
versions without warning and version-gates the workaround to only those
versions that don't have the commit that fixed the root cause.
Additionally, we add function for comparing major.minor versions which
is used in both patches.
-===================
Bobby Eshleman [Fri, 12 Jun 2026 19:08:42 +0000 (12:08 -0700)]
selftests/vsock: skip vng setsid workaround on >= 1.41
virtme-ng 1.41 ships the upstream fix for the SIGTTOU hang
(https://github.com/arighi/virtme-ng/pull/453), so the setsid wrapper in
vng_dry_run() is no longer needed there. Gate the workaround on the vng
version: setsid is used for vng < 1.41, and vng is invoked directly on
>= 1.41.
Bobby Eshleman [Fri, 12 Jun 2026 19:08:41 +0000 (12:08 -0700)]
selftests/vsock: accept vng 1.33 or >= 1.36
The current vng version check uses a discrete allowlist of "1.33",
"1.36", and "1.37", which forces a script update on every new release
even though all post-1.36 releases work.
Replace the discrete list with: "1.33", or any version >= 1.36. 1.34
and 1.35 are skipped because they were not tested. Add a version_lt()
helper that compares MAJOR.MINOR numerically, so the check reads as a
straightforward version comparison.
Eric Dumazet [Fri, 12 Jun 2026 16:25:17 +0000 (16:25 +0000)]
tcp: ipv6: clamp default adverting MSS to avoid GSO_BY_FRAGS (0xFFFF)
When MTU is large, ip6_default_advmss() can return IPV6_MAXPLEN (65535).
This is interpreted by TCP as mss_clamp, allowing the MSS to reach 65535.
However, 0xFFFF is also used as a magic value GSO_BY_FRAGS in the kernel.
If a TCP packet with gso_size=0xFFFF is passed to skb_segment(), it will
be mistakenly treated as GSO_BY_FRAGS, leading to a NULL pointer
dereference because local TCP packets do not use frag_list.
Fix this by returning min(IPV6_MAXPLEN, GSO_BY_FRAGS - 1) (65534) from
ip6_default_advmss() when MTU is large.
Also update the stale comment in ip6_default_advmss() which suggested
that IPV6_MAXPLEN is returned to mean "any MSS".
Fixes: 3953c46c3ac7 ("sk_buff: allow segmenting based on frag sizes") Reported-by: syzbot+ebdb22d461c904fc3cb2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6a2c3193.8812e0fc.3c3fa4.0001.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260612162517.83394-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Fri, 12 Jun 2026 13:59:49 +0000 (13:59 +0000)]
tipc: fix UAF in tipc_l2_send_msg()
Syzbot reported a slab-use-after-free in ipvlan_hard_header() when
called from tipc_l2_send_msg().
The root cause is that tipc_disable_l2_media() calls synchronize_net()
while b->media_ptr is still valid. This allows concurrent RCU readers
to obtain the device pointer after synchronize_net() has finished.
The pointer is cleared later in bearer_disable(), but without any
subsequent synchronization, allowing the device to be freed while
still in use by readers.
Fix this by clearing b->media_ptr in tipc_disable_l2_media() before
calling synchronize_net().
This is safe to do now because the call order in bearer_disable()
was reversed in 0d051bf93c06 ("tipc: make bearer packet filtering generic")
to call tipc_node_delete_links() (which needs the pointer) before
disable_media().
Fixes: 282b3a056225 ("tipc: send out RESET immediately when link goes down")
https: //lore.kernel.org/netdev/6a2c1007.428ffe26.258b27.015d.GAE@google.com/T/#u Reported-by: syzbot+64ec81389cbad56a8c35@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jon Maloy <jmaloy@redhat.com> Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech> Link: https://patch.msgid.link/20260612135949.4010482-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
octeontx2: quiesce stale mailbox IRQ state before request_irq()
Both OTX2 mailbox registration paths currently install their IRQ
handlers before clearing stale local mailbox interrupt state, even
though the code comments already say that the clear is needed first to
avoid spurious interrupts.
This issue was found by our static analysis tool and manually audited on
Linux v6.18.21. Directed QEMU no-device validation further showed that
the real PF and VF mailbox handlers are already reachable in that
pre-clear window and can touch the same mailbox and workqueue carrier
before local quiesce has completed.
This series keeps the change minimal:
- clear stale mailbox interrupt state before request_irq()
- keep interrupt enabling after the handler is installed
That closes the early-IRQ window without introducing a new
enable-before-handler window.
Patch 1 fixes the PF mailbox registration path.
Patch 2 fixes the VF mailbox registration path.
Build-tested by compiling otx2_pf.o and otx2_vf.o.
No OTX2 hardware was available for end-to-end runtime testing.
====================
Runyu Xiao [Thu, 11 Jun 2026 16:00:14 +0000 (00:00 +0800)]
octeontx2-vf: clear stale mailbox IRQ state before request_irq()
otx2vf_register_mbox_intr() currently installs the VF mailbox IRQ
handler before clearing stale mailbox interrupt state. The code then says
that local interrupt bits should be cleared first to avoid spurious
interrupts, but that clear still happens only after request_irq() has
already made the handler reachable.
A running system can reach this during VF mailbox interrupt registration
while stale or latched RVU_VF_INT state is still present. If delivery
happens in the request_irq()-to-clear window,
otx2vf_vfaf_mbox_intr_handler() can run before local quiesce and touch
the same vf->mbox and vf->mbox_wq carrier that probe and teardown later
reuse or destroy.
Move the stale mailbox interrupt clear ahead of request_irq(), but keep
interrupt enabling after the handler is installed. This closes the
pre-clear early-IRQ window without creating a new enable-before-handler
window.
Fixes: 3184fb5ba96e ("octeontx2-vf: Virtual function driver support") Cc: stable@vger.kernel.org Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Ratheesh Kannoth <rkannoth@marvell.com> Link: https://patch.msgid.link/20260611160014.3202224-3-runyu.xiao@seu.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Runyu Xiao [Thu, 11 Jun 2026 16:00:13 +0000 (00:00 +0800)]
octeontx2-pf: clear stale mailbox IRQ state before request_irq()
otx2_register_mbox_intr() currently installs the PF mailbox IRQ handler
before clearing stale mailbox interrupt state. The function itself then
comments that the local interrupt bits must be cleared first to avoid
spurious interrupts, but that clear happens only after request_irq() has
already exposed the handler to irq delivery.
A running system can reach this during PF mailbox interrupt registration
while stale or latched RVU_PF_INT state is still present. If delivery
happens in the request_irq()-to-clear window,
otx2_pfaf_mbox_intr_handler() can run before local quiesce and touch
the same pf->mbox and pf->mbox_wq carrier that probe and teardown later
reuse or destroy.
Move the stale mailbox interrupt clear ahead of request_irq(), but keep
interrupt enabling after the handler is installed. This closes the
pre-clear early-IRQ window without creating a new enable-before-handler
window.
Fixes: 5a6d7c9daef3 ("octeontx2-pf: Mailbox communication with AF") Cc: stable@vger.kernel.org Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Ratheesh Kannoth <rkannoth@marvell.com> Link: https://patch.msgid.link/20260611160014.3202224-2-runyu.xiao@seu.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Greg Patrick [Thu, 11 Jun 2026 17:53:41 +0000 (17:53 +0000)]
net: phy: sfp: detect presence via I2C when no MOD_DEF0 GPIO
An SFP cage (compatible "sff,sfp") whose MOD_DEF0 signal is not wired to a
GPIO currently falls back to sff_gpio_get_state(), which unconditionally
reports the module as present. An empty cage therefore fails its probe and
is parked in SFP_MOD_ERROR forever; because SFP_F_PRESENT never deasserts
there is no REMOVE event to recover the state machine, so a module inserted
after boot is never detected, and empty cages spam -EIO at boot.
This affects boards that route none of the cage presence signal to a
software-readable input. On the NicGiga S100-0800S-M (RTL9303, 8x SFP+) the
cage I2C bus is the switch's SMBus master; TX_DISABLE is driven via a
PCA9534 I/O expander, but no MOD_ABS/MOD_DEF0 line reaches a readable GPIO
(the RTL9303 gpio0 lines read stuck-low, the single PCA9534 is fully
consumed by TX_DISABLE, and there is no RTL8231). The Horaco ZX-SW82TS-L2P
(RTL9302D, 2x SFP+) is independently affected in the same way.
For such an SFP cage, derive presence from a throttled single-byte I2C read
of the module EEPROM instead: a successful read asserts SFP_F_PRESENT,
R_PROBE_ABSENT consecutive failures clear it (to ride out a transient error
on a live module). The existing poll then emits SFP_E_INSERT / SFP_E_REMOVE
normally, giving working hot-plug and silencing the boot-time -EIO spam on
empty cages. Presence is re-probed every T_PROBE_PRESENT, so insertion is
detected within that interval and removal within
T_PROBE_PRESENT * R_PROBE_ABSENT.
A soldered-down module (compatible "sff,sff") has no presence signal and is
genuinely always present, so it continues to use sff_gpio_get_state(); the
new path is gated on the cage type advertising SFP_F_PRESENT.
Signed-off-by: Greg Patrick <gregspatrick@hotmail.com> Tested-by: Manuel Stocker <mensi@mensi.ch> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260611175341.2223184-1-gregspatrick@hotmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Yang [Thu, 11 Jun 2026 07:08:51 +0000 (15:08 +0800)]
devlink: Warn on resource ID collision with PARENT_TOP
ID 0 serves as the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP to mark
top-level resources. While it is technically possible to use 0 as a real
resource ID, a user might be tempted to write:
register(..., MY_RESOURCE_ID_C, DEVLINK_RESOURCE_ID_PARENT_TOP, ...);
register(..., MY_RESOURCE_ID_D, MY_RESOURCE_ID_C, ...);
/* D is a child of C */
register(..., MY_RESOURCE_ID_A, DEVLINK_RESOURCE_ID_PARENT_TOP, ...);
register(..., MY_RESOURCE_ID_B, MY_RESOURCE_ID_A, ...);
/* Is B intentionally top-level, or is it actually a child of A? */
Add a WARN_ON() to catch this and prevent confusion.
David Yang [Thu, 11 Jun 2026 07:08:50 +0000 (15:08 +0800)]
net: dsa: mv88e6xxx: Avoid devlink resource IDs collision with PARENT_TOP
The devlink resource ID for ATU collides with the sentinel
DEVLINK_RESOURCE_ID_PARENT_TOP (0). As a result, ATU_bin_* are
registered as in fact registered as top-level siblings, not as children
of ATU.
Whether intentional or unintentional, clarify it by keeping the real
resource IDs starting at 1. Unfortunately ATU_bin_* are already
registered at top-level, so keep their parent to PARENT_TOP.
David Yang [Thu, 11 Jun 2026 07:08:49 +0000 (15:08 +0800)]
net: dsa: hellcreek: avoid devlink resource IDs collision with PARENT_TOP
This might not cause real problems, but the hellcreek devlink resource
ID collides with the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP (0). Avoid
it by keeping the real resource IDs starting at 1.
David Yang [Thu, 11 Jun 2026 07:08:48 +0000 (15:08 +0800)]
net: dsa: b53: avoid devlink resource IDs collision with PARENT_TOP
This might not cause real problems, but the b53 devlink resource ID
collides with the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP (0). Avoid it
by keeping the real resource IDs starting at 1.
David Yang [Thu, 11 Jun 2026 07:08:47 +0000 (15:08 +0800)]
net: dsa: dsa_loop: avoid devlink resource IDs collision with PARENT_TOP
This might not cause real problems, but the dsa_loop devlink resource ID
collides with the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP (0). Avoid it
by keeping the real resource IDs starting at 1.
Johan Hovold [Thu, 11 Jun 2026 12:03:45 +0000 (14:03 +0200)]
pmdomain: core: fix unused variable warning with !PM_GENERIC_DOMAINS_OF
The genpd provider bus is really only used when
CONFIG_PM_GENERIC_DOMAINS_OF is enabled, and since the recent deferred
initialisation of domain parent devices, the root device pointer is
otherwise unused.
Fix the unused variable warning by moving the definition of the root device
pointer inside the corresponding ifdef.
Fixes: 92b69eff8012 ("pmdomain: core: fix early domain registration") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202606111746.kAxaAbwg-lkp@intel.com/ Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Ulf Hansson <ulfh@kernel.org>
Bhargav Joshi [Thu, 11 Jun 2026 21:12:29 +0000 (02:42 +0530)]
dt-bindings: interrupt-controller: ti,irq-crossbar: Convert to DT schema
Convert TI irq-crossbar binding from text format to DT schema.
As part of conversion following changes are made:
- Add '#interrupt-cells' as a required property which was missing in
text binding
- As irq-crossbar is interrupt-controller. Move binding from
bindings/arm/omap to bindings/interrupt-controller
====================
net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2
This is part 2. Find part 1 here:
https://lore.kernel.org/all/20260531113954.395443-1-tariqt@nvidia.com/
This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. SD single netdev combines multiple PCI functions
behind a single netdev interface. To support switchdev offloads, these
functions must participate in virtual LAG (shared FDB).
Design
Rather than introducing a separate LAG instance for SD, this series
integrates SD secondary devices into the existing LAG structure
(priv.lag) created at probe time. Each lag_func entry carries a
group_id field that identifies its SD group membership (0 means not
part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
physical port entries from SD secondaries, enabling a single unified
iterator that filters by group:
- MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
behavior, used by bonding, FW LAG commands, v2p_map)
- MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
(used by MPESW shared FDB across all devices)
- specific group_id: iterate only devices in that SD group (used by
per-group SD shared FDB operations)
Existing callers use mlx5_ldev_for_each() which maps to
MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
configurations.
Lifecycle and ownership
The SD LAG lifecycle is tied to the SD group, not to bonding events:
1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
(priv.lag) for each LAG-capable PF. e.g.: SD primary devices
2. During mlx5_sd_init(), after the SD group is fully formed (primary
and secondaries paired), sd_lag_init() registers the secondary
devices into the primary's existing priv.lag by calling
mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
also gets its group_id set. No separate LAG instance is created.
3. After all the devices in SD group transition to switchdev,
mlx5_lag_shared_fdb_create() is invoked with the group_id to create
a software-only shared FDB scoped to that SD group. This sets
sd_fdb_active on all lag_func entries in the group. No FW LAG
commands are issued since SD devices share the same physical port.
4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
per-group SD shared FDB is torn down first, then MPESW shared FDB is
created spanning all devices (ports + SD secondaries) using
MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
restored.
5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
removes secondaries from priv.lag and clears the primary's group_id.
The LAG structure itself is not destroyed.
The sd_fdb_active flag is set on all lag_func entries in a group (not
just the primary), so any device can detect the SD shared FDB state
during lag_disable_change teardown without needing to look up peer
entries.
SD shared FDB is a pure software construct -- unlike regular LAG modes
(ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
commands. The software vport LAG for SD is implemented via eswitch
egress ACL bounce rules, managed by the IB layer through
mlx5_eth_lag_init(). And the software LAG demux is implemented via
steering rules that utilize new destination, VHCA_RX.
Devcom support (patches 2-3):
- Expose locked variant of send_event
- Add DEVCOM_CANT_FAIL for non-rollback events
SD core hardening (patches 4-6):
- Make primary/secondary role determination more robust
- Add L2 table silent mode query support
- Expand vport metadata for SD secondary devices
SD switchdev transition (patches 7-8):
- Support switchdev mode transition with shared FDB
- Notify SD on eswitch disable
LAG integration (patches 9-12):
- Store demux resources per master lag_func
- Disable both regular and SD LAG on lag_disable_change
- Introduce software vport LAG implementation
- Add MPESW over SD LAG support
Deferred init (patches 13-14):
- Tie rep load/unload to SD LAG state
- Defer vport metadata init until SD is ready
Enablement (patch 15):
- Enable SD over ECPF and allow switchdev transition
Shay Drory [Fri, 12 Jun 2026 11:39:04 +0000 (14:39 +0300)]
net/mlx5: SD, enable SD over ECPF and allow switchdev transition
Remove the restriction blocking SD on embedded CPU PFs (ECPF), enabling
SD functionality on BlueField DPUs. Remove the blocker preventing SD
devices from transitioning to switchdev mode.
The infrastructure added in earlier patches properly handles this case.
Shay Drory [Fri, 12 Jun 2026 11:39:03 +0000 (14:39 +0300)]
net/mlx5: SD, defer vport metadata init until SD is ready
Allow SD devices to transition to switchdev before the SD group is
fully up. Metadata allocation requires the SD group to be ready, so
defer it from esw_offloads_enable() until SD shared-FDB activation.
Add mlx5_esw_offloads_init_deferred_metadata() which allocates per-vport
metadata and refreshes the ingress ACLs that were previously programmed
with metadata=0. The helper is idempotent and can be called multiple
times.
Shay Drory [Fri, 12 Jun 2026 11:39:02 +0000 (14:39 +0300)]
net/mlx5: E-Switch, Tie rep load/unload to SD LAG state
On an SD device, vport representors are not functional until the SD
group is combined and shared FDB is active. Skip the initial load and
the reload paths in that window; reps are loaded as part of the SD LAG
activation flow once it becomes active.
In addition, explicitly unload representors when SD LAG is destroyed.
Shay Drory [Fri, 12 Jun 2026 11:39:01 +0000 (14:39 +0300)]
net/mlx5: LAG, add MPESW over SD LAG support
Enable MPESW LAG creation over SD LAG members, forming a composite LAG
hierarchy. This allows bonding multiple SD groups together under a
single MPESW configuration with shared FDB.
When enabling composite MPESW, the individual SD LAG shared FDB
configurations are temporarily torn down and recreated when the
composite LAG is disabled.
Shay Drory [Fri, 12 Jun 2026 11:39:00 +0000 (14:39 +0300)]
net/mlx5: LAG, introduce software vport LAG implementation
SD LAG is a virtual LAG without hardware LAG support, so it cannot use
the firmware vport LAG commands. Implement a software-based vport LAG
using egress ACL bounce rules.
Add esw_set_slave_egress_rule() to create an egress ACL rule on the
slave's manager vport that bounces traffic to the master's manager
vport. This achieves the same traffic steering as hardware vport LAG.
Redirect mlx5_cmd_create_vport_lag() and mlx5_cmd_destroy_vport_lag()
to the software implementation when operating in SD LAG mode.
In addition, adjust lag_demux creation to check SD LAG mode as well.
Shay Drory [Fri, 12 Jun 2026 11:38:59 +0000 (14:38 +0300)]
net/mlx5: LAG, disable both regular and SD LAG on lag_disable_change
Extend mlx5_lag_disable_change() to properly disable both regular LAG
and SD LAG when requested. Each LAG type uses its own devcom component
for locking.
Use mlx5_sd_get_devcom() helper to retrieve the SD devcom component,
needed for proper locking when disabling SD LAG.
Shay Drory [Fri, 12 Jun 2026 11:38:58 +0000 (14:38 +0300)]
net/mlx5: LAG, store demux resources per master lag_func
The lag demux resources (flow table, flow group, and rules xarray)
are stored on the shared ldev. With Socket Direct, multiple SD groups
each create their own demux FT/FG during their master's IB device
initialization. Since they all write to the same ldev fields, the
second group's init overwrites the first group's pointers, leaking
the first group's FT/FG.
During teardown, the cleanup uses the overwritten pointers, destroying
the wrong group's resources and leaving leaked flow tables in the LAG
namespace. These leaked tables can interfere with subsequently created
demux tables.
Move the demux resources from the shared ldev to per-master lag_func
instances. Each master device now owns its own independent demux
state. The rule_add and rule_del helpers look up the appropriate
master's lag_func via the existing filter/group infrastructure.
Shay Drory [Fri, 12 Jun 2026 11:38:57 +0000 (14:38 +0300)]
net/mlx5: E-Switch, notify SD on eswitch disable
When eswitch is disabled, notify the SD layer so it can clean up
SD-specific resources such as the TX flow table root configuration
on secondary devices.
Shay Drory [Fri, 12 Jun 2026 11:38:56 +0000 (14:38 +0300)]
net/mlx5: SD, support switchdev mode transition with shared FDB
When the eswitch transitions, propagate the change to SD: secondaries
get their TX flow table root reconfigured for the new mode, and when
all group devices move to switchdev, the per-group shared FDB is
activated.
Shared FDB activation is best-effort - failure does not block the
eswitch transition; the next transition retries.
Note: the existing mlx5_get_sd() guard that blocks switchdev for SD
devices is intentionally retained. It will be removed once all
supporting patches are in place.
Shay Drory [Fri, 12 Jun 2026 11:38:55 +0000 (14:38 +0300)]
net/mlx5: SD, expend vport metadata for SD secondary devices
In Socket Direct configurations the primary and secondary PFs share the
same native_port_num. The eswitch vport metadata encodes pf_num in its
upper bits to distinguish vports across PFs. Without SD-awareness, both
PFs generate identical metadata, causing FDB rules to steer traffic to
the wrong representor.
Add mlx5_sd_pf_num_get() which remaps the pf_num for SD devices.
Use it so each PF in an SD group produces unique vport metadata.
Shay Drory [Fri, 12 Jun 2026 11:38:54 +0000 (14:38 +0300)]
net/mlx5: SD, add L2 table silent mode query support
Add mlx5_fs_cmd_query_l2table_silent() to query the current silent mode
state from firmware. This allows detecting if firmware has already put
secondary devices into silent mode.
During SD group registration, query the silent mode of each device. If
a device is already in silent mode (set by firmware), record this in
the fw_silents_secondaries flag and use it to help determine the
primary/secondary roles.
When fw_silents_secondaries is set, skip the driver-initiated silent
mode set/unset operations since firmware manages this state. This
handles configurations where firmware persistently silences secondary
devices.
Shay Drory [Fri, 12 Jun 2026 11:38:53 +0000 (14:38 +0300)]
net/mlx5: SD, make primary/secondary role determination more robust
Refactor SD group registration to use devcom event-driven role
determination to ensure SD is marked as ready only after roles are fully
assigned and the group state is consistent, making outside accessors,
which will be added in downstream patches, safe to use without races.
The devcom events:
- SD_PRIMARY_SET event: each device compares bus numbers with peers
to determine which should be primary
- SD_SECONDARIES_SET event: secondaries register themselves with the
elected primary device
Shay Drory [Fri, 12 Jun 2026 11:38:52 +0000 (14:38 +0300)]
net/mlx5: devcom, add DEVCOM_CANT_FAIL for non-rollback events
Some devcom events are not expected to fail. Rather than attempting
a rollback that may not be meaningful, allow callers to pass
DEVCOM_CANT_FAIL as the rollback_event to indicate that the event
handler should not fail. If it does, emit a warning and stop
propagating to further peers, but skip the rollback path.
Shay Drory [Fri, 12 Jun 2026 11:38:51 +0000 (14:38 +0300)]
net/mlx5: devcom, expose locked variant of send_event
Factor mlx5_devcom_send_event() into two functions:
- mlx5_devcom_locked_send_event(): performs the dispatch (and
rollback) with comp->sem already held by the caller.
- mlx5_devcom_send_event(): unchanged wrapper that takes comp->sem,
calls the locked variant, and releases it.
This lets callers bracket multiple event broadcasts under a single
held write lock, eliminating the gap between consecutive dispatches
where peer state could change.
SD secondary devices share the primary's uplink and do not have
their own uplink representor. When reloading IB reps on secondary
devices, skip the uplink and only load VF/SF vport IB reps.
Takashi Iwai [Mon, 15 Jun 2026 18:19:22 +0000 (20:19 +0200)]
Merge tag 'asoc-v7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Updates for v7.2
There's been quite a lot of framework improvements this time around,
though mainly cleanups and robustness rather than user visible features.
The same pattern is seen with a lot of the driver work that's going on,
there are new features but a huge proportion of this is bug fixing and
cleanup work. We also have a good selectio of new device support.
- Improvements to SDCA jack handling from Charles Keepax.
- Use of device links to make suspend handling more robust from Richard
Fitzgerald.
- Use of a new helper to factor out a common pattern in SoundWire
enmeration from Charles Keepax.
- Slimming down of the component from Kuninori Morimoto.
- Simplification of format auto selection from Kuninori Morimoto.
- Lots of conversions to guard() from Bui Duc Phuc.
- Addition of a simple-amplifier driver supporting more featureful GPIO
controller amplifiers than the previous basic driver from Herve
Codina.
- Support for AMD ACP 7.x, Cirrus Logic CS42448/CS42888, Everest Semi
ES9356, Mediatek MT2701 and MT8196, Renesas RZ/G3E, Spacemit K3,
Texas Instruments TAC5xx2 and TAS67524.
Lianqin Hu [Mon, 15 Jun 2026 12:05:01 +0000 (12:05 +0000)]
ALSA: usb-audio: Add iface reset and delay quirk for XIBERIA K03S
Setting up the interface when suspended/resumeing fail on this card.
Adding a reset and delay quirk will eliminate this problem.
usb 1-1: New USB device found, idVendor=36f9, idProduct=c009
usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 1-1: Product: XIBERIA K03S
usb 1-1: Manufacturer: Actions
usb 1-1: usb_probe_device
Jiri Kosina [Fri, 12 Jun 2026 15:48:22 +0000 (17:48 +0200)]
HID: hidpp: fix potential UAF in hidpp_connect_event()
If input_register_device() fails, we call input_free_device(), but keep
stale pointer to the old device in hidpp->input, which could potentially
lead to UAF. Fix that by resetting it to NULL before returning from
hidpp_connect_event().
Zhenghang Xiao [Mon, 15 Jun 2026 10:25:56 +0000 (12:25 +0200)]
fuse-uring: clear ent->fuse_req in commit_fetch error path
fuse_uring_commit_fetch() error path called fuse_request_end(req) without
clearing ent->fuse_req when fuse_ring_ent_set_commit() fails. The
still-pending fuse_uring_send_in_task() task-work later dereferences the
dangling pointer through fuse_uring_prepare_send(), causing a
use-after-free.
End the request with fuse_uring_req_end(), which handles all conditions
already.
Annotation/edition by Bernd: The UAF should be fixed by other means already
and actually has to be avoided that way.
Just checking for ent->fuse_req == NULL in fuse_uring_send_in_task()
would be prone to race conditions, because if malicious userspace
would commit requests that have passed the NULL check, but are
in doing args copy, it would still trigger a use-after-free.
Setting ent->fuse_req = NULL in fuse_uring_commit_fetch() still
makes sense, though.
Joanne Koong [Fri, 12 Jun 2026 21:05:09 +0000 (14:05 -0700)]
fuse-uring: use named constants for io-uring iovec indices
Replace magic indices 0 and 1 for the iovec array with named constants
FUSE_URING_IOV_HEADERS and FUSE_URING_IOV_PAYLOAD. This makes the usages
self-documenting and prepares for buffer ring support which will also
reference these iovec slots by index.
Reviewed-by: Bernd Schubert <bernd@bsbernd.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Joanne Koong [Fri, 12 Jun 2026 21:05:07 +0000 (14:05 -0700)]
fuse-uring: use enum types for header copying
Use enum types to identify which part of the header needs to be copied.
This improves the interface and will simplify both kernel-space and
user-space header addresses copying when buffer rings are added.
Reviewed-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Joanne Koong [Fri, 12 Jun 2026 21:05:06 +0000 (14:05 -0700)]
fuse-uring: refactor io-uring header copying from ring
Move header copying from ring logic into a new copy_header_from_ring()
function. This makes the copy_from_user() logic more clear and
centralizes error handling / rate-limited logging.
Reviewed-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Joanne Koong [Fri, 12 Jun 2026 21:05:05 +0000 (14:05 -0700)]
fuse-uring: refactor io-uring header copying to ring
Move header copying to ring logic into a new copy_header_to_ring()
function. This makes the copy_to_user() logic more clear and centralizes
error handling / rate-limited logging.
Reviewed-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Joanne Koong [Fri, 12 Jun 2026 21:05:04 +0000 (14:05 -0700)]
fuse-uring: separate next request fetching from sending logic
Simplify the logic for fetching + sending off the next request.
This gets rid of fuse_uring_send_next_to_ring() which contained
duplicated logic from fuse_uring_send(). This decouples request fetching
from the send operation, which makes the control flow clearer and
reduces unnecessary parameter passing.
Reviewed-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Jun Wu [Fri, 15 May 2026 00:14:12 +0000 (17:14 -0700)]
fuse: invalidate readdir cache on epoch bump
FUSE_NOTIFY_INC_EPOCH invalidates dentries, but does not invalidate cached
readdir results. A process with cwd inside a FUSE mount can therefore
observe stale readdir(".") output after an epoch bump.
Fix this by recording epoch in the readdir cache and checking it on reuse.
Minimal reproducer:
- mount a tiny FUSE fs with an empty root directory
- on opendir, enable fi->cache_readdir and fi->keep_cache
- chdir into the mount and call readdir(".") to populate readdir cache
- make the FUSE server report one file in the root directory
- send only FUSE_NOTIFY_INC_EPOCH
- call readdir(".") again; before this change it stays stale, after this
change it sees the new file
Fixes: 2396356a945b ("fuse: add more control over cache invalidation behaviour") Signed-off-by: Jun Wu <quark@meta.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
virtio-fs: avoid double-free on failed queue setup
virtio_fs_setup_vqs() allocates fs->vqs and fs->mq_map before calling
virtio_find_vqs(). If virtio_find_vqs() fails, the error path frees both
pointers and returns an error to virtio_fs_probe().
virtio_fs_probe() then drops the last kobject reference, and
virtio_fs_ktype_release() frees fs->vqs and fs->mq_map again. This leaves
dangling pointers in struct virtio_fs and can trigger a double-free during
probe failure cleanup.
Set fs->vqs and fs->mq_map to NULL immediately after kfree() in the
virtio_fs_setup_vqs() error path so that the later kobject release sees an
uninitialized state and kfree(NULL) becomes harmless.
This can be reproduced when a broken virtio-fs device advertises more
request queues than the transport actually provides. In that case
virtio_find_vqs() fails while setting up the extra queue, and the probe
path reaches the double-free cleanup sequence.
fuse: invalidate page cache after DIO and async DIO writes
This fixe does page cache invalidation after DIO and async DIO writes for
both O_DIRECT and FOPEN_DIRECT_IO cases.
Commit b359af8275a9 ("fuse: Invalidate the page cache after FOPEN_DIRECT_IO
write") fixed xfstests generic/209 for DIO writes in the FOPEN_DIRECT_IO
path. DIO writes without FOPEN_DIRECT_IO are already handled by
generic_file_direct_write().
However, async DIO writes (xfstests generic/451) remain unhandled.
After this fix:
- Async write with FUSE_ASYNC_DIO:
invalidate in fuse_aio_invalidate_worker()
- Otherwise (Sync or async write without FUSE_ASYNC_DIO):
- With FOPEN_DIRECT_IO:
invalidate in fuse_direct_write_iter()
- Without FOPEN_DIRECT_IO:
invalidate in generic_file_direct_write()
Workqueue is required for async write invalidation to prevent deadlock:
calling it directly in the I/O end routine (which is in fuse worker thread
context) can block on a folio lock held by a buffered I/O thread waiting
for the same fuse worker thread.
Li Wang [Fri, 22 May 2026 08:38:05 +0000 (16:38 +0800)]
fuse: use READ_ONCE in fuse_chan_num_background()
fuse_chan_num_background() is called without holding fch->bg_lock (for
example from fuse_writepages() to compare against fc->congestion_threshold),
while fch->num_background is updated under bg_lock in dev.c and dev_uring.c.
This is the same locked-write/lockless-read pattern already used for
max_background in fuse_chan_max_background().
Use READ_ONCE() on the read side so that:
- The compiler does not cache or coalesce loads of a value that may change
concurrently on another CPU.
- Prevent KCSAN from reporting an unexpected race.
Signed-off-by: Li Wang <liwang@kylinos.cn> Fixes: 670d21c6e17f ("fuse: remove reliance on bdi congestion") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse: dax: Move long delayed work on system_dfl_long_wq
Currently the code enqueue work items using {queue|mod}_delayed_work(),
using system_long_wq. This workqueue should be used when long works are
expected and it is a per-cpu workqueue.
The function(s) end up calling __queue_delayed_work(), which set a global
timer that could fire anywhere, enqueuing the work where the timer fired.
Unbound works could benefit from scheduler task placement, to optimize
performance and power consumption. Long work shouldn't stick to a single
CPU.
Recently, a new unbound workqueue specific for long running work has
been added:
c116737e972e ("workqueue: Add system_dfl_long_wq for long unbound works")
Since the workqueue work doesn't rely on per-cpu variables, there is no
obvious reason that justify the use of a per-cpu workqueue. So change
system_long_wq with system_dfl_long_wq so that the work may benefit from
scheduler task placement.
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Amir Goldstein [Fri, 5 Jun 2026 09:26:47 +0000 (11:26 +0200)]
fuse: add fuse_request_sent tracepoint
This new tracepoint complements fuse_request_send (enqueue) and
fuse_request_end (completion). It fires after the request has been
successfully copied to the daemon's buffer, just before the daemon
can start to process it.
fuse_request_sent does not fire if the copy of the request fails.
It also does not fire for NOTIFY_REPLY, which fires the _end tracepoint
at the end of copy.
This is needed for tools tracking the in-flight state of user initiated
fuse requests.
Tim Bird [Thu, 4 Jun 2026 19:55:08 +0000 (13:55 -0600)]
fuse: Add SPDX ID lines to some files
Some fuse source files are missing SPDX-License-Identifier
lines. Add appropriate IDs to these files, and remove old
license references from the headers.
Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_get_user_pages() allocates the temporary pages[] array used by
iov_iter_extract_pages() with the open-coded kzalloc(n * sizeof(*p),
...) form. max_pages is derived from the inbound iov_iter and is not
bounded at compile time, so the multiplication can overflow on
sufficiently large iter counts; the resulting too-small allocation
would then be written past by iov_iter_extract_pages().
Switch to kcalloc(), which carries the same zero-on-allocation
semantics and adds the standard size_mul overflow check. No
functional change for non-overflow inputs.
Signed-off-by: William Theesfeld <william@theesfeld.net> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
GuoHan Zhao [Sun, 10 May 2026 14:54:37 +0000 (22:54 +0800)]
fuse: use current creds for backing files
FUSE backing files only need a stable snapshot of the current credentials
for later backing-file I/O. prepare_creds() allocates a mutable copy and
can fail, but this code never modifies or commits the result.
Use get_current_cred() instead and store it as a const pointer. This
matches the rest of the backing-file helpers and avoids an unnecessary
allocation and failure path.
Signed-off-by: GuoHan Zhao <zhaoguohan@kylinos.cn> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse: remove redundant buffer size checks for interrupt and forget requests
In fuse_dev_do_read(), there is already logic that ensures the buffer is
a minimum of at least FUSE_MIN_READ_BUFFER (8k) bytes.
This makes the buffer size checks for interrupt and forget requests
redundant as sizeof(struct fuse_in_header) + sizeof(struct
fuse_interrupt_in) and sizeof(struct fuse_in_header) + sizeof(struct
fuse_forget_in) are both less than FUSE_MIN_READ_BUFFER.
Konrad Sztyber [Tue, 14 Apr 2026 08:27:23 +0000 (10:27 +0200)]
fuse: reduce attributes invalidated on directory change
When the contents of a directory is modified, some of its attributes may
also change, so they need to be invalidated. But this isn't the case
for every attribute. For instance, unlinking or creating a file doesn't
change the uid/gid of its parent directory.
This can cause unnecessary FUSE_GETATTRs to be sent to user-space. For
example, fuse_permission() checks if mode, uid, and gid are valid and
will issue a FUSE_GETATTR if they're not, which results in an extra
FUSE_GETATTR request for every FUSE_UNLINK when removing files in the
same directory.
Signed-off-by: Konrad Sztyber <ksztyber@nvidia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Li Wang [Fri, 10 Apr 2026 02:34:33 +0000 (10:34 +0800)]
fuse: drop redundant err assignment in fuse_create_open()
In fuse_create_open(), err is initialized to -ENOMEM immediately before
the fuse_alloc_forget() NULL check. If forget allocation fails,
it branches to out_err with that value. If it succeeds, it falls through
without modifying err, so err is still -ENOMEM at the point where
fuse_file_alloc() is called. The second err = -ENOMEM before
fuse_file_alloc() therefore is redundant.
Signed-off-by: Li Wang <liwang@kylinos.cn> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Randy Dunlap [Tue, 7 Apr 2026 00:50:40 +0000 (17:50 -0700)]
fuse: fuse_i.h: clean up kernel-doc comments
Convert many comments to kernel-doc format to eliminate around 20
kernel-doc warnings like these:
Warning: fs/fuse/fuse_i.h:374 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* A Fuse connection.
Warning: fs/fuse/fuse_i.h:817 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* Get a filled in inode
Warning: fs/fuse/fuse_i.h:859 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* Send RELEASE or RELEASEDIR request
and more like this.
Also add struct member and function parameter descriptions to avoid
these warnings:
Warning: fs/fuse/fuse_i.h:1071 struct member 'epoch_work' not described in 'fuse_conn'
Warning: fs/fuse/fuse_i.h:1071 struct member 'rcu' not described in 'fuse_conn'
Warning: fs/fuse/fuse_i.h:1423 function parameter 'fc' not described in 'fuse_reverse_inval_inode'
Warning: fs/fuse/fuse_i.h:1423 function parameter 'nodeid' not described in 'fuse_reverse_inval_inode'
Warning: fs/fuse/fuse_i.h:1423 function parameter 'offset' not described in 'fuse_reverse_inval_inode'
Warning: fs/fuse/fuse_i.h:1423 function parameter 'len' not described in 'fuse_reverse_inval_inode'
Warning: fs/fuse/fuse_i.h:1436 function parameter 'fc' not described in 'fuse_reverse_inval_entry'
Warning: fs/fuse/fuse_i.h:1436 function parameter 'parent_nodeid' not described in 'fuse_reverse_inval_entry'
Warning: fs/fuse/fuse_i.h:1436 function parameter 'child_nodeid' not described in 'fuse_reverse_inval_entry'
Warning: fs/fuse/fuse_i.h:1436 function parameter 'name' not described in 'fuse_reverse_inval_entry'
Warning: fs/fuse/fuse_i.h:1436 function parameter 'flags' not described in 'fuse_reverse_inval_entry'
Convert struct fuse_file, struct fuse_submount_lookup, struct fuse_inode,
and struct fuse_conn to kernel-doc.
Convert these to plain comments:
Warning: fs/fuse/fuse_i.h:1423 expecting prototype for File(). Prototype was for fuse_reverse_inval_inode() instead
Warning: fs/fuse/fuse_i.h:1436 expecting prototype for File(). Prototype was for fuse_reverse_inval_entry() instead
Change some "/**" to "/*" since they are not kernel-doc comments.
The changes above fix most kernel-doc warnings in this file but
these warnings are not fixed and still remain:
Warning: fs/fuse/fuse_i.h:1428 No description found for return value of 'fuse_fill_super_common'
Warning: fs/fuse/fuse_i.h:1455 No description found for return value of 'fuse_ctl_add_conn'
Binary build output is the same before and after these changes.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Randy Dunlap [Tue, 7 Apr 2026 00:50:39 +0000 (17:50 -0700)]
fuse: fuse_dev_i.h: clean up kernel-doc warnings
Change some "/**" to "/*" since they are not kernel-doc comments:
Warning: fs/fuse/fuse_dev_i.h:25 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* Request flags
Warning: fs/fuse/fuse_dev_i.h:58 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* A request to the client
Warning: fs/fuse/fuse_dev_i.h:117 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* Input queue callbacks
Warning: fs/fuse/fuse_dev_i.h:289 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* Fuse device instance
and more like this.
Convert enum fuse_req_flag to kernel-doc format.
Convert struct fuse_req, struct fuse_iqueue_ops, and struct fuse_dev
to kernel-doc format.
These warnings remain:
Warning: fs/fuse/fuse_dev_i.h:115 struct member 'ring_entry' not described in 'fuse_req'
Warning: fs/fuse/fuse_dev_i.h:115 struct member 'ring_queue' not described in 'fuse_req'
Binary build output is the same before and after these changes.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Randy Dunlap [Tue, 7 Apr 2026 00:50:38 +0000 (17:50 -0700)]
fuse-uring: drop kernel-doc notation for a comment
Use regular C comment syntax for a non-kernel-doc comment to avoid
a kernel-doc warning:
Warning: fs/fuse/dev_uring_i.h:104 This comment starts with '/**', but
isn't a kernel-doc comment.
* Describes if uring is for communication and holds alls the data needed
Binary build output is the same before and after this change.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse: alloc pqueue before installing fch in fuse_dev
Prior to this patchset, fuse_dev (containing fuse_pqueue) was allocated on
mount. But now fuse_dev is allocated when opening /dev/fuse, even though
the queues are not needed at that time.
Delay allocation of the pqueue (4k worth of list_head) just before mounting
or cloning a device.
Various distributions (e.g. Debian/Fedora) configure /dev/fuse as world
writable, so the pqueue allocation should be deferred to a privileged
operation (mount) to prevent unprivileged userspace from consuming pinned
kernel memory.
[Li Wang: fix kernel NULL pointer dereference in fuse_uring_add_to_pq()]
[Fix race in fuse_dev_release()]