]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
3 weeks agorust: ptr: remove implicit index projection syntax
Gary Guo [Tue, 2 Jun 2026 14:17:57 +0000 (15:17 +0100)] 
rust: ptr: remove implicit index projection syntax

All users have been converted to use keyworded index projection syntax to
explicitly state their intention when doing index projection.

Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Gary Guo <gary@garyguo.net>
Acked-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-6-6989470f5440@garyguo.net
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
3 weeks agogpu: nova-core: convert to keyworded projection syntax
Gary Guo [Tue, 2 Jun 2026 14:17:56 +0000 (15:17 +0100)] 
gpu: nova-core: convert to keyworded projection syntax

Use "build" to denote that the index bounds checking here is performed at
build time.

Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Gary Guo <gary@garyguo.net>
Acked-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-5-6989470f5440@garyguo.net
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
3 weeks agorust: dma: update to keyworded index projection syntax
Gary Guo [Tue, 2 Jun 2026 14:17:55 +0000 (15:17 +0100)] 
rust: dma: update to keyworded index projection syntax

Demonstrate the preferred syntax of index projection in DMA documentation
and examples. A few `[i]?` cases are converted to demonstrate the new
variant.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Gary Guo <gary@garyguo.net>
Acked-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-4-6989470f5440@garyguo.net
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
3 weeks agorust: ptr: add panicking index projection variant
Gary Guo [Tue, 2 Jun 2026 14:17:54 +0000 (15:17 +0100)] 
rust: ptr: add panicking index projection variant

There have been a few cases where the programmer knows that the indices are
in bounds but the compiler cannot deduce that. This is also
compiler-version-dependent, so using build indexing here can be
problematic. On the other hand, it is also not ideal to use the fallible
variant, as it adds an error handling path that is never hit.

Add a new panicking index projection for this scenario. Like all panicking
operations, this should be used carefully only in cases where the user
knows the index is going to be in bounds, and panicking would indicate
something is catastrophically wrong.

To signify this, require users to explicitly denote the type of index being
used. The existing two types of index projections also gain the keyworded
version, which will be the recommended way going forward.

The keyworded syntax also paves the way of perhaps adding more flavors in
the future, e.g. `unsafe` index projection. However, unless the code is
extremely performance sensitive and bounds checking cannot be tolerated,
the panicking variant is safer and should be preferred, so it will be left
to the future when demand arises.

Signed-off-by: Gary Guo <gary@garyguo.net>
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-3-6989470f5440@garyguo.net
[ Fixed broken intra-doc link. Added a few extra intra-doc links. Reworded
  some docs slightly. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
3 weeks agorust: ptr: use `match` instead of `unwrap_or_else` for `build_index`
Gary Guo [Tue, 2 Jun 2026 14:17:53 +0000 (15:17 +0100)] 
rust: ptr: use `match` instead of `unwrap_or_else` for `build_index`

Use `match` to avoid potential inlining issues of the `unwrap_or_else`
function.

Suggested-by: Alice Ryhl <aliceryhl@google.com>
Link: https://lore.kernel.org/rust-for-linux/aeCKlut-88SbNsyW@google.com/
Signed-off-by: Gary Guo <gary@garyguo.net>
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-2-6989470f5440@garyguo.net
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
3 weeks agorust: ptr: rename `ProjectIndex::index` to `build_index`
Gary Guo [Tue, 2 Jun 2026 14:17:52 +0000 (15:17 +0100)] 
rust: ptr: rename `ProjectIndex::index` to `build_index`

The corresponding `SliceIndex` trait in Rust uses `index` to mean the
panicking variant, which is also being added to `ProjectIndex`. Hence
rename our custom `build_error!` index variant to `build_index`.

Suggested-by: Alexandre Courbot <acourbot@nvidia.com>
Link: https://lore.kernel.org/rust-for-linux/DI5LLN2V3XCS.34H4CG99N4MPA@nvidia.com
Signed-off-by: Gary Guo <gary@garyguo.net>
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-1-6989470f5440@garyguo.net
[ Reworded docs slightly. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
3 weeks agoALSA: seq: dummy: fix UMP event stack overread
Kyle Zeng [Fri, 5 Jun 2026 08:02:04 +0000 (01:02 -0700)] 
ALSA: seq: dummy: fix UMP event stack overread

The dummy sequencer port forwards events by copying an incoming
struct snd_seq_event into a stack temporary, rewriting source and
destination, and dispatching the temporary to subscribers. That legacy
event storage is smaller than struct snd_seq_ump_event.

When a UMP event reaches the dummy client, the copy leaves the UMP flag
set but only provides legacy-sized stack storage. The subscriber
delivery path then uses snd_seq_event_packet_size() and copies a
UMP-sized packet from that stack object, reading past the end of the
temporary.

Use the existing union __snd_seq_event storage and copy the packet size
reported for the incoming event before rewriting the common routing
fields. This preserves the full UMP packet for UMP events while keeping
legacy event handling unchanged.

Fixes: 32cb23a0f911 ("ALSA: seq: dummy: Allow UMP conversion")
Signed-off-by: Kyle Zeng <kylebot@openai.com>
Link: https://patch.msgid.link/20260605080204.32045-1-kylebot@openai.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
3 weeks agoMerge patch series "proc: protect ptrace_may_access() with exec_update_lock"
Christian Brauner [Fri, 22 May 2026 11:49:13 +0000 (13:49 +0200)] 
Merge patch series "proc: protect ptrace_may_access() with exec_update_lock"

Jann Horn <jannh@google.com> says:

My understanding is that procfs is effectively maintained by the VFS
maintainers (though scripts/get_maintainer.pl claims that there are
no maintainers for procfs because the VFS entry only claims files
directly in fs/, and the procfs entry has no maintainers listed on
it).

In procfs, most uses of ptrace_may_access() should use
exec_update_lock to avoid TOCTOU issues with concurrent privileged
execve() (like setuid binary execution).

This series doesn't fix all the remaining issues in procfs, but it fixes
the easy cases for now; I will probably follow up with fixes for the
gnarlier cases later unless someone else wants to do that.

I have checked that procfs files still work with these changes and that
CONFIG_PROVE_LOCKING=y doesn't generate any warnings.

(checkpatch complains about missing argument names in
proc_op::proc_get_link, but that was already the case before my patch.)

* patches from https://patch.msgid.link/20260518-procfs-lockfix-part1-v1-0-5c3d20e0ac33@google.com:
  proc: protect ptrace_may_access() with exec_update_lock (FD links)
  proc: protect ptrace_may_access() with exec_update_lock (part 1)

Link: https://patch.msgid.link/20260518-procfs-lockfix-part1-v1-0-5c3d20e0ac33@google.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
3 weeks agoproc: protect ptrace_may_access() with exec_update_lock (FD links)
Jann Horn [Mon, 18 May 2026 16:35:16 +0000 (18:35 +0200)] 
proc: protect ptrace_may_access() with exec_update_lock (FD links)

proc_pid_get_link() and proc_pid_readlink() currently look up the task from
the pid once, then do the ptrace access check on that task, then look up
the task from the pid a second time to do the actual access.
That's racy in several ways.

To fix it, pass the task to the ->proc_get_link() handler, and instead of
proc_fd_access_allowed(), introduce a new helper call_proc_get_link() that
looks up and locks the task, does the access check, and calls
->proc_get_link().

Fixes: 778c1144771f ("[PATCH] proc: Use sane permission checks on the /proc/<pid>/fd/ symlinks")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://patch.msgid.link/20260518-procfs-lockfix-part1-v1-2-5c3d20e0ac33@google.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
3 weeks agoproc: protect ptrace_may_access() with exec_update_lock (part 1)
Jann Horn [Mon, 18 May 2026 16:35:15 +0000 (18:35 +0200)] 
proc: protect ptrace_may_access() with exec_update_lock (part 1)

Fix the easy cases where procfs currently calls ptrace_may_access() without
exec_update_lock protection, where the fix is to simply add the extra lock
or use mm_access():

 - do_task_stat(): grab exec_update_lock
 - proc_pid_wchan(): grab exec_update_lock
 - proc_map_files_lookup(): use mm_access() instead of get_task_mm()
 - proc_map_files_readdir(): use mm_access() instead of get_task_mm()
 - proc_ns_get_link(): grab exec_update_lock
 - proc_ns_readlink(): grab exec_update_lock

Fixes: f83ce3e6b02d ("proc: avoid information leaks to non-privileged processes")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://patch.msgid.link/20260518-procfs-lockfix-part1-v1-1-5c3d20e0ac33@google.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
3 weeks agowifi: nl80211: Increase ie_len size to prevent truncated IEs in new peer notifications
Thiyagarajan Pandiyan [Fri, 5 Jun 2026 05:43:07 +0000 (11:13 +0530)] 
wifi: nl80211: Increase ie_len size to prevent truncated IEs in new peer notifications

Currently, ie_len in cfg80211_notify_new_peer_candidate is defined as
1-byte field, capping the maximum IE list size at 255 bytes. When a
large beacon is received, the IE list is truncated, passing incomplete
data to wpa_supplicant. This causes supplicant to fail parsing the IEs.

Increasing the size of ie_len to allow the full length of the IE list to
be forwarded properly.

Signed-off-by: Thiyagarajan Pandiyan <thiyagarajan@aerlync.com>
Link: https://patch.msgid.link/20260605054307.427874-1-thiyagarajan@aerlync.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
3 weeks agowifi: mac80211: fold tid_ampdu_rx allocations into a flexible array
Rosen Penev [Fri, 5 Jun 2026 00:56:27 +0000 (17:56 -0700)] 
wifi: mac80211: fold tid_ampdu_rx allocations into a flexible array

Convert the separately-allocated reorder_buf pointer to a C99 flexible
array member at the end of struct tid_ampdu_rx, with both the
sk_buff_head and the jiffies timestamp in each array element. This
collapses three allocations into one and removes the corresponding
kfree() pairs from the error and free paths.

Assisted-by: opencode:big-pickle
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20260605005627.317194-1-rosenp@gmail.com
[fix kernel-doc]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
3 weeks agowifi: mac80211: Fix -Wc23-extensions in hwmp_route_info_get()
Nathan Chancellor [Thu, 4 Jun 2026 22:40:41 +0000 (15:40 -0700)] 
wifi: mac80211: Fix -Wc23-extensions in hwmp_route_info_get()

When building with a version of clang that supports
'-fms-anonymous-structs' (which will be used by the kernel instead of
the wider '-fms-extensions'), there are a couple warnings after some
recent mesg_hwmp.c changes:

  net/mac80211/mesh_hwmp.c:373:3: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
    373 |                 struct ieee80211_mesh_hwmp_preq_top *preq_elem_top =
        |                 ^
  net/mac80211/mesh_hwmp.c:390:3: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
    390 |                 struct ieee80211_mesh_hwmp_prep_top *prep_elem_top =
        |                 ^
  2 errors generated.

Enclose the switch case blocks in braces to clear up the warning.

Fixes: a91c65cb99d1 ("wifi: mac80211: Use struct instead of macro for PREP frame")
Fixes: 4ac20bd40b7d ("wifi: mac80211: Use struct instead of macro for PREQ frame")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260604-mac80211-mesh_hwmp-fix-c23-extensions-v1-1-25a64d6ce541@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
3 weeks agomedia: v4l2-fwnode: Fix subdev owner overwritten in v4l2_async_register_subdev_sensor()
Mirela Rabulea [Fri, 22 May 2026 14:31:20 +0000 (17:31 +0300)] 
media: v4l2-fwnode: Fix subdev owner overwritten in v4l2_async_register_subdev_sensor()

The v4l2 helper v4l2_async_register_subdev_sensor() calls
v4l2_async_register_subdev(), which is a macro that expands to
__v4l2_async_register_subdev(sd,THIS_MODULE). Since the macro is expanded
inside v4l2-fwnode.c, THIS_MODULE resolves to the v4l2-fwnode module
rather than the sensor driver module that originally set sd->owner. When
v4l2-fwnode is built-in, THIS_MODULE evaluates to NULL, which then
overwrites the sensor driver's owner with NULL.

This causes the problem that the sensor module's reference count is never
incremented during async registration, so the module can be removed while
the subdevice is still in use by a notifier (e.g., a CSI-2 receiver
bridge driver).

Fix this by renaming v4l2_async_register_subdev_sensor() to
__v4l2_async_register_subdev_sensor() with an added explicit module
argument and introducing a wrapper macro:
    #define v4l2_async_register_subdev_sensor(sd) \
        __v4l2_async_register_subdev_sensor(sd, THIS_MODULE)

This ensures the sensor driver module is properly referenced even when
the sensor driver does not init the owner field before calling
v4l2_async_register_subdev_sensor() and prevents premature module removal.

Fixes: aef69d54755d ("media: v4l: fwnode: Add a convenience function for registering sensors")
Cc: stable@vger.kernel.org
Suggested-by: Frank Li <Frank.Li@nxp.com>
Link: https://lore.kernel.org/linux-media/20240315073125.275501-2-sakari.ailus@linux.intel.com/
Signed-off-by: Mirela Rabulea <mirela.rabulea@nxp.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
3 weeks agobatman-adv: fix kernel-doc typos and grammar errors
Sven Eckelmann [Thu, 4 Jun 2026 19:35:52 +0000 (21:35 +0200)] 
batman-adv: fix kernel-doc typos and grammar errors

Various minor errors were gathered over the time in batman-adv's kernel-doc
comments. Get rid of many of them before they are copied (again) to new
functions.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
3 weeks agobatman-adv: fix batadv_v_ogm_packet_recv error handling kernel-doc
Sven Eckelmann [Thu, 4 Jun 2026 20:00:46 +0000 (22:00 +0200)] 
batman-adv: fix batadv_v_ogm_packet_recv error handling kernel-doc

All receive handlers in batman-adv are consuming the skbuff independent of
the result of the handler. The "(without freeing the skb) on failure" is
therefore not corrrect anymore for the current implementation.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
3 weeks agobatman-adv: uapi: keep kernel-doc in struct member order
Sven Eckelmann [Thu, 4 Jun 2026 19:58:31 +0000 (21:58 +0200)] 
batman-adv: uapi: keep kernel-doc in struct member order

The order of the members of struct batadv_coded_packet and struct
batadv_unicast_tvlv_packet didn't match the kernel doc. This is the case
for all other structures and should also be done the same way for these
two.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
3 weeks agobatman-adv: bla: update stale kernel-doc
Sven Eckelmann [Thu, 4 Jun 2026 19:35:04 +0000 (21:35 +0200)] 
batman-adv: bla: update stale kernel-doc

The bridge-loop-avoidance code was changed recently to avoid inconsistent
state and race condition problems. The kernel-doc addded in these commits
(and related code) has various minor deficits which are now resolved.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
3 weeks agobatman-adv: tp_meter: update stale kernel-doc after refactoring
Sven Eckelmann [Thu, 4 Jun 2026 09:09:04 +0000 (11:09 +0200)] 
batman-adv: tp_meter: update stale kernel-doc after refactoring

The tp_meter codebase was recently refactored:

* throughput meter sender and receiver variables were split into
  two different structures
* the congestion control variables were extracted in a separate structure

But the kernel-doc was not updated everywhere to reflect these changes.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
3 weeks agobatman-adv: correct batadv_wifi_* kernel-doc
Sven Eckelmann [Tue, 2 Jun 2026 15:39:16 +0000 (17:39 +0200)] 
batman-adv: correct batadv_wifi_* kernel-doc

The original kernel documentation for the batadv_wifi_* functions contained
copy+paste errors. Correct them to make it easier understandable.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
3 weeks agobatman-adv: document cleanup of batadv_wifi_net_devices entries
Sven Eckelmann [Wed, 3 Jun 2026 08:47:53 +0000 (10:47 +0200)] 
batman-adv: document cleanup of batadv_wifi_net_devices entries

It doesn't seem to be obvious how the entries from the
batadv_wifi_net_devices rhashtable are getting removed before the actual
rhashtable is destroyed. Document the idea behind the process and which
steps are involved.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
3 weeks agobatman-adv: use GFP_KERNEL allocations for the wifi detection cache
Sven Eckelmann [Tue, 2 Jun 2026 15:46:10 +0000 (17:46 +0200)] 
batman-adv: use GFP_KERNEL allocations for the wifi detection cache

The batadv_wifi_net_device_insert() is called with ASSERT_RTNL() held, but
not inside a spinlock or another context which prevents "might_sleep"
functions. To relax the requirements for the allocator, use GFP_KERNEL.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
3 weeks agobatman-adv: drop duplicated wifi_flags assignments
Sven Eckelmann [Tue, 2 Jun 2026 15:41:15 +0000 (17:41 +0200)] 
batman-adv: drop duplicated wifi_flags assignments

During the initialization of the batadv_wifi_net_device_state, it is enough
to write the wifi_flags once before the batadv_wifi_net_device_state is
added to the batadv_wifi_net_devices rhashtable.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
3 weeks agobatman-adv: convert cancellation of work items to disable helper
Sven Eckelmann [Tue, 26 May 2026 07:09:32 +0000 (09:09 +0200)] 
batman-adv: convert cancellation of work items to disable helper

With commit 86898fa6b8cd ("workqueue: Implement disable/enable for
(delayed) work items"), work queues gained the ability to permanently
disallow re-queuing of work items. This is particularly important during
object teardown, where a work item must not be re-armed after shutdown
begins.

Convert all cancel_work_sync() and cancel_delayed_work_sync() call sites to
their disable_* equivalents to clarify the intent to prevent re-arming
after teardown.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
3 weeks agobatman-adv: tp_meter: initialize last_recv_time during init
Sven Eckelmann [Thu, 4 Jun 2026 08:58:51 +0000 (10:58 +0200)] 
batman-adv: tp_meter: initialize last_recv_time during init

The last_recv_time is the most important indicator for a receiver session
to figure out whether a session timed out or not. But this information was
only initialized after the session was added to the tp_receiver_list and
after the timer was started.

In the worst case, the timer (function) could have tried to access this
information before the actual initialization was reached. Like rest of the
variables of the tp_meter receiver session, this field has to be filled out
before any other (parallel running) context has the chance to access it.

Cc: stable@kernel.org
Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
3 weeks agosimple_lookup(): use d_splice_alias() for ->lookup() return value
Al Viro [Sat, 9 May 2026 16:29:41 +0000 (12:29 -0400)] 
simple_lookup(): use d_splice_alias() for ->lookup() return value

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agoecryptfs: use d_splice_alias() for ->lookup() return value
Al Viro [Sat, 9 May 2026 16:28:48 +0000 (12:28 -0400)] 
ecryptfs: use d_splice_alias() for ->lookup() return value

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agoconfigfs_lookup(): switch to d_splice_alias()
Al Viro [Fri, 8 May 2026 21:58:35 +0000 (17:58 -0400)] 
configfs_lookup(): switch to d_splice_alias()

more idiomatic

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agotracefs: use d_splice_alias() in ->lookup() instances
Al Viro [Tue, 27 Jan 2026 20:19:06 +0000 (15:19 -0500)] 
tracefs: use d_splice_alias() in ->lookup() instances

d_add() is not wrong there (inodes are freshly allocated), but
d_splice_alias() is more idiomatic.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agomake cursors NORCU
Al Viro [Tue, 5 May 2026 04:20:19 +0000 (00:20 -0400)] 
make cursors NORCU

All it requires is making sure that d_walk() will skip *all*
CURSOR dentries, even if somebody passes it one as an argument.

Cursors are negative and unhashed all along, never get added to
LRU or to shrink lists and no RCU references via ->d_sib are
possible for those - dentry_unlist() makes sure that no killed
dentry has ->d_sib.next left pointing to a cursor.

Seeing that a cursor is allocated every time we open a directory
on autofs, debugfs, devpts, etc., avoiding an RCU delay when such
opened files get closed is attractive...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agonfs: get rid of fake root dentries
Al Viro [Wed, 15 Apr 2026 23:29:53 +0000 (19:29 -0400)] 
nfs: get rid of fake root dentries

... just grab the reference to the (real) root we are about to return
for the first mount of this superblock and be done with that.

Once upon a time dentry tree eviction at fs shutdown used to break
if ->s_root had been spliced on top of something; that hadn't been
the case for years now, and these fake root dentries violate a bunch
of invariants.  Let's get rid of them...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agowind ->s_roots via ->d_sib instead of ->d_hash
Al Viro [Sat, 18 Apr 2026 22:39:03 +0000 (18:39 -0400)] 
wind ->s_roots via ->d_sib instead of ->d_hash

shrink_dcache_for_umount() is supposed to handle the possibility of
some of the dentries to be evicted being in other threads shrink
lists; it either kills them, leaving an empty husk to be freed by
the owner of shrink list whenever it gets around to that, or it
waits for the eviction in progress to get completed.

That relies upon dentry remaining attached to the tree until the
eviction reaches dentry_unlist() and its ->d_sib gets removed
from the list.  Unfortunately, the secondary roots are linked
via ->d_hash, rather than ->d_sib and they become removed from
that list before their inode references are dropped.

If shrink_dentry_list() from another thread ends up evicting
one of the secondary roots and gets to that point in dentry_kill()
when shrink_dcache_for_umount() is looking for secondary roots,
the latter will *not* notice anything, possibly leading to
warnings about busy inodes at umount time and all kinds of breakage
after that.

Moreover, shrink_dcache_for_umount() walks the list of secondary
roots with no protection whatsoever, so it might end up calling
dget() on a dentry that already passed through
lockref_mark_dead(&dentry->d_lockref);
ending up with corrupted refcount and possible UAF.

AFAICS, the most straightforward way to deal with that would be
to have secondary roots linked via ->d_sib rather than ->d_hash;
then they would remain on the list until killed, and we could
use d_add_waiter() machinery to wait for eviction in progress.

Changes:
* secondary roots look the same as ->s_root from d_unhashed()
and d_unlinked() POV now.
* secondary roots are represented as "no parent, but on ->d_sib"
instead of "no parent, but on ->d_hash".
* since ->d_sib is a plain hlist, we protect it with per-superblock
spinlock (sb->s_roots_lock) instead of the LSB of the head pointer (for
non-root dentries it would be protected by ->d_lock of parent).
* __d_obtain_alias() uses ->d_sib for linkage when allocating
a secondary root.
* d_splice_alias_ops() detects splicing of a secondary root and
removes it from the list before calling __d_move().
* dentry_unlist() detects eviction of a secondary root and
removes it from the list; no need to play the games for d_walk() sake,
since the latter is not going to look for the next sibling of those
anyway.
* ___d_drop() doesn't care about ->s_roots anymore.
* shrink_dcache_for_umount() uses proper locking for access to
the list of secondary roots and if it runs into one that is in the middle
of eviction waits for that to finish.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agoshrink_dentry_tree(): unify the calls of shrink_dentry_list()
Al Viro [Thu, 16 Apr 2026 15:50:41 +0000 (11:50 -0400)] 
shrink_dentry_tree(): unify the calls of shrink_dentry_list()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agoshrinking rcu_read_lock() scope in d_alloc_parallel()
Al Viro [Thu, 23 Apr 2026 18:29:18 +0000 (19:29 +0100)] 
shrinking rcu_read_lock() scope in d_alloc_parallel()

The current use of rcu_read_lock() uses in d_alloc_parallel()
is fairly opaque - the single large scope serves two purposes.

We start with lookup in normal hash, and there rcu_read_lock()
scope puts __d_lookup_rcu() and subsequent lockref_get_not_dead() into
the same RCU read-side critical area.

If no match is found, we proceed to lock the hash chain of
in-lookup hash and scan that for a match.  If we find a match, we want
to grab it and wait for lookup in progress to finish.  Since the bitlock
we use for these hash chains has to nest inside ->d_lock, we need to
unlock the chain first and use lockref_get_not_dead() on the match.
That has to be done without breaking the RCU read-side critical area,
and we use the same rcu_read_lock() scope to bridge over.

The thing is, after having grabbed the reference (and it is
very unlikely to fail) we proceed to grab ->d_lock - d_wait_lookup()
and __d_lookup_unhash()/__d_wake_in_lookup_waiters() are using that for
serialization. That makes lockref_get_not_dead() pointless - trying
to avoid grabbing ->d_lock for refcount increment, only to grab it
anyway immediately after that. If we grab ->d_lock first and replace
lockref_get_not_dead() with direct check for sign and increment if
non-negative we can move rcu_read_unlock() to immediately after grabbing
->d_lock.  Moreover, we don't need the RCU read-side critical area to
be contiguous since before earlier __d_lookup_rcu() - we can just as
well terminate the earlier one ASAP and call rcu_read_lock() again only
after having found a match (if any) in the in-lookup hash chain.

That makes the entire thing easier to follow and the purpose
of those rcu_read_lock() calls easier to describe - the first scope is
for __d_lookup_rcu() + lockref_get_not_dead(), the second one bridges
over from the bitlock scope to the ->d_lock scope on the match found in
in-lookup hash.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agod_walk(): shrink rcu_read_lock() scope
Al Viro [Tue, 21 Apr 2026 19:52:13 +0000 (15:52 -0400)] 
d_walk(): shrink rcu_read_lock() scope

we only need it to bridge over from ->d_lock scope of child to ->d_lock
scope of parent; dropping ->d_lock at rename_retry doesn't need to be
in rcu_read_lock() scope.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agodocument dentry_kill()
Al Viro [Sat, 11 Apr 2026 20:17:19 +0000 (16:17 -0400)] 
document dentry_kill()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agoadjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill()
Al Viro [Sat, 11 Apr 2026 08:17:02 +0000 (04:17 -0400)] 
adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill()

Pull dropping ->d_lock on lock_for_kill() failure into lock_for_kill() itself.
That reduces dentry_kill() to
if (!lock_for_kill(dentry))
return NULL;
return __dentry_kill(dentry);
at which point it's easier to move that if (...) into the beginning of __dentry_kill()
itself and rename it into dentry_kill().

Document the new calling conventions of lock_for_kill().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agoDocument rcu_read_lock() use in select_collect2()
Al Viro [Sat, 11 Apr 2026 08:01:28 +0000 (04:01 -0400)] 
Document rcu_read_lock() use in select_collect2()

If select_collect2() finds something that is neither busy nor can
be moved to shrink list, it needs to return that to caller's caller
(shrink_dcache_tree()) ASAP and do so without grabbing references (among
other things, it might be already dying, in which case refcount can't be
incremented).  We are called inside a ->d_lock scope, but that scope is
going to be terminated as soon as we return to caller (d_walk()); ->d_lock
will be retaken by shrink_dcache_tree(), but we need to bridge between
these scopes, turning them into contiguous RCU read-side critical area.

We do that with rcu_read_lock() scope - it spans from unbalanced
rcu_read_lock() in select_collect2() to unbalanced rcu_read_unlock()
in shrink_dcache_tree().  That works, but it really needs to be documented;
it's rather unidiomatic and it had caused quite a bit of confusion - some
of it in form of patches "fixing" the damn thing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agoShift rcu_read_{,un}lock() inside fast_dput()
Al Viro [Sat, 11 Apr 2026 07:56:42 +0000 (03:56 -0400)] 
Shift rcu_read_{,un}lock() inside fast_dput()

Shrink rcu_read_lock() scopes surrounding fast_dput() calls.
Both callers are immediately preceded and followed by
rcu_read_lock()/rcu_read_unlock() resp.  Shrink that down
into fast_dput() itself; in case when fast_dput() ends up
grabbing ->d_lock, we can pull rcu_read_unlock() up to
right after spin_lock().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agosimplify safety for lock_for_kill() slowpath
Al Viro [Sat, 11 Apr 2026 07:24:28 +0000 (03:24 -0400)] 
simplify safety for lock_for_kill() slowpath

rcu_read_lock() scopes in dentry eviction machinery are too wide
and badly structured; we end up with too many of those, quite
a few essentially identical.  Worse, quite a few of the function
involved are not neutral wrt that, making them harder to reason about.

rcu_read_lock() scope is not the only thing establishing an
RCU read-side critical area - spin_lock scope does the same and
they can be mixed - the sequence
rcu_read_lock()
...
spin_lock()
...
rcu_read_unlock()
...
rcu_read_lock()
...
spun_unlock()
...
rcu_read_unlock()
is an unbroken RCU read-side critical area.

Use of that observation allows to simplify things.  First of all,
lock_for_kill() relies upon being in an unbroken RCU read-side
critical area.  It's always called with ->d_lock held, and normally
returns without having ever dropped that spinlock.  We would not
need rcu_read_lock() at all, if not for the slow path - if trylock
of inode->i_lock fails, we need to drop and retake ->d_lock.

Having all calls of lock_for_kill() inside an rcu_read_lock() scope
takes care of that, but to show that lock_for_kill() slow path is safe,
we need to demonstrate such rcu_read_lock() scope for any call chain
leading to lock_for_kill().  Which is not fun, seeing that there are
10 such scopes, with 5 distinct beginnings between them.

Case 1: opens in dput() proceeds through fast_dput() grabbing ->d_lock,
returning false into dput() and there a call of finish_dput() which calls
dentry_kill(), which calls lock_for_kill(); ends in dentry_kill(), either
right after lock_for_kill() success or right after dropping ->d_lock
on lock_for_kill() failure.  ->d_lock is held continuously all the way
into lock_for_kill().

Case 2: opens in dentry_kill(), where we proceed to the same call of
dentry_kill() as in case 1.  ->d_lock is held since before the
beginning of the scope and all the way into lock_for_kill().

Case 3: opens in select_collect2(), proceeds through the return to
d_walk() and to shrink_dcache_tree() where we grab ->d_lock and
proceed to call shrink_kill(), which calls dentry_kill(), then as
in the previous scopes.

Case 4: opens in shrink_dentry_list(), followed by call of shrink_kill(),
then same as in case 3.  ->d_lock is held since before the beginning
of the scope and all the way into lock_for_kill().

Case 5: opens in shrink_kill(), where it's immediately followed by
call of dentry_kill(), then same as in the previous scopes.  ->d_lock
is held since before the beginning of the scope all the way into
lock_for_kill().

Note that in cases 2, 4 and 5 the slow path of lock_for_kill() is the
only part of rcu_read_lock() scope that is not covered by spinlock
scopes.  In case 1 we have the area in fast_dput() as well and in
case 3 - the return path from select_collect2() and chunk in shrink_dcache_tree()
up to grabbing ->d_lock.

Seeing that the reasons we need rcu_read_lock() in these additional
areas are completely unrelated to lock_for_kill() slow path, the things
get much more straightforward with
* explicit rcu_read_lock() scope surrounding the area in slow path
of lock_for_kill() where ->d_lock is not held
* shrink_dentry_list() dropping rcu_read_lock() as soon as it has
grabbed ->d_lock.
* dput() dropping rcu_read_lock() just before calling finish_dput().
* rcu_read_lock() calls in finish_dput(), shrink_kill() and
shrink_dentry_list() are removed, along with rcu_read_unlock() calls in
dentry_kill().

RCU read-side critical areas are unchanged by that, safety of lock_for_kill()
slow path is trivial to verify and a bunch of rcu_read_lock() scopes either
gone or become easier to describe.

Update the comments on locking conventions and memory safety considerations,
including the NORCU case.

Incidentally, all calls of fast_dput() are immediately preceded by rcu_read_lock()
and followed by rcu_read_unlock() now, which will allow to simplify those on
the next step...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agofold lock_for_kill() and __dentry_kill() into common helper
Al Viro [Sat, 11 Apr 2026 07:14:19 +0000 (03:14 -0400)] 
fold lock_for_kill() and __dentry_kill() into common helper

There are two callers of lock_for_kill() and both are followed
by the same sequence of actions:
* in case of failure, drop ->d_lock, do rcu_read_unlock() and
go away
* in case of success, do rcu_read_unlock() followed by
passing dentry to __dentry_kill(); if the latter returns NULL, go away.

All calls of __dentry_kill() are paired with lock_for_kill() now;
let's turn that sequence into a new helper (dentry_kill()) and switch
to using it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agofold lock_for_kill() into shrink_kill()
Al Viro [Sat, 11 Apr 2026 07:06:39 +0000 (03:06 -0400)] 
fold lock_for_kill() into shrink_kill()

Both callers have exact same shape.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agoshrink_dentry_list(): start with removing from shrink list
Al Viro [Sat, 11 Apr 2026 05:52:53 +0000 (01:52 -0400)] 
shrink_dentry_list(): start with removing from shrink list

Currently we leave dentry on the list until we are done with
lock_for_kill().  That guarantees that it won't have been
even scheduled for removal until we remove it from the list
and drop ->d_lock.  We grab ->d_lock and rcu_read_lock()
and call lock_for_kill().  There are four possible cases:
1) lock_for_kill() has succeeded; dentry and its inode
(if any) are locked, dentry refcount is zero and we can
remove it from shrink list and feed it to shrink_kill().
2) lock_for_kill() fails since dentry has become busy.
Nothing to do, rcu_read_unlock(), remove from shrink list,
drop ->d_lock and move on.
3) lock_for_kill() fails since dentry is currently
being killed - already entered __dentry_kill(), but hasn't
reached dentry_unlist() yet.  Nothing to do, we should just
do rcu_read_unlock(), remove from shrink list so that
whoever's executing __dentry_kill() would free it once they
are done, drop ->d_lock and move on - same actions as in
case (2).
4) lock_for_kill() fails since dentry has been killed
(reached dentry_unlist(), DCACHE_DENTRY_KILLED set in ->d_flags).
In that case whoever had been killing it had already seen it
on our shrink list and skipped freeing it.  At that point it's
just a passive chunk of memory; rcu_read_unlock(), remove from
the list, drop ->d_lock and use dentry_free() to schedule
freeing.

While that works, there's a simpler way to do it:
* grab ->d_lock
* remove dentry from our shrink list
* if DCACHE_DENTRY_KILLED is already set, drop ->d_lock,
call dentry_free() and move on.
* otherwise grab rcu_read_lock() and call lock_for_free()
* if lock_for_kill() succeeds, feed dentry
to shrink_kill(), otherwise drop the locks and move on.

The end result is equivalent to the old variant.  The only difference
arises if at the time we grab ->d_lock dentry had refcount 0 and
lock_for_kill() had failed spin_trylock() and had to drop and regain
->d_lock.  Otherwise nobody can observe at which point within the
unbroken ->d_lock scope dentry had been removed from the shrink list -
all accesses to ->d_lru are under ->d_lock.

If ->d_lock had been dropped and regained, it is possible for another
thread to feed that dentry to __dentry_kill(); if it doesn't get to
dentry_unlist() before we regain ->d_lock, behaviour is still identical -
it's case (3) and by the time __dentry_kill() would've gotten around
to checking if the victim is on shrink list, it would've been already
removed from ours.

If __dentry_kill() from another thread *does* get to dentry_unlist(),
in the old variant we would have __dentry_kill() leave calling
dentry_free() to us and in the new one __dentry_kill() would've called
dentry_free() itself.  Since we are under rcu_read_lock(), we are
guaranteed that actual freeing won't happen until we get around to
rcu_read_unlock().  IOW, the new variant is still safe wrt UAF, if
not for the same reason as the old one, and overall result is the same;
the only difference is which threads ends up scheduling the actual
freeing of dentry.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agod_prune_aliases(): make sure to skip NORCU aliases
Al Viro [Mon, 4 May 2026 06:49:20 +0000 (02:49 -0400)] 
d_prune_aliases(): make sure to skip NORCU aliases

Either they are busy (in which case they won't be moved to shrink
list anyway) or they have a zero refcount, in which case we really
shouldn't mess with them - whoever had dropped the refcount to
zero is on the way to evicting and freeing them.

That way we are guaranteed that only the thread that has dropped
refcount of NORCU dentry to zero might call lock_for_kill() and
__dentry_kill() for those.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agokill d_dispose_if_unused()
Al Viro [Mon, 13 Apr 2026 03:39:16 +0000 (23:39 -0400)] 
kill d_dispose_if_unused()

Rename to_shrink_list() into __move_to_shrink_list(), document and
export it.  Switch d_dispose_if_unused() users to that and kill
d_dispose_if_unused() itself.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agomake to_shrink_list() return whether it has moved dentry to list
Al Viro [Sun, 12 Apr 2026 18:35:38 +0000 (14:35 -0400)] 
make to_shrink_list() return whether it has moved dentry to list

... and make it check the refcount for being zero in addition to
dentry not being on a shrink list already.  Simplifies the callers...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agoselect_collect(): ignore dentries on shrink lists if they have positive refcounts
Al Viro [Sun, 12 Apr 2026 18:17:52 +0000 (14:17 -0400)] 
select_collect(): ignore dentries on shrink lists if they have positive refcounts

If all dentries we find have positive refcounts and some happen
to be on shrink lists, there's no point trying to steal them in the
select_collect2() phase - we won't be able to evict any of them.  Busy on
shrink lists is still busy...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agofind_acceptable_alias(): skip NORCU aliases with zero refcount
Al Viro [Mon, 4 May 2026 04:32:43 +0000 (00:32 -0400)] 
find_acceptable_alias(): skip NORCU aliases with zero refcount

similar to d_find_any_alias() situation

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agofix a race between d_find_any_alias() and final dput() of NORCU dentries
Al Viro [Mon, 4 May 2026 03:00:09 +0000 (23:00 -0400)] 
fix a race between d_find_any_alias() and final dput() of NORCU dentries

Refcount of a NORCU dentry must not be incremented after having dropped
to zero.  Otherwise we might end up with the following race:
CPU1: in fast_dput(d), rcu_read_lock();
CPU1: decrements refcount of d to 0
CPU1: notice that it's unhashed
CPU2: grab a reference to d
CPU2: dput(d), freeing d
CPU1: ... looks like we need to evict d, let's grab ->d_lock, recheck
      the refcount, etc.
and that spin_lock(&d->d_lock) ends up a UAF, despite still being in
an RCU read-side critical area started back when the refcount had been
positive.  If not for DCACHE_NORCU in d->d_flags freeing would've been
RCU-delayed, so we'd have grabbed ->d_lock, noticed the negative value
stored into refcount by __dentry_kill(), dropped the locks and that would
be it.  For NORCU dentries freeing is _not_ delayed, though.

Most of the non-counting references are excluded for NORCU dentries -
they are not allowed to be hashed, they never get placed on LRU, they
never get placed into anyone's list of children and while dput_to_list()
might put them into a shrink list, nobody bumps refcount of something
that had been reached that way.

However, inode's list of aliases can be a problem - it does not contribute
to dentry refcount (for obvious reasons) and we *do* have places that
grab references to something found on that list - that's precisely what
d_find_alias() is.  In case of d_find_alias() we are safe - it skips
unhashed aliases, so all NORCU ones are ignored there.  d_find_any_alias()
is *not* limited to hashed ones, though, and while it's usually called
for directories (which never get NORCU dentries), there are callers that
use it to get something for non-directories with no hashed aliases.

Having d_find_any_alias() hit a NORCU dentry is not impossible - it can
be easily arranged if you have CAP_DAC_READ_SEARCH (memfd_create() + mmap()
+ name_to_handle_at() for /proc/self/map_files/<...> + munmap() +
open_by_handle_at() will do that, and adding a second memfd_create() for
mount_fd makes it possible to do that without having memfd pinned).
The race window is narrow, and it's probably not feasible on bare hardware,
but...

It's not hard to fix, fortunately:
* separate __d_find_dir_alias() (== current __d_find_any_alias()) to
be used for directory inodes.
* provide dget_alias_ilocked() that would return false for NORCU
dentries with zero refcount and return true incrementing refcount otherwise
* make __d_find_any_alias() go over the list of aliases, using
dget_alias_ilocked() and returning the alias it succeeds on (normally the
first one).  Any NORCU alias with zero refcount is going to be evicted by
the thread that had dropped the final reference; this makes __d_find_any_alias()
pretend it had lost the race with eviction.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agoalloc_path_pseudo(): make sure we don't end up with NORCU dentries for directories
Al Viro [Mon, 27 Apr 2026 18:19:28 +0000 (14:19 -0400)] 
alloc_path_pseudo(): make sure we don't end up with NORCU dentries for directories

A lot of places relies upon directories never having NORCU dentries;
currently that property holds, but the proof is not straightforward
and rather brittle.

It's better to have that verified in the sole caller of d_alloc_pseudo(),
so that any future bugs in that direction were caught early.

That way we can be sure that
* current directory of any process is not NORCU
* root directory of any process is not NORCU
* starting point of any LOOKUP_RCU pathwalk is not NORCU
* dget_parent() can rely upon ->d_parent not being NORCU
* d_walk() and is_subdir() can rely upon the same
* alloc_file_pseudo() won't create multiple aliases for a directory
without having to go through a convoluted audit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agoVFS: use wait_var_event for waiting in d_alloc_parallel()
NeilBrown [Thu, 30 Apr 2026 19:42:43 +0000 (15:42 -0400)] 
VFS: use wait_var_event for waiting in d_alloc_parallel()

Parallel lookup starts with a call of d_alloc_parallel().  That primitive
either returns a matching hashed dentry or allocates a new one in the
in-lookup state and returns it to the caller.  Once the caller is done
with lookup, it indicates so either by call of d_{splice_alias,add}()
or by call of d_done_lookup(); at that point dentry leaves the in-lookup
state.

If d_alloc_parallel() finds a matching in-lookup dentry, it must wait for
that dentry to leave the in-lookup state, one way or another.  Currently
by supplying wait_queue_head to d_alloc_parallel().  If d_alloc_parallel()
creates a new in-lookup dentry, the address of that wait_queue_head is stored
in ->d_wait of new dentry and stays there while it's in the in-lookup;
subsequent d_alloc_parallel() will wait on the queue found in the matching
in-lookup dentry.  Transition out of in-lookup state wakes waiters on that
queue (if any).

That works, but the calling conventions are inconvenient - the caller must
supply wait_queue_head and make sure that it survives at least until the new
in-lookup dentry leaves the in-lookup state.  That amounts to boilerplate
in the d_alloc_parallel() callers that are followed by a call of d_lookup_done()
in the same function; in cases like nfs asynchronous unlink it gets worse than
that.

This patch changes d_alloc_parallel() to use wake_up_var_locked() to
wake up waiters, and wait_var_event_spinlock() to wait.  dentry->d_lock
is used for synchronisation as it is already held and the relevant
times.

That eliminates the need of caller-supplied wait_queue_head, simplifying
the calling conventions.  Better yet, we only need one bit of information
stored in dentry itself: whether there are any waiters to be woken up,
and that can be easily stored in ->d_flags; ->d_wait goes away.

The reason we need that bit (DCACHE_LOOKUP_WAITERS) is that with wait_var
machinery the queues are shared with all kinds of stuff and there's
no way tell if any of the waiters have anything to do with our dentry;
most of the time none of them will be relevant, so we need to avoid the
pointless wakeups.

Another benefit of the new scheme comes from the fact that wakeups
have to be done outside of write-side critical areas of ->i_dir_seq;
with the old scheme we need to carry the value picked from ->d_wait from
__d_lookup_unhash() to the place where we actually wake the waiters up.
Now we can just leave DCACHE_LOOKUP_WAITERS in ->d_flags until we get
to doing wakeups - that's done within the same ->d_lock scope, so we
are fine; new bit is accessed only under ->d_lock and it's seen only
on dentries with DCACHE_PAR_LOOKUP in ->d_flags.

__d_lookup_unhash() no longer needs to re-init ->d_lru.  That was
previously shared (in a union) with ->d_wait but ->d_wait is now gone
so it no longer corrupts ->d_lru.

Co-developed-by: Al Viro <viro@zeniv.linux.org.uk> # saner handling of flags
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3 weeks agoaccel/ethosu: fix OOB write in ethosu_gem_cmdstream_copy_and_validate()
Muhammad Bilal [Sat, 23 May 2026 19:08:43 +0000 (19:08 +0000)] 
accel/ethosu: fix OOB write in ethosu_gem_cmdstream_copy_and_validate()

The command stream parsing loop increments the index variable a second
time when a 64-bit command word is encountered (bit 14 set), but does
not re-check the loop bound before writing the second word:

    for (i = 0; i < size / 4; i++) {
        bocmds[i] = cmds[0];
        if (cmd & 0x4000) {
            i++;
            bocmds[i] = cmds[1];   /* unchecked */
        }
    }

The buffer bocmds is backed by a DMA allocation of exactly size bytes
from drm_gem_dma_create(ddev, size), giving valid indices [0, size/4-1].

When i == size/4 - 1 on entry to an iteration and bit 14 of cmds[0] is
set, bocmds[size/4-1] is written in bounds, i is then incremented to
size/4, and bocmds[size/4] writes four bytes past the end of the
allocation.

Userspace controls both the buffer contents and the size argument via
the ioctl, making this a userspace-triggerable heap out-of-bounds write.

Fix by checking the incremented index against the buffer bound before
the second write and returning -EINVAL if the buffer is too small to
contain the extended command.

Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Link: https://patch.msgid.link/20260523190843.33977-1-meatuni001@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
3 weeks agoMerge tag 'batadv-next-pullrequest-20260603' of https://git.open-mesh.org/batadv
Jakub Kicinski [Fri, 5 Jun 2026 02:14:35 +0000 (19:14 -0700)] 
Merge tag 'batadv-next-pullrequest-20260603' of https://git.open-mesh.org/batadv

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches, all by
Sven Eckelmann:

- tp_meter: fix various minor issues (8 patches)

- tp_meter: split generic session type in sender and receiver type

- tp_meter: consolidate locking for congestion control (2 patches)

- bla: annotate lasttime access with READ/WRITE_ONCE

- elp: prevent transmission interval underflow

- tt: sync local and global tvlv preparation return values

- tt: directly retrieve wifi flags of net_device

* tag 'batadv-next-pullrequest-20260603' of https://git.open-mesh.org/batadv:
  batman-adv: tt: directly retrieve wifi flags of net_device
  batman-adv: tt: sync local and global tvlv preparation return values
  batman-adv: prevent ELP transmission interval underflow
  batman-adv: bla: annotate lasttime access with READ/WRITE_ONCE
  batman-adv: tp_meter: consolidate congestion control variables
  batman-adv: tp_meter: use locking for all congestion control variables
  batman-adv: tp_meter: split vars into sender and receiver types
  batman-adv: tp_meter: add only finished tp_vars to lists
  batman-adv: tp_meter: handle seqno wrap-around for fast recovery detection
  batman-adv: tp_meter: fix fast recovery precondition
  batman-adv: tp_meter: avoid divide-by-zero for dec_cwnd
  batman-adv: tp_meter: avoid window underflow
  batman-adv: tp_meter: initialize dec_cwnd explicitly
  batman-adv: tp_meter: initialize dup_acks explicitly
  batman-adv: tp_meter: keep unacked list in ascending ordered
====================

Link: https://patch.msgid.link/20260603072527.174487-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: mv643xx: fix OF node refcount
Bartosz Golaszewski [Tue, 2 Jun 2026 07:34:14 +0000 (09:34 +0200)] 
net: mv643xx: fix OF node refcount

Platform devices created with platform_device_alloc() call
platform_device_release() when the last reference to the device's
kobject is dropped. This function calls of_node_put() unconditionally.
This works fine for devices created with platform_device_register_full()
but users of the split approach (platform_device_alloc() +
platform_device_add()) must bump the reference of the of_node they
assign manually. Add the missing call to of_node_get().

Cc: stable@vger.kernel.org
Fixes: 76723bca2802 ("net: mv643xx_eth: add DT parsing support")
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Link: https://patch.msgid.link/20260602073414.22500-1-bartosz.golaszewski@oss.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/dns_resolver: use kasprintf + kmemdup_nul to simplify dns_query
Thorsten Blum [Tue, 2 Jun 2026 07:13:41 +0000 (09:13 +0200)] 
net/dns_resolver: use kasprintf + kmemdup_nul to simplify dns_query

Use kasprintf() for descriptions with a query type and kmemdup_nul()
otherwise to simplify dns_query().

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260602071343.962830-2-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoocteontx2-af: kpu: Default profile updates
Kiran Kumar K [Tue, 2 Jun 2026 04:05:25 +0000 (09:35 +0530)] 
octeontx2-af: kpu: Default profile updates

Add support for parsing the following:
1. fabric path header
2. tpids 0x88a8, 0x9100 and 0x9200 parsing for
   first pass and second pass packets
3. parse stacked VLANs
4. RoCEv2 header with UDP destination port 4791
5. single SBTAG parsing

Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
Signed-off-by: Nitin Shetty J <nshettyj@marvell.com>
Link: https://patch.msgid.link/20260602040535.3975769-1-nshettyj@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'selftests-rds-roce-support-follow-ups'
Jakub Kicinski [Fri, 5 Jun 2026 01:33:53 +0000 (18:33 -0700)] 
Merge branch 'selftests-rds-roce-support-follow-ups'

Allison Henderson says:

====================
selftests: rds: ROCE support follow ups

This is a follow up series to the "Add ROCE support to rds selftests"
series.  The first patch renames run.sh to rds_run.sh, which provides
a self-describing name that appears on the netdev CI dashboard.

The second patch addresses a sashiko complaint that I thought was
worth circling back for.  In the patch "pin RDS sockets to their
intended transport," sockets are pinned to the specific transport they
are meant to test.  By default, socket transports are implicitly
selected based on the network topology, but it is possible that they
can fail back to other transports if the underlying connection could
not be established.  So the patch pins them to the intended transport
to avoid false positives.

The third patch "support RDS built as loadable modules," lifts the
CONFIG_MODULES=n requirement, and updates the check_*conf_enabled()
to allow modules set to "=m" and further load the backing modules for
any component set as such.  config.sh is updated to match.

The fourth patch converts the rdma-prerequisite checks to return XFAIL
rather than SKIP, since the RDMA datapath is not run in netdev CI.

Questions, comments and feedback appreciated!
====================

Link: https://patch.msgid.link/20260602050657.26389-1-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoselftests: rds: report missing RDMA prereqs as XFAIL
Allison Henderson [Tue, 2 Jun 2026 05:06:57 +0000 (22:06 -0700)] 
selftests: rds: report missing RDMA prereqs as XFAIL

Make the RDMA test return XFAIL rather than skip when RXE is not
available, since the RDMA datapath is not run in netdev CI.

Change the three RDMA-prerequisite checks in check_rdma_conf() and
check_rdma_conf_enabled() to exit with KSFT_XFAIL (2) and tag their
messages [XFAIL] instead of [SKIP].

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260602050657.26389-5-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoselftests: rds: support RDS built as loadable modules
Allison Henderson [Tue, 2 Jun 2026 05:06:56 +0000 (22:06 -0700)] 
selftests: rds: support RDS built as loadable modules

Commit 92cc6708f4a2 ("selftests: rds: config: disable modules") set
CONFIG_MODULES=n since run.sh required this kconfig. But disabling
modules also forces every =m option to =n rather than =y, which can
silently drop unrelated features.

This patch removes CONFIG_MODULES=n from the rds selftest config and
updates the check_*conf_enabled() routines to accept a config as
either built-in (=y) or modular (=m). A new probe_module() function
is added to load the backing module when a component is set to be
modular (=m).  config.sh no longer forces CONFIG_MODULES=n, so a user
who follows the SKIP message to run config.sh does not silently end
up with modules disabled again.

rds.ko itself is auto-loaded on socket creation, and rds_rdma.ko is
auto-loaded when SO_RDS_TRANSPORT is set with RDS_TRANS_IB, but the
TCP transport (rds_tcp.ko) is not auto-loaded on the bind path, so
the backing modules are loaded explicitly here.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260602050657.26389-4-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoselftests: rds: pin RDS sockets to their intended transport
Allison Henderson [Tue, 2 Jun 2026 05:06:55 +0000 (22:06 -0700)] 
selftests: rds: pin RDS sockets to their intended transport

The RDS selftests create AF_RDS sockets but never selects a transport,
so the transport is chosen implicitly based on network topology when
the socket is bound.  If underlying connection establishment fails, RDS
can fall back to another transport (e.g. loopback) and the test still
passes, silently bypassing the intended datapath it is meant to
exercise.

Set SO_RDS_TRANSPORT to the proper RDS_TRANS_IB or RDS_TRANS_TCP before
they are bound, so the test fails loudly if the intended transport is
unavailable rather than passing on a different path.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260602050657.26389-3-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoselftests: rds: Rename run.sh to rds_run.sh
Allison Henderson [Tue, 2 Jun 2026 05:06:54 +0000 (22:06 -0700)] 
selftests: rds: Rename run.sh to rds_run.sh

This patch renames run.sh to rds_run.sh. This gives the test a
self-describing name that appears in the netdev CI dashboard.

Suggested-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260602050657.26389-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: use READ_ONCE() in ipv6_flowlabel_get()
Runyu Xiao [Tue, 2 Jun 2026 00:25:06 +0000 (08:25 +0800)] 
ipv6: use READ_ONCE() in ipv6_flowlabel_get()

ipv6_flowlabel_get() reads flowlabel_consistency and
flowlabel_state_ranges locklessly.

Use READ_ONCE() for these sysctl accesses.

Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260602002506.1519901-1-runyu.xiao@seu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: use READ_ONCE() for bindv6only default in inet6_create()
Runyu Xiao [Tue, 2 Jun 2026 00:24:14 +0000 (08:24 +0800)] 
ipv6: use READ_ONCE() for bindv6only default in inet6_create()

inet6_create() reads net->ipv6.sysctl.bindv6only locklessly.

Use READ_ONCE() for this sysctl access.

Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260602002414.1504106-1-runyu.xiao@seu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoinet: frags: remove redundant assignment in inet_frag_reasm_prepare()
yuan.gao [Wed, 3 Jun 2026 06:53:23 +0000 (14:53 +0800)] 
inet: frags: remove redundant assignment in inet_frag_reasm_prepare()

The assignment is redundant because skb_clone() already copies skb->cb.

Remove the unnecessary code.

Signed-off-by: yuan.gao <yuan.gao@ucloud.cn>
Link: https://patch.msgid.link/20260603065323.2736839-1-yuan.gao@ucloud.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: airoha: Report extack error to the user if airoha_tc_htb_modify_queue() fails
Lorenzo Bianconi [Wed, 3 Jun 2026 10:30:01 +0000 (12:30 +0200)] 
net: airoha: Report extack error to the user if airoha_tc_htb_modify_queue() fails

Report an extack error message in airoha_tc_htb_modify_queue routine if
airoha_qdma_set_tx_rate_limit() fails.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20260603-airoha_tc_htb_modify_queue-err-message-v1-1-33ec3ab997d9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodt-bindings: net: dsa: remove obsolete dsa.txt
Akash Sukhavasi [Wed, 3 Jun 2026 20:42:20 +0000 (15:42 -0500)] 
dt-bindings: net: dsa: remove obsolete dsa.txt

dsa.txt has been a redirect to dsa.yaml since commit bce58590d1bd
("dt-bindings: net: dsa: Add DSA yaml binding") introduced the .yaml
schema. The .yaml has the same filename in the same directory, making
this redirect unnecessary for discoverability.

Two files still reference dsa.txt, forcing readers through an extra
hop to reach the .yaml. The stub has not been touched since August
2020. Update references in lan9303.txt and
Documentation/networking/dsa/dsa.rst to point directly to dsa.yaml
and remove the stub.

Signed-off-by: Akash Sukhavasi <akash.sukhavasi@gmail.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260603-b4-remove-redirect-stubs-v2-3-c8c19876ab64@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodt-bindings: net: remove obsolete mdio.txt
Akash Sukhavasi [Wed, 3 Jun 2026 20:42:18 +0000 (15:42 -0500)] 
dt-bindings: net: remove obsolete mdio.txt

mdio.txt has been a single-line redirect to mdio.yaml since
commit 62d77ff7ecbf ("dt-bindings: net: Add a YAML schemas for the
generic MDIO options"), which introduced the .yaml schema and reduced
the .txt to a stub in the same change. The .yaml has the same filename
in the same directory, making this redirect unnecessary for
discoverability.

No files in the tree reference mdio.txt and it has not been touched
since June 2019. Remove the obsolete stub.

Signed-off-by: Akash Sukhavasi <akash.sukhavasi@gmail.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260603-b4-remove-redirect-stubs-v2-1-c8c19876ab64@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agortnetlink: use dev_isalive() in rtnl_getlink()
Eric Dumazet [Wed, 3 Jun 2026 18:08:31 +0000 (18:08 +0000)] 
rtnetlink: use dev_isalive() in rtnl_getlink()

rtnl_getlink() uses an RCU lookup to get the netdevice pointer.

When/If rtnl_lock() is used, we should check if the netdevice is not
being dismantled before potentially perform illegal actions.

Move dev_isalive() out of net/core/net-sysfs.c and make it available
in net/core/dev.h.

Return -ENODEV if rtnl_getlink() finds a device which is currently
being dismantled and RTNL is requested.

Fixes: e896e5c0734b ("rtnetlink: do not acquire RTNL in rtnl_getlink() with RTEXT_FILTER_NAME_ONLY")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20260603180831.1024716-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: dsa: realtek: Use %pM format specifier for MAC addresses
Andy Shevchenko [Wed, 3 Jun 2026 11:20:11 +0000 (13:20 +0200)] 
net: dsa: realtek: Use %pM format specifier for MAC addresses

Convert to %pM instead of using custom code.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Link: https://patch.msgid.link/20260603112011.230890-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agobridge: mcast: Synchronously shutdown port multicast timers
Ido Schimmel [Wed, 3 Jun 2026 10:35:22 +0000 (13:35 +0300)] 
bridge: mcast: Synchronously shutdown port multicast timers

Currently, while four timers are set up during port multicast context
initialization, only two are synchronously deleted when the context is
de-initialized, just before being deleted.

This is fine because the structure containing the multicast context
(either a bridge port or a VLAN) is only deleted after an RCU grace
period and it will not pass as long as the timers are executing. These
timers are also not supposed to do any work at this stage. They acquire
the bridge multicast lock, see that the multicast context was disabled
and exit.

Make the code more explicit and symmetric and synchronously shutdown all
four timers when the multicast context is de-initialized. Use
timer_shutdown_sync() to guarantee that the timers will not be re-armed
given that the containing structure is being deleted.

Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260603103522.622411-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'rndis_host-add-le310x1-id-and-enable-low-power-handling'
Jakub Kicinski [Fri, 5 Jun 2026 01:11:00 +0000 (18:11 -0700)] 
Merge branch 'rndis_host-add-le310x1-id-and-enable-low-power-handling'

Shaoxu Liu says:

====================
rndis_host: add LE310X1 ID and enable low-power handling

This series adds RNDIS support for Telit Cinterion LE310X1 and then enables
USB power management for that specific ID.
====================

Link: https://patch.msgid.link/tencent_29CB862D5756CBCBAFD2EE436EBAC98A7E05@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agorndis_host: enable power management for Telit LE310X1
Shaoxu Liu [Tue, 2 Jun 2026 09:05:28 +0000 (17:05 +0800)] 
rndis_host: enable power management for Telit LE310X1

Enable autosuspend support for Telit Cinterion LE310X1 RNDIS interface
by selecting a driver_info variant with manage_power callback.

This keeps power management scoped to the new Telit ID only, and avoids
changing behavior for all existing RNDIS devices.

Signed-off-by: Shaoxu Liu <shaoxul@foxmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/tencent_B7686B84CD4B76D76BB912FA6367FAC2CA05@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agorndis_host: add Telit LE310X1 RNDIS USB ID
Shaoxu Liu [Tue, 2 Jun 2026 09:05:27 +0000 (17:05 +0800)] 
rndis_host: add Telit LE310X1 RNDIS USB ID

Add a device match entry for Telit Cinterion LE310X1 RNDIS interface
(VID:PID 1bc7:7030).

This is a functional no-op and keeps using the generic rndis_info for now.
Power-management behavior is handled in a follow-up patch.

Signed-off-by: Shaoxu Liu <shaoxul@foxmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/tencent_F1AF1F5AD39C56485BD16C6DB2415E5B9508@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoinet: frags: fix use-after-free caused by the fqdir_pre_exit() flush
Hyunwoo Kim [Tue, 2 Jun 2026 10:21:05 +0000 (19:21 +0900)] 
inet: frags: fix use-after-free caused by the fqdir_pre_exit() flush

On netns teardown, fqdir_pre_exit() walks the fqdir rhashtable and
flushes every fragment queue that is not yet complete using
inet_frag_queue_flush(). That helper frees all the skbs queued on the
fragment queue but does not set INET_FRAG_COMPLETE, and leaves
q->fragments_tail and q->last_run_head pointing at the freed skbs.
The queue itself stays in the rhashtable.

fqdir_pre_exit() first lowers high_thresh to 0 to stop new queue lookups,
but it cannot stop a fragment that already obtained the queue through
inet_frag_find() earlier and stalled just before taking the queue lock.
Once that fragment resumes after the flush and takes the queue lock,
it passes the INET_FRAG_COMPLETE check and then dereferences the freed
fragments_tail. inet_frag_queue_insert() reads FRAG_CB() and ->len of
that pointer and, on the append path, writes ->next_frag, causing a
slab use-after-free. IPv6, nf_conntrack_reasm6 and 6lowpan reassembly
share the same flush path and are affected as well.

Reset rb_fragments, fragments_tail and last_run_head in
inet_frag_queue_flush() so a flushed queue no longer points at the
freed skbs. A fragment that resumes after the flush and takes the
queue lock then finds an empty queue and starts a new run instead of
dereferencing the freed fragments_tail. ip_frag_reinit() already
performed this reset after its own flush, so drop the now duplicate
code there.

Cc: stable@vger.kernel.org
Fixes: 006a5035b495 ("inet: frags: flush pending skbs in fqdir_pre_exit()")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Link: https://patch.msgid.link/ah6ukYq5G98LshdA@v4bel
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agobonding: annotate data-races in sysfs and procfs
Eric Dumazet [Tue, 2 Jun 2026 15:27:48 +0000 (15:27 +0000)] 
bonding: annotate data-races in sysfs and procfs

bonding sysfs and procfs read parameters locklessly,
while drivers/net/bonding/bond_options.c can write over them.

Add missing READ_ONCE()/WRITE_ONCE() annotations.

This came as a prereq to avoid RTNL in bond_fill_info().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260602152748.2564393-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoi2c: busses: make K1 driver default for SpacemiT platforms
Iker Pedrosa [Tue, 26 May 2026 14:36:57 +0000 (16:36 +0200)] 
i2c: busses: make K1 driver default for SpacemiT platforms

Enable I2C_K1 by default when ARCH_SPACEMIT is configured to ensure SD
card functionality works out-of-the-box.

SpacemiT K1 boards use I2C-controlled PMICs (like the P1 chip) to
provide SD card power supplies. Without the I2C_K1 driver enabled,
regulators cannot be controlled and SD card detection/operation fails.

Suggested-by: Margherita Milani <margherita.milani@amarulasolutions.com>
Suggested-by: Yixun Lan <dlan@kernel.org>
Signed-off-by: Iker Pedrosa <ikerpedrosam@gmail.com>
Reviewed-by: Yixun Lan <dlan@kernel.org>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260526-orangepi-sd-card-i2c-v1-1-b92268bfd467@gmail.com
3 weeks agoMerge tag 'drm-rust-next-2026-06-04' of https://gitlab.freedesktop.org/drm/rust/kerne...
Dave Airlie [Thu, 4 Jun 2026 23:09:38 +0000 (09:09 +1000)] 
Merge tag 'drm-rust-next-2026-06-04' of https://gitlab.freedesktop.org/drm/rust/kernel into drm-next

DRM Rust changes for v7.2-rc1

- Driver Core (shared via signed tag dd-lifetimes-7.2-rc1):

  - Introduce Higher-Ranked Lifetime Types (HRT) for Rust device
    drivers, allowing driver structs to hold device resources like
    pci::Bar and IoMem directly with a lifetime tied to the binding
    scope, removing the need for Devres indirection and ARef<Device>.

  - Replace drvdata() with scoped registration data on the auxiliary
    bus, using the new ForLt trait to thread lifetimes through
    registrations. Remove drvdata() and driver_type.

- DRM:

  - Add GPUVM immediate mode abstraction for Rust GPU drivers:
    - In immediate mode, GPU virtual address space state is updated
      during job execution (in the DMA fence signalling critical path),
      keeping the GPUVM and the GPU's address space always in sync.

    - Provide GpuVm, GpuVa, and GpuVmBo types for managing address
      spaces, virtual mappings, and GEM object backing respectively.

    - Provide split-merge map/unmap operations that handle partial
      overlaps with existing mappings.

    - drm_exec integration for dma_resv locking and GEM object
      validation based on the external/evicted object lists are not
      yet covered and planned as follow-up work.

  - Introduce DeviceContext type state for drm::Device, allowing
    drivers to restrict operations to contexts where the device is
    guaranteed to be registered (or not yet registered) with userspace.

  - Add FEAT_RENDER flag to the Driver trait for render node support.

- Nova:

  - Hopper/Blackwell enablement:
    - Add GPU identification and architecture-based HAL selection for
      Hopper (GH100) and Blackwell (GB100, GB202).

    - Implement the FSP (Foundation Security Processor) boot path used by
      Hopper and Blackwell, including FSP falcon engine support, EMEM
      operations, MCTP/NVDM message infrastructure, and FSP Chain of
      Trust boot with GSP lockdown release.

    - Add support for 32-bit firmware images and auto-detection of
      firmware image format.

    - Add architecture-specific framebuffer, sysmem flush, PCI config
      mirror, DMA mask, and WPR/non-WPR heap sizing.

  - GSP boot and unload:
    - Refactor the GSP boot process into a chipset-specific HAL,
      keeping the SEC2 and FSP boot paths separated cleanly.

    - Implement proper driver unload: send UNLOADING_GUEST_DRIVER
      command, run Booter Unloader and FWSEC-SB upon unbinding, and run
      the unload bundle on Gsp::boot() failure. This removes the need
      for a manual GPU reset between driver unbind and re-probe.

  - GA100 support:
    - Add support for the GA100 GPU, including IFR header detection and
      skipping, correct fwsignature selection, conditional FRTS boot,
      and documentation of the IFR header layout.

  - VBIOS hardening and refactoring:
    - Harden VBIOS parsing with checked arithmetic, bounds-checked
      accesses, and FromBytes-based structure reads throughout the FWSEC
      and Falcon data paths. Simplify the overall VBIOS module
      structure.

  - HRT adoption:
    - Use lifetime-parameterized pci::Bar directly, replacing the
      Arc<Devres<Bar0>> indirection. Replace ARef<Device> with &'bound
      Device in SysmemFlush and the GSP sequencer. Separate the driver
      type from driver data.

  - Misc:
    - Rename module names to kebab-case (nova-drm, nova-core).

    - Require little-endian in Kconfig, making the existing assumption
      explicit.

- Tyr:

  - Define comprehensive typed register blocks for GPU_CONTROL,
    JOB_CONTROL, MMU_CONTROL (including per-address-space registers),
    and DOORBELL_BLOCK using the kernel register!() macro. This replaces
    manual bit manipulation with typed register and field accessors.

  - Add shmem-backed GEM objects and set DMA mask based on GPU physical
    address width.

  - Adopt HRT: separate driver type from driver data, and use IoMem
    directly instead of Devres for register access during probe.

  - Move clock cleanup into a Drop implementation.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: "Danilo Krummrich" <dakr@kernel.org>
Link: https://patch.msgid.link/DJ0IF39U9ETK.PCCUO7ZEQ4S0@kernel.org
3 weeks agoi2c: Use named initializers for arrays of i2c_device_data
Uwe Kleine-König (The Capable Hub) [Mon, 18 May 2026 16:45:09 +0000 (18:45 +0200)] 
i2c: Use named initializers for arrays of i2c_device_data

While being less compact, using named initializers allows to more easily
see which members of the structs are assigned which value without having
to lookup the declaration of the struct. And it's also more robust
against changes to the struct definition.

The mentioned robustness is relevant for a planned change to struct
i2c_device_id that replaces .driver_data by an anonymous union.

While touching all these arrays, unify usage of whitespace in the list
terminator.

This patch doesn't modify the compiled arrays, only their representation
in source form benefits. The former was confirmed with x86 and arm64
builds.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Link: https://lore.kernel.org/r/20260518164510.805502-2-u.kleine-koenig@baylibre.com
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
3 weeks agocxl/pci: Convert PCIBIOS errors to errno on DVSEC config accesses
Dave Jiang [Thu, 4 Jun 2026 18:01:54 +0000 (11:01 -0700)] 
cxl/pci: Convert PCIBIOS errors to errno on DVSEC config accesses

PCI config space accessors return positive PCIBIOS_* status codes on
failure that are positive integers. Several DVSEC accesses in the CXL
core propagated these raw values to callers that test for failure against
less than 0. Thus silently misinterpret the return value as success.

Convert the positive error values to negative errno values so the checks
are correct on error paths.

While the chances of a config access failure are low, fix for correctness
and to avoid confusion in the future when more DVSEC accesses are added.

Fixes: 14d788740774 ("cxl/mem: Consolidate CXL DVSEC Range enumeration in the core")
Fixes: ce17ad0d5498 ("cxl: Wait Memory_Info_Valid before access memory related info")
Reviewed-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Assisted-by: Claude:claude-opus-4-8
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20260604180154.1925149-3-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
3 weeks agoaccel/ethosu: reject DMA commands with uninitialized length
Muhammad Bilal [Sun, 24 May 2026 13:03:19 +0000 (13:03 +0000)] 
accel/ethosu: reject DMA commands with uninitialized length

cmd_state_init() initializes the command state with memset(0xff),
leaving dma->len at U64_MAX to signal missing setup. The only setter
is NPU_SET_DMA0_LEN; if userspace omits this command and issues
NPU_OP_DMA_START, dma->len remains U64_MAX.

In dma_length(), a positive stride added to U64_MAX wraps to a small
value. With size0 == 1, check_mul_overflow() does not trigger and
dma_length() returns 0 instead of U64_MAX. The caller's U64_MAX check
then passes, region_size[] stays 0, and the bounds check in
ethosu_job.c is bypassed, allowing hardware to execute DMA with stale
physical addresses.

Fix by checking for U64_MAX at the start of dma_length() before any
arithmetic, consistent with the sentinel value used throughout the
driver to detect uninitialized fields.

Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Link: https://patch.msgid.link/20260524130319.12747-1-meatuni001@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
3 weeks agoaccel/ethosu: fix arithmetic issues in dma_length()
Muhammad Bilal [Sun, 24 May 2026 10:37:10 +0000 (10:37 +0000)] 
accel/ethosu: fix arithmetic issues in dma_length()

dma_length() derives DMA region usage from command stream values and
updates region_size[]:

    len = ((len + stride[0]) * size0 + stride[1]) * size1
    region_size[region] = max(..., len + dma->offset)

Several arithmetic issues can corrupt the derived region size:

- signed stride values may underflow when added to len
- intermediate multiplications may overflow
- len + dma->offset may overflow during region_size updates
- dma_length() error returns were not validated by the caller

region_size[] is later used by ethosu_job.c to validate command stream
accesses against GEM buffer sizes. Arithmetic wraparound can therefore
under-report region usage and bypass the bounds validation.

Fix by validating signed additions, using overflow helpers for
multiplications and offset updates, and propagating dma_length()
failures to the caller.

Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Link: https://patch.msgid.link/20260524103710.47397-1-meatuni001@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
3 weeks agoaccel/ethosu: fix wrong weight index in NPU_SET_SCALE1_LENGTH on U85
Muhammad Bilal [Sat, 23 May 2026 21:07:53 +0000 (21:07 +0000)] 
accel/ethosu: fix wrong weight index in NPU_SET_SCALE1_LENGTH on U85

On non-U65 hardware (e.g. U85), opcode 0x4093 is NPU_SET_WEIGHT2_LENGTH.
The BASE handler for the same opcode correctly assigns to
st.weight[2].base, but the LENGTH handler mistakenly assigns cmds[1]
to st.weight[1].length instead of st.weight[2].length.

This leaves weight[2].length at its initialised sentinel value of
0xffffffff and corrupts weight[1].length with the user-supplied value,
breaking the software bounds-check state for both weight buffers on U85.

Fix the index to match the BASE handler.

Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Link: https://patch.msgid.link/20260523210840.92039-3-meatuni001@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
3 weeks agoaccel/ethosu: reject NPU_OP_RESIZE commands from userspace
Muhammad Bilal [Sat, 23 May 2026 21:07:52 +0000 (21:07 +0000)] 
accel/ethosu: reject NPU_OP_RESIZE commands from userspace

NPU_OP_RESIZE is a U85-only command that the driver does not yet
implement. The existing WARN_ON(1) placeholder fires unconditionally
whenever userspace submits this command via DRM_IOCTL_ETHOSU_GEM_CREATE,
causing unbounded kernel log spam.

If panic_on_warn is set the kernel panics, giving any unprivileged user
with access to the DRM device a trivial denial-of-service primitive.

Replace the WARN_ON(1) with an explicit -EINVAL return so the ioctl
rejects the command before it reaches hardware.

Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Link: https://patch.msgid.link/20260523210840.92039-2-meatuni001@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
3 weeks agocxl/pci: Fix the incorrect check of pci_read_config_word() return
Dave Jiang [Thu, 4 Jun 2026 18:01:53 +0000 (11:01 -0700)] 
cxl/pci: Fix the incorrect check of pci_read_config_word() return

pci_read_config_word() returns PCIBIOS_* status on error which are
positive values. The check should be for non-zero values to indicate
error. Fix cxl_set_mem_enable() to check for non-zero return value
instead of negative value.

While fixing this, also convert the error to negative errno value when
returning on error path.

Fixes: 34e37b4c432c ("cxl/port: Enable HDM Capability after validating DVSEC Ranges")
Reviewed-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Assisted-by: Claude:claude-opus-4-8
Link: https://patch.msgid.link/20260604180154.1925149-2-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
3 weeks agotools: ynl: try to avoid the very slow YAML loader
Jakub Kicinski [Wed, 3 Jun 2026 21:08:10 +0000 (14:08 -0700)] 
tools: ynl: try to avoid the very slow YAML loader

Turns out Python YAML defaults to a pure Python loader for YAML
files which is a lot slower than the C loader (using libyaml).
Try to use the C one whenever possible.

The avg time to run:
  $ tools/net/ynl/pyynl/cli.py --family tc --no-schema
drops from 300+ ms to 115 ms with this change (40 samples).

We could drop the load time further to 85 ms if we "compiled"
the specs to JSON. Slightly tricky parts are that we don't
currently install the specs at all on make install, so it's
unclear where to put the conversion. Also JSON has questionable
support for comments and we need an SPDX line.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260603210810.2636193-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoaccel/ethosu: fix IFM region index out-of-bounds in command stream parser
Muhammad Bilal [Sat, 23 May 2026 19:51:59 +0000 (19:51 +0000)] 
accel/ethosu: fix IFM region index out-of-bounds in command stream parser

NPU_SET_IFM_REGION extracts the region index with param & 0x7f, giving
a maximum value of 127. However region_size[] and output_region[] in
struct ethosu_validated_cmdstream_info are both sized to
NPU_BASEP_REGION_MAX (8), giving valid indices [0..7].

Every other region assignment in the same switch uses param & 0x7:
  NPU_SET_OFM_REGION:  st.ofm.region  = param & 0x7;
  NPU_SET_IFM2_REGION: st.ifm2.region = param & 0x7;
  NPU_SET_WEIGHT_REGION: st.weight[0].region = param & 0x7;
  NPU_SET_SCALE_REGION:  st.scale[0].region  = param & 0x7;

The 0x7f mask on IFM is inconsistent and appears to be a typo.

feat_matrix_length() and calc_sizes() use the region index directly
as an array subscript into the kzalloc'd info struct:
  info->region_size[fm->region] = max(...);

A userspace caller supplying NPU_SET_IFM_REGION with param > 7 causes
a write up to 127*8 = 1016 bytes past the start of region_size[],
corrupting adjacent kernel heap data.

Fix by applying the same & 0x7 mask used by all other region
assignments.

Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Link: https://patch.msgid.link/20260523195159.55801-1-meatuni001@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
3 weeks agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 4 Jun 2026 22:26:27 +0000 (15:26 -0700)] 
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-7.1-rc7).

Silent conflicts:

net/wireless/nl80211.c
  cb9959ab5f99 ("wifi: cfg80211: enforce HE/EHT cap/oper consistency")
  a384ae969902 ("wifi: cfg80211: move AP HT/VHT/... operation to beacon info")
https://lore.kernel.org/aiGJDaHV4UlCexIQ@sirena.org.uk

Conflicts:

drivers/net/wireless/intel/iwlwifi/mld/ap.c
  a342c99cb70d ("wifi: iwlwifi: mld: honor BSS_CHANGED_BEACON_ENABLED")
  9bf1b409afc7 ("wifi: iwlwifi: mld: send tx power constraints before link activation")
https://lore.kernel.org/ah2bfedhV45ZxMO8@sirena.org.uk

drivers/net/wireless/intel/iwlwifi/pcie/drv.c
  093305d801fa ("wifi: iwlwifi: pcie: simplify the resume flow if fast resume is not used")
  e2323929a68a ("wifi: iwlwifi: pcie: add debug print for resume flow if powered off")
https://lore.kernel.org/ah2bfedhV45ZxMO8@sirena.org.uk

Adjacent changes:

drivers/net/ethernet/airoha/airoha_eth.c
  b38cae85d1c4 ("net: airoha: Fix use-after-free in metadata dst teardown")
  ec6c391bcca7 ("net: airoha: Introduce airoha_gdm_dev struct")

drivers/net/ethernet/microchip/lan743x_main.c
  8173d22b211f ("net: lan743x: permit VLAN-tagged packets up to configured MTU")
  e3c6508a46f5 ("net: lan743x: avoid netdev-based logging before netdev registration")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoperf sched: Fix thread reference leaks in timehist_get_thread()
Arnaldo Carvalho de Melo [Thu, 4 Jun 2026 21:18:05 +0000 (18:18 -0300)] 
perf sched: Fix thread reference leaks in timehist_get_thread()

timehist_get_thread() acquires a thread reference via
machine__findnew_thread() and an idle thread reference via
get_idle_thread() (which calls thread__get()).  Two error paths in
the idle_hist block return NULL without releasing these references:

 - When get_idle_thread() fails, the thread reference leaks.
 - When thread__priv(idle) returns NULL, both idle and thread leak.

Additionally, the idle thread reference acquired on the success path
is never released, leaking a reference on every sample when
--idle-hist is active.

Add thread__put() calls on both error paths and release the idle
reference after use on the success path.

Fixes: 5d8f17fb5822 ("perf sched timehist: Add -I/--idle-hist option")
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 weeks agoperf tools: Add bounds check to cpu__get_node()
Arnaldo Carvalho de Melo [Thu, 4 Jun 2026 21:14:23 +0000 (18:14 -0300)] 
perf tools: Add bounds check to cpu__get_node()

cpu__get_node() accesses cpunode_map[cpu.cpu] without checking against
max_cpu_num, the allocation size of cpunode_map.  Callers such as
builtin-kmem.c:evsel__process_alloc_event() pass sample->cpu from
perf.data events, which may exceed the host's CPU count when analyzing
cross-machine recordings.

Add a bounds check against max_cpu_num before indexing, returning -1
for out-of-range values.  This is a central fix that protects all
callers.

Fixes: 86895b480a2f ("perf stat: Add --per-node agregation support")
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 weeks agoperf tools: Guard remaining test_bit calls from OOB sample CPU
Arnaldo Carvalho de Melo [Thu, 4 Jun 2026 21:11:41 +0000 (18:11 -0300)] 
perf tools: Guard remaining test_bit calls from OOB sample CPU

auxtrace.c:filter_cpu() and builtin-script.c:filter_cpu() call
test_bit(cpu, cpu_bitmap) where cpu_bitmap is declared with
MAX_NR_CPUS bits.  When the CPU value from a perf.data event is
corrupt or absent (e.g. negative or >= MAX_NR_CPUS), test_bit reads
out of bounds.

Add bounds checks before test_bit(): >= 0 for the int16_t cpu.cpu in
auxtrace (which also covers the -1 sentinel), and < MAX_NR_CPUS for
both sites.  Matches the pattern applied in the previous series for
builtin-annotate.c, builtin-diff.c, builtin-report.c, and
builtin-sched.c.

Fixes: 644e0840ad46 ("perf auxtrace: Add CPU filter support")
Fixes: 5d67be97f890 ("perf report/annotate/script: Add option to specify a CPU range")
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 weeks agoocfs2: fix out-of-bounds write in ocfs2_remove_refcount_extent
Ian Bridges [Mon, 1 Jun 2026 18:44:33 +0000 (13:44 -0500)] 
ocfs2: fix out-of-bounds write in ocfs2_remove_refcount_extent

[BUG]
Unlinking a refcounted file whose refcount tree has leaf blocks
triggers a fortify panic due to an out-of-bounds write.

[CAUSE]
When the last leaf block is removed from a refcount tree,
ocfs2_remove_refcount_extent() converts the root back to leaf mode
with a bulk memset on &rb->rf_records. rf_records sits in an anonymous
union with rf_list. rf_list.l_tree_depth aliases rf_records.rl_count,
and is 0 for a single-level tree. With rl_count equal to 0, the memset
writes past the 16-byte declared size of rf_records, which the fortify
checker catches.

[FIX]
Replace the bulk memset on &rb->rf_records with a correctly-bounded
memset on rl_recs[] alone, after setting rl_count to the correct value.

Link: https://lore.kernel.org/ah3TESOsEO9j_JLU@dev
Fixes: 2f26f58df041 ("ocfs2: annotate flexible array members with __counted_by_le()")
Signed-off-by: Ian Bridges <icb@fastmail.org>
Reported-by: syzbot+3ef989aae096b30f1663@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3ef989aae096b30f1663
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 weeks agoocfs2: fix race between ocfs2_control_install_private() and ocfs2_control_release()
Joseph Qi [Mon, 1 Jun 2026 12:16:18 +0000 (20:16 +0800)] 
ocfs2: fix race between ocfs2_control_install_private() and ocfs2_control_release()

Move atomic_inc(&ocfs2_control_opened) and the handshake state update
inside ocfs2_control_lock to close a race window where
ocfs2_control_release() can observe ocfs2_control_opened dropping to zero
(resetting ocfs2_control_this_node and running_proto) while
ocfs2_control_install_private() is about to bump the counter and mark the
connection valid.

Link: https://lore.kernel.org/20260601121618.1263346-1-joseph.qi@linux.alibaba.com
Fixes: 3cfd4ab6b6b4 ("ocfs2: Add the local node id to the handshake.")
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: Ginger <ginger.jzllee@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 weeks agoocfs2/dlm: require a ref for locking_state debugfs open
Zhang Cen [Sun, 31 May 2026 04:47:14 +0000 (12:47 +0800)] 
ocfs2/dlm: require a ref for locking_state debugfs open

debug_lockres_open() copies inode->i_private into struct debug_lockres and
debug_lockres_release() later drops that pointer with dlm_put().  That
only works if open successfully pins the struct dlm_ctxt.

Today open calls dlm_grab(dlm) but ignores its return value.  Once the
last domain unregister has removed the context from dlm_domains,
dlm_grab() returns NULL, yet open still stores the raw pointer and returns
success.  The later release path is outside the debugfs removal barrier,
so it can call dlm_put() after dlm_free_ctxt_mem() has freed the context.
KASAN reports this as a slab-use-after-free in dlm_put() called from
debug_lockres_release().

Fail the open when dlm_grab() cannot acquire the reference and unwind the
seq_file private state before returning.  That keeps locking_state from
handing out a file descriptor whose release path does not own the
dlm_ctxt.

The buggy scenario involves two paths, with each column showing the order
within that path:

locking_state debugfs open:          last domain unregister:
1. debug_lockres_open() reads        1. dlm_unregister_domain() calls
   inode->i_private.                    dlm_complete_dlm_shutdown().
2. debug_lockres_open() calls        2. shutdown removes the dlm_ctxt from
   dlm_grab(dlm) and gets NULL.         dlm_domains.
3. open still stores the raw dlm     3. final teardown reaches
   pointer in dl->dl_ctxt and           dlm_free_ctxt_mem() and frees it.
   returns success.
4. debug_lockres_release() later
   calls dlm_put(dl->dl_ctxt).

Validation reproduced this kernel report:
KASAN slab-use-after-free in dlm_put+0x82/0x200
RIP: 0033:0x7f4d349bc9e0
The buggy address belongs to the object at ffff888103a3c000 which belongs
to the cache kmalloc-2k of size 2048
The buggy address is located 816 bytes inside of freed 2048-byte region
[ffff888103a3c000ffff888103a3c800)
Write of size 4
Call trace:
  dump_stack_lvl+0x66/0xa0 (?:?)
  print_report+0xd0/0x630 (?:?)
  dlm_put+0x82/0x200 (?:?)
  srso_alias_return_thunk+0x5/0xfbef5 (?:?)
  __virt_addr_valid+0x188/0x2f0 (?:?)
  kasan_report+0xe4/0x120 (?:?)
  kasan_check_range+0x105/0x1b0 (?:?)
  debug_lockres_release+0x53/0x80 (fs/ocfs2/dlm/dlmdebug.c:587)
  dlm_put+0x9/0x200 (?:?)
  debug_lockres_release+0x5c/0x80 (fs/ocfs2/dlm/dlmdebug.c:587)
  full_proxy_release+0x67/0x90 (?:?)
  __fput+0x1df/0x4b0 (?:?)
  do_raw_spin_lock+0x10f/0x1b0 (?:?)
  fput_close_sync+0xd2/0x170 (?:?)
  __x64_sys_close+0x55/0x90 (?:?)
  do_syscall_64+0x10c/0x640 (arch/x86/entry/syscall_64.c:87)
  irqentry_exit+0xac/0x6e0 (?:?)
  entry_SYSCALL_64_after_hwframe+0x77/0x7f (?:?)
Freed by task stack:
  kasan_save_stack+0x33/0x60 (?:?)
  kasan_save_track+0x14/0x30 (?:?)
  kasan_save_free_info+0x3b/0x60 (?:?)
  __kasan_slab_free+0x5f/0x80 (?:?)
  kfree+0x30f/0x580 (?:?)
  dlm_put+0x1ce/0x200 (?:?)
  dlm_unregister_domain+0xf6/0xb30 (?:?)
  o2cb_cluster_disconnect+0x6b/0x90 (?:?)
  ocfs2_cluster_disconnect+0x41/0x70 (?:?)
  ocfs2_dlm_shutdown+0x1c4/0x220 (?:?)
  ocfs2_dismount_volume+0x38a/0x550 (?:?)
  generic_shutdown_super+0xc3/0x220 (?:?)
  kill_block_super+0x29/0x60 (?:?)
  deactivate_locked_super+0x66/0xe0 (?:?)
  cleanup_mnt+0x13d/0x210 (?:?)
  task_work_run+0xfa/0x170 (?:?)
  exit_to_user_mode_loop+0xd6/0x430 (?:?)
  do_syscall_64+0x3cb/0x640 (arch/x86/entry/syscall_64.c:87)
  entry_SYSCALL_64_after_hwframe+0x77/0x7f (?:?)

Link: https://lore.kernel.org/20260531044714.1640172-1-rollkingzzc@gmail.com
Fixes: 4e3d24ed1a12 ("ocfs2/dlm: Dumps the lockres' into a debugfs file")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 weeks agoocfs2: reject FITRIM ranges shorter than a cluster
Zhang Cen [Thu, 28 May 2026 15:12:47 +0000 (23:12 +0800)] 
ocfs2: reject FITRIM ranges shorter than a cluster

ocfs2_trim_mainbm() trims the global bitmap in cluster units, but its
too-short range validation only checks sb->s_blocksize.

On filesystems with a cluster size larger than the block size, a FITRIM
range that is at least one block but shorter than one cluster is accepted
and shifted down to len == 0.  The later start + len - 1 and len -= ...
arithmetic then underflows and can drive trimming past the requested
range.

Reject ranges shorter than s_clustersize instead.  That preserves the
existing -EINVAL behavior for requests that cannot discard even one
allocation unit and keeps zero-cluster trims out of the group walk.

Link: https://lore.kernel.org/20260528151247.361854-1-rollkingzzc@gmail.com
Fixes: aa89762c5480 ("ocfs2: return EINVAL if the given range to discard is less than block size")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 weeks agoocfs2: validate fast symlink target during inode read
Zhang Cen [Thu, 28 May 2026 15:12:30 +0000 (23:12 +0800)] 
ocfs2: validate fast symlink target during inode read

ocfs2_validate_inode_block() already rejects several inconsistent
self-contained dinodes before they are exposed to the rest of the
filesystem.  Fast symlinks need the same treatment.

A zero-cluster symlink is treated as a fast symlink and later read through
page_get_link() and ocfs2_fast_symlink_read_folio().  That path uses
strnlen() on the inline payload and then copies len + 1 bytes into the
folio.  If a corrupt dinode stores an i_size that does not fit the inline
area or omits the terminating NUL at i_size, that copy reads past the end
of the inode block buffer.

Reject zero-cluster symlink dinodes whose i_size exceeds the inline
fast-symlink capacity or whose inline payload is not NUL-terminated
exactly at i_size when the inode block is validated.  This keeps malformed
fast symlinks from reaching the read path.

Validation reproduced this kernel report:
KASAN use-after-free in ocfs2_fast_symlink_read_folio+0x12c/0x1f0
RIP: 0033:0x7f5c6d859aa7
Read of size 3905
Call trace:
  dump_stack_lvl+0x66/0xa0 (?:?)
  print_report+0xce/0x630 (?:?)
  ocfs2_fast_symlink_read_folio+0x12c/0x1f0 (fs/ocfs2/inode.c:?)
  srso_alias_return_thunk+0x5/0xfbef5 (?:?)
  __virt_addr_valid+0x19f/0x330 (?:?)
  kasan_report+0xe0/0x110 (?:?)
  kasan_check_range+0x105/0x1b0 (?:?)
  __asan_memcpy+0x23/0x60 (?:?)
  filemap_read_folio+0x27/0xe0 (?:?)
  filemap_read_folio+0x35/0xe0 (?:?)
  do_read_cache_folio+0x138/0x230 (?:?)
  __page_get_link+0x26/0x110 (?:?)
  page_get_link+0x2e/0x70 (?:?)
  vfs_readlink+0x15e/0x250 (?:?)
  touch_atime+0x4d/0x370 (?:?)
  do_readlinkat+0x186/0x200 (?:?)
  do_user_addr_fault+0x65a/0x890 (?:?)
  __x64_sys_readlink+0x46/0x60 (?:?)
  do_syscall_64+0x115/0x6a0 (arch/x86/entry/syscall_64.c:87)
  entry_SYSCALL_64_after_hwframe+0x77/0x7f (?:?)

Link: https://lore.kernel.org/20260528151230.361127-1-rollkingzzc@gmail.com
Fixes: ea022dfb3c2a ("ocfs: simplify symlink handling")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Gui-Dong Han <2045gemini@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 weeks agoocfs2: add journal NULL check in ocfs2_checkpoint_inode()
Joseph Qi [Sun, 31 May 2026 13:16:45 +0000 (21:16 +0800)] 
ocfs2: add journal NULL check in ocfs2_checkpoint_inode()

During unmount, ocfs2_journal_shutdown() frees the journal and sets
osb->journal to NULL. Later, when VFS evicts remaining cached inodes,
ocfs2_evict_inode() -> ocfs2_clear_inode() -> ocfs2_checkpoint_inode()
-> ocfs2_ci_fully_checkpointed() dereferences osb->journal, causing a
NULL pointer dereference.

Fix this by adding a NULL check for osb->journal in
ocfs2_checkpoint_inode(). If the journal is NULL, it has already been
fully flushed and destroyed during shutdown, so there is nothing to
checkpoint.

Link: https://lore.kernel.org/20260531131645.3650299-1-joseph.qi@linux.alibaba.com
Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu>
Fixes: da5e7c87827e ("ocfs2: cleanup journal init and shutdown")
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Tested-by: Farhad Alemi <farhad.alemi@berkeley.edu>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 weeks agoocfs2: fix buffer head management in ocfs2_read_blocks()
Dmitry Antipov [Fri, 29 May 2026 09:41:28 +0000 (12:41 +0300)] 
ocfs2: fix buffer head management in ocfs2_read_blocks()

In ocfs2_read_blocks(), caller should't assume that buffer head returned
by 'sb_getblk()' is exclusively owned and so 'put_bh()' always drops
b_count from 1 to 0.  If it is not so, buffer head remains on hold and
likely to be returned by the next call to 'sb_getblk()' unchanged - that
is, with BH_Uptodate bit set even if it has failed validation previously,
thus allowing to insert that buffer head into OCFS2 metadata cache and
submit it to upper layers.  To avoid such a scenario, BH_Uptodate should
be cleared immediately after 'validate()' callback has detected some data
inconsistency.

Link: https://lore.kernel.org/20260529094128.494293-1-dmantipov@yandex.ru
Fixes: cf76c78595ca ("ocfs2: don't put and assigning null to bh allocated outside")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reported-by: syzbot+caacd220635a9cc3bac9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=caacd220635a9cc3bac9
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 weeks agoraid6: use kmalloc() in raid6_select_algo()
Mike Rapoport (Microsoft) [Thu, 28 May 2026 09:53:01 +0000 (12:53 +0300)] 
raid6: use kmalloc() in raid6_select_algo()

raid6_select_algo() allocates 8 pages for buffer that is used
as a scratch area for selection of the best algorithm.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API than ancient __get_free_pages().
kmalloc() does not require ugly casts and kfree() does not need to know the
size of the freed object.

There is no performance difference because kmalloc() redirects allocations
of such size to the page allocator.

Replace __get_free_pages() call with kmalloc().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Link: https://lore.kernel.org/20260528-lib-v4-2-4e3ad1277279@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@kernel.org>
Cc: Li Nan <linan122@huawei.com>
Cc: Song Liu <song@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 weeks agoxor: use kmalloc() in calibrate_xor_blocks()
Mike Rapoport (Microsoft) [Thu, 28 May 2026 09:53:00 +0000 (12:53 +0300)] 
xor: use kmalloc() in calibrate_xor_blocks()

Patch series "lib/raid: replace __get_free_pages() call with kmalloc()", v4.

The xor benchmark allocates 4 pages for a scratch buffer that is used
purely as a CPU-only XOR working area.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API than ancient __get_free_pages().
kmalloc() does not require ugly casts and kfree() does not need to know
the size of the freed object.

There is no performance difference because kmalloc() redirects allocations
of such size to the page allocator.

Replace __get_free_pages() call with kmalloc().

This patch (of 2):

The xor benchmark allocates 4 pages for a scratch buffer that is used
purely as a CPU-only XOR working area.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API than ancient __get_free_pages().
kmalloc() does not require ugly casts and kfree() does not need to know the
size of the freed object.

There is no performance difference because kmalloc() redirects allocations
of such size to the page allocator.

Replace __get_free_pages() call with kmalloc().

Link: https://lore.kernel.org/20260528-lib-v4-0-4e3ad1277279@kernel.org
Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Link: https://lore.kernel.org/20260528-lib-v4-1-4e3ad1277279@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Li Nan <linan122@huawei.com>
Cc: Song Liu <song@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 weeks agoocfs2: reject oversized group bitmap descriptors
Zhang Cen [Sun, 24 May 2026 11:12:48 +0000 (19:12 +0800)] 
ocfs2: reject oversized group bitmap descriptors

ocfs2_validate_gd_parent() only bounds bg_bits against the parent
allocator's chain geometry.  A malicious descriptor can still claim a
bg_size/bg_bits pair that exceeds the bitmap bytes that physically fit in
the group descriptor block, so later bitmap scans and bit updates can run
past bg_bitmap.

Add a physical-cap check based on ocfs2_group_bitmap_size() for the parent
allocator type and reject descriptors whose bg_size or bg_bits exceed that
capacity.  Keep the existing chain geometry check so both the on-disk
bitmap layout and the allocator metadata must agree before the descriptor
is used.

Validation reproduced this kernel report:
KASAN use-after-free in _find_next_bit+0x7f/0xc0
Read of size 8
Call trace:
  dump_stack_lvl+0x66/0xa0 (?:?)
  print_report+0xd0/0x630 (?:?)
  _find_next_bit+0x7f/0xc0 (?:?)
  srso_alias_return_thunk+0x5/0xfbef5 (?:?)
  __virt_addr_valid+0x188/0x2f0 (?:?)
  kasan_report+0xe4/0x120 (?:?)
  ocfs2_find_max_contig_free_bits+0x35/0x70 (fs/ocfs2/suballoc.c:1375)
  ocfs2_block_group_set_bits+0x472/0x4b0 (fs/ocfs2/suballoc.c:1457)
  ocfs2_cluster_group_search+0x16b/0x440 (fs/ocfs2/suballoc.c:86)
  ocfs2_bg_discontig_fix_result+0x1ef/0x230 (fs/ocfs2/suballoc.c:1786)
  ocfs2_search_chain+0x8f8/0x10a0 (fs/ocfs2/suballoc.c:1886)
  get_page_from_freelist+0x70e/0x2370 (?:?)
  lock_release+0xc6/0x290 (?:?)
  do_raw_spin_unlock+0x9a/0x100 (?:?)
  kasan_unpoison+0x27/0x60 (?:?)
  __bfs+0x147/0x240 (?:?)
  get_page_from_freelist+0x83d/0x2370 (?:?)
  ocfs2_claim_suballoc_bits+0x38c/0xe70 (fs/ocfs2/suballoc.c:96)
  sched_domains_numa_masks_clear+0x70/0xd0 (?:?)
  check_irq_usage+0xe8/0xb70 (?:?)
  __ocfs2_claim_clusters+0x18d/0x4c0 (fs/ocfs2/suballoc.c:2497)
  check_path+0x24/0x50 (?:?)
  rcu_is_watching+0x20/0x50 (?:?)
  check_prev_add+0xfd/0xd00 (?:?)
  ocfs2_add_clusters_in_btree+0x17d/0x810 (fs/ocfs2/suballoc.c:?)
  __folio_batch_add_and_move+0x1f5/0x3d0 (?:?)
  ocfs2_add_inode_data+0xd9/0x120 (fs/ocfs2/suballoc.c:?)
  filemap_add_folio+0x105/0x1f0 (?:?)
  ocfs2_write_begin_nolock+0x29f7/0x2f80 (fs/ocfs2/suballoc.c:3043)
  ocfs2_read_inode_block+0xb5/0x110 (fs/ocfs2/suballoc.c:?)
  down_write+0xf5/0x180 (?:?)
  ocfs2_write_begin+0x180/0x240 (fs/ocfs2/suballoc.c:?)
  __mark_inode_dirty+0x758/0x9a0 (?:?)
  inode_to_bdi+0x41/0x90 (?:?)
  balance_dirty_pages_ratelimited_flags+0xf8/0x1d0 (?:?)
  generic_perform_write+0x252/0x440 (?:?)
  mnt_put_write_access_file+0x16/0x70 (?:?)
  file_update_time_flags+0xe4/0x200 (?:?)
  ocfs2_file_write_iter+0x80a/0x1320 (fs/ocfs2/suballoc.c:?)
  lock_acquire+0x184/0x2f0 (?:?)
  ksys_write+0xd2/0x170 (?:?)
  apparmor_file_permission+0xf5/0x310 (?:?)
  read_zero+0x8d/0x140 (?:?)
  lock_is_held_type+0x8f/0x100 (?:?)

Link: https://lore.kernel.org/20260524111248.1429884-1-rollkingzzc@gmail.com
Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>