Al Viro [Wed, 25 Jun 2025 19:02:11 +0000 (15:02 -0400)]
invent_group_ids(): zero ->mnt_group_id always implies !IS_MNT_SHARED()
All places where we call set_mnt_shared() are guaranteed to have
non-zero ->mnt_group_id - either by explicit test, or by having
done successful invent_group_ids() covering the same mount since
we'd grabbed namespace_sem.
The opposite combination (non-zero ->mnt_group_id and !IS_MNT_SHARED())
*is* possible - it means that we have allocated group id, but didn't
get around to set_mnt_shared() yet; such state is transient -
by the time we do namespace_unlock(), we must either do set_mnt_shared()
or unroll the group id allocations by cleanup_group_ids().
Al Viro [Wed, 25 Jun 2025 18:48:50 +0000 (14:48 -0400)]
get rid of CL_SHARE_TO_SLAVE
the only difference between it and CL_SLAVE is in this predicate
in clone_mnt():
if ((flag & CL_SLAVE) ||
((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) {
However, in case of CL_SHARED_TO_SLAVE we have not allocated any
mount group ids since the time we'd grabbed namespace_sem, so
IS_MNT_SHARED() is equivalent to non-zero ->mnt_group_id. And
in case of CL_SLAVE old has come either from the original tree,
which had ->mnt_group_id allocated for all nodes or from result
of sequence of CL_MAKE_SHARED or CL_MAKE_SHARED|CL_SLAVE copies,
ultimately going back to the original tree. In both cases we are
guaranteed that old->mnt_group_id will be non-zero.
In other words, the predicate is always equal to
(flags & (CL_SLAVE | CL_SHARED_TO_SLAVE)) && old->mnt_group_id
and with that replacement CL_SLAVE and CL_SHARED_TO_SLAVE have exact
same behaviour.
Al Viro [Wed, 18 Jun 2025 22:23:41 +0000 (18:23 -0400)]
take freeing of emptied mnt_namespace to namespace_unlock()
Freeing of a namespace must be delayed until after we'd dealt with mount
notifications (in namespace_unlock()). The reasons are not immediately
obvious (they are buried in ->prev_ns handling in mnt_notify()), and
having that free_mnt_ns() explicitly called after namespace_unlock()
is asking for trouble - it does feel like they should be OK to free
as soon as they've been emptied.
Make the things more explicit by setting 'emptied_ns' under namespace_sem
and having namespace_unlock() free the sucker as soon as it's safe to free.
Al Viro [Wed, 18 Jun 2025 01:35:22 +0000 (21:35 -0400)]
copy_tree(): don't link the mounts via mnt_list
The only place that really needs to be adjusted is commit_tree() -
there we need to iterate through the copy and we might as well
use next_mnt() for that. However, in case when our tree has been
slid under something already mounted (propagation to a mountpoint
that already has something mounted on it or a 'beneath' move_mount)
we need to take care not to walk into the overmounting tree.
Al Viro [Wed, 25 Jun 2025 03:36:43 +0000 (23:36 -0400)]
turn do_make_slave() into transfer_propagation()
Lift calculation of replacement propagation source, removal from
peer group and assignment of ->mnt_master from do_make_slave() into
change_mnt_propagation() itself. What remains is switching of
what used to get propagation *through* mnt to alternative source.
Rename to transfer_propagation(), passing it the replacement source
as the second argument. Have it return void, while we are at it.
Al Viro [Tue, 24 Jun 2025 22:12:56 +0000 (18:12 -0400)]
do_make_slave(): choose new master sanely
When mount changes propagation type so that it doesn't propagate
events any more (MS_PRIVATE, MS_SLAVE, MS_UNBINDABLE), we need
to make sure that event propagation between other mounts is
unaffected.
We need to make sure that events from peers and master of that mount
(if any) still reach everything that used to be on its ->mnt_slave_list.
If mount has neither peers nor master, we simply need to dissolve
its ->mnt_slave_list and clear ->mnt_master of everything in there.
If mount has peers, we transfer everything in ->mnt_slave_list of
this mount into that of some of those peers (and adjust ->mnt_master
accordingly).
If mount has a master but no peers, we transfer everything in
->mnt_slave_list of this mount into that of its master (adjusting
->mnt_master, etc.).
There are two problems with the current implementation:
* there's a long-obsolete logics in choosing the peer -
once upon a time it made sense to prefer the peer that had the
same ->mnt_root as our mount, but that had been pointless since
2014 ("smarter propagate_mnt()")
* the most common caller of that thing is umount_tree()
taking the mounts out of propagation graph. In that case it's
possible to have ->mnt_slave_list contents moved many times,
since the replacement master is likely to be taken out by the
same umount_tree(), etc.
Take the choice of replacement master into a separate function
(propagation_source()) and teach it to skip the candidates that
are going to be taken out.
Al Viro [Sat, 28 Jun 2025 03:27:48 +0000 (23:27 -0400)]
propagate_mnt(): get rid of last_dest
Its only use is choosing the type of copy - CL_MAKE_SHARED if there
already is a copy in that peer group, CL_SLAVE or CL_SLAVE | CL_MAKE_SHARED
otherwise.
But that's easy to keep track of - just set type in the beginning of group
and reset to CL_MAKE_SHARED after the first created secondary in it...
Al Viro [Sat, 28 Jun 2025 03:09:47 +0000 (23:09 -0400)]
propagate_one(): separate the "what should be the master for this copy" part
When we create the first copy for a peer group, it becomes a slave of
one of the existing copies; take that logics into a separate helper -
find_master(parent, last_copy, original).
Al Viro [Sat, 21 Jun 2025 21:41:40 +0000 (17:41 -0400)]
propagate_one(): get rid of dest_master
propagate_mnt() takes the subtree we are about to attach and creates
its copies, setting the propagation between those. Each copy is cloned
either from the original or from one of the already created copies.
The tricky part is choosing the right copy to serve as a master when we
are starting a new peer group.
The algorithm for doing that selection puts temporary marks on the masters
of mountpoints that already got a copy created for them; since the initial
peer group might have no master at all, we need to special-case that when
looking for the mark. Currently we do that by memorizing the master of
original peer group. It works, but we get yet another piece of data to
pass from propagate_mnt() to propagate_one().
Alternative is to mark the master of original peer group if not NULL,
turning the check into "master is NULL or marked". Less data to pass
around and memory safety is more obvious that way...
Al Viro [Sat, 21 Jun 2025 22:06:19 +0000 (18:06 -0400)]
mount: separate the flags accessed only under namespace_sem
Several flags are updated and checked only under namespace_sem; we are
already making use of that when we are checking them without mount_lock,
but we have to hold mount_lock for all updates, which makes things
clumsier than they have to be.
Take MNT_SHARED, MNT_UNBINDABLE, MNT_MARKED and MNT_UMOUNT_CANDIDATE
into a separate field (->mnt_t_flags), renaming them to T_SHARED,
etc. to avoid confusion. All accesses must be under namespace_sem.
That changes locking requirements for mnt_change_propagation() and
set_mnt_shared() - only namespace_sem is needed now. The same goes
for SET_MNT_MARKED et.al.
There might be more flags moved from ->mnt_flags to that field;
this is just the initial set.
Al Viro [Mon, 28 Apr 2025 02:53:13 +0000 (22:53 -0400)]
don't have mounts pin their parents
Simplify the rules for mount refcounts. Current rules include:
* being a namespace root => +1
* being someone's child => +1
* being someone's child => +1 to parent's refcount, unless you've
already been through umount_tree().
The last part is not needed at all. It makes for more places where need
to decrement refcounts and it creates an asymmetry between the situations
for something that has never been a part of a namespace and something that
left one, both for no good reason.
If mount's refcount has additions from its children, we know that
* it's either someone's child itself (and will remain so
until umount_tree(), at which point contributions from children
will disappear), or
* or is the root of namespace (and will remain such until
it either becomes someone's child in another namespace or goes through
umount_tree()), or
* it is the root of some tree copy, and is currently pinned
by the caller of copy_tree() (and remains such until it either gets
into namespace, or goes to umount_tree()).
In all cases we already have contribution(s) to refcount that will last
as long as the contribution from children remains. In other words, the
lifetime is not affected by refcount contributions from children.
It might be useful for "is it busy" checks, but those are actually
no harder to express without it.
NB: propagate_mnt_busy() part is an equivalent transformation, ugly as it
is; the current logics is actually wrong and may give false negatives,
but fixing that is for a separate patch (probably earlier in the queue).
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 26 Apr 2025 00:21:23 +0000 (20:21 -0400)]
get rid of mountpoint->m_count
struct mountpoint has an odd kinda-sorta refcount in it. It's always
either equal to or one above the number of mounts attached to that
mountpoint.
"One above" happens when a function takes a temporary reference to
mountpoint. Things get simpler if we express that as inserting
a local object into ->m_list and removing it to drop the reference.
New calling conventions:
1) lock_mount(), do_lock_mount(), get_mountpoint() and lookup_mountpoint()
take an extra struct pinned_mountpoint * argument and returns 0/-E...
(or true/false in case of lookup_mountpoint()) instead of returning
struct mountpoint pointers. In case of success, the struct mountpoint *
we used to get can be found as pinned_mountpoint.mp
2) unlock_mount() (always paired with lock_mount()/do_lock_mount()) takes
an address of struct pinned_mountpoint - the same that had been passed to
lock_mount()/do_lock_mount().
3) put_mountpoint() for a temporary reference (paired with get_mountpoint()
or lookup_mountpoint()) is replaced with unpin_mountpoint(), which takes
the address of pinned_mountpoint we passed to matching {get,lookup}_mountpoint().
4) all instances of pinned_mountpoint are local variables; they always live on
stack. {} is used for initializer, after successful {get,lookup}_mountpoint()
we must make sure to call unpin_mountpoint() before leaving the scope and
after successful {do_,}lock_mount() we must make sure to call unlock_mount()
before leaving the scope.
5) all manipulations of ->m_count are gone, along with ->m_count itself.
struct mountpoint lives while its ->m_list is non-empty.
Al Viro [Fri, 25 Apr 2025 21:24:10 +0000 (17:24 -0400)]
combine __put_mountpoint() with unhash_mnt()
A call of unhash_mnt() is immediately followed by passing its return
value to __put_mountpoint(); the shrink list given to __put_mountpoint()
will be ex_mountpoints when called from umount_mnt() and list when called
from mntput_no_expire().
Replace with __umount_mnt(mount, shrink_list), moving the call of
__put_mountpoint() into it (and returning nothing), adjust the
callers.
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 25 Apr 2025 20:53:01 +0000 (16:53 -0400)]
pivot_root(): reorder tree surgeries, collapse unhash_mnt() and put_mountpoint()
attach new_mnt *before* detaching root_mnt; that way we don't need to keep hold
on the mountpoint and one more pair of unhash_mnt()/put_mountpoint() gets
folded together into umount_mnt().
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 2 May 2025 00:40:57 +0000 (20:40 -0400)]
take ->mnt_expire handling under mount_lock [read_seqlock_excl]
Doesn't take much massage, and we no longer need to make sure that
by the time of final mntput() the victim has been removed from the
list. Makes life safer for ->d_automount() instances...
Rules:
* all ->mnt_expire accesses are under mount_lock.
* insertion into the list is done by mnt_set_expiry(), and
caller (->d_automount() instance) must hold a reference to mount
in question. It shouldn't be done more than once for a mount.
* if a mount on an expiry list is not yet mounted, it will
be ignored by anything that walks that list.
* if the final mntput() finds its victim still on an expiry
list (in which case it must've never been mounted - umount_tree()
would've taken it out), it will remove the victim from the list.
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 1 May 2025 23:59:30 +0000 (19:59 -0400)]
attach_recursive_mnt(): remove from expiry list on move
... rather than doing that in do_move_mount(). That's the main
obstacle to moving the protection of ->mnt_expire from namespace_sem
to mount_lock (spinlock-only), which would simplify several failure
exits.
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 25 Apr 2025 16:55:39 +0000 (12:55 -0400)]
do_move_mount(): take dropping the old mountpoint into attach_recursive_mnt()
... and fold it with unhash_mnt() there - there's no need to retain a reference
to old_mp beyond that point, since by then all mountpoints we were going to add
are either explicitly pinned by get_mountpoint() or have stuff already added
to them.
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 26 Apr 2025 02:40:48 +0000 (22:40 -0400)]
attach_recursive_mnt(): unify the mnt_change_mountpoint() logics
The logics used for tucking under existing mount differs for original
and copies; copies do a mount hash lookup to see if mountpoint to be is
already overmounted, while the original is told explicitly.
But the same logics that is used for copies works for the original,
at which point the only place where we get very close to eliminating
the need of passing 'beneath' flag to attach_recursive_mnt().
Al Viro [Sat, 26 Apr 2025 02:34:33 +0000 (22:34 -0400)]
make commit_tree() usable in same-namespace move case
Once attach_recursive_mnt() has created all copies of original subtree,
it needs to put them in place(s).
Steps needed for those are slightly different:
1) in 'move' case, original copy doesn't need any rbtree
manipulations (everything's already in the same namespace where it will
be), but it needs to be detached from the current location
2) in 'attach' case, original may be in anon namespace; if it is,
all those mounts need to removed from their current namespace before
insertion into the target one
3) additional copies have a couple of extra twists - in case
of cross-userns propagation we need to lock everything other the root of
subtree and in case when we end up inserting under an existing mount,
that mount needs to be found (for original copy we have it explicitly
passed by the caller).
Quite a bit of that can be unified; as the first step, make commit_tree()
helper (inserting mounts into namespace, hashing the root of subtree
and marking the namespace as updated) usable in all cases; (2) and (3)
are already using it and for (1) we only need to make the insertion of
mounts into namespace conditional.
Al Viro [Thu, 15 May 2025 00:50:06 +0000 (20:50 -0400)]
Rewrite of propagate_umount()
The variant currently in the tree has problems; trying to prove
correctness has caught at least one class of bugs (reparenting
that ends up moving the visible location of reparented mount, due
to not excluding some of the counterparts on propagation that
should've been included).
I tried to prove that it's the only bug there; I'm still not sure
whether it is. If anyone can reconstruct and write down an analysis
of the mainline implementation, I'll gladly review it; as it is,
I ended up doing a different implementation. Candidate collection
phase is similar, but trimming the set down until it satisfies the
constraints turned out pretty different.
I hoped to do transformation as a massage series, but that turns out
to be too convoluted. So it's a single patch replacing propagate_umount()
and friends in one go, with notes and analysis in D/f/propagate_umount.txt
(in addition to inline comments).
As far I can tell, it is provably correct and provably linear by the number
of mounts we need to look at in order to decide what should be unmounted.
It even builds and seems to survive testing...
Another nice thing that fell out of that is that ->mnt_umounting is no longer
needed.
Compared to the first version:
* explicit MNT_UMOUNT_CANDIDATE flag for is_candidate()
* trim_ancestors() only clears that flag, leaving the suckers on list
* trim_one() and handle_locked() take the stuff with flag cleared off
the list. That allows to iterate with list_for_each_entry_safe() when calling
trim_one() - it removes at most one element from the list now.
* no globals - I didn't bother with any kind of context, not worth it.
* Notes updated accordingly; I have not touch the terms yet.
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 3 May 2025 01:32:01 +0000 (21:32 -0400)]
sanitize handling of long-term internal mounts
Original rationale for those had been the reduced cost of mntput()
for the stuff that is mounted somewhere. Mount refcount increments and
decrements are frequent; what's worse, they tend to concentrate on the
same instances and cacheline pingpong is quite noticable.
As the result, mount refcounts are per-cpu; that allows a very cheap
increment. Plain decrement would be just as easy, but decrement-and-test
is anything but (we need to add the components up, with exclusion against
possible increment-from-zero, etc.).
Fortunately, there is a very common case where we can tell that decrement
won't be the final one - if the thing we are dropping is currently
mounted somewhere. We have an RCU delay between the removal from mount
tree and dropping the reference that used to pin it there, so we can
just take rcu_read_lock() and check if the victim is mounted somewhere.
If it is, we can go ahead and decrement without and further checks -
the reference we are dropping is not the last one. If it isn't, we
get all the fun with locking, carefully adding up components, etc.,
but the majority of refcount decrements end up taking the fast path.
There is a major exception, though - pipes and sockets. Those live
on the internal filesystems that are not going to be mounted anywhere.
They are not going to be _un_mounted, of course, so having to take the
slow path every time a pipe or socket gets closed is really obnoxious.
Solution had been to mark them as long-lived ones - essentially faking
"they are mounted somewhere" indicator.
With minor modification that works even for ones that do eventually get
dropped - all it takes is making sure we have an RCU delay between
clearing the "mounted somewhere" indicator and dropping the reference.
There are some additional twists (if you want to drop a dozen of such
internal mounts, you'd be better off with clearing the indicator on
all of them, doing an RCU delay once, then dropping the references),
but in the basic form it had been
* use kern_mount() if you want your internal mount to be
a long-term one.
* use kern_unmount() to undo that.
Unfortunately, the things did rot a bit during the mount API reshuffling.
In several cases we have lost the "fake the indicator" part; kern_unmount()
on the unmount side remained (it doesn't warn if you use it on a mount
without the indicator), but all benefits regaring mntput() cost had been
lost.
To get rid of that bitrot, let's add a new helper that would work
with fs_context-based API: fc_mount_longterm(). It's a counterpart
of fc_mount() that does, on success, mark its result as long-term.
It must be paired with kern_unmount() or equivalents.
Converted:
1) mqueue (it used to use kern_mount_data() and the umount side
is still as it used to be)
2) hugetlbfs (used to use kern_mount_data(), internal mount is
never unmounted in this one)
3) i915 gemfs (used to be kern_mount() + manual remount to set
options, still uses kern_unmount() on umount side)
4) v3d gemfs (copied from i915)
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 26 Apr 2025 23:17:28 +0000 (19:17 -0400)]
do_umount(): simplify the "is it still mounted" checks
Calls of do_umount() are always preceded by can_umount(), where we'd
done a racy check for mount belonging to our namespace; if it wasn't,
can_unmount() would've failed with -EINVAL and we wouldn't have
reached do_umount() at all.
That check needs to be redone once we have acquired namespace_sem
and in do_umount() we do that. However, that's done in a very odd
way; we check that mount is still in rbtree of _some_ namespace or
its mnt_list is not empty. It is equivalent to check_mnt(mnt) -
we know that earlier mnt was mounted in our namespace; if it has
stayed there, it's going to remain in rbtree of our namespace.
OTOH, if it ever had been removed from out namespace, it would be
removed from rbtree and it never would've re-added to a namespace
afterwards. As for ->mnt_list, for something that had been mounted
in a namespace we'll never observe non-empty ->mnt_list while holding
namespace_sem - it does temporarily become non-empty during
umount_tree(), but that doesn't outlast the call of umount_tree(),
let alone dropping namespace_sem.
Things get much easier to follow if we replace that with (equivalent)
check_mnt(mnt) there. What's more, currently we treat a failure of
that test as "quietly do nothing"; we might as well pretend that we'd
lost the race and fail on that the same way can_umount() would have.
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 7 May 2025 18:05:50 +0000 (14:05 -0400)]
clone_mnt(): simplify the propagation-related logics
The underlying rules are simple:
* MNT_SHARED should be set iff ->mnt_group_id of new mount ends up
non-zero.
* mounts should be on the same ->mnt_share cyclic list iff they have
the same non-zero ->mnt_group_id value.
* CL_PRIVATE is mutually exclusive with MNT_SHARED, MNT_SLAVE,
MNT_SHARED_TO_SLAVE and MNT_EXPIRE; the whole point of that thing is to
get a clone of old mount that would *not* be on any namespace-related
lists.
The above allows to make the logics more straightforward; what's more,
it makes the proof that invariants are maintained much simpler.
The variant in mainline is safe (aside of a very narrow race with
unsafe modification of mnt_flags right after we had the mount exposed
in superblock's ->s_mounts; theoretically it can race with ro remount
of the original, but it's not easy to hit), but proof of its correctness
is really unpleasant.
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 6 May 2025 22:48:05 +0000 (18:48 -0400)]
don't set MNT_LOCKED on parentless mounts
Originally MNT_LOCKED meant only one thing - "don't let this mount to
be peeled off its parent, we don't want to have its mountpoint exposed".
Accordingly, it had only been set on mounts that *do* have a parent.
Later it got overloaded with another use - setting it on the absolute
root had given free protection against umount(2) of absolute root
(was possible to trigger, oopsed). Not a bad trick, but it ended
up costing more than it bought us. Unfortunately, the cost included
both hard-to-reason-about logics and a subtle race between
mount -o remount,ro and mount --[r]bind - lockless &= ~MNT_LOCKED in
the end of __do_loopback() could race with sb_prepare_remount_readonly()
setting and clearing MNT_HOLD_WRITE (under mount_lock, as it should
be). The race wouldn't be much of a problem (there are other ways to
deal with it), but the subtlety is.
Turns out that nobody except umount(2) had ever made use of having
MNT_LOCKED set on absolute root. So let's give up on that trick,
clever as it had been, add an explicit check in do_umount() and
return to using MNT_LOCKED only for mounts that have a parent.
It means that
* clone_mnt() no longer copies MNT_LOCKED
* copy_tree() sets it on submounts if their counterparts had
been marked such, and does that right next to attach_mnt() in there,
in the same mount_lock scope.
* __do_loopback() no longer needs to strip MNT_LOCKED off the
root of subtree it's about to return; no store, no race.
* init_mount_tree() doesn't bother setting MNT_LOCKED on absolute
root.
* lock_mnt_tree() does not set MNT_LOCKED on the subtree's root;
accordingly, its caller (loop in attach_recursive_mnt()) does not need to
bother stripping that MNT_LOCKED on root. Note that lock_mnt_tree() setting
MNT_LOCKED on submounts happens in the same mount_lock scope as __attach_mnt()
(from commit_tree()) that makes them reachable.
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 18 Jun 2025 01:10:02 +0000 (21:10 -0400)]
__attach_mnt(): lose the second argument
It's always ->mnt_parent of the first one. What the function does is
making a mount (with already set parent and mountpoint) visible - in
mount hash and in the parent's list of children.
IOW, it takes the existing rootwards linkage and sets the matching
crownwards linkage.
Al Viro [Sun, 1 Jun 2025 04:34:32 +0000 (00:34 -0400)]
copy_tree(): don't set ->mnt_mountpoint on the root of copy
It never made any sense - neither when copy_tree() had been introduced
(2.4.11-pre5), nor at any point afterwards. Mountpoint is meaningless
without parent mount and the root of copied tree has no parent until we get
around to attaching it somewhere. At that time we'll have mountpoint set;
before that we have no idea which dentry will be used as mountpoint.
IOW, copy_tree() should just leave the default value.
Al Viro [Sat, 21 Jun 2025 02:46:55 +0000 (22:46 -0400)]
prevent mount hash conflicts
Currently it's still possible to run into a pathological situation when
two hashed mounts share both parent and mountpoint. That does not work
well, for obvious reasons.
We are not far from getting rid of that; the only remaining gap is
attach_recursive_mnt() not being careful enough when sliding a tree
under existing mount (for propagated copies or in 'beneath' case for
the original one).
To deal with that cleanly we need to be able to find overmounts
(i.e. mounts on top of parent's root); we could do hash lookups or scan
the list of children but either would be costly. Since one of the results
we get from that will be prevention of multiple parallel overmounts, let's
just bite the bullet and store a (non-counting) reference to overmount
in struct mount.
With that done, closing the hole in attach_recursive_mnt() becomes easy
- we just need to follow the chain of overmounts before we change the
mountpoint of the mount we are sliding things under.
Al Viro [Sat, 26 Apr 2025 02:18:12 +0000 (22:18 -0400)]
get rid of mnt_set_mountpoint_beneath()
mnt_set_mountpoint_beneath() consists of attaching new mount side-by-side
with the one we want to mount beneath (by mnt_set_mountpoint()), followed
by mnt_change_mountpoint() shifting the the top mount onto the new one
(by mnt_change_mountpoint()).
Both callers of mnt_set_mountpoint_beneath (both in attach_recursive_mnt())
have the same form - in 'beneath' case we call mnt_set_mountpoint_beneath(),
otherwise - mnt_set_mountpoint().
The thing is, expressing that as unconditional mnt_set_mountpoint(),
followed, in 'beneath' case, by mnt_change_mountpoint() is just as easy.
And these mnt_change_mountpoint() callers are similar to the ones we
do when it comes to attaching propagated copies, which will allow more
cleanups in the next commits.
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 25 Apr 2025 16:40:28 +0000 (12:40 -0400)]
attach_mnt(): expand in attach_recursive_mnt(), then lose the flag argument
simpler that way - all but one caller pass false as 'beneath' argument,
and that one caller is actually happier with the call expanded - the
logics with choice of mountpoint is identical for 'moving' and 'attaching'
cases, and now that is no longer hidden.
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Sun, 29 Jun 2025 16:25:55 +0000 (09:25 -0700)]
Merge tag 'staging-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver fix from Greg KH:
"Here is a single staging driver fix for 6.16-rc4. It resolves a build
error in the rtl8723bs driver for some versions of clang on arm64 when
checking the frame size with -Wframe-larger-than.
It has been in linux-next for a while now with no reported issues"
* tag 'staging-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: rtl8723bs: Avoid memset() in aes_cipher() and aes_decipher()
Linus Torvalds [Sun, 29 Jun 2025 16:21:27 +0000 (09:21 -0700)]
Merge tag 'tty-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
"Here are five small serial and tty and vt fixes for 6.16-rc4. Included
in here are:
- kerneldoc fixes for recent vt changes
- imx serial driver fix
- of_node sysfs fix for a regression
- vt missing notification fix
- 8250 dt bindings fix
All of these have been in linux-next for a while with no reported issues"
* tag 'tty-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
dt-bindings: serial: 8250: Make clocks and clock-frequency exclusive
serial: imx: Restore original RXTL for console to fix data loss
serial: core: restore of_node information in sysfs
vt: fix kernel-doc warnings in ucs_get_fallback()
vt: add missing notification when switching back to text mode
Linus Torvalds [Sun, 29 Jun 2025 15:43:54 +0000 (08:43 -0700)]
Merge tag 'edac_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC fix from Borislav Petkov:
- Consider secondary address mask registers in amd64_edac in order to
get the correct total memory size of the system
* tag 'edac_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/amd64: Fix size calculation for Non-Power-of-Two DIMMs
Linus Torvalds [Sun, 29 Jun 2025 15:28:24 +0000 (08:28 -0700)]
Merge tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Make sure DR6 and DR7 are initialized to their architectural values
and not accidentally cleared, leading to misconfigurations
* tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/traps: Initialize DR7 by writing its architectural reset value
x86/traps: Initialize DR6 by writing its architectural reset value
Linus Torvalds [Sun, 29 Jun 2025 15:16:02 +0000 (08:16 -0700)]
Merge tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- Make sure an AUX perf event is really disabled when it overruns
* tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/aux: Fix pending disable flow when the AUX ring buffer overruns
Linus Torvalds [Sat, 28 Jun 2025 22:23:17 +0000 (15:23 -0700)]
Merge tag 'i2c-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
- imx: fix SMBus protocol compliance during block read
- omap: fix error handling path in probe
- robotfuzz, tiny-usb: prevent zero-length reads
- x86, designware, amdisp: fix build error when modules are disabled
(agreed to go in via i2c)
- scx200_acb: fix build error because of missing HAS_IOPORT
* tag 'i2c-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: scx200_acb: depends on HAS_IOPORT
i2c: omap: Fix an error handling path in omap_i2c_probe()
platform/x86: Use i2c adapter name to fix build errors
i2c: amd-isp: Initialize unique adapter name
i2c: designware: Initialize adapter name only when not set
i2c: tiny-usb: disable zero-length read messages
i2c: robotfuzz-osif: disable zero-length read messages
i2c: imx: fix emulated smbus block read
Linus Torvalds [Sat, 28 Jun 2025 18:39:24 +0000 (11:39 -0700)]
Merge tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
- Fix possible UAF on error path in filter_free_subsystem_filters()
When freeing a subsystem filter, the filter for the subsystem is
passed in to be freed and all the events within the subsystem will
have their filter freed too. In order to free without waiting for RCU
synchronization, list items are allocated to hold what is going to be
freed to free it via a call_rcu(). If the allocation of these items
fails, it will call the synchronization directly and free after that
(causing a bit of delay for the user).
The subsystem filter is first added to this list and then the filters
for all the events under the subsystem. The bug is if one of the
allocations of the list items for the event filters fail to allocate,
it jumps to the "free_now" label which will free the subsystem
filter, then all the items on the allocated list, and then the event
filters that were not added to the list yet. But because the
subsystem filter was added first, it gets freed twice.
The solution is to add the subsystem filter after the events, and
then if any of the allocations fail it will not try to free any of
them twice
* tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix filter logic error
Linus Torvalds [Sat, 28 Jun 2025 18:35:11 +0000 (11:35 -0700)]
Merge tag 'loongarch-fixes-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
- replace __ASSEMBLY__ with __ASSEMBLER__ in headers like others
- fix build warnings about export.h
- reserve the EFI memory map region for kdump
- handle __init vs inline mismatches
- fix some KVM bugs
* tag 'loongarch-fixes-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Disable updating of "num_cpu" and "feature"
LoongArch: KVM: Check validity of "num_cpu" from user space
LoongArch: KVM: Check interrupt route from physical CPU
LoongArch: KVM: Fix interrupt route update with EIOINTC
LoongArch: KVM: Add address alignment check for IOCSR emulation
LoongArch: KVM: Avoid overflow with array index
LoongArch: Handle KCOV __init vs inline mismatches
LoongArch: Reserve the EFI memory map region
LoongArch: Fix build warnings about export.h
LoongArch: Replace __ASSEMBLY__ with __ASSEMBLER__ in headers
- Fix for regression in handling native Windows symlinks
- Three smbdirect fixes:
- oops in RDMA response processing
- smbdirect memcpy issue
- fix smbdirect regression with large writes (smbdirect test cases
now all passing)
- Fix for "FAILED_TO_PARSE" warning in trace-cmd report output
* tag 'v6.16-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix reading into an ITER_FOLIOQ from the smbdirect code
cifs: Fix the smbd_response slab to allow usercopy
smb: client: fix potential deadlock when reconnecting channels
smb: client: remove \t from TP_printk statements
smb: client: let smbd_post_send_iter() respect the peers max_send_size and transmit all data
smb: client: fix regression with native SMB symlinks
Linus Torvalds [Sat, 28 Jun 2025 03:34:10 +0000 (20:34 -0700)]
Merge tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"16 hotfixes.
6 are cc:stable and the remainder address post-6.15 issues or aren't
considered necessary for -stable kernels. 5 are for MM"
* tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add Lorenzo as THP co-maintainer
mailmap: update Duje Mihanović's email address
selftests/mm: fix validate_addr() helper
crashdump: add CONFIG_KEYS dependency
mailmap: correct name for a historical account of Zijun Hu
mailmap: add entries for Zijun Hu
fuse: fix runtime warning on truncate_folio_batch_exceptionals()
scripts/gdb: fix dentry_name() lookup
mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write
mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the percpu variable tag->counters
lib/group_cpus: fix NULL pointer dereference from group_cpus_evenly()
mm/hugetlb: remove unnecessary holding of hugetlb_lock
MAINTAINERS: add missing files to mm page alloc section
MAINTAINERS: add tree entry to mm init block
mm: add OOM killer maintainer structure
fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folio
Linus Torvalds [Sat, 28 Jun 2025 03:22:18 +0000 (20:22 -0700)]
Merge tag 'riscv-for-linus-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V Fixes for 5.16-rc4
- .rodata is no longer linkd into PT_DYNAMIC.
It was not supposed to be there in the first place and resulted in
invalid (but unused) entries. This manifests as at least warnings in
llvm-readelf
- A fix for runtime constants with all-0 upper 32-bits. This should
only manifest on MMU=n kernels
- A fix for context save/restore on systems using the T-Head vector
extensions
- A fix for a conflicting "+r"/"r" register constraint in the VDSO
getrandom syscall wrapper, which is undefined behavior in clang
- A fix for a missing register clobber in the RVV raid6 implementation.
This manifests as a NULL pointer reference on some compilers, but
could trigger in other ways
- Misaligned accesses from userspace at faulting addresses are now
handled correctly
- A fix for an incorrect optimization that allowed access_ok() to mark
invalid addresses as accessible, which can result in userspace
triggering BUG()s
- A few fixes for build warnings, and an update to Drew's email address
* tag 'riscv-for-linus-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: export boot_cpu_hartid
Revert "riscv: Define TASK_SIZE_MAX for __access_ok()"
riscv: Fix sparse warning in vendor_extensions/sifive.c
Revert "riscv: misaligned: fix sleeping function called during misaligned access handling"
MAINTAINERS: Update Drew Fustini's email address
RISC-V: uaccess: Wrap the get_user_8 uaccess macro
raid6: riscv: Fix NULL pointer dereference caused by a missing clobber
RISC-V: vDSO: Correct inline assembly constraints in the getrandom syscall wrapper
riscv: vector: Fix context save/restore with xtheadvector
riscv: fix runtime constant support for nommu kernels
riscv: vdso: Exclude .rodata from the PT_DYNAMIC segment
Linus Torvalds [Sat, 28 Jun 2025 02:38:36 +0000 (19:38 -0700)]
Merge tag 'drm-fixes-2025-06-28' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Regular weekly drm updates, nothing out of the ordinary, amdgpu, xe,
i915 and a few misc bits. Seems about right for this time in the
release cycle.
core:
- fix drm_writeback_connector_cleanup function signature
- use correct HDMI audio bridge in drm_connector_hdmi_audio_init
bridge:
- SN65DSI86: fix HPD
amdgpu:
- Cleaner shader support for additional GFX9 GPUs
- MES firmware compatibility fixes
- Discovery error reporting fixes
- SDMA6/7 userq fixes
- Backlight fix
- EDID sanity check
i915:
- Fix for SNPS PHY HDMI for 1080p@120Hz
- Correct DP AUX DPCD probe address
- Followup build fix for GCOV and AutoFDO enabled config
xe:
- Missing error check
- Fix xe_hwmon_power_max_write
- Move flushes
- Explicitly exit CT safe mode on unwind
- Process deferred GGTT node removals on device unwind"
* tag 'drm-fixes-2025-06-28' of https://gitlab.freedesktop.org/drm/kernel:
drm/xe: Process deferred GGTT node removals on device unwind
drm/xe/guc: Explicitly exit CT safe mode on unwind
drm/xe: move DPT l2 flush to a more sensible place
drm/xe: Move DSB l2 flush to a more sensible place
drm/bridge: ti-sn65dsi86: Add HPD for DisplayPort connector type
drm/i915: fix build error some more
drm/xe/hwmon: Fix xe_hwmon_power_max_write
drm/xe/display: Add check for alloc_ordered_workqueue()
drm/amd/display: Add sanity checks for drm_edid_raw()
drm/amd/display: Fix AMDGPU_MAX_BL_LEVEL value
drm/amdgpu/sdma7: add ucode version checks for userq support
drm/amdgpu/sdma6: add ucode version checks for userq support
drm/amd: Adjust output for discovery error handling
drm/amdgpu/mes: add compatibility checks for set_hw_resource_1
drm/amdgpu/gfx9: Add Cleaner Shader Support for GFX9.x GPUs
drm/bridge-connector: Fix bridge in drm_connector_hdmi_audio_init()
drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS
drm/i915/snps_hdmi_pll: Fix 64-bit divisor truncation by using div64_u64
drm: writeback: Fix drm_writeback_connector_cleanup signature
Linus Torvalds [Sat, 28 Jun 2025 00:58:32 +0000 (17:58 -0700)]
Merge tag 'cxl-fixes-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull Compute Express Link (CXL) fixes from Dave Jiang:
"These fixes address a few issues in the CXL subsystem, including
dealing with some bugs in the CXL EDAC and RAS drivers:
- Fix return value of cxlctl_validate_set_features()
- Fix min_scrub_cycle of a region miscaculation and add additional
documentation
- Fix potential memory leak issues for CXL EDAC
- Fix CPER handler device confusion for CXL RAS
- Fix using wrong repair type to check DRAM event record"
* tag 'cxl-fixes-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/edac: Fix using wrong repair type to check dram event record
cxl/ras: Fix CPER handler device confusion
cxl/edac: Fix potential memory leak issues
cxl/Documentation: Add more description about min/max scrub cycle
cxl/edac: Fix the min_scrub_cycle of a region miscalculation
cxl: fix return value in cxlctl_validate_set_features()
Linus Torvalds [Sat, 28 Jun 2025 00:32:30 +0000 (17:32 -0700)]
Merge tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull crypto library fix from Eric Biggers:
"Fix a regression where the purgatory code sometimes fails to build"
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crypto: sha256: Mark sha256_choose_blocks as __always_inline
Dave Airlie [Fri, 27 Jun 2025 20:53:00 +0000 (06:53 +1000)]
Merge tag 'drm-misc-fixes-2025-06-26' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
drm-misc-fixes for v6.16-rc4:
- Fix function signature of drm_writeback_connector_cleanup.
- Use correct HDMI audio bridge in drm_connector_hdmi_audio_init.
- Make HPD work on SN65DSI86.
If the processing of the tr->events loop fails, the filter that has been
added to filter_head will be released twice in free_filter_list(&head->rcu)
and __free_filter(filter).
After adding the filter of tr->events, add the filter to the filter_head
process to avoid triggering uaf.
Link: https://lore.kernel.org/tencent_4EF87A626D702F816CD0951CE956EC32CD0A@qq.com Fixes: a9d0aab5eb33 ("tracing: Fix regression of filter waiting a long time on RCU synchronization") Reported-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=daba72c4af9915e9c894 Tested-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Linus Torvalds [Fri, 27 Jun 2025 19:08:36 +0000 (12:08 -0700)]
Merge tag 'acpi-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Revert a commit that attempted to fix a memory leak in an error code
path and introduced a different issue (Zhe Qiao)"
* tag 'acpi-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Revert "PCI/ACPI: Fix allocated memory release on error in pci_acpi_scan_root()"
Linus Torvalds [Fri, 27 Jun 2025 16:02:33 +0000 (09:02 -0700)]
Merge tag 'block-6.16-20250626' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fixes for ublk:
- fix C++ narrowing warnings in the uapi header
- update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header
- fix for the ublk ->queue_rqs() implementation, limiting a batch
to just the specific task AND ring
- ublk_get_data() error handling fix
- sanity check more arguments in ublk_ctrl_add_dev()
- selftest addition
- NVMe pull request via Christoph:
- reset delayed remove_work after reconnect
- fix atomic write size validation
- Fix for a warning introduced in bdev_count_inflight_rw() in this
merge window
* tag 'block-6.16-20250626' of git://git.kernel.dk/linux:
block: fix false warning in bdev_count_inflight_rw()
ublk: sanity check add_dev input for underflow
nvme: fix atomic write size validation
nvme: refactor the atomic write unit detection
nvme: reset delayed remove_work after reconnect
ublk: setup ublk_io correctly in case of ublk_get_data() failure
ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header
ublk: fix narrowing warnings in UAPI header
selftests: ublk: don't take same backing file for more than one ublk devices
ublk: build batch from IOs in same io_ring_ctx and io task
Linus Torvalds [Fri, 27 Jun 2025 15:55:57 +0000 (08:55 -0700)]
Merge tag 'io_uring-6.16-20250626' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Two tweaks for a recent fix: fixing a memory leak if multiple iovecs
were initially mapped but only the first was used and hence turned
into a UBUF rathan than an IOVEC iterator, and catching a case where
a retry would be done even if the previous segment wasn't full
- Small series fixing an issue making the vm unhappy if debugging is
turned on, hitting a VM_BUG_ON_PAGE()
- Fix a resource leak in io_import_dmabuf() in the error handling case,
which is a regression in this merge window
- Mark fallocate as needing to be write serialized, as is already done
for truncate and buffered writes
* tag 'io_uring-6.16-20250626' of git://git.kernel.dk/linux:
io_uring/kbuf: flag partial buffer mappings
io_uring/net: mark iov as dynamically allocated even for single segments
io_uring: fix resource leak in io_import_dmabuf()
io_uring: don't assume uaddr alignment in io_vec_fill_bvec
io_uring/rsrc: don't rely on user vaddr alignment
io_uring/rsrc: fix folio unpinning
io_uring: make fallocate be hashed work
Linus Torvalds [Fri, 27 Jun 2025 15:26:25 +0000 (08:26 -0700)]
Merge tag 's390-6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- Fix incorrectly dropped dereferencing of the stack nth entry
introduced with a previous KASAN false positive fix
- Use a proper memdup_array_user() helper to prevent overflow in a
protected key size calculation
* tag 's390-6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/ptrace: Fix pointer dereferencing in regs_get_kernel_stack_nth()
s390/pkey: Prevent overflow in size calculation for memdup_user()
Johannes Berg [Fri, 6 Jun 2025 07:56:52 +0000 (09:56 +0200)]
i2c: scx200_acb: depends on HAS_IOPORT
It already depends on X86_32, but that's also set for ARCH=um.
Recent changes made UML no longer have IO port access since
it's not needed, but this driver uses it. Build it only for
HAS_IOPORT. This is pretty much the same as depending on X86,
but on the off-chance that HAS_IOPORT will ever be optional
on x86 HAS_IOPORT is the real prerequisite.
Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Bibo Mao [Fri, 27 Jun 2025 10:27:44 +0000 (18:27 +0800)]
LoongArch: KVM: Disable updating of "num_cpu" and "feature"
Property "num_cpu" and "feature" are read-only once eiointc is created,
which are set with KVM_DEV_LOONGARCH_EXTIOI_GRP_CTRL attr group before
device creation.
Attr group KVM_DEV_LOONGARCH_EXTIOI_GRP_SW_STATUS is to update register
and software state for migration and reset usage, property "num_cpu" and
"feature" can not be update again if it is created already.
Here discard write operation with property "num_cpu" and "feature" in
attr group KVM_DEV_LOONGARCH_EXTIOI_GRP_CTRL.
Cc: stable@vger.kernel.org Fixes: 1ad7efa552fd ("LoongArch: KVM: Add EIOINTC user mode read and write functions") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Fri, 27 Jun 2025 10:27:44 +0000 (18:27 +0800)]
LoongArch: KVM: Check validity of "num_cpu" from user space
The maximum supported cpu number is EIOINTC_ROUTE_MAX_VCPUS about
irqchip EIOINTC, here add validation about cpu number to avoid array
pointer overflow.
Cc: stable@vger.kernel.org Fixes: 1ad7efa552fd ("LoongArch: KVM: Add EIOINTC user mode read and write functions") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Fri, 27 Jun 2025 10:27:44 +0000 (18:27 +0800)]
LoongArch: KVM: Check interrupt route from physical CPU
With EIOINTC interrupt controller, physical CPU ID is set for irq route.
However the function kvm_get_vcpu() is used to get destination vCPU when
delivering irq. With API kvm_get_vcpu(), the logical CPU ID is used.
With API kvm_get_vcpu_by_cpuid(), vCPU ID can be searched from physical
CPU ID.
Cc: stable@vger.kernel.org Fixes: 3956a52bc05b ("LoongArch: KVM: Add EIOINTC read and write functions") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Fri, 27 Jun 2025 10:27:44 +0000 (18:27 +0800)]
LoongArch: KVM: Fix interrupt route update with EIOINTC
With function eiointc_update_sw_coremap(), there is forced assignment
like val = *(u64 *)pvalue. Parameter pvalue may be pointer to char type
or others, there is problem with forced assignment with u64 type.
Here the detailed value is passed rather address pointer.
Cc: stable@vger.kernel.org Fixes: 3956a52bc05b ("LoongArch: KVM: Add EIOINTC read and write functions") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Fri, 27 Jun 2025 10:27:44 +0000 (18:27 +0800)]
LoongArch: KVM: Add address alignment check for IOCSR emulation
IOCSR instruction supports 1/2/4/8 bytes access, the address should be
naturally aligned with its access size. Here address alignment check is
added in the EIOINTC kernel emulation.
Cc: stable@vger.kernel.org Fixes: 3956a52bc05b ("LoongArch: KVM: Add EIOINTC read and write functions") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Linus Torvalds [Fri, 27 Jun 2025 02:49:12 +0000 (19:49 -0700)]
Merge tag 'bcachefs-2025-06-26' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
- Lots of small check/repair fixes, primarily in subvol loop and
directory structure loop (when involving snapshots).
- Fix a few 6.16 regressions: rare UAF in the foreground allocator path
when taking a transaction restart from the transaction bump
allocator, and some small fallout from the change to log the error
being corrected in the journal when repairing errors, also some
fallout from the btree node read error logging improvements.
(Alan, Bharadwaj)
- New option: journal_rewind
This lets the entire filesystem be reset to an earlier point in time.
Note that this is only a disaster recovery tool, and right now there
are major caveats to using it (discards should be disabled, in
particular), but it successfully restored the filesystem of one of
the users who was bit by the subvolume deletion bug and didn't have
backups. I'll likely be making some changes to the discard path in
the future to make this a reliable recovery tool.
- Some new btree iterator tracepoints, for tracking down some
livelock-ish behaviour we've been seeing in the main data write path.
* tag 'bcachefs-2025-06-26' of git://evilpiepirate.org/bcachefs: (51 commits)
bcachefs: Plumb correct ip to trans_relock_fail tracepoint
bcachefs: Ensure we rewind to run recovery passes
bcachefs: Ensure btree node scan runs before checking for scanned nodes
bcachefs: btree_root_unreadable_and_scan_found_nothing should not be autofix
bcachefs: fix bch2_journal_keys_peek_prev_min() underflow
bcachefs: Use wait_on_allocator() when allocating journal
bcachefs: Check for bad write buffer key when moving from journal
bcachefs: Don't unlock the trans if ret doesn't match BCH_ERR_operation_blocked
bcachefs: Fix range in bch2_lookup_indirect_extent() error path
bcachefs: fix spurious error_throw
bcachefs: Add missing bch2_err_class() to fileattr_set()
bcachefs: Add missing key type checks to check_snapshot_exists()
bcachefs: Don't log fsck err in the journal if doing repair elsewhere
bcachefs: Fix *__bch2_trans_subbuf_alloc() error path
bcachefs: Fix missing newlines before ero
bcachefs: fix spurious error in read_btree_roots()
bcachefs: fsck: Fix oops in key_visible_in_snapshot()
bcachefs: fsck: fix unhandled restart in topology repair
bcachefs: fsck: Fix check_directory_structure when no check_dirents
bcachefs: Fix restart handling in btree_node_scrub_work()
...
Linus Torvalds [Thu, 26 Jun 2025 19:26:39 +0000 (12:26 -0700)]
Merge tag 'devicetree-fixes-for-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
- Convert altr,uart-1.0 and altr,juart-1.0 to DT schema. These were
applied for nios2, but never sent upstream.
- Fix extra '/' in fsl,ls1028a-reset '$id' path
- Fix warnings in ti,sn65dsi83 schema due to unnecessary $ref.
* tag 'devicetree-fixes-for-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: serial: Convert altr,uart-1.0 to DT schema
dt-bindings: serial: Convert altr,juart-1.0 to DT schema
dt-bindings: soc: fsl,ls1028a-reset: Drop extra "/" in $id
dt-bindings: drm/bridge: ti-sn65dsi83: drop $ref to fix lvds-vod* warnings
Jens Axboe [Thu, 26 Jun 2025 18:17:48 +0000 (12:17 -0600)]
io_uring/kbuf: flag partial buffer mappings
A previous commit aborted mapping more for a non-incremental ring for
bundle peeking, but depending on where in the process this peeking
happened, it would not necessarily prevent a retry by the user. That can
create gaps in the received/read data.
Add struct buf_sel_arg->partial_map, which can pass this information
back. The networking side can then map that to internal state and use it
to gate retry as well.
Since this necessitates a new flag, change io_sr_msg->retry to a
retry_flags member, and store both the retry and partial map condition
in there.
Cc: stable@vger.kernel.org Fixes: 26ec15e4b0c1 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks") Signed-off-by: Jens Axboe <axboe@kernel.dk>
David Howells [Wed, 2 Apr 2025 19:27:26 +0000 (20:27 +0100)]
cifs: Fix reading into an ITER_FOLIOQ from the smbdirect code
When performing a file read from RDMA, smbd_recv() prints an "Invalid msg
type 4" error and fails the I/O. This is due to the switch-statement there
not handling the ITER_FOLIOQ handed down from netfslib.
Fix this by collapsing smbd_recv_buf() and smbd_recv_page() into
smbd_recv() and just using copy_to_iter() instead of memcpy(). This
future-proofs the function too, in case more ITER_* types are added.
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Tom Talpey <tom@talpey.com>
cc: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
David Howells [Wed, 25 Jun 2025 13:15:04 +0000 (14:15 +0100)]
cifs: Fix the smbd_response slab to allow usercopy
The handling of received data in the smbdirect client code involves using
copy_to_iter() to copy data from the smbd_reponse struct's packet trailer
to a folioq buffer provided by netfslib that encapsulates a chunk of
pagecache.
If, however, CONFIG_HARDENED_USERCOPY=y, this will result in the checks
then performed in copy_to_iter() oopsing with something like the following:
CIFS: Attempting to mount //172.31.9.1/test
CIFS: VFS: RDMA transport established
usercopy: Kernel memory exposure attempt detected from SLUB object 'smbd_response_0000000091e24ea1' (offset 81, size 63)!
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:102!
...
RIP: 0010:usercopy_abort+0x6c/0x80
...
Call Trace:
<TASK>
__check_heap_object+0xe3/0x120
__check_object_size+0x4dc/0x6d0
smbd_recv+0x77f/0xfe0 [cifs]
cifs_readv_from_socket+0x276/0x8f0 [cifs]
cifs_read_from_socket+0xcd/0x120 [cifs]
cifs_demultiplex_thread+0x7e9/0x2d50 [cifs]
kthread+0x396/0x830
ret_from_fork+0x2b8/0x3b0
ret_from_fork_asm+0x1a/0x30
The problem is that the smbd_response slab's packet field isn't marked as
being permitted for usercopy.
Fix this by passing parameters to kmem_slab_create() to indicate that
copy_to_iter() is permitted from the packet region of the smbd_response
slab objects, less the header space.
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Reported-by: Stefan Metzmacher <metze@samba.org> Link: https://lore.kernel.org/r/acb7f612-df26-4e2a-a35d-7cd040f513e1@samba.org/ Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Stefan Metzmacher <metze@samba.org> Tested-by: Stefan Metzmacher <metze@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
Paulo Alcantara [Wed, 25 Jun 2025 15:22:38 +0000 (12:22 -0300)]
smb: client: fix potential deadlock when reconnecting channels
Fix cifs_signal_cifsd_for_reconnect() to take the correct lock order
and prevent the following deadlock from happening
======================================================
WARNING: possible circular locking dependency detected
6.16.0-rc3-build2+ #1301 Tainted: G S W
------------------------------------------------------
cifsd/6055 is trying to acquire lock: ffff88810ad56038 (&tcp_ses->srv_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x134/0x200
but task is already holding lock: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
3 locks held by cifsd/6055:
#0: ffffffff857de398 (&cifs_tcp_ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x7b/0x200
#1: ffff888119c64060 (&ret_buf->ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x9c/0x200
#2: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200
Cc: linux-cifs@vger.kernel.org Reported-by: David Howells <dhowells@redhat.com> Fixes: d7d7a66aacd6 ("cifs: avoid use of global locks for high contention data") Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
* tag 'nvme-6.16-2025-06-26' of git://git.infradead.org/nvme:
nvme: fix atomic write size validation
nvme: refactor the atomic write unit detection
nvme: reset delayed remove_work after reconnect
Michal Wajdeczko [Thu, 12 Jun 2025 22:09:36 +0000 (00:09 +0200)]
drm/xe: Process deferred GGTT node removals on device unwind
While we are indirectly draining our dedicated workqueue ggtt->wq
that we use to complete asynchronous removal of some GGTT nodes,
this happends as part of the managed-drm unwinding (ggtt_fini_early),
which could be later then manage-device unwinding, where we could
already unmap our MMIO/GMS mapping (mmio_fini).
This was recently observed during unsuccessful VF initialization:
Michal Wajdeczko [Thu, 12 Jun 2025 22:09:37 +0000 (00:09 +0200)]
drm/xe/guc: Explicitly exit CT safe mode on unwind
During driver probe we might be briefly using CT safe mode, which
is based on a delayed work, but usually we are able to stop this
once we have IRQ fully operational. However, if we abort the probe
quite early then during unwind we might try to destroy the workqueue
while there is still a pending delayed work that attempts to restart
itself which triggers a WARN.
This was recently observed during unsuccessful VF initialization:
drm/xe: Move DSB l2 flush to a more sensible place
Flushing l2 is only needed after all data has been written.
Fixes: 01570b446939 ("drm/xe/bmg: implement Wa_16023588340") Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://lore.kernel.org/r/20250606104546.1996818-3-matthew.auld@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 0dd2dd0182bc444a62652e89d08c7f0e4fde15ba) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Kees Cook [Thu, 26 Jun 2025 12:07:18 +0000 (20:07 +0800)]
LoongArch: Handle KCOV __init vs inline mismatches
When the KCOV is enabled all functions get instrumented, unless
the __no_sanitize_coverage attribute is used. To prepare for
__no_sanitize_coverage being applied to __init functions, we have to
handle differences in how GCC's inline optimizations get resolved.
For LoongArch this exposed several places where __init annotations
were missing but ended up being "accidentally correct". So fix these
cases.