git.ipfire.org Git - thirdparty/kernel/linux.git/log

Merge branches 'work.path' and 'work.mount' into work.f_path

kernel/acct.c: saner struct file treatment

Instead of switching ->f_path.mnt of an opened file to internal
clone, get a struct path with ->mnt set to internal clone of that
->f_path.mnt, then dentry_open() that to get the file with right ->f_path.mnt
from the very beginning.

The only subtle part here is that on failure exits we need to
close the file with __fput_sync() and make sure we do that *before*
dropping the original mount.

With that done, only fs/{file_table,open,namei}.c ever store
anything to file->f_path and only prior to file->f_mode & FMODE_OPENED
becoming true. Analysis of mount write count handling also becomes
less brittle and convoluted...

[AV: folded a fix for a bug spotted by Jan Kara - we do need a full-blown
open of the original file, not just user_path_at() or we end up skipping
permission checks]

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

constify {__,}mnt_is_readonly()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

WRITE_HOLD machinery: no need for to bump mount_lock seqcount

... neither for insertion into the list of instances, nor for
mnt_{un,}hold_writers(), nor for mnt_get_write_access() deciding
to be nice to RT during a busy-wait loop - all of that only needs
the spinlock side of mount_lock.

IOW, it's mount_locked_reader, not mount_writer.

Clarify the comment re locking rules for mnt_unhold_writers() - it's
not just that mount_lock needs to be held when calling that, it must
have been held all along since the matching mnt_hold_writers().

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

struct mount: relocate MNT_WRITE_HOLD bit

... from ->mnt_flags to LSB of ->mnt_pprev_for_sb.

This is safe - we always set and clear it within the same mount_lock
scope, so we won't interfere with list operations - traversals are
always forward, so they don't even look at ->mnt_prev_for_sb and
both insertions and removals are in mount_lock scopes of their own,
so that bit will be clear in *all* mount instances during those.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

preparations to taking MNT_WRITE_HOLD out of ->mnt_flags

We have an unpleasant wart in accessibility rules for struct mount.  There
are per-superblock lists of mounts, used by sb_prepare_remount_readonly()
to check if any of those is currently claimed for write access and to
block further attempts to get write access on those until we are done.

As soon as it is attached to a filesystem, mount becomes reachable
via that list.  Only sb_prepare_remount_readonly() traverses it and
it only accesses a few members of struct mount.  Unfortunately,
->mnt_flags is one of those and it is modified - MNT_WRITE_HOLD set
and then cleared.  It is done under mount_lock, so from the locking
rules POV everything's fine.

However, it has easily overlooked implications - once mount has been
attached to a filesystem, it has to be treated as globally visible.
In particular, initializing ->mnt_flags *must* be done either prior
to that point or under mount_lock.  All other members are still
private at that point.

Life gets simpler if we move that bit (and that's *all* that can get
touched by access via this list) out of ->mnt_flags.  It's not even
hard to do - currently the list is implemented as list_head one,
anchored in super_block->s_mounts and linked via mount->mnt_instance.

As the first step, switch it to hlist-like open-coded structure -
address of the first mount in the set is stored in ->s_mounts
and ->mnt_instance replaced with ->mnt_next_for_sb and ->mnt_pprev_for_sb -
the former either NULL or pointing to the next mount in set, the
latter - address of either ->s_mounts or ->mnt_next_for_sb in the
previous element of the set.

In the next commit we'll steal the LSB of ->mnt_pprev_for_sb as
replacement for MNT_WRITE_HOLD.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

setup_mnt(): primitive for connecting a mount to filesystem

Take the identical logics in vfs_create_mount() and clone_mnt() into
a new helper that takes an empty struct mount and attaches it to
given dentry (sub)tree.

Should be called once in the lifetime of every mount, prior to making
it visible in any data structures.

After that point ->mnt_root and ->mnt_sb never change; ->mnt_root
is a counting reference to dentry and ->mnt_sb - an active reference
to superblock.

Mount remains associated with that dentry tree all the way until
the call of cleanup_mnt(), when the refcount eventually drops
to zero.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

simplify the callers of mnt_unhold_writers()

The logics in cleanup on failure in mount_setattr_prepare() is simplified
by having the mnt_hold_writers() failure followed by advancing m to the
next node in the tree before leaving the loop.

And since all calls are preceded by the same check that flag has been set
and the function is inlined, let's just shift the check into it.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

copy_mnt_ns(): use guards

* mntput() of rootmnt and pwdmnt done via __free(mntput)
* mnt_ns_tree_add() can be done within namespace_excl scope.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure

Now that free_mnt_ns() works prior to mnt_ns_tree_add(), there's no need for
an open-coded analogue free_mnt_ns() there - yes, we do avoid one call_rcu()
use per failing call of clone() or unshare(), if they fail due to OOM in that
particular spot, but it's not really worth bothering.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Merge branch 'no-rebase-mnt_ns_tree_remove' into work.mount

mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list

Actual removal is done under the lock, but for checking if need to bother
the lockless RB_EMPTY_NODE() is safe - either that namespace had never
been added to mnt_ns_tree, in which case the the node will stay empty, or
whoever had allocated it has called mnt_ns_tree_add() and it has already
run to completion. After that point RB_EMPTY_NODE() will become false and
will remain false, no matter what we do with other nodes in the tree.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

open_detached_copy(): separate creation of namespace into helper

... and convert the helper to use of a guard(namespace_excl)

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

open_detached_copy(): don't bother with mount_lock_hash()

we are holding namespace_sem and a reference to root of tree;
iterating through that tree does not need mount_lock. Neither
does the insertion into the rbtree of new namespace or incrementing
the mount count of that namespace.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

path_has_submounts(): use guard(mount_locked_reader)

Needed there since the callback passed to d_walk() (path_check_mount())
is using __path_is_mountpoint(), which uses __lookup_mnt().

Has to be taken in the caller - d_walk() might take rename_lock spinlock
component and that nests inside mount_lock.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fs/namespace.c: sanitize descriptions for {__,}lookup_mnt()

Comments regarding "shadow mounts" were stale - no such thing anymore.
Document the locking requirements for __lookup_mnt().

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ecryptfs: get rid of pointless mount references in ecryptfs dentries

->lower_path.mnt has the same value for all dentries on given ecryptfs
instance and if somebody goes for mountpoint-crossing variant where that
would not be true, we can deal with that when it happens (and _not_
with duplicating these reference into each dentry).

As it is, we are better off just sticking a reference into ecryptfs-private
part of superblock and keeping it pinned until ->kill_sb().

That way we can stick a reference to underlying dentry right into ->d_fsdata
of ecryptfs one, getting rid of indirection through struct ecryptfs_dentry_info,
along with the entire struct ecryptfs_dentry_info machinery.

[kudos to Dan Carpenter for spotting a bug in ecryptfs_get_tree() part]

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

umount_tree(): take all victims out of propagation graph at once

For each removed mount we need to calculate where the slaves will end up.
To avoid duplicating that work, do it for all mounts to be removed
at once, taking the mounts themselves out of propagation graph as
we go, then do all transfers; the duplicate work on finding destinations
is avoided since if we run into a mount that already had destination found,
we don't need to trace the rest of the way. That's guaranteed
O(removed mounts) for finding destinations and removing from propagation
graph and O(surviving mounts that have master removed) for transfers.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_mount(): use __free(path_put)

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_move_mount_old(): use __free(path_put)

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

constify can_move_mount_beneath() arguments

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

path_umount(): constify struct path argument

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

may_copy_tree(), __do_loopback(): constify struct path argument

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

path_mount(): constify struct path argument

now it finally can be done.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_{loopback,change_type,remount,reconfigure_mnt}(): constify struct path argument

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_new_mount{,_fc}(): constify struct path argument

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

mnt_warn_timestamp_expiry(): constify struct path argument

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s)

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

collect_paths(): constify the return value

callers have no business modifying the paths they get

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

drop_collected_paths(): constify arguments

... and use that to constify the pointers in callers

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_set_group(): constify path arguments

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_mount_setattr(): constify path argument

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

constify check_mnt()

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_lock_mount(): don't modify path.

Currently do_lock_mount() has the target path switched to whatever
might be overmounting it.  We _do_ want to have the parent
mount/mountpoint chosen on top of the overmounting pile; however,
the way it's done has unpleasant races - if umount propagation
removes the overmount while we'd been trying to set the environment
up, we might end up failing if our target path strays into that overmount
just before the overmount gets kicked out.

Users of do_lock_mount() do not need the target path changed - they
have all information in res->{parent,mp}; only one place (in
do_move_mount()) currently uses the resulting path->mnt, and that value
is trivial to reconstruct by the original value of path->mnt + chosen
parent mount.

Let's keep the target path unchanged; it avoids a bunch of subtle races
and it's not hard to do:
do
as mount_locked_reader
find the prospective parent mount/mountpoint dentry
grab references if it's not the original target
lock the prospective mountpoint dentry
take namespace_sem exclusive
if prospective parent/mountpoint would be different now
err = -EAGAIN
else if location has been unmounted
err = -ENOENT
else if mountpoint dentry is not allowed to be mounted on
err = -ENOENT
else if beneath and the top of the pile was the absolute root
err = -EINVAL
else
try to get struct mountpoint (by dentry), set
err to 0 on success and -ENO{MEM,ENT} on failure
if err != 0
res->parent = ERR_PTR(err)
drop locks
else
res->parent = prospective parent
drop temporary references
while err == -EAGAIN

A somewhat subtle part is that dropping temporary references is allowed.
Neither mounts nor dentries should be evicted by a thread that holds
namespace_sem.  On success we are dropping those references under
namespace_sem, so we need to be sure that these are not the last
references remaining.  However, on success we'd already verified (under
namespace_sem) that original target is still mounted and that mount
and dentry we are about to drop are still reachable from it via the
mount tree.  That guarantees that we are not about to drop the last
remaining references.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

new helper: topmost_overmount()

Returns the final (topmost) mount in the chain of overmounts
starting at given mount. Same locking rules as for any mount
tree traversal - either the spinlock side of mount_lock, or
rcu + sample the seqcount side of mount_lock before the call
and recheck afterwards.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

don't bother passing new_path->dentry to can_move_mount_beneath()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry

That kills the last place where callers of lock_mount(path, &mp)
used path->dentry.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint

parent and mountpoint always come from the same struct pinned_mountpoint
now.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path

Both callers pass it a mountpoint reference picked from pinned_mountpoint
and path it corresponds to.

First of all, path->dentry is equal to mp.mp->m_dentry. Furthermore, path->mnt
is &mp.parent->mnt, making struct path contents redundant.

Pass it the address of that pinned_mountpoint instead; what's more, if we
teach it to treat ERR_PTR(error) in ->parent as "bail out with that error"
we can simplify the callers even more - do_add_mount() will do the right
thing even when called after lock_mount() failure.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_move_mount(): use the parent mount returned by do_lock_mount()

After successful do_lock_mount() call, mp.parent is set to either
real_mount(path->mnt) (for !beneath case) or to ->mnt_parent of that
(for beneath). p is set to real_mount(path->mnt) and after
several uses it's made equal to mp.parent. All uses prior to that
care only about p->mnt_ns and since p->mnt_ns == parent->mnt_ns,
we might as well use mp.parent all along.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

change calling conventions for lock_mount() et.al.

1) pinned_mountpoint gets a new member - struct mount *parent.
Set only if we locked the sucker; ERR_PTR() - on failed attempt.

2) do_lock_mount() et.al. return void and set ->parent to
* on success with !beneath - mount corresponding to path->mnt
* on success with beneath - the parent of mount corresponding
to path->mnt
* in case of error - ERR_PTR(-E...).
IOW, we get the mount we will be actually mounting upon or ERR_PTR().

3) we can't use CLASS, since the pinned_mountpoint is placed on
hlist during initialization, so we define local macros:
LOCK_MOUNT(mp, path)
LOCK_MOUNT_MAYBE_BENEATH(mp, path, beneath)
LOCK_MOUNT_EXACT(mp, path)
All of them declare and initialize struct pinned_mountpoint mp,
with unlock_mount done via __cleanup().

Users converted.

[
lock_mount() is unused now; removed.
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
]

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

configfs:get_target() - release path as soon as we grab configfs_item reference

... and get rid of path argument - it turns into a local variable in get_target()

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

apparmor/af_unix: constify struct path * arguments

unix_sk(sock)->path should never be modified, least of all by LSM...

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ovl_is_real_file: constify realpath argument

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ovl_sync_file(): constify path argument

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ovl_lower_dir(): constify path argument

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ovl_get_verity_digest(): constify path argument

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ovl_validate_verity(): constify {meta,data}path arguments

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ovl_ensure_verity_loaded(): constify datapath argument

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ksmbd_vfs_set_init_posix_acl(): constify path argument

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ksmbd_vfs_inherit_posix_acl(): constify path argument

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ksmbd_vfs_kern_path_unlock(): constify path argument

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ksmbd_vfs_path_lookup_locked(): root_share_path can be const struct path *

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

check_export(): constify path argument

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

export_operations->open(): constify path argument

for the method and its sole instance...

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

rqst_exp_get_by_name(): constify path argument

Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

nfs: constify path argument of __vfs_getattr()

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

bpf...d_path(): constify path argument

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

done_path_create(): constify path argument

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

filename_lookup(): constify root argument

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

constify path argument of vfs_statx_path()

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

backing_file_user_path(): constify struct path *

Callers never use the resulting pointer to modify the struct path it
points to (nor should they).

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

finish_automount(): use __free() to deal with dropping mnt on failure

same story as with do_new_mount_fc().

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_new_mount_fc(): use __free() to deal with dropping mnt on failure

do_add_mount() consumes vfsmount on success; just follow it with
conditional retain_and_null_ptr() on success and we can switch
to __free() for mnt and be done with that - unlock_mount() is
in the very end.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

finish_automount(): take the lock_mount() analogue into a helper

finish_automount() can't use lock_mount() - it treats finding something
already mounted as "quitely drop our mount and return 0", not as
"mount on top of whatever mounted there". It's been open-coded;
let's take it into a helper similar to lock_mount(). "something's
already mounted" => -EBUSY, finish_automount() needs to distinguish
it from the normal case and it can't happen in other failure cases.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

pivot_root(2): use __free() to deal with struct path in it

preparations for making unlock_mount() a __cleanup();
can't have path_put() inside mount_lock scope.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_loopback(): use __free(path_put) to deal with old_path

preparations for making unlock_mount() a __cleanup();
can't have path_put() inside mount_lock scope.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

finish_automount(): simplify the ELOOP check

It's enough to check that dentries match; if path->dentry is equal to
m->mnt_root, superblocks will match as well.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

move_mount(2): take sanity checks in 'beneath' case into do_lock_mount()

We want to mount beneath the given location. For that operation to
make sense, location must be the root of some mount that has something
under it. Currently we let it proceed if those requirements are not met,
with rather meaningless results, and have that bogosity caught further
down the road; let's fail early instead - do_lock_mount() doesn't make
sense unless those conditions hold, and checking them there makes
things simpler.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_move_mount(): deal with the checks on old_path early

1) checking that location we want to move does point to root of some mount
can be done before anything else; that property is not going to change
and having it already verified simplifies the analysis.

2) checking the type agreement between what we are trying to move and what
we are trying to move it onto also belongs in the very beginning -
do_lock_mount() might end up switching new_path to something that overmounts
the original location, but... the same type agreement applies to overmounts,
so we could just as well check against the original location.

3) since we know that old_path->dentry is the root of old_path->mnt, there's
no point bothering with path_is_overmounted() in can_move_mount_beneath();
it's simply a check for the mount we are trying to move having non-NULL
->overmount. And with that, we can switch can_move_mount_beneath() to
taking old instead of old_path, leaving no uses of old_path past the original
checks.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_move_mount(): trim local variables

Both 'parent' and 'ns' are used at most once, no point precalculating those...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

switch do_new_mount_fc() to fc_mount()

Prior to the call of do_new_mount_fc() the caller has just done successful
vfs_get_tree(). Then do_new_mount_fc() does several checks on resulting
superblock, and either does fc_drop_locked() and returns an error or
proceeds to unlock the superblock and call vfs_create_mount().

The thing is, there's no reason to delay that unlock + vfs_create_mount() -
the tests do not rely upon the state of ->s_umount and
fc_drop_locked()
put_fs_context()
is equivalent to
unlock ->s_umount
put_fs_context()

Doing vfs_create_mount() before the checks allows us to move vfs_get_tree()
from caller to do_new_mount_fc() and collapse it with vfs_create_mount()
into an fc_mount() call.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

current_chrooted(): use guards

here a use of __free(path_put) for dropping fs_root is enough to
make guard(mount_locked_reader) fit...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

current_chrooted(): don't bother with follow_down_one()

All we need here is to follow ->overmount on root mount of namespace...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

path_is_under(): use guards

... and document that locking requirements for is_path_reachable().
There is one questionable caller in do_listmount() where we are not
holding mount_lock *and* might not have the first argument mounted.
However, in that case it will immediately return true without having
to look at the ancestors. Might be cleaner to move the check into
non-LSTM_ROOT case which it really belongs in - there the check is
not always true and is_mounted() is guaranteed.

Document the locking environments for is_path_reachable() callers:
get_peer_under_root()
get_dominating_id()
do_statmount()
do_listmount()

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

mnt_set_expiry(): use guards

The reason why it needs only mount_locked_reader is that there's no lockless
accesses of expiry lists.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

has_locked_children(): use guards

... and document the locking requirements of __has_locked_children()

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint()

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

check_for_nsfs_mounts(): no need to take locks

Currently we are taking mount_writer; what that function needs is
either mount_locked_reader (we are not changing anything, we just
want to iterate through the subtree) or namespace_shared and
a reference held by caller on the root of subtree - that's also
enough to stabilize the topology.

The thing is, all callers are already holding at least namespace_shared
as well as a reference to the root of subtree.

Let's make the callers provide locking warranties - don't mess with
mount_lock in check_for_nsfs_mounts() itself and document the locking
requirements.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

mnt_already_visible(): use guards

clean fit; namespace_shared due to iterating through ns->mounts.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

put_mnt_ns(): use guards

clean fit; guards can't be weaker due to umount_tree() call.
Setting emptied_ns requires namespace_excl, but not anything
mount_lock-related.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

mark_mounts_for_expiry(): use guards

Clean fit; guards can't be weaker due to umount_tree() calls.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_set_group(): use guards

clean fit; namespace_excl to modify propagation graph

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_change_type(): use guards

clean fit; namespace_excl to modify propagation graph

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

__is_local_mountpoint(): use guards

clean fit; namespace_shared due to iterating through ns->mounts.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

__detach_mounts(): use guards

Clean fit for guards use; guards can't be weaker due to umount_tree() calls.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fs/namespace.c: allow to drop vfsmount references via __free(mntput)

Note that just as path_put, it should never be done in scope of
namespace_sem, be it shared or exclusive.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

introduced guards for mount_lock

mount_writer: write_seqlock; that's an equivalent of {un,}lock_mount_hash()
mount_locked_reader: read_seqlock_excl; these tend to be open-coded.

No bulk conversions, please - if nothing else, quite a few places take
use mount_writer form when mount_locked_reader is sufficent. It needs
to be dealt with carefully.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fs/namespace.c: fix the namespace_sem guard mess

If anything, namespace_lock should be DEFINE_LOCK_GUARD_0, not DEFINE_GUARD.
That way we
* do not need to feed it a bogus argument
* do not get gcc trying to compare an address of static in
file variable with -4097 - and, if we are unlucky, trying to keep
it in a register, with spills and all such.

The same problems apply to grabbing namespace_sem shared.

Rename it to namespace_excl, add namespace_shared, convert the existing users:

    guard(namespace_lock, &namespace_sem) => guard(namespace_excl)()
    guard(rwsem_read, &namespace_sem) => guard(namespace_shared)()
    scoped_guard(namespace_lock, &namespace_sem) => scoped_guard(namespace_excl)
    scoped_guard(rwsem_read, &namespace_sem) => scoped_guard(namespace_shared)

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Linux 6.17-rc4

Merge tag 'x86_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

- Convert the SSB mitigation to the attack vector controls which got
   forgotten at the time

- Prevent the CPUID topology hierarchy detection on AMD from
   overwriting the correct initial APIC ID

- Fix the case of a machine shipping without microcode in the BIOS, in
   the AMD microcode loader

- Correct the Pentium 4 model range which has a constant TSC

* tag 'x86_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bugs: Add attack vector controls for SSB
  x86/cpu/topology: Use initial APIC ID from XTOPOLOGY leaf on AMD/HYGON
  x86/microcode/AMD: Handle the case of no BIOS microcode
  x86/cpu/intel: Fix the constant_tsc model check for Pentium 4

Merge tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

- Fix a stall on the CPU offline path due to mis-counting a deadline
   server task twice as part of the runqueue's running tasks count

- Fix a realtime tasks starvation case where failure to enqueue a timer
   whose expiration time is already in the past would cause repeated
   attempts to re-enqueue a deadline server task which leads to starving
   the former, realtime one

- Prevent a delayed deadline server task stop from breaking the
   per-runqueue bandwidth tracking

- Have a function checking whether the deadline server task has
   stopped, return the correct value

* tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Don't count nr_running for dl_server proxy tasks
  sched/deadline: Fix RT task potential starvation when expiry time passed
  sched/deadline: Always stop dl-server before changing parameters
  sched/deadline: Fix dl_server_stopped()

Merge tag 'irq_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

- Remove unnecessary and noisy WARN_ONs in gic-v5's init path

- Avoid a kmemleak false positive for the gic-v5's L2 IST table entries

- Fix a retval check in mvebu-gicp's probe function

- Fix a wrong conversion to guards in atmel-aic[5] irqchip

* tag 'irq_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v5: Remove undue WARN_ON()s in the IRS affinity parsing
  irqchip/gic-v5: Fix kmemleak L2 IST table entries false positives
  irqchip/mvebu-gicp: Fix an IS_ERR() vs NULL check in probe()
  irqchip/atmel-aic[5]: Fix incorrect lock guard conversion

Merge tag 'hardening-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:

- ARM: stacktrace: include asm/sections.h in asm/stacktrace.h (Arnd
   Bergmann)

- ubsan: Fix incorrect hand-side used in handle (Junhui Pei)

- hardening: Require clang 20.1.0 for __counted_by (Nathan Chancellor)

* tag 'hardening-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  hardening: Require clang 20.1.0 for __counted_by
  ARM: stacktrace: include asm/sections.h in asm/stacktrace.h
  ubsan: Fix incorrect hand-side used in handle

Merge tag 'gpio-fixes-for-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:

- fix an off-by-one bug in interrupt handling in gpio-timberdale

- update MAINTAINERS

* tag 'gpio-fixes-for-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
MAINTAINERS: Change Altera-PIO driver maintainer
gpio: timberdale: fix off-by-one in IRQ type boundary check

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:

- CFI failure due to kpti_ng_pgd_alloc() signature mismatch

- Underallocation bug in the SVE ptrace kselftest

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
kselftest/arm64: Don't open code SVE_PT_SIZE() in fp-ptrace
arm64: mm: Fix CFI failure due to kpti_ng_pgd_alloc function signature

kselftest/arm64: Don't open code SVE_PT_SIZE() in fp-ptrace

In fp-trace when allocating a buffer to write SVE register data we open
code the addition of the header size to the VL depeendent register data
size, which lead to an underallocation bug when we cut'n'pasted the code
for FPSIMD format writes. Use the SVE_PT_SIZE() macro that the kernel
UAPI provides for this.

Fixes: b84d2b27954f ("kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace")
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250812-arm64-fp-trace-macro-v1-1-317cfff986a5@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

arm64: mm: Fix CFI failure due to kpti_ng_pgd_alloc function signature

Seen during KPTI initialization:

  CFI failure at create_kpti_ng_temp_pgd+0x124/0xce8 (target: kpti_ng_pgd_alloc+0x0/0x14; expected type: 0xd61b88b6)

The call site is alloc_init_pud() at arch/arm64/mm/mmu.c:

  pud_phys = pgtable_alloc(TABLE_PUD);

alloc_init_pud() has the prototype:

  static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
                             phys_addr_t phys, pgprot_t prot,
                             phys_addr_t (*pgtable_alloc)(enum pgtable_type),
                             int flags)

where the pgtable_alloc() prototype is declared.

The target (kpti_ng_pgd_alloc) is used in arch/arm64/kernel/cpufeature.c:

  create_kpti_ng_temp_pgd(kpti_ng_temp_pgd, __pa(alloc), KPTI_NG_TEMP_VA,
                          PAGE_SIZE, PAGE_KERNEL, kpti_ng_pgd_alloc, 0);

which is an alias for __create_pgd_mapping_locked() with prototype:

  extern __alias(__create_pgd_mapping_locked)
  void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys,
                               unsigned long virt,
                               phys_addr_t size, pgprot_t prot,
                               phys_addr_t (*pgtable_alloc)(enum pgtable_type),
                               int flags);

__create_pgd_mapping_locked() passes the function pointer down:

  __create_pgd_mapping_locked() -> alloc_init_p4d() -> alloc_init_pud()

But the target function (kpti_ng_pgd_alloc) has the wrong signature:

  static phys_addr_t __init kpti_ng_pgd_alloc(int shift);

The "int" should be "enum pgtable_type".

To make "enum pgtable_type" available to cpufeature.c, move
enum pgtable_type definition from arch/arm64/mm/mmu.c to
arch/arm64/include/asm/mmu.h.

Adjust kpti_ng_pgd_alloc to use "enum pgtable_type" instead of "int".
The function behavior remains identical (parameter is unused).

Fixes: c64f46ee1377 ("arm64: mm: use enum to identify pgtable level instead of *_SHIFT")
Cc: <stable@vger.kernel.org> # 6.16.x
Signed-off-by: Kees Cook <kees@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250829190721.it.373-kees@kernel.org
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"ARM:

   - Correctly handle 'invariant' system registers for protected VMs

   - Improved handling of VNCR data aborts, including external aborts

   - Fixes for handling of FEAT_RAS for NV guests, providing a sane
     fault context during SEA injection and preventing the use of
     RASv1p1 fault injection hardware

   - Ensure that page table destruction when a VM is destroyed gives an
     opportunity to reschedule

   - Large fix to KVM's infrastructure for managing guest context loaded
     on the CPU, addressing issues where the output of AT emulation
     doesn't get reflected to the guest

   - Fix AT S12 emulation to actually perform stage-2 translation when
     necessary

   - Avoid attempting vLPI irqbypass when GICv4 has been explicitly
     disabled for a VM

   - Minor KVM + selftest fixes

  RISC-V:

   - Fix pte settings within kvm_riscv_gstage_ioremap()

   - Fix comments in kvm_riscv_check_vcpu_requests()

   - Fix stack overrun when setting vlenb via ONE_REG

  x86:

   - Use array_index_nospec() to sanitize the target vCPU ID when
     handling PV IPIs and yields as the ID is guest-controlled.

   - Drop a superfluous cpumask_empty() check when reclaiming SEV
     memory, as the common case, by far, is that at least one CPU will
     have entered the VM, and wbnoinvd_on_cpus_mask() will naturally
     handle the rare case where the set of have_run_cpus is empty.

  Selftests (not KVM):

   - Rename the is_signed_type() macro in kselftest_harness.h to
     is_signed_var() to fix a collision with linux/overflow.h. The
     collision generates compiler warnings due to the two macros having
     different meaning"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (29 commits)
  KVM: arm64: nv: Fix ATS12 handling of single-stage translation
  KVM: arm64: Remove __vcpu_{read,write}_sys_reg_{from,to}_cpu()
  KVM: arm64: Fix vcpu_{read,write}_sys_reg() accessors
  KVM: arm64: Simplify sysreg access on exception delivery
  KVM: arm64: Check for SYSREGS_ON_CPU before accessing the 32bit state
  RISC-V: KVM: fix stack overrun when loading vlenb
  RISC-V: KVM: Correct kvm_riscv_check_vcpu_requests() comment
  RISC-V: KVM: Fix pte settings within kvm_riscv_gstage_ioremap()
  KVM: arm64: selftests: Sync ID_AA64MMFR3_EL1 in set_id_regs
  KVM: arm64: Get rid of ARM64_FEATURE_MASK()
  KVM: arm64: Make ID_AA64PFR1_EL1.RAS_frac writable
  KVM: arm64: Make ID_AA64PFR0_EL1.RAS writable
  KVM: arm64: Ignore HCR_EL2.FIEN set by L1 guest's EL2
  KVM: arm64: Handle RASv1p1 registers
  arm64: Add capability denoting FEAT_RASv1p1
  KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables
  KVM: arm64: Split kvm_pgtable_stage2_destroy()
  selftests: harness: Rename is_signed_type() to avoid collision with overflow.h
  KVM: SEV: don't check have_run_cpus in sev_writeback_caches()
  KVM: arm64: Correctly populate FAR_EL2 on nested SEA injection
  ...

hardening: Require clang 20.1.0 for __counted_by

After an innocuous change in -next that modified a structure that
contains __counted_by, clang-19 start crashing when building certain
files in drivers/gpu/drm/xe. When assertions are enabled, the more
descriptive failure is:

clang: clang/lib/AST/RecordLayoutBuilder.cpp:3335: const ASTRecordLayout &clang::ASTContext::getASTRecordLayout(const RecordDecl *) const: Assertion `D && "Cannot get layout of forward declarations!"' failed.

According to a reverse bisect, a tangential change to the LLVM IR
generation phase of clang during the LLVM 20 development cycle [1]
resolves this problem. Bump the version of clang that enables
CONFIG_CC_HAS_COUNTED_BY to 20.1.0 to ensure that this issue cannot be
hit.

Link: https://github.com/llvm/llvm-project/commit/160fb1121cdf703c3ef5e61fb26c5659eb581489
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20250807-fix-counted_by-clang-19-v1-1-902c86c1d515@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>