git.ipfire.org Git - thirdparty/linux.git/log

batman-adv: tp_meter: initialize last_recv_time during init

The last_recv_time is the most important indicator for a receiver session
to figure out whether a session timed out or not. But this information was
only initialized after the session was added to the tp_receiver_list and
after the timer was started.

In the worst case, the timer (function) could have tried to access this
information before the actual initialization was reached. Like rest of the
variables of the tp_meter receiver session, this field has to be filled out
before any other (parallel running) context has the chance to access it.

Cc: stable@kernel.org
Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation")
Signed-off-by: Sven Eckelmann <sven@narfation.org>

simple_lookup(): use d_splice_alias() for ->lookup() return value

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ecryptfs: use d_splice_alias() for ->lookup() return value

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

configfs_lookup(): switch to d_splice_alias()

more idiomatic

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

tracefs: use d_splice_alias() in ->lookup() instances

d_add() is not wrong there (inodes are freshly allocated), but
d_splice_alias() is more idiomatic.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

make cursors NORCU

All it requires is making sure that d_walk() will skip *all*
CURSOR dentries, even if somebody passes it one as an argument.

Cursors are negative and unhashed all along, never get added to
LRU or to shrink lists and no RCU references via ->d_sib are
possible for those - dentry_unlist() makes sure that no killed
dentry has ->d_sib.next left pointing to a cursor.

Seeing that a cursor is allocated every time we open a directory
on autofs, debugfs, devpts, etc., avoiding an RCU delay when such
opened files get closed is attractive...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

nfs: get rid of fake root dentries

... just grab the reference to the (real) root we are about to return
for the first mount of this superblock and be done with that.

Once upon a time dentry tree eviction at fs shutdown used to break
if ->s_root had been spliced on top of something; that hadn't been
the case for years now, and these fake root dentries violate a bunch
of invariants. Let's get rid of them...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

wind ->s_roots via ->d_sib instead of ->d_hash

shrink_dcache_for_umount() is supposed to handle the possibility of
some of the dentries to be evicted being in other threads shrink
lists; it either kills them, leaving an empty husk to be freed by
the owner of shrink list whenever it gets around to that, or it
waits for the eviction in progress to get completed.

That relies upon dentry remaining attached to the tree until the
eviction reaches dentry_unlist() and its ->d_sib gets removed
from the list. Unfortunately, the secondary roots are linked
via ->d_hash, rather than ->d_sib and they become removed from
that list before their inode references are dropped.

If shrink_dentry_list() from another thread ends up evicting
one of the secondary roots and gets to that point in dentry_kill()
when shrink_dcache_for_umount() is looking for secondary roots,
the latter will *not* notice anything, possibly leading to
warnings about busy inodes at umount time and all kinds of breakage
after that.

Moreover, shrink_dcache_for_umount() walks the list of secondary
roots with no protection whatsoever, so it might end up calling
dget() on a dentry that already passed through
lockref_mark_dead(&dentry->d_lockref);
ending up with corrupted refcount and possible UAF.

AFAICS, the most straightforward way to deal with that would be
to have secondary roots linked via ->d_sib rather than ->d_hash;
then they would remain on the list until killed, and we could
use d_add_waiter() machinery to wait for eviction in progress.

Changes:
* secondary roots look the same as ->s_root from d_unhashed()
and d_unlinked() POV now.
* secondary roots are represented as "no parent, but on ->d_sib"
instead of "no parent, but on ->d_hash".
* since ->d_sib is a plain hlist, we protect it with per-superblock
spinlock (sb->s_roots_lock) instead of the LSB of the head pointer (for
non-root dentries it would be protected by ->d_lock of parent).
* __d_obtain_alias() uses ->d_sib for linkage when allocating
a secondary root.
* d_splice_alias_ops() detects splicing of a secondary root and
removes it from the list before calling __d_move().
* dentry_unlist() detects eviction of a secondary root and
removes it from the list; no need to play the games for d_walk() sake,
since the latter is not going to look for the next sibling of those
anyway.
* ___d_drop() doesn't care about ->s_roots anymore.
* shrink_dcache_for_umount() uses proper locking for access to
the list of secondary roots and if it runs into one that is in the middle
of eviction waits for that to finish.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

shrink_dentry_tree(): unify the calls of shrink_dentry_list()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

shrinking rcu_read_lock() scope in d_alloc_parallel()

The current use of rcu_read_lock() uses in d_alloc_parallel()
is fairly opaque - the single large scope serves two purposes.

We start with lookup in normal hash, and there rcu_read_lock()
scope puts __d_lookup_rcu() and subsequent lockref_get_not_dead() into
the same RCU read-side critical area.

If no match is found, we proceed to lock the hash chain of
in-lookup hash and scan that for a match.  If we find a match, we want
to grab it and wait for lookup in progress to finish.  Since the bitlock
we use for these hash chains has to nest inside ->d_lock, we need to
unlock the chain first and use lockref_get_not_dead() on the match.
That has to be done without breaking the RCU read-side critical area,
and we use the same rcu_read_lock() scope to bridge over.

The thing is, after having grabbed the reference (and it is
very unlikely to fail) we proceed to grab ->d_lock - d_wait_lookup()
and __d_lookup_unhash()/__d_wake_in_lookup_waiters() are using that for
serialization. That makes lockref_get_not_dead() pointless - trying
to avoid grabbing ->d_lock for refcount increment, only to grab it
anyway immediately after that. If we grab ->d_lock first and replace
lockref_get_not_dead() with direct check for sign and increment if
non-negative we can move rcu_read_unlock() to immediately after grabbing
->d_lock.  Moreover, we don't need the RCU read-side critical area to
be contiguous since before earlier __d_lookup_rcu() - we can just as
well terminate the earlier one ASAP and call rcu_read_lock() again only
after having found a match (if any) in the in-lookup hash chain.

That makes the entire thing easier to follow and the purpose
of those rcu_read_lock() calls easier to describe - the first scope is
for __d_lookup_rcu() + lockref_get_not_dead(), the second one bridges
over from the bitlock scope to the ->d_lock scope on the match found in
in-lookup hash.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d_walk(): shrink rcu_read_lock() scope

we only need it to bridge over from ->d_lock scope of child to ->d_lock
scope of parent; dropping ->d_lock at rename_retry doesn't need to be
in rcu_read_lock() scope.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

document dentry_kill()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill()

Pull dropping ->d_lock on lock_for_kill() failure into lock_for_kill() itself.
That reduces dentry_kill() to
if (!lock_for_kill(dentry))
return NULL;
return __dentry_kill(dentry);
at which point it's easier to move that if (...) into the beginning of __dentry_kill()
itself and rename it into dentry_kill().

Document the new calling conventions of lock_for_kill().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Document rcu_read_lock() use in select_collect2()

If select_collect2() finds something that is neither busy nor can
be moved to shrink list, it needs to return that to caller's caller
(shrink_dcache_tree()) ASAP and do so without grabbing references (among
other things, it might be already dying, in which case refcount can't be
incremented). We are called inside a ->d_lock scope, but that scope is
going to be terminated as soon as we return to caller (d_walk()); ->d_lock
will be retaken by shrink_dcache_tree(), but we need to bridge between
these scopes, turning them into contiguous RCU read-side critical area.

We do that with rcu_read_lock() scope - it spans from unbalanced
rcu_read_lock() in select_collect2() to unbalanced rcu_read_unlock()
in shrink_dcache_tree(). That works, but it really needs to be documented;
it's rather unidiomatic and it had caused quite a bit of confusion - some
of it in form of patches "fixing" the damn thing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Shift rcu_read_{,un}lock() inside fast_dput()

Shrink rcu_read_lock() scopes surrounding fast_dput() calls.
Both callers are immediately preceded and followed by
rcu_read_lock()/rcu_read_unlock() resp. Shrink that down
into fast_dput() itself; in case when fast_dput() ends up
grabbing ->d_lock, we can pull rcu_read_unlock() up to
right after spin_lock().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

simplify safety for lock_for_kill() slowpath

rcu_read_lock() scopes in dentry eviction machinery are too wide
and badly structured; we end up with too many of those, quite
a few essentially identical.  Worse, quite a few of the function
involved are not neutral wrt that, making them harder to reason about.

rcu_read_lock() scope is not the only thing establishing an
RCU read-side critical area - spin_lock scope does the same and
they can be mixed - the sequence
rcu_read_lock()
...
spin_lock()
...
rcu_read_unlock()
...
rcu_read_lock()
...
spun_unlock()
...
rcu_read_unlock()
is an unbroken RCU read-side critical area.

Use of that observation allows to simplify things.  First of all,
lock_for_kill() relies upon being in an unbroken RCU read-side
critical area.  It's always called with ->d_lock held, and normally
returns without having ever dropped that spinlock.  We would not
need rcu_read_lock() at all, if not for the slow path - if trylock
of inode->i_lock fails, we need to drop and retake ->d_lock.

Having all calls of lock_for_kill() inside an rcu_read_lock() scope
takes care of that, but to show that lock_for_kill() slow path is safe,
we need to demonstrate such rcu_read_lock() scope for any call chain
leading to lock_for_kill().  Which is not fun, seeing that there are
10 such scopes, with 5 distinct beginnings between them.

Case 1: opens in dput() proceeds through fast_dput() grabbing ->d_lock,
returning false into dput() and there a call of finish_dput() which calls
dentry_kill(), which calls lock_for_kill(); ends in dentry_kill(), either
right after lock_for_kill() success or right after dropping ->d_lock
on lock_for_kill() failure.  ->d_lock is held continuously all the way
into lock_for_kill().

Case 2: opens in dentry_kill(), where we proceed to the same call of
dentry_kill() as in case 1.  ->d_lock is held since before the
beginning of the scope and all the way into lock_for_kill().

Case 3: opens in select_collect2(), proceeds through the return to
d_walk() and to shrink_dcache_tree() where we grab ->d_lock and
proceed to call shrink_kill(), which calls dentry_kill(), then as
in the previous scopes.

Case 4: opens in shrink_dentry_list(), followed by call of shrink_kill(),
then same as in case 3.  ->d_lock is held since before the beginning
of the scope and all the way into lock_for_kill().

Case 5: opens in shrink_kill(), where it's immediately followed by
call of dentry_kill(), then same as in the previous scopes.  ->d_lock
is held since before the beginning of the scope all the way into
lock_for_kill().

Note that in cases 2, 4 and 5 the slow path of lock_for_kill() is the
only part of rcu_read_lock() scope that is not covered by spinlock
scopes.  In case 1 we have the area in fast_dput() as well and in
case 3 - the return path from select_collect2() and chunk in shrink_dcache_tree()
up to grabbing ->d_lock.

Seeing that the reasons we need rcu_read_lock() in these additional
areas are completely unrelated to lock_for_kill() slow path, the things
get much more straightforward with
* explicit rcu_read_lock() scope surrounding the area in slow path
of lock_for_kill() where ->d_lock is not held
* shrink_dentry_list() dropping rcu_read_lock() as soon as it has
grabbed ->d_lock.
* dput() dropping rcu_read_lock() just before calling finish_dput().
* rcu_read_lock() calls in finish_dput(), shrink_kill() and
shrink_dentry_list() are removed, along with rcu_read_unlock() calls in
dentry_kill().

RCU read-side critical areas are unchanged by that, safety of lock_for_kill()
slow path is trivial to verify and a bunch of rcu_read_lock() scopes either
gone or become easier to describe.

Update the comments on locking conventions and memory safety considerations,
including the NORCU case.

Incidentally, all calls of fast_dput() are immediately preceded by rcu_read_lock()
and followed by rcu_read_unlock() now, which will allow to simplify those on
the next step...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fold lock_for_kill() and __dentry_kill() into common helper

There are two callers of lock_for_kill() and both are followed
by the same sequence of actions:
* in case of failure, drop ->d_lock, do rcu_read_unlock() and
go away
* in case of success, do rcu_read_unlock() followed by
passing dentry to __dentry_kill(); if the latter returns NULL, go away.

All calls of __dentry_kill() are paired with lock_for_kill() now;
let's turn that sequence into a new helper (dentry_kill()) and switch
to using it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fold lock_for_kill() into shrink_kill()

Both callers have exact same shape.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

shrink_dentry_list(): start with removing from shrink list

Currently we leave dentry on the list until we are done with
lock_for_kill().  That guarantees that it won't have been
even scheduled for removal until we remove it from the list
and drop ->d_lock.  We grab ->d_lock and rcu_read_lock()
and call lock_for_kill().  There are four possible cases:
1) lock_for_kill() has succeeded; dentry and its inode
(if any) are locked, dentry refcount is zero and we can
remove it from shrink list and feed it to shrink_kill().
2) lock_for_kill() fails since dentry has become busy.
Nothing to do, rcu_read_unlock(), remove from shrink list,
drop ->d_lock and move on.
3) lock_for_kill() fails since dentry is currently
being killed - already entered __dentry_kill(), but hasn't
reached dentry_unlist() yet.  Nothing to do, we should just
do rcu_read_unlock(), remove from shrink list so that
whoever's executing __dentry_kill() would free it once they
are done, drop ->d_lock and move on - same actions as in
case (2).
4) lock_for_kill() fails since dentry has been killed
(reached dentry_unlist(), DCACHE_DENTRY_KILLED set in ->d_flags).
In that case whoever had been killing it had already seen it
on our shrink list and skipped freeing it.  At that point it's
just a passive chunk of memory; rcu_read_unlock(), remove from
the list, drop ->d_lock and use dentry_free() to schedule
freeing.

While that works, there's a simpler way to do it:
* grab ->d_lock
* remove dentry from our shrink list
* if DCACHE_DENTRY_KILLED is already set, drop ->d_lock,
call dentry_free() and move on.
* otherwise grab rcu_read_lock() and call lock_for_free()
* if lock_for_kill() succeeds, feed dentry
to shrink_kill(), otherwise drop the locks and move on.

The end result is equivalent to the old variant.  The only difference
arises if at the time we grab ->d_lock dentry had refcount 0 and
lock_for_kill() had failed spin_trylock() and had to drop and regain
->d_lock.  Otherwise nobody can observe at which point within the
unbroken ->d_lock scope dentry had been removed from the shrink list -
all accesses to ->d_lru are under ->d_lock.

If ->d_lock had been dropped and regained, it is possible for another
thread to feed that dentry to __dentry_kill(); if it doesn't get to
dentry_unlist() before we regain ->d_lock, behaviour is still identical -
it's case (3) and by the time __dentry_kill() would've gotten around
to checking if the victim is on shrink list, it would've been already
removed from ours.

If __dentry_kill() from another thread *does* get to dentry_unlist(),
in the old variant we would have __dentry_kill() leave calling
dentry_free() to us and in the new one __dentry_kill() would've called
dentry_free() itself.  Since we are under rcu_read_lock(), we are
guaranteed that actual freeing won't happen until we get around to
rcu_read_unlock().  IOW, the new variant is still safe wrt UAF, if
not for the same reason as the old one, and overall result is the same;
the only difference is which threads ends up scheduling the actual
freeing of dentry.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d_prune_aliases(): make sure to skip NORCU aliases

Either they are busy (in which case they won't be moved to shrink
list anyway) or they have a zero refcount, in which case we really
shouldn't mess with them - whoever had dropped the refcount to
zero is on the way to evicting and freeing them.

That way we are guaranteed that only the thread that has dropped
refcount of NORCU dentry to zero might call lock_for_kill() and
__dentry_kill() for those.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

kill d_dispose_if_unused()

Rename to_shrink_list() into __move_to_shrink_list(), document and
export it. Switch d_dispose_if_unused() users to that and kill
d_dispose_if_unused() itself.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

make to_shrink_list() return whether it has moved dentry to list

... and make it check the refcount for being zero in addition to
dentry not being on a shrink list already. Simplifies the callers...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

select_collect(): ignore dentries on shrink lists if they have positive refcounts

If all dentries we find have positive refcounts and some happen
to be on shrink lists, there's no point trying to steal them in the
select_collect2() phase - we won't be able to evict any of them. Busy on
shrink lists is still busy...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

find_acceptable_alias(): skip NORCU aliases with zero refcount

similar to d_find_any_alias() situation

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fix a race between d_find_any_alias() and final dput() of NORCU dentries

Refcount of a NORCU dentry must not be incremented after having dropped
to zero.  Otherwise we might end up with the following race:
CPU1: in fast_dput(d), rcu_read_lock();
CPU1: decrements refcount of d to 0
CPU1: notice that it's unhashed
CPU2: grab a reference to d
CPU2: dput(d), freeing d
CPU1: ... looks like we need to evict d, let's grab ->d_lock, recheck
      the refcount, etc.
and that spin_lock(&d->d_lock) ends up a UAF, despite still being in
an RCU read-side critical area started back when the refcount had been
positive.  If not for DCACHE_NORCU in d->d_flags freeing would've been
RCU-delayed, so we'd have grabbed ->d_lock, noticed the negative value
stored into refcount by __dentry_kill(), dropped the locks and that would
be it.  For NORCU dentries freeing is _not_ delayed, though.

Most of the non-counting references are excluded for NORCU dentries -
they are not allowed to be hashed, they never get placed on LRU, they
never get placed into anyone's list of children and while dput_to_list()
might put them into a shrink list, nobody bumps refcount of something
that had been reached that way.

However, inode's list of aliases can be a problem - it does not contribute
to dentry refcount (for obvious reasons) and we *do* have places that
grab references to something found on that list - that's precisely what
d_find_alias() is.  In case of d_find_alias() we are safe - it skips
unhashed aliases, so all NORCU ones are ignored there.  d_find_any_alias()
is *not* limited to hashed ones, though, and while it's usually called
for directories (which never get NORCU dentries), there are callers that
use it to get something for non-directories with no hashed aliases.

Having d_find_any_alias() hit a NORCU dentry is not impossible - it can
be easily arranged if you have CAP_DAC_READ_SEARCH (memfd_create() + mmap()
+ name_to_handle_at() for /proc/self/map_files/<...> + munmap() +
open_by_handle_at() will do that, and adding a second memfd_create() for
mount_fd makes it possible to do that without having memfd pinned).
The race window is narrow, and it's probably not feasible on bare hardware,
but...

It's not hard to fix, fortunately:
* separate __d_find_dir_alias() (== current __d_find_any_alias()) to
be used for directory inodes.
* provide dget_alias_ilocked() that would return false for NORCU
dentries with zero refcount and return true incrementing refcount otherwise
* make __d_find_any_alias() go over the list of aliases, using
dget_alias_ilocked() and returning the alias it succeeds on (normally the
first one).  Any NORCU alias with zero refcount is going to be evicted by
the thread that had dropped the final reference; this makes __d_find_any_alias()
pretend it had lost the race with eviction.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

alloc_path_pseudo(): make sure we don't end up with NORCU dentries for directories

A lot of places relies upon directories never having NORCU dentries;
currently that property holds, but the proof is not straightforward
and rather brittle.

It's better to have that verified in the sole caller of d_alloc_pseudo(),
so that any future bugs in that direction were caught early.

That way we can be sure that
* current directory of any process is not NORCU
* root directory of any process is not NORCU
* starting point of any LOOKUP_RCU pathwalk is not NORCU
* dget_parent() can rely upon ->d_parent not being NORCU
* d_walk() and is_subdir() can rely upon the same
* alloc_file_pseudo() won't create multiple aliases for a directory
without having to go through a convoluted audit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

VFS: use wait_var_event for waiting in d_alloc_parallel()

Parallel lookup starts with a call of d_alloc_parallel().  That primitive
either returns a matching hashed dentry or allocates a new one in the
in-lookup state and returns it to the caller.  Once the caller is done
with lookup, it indicates so either by call of d_{splice_alias,add}()
or by call of d_done_lookup(); at that point dentry leaves the in-lookup
state.

If d_alloc_parallel() finds a matching in-lookup dentry, it must wait for
that dentry to leave the in-lookup state, one way or another.  Currently
by supplying wait_queue_head to d_alloc_parallel().  If d_alloc_parallel()
creates a new in-lookup dentry, the address of that wait_queue_head is stored
in ->d_wait of new dentry and stays there while it's in the in-lookup;
subsequent d_alloc_parallel() will wait on the queue found in the matching
in-lookup dentry.  Transition out of in-lookup state wakes waiters on that
queue (if any).

That works, but the calling conventions are inconvenient - the caller must
supply wait_queue_head and make sure that it survives at least until the new
in-lookup dentry leaves the in-lookup state.  That amounts to boilerplate
in the d_alloc_parallel() callers that are followed by a call of d_lookup_done()
in the same function; in cases like nfs asynchronous unlink it gets worse than
that.

This patch changes d_alloc_parallel() to use wake_up_var_locked() to
wake up waiters, and wait_var_event_spinlock() to wait.  dentry->d_lock
is used for synchronisation as it is already held and the relevant
times.

That eliminates the need of caller-supplied wait_queue_head, simplifying
the calling conventions.  Better yet, we only need one bit of information
stored in dentry itself: whether there are any waiters to be woken up,
and that can be easily stored in ->d_flags; ->d_wait goes away.

The reason we need that bit (DCACHE_LOOKUP_WAITERS) is that with wait_var
machinery the queues are shared with all kinds of stuff and there's
no way tell if any of the waiters have anything to do with our dentry;
most of the time none of them will be relevant, so we need to avoid the
pointless wakeups.

Another benefit of the new scheme comes from the fact that wakeups
have to be done outside of write-side critical areas of ->i_dir_seq;
with the old scheme we need to carry the value picked from ->d_wait from
__d_lookup_unhash() to the place where we actually wake the waiters up.
Now we can just leave DCACHE_LOOKUP_WAITERS in ->d_flags until we get
to doing wakeups - that's done within the same ->d_lock scope, so we
are fine; new bit is accessed only under ->d_lock and it's seen only
on dentries with DCACHE_PAR_LOOKUP in ->d_flags.

__d_lookup_unhash() no longer needs to re-init ->d_lru.  That was
previously shared (in a union) with ->d_wait but ->d_wait is now gone
so it no longer corrupts ->d_lru.

Co-developed-by: Al Viro <viro@zeniv.linux.org.uk> # saner handling of flags
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

accel/ethosu: fix OOB write in ethosu_gem_cmdstream_copy_and_validate()

The command stream parsing loop increments the index variable a second
time when a 64-bit command word is encountered (bit 14 set), but does
not re-check the loop bound before writing the second word:

    for (i = 0; i < size / 4; i++) {
        bocmds[i] = cmds[0];
        if (cmd & 0x4000) {
            i++;
            bocmds[i] = cmds[1];   /* unchecked */
        }
    }

The buffer bocmds is backed by a DMA allocation of exactly size bytes
from drm_gem_dma_create(ddev, size), giving valid indices [0, size/4-1].

When i == size/4 - 1 on entry to an iteration and bit 14 of cmds[0] is
set, bocmds[size/4-1] is written in bounds, i is then incremented to
size/4, and bocmds[size/4] writes four bytes past the end of the
allocation.

Userspace controls both the buffer contents and the size argument via
the ioctl, making this a userspace-triggerable heap out-of-bounds write.

Fix by checking the incremented index against the buffer bound before
the second write and returning -EINVAL if the buffer is too small to
contain the extended command.

Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Link: https://patch.msgid.link/20260523190843.33977-1-meatuni001@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

Merge tag 'batadv-next-pullrequest-20260603' of https://git.open-mesh.org/batadv

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches, all by
Sven Eckelmann:

- tp_meter: fix various minor issues (8 patches)

- tp_meter: split generic session type in sender and receiver type

- tp_meter: consolidate locking for congestion control (2 patches)

- bla: annotate lasttime access with READ/WRITE_ONCE

- elp: prevent transmission interval underflow

- tt: sync local and global tvlv preparation return values

- tt: directly retrieve wifi flags of net_device

* tag 'batadv-next-pullrequest-20260603' of https://git.open-mesh.org/batadv:
  batman-adv: tt: directly retrieve wifi flags of net_device
  batman-adv: tt: sync local and global tvlv preparation return values
  batman-adv: prevent ELP transmission interval underflow
  batman-adv: bla: annotate lasttime access with READ/WRITE_ONCE
  batman-adv: tp_meter: consolidate congestion control variables
  batman-adv: tp_meter: use locking for all congestion control variables
  batman-adv: tp_meter: split vars into sender and receiver types
  batman-adv: tp_meter: add only finished tp_vars to lists
  batman-adv: tp_meter: handle seqno wrap-around for fast recovery detection
  batman-adv: tp_meter: fix fast recovery precondition
  batman-adv: tp_meter: avoid divide-by-zero for dec_cwnd
  batman-adv: tp_meter: avoid window underflow
  batman-adv: tp_meter: initialize dec_cwnd explicitly
  batman-adv: tp_meter: initialize dup_acks explicitly
  batman-adv: tp_meter: keep unacked list in ascending ordered
====================

Link: https://patch.msgid.link/20260603072527.174487-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mv643xx: fix OF node refcount

Platform devices created with platform_device_alloc() call
platform_device_release() when the last reference to the device's
kobject is dropped. This function calls of_node_put() unconditionally.
This works fine for devices created with platform_device_register_full()
but users of the split approach (platform_device_alloc() +
platform_device_add()) must bump the reference of the of_node they
assign manually. Add the missing call to of_node_get().

Cc: stable@vger.kernel.org
Fixes: 76723bca2802 ("net: mv643xx_eth: add DT parsing support")
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Link: https://patch.msgid.link/20260602073414.22500-1-bartosz.golaszewski@oss.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/dns_resolver: use kasprintf + kmemdup_nul to simplify dns_query

Use kasprintf() for descriptions with a query type and kmemdup_nul()
otherwise to simplify dns_query().

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260602071343.962830-2-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: kpu: Default profile updates

Add support for parsing the following:
1. fabric path header
2. tpids 0x88a8, 0x9100 and 0x9200 parsing for
first pass and second pass packets
3. parse stacked VLANs
4. RoCEv2 header with UDP destination port 4791
5. single SBTAG parsing

Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
Signed-off-by: Nitin Shetty J <nshettyj@marvell.com>
Link: https://patch.msgid.link/20260602040535.3975769-1-nshettyj@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-rds-roce-support-follow-ups'

Allison Henderson says:

====================
selftests: rds: ROCE support follow ups

This is a follow up series to the "Add ROCE support to rds selftests"
series.  The first patch renames run.sh to rds_run.sh, which provides
a self-describing name that appears on the netdev CI dashboard.

The second patch addresses a sashiko complaint that I thought was
worth circling back for.  In the patch "pin RDS sockets to their
intended transport," sockets are pinned to the specific transport they
are meant to test.  By default, socket transports are implicitly
selected based on the network topology, but it is possible that they
can fail back to other transports if the underlying connection could
not be established.  So the patch pins them to the intended transport
to avoid false positives.

The third patch "support RDS built as loadable modules," lifts the
CONFIG_MODULES=n requirement, and updates the check_*conf_enabled()
to allow modules set to "=m" and further load the backing modules for
any component set as such.  config.sh is updated to match.

The fourth patch converts the rdma-prerequisite checks to return XFAIL
rather than SKIP, since the RDMA datapath is not run in netdev CI.

Questions, comments and feedback appreciated!
====================

Link: https://patch.msgid.link/20260602050657.26389-1-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: report missing RDMA prereqs as XFAIL

Make the RDMA test return XFAIL rather than skip when RXE is not
available, since the RDMA datapath is not run in netdev CI.

Change the three RDMA-prerequisite checks in check_rdma_conf() and
check_rdma_conf_enabled() to exit with KSFT_XFAIL (2) and tag their
messages [XFAIL] instead of [SKIP].

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260602050657.26389-5-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: support RDS built as loadable modules

Commit 92cc6708f4a2 ("selftests: rds: config: disable modules") set
CONFIG_MODULES=n since run.sh required this kconfig. But disabling
modules also forces every =m option to =n rather than =y, which can
silently drop unrelated features.

This patch removes CONFIG_MODULES=n from the rds selftest config and
updates the check_*conf_enabled() routines to accept a config as
either built-in (=y) or modular (=m). A new probe_module() function
is added to load the backing module when a component is set to be
modular (=m). config.sh no longer forces CONFIG_MODULES=n, so a user
who follows the SKIP message to run config.sh does not silently end
up with modules disabled again.

rds.ko itself is auto-loaded on socket creation, and rds_rdma.ko is
auto-loaded when SO_RDS_TRANSPORT is set with RDS_TRANS_IB, but the
TCP transport (rds_tcp.ko) is not auto-loaded on the bind path, so
the backing modules are loaded explicitly here.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260602050657.26389-4-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: pin RDS sockets to their intended transport

The RDS selftests create AF_RDS sockets but never selects a transport,
so the transport is chosen implicitly based on network topology when
the socket is bound. If underlying connection establishment fails, RDS
can fall back to another transport (e.g. loopback) and the test still
passes, silently bypassing the intended datapath it is meant to
exercise.

Set SO_RDS_TRANSPORT to the proper RDS_TRANS_IB or RDS_TRANS_TCP before
they are bound, so the test fails loudly if the intended transport is
unavailable rather than passing on a different path.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260602050657.26389-3-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Rename run.sh to rds_run.sh

This patch renames run.sh to rds_run.sh. This gives the test a
self-describing name that appears in the netdev CI dashboard.

Suggested-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260602050657.26389-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: use READ_ONCE() in ipv6_flowlabel_get()

ipv6_flowlabel_get() reads flowlabel_consistency and
flowlabel_state_ranges locklessly.

Use READ_ONCE() for these sysctl accesses.

Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260602002506.1519901-1-runyu.xiao@seu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: use READ_ONCE() for bindv6only default in inet6_create()

inet6_create() reads net->ipv6.sysctl.bindv6only locklessly.

Use READ_ONCE() for this sysctl access.

Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260602002414.1504106-1-runyu.xiao@seu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

inet: frags: remove redundant assignment in inet_frag_reasm_prepare()

The assignment is redundant because skb_clone() already copies skb->cb.

Remove the unnecessary code.

Signed-off-by: yuan.gao <yuan.gao@ucloud.cn>
Link: https://patch.msgid.link/20260603065323.2736839-1-yuan.gao@ucloud.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Report extack error to the user if airoha_tc_htb_modify_queue() fails

Report an extack error message in airoha_tc_htb_modify_queue routine if
airoha_qdma_set_tx_rate_limit() fails.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20260603-airoha_tc_htb_modify_queue-err-message-v1-1-33ec3ab997d9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: dsa: remove obsolete dsa.txt

dsa.txt has been a redirect to dsa.yaml since commit bce58590d1bd
("dt-bindings: net: dsa: Add DSA yaml binding") introduced the .yaml
schema. The .yaml has the same filename in the same directory, making
this redirect unnecessary for discoverability.

Two files still reference dsa.txt, forcing readers through an extra
hop to reach the .yaml. The stub has not been touched since August
2020. Update references in lan9303.txt and
Documentation/networking/dsa/dsa.rst to point directly to dsa.yaml
and remove the stub.

Signed-off-by: Akash Sukhavasi <akash.sukhavasi@gmail.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260603-b4-remove-redirect-stubs-v2-3-c8c19876ab64@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: remove obsolete mdio.txt

mdio.txt has been a single-line redirect to mdio.yaml since
commit 62d77ff7ecbf ("dt-bindings: net: Add a YAML schemas for the
generic MDIO options"), which introduced the .yaml schema and reduced
the .txt to a stub in the same change. The .yaml has the same filename
in the same directory, making this redirect unnecessary for
discoverability.

No files in the tree reference mdio.txt and it has not been touched
since June 2019. Remove the obsolete stub.

Signed-off-by: Akash Sukhavasi <akash.sukhavasi@gmail.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260603-b4-remove-redirect-stubs-v2-1-c8c19876ab64@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtnetlink: use dev_isalive() in rtnl_getlink()

rtnl_getlink() uses an RCU lookup to get the netdevice pointer.

When/If rtnl_lock() is used, we should check if the netdevice is not
being dismantled before potentially perform illegal actions.

Move dev_isalive() out of net/core/net-sysfs.c and make it available
in net/core/dev.h.

Return -ENODEV if rtnl_getlink() finds a device which is currently
being dismantled and RTNL is requested.

Fixes: e896e5c0734b ("rtnetlink: do not acquire RTNL in rtnl_getlink() with RTEXT_FILTER_NAME_ONLY")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20260603180831.1024716-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: realtek: Use %pM format specifier for MAC addresses

Convert to %pM instead of using custom code.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Link: https://patch.msgid.link/20260603112011.230890-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: mcast: Synchronously shutdown port multicast timers

Currently, while four timers are set up during port multicast context
initialization, only two are synchronously deleted when the context is
de-initialized, just before being deleted.

This is fine because the structure containing the multicast context
(either a bridge port or a VLAN) is only deleted after an RCU grace
period and it will not pass as long as the timers are executing. These
timers are also not supposed to do any work at this stage. They acquire
the bridge multicast lock, see that the multicast context was disabled
and exit.

Make the code more explicit and symmetric and synchronously shutdown all
four timers when the multicast context is de-initialized. Use
timer_shutdown_sync() to guarantee that the timers will not be re-armed
given that the containing structure is being deleted.

Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260603103522.622411-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'rndis_host-add-le310x1-id-and-enable-low-power-handling'

Shaoxu Liu says:

====================
rndis_host: add LE310X1 ID and enable low-power handling

This series adds RNDIS support for Telit Cinterion LE310X1 and then enables
USB power management for that specific ID.
====================

Link: https://patch.msgid.link/tencent_29CB862D5756CBCBAFD2EE436EBAC98A7E05@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rndis_host: enable power management for Telit LE310X1

Enable autosuspend support for Telit Cinterion LE310X1 RNDIS interface
by selecting a driver_info variant with manage_power callback.

This keeps power management scoped to the new Telit ID only, and avoids
changing behavior for all existing RNDIS devices.

Signed-off-by: Shaoxu Liu <shaoxul@foxmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/tencent_B7686B84CD4B76D76BB912FA6367FAC2CA05@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rndis_host: add Telit LE310X1 RNDIS USB ID

Add a device match entry for Telit Cinterion LE310X1 RNDIS interface
(VID:PID 1bc7:7030).

This is a functional no-op and keeps using the generic rndis_info for now.
Power-management behavior is handled in a follow-up patch.

Signed-off-by: Shaoxu Liu <shaoxul@foxmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/tencent_F1AF1F5AD39C56485BD16C6DB2415E5B9508@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

inet: frags: fix use-after-free caused by the fqdir_pre_exit() flush

On netns teardown, fqdir_pre_exit() walks the fqdir rhashtable and
flushes every fragment queue that is not yet complete using
inet_frag_queue_flush(). That helper frees all the skbs queued on the
fragment queue but does not set INET_FRAG_COMPLETE, and leaves
q->fragments_tail and q->last_run_head pointing at the freed skbs.
The queue itself stays in the rhashtable.

fqdir_pre_exit() first lowers high_thresh to 0 to stop new queue lookups,
but it cannot stop a fragment that already obtained the queue through
inet_frag_find() earlier and stalled just before taking the queue lock.
Once that fragment resumes after the flush and takes the queue lock,
it passes the INET_FRAG_COMPLETE check and then dereferences the freed
fragments_tail. inet_frag_queue_insert() reads FRAG_CB() and ->len of
that pointer and, on the append path, writes ->next_frag, causing a
slab use-after-free. IPv6, nf_conntrack_reasm6 and 6lowpan reassembly
share the same flush path and are affected as well.

Reset rb_fragments, fragments_tail and last_run_head in
inet_frag_queue_flush() so a flushed queue no longer points at the
freed skbs. A fragment that resumes after the flush and takes the
queue lock then finds an empty queue and starts a new run instead of
dereferencing the freed fragments_tail. ip_frag_reinit() already
performed this reset after its own flush, so drop the now duplicate
code there.

Cc: stable@vger.kernel.org
Fixes: 006a5035b495 ("inet: frags: flush pending skbs in fqdir_pre_exit()")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Link: https://patch.msgid.link/ah6ukYq5G98LshdA@v4bel
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bonding: annotate data-races in sysfs and procfs

bonding sysfs and procfs read parameters locklessly,
while drivers/net/bonding/bond_options.c can write over them.

Add missing READ_ONCE()/WRITE_ONCE() annotations.

This came as a prereq to avoid RTNL in bond_fill_info().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260602152748.2564393-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

i2c: busses: make K1 driver default for SpacemiT platforms

Enable I2C_K1 by default when ARCH_SPACEMIT is configured to ensure SD
card functionality works out-of-the-box.

SpacemiT K1 boards use I2C-controlled PMICs (like the P1 chip) to
provide SD card power supplies. Without the I2C_K1 driver enabled,
regulators cannot be controlled and SD card detection/operation fails.

Suggested-by: Margherita Milani <margherita.milani@amarulasolutions.com>
Suggested-by: Yixun Lan <dlan@kernel.org>
Signed-off-by: Iker Pedrosa <ikerpedrosam@gmail.com>
Reviewed-by: Yixun Lan <dlan@kernel.org>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260526-orangepi-sd-card-i2c-v1-1-b92268bfd467@gmail.com

Merge tag 'drm-rust-next-2026-06-04' of https://gitlab.freedesktop.org/drm/rust/kernel into drm-next

DRM Rust changes for v7.2-rc1

- Driver Core (shared via signed tag dd-lifetimes-7.2-rc1):

  - Introduce Higher-Ranked Lifetime Types (HRT) for Rust device
    drivers, allowing driver structs to hold device resources like
    pci::Bar and IoMem directly with a lifetime tied to the binding
    scope, removing the need for Devres indirection and ARef<Device>.

  - Replace drvdata() with scoped registration data on the auxiliary
    bus, using the new ForLt trait to thread lifetimes through
    registrations. Remove drvdata() and driver_type.

- DRM:

  - Add GPUVM immediate mode abstraction for Rust GPU drivers:
    - In immediate mode, GPU virtual address space state is updated
      during job execution (in the DMA fence signalling critical path),
      keeping the GPUVM and the GPU's address space always in sync.

    - Provide GpuVm, GpuVa, and GpuVmBo types for managing address
      spaces, virtual mappings, and GEM object backing respectively.

    - Provide split-merge map/unmap operations that handle partial
      overlaps with existing mappings.

    - drm_exec integration for dma_resv locking and GEM object
      validation based on the external/evicted object lists are not
      yet covered and planned as follow-up work.

  - Introduce DeviceContext type state for drm::Device, allowing
    drivers to restrict operations to contexts where the device is
    guaranteed to be registered (or not yet registered) with userspace.

  - Add FEAT_RENDER flag to the Driver trait for render node support.

- Nova:

  - Hopper/Blackwell enablement:
    - Add GPU identification and architecture-based HAL selection for
      Hopper (GH100) and Blackwell (GB100, GB202).

    - Implement the FSP (Foundation Security Processor) boot path used by
      Hopper and Blackwell, including FSP falcon engine support, EMEM
      operations, MCTP/NVDM message infrastructure, and FSP Chain of
      Trust boot with GSP lockdown release.

    - Add support for 32-bit firmware images and auto-detection of
      firmware image format.

    - Add architecture-specific framebuffer, sysmem flush, PCI config
      mirror, DMA mask, and WPR/non-WPR heap sizing.

  - GSP boot and unload:
    - Refactor the GSP boot process into a chipset-specific HAL,
      keeping the SEC2 and FSP boot paths separated cleanly.

    - Implement proper driver unload: send UNLOADING_GUEST_DRIVER
      command, run Booter Unloader and FWSEC-SB upon unbinding, and run
      the unload bundle on Gsp::boot() failure. This removes the need
      for a manual GPU reset between driver unbind and re-probe.

  - GA100 support:
    - Add support for the GA100 GPU, including IFR header detection and
      skipping, correct fwsignature selection, conditional FRTS boot,
      and documentation of the IFR header layout.

  - VBIOS hardening and refactoring:
    - Harden VBIOS parsing with checked arithmetic, bounds-checked
      accesses, and FromBytes-based structure reads throughout the FWSEC
      and Falcon data paths. Simplify the overall VBIOS module
      structure.

  - HRT adoption:
    - Use lifetime-parameterized pci::Bar directly, replacing the
      Arc<Devres<Bar0>> indirection. Replace ARef<Device> with &'bound
      Device in SysmemFlush and the GSP sequencer. Separate the driver
      type from driver data.

  - Misc:
    - Rename module names to kebab-case (nova-drm, nova-core).

    - Require little-endian in Kconfig, making the existing assumption
      explicit.

- Tyr:

  - Define comprehensive typed register blocks for GPU_CONTROL,
    JOB_CONTROL, MMU_CONTROL (including per-address-space registers),
    and DOORBELL_BLOCK using the kernel register!() macro. This replaces
    manual bit manipulation with typed register and field accessors.

  - Add shmem-backed GEM objects and set DMA mask based on GPU physical
    address width.

  - Adopt HRT: separate driver type from driver data, and use IoMem
    directly instead of Devres for register access during probe.

  - Move clock cleanup into a Drop implementation.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: "Danilo Krummrich" <dakr@kernel.org>
Link: https://patch.msgid.link/DJ0IF39U9ETK.PCCUO7ZEQ4S0@kernel.org

i2c: Use named initializers for arrays of i2c_device_data

While being less compact, using named initializers allows to more easily
see which members of the structs are assigned which value without having
to lookup the declaration of the struct. And it's also more robust
against changes to the struct definition.

The mentioned robustness is relevant for a planned change to struct
i2c_device_id that replaces .driver_data by an anonymous union.

While touching all these arrays, unify usage of whitespace in the list
terminator.

This patch doesn't modify the compiled arrays, only their representation
in source form benefits. The former was confirmed with x86 and arm64
builds.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Link: https://lore.kernel.org/r/20260518164510.805502-2-u.kleine-koenig@baylibre.com
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>

cxl/pci: Convert PCIBIOS errors to errno on DVSEC config accesses

PCI config space accessors return positive PCIBIOS_* status codes on
failure that are positive integers. Several DVSEC accesses in the CXL
core propagated these raw values to callers that test for failure against
less than 0. Thus silently misinterpret the return value as success.

Convert the positive error values to negative errno values so the checks
are correct on error paths.

While the chances of a config access failure are low, fix for correctness
and to avoid confusion in the future when more DVSEC accesses are added.

Fixes: 14d788740774 ("cxl/mem: Consolidate CXL DVSEC Range enumeration in the core")
Fixes: ce17ad0d5498 ("cxl: Wait Memory_Info_Valid before access memory related info")
Reviewed-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Assisted-by: Claude:claude-opus-4-8
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20260604180154.1925149-3-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>

accel/ethosu: reject DMA commands with uninitialized length

cmd_state_init() initializes the command state with memset(0xff),
leaving dma->len at U64_MAX to signal missing setup. The only setter
is NPU_SET_DMA0_LEN; if userspace omits this command and issues
NPU_OP_DMA_START, dma->len remains U64_MAX.

In dma_length(), a positive stride added to U64_MAX wraps to a small
value. With size0 == 1, check_mul_overflow() does not trigger and
dma_length() returns 0 instead of U64_MAX. The caller's U64_MAX check
then passes, region_size[] stays 0, and the bounds check in
ethosu_job.c is bypassed, allowing hardware to execute DMA with stale
physical addresses.

Fix by checking for U64_MAX at the start of dma_length() before any
arithmetic, consistent with the sentinel value used throughout the
driver to detect uninitialized fields.

Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Link: https://patch.msgid.link/20260524130319.12747-1-meatuni001@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

accel/ethosu: fix arithmetic issues in dma_length()

dma_length() derives DMA region usage from command stream values and
updates region_size[]:

len = ((len + stride[0]) * size0 + stride[1]) * size1
region_size[region] = max(..., len + dma->offset)

Several arithmetic issues can corrupt the derived region size:

- signed stride values may underflow when added to len
- intermediate multiplications may overflow
- len + dma->offset may overflow during region_size updates
- dma_length() error returns were not validated by the caller

region_size[] is later used by ethosu_job.c to validate command stream
accesses against GEM buffer sizes. Arithmetic wraparound can therefore
under-report region usage and bypass the bounds validation.

Fix by validating signed additions, using overflow helpers for
multiplications and offset updates, and propagating dma_length()
failures to the caller.

Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Link: https://patch.msgid.link/20260524103710.47397-1-meatuni001@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

accel/ethosu: fix wrong weight index in NPU_SET_SCALE1_LENGTH on U85

On non-U65 hardware (e.g. U85), opcode 0x4093 is NPU_SET_WEIGHT2_LENGTH.
The BASE handler for the same opcode correctly assigns to
st.weight[2].base, but the LENGTH handler mistakenly assigns cmds[1]
to st.weight[1].length instead of st.weight[2].length.

This leaves weight[2].length at its initialised sentinel value of
0xffffffff and corrupts weight[1].length with the user-supplied value,
breaking the software bounds-check state for both weight buffers on U85.

Fix the index to match the BASE handler.

Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Link: https://patch.msgid.link/20260523210840.92039-3-meatuni001@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

accel/ethosu: reject NPU_OP_RESIZE commands from userspace

NPU_OP_RESIZE is a U85-only command that the driver does not yet
implement. The existing WARN_ON(1) placeholder fires unconditionally
whenever userspace submits this command via DRM_IOCTL_ETHOSU_GEM_CREATE,
causing unbounded kernel log spam.

If panic_on_warn is set the kernel panics, giving any unprivileged user
with access to the DRM device a trivial denial-of-service primitive.

Replace the WARN_ON(1) with an explicit -EINVAL return so the ioctl
rejects the command before it reaches hardware.

Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Link: https://patch.msgid.link/20260523210840.92039-2-meatuni001@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

cxl/pci: Fix the incorrect check of pci_read_config_word() return

pci_read_config_word() returns PCIBIOS_* status on error which are
positive values. The check should be for non-zero values to indicate
error. Fix cxl_set_mem_enable() to check for non-zero return value
instead of negative value.

While fixing this, also convert the error to negative errno value when
returning on error path.

Fixes: 34e37b4c432c ("cxl/port: Enable HDM Capability after validating DVSEC Ranges")
Reviewed-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Assisted-by: Claude:claude-opus-4-8
Link: https://patch.msgid.link/20260604180154.1925149-2-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>

tools: ynl: try to avoid the very slow YAML loader

Turns out Python YAML defaults to a pure Python loader for YAML
files which is a lot slower than the C loader (using libyaml).
Try to use the C one whenever possible.

The avg time to run:
$ tools/net/ynl/pyynl/cli.py --family tc --no-schema
drops from 300+ ms to 115 ms with this change (40 samples).

We could drop the load time further to 85 ms if we "compiled"
the specs to JSON. Slightly tricky parts are that we don't
currently install the specs at all on make install, so it's
unclear where to put the conversion. Also JSON has questionable
support for comments and we need an SPDX line.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260603210810.2636193-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

accel/ethosu: fix IFM region index out-of-bounds in command stream parser

NPU_SET_IFM_REGION extracts the region index with param & 0x7f, giving
a maximum value of 127. However region_size[] and output_region[] in
struct ethosu_validated_cmdstream_info are both sized to
NPU_BASEP_REGION_MAX (8), giving valid indices [0..7].

Every other region assignment in the same switch uses param & 0x7:
  NPU_SET_OFM_REGION:  st.ofm.region  = param & 0x7;
  NPU_SET_IFM2_REGION: st.ifm2.region = param & 0x7;
  NPU_SET_WEIGHT_REGION: st.weight[0].region = param & 0x7;
  NPU_SET_SCALE_REGION:  st.scale[0].region  = param & 0x7;

The 0x7f mask on IFM is inconsistent and appears to be a typo.

feat_matrix_length() and calc_sizes() use the region index directly
as an array subscript into the kzalloc'd info struct:
  info->region_size[fm->region] = max(...);

A userspace caller supplying NPU_SET_IFM_REGION with param > 7 causes
a write up to 127*8 = 1016 bytes past the start of region_size[],
corrupting adjacent kernel heap data.

Fix by applying the same & 0x7 mask used by all other region
assignments.

Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Link: https://patch.msgid.link/20260523195159.55801-1-meatuni001@gmail.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-7.1-rc7).

Silent conflicts:

net/wireless/nl80211.c
  cb9959ab5f99 ("wifi: cfg80211: enforce HE/EHT cap/oper consistency")
  a384ae969902 ("wifi: cfg80211: move AP HT/VHT/... operation to beacon info")
https://lore.kernel.org/aiGJDaHV4UlCexIQ@sirena.org.uk

Conflicts:

drivers/net/wireless/intel/iwlwifi/mld/ap.c
  a342c99cb70d ("wifi: iwlwifi: mld: honor BSS_CHANGED_BEACON_ENABLED")
  9bf1b409afc7 ("wifi: iwlwifi: mld: send tx power constraints before link activation")
https://lore.kernel.org/ah2bfedhV45ZxMO8@sirena.org.uk

drivers/net/wireless/intel/iwlwifi/pcie/drv.c
  093305d801fa ("wifi: iwlwifi: pcie: simplify the resume flow if fast resume is not used")
  e2323929a68a ("wifi: iwlwifi: pcie: add debug print for resume flow if powered off")
https://lore.kernel.org/ah2bfedhV45ZxMO8@sirena.org.uk

Adjacent changes:

drivers/net/ethernet/airoha/airoha_eth.c
  b38cae85d1c4 ("net: airoha: Fix use-after-free in metadata dst teardown")
  ec6c391bcca7 ("net: airoha: Introduce airoha_gdm_dev struct")

drivers/net/ethernet/microchip/lan743x_main.c
  8173d22b211f ("net: lan743x: permit VLAN-tagged packets up to configured MTU")
  e3c6508a46f5 ("net: lan743x: avoid netdev-based logging before netdev registration")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: mm: clarify that user_reserve_kbytes has no effect when overcommit_memory is set to 0 or 1

Looking at __vm_enough_memory() in mm/util.c, user_reserve_kbytes has no
effect when overcommit_memory is set to 0 or 1. The documentation for
overcommit_memory already references user_reserve_kbytes when the flag
is set to 2.

Let's go ahead and add a clarification to user_reserve_kbytes in vm.rst
that it has no effect when overcommit_memory is set to 0 or 1.

Link: https://lore.kernel.org/20260528-mm-clarify-docs-v1-1-aa88e83b4bfd@redhat.com
Signed-off-by: Brian Masney <bmasney@redhat.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

MAINTAINERS: add vm.rst to memory management core

The vm.rst file is currently not listed in the MAINTAINERS file, so let's
go ahead and add to the MM core subsystem so that the maintainers are CCed
when changes to the documentation are proposed.

Link: https://lore.kernel.org/20260528-mm-vm-rst-maintainers-file-v1-1-306631c0a610@redhat.com
Signed-off-by: Brian Masney <bmasney@redhat.com>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org>
Reviewed-by: Liam R. Howlett (Oracle) <liam@infradead.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/migrate: find_mm_struct: fix race between security checks and suid exec

The target task can execute a setuid binary between ptrace_may_access()
and get_task_mm(). Protect this critical section with exec_update_lock.

I don't think cpuset_mems_allowed(task) should be called under
exec_update_lock, but this patch just tries to add the minimal fix.

Perhaps we can later add a common helper which can be used by
find_mm_struct() and kernel_migrate_pages().

Link: https://lore.kernel.org/ahWxQ3JxdR5ff2qf@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: document the folio refcount a little better

Expand the documentation of folio_ref_count() to talk about expected,
temporary and spurious refcounts as well as the concept of freezing.

Link: https://lore.kernel.org/20260526200032.353868-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: remove mentions of PageWriteback

Update two comments to refer to writeback in general instead of the
specific flag. Convert the large comment in memory.c to be entirely
folio-based.

Link: https://lore.kernel.org/20260526195650.353196-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

zram: clear trailing bytes of compressed writeback pages

Patch series "zram: writeback fixes", v2.

Brian (privately) reported a "leak" of writeback bitmap in certain cases,
so that backing device can store less pages; and a theoretical data leak
in the trailing bytes of compressed writeback pages.  Both issues are low
risk.

This patch (of 2):

When compressed writeback is available writtenback pages contain "garbage"
in PAGE_SIZE - obj_size trailing bytes.  That "garbage" is, basically,
whatever data that page held before we got it for writeback.  To get
advantage of it an attacker needs to be able to read from active backing
swap device, which is already catastrophic.  Still, just in case, zero out
those trailing bytes before writeback to a backing device so that we only
store swap-ed out data there.

Link: https://lore.kernel.org/20260526022754.2377730-1-senozhatsky@chromium.org
Link: https://lore.kernel.org/20260526022754.2377730-3-senozhatsky@chromium.org
Fixes: d38fab605c66 ("zram: introduce compressed data writeback")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Brian Geffon <bgeffon@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

zram: do not leak blk idx at the end of writeback

zram_writeback_slots() loop can terminate with valid reserved backing
device blk_idx. The problem is that cleanup code doesn't release that
reserved blk_idx before zram_writeback_slots() returns, which leads to
blk_idx leak (it becomes permanently busy and can not be used for actual
writeback.) This does not lead to any system instabilities, it only means
that we can writeback less pages. The scenario is hard to hit in practice
as it requires writeabck to race with modification (slot-free or
overwrite) of the final post-processing slot.

Release reserved but unused blk_idx before returning from
zram_writeback_slots().

Link: https://lore.kernel.org/20260526022754.2377730-2-senozhatsky@chromium.org
Fixes: f405066a1f0db ("zram: introduce writeback bio batching")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Brian Geffon <bgeffon@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: multi objcg charge support

Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") split a memcg's single obj_cgroup into one per NUMA node
so that reparenting LRU folios can take per-node lru locks.  As a side
effect, the per-CPU obj_stock_pcp -- which caches exactly one cached_objcg
-- thrashes on workloads where threads of the same memcg run on different
NUMA nodes.  The kernel test robot reported a 67.7% regression on
stress-ng.switch.ops_per_sec from this pattern.

Mirror the multi-slot pattern already used by memcg_stock_pcp: turn
nr_bytes and cached_objcg into NR_OBJ_STOCK-element arrays, scan all slots
on consume/refill/account, prefer empty slots when inserting, and evict a
slot round-robin only when full.  With multiple slots a CPU can hold the
per-node objcg variants of one memcg plus a few siblings without ever
forcing a drain.

A single int8_t index records which slot the cached slab stats belong to;
the stats are flushed on slot or pgdat change.  With NR_OBJ_STOCK = 5 the
layout (verified with pahole) is:

  offset 0  : lock(1) + index(1) + node_id(2) + slab stats(4) = 8B
  offset 8  : nr_bytes[5]                                     = 10B
  offset 18 : padding                                         = 6B
  offset 24 : cached[5]                                       = 40B
  offset 64 : (line 2) work_struct + flags (cold)

so consume_obj_stock, refill_obj_stock and the slab account path each
touch exactly one 64-byte cache line on non-debug 64-bit builds.

Link: https://lore.kernel.org/20260526033931.1760588-5-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
Tested-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <qi.zheng@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: int16_t for cached slab stats

Currently struct obj_stock_pcp stores cached slab stats in 'int' which is
4 bytes per counter on 64-bit machines.  Switch them to int16_t to shrink
the cached metadata.

The existing PAGE_SIZE flush in __account_obj_stock() bounds *bytes at
PAGE_SIZE on 4KiB and 16KiB page archs, well within int16_t.  On 64KiB
pages PAGE_SIZE is well above S16_MAX so that flush never fires, and a
sufficiently long run of accumulations would overflow the cache.  Add an
explicit S16_MAX guard before each add: when the next add would push
abs(*bytes) past S16_MAX, fold the cached value into @nr and flush
directly via mod_objcg_mlstate() before the accumulation.

Link: https://lore.kernel.org/20260526033931.1760588-4-shakeel.butt@linux.dev
Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Tested-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Acked-by: Qi Zheng <qi.zheng@linux.dev>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: uint16_t for nr_bytes in obj_stock_pcp

Currently struct obj_stock_pcp stores nr_bytes in an 'unsigned int' which
is 4 bytes on 64-bit machines.  Switch the field to uint16_t to shrink the
per-CPU cache.

The kernel supports PAGE_SIZE_4KB, _8KB, _16KB, _32KB, _64KB and _256KB
(see HAVE_PAGE_SIZE_* in arch/Kconfig).  After the PAGE_SIZE-aligned flush
in __refill_obj_stock(), the sub-page remainder fits in uint16_t up
through 64KiB pages where PAGE_SIZE - 1 == U16_MAX, but on 256KiB pages
PAGE_SIZE - 1 == 0x3FFFF exceeds U16_MAX.  The accumulator also needs to
stay within uint16_t between page-aligned flushes on 64KiB pages where
PAGE_SIZE itself is U16_MAX + 1.

Accumulate the new total in an 'unsigned int' local, then on PAGE_SHIFT <=
16 flush whenever the accumulator would hit U16_MAX; together with the
existing allow_uncharge flush at PAGE_SIZE this keeps the uint16_t safe.

On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on hexagon and powerpc
44x, both 32-bit), uint16_t cannot represent the sub-page remainder.
Define obj_stock_bytes_t as 'unsigned int' on those archs so nr_bytes can
hold the full remainder and the normal page-boundary flush in
__refill_obj_stock() and the page extraction in drain_obj_stock() both
work correctly.

The single-cache-line layout target only applies to PAGE_SHIFT <= 16;
those archs are 32-bit embedded and not the optimization target.

Link: https://lore.kernel.org/20260526033931.1760588-3-shakeel.butt@linux.dev
Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Tested-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Acked-by: Qi Zheng <qi.zheng@linux.dev>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: store node_id instead of pglist_data pointer

Patch series "memcg: shrink obj_stock_pcp and cache multiple objcgs", v3.

Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") split a memcg's single obj_cgroup into one per NUMA node
so that reparenting LRU folios can take per-node lru locks.  As a side
effect, the per-CPU obj_stock_pcp -- which caches a single cached_objcg
pointer -- thrashes on workloads where threads of the same memcg run on
different NUMA nodes.  The kernel test robot reported a 67.7% regression
on stress-ng.switch.ops_per_sec from this pattern.

Commit d0211878ce06 ("memcg: cache obj_stock by memcg, not by objcg
pointer") landed as a temporary fix by treating sibling per-node objcgs as
equivalent for the cache lookup, intended to be reverted once per-node
kmem accounting is introduced.  This series takes a more general approach:
cache multiple objcgs per CPU using the multi-slot pattern memcg_stock_pcp
already uses, so the per-node objcg variants of one memcg can all coexist
in the stock without ever forcing a drain.  The temporary fix can then be
reverted.

To avoid increasing the per-CPU cache footprint, the first three patches
shrink the existing single-slot obj_stock_pcp fields.  The final patch
converts cached_objcg and nr_bytes into NR_OBJ_STOCK=5 slot arrays and
reorders the struct so the entire consume/refill/account hot path fits
within a single 64-byte cache line on non-debug 64-bit builds (verified
with pahole).

This patch (of 4):

The struct obj_stock_pcp stores a pointer to pglist_data for the slab
stats cached on the cpu.  On 64-bit machines, this costs 8 bytes.  The
pointer is not strictly required: NODE_DATA() can recover it from the node
id.  Replace cached_pgdat with int16_t node_id and use NUMA_NO_NODE as the
"no stats cached" sentinel.

At the moment all the archs limit MAX_NUMNODES to 1024 so int16_t is
plenty; a BUILD_BUG_ON() makes sure we notice if that ever changes.

Link: https://lore.kernel.org/20260526033931.1760588-1-shakeel.butt@linux.dev
Link: https://lore.kernel.org/20260526033931.1760588-2-shakeel.butt@linux.dev
Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Tested-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Acked-by: Qi Zheng <qi.zheng@linux.dev>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/memfd: remove unused variable 'sig' in fuse_test

  fuse_test.c: In function 'sealing_thread_fn':
  fuse_test.c:165:13: warning: unused variable 'sig' [-Wunused-variable]
    165 |         int sig, r;
        |             ^~~

Remove unused 'sig' to fix -Wunused-variable warning.

Link: https://lore.kernel.org/20260524193732.48853-3-eva.kurchatova@virtuozzo.com
Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/memfd: fix -Wmaybe-uninitialized warning in memfd_test

Patch series "selftests/memfd: fix compilation warnings".

This patchset fixes warnings about unused but initialized variables, and
unused dummy buffer passed to pwrite() syscall in the tests.

This patch (of 2):

  memfd_test.c: In function 'mfd_fail_grow_write.part.0':
  memfd_test.c:685:13: warning: '<unknown>' may be used uninitialized
  [-Wmaybe-uninitialized]
    685 |         l = pwrite(fd, buf, mfd_def_size * 8, 0);
        |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pwrite() is declared with attribute 'access (read_only, 2, 3)', so GCC
knows it reads from the buffer.  malloc() returns uninitialized memory,
hence the warning.  Use calloc() to zero-initialize the buffer.  The
actual contents don't matter here since the test verifies that pwrite()
fails on a sealed memfd.

Link: https://lore.kernel.org/20260524193732.48853-1-eva.kurchatova@virtuozzo.com
Link: https://lore.kernel.org/20260524193732.48853-2-eva.kurchatova@virtuozzo.com
Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: shmem: refactor thpsize_shmem_enabled_show() with helper arrays

Replace the hardcoded if/else chain of test_bit() calls and string
literals in thpsize_shmem_enabled_show() with a loop over
huge_shmem_orders_by_mode[] and huge_shmem_enabled_mode_strings[] arrays.

This makes thpsize_shmem_enabled_show() consistent with
thpsize_shmem_enabled_store() and eliminates duplicated mode name strings.

Link: https://lore.kernel.org/20260525102700.68707-3-ranxiaokai627@163.com
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Lance Yang <lance.yang@linux.dev>
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: shmem: refactor thpsize_shmem_enabled_store() with sysfs_match_string()

Patch series "refactors thpsize_shmem_enabled_store() and
thpsize_shmem_enabled_show()", v4.

This patch (of 2):

Inspired by commit 82d9ff648c6c ("mm: huge_memory: refactor
anon_enabled_store() with set_anon_enabled_mode()"), refactor
thpsize_shmem_enabled_store() using sysfs_match_string(). This eliminates
the duplicated spin_lock/unlock(), set/clear_bit(), calls across all
branches, reducing code duplication.

Behavioral change:
Call start_stop_khugepaged() only when the mode actually changes.
If unchanged, call set_recommended_min_free_kbytes() to preserve
legacy watermark behavior. This avoids unnecessary khugepaged restarts.

Tested with selftests ./run_kselftest.sh -t mm:ksft_thp.sh,
all test cases passed.

Link: https://lore.kernel.org/20260525102700.68707-1-ranxiaokai627@163.com
Link: https://lore.kernel.org/20260525102700.68707-2-ranxiaokai627@163.com
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Lance Yang <lance.yang@linux.dev>
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Cc: Breno Leitao <leitao@debian.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: make mmap_miss accounting symmetric for VM_SEQ_READ

do_sync_mmap_readahead() skips both the mmap_miss increment and the
MMAP_LOTSAMISS check for VM_SEQ_READ mappings, since sequential access is
non-speculative and should always read ahead.  The two decrement sites in
do_async_mmap_readahead() and filemap_map_pages() do not mirror this skip,
so concurrent faults on a VM_SEQ_READ mapping can still drive
ra->mmap_miss down to zero through the decrement paths even though nothing
in the sync path ever increments it.  The counter itself is per-file
(file->f_ra.mmap_miss), so it can be moved by any VMA mapping the file,
not just the one currently faulting.

Skip the decrement for VM_SEQ_READ in both decrement sites so the counter
only moves for mappings that also participate in the increment side.  No
functional change for VM_SEQ_READ users, since the increment-side gate
already prevents the counter from being consulted on their behalf, but it
stops a VM_SEQ_READ mapping from biasing the counter for other mappings of
the same file.

Link: https://lore.kernel.org/20260525145751.2671248-1-usama.arif@linux.dev
Signed-off-by: Usama Arif <usama.arif@linux.dev>
Closes: https://lore.kernel.org/all/8edc8cd0-f65c-4456-9b3f-362e744c9a96@linux.dev/
Reviewed-by: William Kucharski <william.kucharski@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/proc: add /proc/pid/smaps tearing tests

Add tearing tests for /proc/pid/smaps file. New tests reuse the same
logic as with maps file but skipping all the data except for the VMA
addresses, which are the only part relevant for the tearing tests. Skip
PROCMAP_QUERY parts of the tests because smaps does not implement that
ioctl.

Link: https://lore.kernel.org/20260426062718.1238437-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <liam@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/proc: ensure the test is performed at the right page boundary

When running tearing tests we need to ensure the pages we use include VMAs
that were mapped by the child process for this test. Currently we always
use the first two pages, checking VMAs at their boundaries and this works,
however once we add tests for /proc/pid/smaps, the first two pages might
not contain the VMAs that child modifies. Locate the page that contains
the first VMA mapped by the child and use that and the next page for the
test.

Link: https://lore.kernel.org/20260426062718.1238437-3-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <liam@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/proc/task_mmu: read proc/pid/{smaps|numa_maps} under per-vma lock

Patch series "use vma locks for proc/pid/{smaps|numa_maps} reads", v2.

Use per-vma locks when reading /proc/pid/smaps and /proc/pid/numa_maps
similar to /proc/pid/maps to reduce contention on central mmap_lock.  One
major difference between maps and smaps/numa_maps reading is that the
latter executes page table walk which can't be done under RCU due to a
possibility of sleeping.  Therefore we drop RCU read lock before this walk
while keeping the VMA locked.  After the walk we retake RCU read lock,
reset VMA iterator and proceed with the next VMA.

The last two patches extend /proc/pid/maps test to cover /proc/pid/smaps
reading during concurrent address space modification.

This patch (of 3):

proc/pid/{smaps|numa_maps} can be read using the combination of RCU and
VMA read locks, similar to proc/pid/maps.  RCU is required to safely
traverse the VMA tree and VMA lock stabilizes the VMA being processed and
the pagetable walk.

Link: https://lore.kernel.org/20260426062718.1238437-1-surenb@google.com
Link: https://lore.kernel.org/20260426062718.1238437-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <liam@infradead.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/vmscan: unify writeback reclaim statistic and throttling

Currently MGLRU and non-MGLRU handle the reclaim statistic and writeback
handling very differently, especially throttling.  Basically MGLRU just
ignored the throttling part.

Let's just unify this part, use a helper to deduplicate the code so both
setups will share the same behavior.

Test using following reproducer using bash:

  echo "Setup a slow device using dm delay"
  dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
  LOOP=$(losetup --show -f /var/tmp/backing)
  mkfs.ext4 -q $LOOP
  echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
      dmsetup create slow_dev
  mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow

  echo "Start writeback pressure"
  sync && echo 3 > /proc/sys/vm/drop_caches
  mkdir /sys/fs/cgroup/test_wb
  echo 128M > /sys/fs/cgroup/test_wb/memory.max
  (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
      dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)

  echo "Clean up"
  echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
  dmsetup resume slow_dev
  umount -l /mnt/slow && sync
  dmsetup remove slow_dev

Before this commit, `dd` will get OOM killed immediately if MGLRU is
enabled.  Classic LRU is fine.

After this commit, throttling is now effective and no more spin on LRU or
premature OOM.  Stress test on other workloads also looks good.

Global throttling is not here yet, we will fix that separately later.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-15-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
Tested-by: Leno Hou <lenohou@gmail.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/vmscan: remove sc->unqueued_dirty

No one is using it now, just remove it.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-14-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Suggested-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/vmscan: remove sc->file_taken

No one is using it now, just remove it.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-13-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mglru: remove no longer used reclaim argument for folio protection

Now dirty reclaim folios are handled after isolation, not before, since
dirty reactivation must take the folio off LRU first, and that helps to
unify the dirty handling logic.

So this argument is no longer needed. Just remove it.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-12-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chen Ridong <chenridong@huaweicloud.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mglru: simplify and improve dirty writeback handling

Right now the flusher wakeup mechanism for MGLRU is less responsive and
unlikely to trigger compared to classical LRU.  The classical LRU wakes
the flusher if one batch of folios passed to shrink_folio_list is
unevictable due to under writeback.  MGLRU instead check and handle this
after the whole reclaim loop is done.

We previously even saw OOM problems due to passive flusher, which were
fixed but still not perfect [1].

We have just unified the dirty folio counting and activation routine, now
just move the dirty flush into the loop right after shrink_folio_list.
This improves the performance a lot for workloads involving heavy
writeback and prepares for throttling too.

Test with YCSB workloadb showed a major performance improvement:

Before this series:
Throughput(ops/sec): 62485.02962831822
AverageLatency(us): 500.9746963330107
pgpgin 159347462
workingset_refault_file 34522071

After this commit:
Throughput(ops/sec): 80857.08510208207
AverageLatency(us): 386.653262968934
pgpgin 112233121
workingset_refault_file 19516246

The performance is a lot better with significantly lower refault.  We also
observed similar or higher performance gain for other real-world
workloads.

We were concerned that the dirty flush could cause more wear for SSD: that
should not be the problem here, since the wakeup condition is when the
dirty folios have been pushed to the tail of LRU, which indicates that
memory pressure is so high that writeback is blocking the workload
already.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-11-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chen Ridong <chenridong@huaweicloud.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mglru: use the common routine for dirty/writeback reactivation

Currently MGLRU will move the dirty writeback folios to the second oldest
gen instead of reactivate them like the classical LRU.  This might help to
reduce the LRU contention as it skipped the isolation.  But as a result we
will see these folios at the LRU tail more frequently leading to
inefficient reclaim.

Besides, the dirty / writeback check after isolation in shrink_folio_list
is more accurate and covers more cases.  So instead, just drop the special
handling for dirty writeback, use the common routine and re-activate it
like the classical LRU.

This should in theory improve the scan efficiency.  These folios will be
rotated back to LRU tail once writeback is done so there is no risk of
hotness inversion.  And now each reclaim loop will have a higher success
rate.  This also prepares for unifying the writeback and throttling
mechanism with classical LRU, we keep these folios far from tail so
detecting the tail batch will have a similar pattern with classical LRU.

The micro optimization that avoids LRU contention by skipping the
isolation is gone, which should be fine.  Compared to IO and writeback
cost, the isolation overhead is trivial.

And using the common routine also keeps the folio's referenced bits (tier
bits), which could improve metrics in the long term.  Also no more need to
clean reclaim bit as the common routine will make use of it.

Note the common routine updates a few throttling and writeback counters,
which are not used, and never have been for the MGLRU case.  We will start
making use of these in later commits.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-10-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chen Ridong <chenridong@huaweicloud.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mglru: remove redundant swap constrained check upon isolation

Remove the swap-constrained early reject check upon isolation.  This check
is a micro optimization when swap IO is not allowed, so folios are
rejected early.  But it is redundant and overly broad since
shrink_folio_list() already handles all these cases with proper
granularity.

Notably, this check wrongly rejected lazyfree folios, and it doesn't cover
all rejection cases.  shrink_folio_list() uses may_enter_fs(), which
distinguishes non-SWP_FS_OPS devices from filesystem-backed swap and does
all the checks after folio is locked, so flags like swap cache are stable.

This check also covers dirty file folios, which are not a problem now
since sort_folio() already bumps dirty file folios to the next generation,
but causes trouble for unifying dirty folio writeback handling.

And there should be no performance impact from removing it.  We may have
lost a micro optimization, but unblocked lazyfree reclaim for NOIO
contexts, which is not a common case in the first place.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-9-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mglru: don't abort scan immediately right after aging

Right now, if eviction triggers aging, the reclaimer will abort.  This is
not the optimal strategy for several reasons.

Aborting the reclaim early wastes a reclaim cycle when under pressure, and
for concurrent reclaim, if the LRU is under aging, all concurrent
reclaimers might fail.  And if the age has just finished, new cold folios
exposed by the aging are not reclaimed until the next reclaim iteration.

What's more, the current aging trigger is quite lenient, having 3 gens
with a reclaim priority lower than default will trigger aging, and blocks
reclaiming from one memcg.  This wastes reclaim retry cycles easily.  And
in the worst case, if the reclaim is making slower progress and all
following attempts fail due to being blocked by aging, it triggers
unexpected early OOM.

And if a lruvec requires aging, it doesn't mean it's hot.  Instead, the
lruvec could be idle for quite a while, and hence it might contain lots of
cold folios to be reclaimed.

While it's helpful to rotate memcg LRU after aging for global reclaim, as
global reclaim fairness is coupled with the rotation in shrink_many, memcg
fairness is instead handled by cgroup iteration in shrink_node_memcgs.
So, for memcg level pressure, this abort is not the key part for keeping
the fairness.  And in most cases, there is no need to age, and fairness
must be achieved by upper-level reclaim control.

So instead, just keep the scanning going unless one whole batch of folios
failed to be isolated or enough folios have been scanned, which is
triggered by evict_folios returning 0.  And only abort for global reclaim
after one batch, so when there are fewer memcgs, progress is still made,
and the fairness mechanism described above still works fine.

And in most cases, the one more batch attempt for global reclaim might
just be enough to satisfy what the reclaimer needs, hence improving global
reclaim performance by reducing reclaim retry cycles.

Rotation is still there after the reclaim is done, which still follows the
comment in mmzone.h.  And fairness still looking good.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-8-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mglru: use a smaller batch for reclaim

With a fixed number to reclaim calculated at the beginning, making each
following step smaller should reduce the lock contention and avoid
over-aggressive reclaim of folios, as it will abort earlier when the
number of folios to be reclaimed is reached.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-7-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mglru: avoid reclaim type fall back when isolation makes no progress

While isolation makes no progress in scan_folios(), we quickly fall back
to the other type in isolate_folios(). This is incorrect, as the current
type may still have sufficient folios. Falling back can undermine the
positive_ctrl_err() result from get_type_to_scan(), which is derived from
swappiness.

So just continue scanning this type for another round.

Worth noting if the cold generations are all reclaimed, scan will no
longer make any progress either, which may undermine the swappiness again.
This is not a new issue and hence better be fixed later [1].

Link: https://lore.kernel.org/linux-mm/CAGsJ_4zjdOYEtuO6gNjABm7NDxW0skzBFNRNee-k2D6VwsYEQA@mail.gmail.com/
Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-6-02fabb92dc43@tencent.com
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chen Ridong <chenridong@huaweicloud.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mglru: scan and count the exact number of folios

Make the scan helpers return the exact number of folios being scanned or
isolated.  Since the reclaim loop now has a natural scan budget that
controls the scan progress, returning the scan number and consuming the
budget makes the scan more accurate and easier to follow.

The number of scanned folios for each iteration is always larger than 0,
unless the reclaim must stop for a forced aging, so there is no more need
for any special handling when there is no progress made:

- `return isolated || !remaining ?  scanned : 0` in scan_folios: both
  the function and the call now just return the exact scan count, combined
  with the scan budget introduced in the previous commit to avoid livelock
  or under scan.

- `scanned += try_to_inc_min_seq` in evict_folios: adding a bool as a
  scan count was kind of confusing and no longer needed, as scan number
  should never be zero as long as there are still evictable gens.  We may
  encounter a empty old gen that returns 0 scan count, to avoid that, do a
  try_to_inc_min_seq before toisolation which have slight to none overhead
  in most cases.

- `evictable_min_seq + MIN_NR_GENS > max_seq` guard in evict_folios: the
  per-type get_nr_gens == MIN_NR_GENS check in scan_folios naturally
  returns 0 when only two gens remain and breaks the loop.

Also change try_to_inc_min_seq to return void, as its return value is no
longer used by any caller.  Call it before isolate_folios to flush any
empty gens left by external folio freeing, and again after isolate_folios
when scanning moved or protected folios may have emptied the oldest gen.

The scan still stops if only two gens are left, as the scan number will be
zero.  This matches the previous behavior.  This forced gen protection may
be removed or softened later to improve reclaim further.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-5-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mglru: restructure the reclaim loop

The current loop will calculate the scan number on each iteration.  The
number of folios to scan is based on the LRU length, with some unclear
behaviors, e.g, the scan number is only shifted by reclaim priority when
aging is not needed or when at the default priority, and it couples the
number calculation with aging and rotation.

Adjust, simplify it, and decouple aging and rotation.  Just calculate the
scan number for once at the beginning of the reclaim, always respect the
reclaim priority, and make the aging and rotation more explicit.

This slightly changes how aging and offline memcg reclaim works:
Previously, aging was skipped at DEF_PRIORITY even when eviction was no
longer possible, so the reclaimer wasted an iteration until the priority
escalated.  Now aging runs immediately whenever it is needed to make
progress; the DEF_PRIORITY skip only applies when eviction is still
viable.  This may avoid wasted iterations that over-reclaim slab and break
reclaim balance in multi-cgroup setups.

Similar for offline memcg.  Previously, offline memcg wouldn't be aged
unless it didn't have any evictable folios.  Now, we might age it if it
has only 3 generations, which should be fine.  On one hand, offline memcg
might still hold long-term folios, and in fact, a long-existing offline
memcg must be pinned by some long-term folios like shmem.  These folios
might be used by other memcg, so aging them as ordinary memcg seems
correct.  Besides, aging enables further reclaim of an offlined memcg,
which will certainly happen if we keep shrinking it.  And offline memcg
might soon be no longer an issue with reparenting.

Overall, the memcg LRU rotation, as described in mmzone.h, remains the
same.

Note that because the scan budget is now pinned at loop entry, tiny lruvec
might skip this reclaim pass, also skipping aging, which could be
beneficial as aging is not helpful since it will still be un-reclaimable
after aging.  Reclaim will go on as usual once priority escalates.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-4-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chen Ridong <chenridong@huaweicloud.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mglru: relocate the LRU scan batch limit to callers

Same as active / inactive LRU, MGLRU isolates and scans folios in batches.
The batch split is done hidden deep in the helper, which makes the code
harder to follow. The helper's arguments are also confusing since callers
usually request more folios than the batch size, so the helper almost
never processes the full requested amount.

Move the batch splitting into the top loop to make it cleaner, there
should be no behavior change.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-3-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mglru: rename variables related to aging and rotation

The current variable name isn't helpful. Make the variable names more
meaningful.

Only naming change, no behavior change.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-2-02fabb92dc43@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Suggested-by: Barry Song <baohua@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mglru: consolidate common code for retrieving evictable size

Patch series "mm/mglru: improve reclaim loop and dirty folio", v7.

This series cleans up and slightly improves MGLRU's reclaim loop and dirty
writeback handling.  As a result, we can see an up to ~30% increase in
some workloads like MongoDB with YCSB and a huge decrease in file refault,
no swap involved.  Other common benchmarks have no regression, and LOC is
reduced, with less unexpected OOM, too.

Some of the problems were found in our production environment, and others
were mostly exposed while stress testing during the development of the
LSM/MM/BPF topic on improving MGLRU [1].  This series cleans up the code
base and fixes several performance issues, preparing for further work.

MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other.  The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective.

This series slightly cleans up and improves these issues using a scan
budget by calculating the number of folios to scan at the beginning of the
loop, and decouples aging from the reclaim calculation helpers.  Then,
move the dirty flush logic inside the reclaim loop so it can kick in more
effectively.  These issues are somehow related, and this series handles
them and improves MGLRU reclaim in many ways.

Test results: All tests are done on a 48c96t NUMA machine with 2 nodes and
a 128G memory machine using NVME as storage.  Classical (non-MGLRU) LRU
numbers are included as "MGLRU disabled" for each benchmark below; see [8]
and [9] for the longer write-up.

MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read and
dirty writeback.  MongoDB is set up in a 10G cgroup using Docker, and the
WiredTiger cache size is set to 4.5G, using NVME as storage.  This is
close to the case we observed regressing in our production environment:
mixed read and writeback pressure, so it is a practical case for
evaluation.

Not using SWAP.  The intent is to isolate the file LRU writeback path.
Enabling SWAP would just add noise from anonymous reclaim.

MGLRU Before:
Throughput(ops/sec): 60653.502655
workingset_refault_file 12904916
pgpgin 165366622
pgpgout 5219588

MGLRU After:
Throughput(ops/sec): 82384.354760 (+35.8%, higher is better)
workingset_refault_file 7128285   (-44.7%, lower is better)
pgpgin 113170693                  (-31.5%, lower is better)
pgpgout 5639724

MGLRU Disabled:
Throughput(ops/sec): 93713.640901
workingset_refault_file 15013443
pgpgin 85365614
pgpgout 5866508

We can see a significant performance improvement after this series.  The
test is done on NVME and the performance gap would be even larger for slow
devices, such as HDD or network storage.  We observed over 100% gain for
some workloads with slow IO.

Note, classical LRU is still faster for this benchmark, MGLRU may catch up
later with further work [7].

Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers.  Many memcgs each applying roughly equal pressure exercises the
LRU's ability to detect/protect each tenant's working set and to balance
reclamation fairly between tenants, which makes this a meaningful test for
the reclaim mechanism.

Fairness is reported via Jain's fairness index (1.0 means all tenants get
exactly equal allocation, lower is worse).  Under equal pressure, all
memcgs should make roughly equal forward progress.  See [8] for the longer
rationale and per-memcg breakdown.

MGLRU before:
Total requests:           81898
Per-worker mean:         1279.7
Per-worker 95% CI (mean):       [  1259.0,   1300.4]
Jain's fairness index: 0.995893  (1.0 = perfectly fair)
Latency:
      Bucket     Count      Pct    Cumul
      [0,1)s     28392   34.67%   34.67%
      [1,2)s      8022    9.80%   44.46%
      [2,4)s      6130    7.48%   51.95%
      [4,8)s     39354   48.05%  100.00%

MGLRU after:
Total requests:           82901
Per-worker mean:         1295.3
Per-worker 95% CI (mean):       [  1265.3,   1325.4]
Jain's fairness index: 0.991607  (1.0 = perfectly fair)
Latency:
      Bucket     Count      Pct    Cumul
      [0,1)s     28128   33.93%   33.93%
      [1,2)s      8756   10.56%   44.49%
      [2,4)s      7028    8.48%   52.97%
      [4,8)s     38989   47.03%  100.00%

MGLRU disabled:
Total requests:           62399
Per-worker mean:          975.0
Per-worker 95% CI (mean):       [   941.9,   1008.1]
Jain's fairness index: 0.982156  (1.0 = perfectly fair)
Latency:
      Bucket     Count      Pct    Cumul
      [0,1)s     20051   32.13%   32.13%
      [1,2)s      2255    3.61%   35.75%
      [2,4)s      6149    9.85%   45.60%
      [4,8)s     33927   54.37%   99.97%
     [8,16)s        17    0.03%  100.00%

Reclaim is still fair and effective, total requests number seems slightly
better.

OOM issue with aging and throttling
===================================
For the throttling OOM issue, it can be easily reproduced using dd and
cgroup limit as demonstrated and fixed by a later patch in this series.

The aging OOM is a bit tricky, a specific reproducer can be used to
simulate what we encountered in production environment [4]: Spawns
multiple workers that keep reading the given file using mmap, and pauses
for 120ms after one file read batch.  It also spawns another set of
workers that keep allocating and freeing a given size of anonymous memory.
The total memory size exceeds the memory limit (eg.  14G anon + 8G file,
which is 22G vs a 16G memcg limit).

- MGLRU disabled:
  Finished 128 iterations.

- MGLRU enabled:
  OOM with following info after about ~10-20 iterations:
    [   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
    [   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [   62.640823] Memory cgroup stats for /demo:
    [   62.641017] anon 10604879872
    [   62.641941] file 6574858240

  OOM occurs despite there being still evictable file folios.

- MGLRU enabled after this series:
  Finished 128 iterations.

Worth noting there is another OOM related issue reported in V1 of this
series, which is tested and looking OK now [5].

MySQL:
======

Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
  --tables=48 --table-size=2000000 --threads=48 --time=600 run

A 24G InnoDB buffer pool inside a 2G memcg with ZRAM as swap forces
aggressive eviction of cached database anon pages, which exercises the
LRU's hot page detection and the eviction path under swap pressure.  The
workload is practical, and the pressure is higher than what we usually see
in production but it is intended to expose the extreme case.

MGLRU before:   17313.688333 tps
MGLRU after:    17286.195000 tps
MGLRU disabled: 16245.330000 tps

Seems only noise level changes, no regression.

FIO:
====
Testing with the following command, where /mnt/ramdisk is a 64G EXT4
ramdisk, each test file is 3G, in a 10G memcg, 6 test run each:

fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
  --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
  --rw=randread --norandommap --time_based \
  --ramp_time=1m --runtime=5m --group_reporting

Random buffered mmap read on a ramdisk strips out storage variance and
stresses purely the LRU's ability to evict and recycle the page cache
under heavy random read pressure.

MGLRU before:      9033.91 MB/s
MGLRU after:       9065.72 MB/s
MGLRU disabled:    8254.54 MB/s

Also seem only noise level changes and no regression or slightly better.

Build kernel:
=============
Build kernel test using ZRAM as swap, kernel source on tmpfs, in a memcg
with memory.max=3G, using make -j96 and defconfig, measuring system time,
6 test run each.  Building the kernel is a classical mixed anon + file
workload (lots of small file reads/writes plus parallel anon allocations
from cc/ld) and is representative of many real compilation jobs.

MGLRU before:     2823.13s
MGLRU after:      2801.26s
MGLRU disabled:   5023.50s

Also seem only noise level changes, no regression or very slightly better.

Android:
========
Xinyu reported a performance gain on Android, too, with this series.  The
test consisted of cold-starting multiple applications sequentially under
moderate system load [6]; this is a real Android user-visible scenario,
dominated by the LRU's ability to keep the right working set resident and
re-fault launch-critical pages quickly.

Before:
Launch Time Summary (all apps, all runs)
  Mean 868.0ms
  P50 888.0ms
  P90 1274.2ms
  P95 1399.0ms

After:
Launch Time Summary (all apps, all runs)
  Mean 850.5ms (-2.07%)
  P50 861.5ms  (-3.04%)
  P90 1179.0ms (-8.05%)
  P95 1228.0ms (-12.2%)

This patch (of 15):

Merge commonly used code for counting evictable folios in a lruvec.

No behavior change.

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-0-02fabb92dc43@tencent.com
Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-1-02fabb92dc43@tencent.com
Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure
Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/
Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/
Link: https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-0-913619b014d9@tencent.com/
Link: https://lore.kernel.org/linux-mm/CAMgjq7BzQAPp8u_3-9e3ueXmRCoW=2sydok0hFM=MYL7VC1YYg@mail.gmail.com/
Link: https://lore.kernel.org/linux-mm/CAMgjq7D+4QmiWe73OPFuH0s+ZKCUJoo+MfcWOdJcV+VO-T2Wmg@mail.gmail.com/
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Yuanchu Xie <yuanchu@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Stevens <stevensd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Leno Hou <lenohou@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

userfaultfd: make functions that are not used outside uffd static

After merging fs/userfaultfd.c into mm/userfaultfd.c, several functions
that were previously shared between the two files are now only used within
mm/userfaultfd.c.

Make them static and remove their declarations from
include/linux/userfaultfd_k.h.

Link: https://lore.kernel.org/20260523173759.3964908-3-rppt@kernel.org
Assisted-by: Copilot:claude-opus-4-6
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.c

Patch series "userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.c",
v3.

These patches merge fs/userfaultfd.c into mm/userfaultfd.c and make
functions used only inside mm/userfaultfd.c static.

This patch (of 2):

Historically userfaultfd implementation has been split between
fs/userfaultfd.c and mm/userfaultfd.c.

The mm/ part implemented memory management operations, while the fs/ part
implemented file descriptor handling and called into the mm/ part for the
actual memory management work.

This separation is quite artificial and fs/userfaultfd.c does not seem to
belong to fs/ because it's only a user if vfs APIs and like for other
users, for example, memfd and secretmem, the file descriptor handling
could live in mm/ as well.

"Append" fs/userfaultfd.c to mm/userfaultfd and update fs/Makefile and
MAINTAINERS accordingly.

No intended functional changes.

Link: https://lore.kernel.org/20260523173759.3964908-1-rppt@kernel.org
Link: https://lore.kernel.org/20260523173759.3964908-2-rppt@kernel.org
Assisted-by: Copilot:claude-opus-4-6
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Christian Brauner (Amutable) <brauner@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: remove VM_BUG_ON()s from pindex helpers

Vlastimil pointed out that the VM_BUG_ON()s have fallen out of favour, so
remove them.

Link: https://lore.kernel.org/20260526-page_alloc-unmapped-prep-v2-1-412f4d486115@google.com
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Link: https://lore.kernel.org/all/4074a816-9e75-45a6-8141-25459bcc106b@kernel.org/
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>