git.ipfire.org Git - thirdparty/kernel/linux.git/log

fuse-uring: refactor io-uring header copying to ring

Move header copying to ring logic into a new copy_header_to_ring()
function. This makes the copy_to_user() logic more clear and centralizes
error handling / rate-limited logging.

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: separate next request fetching from sending logic

Simplify the logic for fetching + sending off the next request.

This gets rid of fuse_uring_send_next_to_ring() which contained
duplicated logic from fuse_uring_send(). This decouples request fetching
from the send operation, which makes the control flow clearer and
reduces unnecessary parameter passing.

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: invalidate readdir cache on epoch bump

FUSE_NOTIFY_INC_EPOCH invalidates dentries, but does not invalidate cached
readdir results. A process with cwd inside a FUSE mount can therefore
observe stale readdir(".") output after an epoch bump.

Fix this by recording epoch in the readdir cache and checking it on reuse.

Minimal reproducer:

- mount a tiny FUSE fs with an empty root directory
- on opendir, enable fi->cache_readdir and fi->keep_cache
- chdir into the mount and call readdir(".") to populate readdir cache
- make the FUSE server report one file in the root directory
- send only FUSE_NOTIFY_INC_EPOCH
- call readdir(".") again; before this change it stays stale, after this
change it sees the new file

Fixes: 2396356a945b ("fuse: add more control over cache invalidation behaviour")
Signed-off-by: Jun Wu <quark@meta.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

virtio-fs: avoid double-free on failed queue setup

virtio_fs_setup_vqs() allocates fs->vqs and fs->mq_map before calling
virtio_find_vqs(). If virtio_find_vqs() fails, the error path frees both
pointers and returns an error to virtio_fs_probe().

virtio_fs_probe() then drops the last kobject reference, and
virtio_fs_ktype_release() frees fs->vqs and fs->mq_map again. This leaves
dangling pointers in struct virtio_fs and can trigger a double-free during
probe failure cleanup.

Set fs->vqs and fs->mq_map to NULL immediately after kfree() in the
virtio_fs_setup_vqs() error path so that the later kobject release sees an
uninitialized state and kfree(NULL) becomes harmless.

This can be reproduced when a broken virtio-fs device advertises more
request queues than the transport actually provides. In that case
virtio_find_vqs() fails while setting up the extra queue, and the probe
path reaches the double-free cleanup sequence.

Signed-off-by: Yung-Tse Cheng <mes900903@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: invalidate page cache after DIO and async DIO writes

This fixe does page cache invalidation after DIO and async DIO writes for
both O_DIRECT and FOPEN_DIRECT_IO cases.

Commit b359af8275a9 ("fuse: Invalidate the page cache after FOPEN_DIRECT_IO
write") fixed xfstests generic/209 for DIO writes in the FOPEN_DIRECT_IO
path. DIO writes without FOPEN_DIRECT_IO are already handled by
generic_file_direct_write().
However, async DIO writes (xfstests generic/451) remain unhandled.

After this fix:
- Async write with FUSE_ASYNC_DIO:
    invalidate in fuse_aio_invalidate_worker()

- Otherwise (Sync or async write without FUSE_ASYNC_DIO):
    - With FOPEN_DIRECT_IO:
        invalidate in fuse_direct_write_iter()
    - Without FOPEN_DIRECT_IO:
        invalidate in generic_file_direct_write()

Workqueue is required for async write invalidation to prevent deadlock:
calling it directly in the I/O end routine (which is in fuse worker thread
context) can block on a folio lock held by a buffered I/O thread waiting
for the same fuse worker thread.

Co-developed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Cheng Ding <cding@ddn.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: set ff->flock only on success

If FUSE_SETLK fails (e.g., due to EWOULDBLOCK), we shall not set
FUSE_RELEASE_FLOCK_UNLOCK in fuse_file_release().

Reported-by: Li Yichao <liyichao.1@bytedance.com>
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: clean up interrupt reading

Clean up interrupt reading logic. Remove passing the pointer to the fuse
request as an arg and make the header initializations more readable.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: remove stray newline in fuse_dev_do_read()

Remove stray newline that shouldn't be there.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: use READ_ONCE in fuse_chan_num_background()

fuse_chan_num_background() is called without holding fch->bg_lock (for
example from fuse_writepages() to compare against fc->congestion_threshold),
while fch->num_background is updated under bg_lock in dev.c and dev_uring.c.
This is the same locked-write/lockless-read pattern already used for
max_background in fuse_chan_max_background().

Use READ_ONCE() on the read side so that:

- The compiler does not cache or coalesce loads of a value that may change
concurrently on another CPU.
- Prevent KCSAN from reporting an unexpected race.

Signed-off-by: Li Wang <liwang@kylinos.cn>
Fixes: 670d21c6e17f ("fuse: remove reliance on bdi congestion")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: dax: Move long delayed work on system_dfl_long_wq

Currently the code enqueue work items using {queue|mod}_delayed_work(),
using system_long_wq. This workqueue should be used when long works are
expected and it is a per-cpu workqueue.

The function(s) end up calling __queue_delayed_work(), which set a global
timer that could fire anywhere, enqueuing the work where the timer fired.

Unbound works could benefit from scheduler task placement, to optimize
performance and power consumption. Long work shouldn't stick to a single
CPU.

Recently, a new unbound workqueue specific for long running work has
been added:

c116737e972e ("workqueue: Add system_dfl_long_wq for long unbound works")

Since the workqueue work doesn't rely on per-cpu variables, there is no
obvious reason that justify the use of a per-cpu workqueue. So change
system_long_wq with system_dfl_long_wq so that the work may benefit from
scheduler task placement.

Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: add fuse_request_sent tracepoint

This new tracepoint complements fuse_request_send (enqueue) and
fuse_request_end (completion). It fires after the request has been
successfully copied to the daemon's buffer, just before the daemon
can start to process it.

fuse_request_sent does not fire if the copy of the request fails.
It also does not fire for NOTIFY_REPLY, which fires the _end tracepoint
at the end of copy.

This is needed for tools tracking the in-flight state of user initiated
fuse requests.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: Add SPDX ID lines to some files

Some fuse source files are missing SPDX-License-Identifier
lines. Add appropriate IDs to these files, and remove old
license references from the headers.

Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: use QSTR() instead of QSTR_INIT() in fuse_get_dentry

Drop the hard-coded length argument and use the simpler QSTR(). Inline
the code and drop the local variable.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: convert page array allocation to kcalloc()

fuse_get_user_pages() allocates the temporary pages[] array used by
iov_iter_extract_pages() with the open-coded kzalloc(n * sizeof(*p),
...) form. max_pages is derived from the inbound iov_iter and is not
bounded at compile time, so the multiplication can overflow on
sufficiently large iter counts; the resulting too-small allocation
would then be written past by iov_iter_extract_pages().

Switch to kcalloc(), which carries the same zero-on-allocation
semantics and adds the standard size_mul overflow check. No
functional change for non-overflow inputs.

Signed-off-by: William Theesfeld <william@theesfeld.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: use current creds for backing files

FUSE backing files only need a stable snapshot of the current credentials
for later backing-file I/O. prepare_creds() allocates a mutable copy and
can fail, but this code never modifies or commits the result.

Use get_current_cred() instead and store it as a const pointer. This
matches the rest of the backing-file helpers and avoids an unnecessary
allocation and failure path.

Signed-off-by: GuoHan Zhao <zhaoguohan@kylinos.cn>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: expand MAINTAINERS with subsystem info, update mailing list

- Bernd and Joanne are maintainers for fuse-uring

- Amir is maintainer for passthrough

- mailing list is now officially <fuse-devel@lists.linux.dev>

- change status of fuse-core to be "Supported"

Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: remove redundant buffer size checks for interrupt and forget requests

In fuse_dev_do_read(), there is already logic that ensures the buffer is
a minimum of at least FUSE_MIN_READ_BUFFER (8k) bytes.

This makes the buffer size checks for interrupt and forget requests
redundant as sizeof(struct fuse_in_header) + sizeof(struct
fuse_interrupt_in) and sizeof(struct fuse_in_header) + sizeof(struct
fuse_forget_in) are both less than FUSE_MIN_READ_BUFFER.

We can get rid of these checks.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: drop redundant check in fuse_sync_bucket_alloc()

kzalloc_obj with __GFP_NOFAIL is documented to never return failure,
and checking for NULL is redundant (__GFP_NOFAIL in gfp_types.h).

Signed-off-by: Li Wang <liwang@kylinos.cn>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: reduce attributes invalidated on directory change

When the contents of a directory is modified, some of its attributes may
also change, so they need to be invalidated.  But this isn't the case
for every attribute.  For instance, unlinking or creating a file doesn't
change the uid/gid of its parent directory.

This can cause unnecessary FUSE_GETATTRs to be sent to user-space.  For
example, fuse_permission() checks if mode, uid, and gid are valid and
will issue a FUSE_GETATTR if they're not, which results in an extra
FUSE_GETATTR request for every FUSE_UNLINK when removing files in the
same directory.

Signed-off-by: Konrad Sztyber <ksztyber@nvidia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: drop redundant err assignment in fuse_create_open()

In fuse_create_open(), err is initialized to -ENOMEM immediately before
the fuse_alloc_forget() NULL check. If forget allocation fails,
it branches to out_err with that value. If it succeeds, it falls through
without modifying err, so err is still -ENOMEM at the point where
fuse_file_alloc() is called. The second err = -ENOMEM before
fuse_file_alloc() therefore is redundant.

Signed-off-by: Li Wang <liwang@kylinos.cn>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: fuse_i.h: clean up kernel-doc comments

Convert many comments to kernel-doc format to eliminate around 20
kernel-doc warnings like these:

Warning: fs/fuse/fuse_i.h:374 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* A Fuse connection.
Warning: fs/fuse/fuse_i.h:817 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* Get a filled in inode
Warning: fs/fuse/fuse_i.h:859 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* Send RELEASE or RELEASEDIR request
and more like this.

Also add struct member and function parameter descriptions to avoid
these warnings:

Warning: fs/fuse/fuse_i.h:1071 struct member 'epoch_work' not described in 'fuse_conn'
Warning: fs/fuse/fuse_i.h:1071 struct member 'rcu' not described in 'fuse_conn'
Warning: fs/fuse/fuse_i.h:1423 function parameter 'fc' not described in 'fuse_reverse_inval_inode'
Warning: fs/fuse/fuse_i.h:1423 function parameter 'nodeid' not described in 'fuse_reverse_inval_inode'
Warning: fs/fuse/fuse_i.h:1423 function parameter 'offset' not described in 'fuse_reverse_inval_inode'
Warning: fs/fuse/fuse_i.h:1423 function parameter 'len' not described in 'fuse_reverse_inval_inode'
Warning: fs/fuse/fuse_i.h:1436 function parameter 'fc' not described in 'fuse_reverse_inval_entry'
Warning: fs/fuse/fuse_i.h:1436 function parameter 'parent_nodeid' not described in 'fuse_reverse_inval_entry'
Warning: fs/fuse/fuse_i.h:1436 function parameter 'child_nodeid' not described in 'fuse_reverse_inval_entry'
Warning: fs/fuse/fuse_i.h:1436 function parameter 'name' not described in 'fuse_reverse_inval_entry'
Warning: fs/fuse/fuse_i.h:1436 function parameter 'flags' not described in 'fuse_reverse_inval_entry'

Convert struct fuse_file, struct fuse_submount_lookup, struct fuse_inode,
and struct fuse_conn to kernel-doc.

Convert these to plain comments:
Warning: fs/fuse/fuse_i.h:1423 expecting prototype for File(). Prototype was for fuse_reverse_inval_inode() instead
Warning: fs/fuse/fuse_i.h:1436 expecting prototype for File(). Prototype was for fuse_reverse_inval_entry() instead

Change some "/**" to "/*" since they are not kernel-doc comments.

The changes above fix most kernel-doc warnings in this file but
these warnings are not fixed and still remain:
Warning: fs/fuse/fuse_i.h:1428 No description found for return value of 'fuse_fill_super_common'
Warning: fs/fuse/fuse_i.h:1455 No description found for return value of 'fuse_ctl_add_conn'

Binary build output is the same before and after these changes.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: fuse_dev_i.h: clean up kernel-doc warnings

Change some "/**" to "/*" since they are not kernel-doc comments:

Warning: fs/fuse/fuse_dev_i.h:25 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* Request flags
Warning: fs/fuse/fuse_dev_i.h:58 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* A request to the client
Warning: fs/fuse/fuse_dev_i.h:117 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* Input queue callbacks
Warning: fs/fuse/fuse_dev_i.h:289 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* Fuse device instance
and more like this.

Convert enum fuse_req_flag to kernel-doc format.
Convert struct fuse_req, struct fuse_iqueue_ops, and struct fuse_dev
to kernel-doc format.

These warnings remain:
Warning: fs/fuse/fuse_dev_i.h:115 struct member 'ring_entry' not described in 'fuse_req'
Warning: fs/fuse/fuse_dev_i.h:115 struct member 'ring_queue' not described in 'fuse_req'

Binary build output is the same before and after these changes.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: drop kernel-doc notation for a comment

Use regular C comment syntax for a non-kernel-doc comment to avoid
a kernel-doc warning:

Warning: fs/fuse/dev_uring_i.h:104 This comment starts with '/**', but
isn't a kernel-doc comment.
* Describes if uring is for communication and holds alls the data needed

Binary build output is the same before and after this change.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: simplify fuse_dev_ioctl_clone()

Don't need to check if the new device file is already initialized, since
fuse_dev_install_with_pq() will do that anyway.

Make fuse_dev_install_with_pq() return a boolean value indicating success so
that fuse_dev_ioctl_clone() can return an error in case of failure.

Move aborting the connection (setting fc->connected to zero) to
fuse_dev_install(), because it is not needed when the clone ioctl fails.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: alloc pqueue before installing fch in fuse_dev

Prior to this patchset, fuse_dev (containing fuse_pqueue) was allocated on
mount. But now fuse_dev is allocated when opening /dev/fuse, even though
the queues are not needed at that time.

Delay allocation of the pqueue (4k worth of list_head) just before mounting
or cloning a device.

Various distributions (e.g. Debian/Fedora) configure /dev/fuse as world
writable, so the pqueue allocation should be deferred to a privileged
operation (mount) to prevent unprivileged userspace from consuming pinned
kernel memory.

[Li Wang: fix kernel NULL pointer dereference in fuse_uring_add_to_pq()]
[Fix race in fuse_dev_release()]

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: remove #include "fuse_i.h" from dev.c and dev_uring.c

Move a couple of function declarations from fuse_i.h to dev.h and
fuse_dev_i.h.

Add fuse_conn_get_id() helper that retrieves the connection ID (s_dev) from
fuse_conn.

With the exception of cuse.c, virtio_fs.c and trace.c source files now
either include fuse_i.h or fuse_dev_i/dev_uring_i.h but not both.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: change ring->fc to ring->chan

Store pointer to struct fuse_chan instead of struct fuse_conn in fuse_ring.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: remove fuse_mutex protection from fuse_dev_ioctl_sync_init()

In normal use ioctl(FUSE_DEV_IOC_SYNC_INIT) comes before the mount() or
fsconfig() syscalls, they are executed strictly serially.

If ioctl and mount are performed in parallel, the behavior is
nondeterministic. Removing the mutex does not change this.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: set params in fuse_chan_set_initialized()

Set minor, max_write and max_pages in the fuse_chan. These match the same
fields in fuse_conn but are needed in both layers.

[Dongyang Jin: Pointers should use NULL instead of explicit '0']

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: create notify.c

Move FUSE_NOTIFY_* handling into a separate source file.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: create poll.c

Move f_op->poll related functions to the new source file.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: change fud->fc to fud->chan

Store pointer to struct fuse_chan instead of struct fuse_conn in fuse_dev.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: split out filesystem part of request sending

Create a new source file: req.c and add the request sending entry
functions:

  __fuse_simple_request()
  fuse_simple_background()
  fuse_simple_notify_reply()

Introduce transport layer sending functions that are called by the
respective fs layer function:

  fuse_chan_send()
  fuse_chan_send_bg()
  fuse_chan_send_notify_reply()

Move calculation of request header fields uid, gid and pid from
fuse_get_req() and fuse_force_creads() to a new helper: fuse_fill_creds().

These fileds are now passed to the transport layer via struct fuse_args.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: change req->fm to req->chan

Store a struct fuse_chan pointer in fuse_req instead of a struct fuse_mount
pointer.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: remove fm arg of args->end callback

Only used by FUSE_INIT and CUSE_INIT, these can store the relevant pointer
in their structs derived from fuse_args.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: split off fuse_args and related definitions into a separate header

This is going to be used by both layers (transport and filesystem)

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: abort related layering cleanup

- rename fuse_abort_conn() to fuse_chan_abort(), pass fuse_chan pointer
instead of fuse_conn

- pass an abort_with_err argument that tells fuse_dev_(read|write) to
return with ECONNABORTED instead of ENODEV

- move fc->aborted to fch->abort_with_err

- rename fuse_wait_aborted() to fuse_chan_wait_aborted()

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: remove #include "fuse_i.h" from "req_timeout.c"

Just need to move fuse_abort_conn().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: remove #include "fuse_i.h" from "dev_uring_i.h"

Start getting rid of fs layer stuff from transport layer files.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: move fuse_dev_waitq to dev.c

Move wake_up_all(&fuse_dev_waitq) into fuse_dev_install() where it
logically belongs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: move forget related struct and helpers

Move:
- struct fuse_forget_link to fuse_dev_i.h
- fuse_alloc_forget() to dev.c/dev.h

Rename:
- fuse_queue_forget -> fuse_chan_queue_forget

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: don't access transport layer structs directly from the fs layer

Add helpers (get and set functions mainly) that cleanly separate the
layers.

Remove #include "fuse_dev_i.h" from:

- inode.c
- file.c
- control.c

Remove #include "dev_uring_i.h" from inode.c.

[Li Wang: drop redundant initializer in process_init_limits()]

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: move struct fuse_req and related to fuse_dev_i.h

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: move request timeout to fuse_chan

Move:

- timeout

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: add back pointer from fuse_chan to fuse_conn

Will be needed by callbacks from the transport layer to the fs layer.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: split off fch->lock from fc->lock

And document which members they protect.

end_polls() is called with both, outer fch->lock is probably unnecessary,
but doesn't hurt for now.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: move interrupt related members to fuse_chan

Move:

- no_interrupt

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: move io_uring related members to fuse_chan

Move:

- io_uring
- ring

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: move request blocking related members to fuse_chan

Move:

- initialized
- blocked
- blocked_waitq
- connected
- num_waiting

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: move background queuing related members to fuse_chan

Move:

- max_background
- num_background
- active_background
- bg_queue
- bg_lock

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: move 'devices' member from fuse_conn to fuse_chan

This belongs in the transport layer.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: move fuse_dev and fuse_pqueue to dev.c

Move function definitions to dev.c, struct definitions to fuse_dev_i.h.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: move fuse_iqueue to fuse_chan

Move the 'fiq' member from fuse_conn to fuse_chan.

Move iqueue related structure definitions and function declarations from
"fuse_i.h" to "fuse_dev_i.h".

Add a fuse_dev_chan_new() helper, that returns a fuse_chan initialized with
the fuse_dev_fiq_ops.

Add a fuse_chan_release() function, that calls fiq->ops->release().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: add struct fuse_chan

The goal is to separate transport layer stuff out from struct fuse_conn,
leaving just the filesystem related members.

Add a new object referenced from fuse_conn. This patch just implements the
allocation and freeing of this object.

Following patches will move transport related members from fuse_conn to
fuse_chan.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: move request timeout code to a new source file

This marks the first step in cleanly separating the transport layer from
the filesystem layer.

Add "dev.h", which will contain the interface definition for the transport
layer.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: fix io-uring background queue dispatch on request completion

When a background request completes via the io_uring path, the
background queue gets flushed to dispatch pending background requests,
but this is done before the connection-level background counters
(fc->num_background, fc->active_background) are properly accounted,
which may reduce effective queue depth to one.

The connection-level counters are decremented in fuse_request_end(), but
flush_bg_queue() flushes the /dev/fuse path queue (fc->bg_queue), not
the io_uring per-queue bg one, which means pending uring background
requests on the queue are never dispatched in this path.

Fix this by accounting the connection-level background counters first
before flushing the queue's background queue. Since
fuse_request_bg_finish() clears FR_BACKGROUND, fuse_request_end() will
skip the background cleanup branch entirely, which avoids any
double-decrements; it will call the wake_up(&req->waitq) branch but this
is effectively a no-op as background requests have no waiters on
req->waitq.

Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Fixes: 857b0263f30e ("fuse: Allow to queue bg requests through io-uring")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: fix device node leak in cuse_process_init_reply()

If device_add() succeeds during CUSE initialization but a subsequent
step (cdev_alloc() or cdev_add()) fails, the error path calls
put_device() without first calling device_del().  This leaks the
devtmpfs entry created by device_add(), leaving a stale /dev/<name>
node that persists until reboot.

Since the cuse_conn is never linked into cuse_conntbl on the failure
path, cuse_channel_release() sees cc->dev == NULL and skips
device_unregister(), so no other code path cleans up the node.

This has several consequences:

- The device name is permanently poisoned: any subsequent attempt to
   create a CUSE device with the same name hits the stale sysfs entry,
   device_add() fails, and the new device is aborted.

- The collision manifests as ENODEV returned to userspace with no
   dmesg diagnostic, making it very difficult to debug.

- The failure is self-perpetuating: once a name is leaked, all future
   attempts with that name fail identically.

Fix this by introducing an err_dev label that calls device_del() to
undo device_add() before falling through to err_unlock.  The existing
err_unlock path from a device_add() failure correctly skips device_del()
since the device was never added.

Testing instructions can be found at the lore link below.

Link: https://lore.kernel.org/all/20260408-wip-cuse-leak-fix-v1-0-1c028d575e97@redhat.com/
Signed-off-by: Alberto Ruiz <aruiz@redhat.com>
Fixes: 151060ac1314 ("CUSE: implement CUSE - Character device in Userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: do not use start_removing_noperm()

Revert the fuse part of commit c9ba789dad15 ("VFS: introduce
start_creating_noperm() and start_removing_noperm()").

Commit c9ba789dad15 ("VFS: introduce start_creating_noperm() and
start_removing_noperm()") caused a regression in FUSE_NOTIFY_INVAL_ENTRY,
which failed to invalidate negative dentries.

This manifests in the filesystem returning -ENOENT for operations on an
existing file.

Fixing it properly while still keeping the start_removing* infrastructure
would add much additional complexity.

Instead revert to the original simple implementation.

The start_removing* infrastructure is needed in VFS to abstract the
filesystem locking. However filesystem code can still safely use the raw
locking primitives without affacting other filesystems.

This is part two of the revert.

Reported-by: Артем Лабазов <123321artyom@gmail.com>
Closes: https://lore.kernel.org/all/CAFbF8N7++zopZuEcsKRxBV_sgOGCbzCY0hOyMw1SiGAtuzGhyQ@mail.gmail.com/
Fixes: c9ba789dad15 ("VFS: introduce start_creating_noperm() and start_removing_noperm()")
Cc: stable@vger.kernel.org # 6.19
Cc: NeilBrown <neilb@ownmail.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

Revert "fuse: fix conversion of fuse_reverse_inval_entry() to start_removing()"

This reverts commit cab012375122304a6343c1ed09404e5143b9dc01.

Commit c9ba789dad15 ("VFS: introduce start_creating_noperm() and
start_removing_noperm()") caused a regression in FUSE_NOTIFY_INVAL_ENTRY,
which failed to invalidate negative dentries.

This manifests in the filesystem returning -ENOENT for operations on an
existing file.

Fixing it properly while still keeping the start_removing* infrastructure
would add much additional complexity.

Instead revert to the original simple implementation.

The start_removing* infrastructure is needed in VFS to abstract the
filesystem locking. However filesystem code can still safely use the raw
locking primitives without affacting other filesystems.

This is part one of the revert.

Reported-by: Артем Лабазов <123321artyom@gmail.com>
Closes: https://lore.kernel.org/all/CAFbF8N7++zopZuEcsKRxBV_sgOGCbzCY0hOyMw1SiGAtuzGhyQ@mail.gmail.com/
Fixes: c9ba789dad15 ("VFS: introduce start_creating_noperm() and start_removing_noperm()")
Cc: stable@vger.kernel.org # 6.19
Cc: NeilBrown <neilb@ownmail.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: avoid 32-bit prune notification count wrap

FUSE_NOTIFY_PRUNE validates the nodeid payload length with:

    size - sizeof(outarg) != outarg.count * sizeof(u64)

On 32-bit kernels, size_t is also 32 bits, so the daemon-controlled
count multiplication can wrap.  A prune notification with count
0x20000000 and no nodeid payload passes the check, enters the copy
loop, and asks the device copy path to read nodeids that are not
present in the userspace write buffer.  In QEMU this reaches the
fuse_copy_fill() BUG_ON(!err) path.

Validate the payload length with array_size() instead.  That accepts
exactly the same valid messages, but avoids wrapping arithmetic before
the copy loop consumes the count.

Assisted-by: Codex:gpt-5.5-cyber-preview
Fixes: 3f29d59e92a9 ("fuse: add prune notification")
Cc: stable@vger.kernel.org
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: remove request-less entries from ent_w_req_queue to fix NULL deref

If a copy into the userspace ring buffer fails, a request will be
terminated and fuse_uring_req_end() will set ent->fuse_req to NULL but
it will leave the entry on ent_w_req_queue in FRRS_FUSE_REQ state. This
can lead to a NULL deref if the request expiration logic scans
ent_w_req_queue in the window before the entry is moved off it.

Fix this by taking the entry off ent_w_req_queue and changing its state
from FRRS_FUSE_REQ to FRRS_INVALID before terminating the request.

Fixes: 4fea593e625c ("fuse: optimize over-io-uring request expiration check")
Cc: stable@kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: clear intr_entry in fuse_resend and fuse_remove_pending_req

When fuse_resend() moves a request from fpq->processing back to
fiq->pending, it sets FR_PENDING and clears FR_SENT but does not
remove the requests intr_entry from fiq->interrupts.  If the
request had FR_INTERRUPTED set from a prior signal, intr_entry
remains dangling on fiq->interrupts.  When the requesting task
then receives a fatal signal, fuse_remove_pending_req() sees
FR_PENDING=1, removes the request from fiq->pending and frees it
via the refcount path, also without cleaning intr_entry.  The
stale intr_entry causes use-after-free when fuse_read_interrupt()
iterates fiq->interrupts:
  - list_del_init(&req->intr_entry) -> UAF write on freed slab
  - req->in.h.unique -> UAF read, data leaked to userspace

Remove intr_entry from fiq->interrupts in fuse_resend() for
interrupted requests before they are placed back on fiq->pending.

Add a WARN_ON if the intr_entry is not empty on request destruction.

Fixes: 760eac73f9f6 ("fuse: Introduce a new notification type for resend pending requests")
Cc: stable@vger.kernel.org # 6.9
Signed-off-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: make a fuse_req on SQE commit only findable after memcpy

Bad userspace might try to trick us and send commit SQEs request
unique / commit-id of requests that are not even send to
fuse-server (io_uring_cmd_done() not called) yet.

fuse_uring_commit_fetch() ends the fuse request when the ring entry
has a wrong state, but that could have caused a use-after-free
with the memcpy operations in fuse_uring_send_in_task().
In order to avoid such races the call of fuse_uring_add_to_pq()
is moved after the copy operations and just before completing
the io-uring request - malicious userspace cannot find the request
anymore until all prepration work in fuse-client/kernel is completed.

This also moves fuse_uring_add_to_pq() a bit up in the code to
avoid a forward declaration. Also not with a preparation commit,
to make it easier to back port to older kernels.

Reported-by: xlabai <xlabai@tencent.com>
Reported-by: Berkant Koc <me@berkoc.com>
Fixes: c090c8abae4b6b ("fuse: Add io-uring sqe commit and fetch support")
Cc: stable@kernel.org # 6.14
Signed-off-by: Bernd Schubert <bernd@bsbernd.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: Avoid queue->stopped races and set/read that value under lock

There are several readers of queue->stopped that check the value
under lock, but fuse_uring_commit_fetch() did not and actually
the value was not set under the lock in fuse_uring_abort_end_requests()
either. Especially in fuse_uring_commit_fetch it is important
to check under a lock, because due to races 'struct fuse_req'
might be freed with fuse_request_end, but another thread/cpu
might already do teardown work.

Cc: stable@kernel.org # 6.14
Fixes: 4a9bfb9b6850fec ("fuse: {io-uring} Handle teardown of ring entries")
Reported-by: Berkant Koc <me@berkoc.com>
Reported-by: xlabai <xlabai@tencent.com>
Signed-off-by: Bernd Schubert <bernd@bsbernd.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: Avoid use-after-free in fuse_uring_async_stop_queues

fuse_uring_async_stop_queues() might run when the last reference
on ring->queue_refs was already dropped.

In order to avoid an early destruction a reference on struct fuse_conn
is now taken before starting fuse_uring_async_stop_queues() and that
reference is only released when that delayed work queue terminates.

Fixes: 4a9bfb9b6850 ("fuse: {io-uring} Handle teardown of ring entries")
Cc: stable@kernel.org # 6.14
Reported-by: Berkant Koc <me@berkoc.com>
Signed-off-by: Bernd Schubert <bernd@bsbernd.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: end fuse_req on io-uring cancel task work

When io_uring delivers task work with tw.cancel set (PF_EXITING,
PF_KTHREAD fallback, or percpu_ref_is_dying on the ring context),
fuse_uring_send_in_task() takes the cancel branch, assigns
-ECANCELED, and falls through to fuse_uring_send(). That path only
flips the entry to FRRS_USERSPACE and completes the io_uring cmd;
it never discharges the ring entry's owning reference to the
fuse_req that fuse_uring_add_req_to_ring_ent() handed it at
dispatch time.

    fuse_uring_send_in_task()
      tw.cancel == true
        err = -ECANCELED
      fuse_uring_send(ent, cmd, err, issue_flags)
        ent->state = FRRS_USERSPACE
        list_move(&ent->list, &queue->ent_in_userspace)
        ent->cmd = NULL
        io_uring_cmd_done(-ECANCELED)
        /* ent->fuse_req still set, req still hashed */

The fuse_req stays linked on fpq->processing[hash] and
fuse_request_end() is never invoked. The originating syscall
thread blocks in D-state in request_wait_answer() until
fuse_abort_conn() runs, which can be the entire connection
lifetime. For FR_BACKGROUND requests fc->num_background is never
decremented either, so repeated cancels inflate the counter until
max_background is hit and all later background ops stall. tw.cancel does
not imply a connection abort (e.g. a single io_uring worker thread exits
while the fuse connection stays up), so this cannot be left for
fuse_abort_conn() to clean up.

Ending the req but still routing the entry through fuse_uring_send()
is not enough: that leaves a req-less entry on ent_in_userspace, and
ent_list_request_expired() dereferences ent->fuse_req unconditionally
on the head of that list, which would then NULL-deref.

Fix the cancel branch to release the entry directly. Remove it from the
queue, complete the io_uring cmd, end the fuse_req, free the entry, and
drop its queue_refs (waking the teardown waiter if it was the last).

Fixes: c2c9af9a0b13 ("fuse: Allow to queue fg requests through io-uring")
Cc: stable@vger.kernel.org
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Assisted-by: kres (claude-opus-4-7)
Signed-off-by: Chris Mason <clm@meta.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: fix moving cancelled entry to ent_in_userspace list

fuse_uring_cancel() moves entries that are available (these have no reqs
attached) to the ent_in_userspace list. ent_list_request_expired()
checks the first entry on ent_in_userspace and dereferences
ent->fuse_req unconditionally, which will crash on a cancelled entry
that was moved to this list.

Fix this by freeing the entry and dropping queue_refs directly in
fuse_uring_cancel(). This is safe because cancel is the cancel handler
itself - after io_uring_cmd_done(), no more cancels will be dispatched
for this command, and teardown serializes with cancel via queue->lock.

Since cancel now decrements queue_refs, fuse_uring_abort() must no
longer gate fuse_uring_abort_end_requests() on queue_refs > 0, as
cancelled entries may have already dropped queue_refs while requests are
still queued. Remove the gate so abort always flushes requests and stops
queues.

Reported-by: Heechan Kang <gganji11@naver.com>
Tested-by: Heechan Kang <gganji11@naver.com>
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Fixes: 4fea593e625c ("fuse: optimize over-io-uring request expiration check")
Cc: stable@vger.kernel.org
Suggested-by: Jian Huang Li <ali@ddn.com>
Suggested-by: Horst Birthelmer <horst@birthelmer.de>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: check connection abort during ring creation

Check fch->connected under fch->lock in fuse_uring_create() before
attaching a new ring. Without this, a race between fuse_uring_create()
and fuse_chan_abort() can result in the ring, queue, and fpq.processing
table being created after fuse_uring_abort() has already run, leading
to unnecessary allocation and teardown. These are eventually cleaned up
by fuse_uring_destruct() but will linger until the process exits, even
with the connection aborted.

Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: fix race between registration and connection abortion

This fixes this race:
- thread a: io_uring_enter -> register sqe ->
  fuse_uring_create_ring_ent -> allocate ent but doesn't grab queue_ref
  yet
- thread b: fuse_conn_destroy() -> fuse_chan_abort() ->
  fuse_uring_abort() is a no-op due to queue ref being 0
- thread a: grabs the queue_ref, queue_ref is now 1, rest of
  fuse_uring_do_register() logic executes
- thread b: fuse_chan_abort() returns, fuse_chan_wait_aborted() now runs
  and calls
  "wait_event(ring->stop_waitq, atomic_read(&ring->queue_refs) == 0);"
The abort/unmount thread will hang indefinitely in unkillable state as
nothing will decrement queue_refs or wake stop_waitq, and the ring,
queue, and ent are leaked.

Fix this by checking fch->connected under fch->lock after the created
ent has grabbed a ref count on the queue. This ensures that in the
scenario above, it is guaranteed that we either release the queue ref
and wake up stop_waitq (in case fuse_chan_wait_aborted() is already
waiting) in fuse_uring_do_register() when we detect !fch->connected, or
if the connection is aborted after the check, it is guaranteed that the
async teardown worker will be running in the background cleaning up ents
and decrementing the ent's ref on the queue, which will unblock the
eventual queue and ring teardown.

Fixes: 24fe962c86f5 ("fuse: {io-uring} Handle SQEs - register commands")
Cc: stable@vger.kernel.org
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: fix data races on ring->ready

On weakly-ordered architectures, the store to fiq->ops can be
reordered past the store to ring->ready, allowing a CPU that sees
ring->ready == true via fuse_uring_ready() to dispatch requests
through a stale fiq->ops pointer. Upgrade the store to
smp_store_release() and the load in fuse_uring_ready() to
smp_load_acquire() so that the preceding WRITE_ONCE(fiq->ops, ...)
is visible to any CPU that observes ring->ready == true.

Additionally, fuse_uring_do_register() publishes ring->ready with
WRITE_ONCE() but the fast-path check reads it with a plain load.
This is a marked-vs-unmarked access that KCSAN will flag. Wrap it in
READ_ONCE() to mark it without adding unnecessary ordering.

Also wrap the fc->ring load in fuse_uring_ready() in READ_ONCE() to
prevent the compiler from reloading it between the NULL check and the
dereference.

Fixes: c2c9af9a0b13 ("fuse: Allow to queue fg requests through io-uring")
Cc: stable@vger.kernel.org
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Assisted-by: kres (claude-opus-4-7)
Signed-off-by: Chris Mason <clm@meta.com>
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse-uring: fix EFAULT clobber in fuse_uring_commit

copy_from_user() returns the number of bytes not copied as an unsigned
residual on failure (1..sizeof(struct fuse_out_header)). fuse_uring_commit
stores that residual in ssize_t err, sets req->out.h.error to -EFAULT,
then jumps to out: with err still holding the positive residual.

    err = copy_from_user(&req->out.h, &ent->headers->in_out,
                         sizeof(req->out.h));
    if (err) {
        req->out.h.error = -EFAULT;
        goto out;          /* err is the positive residual */
    }
    ...
    out:
        fuse_uring_req_end(ent, req, err);

fuse_uring_req_end() then runs

    if (error)
        req->out.h.error = error;

which overwrites the just-assigned -EFAULT with the positive residual.
FUSE callers such as fuse_simple_request() test err < 0 to detect
failure, so the positive value is interpreted as success and the
caller proceeds with an uninitialised or partial req->out.args.

Fix by assigning err = -EFAULT in the failure branch before jumping
to out, so fuse_uring_req_end() receives a negative errno and sets
req->out.h.error to -EFAULT.

Fixes: c090c8abae4b ("fuse: Add io-uring sqe commit and fetch support")
Cc: stable@vger.kernel.org
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Assisted-by: kres (claude-opus-4-7)
Signed-off-by: Chris Mason <clm@meta.com>
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

fuse: back uncached readdir buffers with pages

Commit dabb90391028 ("fuse: increase readdir buffer size") changed
fuse_readdir_uncached() to size its temporary buffer from ctx->count.
This is useful for overlayfs and other in-kernel callers that use
INT_MAX to indicate an unlimited directory read.

The larger buffer is currently supplied as a kvec output argument. For
virtiofs, kvec arguments are copied through req->argbuf, which is
allocated with kmalloc(..., GFP_ATOMIC). A large uncached readdir buffer
can therefore require a multi-megabyte contiguous atomic allocation
before the request is queued.

Avoid the large bounce-buffer allocation by backing uncached readdir
output with pages and setting out_pages. Transports such as virtiofs can
then pass the pages as scatter-gather entries instead of copying the
output through argbuf.

Map the pages with vm_map_ram() only while parsing the returned dirents.
The existing parser can then continue to use a linear kernel mapping.

[SzM: separate allocation of pages into a helper function]

Fixes: dabb90391028 ("fuse: increase readdir buffer size")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

virtiofs: fix UAF on submount umount

iput() called from fuse_release_end() can Oops if the super block has
already been destroyed. Normally this is prevented by waiting for
num_waiting to go down to zero before commencing with super block shutdown.

This only works, however, for the last submount instance, as the wait
counter is per connection, not per superblock.

Revert to using synchronous release requests for the auto_submounts case,
which is virtiofs only at this time.

Reported-by: Aurélien Bombo <abombo@microsoft.com>
Reported-by: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: Greg Kurz <gkurz@redhat.com>
Closes: https://github.com/kata-containers/kata-containers/issues/12589
Fixes: 26e5c67deb2e ("fuse: fix livelock in synchronous file put from fuseblk workers")
Cc: stable@vger.kernel.org
Reviewed-by: Greg Kurz <gkurz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

Merge tag 'memory-controller-drv-7.2-2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl into soc/drivers

Memory controller drivers for v7.2, part two

A few improvements for Tegra Memory Controller drivers, including one
fix for UBSAN report for an older commit.

* tag 'memory-controller-drv-7.2-2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl:
  memory: tegra234: drop dead NULL check in tegra234_mc_icc_aggregate()
  memory: tegra264: drop redundant tegra264_mc_icc_aggregate()
  memory: tegra186-emc: stop borrowing MC aggregate hook for EMC

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Revert "firmware: zynqmp: Add dynamic CSU register discovery and sysfs interface"

This reverts commit 47d7bca76dd4f36ba0525d761f247c76ec9e4b17, which was
merged by accident.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Revert "Documentation: ABI: add sysfs interface for ZynqMP CSU registers"

This reverts commit 8ebebccf1579f6ce92bde3ddbb13df12c080f647, which was
merged by accident.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

mm/slab: introduce kmalloc_flags()

With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
alloc flag that prevents kmalloc recursion. For that we need a version
of kmalloc() that takes alloc_flags and use it in places that perform
these potentially recursive kmalloc allocations (of sheaves or obj_ext
arrays).

Add this function, named kmalloc_flags(). Right now it's only useful for
these nested allocations, so it doesn't need to optimize build-time
constant sizes like kmalloc() or kmalloc_buckets.

Since we need it to support both normal and non-spinning
kmalloc_nolock() context through the SLAB_ALLOC_NOLOCK flag, split out
most of the special _kmalloc_nolock_noprof() implementation to
__kmalloc_nolock_noprof() that takes a slab_alloc_context, and make
_kmalloc_nolock_noprof() a simple tail calling wrapper with the proper
context.

kmalloc_flags() can thus determine whether to call
__kmalloc_nolock_noprof() or __do_kmalloc_node(), based on the
given alloc_flags.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-14-7190909db118@kernel.org
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

ALSA: hda/realtek: Add quirk for Lenovo Xiaoxin 14 GT

The Lenovo Xiaoxin 14 GT (Chinese market model, AMD Ryzen AI 9 365)
produces constant electrical hissing and crackling noise from both
internal speakers and 3.5mm headphone jack during audio playback.
Audio works correctly on Windows.

The PCI SSID 17aa:3912 is not present in the quirk list. The device
shares the same AMD platform and ALC287 codec as neighboring Lenovo
14" AMD models (17aa:3911, 17aa:390d), so apply the same fixup.

Note: the fixup selection is based on similarity with neighboring
models and has not been verified by testing a compiled kernel.
Guidance from maintainers on the correct fixup is welcome.

Signed-off-by: Viktor Menshin <ripeeerr@gmail.com>
Link: https://patch.msgid.link/20260615092515.1082-1-ripeeerr@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

mm/slab: allow __GFP_NOMEMALLOC and __GFP_NOWARN for kmalloc_nolock()

The two flags are added internally so there's no point for warning if
they are passed by the caller as well, so allow them. This will allow
simplifying obj_ext allocation under kmalloc_nolock().

Also it's not necessary to have the extra alloc_gfp variable for adding
the two flags. The original gfp_flags parameter is not used anywhere
except for the warning. So remove alloc_gfp and directly modify and use
gfp_flags everywhere.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-13-7190909db118@kernel.org
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

mm/slab: pass slab_alloc_context to __do_kmalloc_node()

With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
alloc flag that prevents kmalloc recursion. For that we need a version
of kmalloc() that takes alloc_flags and use it in places that perform
these potentially recursive kmalloc allocations (of sheaves or obj_ext
arrays).

As a preparatory step, make __do_kmalloc_node() take a pointer to
slab_alloc_context. This replaces the 'size' and 'caller' parameters and
includes alloc_flags which we'll make use of.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-12-7190909db118@kernel.org
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags

The last user of gfpflags_allow_spinning() in slab is
alloc_from_pcs_bulk(), which is only called from
kmem_cache_alloc_bulk().

It turns out that gfpflags_allow_spinning() is not necessary, because
kmem_cache_alloc_bulk() is only expected to be called from context that
does allow spinning, so simply replace it with 'true'. This means we can
also drop the gfp parameter from alloc_from_pcs_bulk().

With that, we can remove the "@flags must allow spinning" part of the
kernel doc, as there is no more connection to the gfp flags in the slab
implementation.

Also remove a comment in alloc_slab_obj_exts() because there should be
no more false positives possible due to gfp_allowed_mask during early
boot.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-11-7190909db118@kernel.org
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

mm/slab: replace slab_alloc_node() parameters with slab_alloc_context

The function takes all the parameters that exist as fields in
slab_alloc_context, except alloc_flags. Replace them with a single
pointer.

This moves slab_alloc_context initialization to a number of callers,
which is more verbose, but arguably also more clear than a long list of
parameters, and most do not use the 'lru' field.

This will also allow kmalloc_nolock() to call slab_alloc_node() and
reduce the special open-coding it currently has.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-10-7190909db118@kernel.org
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

mm/slab: pass alloc_flags through slab_post_alloc_hook() chain

Convert the whole following call stack to pass either slab_alloc_context
(thus including alloc_flags) or just alloc_flags as necessary:

slab_post_alloc_hook()
  alloc_tagging_slab_alloc_hook()
    __alloc_tagging_slab_alloc_hook()
      prepare_slab_obj_exts_hook()
        alloc_slab_obj_exts()
  memcg_slab_post_alloc_hook()
    __memcg_slab_post_alloc_hook()
      alloc_slab_obj_exts()

Converting all these at once avoids unnecessary churn and is mostly
mechanical.

This ultimately allows to decide if spinning is allowed using
alloc_flags in alloc_slab_obj_exts(), as well as slab_post_alloc_hook().
Aside from alloc_from_pcs_bulk() (to be handled next) there is nothing
else in slab itself relying on gfpflags_allow_spinning() which can
be false even if not called from kmalloc_nolock().

A followup change will also use the alloc_flags availability in the call
stack above to remove the __GFP_NO_OBJ_EXT flag.

For alloc_slab_obj_exts(), also replace the suboptimal "bool new_slab"
parameter with a SLAB_ALLOC_NEW_SLAB flag with identical functionality.

To further reduce the number of parameters of slab_post_alloc_hook(),
also make 'struct list_lru *lru' (which is NULL for most callers) a new
field of slab_alloc_context.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-9-7190909db118@kernel.org
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

mm/slab: pass alloc_flags to new slab allocation

Add the alloc_flags parameter to allocate_slab() and new_slab()
so it can be used to determine if spinning is allowed, independently
from gfp flags.

refill_objects() passes SLAB_ALLOC_DEFAULT because it can only be
reached from contexts that allow spinning.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-8-7190909db118@kernel.org
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

mm/slab: add alloc_flags to slab_alloc_context

Add alloc_flags as a new field to the slab_alloc_context helper struct,
so we can pass it to more functions in the slab implementation without
adding another function parameter.

Start checking them via alloc_flags_allow_spinning() in
alloc_single_from_new_slab() (where we can drop the allow_spin
parameter), ___slab_alloc(), get_from_partial_node() and
get_from_any_partial(). This further reduces false-positive
spinning-not-allowed from allocations that are not kmalloc_nolock() but
lack __GFP_RECLAIM flags.

_kmalloc_nolock_noprof() initializes ac.alloc_flags using its flags that
are SLAB_ALLOC_NOLOCK. slab_alloc_node() and __kmem_cache_alloc_bulk()
are not reachable from kmalloc_nolock() and all their callers expect
spinning to be allowed, so they can use SLAB_ALLOC_DEFAULT. This is
temporary as the scope of slab_alloc_context will further move to the
callers, making the alloc_flags usage more obvious.

Also change how trynode_flags are constructed in ___slab_alloc() to
achieve the same "do not upgrade to GFP_NOWAIT" by using masking instead
of checking allow_spin. We need to do that because we now determine
allow_spin from alloc_flags, and would otherwise start to upgrade e.g.
kmalloc() allocations without __GFP_KSWAPD_RECLAIM (that however do
allow spinning) to GFP_NOWAIT, thus including __GFP_KSWAPD_RECLAIM.

During the masking keep also existing __GFP_NOMEMALLOC (pointed out by
Sashiko) and __GFP_ACCOUNT. Previously the hardcoded GFP_NOWAIT would
eliminate them, but it's not a big problem that would need a separate
fix.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-6-7190909db118@kernel.org
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

mm/slab: replace struct partial_context with slab_alloc_context

Refactor get_from_partial_node(), get_from_any_partial(),
get_from_partial() and ___slab_alloc().

Remove struct partial_context, which used to be more substantial but
shrank as part of the sheaves conversion. Instead pass gfp_flags and
pointer to the new slab_alloc_context, which together is a superset of
partial_context, and alloc_flags are about to be added to
slab_alloc_context as well.

No functional change intended.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-7-7190909db118@kernel.org
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

mm/slab: introduce alloc_flags and SLAB_ALLOC_NOLOCK

Similarly to the page allocators, introduce slab-allocator specific
alloc flags that internally control allocation behavior in addition to
gfp_flags, without occupying the limited gfp flags space.

Introduce the first flag SLAB_ALLOC_NOLOCK that behaves similarly to
page allocator's ALLOC_TRYLOCK and will be used to reimplement
kmalloc_nolock()'s "!allow_spin" behavior. That currently relies on
gfpflags_allow_spinning() and thus the lack of both __GFP_RECLAIM flags,
importantly __GFP_KSWAPD_RECLAIM. This can give false-positive results
e.g. in early boot with a restricted gfp_allowed_mask.

Also introduce alloc_flags_allow_spinning() to replace the usage of
gfpflags_allow_spinning().

Start using alloc_flags and the new check first in alloc_from_pcs() and
__pcs_replace_empty_main(). This means some slab allocations that were
falsely treated as kmalloc_nolock() due to their gfp flags will now have
higher chances of success, and this will further increase with followup
changes.

Remove a WARN_ON_ONCE() from refill_objects() as it's now legitimate to
reach it from a slab allocation that's not _nolock() and yet lacks
__GFP_KSWAPD_RECLAIM for other reasons.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-5-7190909db118@kernel.org
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

mm/slab: introduce slab_alloc_context

Similarly to page allocator's struct alloc_context, introduce a helper
struct to hold a part of the allocation arguments. This will allow
reducing the number of parameters in many functions of the
implementation, and extend them easily if needed.

For now, make it hold the caller address and the originally requested
allocation size.

Convert alloc_single_from_new_slab(), __slab_alloc_node() and
___slab_alloc(). No functional change intended.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-4-7190909db118@kernel.org
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

exfat: bound uniname advance in exfat_find_dir_entry()

In exfat_find_dir_entry(), each TYPE_EXTEND (file name) entry advances the
output pointer by a fixed amount while the loop guard only tracks the
accumulated name length:

if (++order == 2)
uniname = p_uniname->name;
else
uniname += EXFAT_FILE_NAME_LEN;
len = exfat_extract_uni_name(ep, entry_uniname);
name_len += len;
unichar = *(uniname+len);
*(uniname+len) = 0x0;

uniname grows by EXFAT_FILE_NAME_LEN (15) per name entry, but name_len
grows only by the actual extracted length, which is shorter when a name
fragment contains an early NUL. The only guard is
`name_len >= MAX_NAME_LENGTH`, so a crafted directory with many short
name fragments lets uniname run far past the
p_uniname->name[MAX_NAME_LENGTH + 3] buffer while name_len stays small,
causing an out-of-bounds read and write at *(uniname+len).

The sibling extractor exfat_get_uniname_from_ext_entry() already stops
on a short fragment (the lockstep `len != EXFAT_FILE_NAME_LEN` guard
added in commit d42334578eba ("exfat: check if filename entries exceeds
max filename length")); exfat_find_dir_entry() never got the
equivalent. Track the per-entry write offset as a count and reject a
fragment once the offset, or the offset plus the extracted length, would
exceed MAX_NAME_LENGTH, before forming the output pointer.

Fixes: ca06197382bd ("exfat: add directory operations")
Cc: stable@vger.kernel.org
Suggested-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

exfat: add swap_activate support

Commit 07d67f3e9083 ("exfat: add iomap buffered I/O support")
converted exfat buffered I/O to iomap, but did not add a
.swap_activate handler to the address_space_operations.

swapon(2) on an exfat swapfile then fails with EINVAL, which causes
LTP swap tests to fail.

Add exfat_iomap_swap_activate() and hook it into exfat_aops so exfat
uses iomap_swapfile_activate() for swapfile activation.

Fixes: 614f71ca1bdf ("exfat: add iomap buffered I/O support")
Closes: https://lore.kernel.org/all/20260603110212.3020276-1-japo@linux.ibm.com/
Signed-off-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

exfat: preserve benign secondary entries during rename and move

Commit 8258ef28001a ("exfat: handle unreconized benign secondary
entries") added cluster freeing for benign secondary entries inside
exfat_remove_entries().  However, exfat_remove_entries() is also called
from the rename and move paths (exfat_rename_file and exfat_move_file),
where the old entry set is being relocated rather than deleted.  This
causes benign secondary entries such as vendor extension entries to be
silently destroyed on rename or cross-directory move, violating the
exFAT spec requirement (section 8.2) that implementations preserve
unrecognized benign secondary entries.

Fix this by adding a free_benign parameter to exfat_remove_entries()
so callers can suppress cluster freeing during relocation, and
extending exfat_init_ext_entry() to copy trailing benign secondary
entries from the old entry set into the new one internally.  Also
clean up the error paths to delete newly allocated entries on failure.

Fixes: 8258ef28001a ("exfat: handle unreconized benign secondary entries")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-fsdevel/CAG7tbBV--waov7XVu2FHQEc6paR92dufS=em9DW5Kzsrpu3iQg@mail.gmail.com/
Signed-off-by: Rochan Avlur <rochan.avlur@gmail.com>
Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

exfat: serialize truncate against in-flight DIO

exfat_setattr() did not call inode_dio_wait() before performing a size
change, leaving a window where a concurrent in-flight DIO write could be
operating on clusters that the truncate is about to free.

Add inode_dio_wait() before the truncate_setsize()/exfat_truncate()
sequence so that any in-flight DIO completes before cluster freeing
begins.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

exfat: add support for SEEK_HOLE and SEEK_DATA in llseek

Adds exfat_file_llseek() that implements these whence values via
the iomap layer (iomap_seek_hole() and iomap_seek_data()) using the
existing exfat_read_iomap_ops.
Unlike many other modern filesystems, exFAT does not support sparse files
with unallocated clusters (holes). In exFAT, clusters are always fully
allocated once they are written or preallocated. In addition, exFAT
maintains a separate "Valid Data Length" (valid_size) that is distinct
from the logical file size. This affects how holes are reported during
seeking. In exfat_iomap_begin(), ranges where the offset is greater
than or equal to ei->valid_size are mapped as IOMAP_UNWRITTEN, while ranges
below valid_size are mapped as IOMAP_MAPPED. This mapping behavior is used
by the iomap seek functions to correctly report SEEK_HOLE and SEEK_DATA
positions.

- Ranges with offset >= ei->valid_size are mapped as IOMAP_HOLE.
- Ranges with offset < ei->valid_size are mapped as IOMAP_MAPPED.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

exfat: add iomap direct I/O support

Add iomap-based direct I/O support to the exfat filesystem. This replaces
the previous exfat_direct_IO() implementation that used
blockdev_direct_IO() with iomap_dio_rw() interface.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

exfat: add iomap buffered I/O support

Add full buffered I/O support using the iomap framework to the exfat
filesystem. This will replaces the old exfat_get_block(),
exfat_write_begin(), exfat_write_end(), and exfat_block_truncate_page()
with their iomap equivalents. Buffered writes now use
iomap_file_buffered_write(), read uses iomap_bio_read_folio() and
iomap_bio_readahead(), and writeback is handled through iomap_writepages().

Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

exfat: fix implicit declaration of brelse()

exfat_cluster_walk() calls brelse(bh) without including the header that
declares the function, causing the following build error:

fs/exfat/exfat_fs.h:542:9: error: implicit declaration of function ‘brelse’ [-Werror=implicit-function-declaration]

Fix this by adding the missing buffer_head.h in exfat_fs.h.

Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

exfat: add data_start_bytes and exfat_cluster_to_phys_bytes() helper

This caches the data area start offset in bytes (data_start_bytes)
and introduces a helper function exfat_cluster_to_phys_bytes() to compute
the physical byte position of a given cluster.

Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

exfat: add support for multi-cluster allocation

Currently exfat_map_cluster() allocates and returns only one cluster
at a time even when more clusters are needed. This causes multiple
FAT walks and repeated allocation calls during large sequential writes
or when using iomap for writes. This change exfat_map_cluster() and
exfat_alloc_cluster() to be able to allocate multiple contiguous
clusters.

Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

exfat: add exfat_file_open()

Add exfat_file_open() to handle file open operation for exFAT.
This change is a preparation step before introducing iomap-based direct
IO support.

Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

exfat: add balloc parameter to exfat_map_cluster() for iomap support

In preparation for supporting the iomap infrastructure, we need to know
whether a new cluster was allocated or not in exfat_map_cluster().

Add an optional 'bool *balloc' output parameter. When a new cluster is
allocated, *balloc is set to true. Pass NULL from exfat_get_block() to
preserve the existing behavior.

Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>