Dmitry Antipov [Wed, 21 Feb 2024 11:22:05 +0000 (14:22 +0300)]
eventpoll: prefer kfree_rcu() in __ep_remove()
In '__ep_remove()', prefer 'kfree_rcu()' over 'call_rcu()' with
dummy 'epi_rcu_free()' callback. This follows commit d0089603fa7a
("fs: prefer kfree_rcu() in fasync_remove_entry()") and should not
be backported to stable as well.
Merge series 'Use Maple Trees for simple_offset utilities' of https://lore.kernel.org/r/170820083431.6328.16233178852085891453.stgit@91.116.238.104.host.secureserver.net
Pull simple offset series from Chuck Lever
In an effort to address slab fragmentation issues reported a few
months ago, I've replaced the use of xarrays for the directory
offset map in "simple" file systems (including tmpfs).
Thanks to Liam Howlett for helping me get this working with Maple
Trees.
* series 'Use Maple Trees for simple_offset utilities' of https://lore.kernel.org/r/170820083431.6328.16233178852085891453.stgit@91.116.238.104.host.secureserver.net: (6 commits)
libfs: Convert simple directory offsets to use a Maple Tree
test_maple_tree: testing the cyclic allocation
maple_tree: Add mtree_alloc_cyclic()
libfs: Add simple_offset_empty()
libfs: Define a minimum directory offset
libfs: Re-arrange locking in offset_iterate_dir()
Signed-off-by: Christian Brauner <brauner@kernel.org>
Chuck Lever [Sat, 17 Feb 2024 20:24:16 +0000 (15:24 -0500)]
libfs: Convert simple directory offsets to use a Maple Tree
Test robot reports:
> kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
>
> commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
Feng Tang further clarifies that:
> ... the new simple_offset_add()
> called by shmem_mknod() brings extra cost related with slab,
> specifically the 'radix_tree_node', which cause the regression.
Willy's analysis is that, over time, the test workload causes
xa_alloc_cyclic() to fragment the underlying SLAB cache.
This patch replaces the offset_ctx's xarray with a Maple Tree in the
hope that Maple Tree's dense node mode will handle this scenario
more scalably.
In addition, we can widen the simple directory offset maximum to
signed long (as loff_t is also signed).
Suggested-by: Matthew Wilcox <willy@infradead.org> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/170820145616.6328.12620992971699079156.stgit@91.116.238.104.host.secureserver.net Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
Chuck Lever [Sat, 17 Feb 2024 20:23:40 +0000 (15:23 -0500)]
libfs: Re-arrange locking in offset_iterate_dir()
Liam and Matthew say that once the RCU read lock is released,
xa_state is not safe to re-use for the next xas_find() call. But the
RCU read lock must be released on each loop iteration so that
dput(), which might_sleep(), can be called safely.
Thus we are forced to walk the offset tree with fresh state for each
directory entry. xa_find() can do this for us, though it might be a
little less efficient than maintaining xa_state locally.
We believe that in the current code base, inode->i_rwsem provides
protection for the xa_state maintained in
offset_iterate_dir(). However, there is no guarantee that will
continue to be the case in the future.
Since offset_iterate_dir() doesn't build xa_state locally any more,
there's no longer a strong need for offset_find_next(). Clean up by
rolling these two helpers together.
Arnd Bergmann [Fri, 16 Feb 2024 20:23:34 +0000 (21:23 +0100)]
fs/select: rework stack allocation hack for clang
A while ago, we changed the way that select() and poll() preallocate
a temporary buffer just under the size of the static warning limit of
1024 bytes, as clang was frequently going slightly above that limit.
The warnings have recently returned and I took another look. As it turns
out, clang is not actually inherently worse at reserving stack space,
it just happens to inline do_select() into core_sys_select(), while gcc
never inlines it.
Annotate do_select() to never be inlined and in turn remove the special
case for the allocation size. This should give the same behavior for
both clang and gcc all the time and once more avoids those warnings.
Fixes: ad312f95d41c ("fs/select: avoid clang stack usage warning") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20240216202352.2492798-1-arnd@kernel.org Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
Dmitry Antipov [Fri, 9 Feb 2024 12:52:19 +0000 (15:52 +0300)]
fs: prefer kfree_rcu() in fasync_remove_entry()
In 'fasync_remove_entry()', prefer 'kfree_rcu()' over 'call_rcu()' with dummy
'fasync_free_rcu()' callback. This is mostly intended in attempt to fix weird
https://syzkaller.appspot.com/bug?id=6a64ad907e361e49e92d1c4c114128a1bda2ed7f,
where kmemleak may consider 'fa' as unreferenced during RCU grace period. See
https://lore.kernel.org/stable/20230930174657.800551-1-joel@joelfernandes.org
as well. Comments are highly appreciated.
Ever since ae65a5211d90 ("mm/slab: document kfree() as allowed for
kmem_cache_alloc() objects") kfree() can be used for both kmalloc() and
kmem_cache_alloc() so this is no safe.
Taylor Jackson [Thu, 8 Feb 2024 03:02:54 +0000 (03:02 +0000)]
fs/mnt_idmapping.c: Return -EINVAL when no map is written
Currently, it is possible to create an idmapped mount using a user
namespace without any mappings. However, this yields an idmapped
mount that doesn't actually map the ids. With the following change,
it will no longer be possible to create an idmapped mount when using
a user namespace with no mappings, and will instead return EINVAL,
an “invalid argument” error code.
fd = eventfd(0, 0);
size = write(fd, &str, strlen(str));
printf("eventfd: test writing a string, size=%ld\n", size);
size = read(fd, &value, sizeof(value));
printf("eventfd: test reading as uint64, size=%ld, valus=0x%lX\n",
size, value);
close(fd);
And its output is:
eventfd: test writing a string, size=8
eventfd: test reading as uint64, size=8, valus=0x6F77206F6C6C6568
By checking whether count is equal to sizeof(ucnt), such errors
could be detected. It also follows the requirements of the manual.
Signed-off-by: Wen Yang <wenyang.linux@foxmail.com> Link: https://lore.kernel.org/r/tencent_10AAA44731FFFA493F9F5501521F07DD4D0A@qq.com Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Eric Biggers <ebiggers@google.com> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
Amir Goldstein [Fri, 2 Feb 2024 11:01:32 +0000 (13:01 +0200)]
fs: remove the inode argument to ->d_real() method
The only remaining user of ->d_real() method is d_real_inode(), which
passed NULL inode argument to get the real data dentry.
There are no longer any users that call ->d_real() with a non-NULL
inode argument for getting a detry from a specific underlying layer.
Remove the inode argument of the method and replace it with an integer
'type' argument, to allow callers to request the real metadata dentry
instead of the real data dentry.
All the current users of d_real_inode() (e.g. uprobe) continue to get
the real data inode. Caller that need to get the real metadata inode
(e.g. IMA/EVM) can use d_inode(d_real(dentry, D_REAL_METADATA)).
Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20240202110132.1584111-3-amir73il@gmail.com Tested-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christian Brauner <brauner@kernel.org>
Amir Goldstein [Fri, 2 Feb 2024 11:01:31 +0000 (13:01 +0200)]
fs: make file_dentry() a simple accessor
file_dentry() is a relic from the days that overlayfs was using files with
a "fake" path, meaning, f_path on overlayfs and f_inode on underlying fs.
In those days, file_dentry() was needed to get the underlying fs dentry
that matches f_inode.
Files with "fake" path should not exist nowadays, so make file_dentry() a
simple accessor and use an assertion to make sure that file_dentry() was
not papering over filesystem bugs.
Huang Xiaojia [Tue, 6 Feb 2024 01:43:53 +0000 (09:43 +0800)]
epoll: Remove ep_scan_ready_list() in comments
Since commit 443f1a042233 ("lift the calls of ep_send_events_proc()
into the callers"), ep_scan_ready_list() has been removed.
But there are still several in comments. All of them should
be replaced with other caller functions.
JonasZhou [Fri, 2 Feb 2024 08:33:04 +0000 (16:33 +0800)]
fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.
In the struct address_space, there is a 32-byte gap between i_mmap
and i_mmap_rwsem. Due to the alignment of struct address_space
variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem
may end up in the same CACHE line.
While running Unixbench/execl, we observe high false sharing issues
when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
after i_private_list, ensuring a 64-byte gap between i_mmap and
i_mmap_rwsem.
For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score
of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell
improves by ~3.26%.
Kunwu Chan [Wed, 31 Jan 2024 07:09:41 +0000 (15:09 +0800)]
fs: Use KMEM_CACHE instead of kmem_cache_create
commit 0a31bd5f2bbb ("KMEM_CACHE(): simplify slab cache creation")
introduces a new macro.
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Kees Cook [Mon, 29 Jan 2024 18:40:15 +0000 (10:40 -0800)]
select: Avoid wrap-around instrumentation in do_sys_poll()
The mix of int, unsigned int, and unsigned long used by struct
poll_list::len, todo, len, and j meant that the signed overflow
sanitizer got worried it needed to instrument several places where
arithmetic happens between these variables. Since all of the variables
are always positive and bounded by unsigned int, use a single type in
all places. Additionally expand the zero-test into an explicit range
check before updating "todo".
This keeps sanitizer instrumentation[1] out of a UACCESS path:
vmlinux.o: warning: objtool: do_sys_poll+0x285: call to __ubsan_handle_sub_overflow() with UACCESS enabled
Kees Cook [Mon, 29 Jan 2024 18:37:29 +0000 (10:37 -0800)]
iov_iter: Avoid wrap-around instrumentation in copy_compat_iovec_from_user()
The loop counter "i" in copy_compat_iovec_from_user() is an int, but
because the nr_segs argument is unsigned long, the signed overflow
sanitizer got worried "i" could wrap around. Instead of making "i" an
unsigned long (which may enlarge the type size), switch both nr_segs
and i to u32. There is no truncation with nr_segs since it is never
larger than UIO_MAXIOV anyway. This keeps sanitizer instrumentation[1]
out of a UACCESS path:
vmlinux.o: warning: objtool: copy_compat_iovec_from_user+0xa9: call to __ubsan_handle_add_overflow() with UACCESS enabled
sysv: don't call sb_bread() with pointers_lock held
syzbot is reporting sleep in atomic context in SysV filesystem [1], for
sb_bread() is called with rw_spinlock held.
A "write_lock(&pointers_lock) => read_lock(&pointers_lock) deadlock" bug
and a "sb_bread() with write_lock(&pointers_lock)" bug were introduced by
"Replace BKL for chain locking with sysvfs-private rwlock" in Linux 2.5.12.
Then, "[PATCH] err1-40: sysvfs locking fix" in Linux 2.6.8 fixed the
former bug by moving pointers_lock lock to the callers, but instead
introduced a "sb_bread() with read_lock(&pointers_lock)" bug (which made
this problem easier to hit).
Al Viro suggested that why not to do like get_branch()/get_block()/
find_shared() in Minix filesystem does. And doing like that is almost a
revert of "[PATCH] err1-40: sysvfs locking fix" except that get_branch()
from with find_shared() is called without write_lock(&pointers_lock).
Kent Overstreet [Sat, 27 Jan 2024 02:01:05 +0000 (21:01 -0500)]
fs/pipe: Convert to lockdep_cmp_fn
*_lock_nested() is fundamentally broken; lockdep needs to check lock
ordering, but we cannot device a total ordering on an unbounded number
of elements with only a few subclasses.
the replacement is to define lock ordering with a proper comparison
function.
fs/pipe.c was already doing everything correctly otherwise, nothing
much changes here.
Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240127020111.487218-2-kent.overstreet@linux.dev Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
Baokun Li [Wed, 24 Jan 2024 14:28:57 +0000 (22:28 +0800)]
asm-generic: remove extra type checking in acquire/release for non-SMP case
If CONFIG_SMP is not enabled, the smp_load_acquire/smp_store_release is
implemented as READ_ONCE/READ_ONCE and barrier() and type checking.
READ_ONCE/READ_ONCE already checks the pointer type, and then checks it
more stringently outside, but the non-SMP case simply isn't relevant, so
remove the extra compiletime_assert_atomic_type() to avoid compilation
errors.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202401230837.TXro0PHi-lkp@intel.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240124142857.4146716-4-libaokun1@huawei.com Signed-off-by: Christian Brauner <brauner@kernel.org>
Baokun Li [Wed, 24 Jan 2024 14:28:56 +0000 (22:28 +0800)]
Revert "mm/filemap: avoid buffered read/write race to read inconsistent data"
This reverts commit e2c27b803bb6 ("mm/filemap: avoid buffered read/write
race to read inconsistent data"). After making the i_size_read/write
helpers be smp_load_acquire/store_release(), it is already guaranteed that
changes to page contents are visible before we see increased inode size,
so the extra smp_rmb() in filemap_read() can be removed.
Baokun Li [Wed, 24 Jan 2024 14:28:55 +0000 (22:28 +0800)]
fs: make the i_size_read/write helpers be smp_load_acquire/store_release()
In [Link] Linus mentions that acquire/release makes it clear which
_particular_ memory accesses are the ordered ones, and it's unlikely
to make any performance difference, so it's much better to pair up
the release->acquire ordering than have a "wmb->rmb" ordering.
Calling smp_store_release() in i_size_write() ensures that the data
in the page and the PG_uptodate bit are updated before the isize is
updated, and calling smp_load_acquire() in i_size_read ensures that
it will not read a newer isize than the data in the page. Therefore,
this avoids buffered read-write inconsistencies caused by Load-Load
reordering.
Rewrite the alignment checking iterators for iovec and bvec to be easier
to read, and also significantly more compact in terms of generated code.
This saves 270 bytes of text on x86-64 for me (with clang-18) and 224
bytes on arm64 (with gcc-13).
In profiles, also saves a bit of time as well for the same workload:
Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/544b31f7-6d4b-42f5-a544-1420501f081f@kernel.dk Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
v2: do the other half of the iterators too, as suggested by Keith.
This further saves some text.
Merge tag 'exportfs-6.9' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/cel/linux
Merge exportfs fixes from Chuck Lever:
* tag 'exportfs-6.9' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/cel/linux:
fs: Create a generic is_dot_dotdot() utility
exportfs: fix the fallback implementation of the get_name export operation
Wen Yang [Mon, 15 Jan 2024 15:27:00 +0000 (23:27 +0800)]
eventfd: move 'eventfd-count' printing out of spinlock
When printing eventfd->count, interrupts will be disabled and a spinlock
will be obtained, competing with eventfd_write(). By moving the
"eventfd-count" print out of the spinlock and merging multiple
seq_printf() into one, it could improve a bit, just like timerfd_show().
Signed-off-by: Wen Yang <wenyang.linux@foxmail.com> Link: https://lore.kernel.org/r/tencent_B0B3D2BD9861FD009E03AB18A81783322709@qq.com Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dylan Yudaken <dylany@fb.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Eric Biggers <ebiggers@google.com> Cc: <linux-fsdevel@vger.kernel.org> Cc: <linux-kernel@vger.kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
Chuck Lever [Sun, 31 Dec 2023 00:46:00 +0000 (19:46 -0500)]
fs: Create a generic is_dot_dotdot() utility
De-duplicate the same functionality in several places by hoisting
the is_dot_dotdot() utility function into linux/fs.h.
Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Trond Myklebust [Thu, 28 Dec 2023 20:15:10 +0000 (15:15 -0500)]
exportfs: fix the fallback implementation of the get_name export operation
The fallback implementation for the get_name export operation uses
readdir() to try to match the inode number to a filename. That filename
is then used together with lookup_one() to produce a dentry.
A problem arises when we match the '.' or '..' entries, since that
causes lookup_one() to fail. This has sometimes been seen to occur for
filesystems that violate POSIX requirements around uniqueness of inode
numbers, something that is common for snapshot directories.
This patch just ensures that we skip '.' and '..' rather than allowing a
match.
Bytes 18-19 of 20 are uninitialized
Memory access of size 20 starts at ffff888128a46380
Data copied to user address 0000000020000240"
Per Chuck Lever's suggestion, use kzalloc() instead of kmalloc() to
solve the problem.
Fixes: 990d6c2d7aee ("vfs: Add name to file handle conversion support") Suggested-by: Chuck Lever III <chuck.lever@oracle.com> Reported-and-tested-by: <syzbot+09b349b3066c2e0b1e96@syzkaller.appspotmail.com> Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Link: https://lore.kernel.org/r/20240119153906.4367-1-n.zhandarovich@fintech.ru Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
Rich Felker [Mon, 31 Aug 2020 15:32:08 +0000 (11:32 -0400)]
vfs: add RWF_NOAPPEND flag for pwritev2
The pwrite function, originally defined by POSIX (thus the "p"), is
defined to ignore O_APPEND and write at the offset passed as its
argument. However, historically Linux honored O_APPEND if set and
ignored the offset. This cannot be changed due to stability policy,
but is documented in the man page as a bug.
Now that there's a pwritev2 syscall providing a superset of the pwrite
functionality that has a flags argument, the conforming behavior can
be offered to userspace via a new flag. Since pwritev2 checks flag
validity (in kiocb_set_rw_flags) and reports unknown ones with
EOPNOTSUPP, callers will not get wrong behavior on old kernels that
don't support the new flag; the error is reported and the caller can
decide how to handle it.
Hu.Yadi [Thu, 11 Jan 2024 11:32:29 +0000 (19:32 +0800)]
selftests/move_mount_set_group:Make tests build with old libc
Replace SYS_<syscall> with __NR_<syscall>. Using the __NR_<syscall>
notation, provided by UAPI, is useful to build tests on systems without
the SYS_<syscall> definitions.
Replace SYS_move_mount with __NR_move_mount
Similar changes: commit 87129ef13603 ("selftests/landlock: Make tests build with old libc")
Acked-by: Mickaël Salaün <mic@digikod.net> Signed-off-by: Hu.Yadi <hu.yadi@h3c.com> Link: https://lore.kernel.org/r/20240111113229.10820-1-hu.yadi@h3c.com Reviewed-by: Berlin <berlin@h3c.com> Suggested-by: Jiao <jiaoxupo@h3c.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
Hu Yadi [Fri, 12 Jan 2024 07:40:59 +0000 (15:40 +0800)]
selftests/filesystems:fix build error in overlayfs
One build issue comes up due to both mount.h included dev_in_maps.c
In file included from dev_in_maps.c:10:
/usr/include/sys/mount.h:35:3: error: expected identifier before numeric constant
35 | MS_RDONLY = 1, /* Mount read-only. */
| ^~~~~~~~~
In file included from dev_in_maps.c:13:
Remove one of them to solve conflict, another error comes up:
dev_in_maps.c:170:6: error: implicit declaration of function ‘mount’ [-Werror=implicit-function-declaration]
170 | if (mount(NULL, "/", NULL, MS_SLAVE | MS_REC, NULL) == -1) {
| ^~~~~
cc1: all warnings being treated as errors
and then , add sys_mount definition to solve it
After both above, dev_in_maps.c can be built correctly on my mache(gcc 10.2,glibc-2.32,kernel-5.10)
The root cause is that, one thread is doing page migration, and we will
use the target page's ->mapping field to save 'anon_vma' pointer between
page unmap and page move, and now the target page is locked and refcount
is 1.
Currently, there is another stress-ng thread performing memory hotplug,
attempting to offline the target page that is being migrated. It discovers
that the refcount of this target page is 1, preventing the offline operation,
thus proceeding to dump the page. However, page_mapping() of the target
page may return an incorrect file mapping to crash the system in dump_mapping(),
since the target page->mapping only saves 'anon_vma' pointer without setting
PAGE_MAPPING_ANON flag.
The page migration issue has been fixed by commit d1adb25df711 ("mm: migrate:
fix getting incorrect page mapping during page migration"). In addition,
Matthew suggested we should also improve dump_mapping()'s robustness to
resilient against the kernel crash [1].
With checking the 'dentry.parent' and 'dentry.d_name.name' used by
dentry_name(), I can see dump_mapping() will output the invalid dentry
instead of crashing the system when this issue is reproduced again.
Wen Yang [Wed, 10 Jan 2024 15:47:40 +0000 (23:47 +0800)]
eventfd: add a BUILD_BUG_ON() to ensure consistency between EFD_SEMAPHORE and the uapi
introduce a BUILD_BUG_ON to check that the EFD_SEMAPHORE is equal to its
definition in the uapi file, just like EFD_CLOEXEC and EFD_NONBLOCK.
Signed-off-by: Wen Yang <wenyang.linux@foxmail.com> Link: https://lore.kernel.org/r/tencent_0BAA2DEAF9208D49987457E6583F9BE79507@qq.com Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: <linux-fsdevel@vger.kernel.org> Cc: <linux-kernel@vger.kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
If initrd_start cpio extraction fails, CONFIG_BLK_DEV_RAM triggers
fallback to initrd.image handling via populate_initrd_image().
The populate_initrd_image() call follows successful extraction of any
built-in cpio archive at __initramfs_start, but currently performs
built-in archive extraction a second time.
Prior to commit b2a74d5f9d446 ("initramfs: remove clean_rootfs"),
the second built-in initramfs unpack call was used to repopulate entries
removed by clean_rootfs(), but it's no longer necessary now the contents
of the previous extraction are retained.
Linus Torvalds [Sun, 21 Jan 2024 22:01:12 +0000 (14:01 -0800)]
Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs
Pull more bcachefs updates from Kent Overstreet:
"Some fixes, Some refactoring, some minor features:
- Assorted prep work for disk space accounting rewrite
- BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
makes our trigger context more explicit
- A few fixes to avoid excessive transaction restarts on
multithreaded workloads: fstests (in addition to ktest tests) are
now checking slowpath counters, and that's shaking out a few bugs
- Assorted tracepoint improvements
- Starting to break up bcachefs_format.h and move on disk types so
they're with the code they belong to; this will make room to start
documenting the on disk format better.
- A few minor fixes"
* tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
bcachefs: Improve inode_to_text()
bcachefs: logged_ops_format.h
bcachefs: reflink_format.h
bcachefs; extents_format.h
bcachefs: ec_format.h
bcachefs: subvolume_format.h
bcachefs: snapshot_format.h
bcachefs: alloc_background_format.h
bcachefs: xattr_format.h
bcachefs: dirent_format.h
bcachefs: inode_format.h
bcachefs; quota_format.h
bcachefs: sb-counters_format.h
bcachefs: counters.c -> sb-counters.c
bcachefs: comment bch_subvolume
bcachefs: bch_snapshot::btime
bcachefs: add missing __GFP_NOWARN
bcachefs: opts->compression can now also be applied in the background
bcachefs: Prep work for variable size btree node buffers
bcachefs: grab s_umount only if snapshotting
...
Linus Torvalds [Sun, 21 Jan 2024 19:14:40 +0000 (11:14 -0800)]
Merge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for time and clocksources:
- A fix for the idle and iowait time accounting vs CPU hotplug.
The time is reset on CPU hotplug which makes the accumulated
systemwide time jump backwards.
- Assorted fixes and improvements for clocksource/event drivers"
* tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
clocksource/drivers/ep93xx: Fix error handling during probe
clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
clocksource/timer-riscv: Add riscv_clock_shutdown callback
dt-bindings: timer: Add StarFive JH8100 clint
dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
Kent Overstreet [Tue, 16 Jan 2024 21:20:21 +0000 (16:20 -0500)]
bcachefs: opts->compression can now also be applied in the background
The "apply this compression method in the background" paths now use the
compression option if background_compression is not set; this means that
setting or changing the compression option will cause existing data to
be compressed accordingly in the background.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 16 Jan 2024 18:29:59 +0000 (13:29 -0500)]
bcachefs: Prep work for variable size btree node buffers
bcachefs btree nodes are big - typically 256k - and btree roots are
pinned in memory. As we're now up to 18 btrees, we now have significant
memory overhead in mostly empty btree roots.
And in the future we're going to start enforcing that certain btree node
boundaries exist, to solve lock contention issues - analagous to XFS's
AGIs.
Thus, we need to start allocating smaller btree node buffers when we
can. This patch changes code that refers to the filesystem constant
c->opts.btree_node_size to refer to the btree node buffer size -
btree_buf_bytes() - where appropriate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In __bch2_ioctl_subvolume_create(), we grab s_umount unconditionally
and unlock it at the end of the function. There is a comment
"why do we need this lock?" about the lock coming from
commit 42d237320e98 ("bcachefs: Snapshot creation, deletion")
The reason is that __bch2_ioctl_subvolume_create() calls
sync_inodes_sb() which enforce locked s_umount to writeback all dirty
nodes before doing snapshot works.
Fix it by read locking s_umount for snapshotting only and unlocking
s_umount after sync_inodes_sb().
Signed-off-by: Su Yue <glass.su@suse.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Colin Ian King [Tue, 16 Jan 2024 11:07:23 +0000 (11:07 +0000)]
bcachefs: remove redundant variable tmp
The variable tmp is being assigned a value but it isn't being
read afterwards. The assignment is redundant and so tmp can be
removed.
Cleans up clang scan build warning:
warning: Although the value stored to 'ret' is used in the enclosing
expression, the value is never actually read from 'ret'
[deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 16 Jan 2024 01:37:23 +0000 (20:37 -0500)]
bcachefs: Fix excess transaction restarts in __bchfs_fallocate()
drop_locks_do() should not be used in a fastpath without first trying
the do in nonblocking mode - the unlock and relock will cause excessive
transaction restarts and potentially livelocking with other threads that
are contending for the same locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 15 Jan 2024 22:59:51 +0000 (17:59 -0500)]
bcachefs: Better journal tracepoints
Factor out bch2_journal_bufs_to_text(), and use it in the
journal_entry_full() tracepoint; when we can't get a journal reservation
we need to know the outstanding journal entry sizes to know if the
problem is due to excessive flushing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 15 Jan 2024 22:56:22 +0000 (17:56 -0500)]
bcachefs: Avoid flushing the journal in the discard path
When issuing discards, we may need to flush the journal if there's too
many buckets that can't be discarded until a journal flush.
But the heuristic was bad; we should be comparing the number of buckets
that need to flushes against the number of free buckets, not the number
of buckets we saw.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 15 Jan 2024 19:15:26 +0000 (14:15 -0500)]
bcachefs: bch2_kthread_io_clock_wait() no longer sleeps until full amount
Drop t he loop in bch2_kthread_io_clock_wait(): this allows the code
that uses it to be woken up for other reasons, and fixes a bug where
rebalance wouldn't wake up when a scan was requested.
This raises the possibility of spurious wakeups, but callers should
always be able to handle that reasonably well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 11 Jan 2024 04:47:04 +0000 (23:47 -0500)]
bcachefs: Reduce would_deadlock restarts
We don't have to take locks in any particular ordering - we'll make
forward progress just fine - but if we try to stick to an ordering, it
can help to avoid excessive would_deadlock transaction restarts.
This tweaks the reflink path to take extents btree locks in the right
order.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 11 Jan 2024 04:08:30 +0000 (23:08 -0500)]
bcachefs: Don't log errors if BCH_WRITE_ALLOC_NOWAIT
Previously, we added logging in the write path to ensure that any
unexpected errors getting reported to userspace have a log message; but
BCH_WRITE_ALLOC_NOWAIT is a special case, it's used for promotes where
errors are expected and not reported out to userspace - so we need to
silence those.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- retry (reconnect) improvement including new retrans mount parm, and
handling of two additional return codes that need to be retried on
- two minor cleanup patches and another to remove duplicate query
info code
- two documentation cleanup, and one reviewer email correction"
* tag 'v6.8-rc-part2-smb-client' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update iface_last_update on each query-and-update
cifs: handle servers that still advertise multichannel after disabling
cifs: new mount option called retrans
cifs: reschedule periodic query for server interfaces
smb: client: don't clobber ->i_rdev from cached reparse points
smb: client: get rid of smb311_posix_query_path_info()
smb: client: parse owner/group when creating reparse points
smb: client: fix parsing of SMB3.1.1 POSIX create context
cifs: update known bugs mentioned in kernel docs for cifs
cifs: new nt status codes from MS-SMB2
cifs: pick channel for tcon and tdis
cifs: open_cached_dir should not rely on primary channel
smb3: minor documentation updates
Update MAINTAINERS email address
cifs: minor comment cleanup
smb3: show beginning time for per share stats
cifs: remove redundant variable tcon_exist
Linus Torvalds [Sat, 20 Jan 2024 23:03:25 +0000 (15:03 -0800)]
Merge tag 'dmaengine-fix-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine
Pull dmaengine updates from Vinod Koul:
"New support:
- Loongson LS2X APB DMA controller
- sf-pdma: mpfs-pdma support
- Qualcomm X1E80100 GPI dma controller support
Updates:
- Xilinx XDMA updates to support interleaved DMA transfers
- TI PSIL threads for AM62P and J722S and cfg register regions
description
- axi-dmac Improving the cyclic DMA transfers
- Tegra Support dma-channel-mask property
- Remaining platform remove callback returning void conversions
Driver fixes for:
- Xilinx xdma driver operator precedence and initialization fix
- Excess kernel-doc warning fix in imx-sdma xilinx xdma drivers
- format-overflow warning fix for rz-dmac, sh usb dmac drivers
- 'output may be truncated' fix for shdma, fsl-qdma and dw-edma
drivers"
* tag 'dmaengine-fix-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (58 commits)
dmaengine: dw-edma: increase size of 'name' in debugfs code
dmaengine: fsl-qdma: increase size of 'irq_name'
dmaengine: shdma: increase size of 'dev_id'
dmaengine: xilinx: xdma: Fix kernel-doc warnings
dmaengine: usb-dmac: Avoid format-overflow warning
dmaengine: sh: rz-dmac: Avoid format-overflow warning
dmaengine: imx-sdma: fix Excess kernel-doc warnings
dmaengine: xilinx: xdma: Fix initialization location of desc in xdma_channel_isr()
dmaengine: xilinx: xdma: Fix operator precedence in xdma_prep_interleaved_dma()
dmaengine: xilinx: xdma: statify xdma_prep_interleaved_dma
dmaengine: xilinx: xdma: Workaround truncation compilation error
dmaengine: pl330: issue_pending waits until WFP state
dmaengine: xilinx: xdma: Implement interleaved DMA transfers
dmaengine: xilinx: xdma: Prepare the introduction of interleaved DMA transfers
dmaengine: xilinx: xdma: Add transfer error reporting
dmaengine: xilinx: xdma: Add error checking in xdma_channel_isr()
dmaengine: xilinx: xdma: Rework xdma_terminate_all()
dmaengine: xilinx: xdma: Ease dma_pool alignment requirements
dmaengine: xilinx: xdma: Add necessary macro definitions
dmaengine: xilinx: xdma: Get rid of unused code
...