Darrick J. Wong [Wed, 21 May 2025 22:40:23 +0000 (15:40 -0700)]
fuse2fs: decode fuse_main error codes
Translate the fuse_main return values into actual mount(8) style error
codes instead of returning 0 all the time, and print something to the
original stderr if something went wrong so that the user will know what
to do next.
Darrick J. Wong [Wed, 21 May 2025 22:37:00 +0000 (15:37 -0700)]
fuse2fs: flip parameter order in __translate_error
Flip the parameter order in __translate_error so that it matches
translate_error. I wasted too much time debugging a memory corruption
that happened because I converted translate_error to __translate_error
when developing the next patch and the compiler didn't warn me about
mismatched types.
Darrick J. Wong [Wed, 21 May 2025 22:36:44 +0000 (15:36 -0700)]
fuse2fs: fix error return handling in op_truncate
Fix a couple of bugs with the errcode/ret handling in op_truncate.
First, we need to return ESTALE for a zero inumber because there is no
inode zero in an ext* filesystem. Second, we need to return negative
errno for failures to libfuse, not raw errcode_t.
Darrick J. Wong [Wed, 21 May 2025 22:36:13 +0000 (15:36 -0700)]
fuse2fs: compact all the boolean flags in struct fuse2fs
Compact all the booleans into u8 fields. I'd go further and turn them
into bitfields but that breaks the fuse argument parsing macros, which
compute the offset of the structure fields, and gcc won't let us do that
to bit fields. Still, 136 -> 112 bytes isn't bad.
Darrick J. Wong [Wed, 21 May 2025 22:35:41 +0000 (15:35 -0700)]
fuse2fs: clean up error messages
Instead of horridly line-wrapping multi-line messages that are printed
during mounting, let's just expand them to be one source code line per
printed line. This will make it a lot easier for someone who sees the
these errors to grep the source code to find out where they came from.
Darrick J. Wong [Wed, 21 May 2025 22:35:26 +0000 (15:35 -0700)]
libext2fs: fix livelock in the unix io manager
generic/441 found a livelock in the unix IO manager. Let's say that
write_primary_superblock decides to call io_channel_set_blksize in the
process of writing the primary super.
unix_set_blksize then takes the cache and bounce mutexes, and calls
flush_cached_blocks. If there are dirty blocks in the cache, they will
be written with raw_write_blk. Unfortunately, that function tries to
take the bounce mutex, which we already hold. At that point, we
livelock fuse2fs.
Darrick J. Wong [Wed, 21 May 2025 22:35:10 +0000 (15:35 -0700)]
libext2fs: fix unix io manager invalidation
flush_cached_blocks does not invalidate clean blocks from the block
cache. From reading all the call sites, it looks like they all actually
want the cache to be empty on successful return, so adjust the
implementation to do this.
Theodore Ts'o [Fri, 23 May 2025 03:53:40 +0000 (23:53 -0400)]
fuse2fs: fix portability issues when compiling on MacOS
Fix a number of portability issues which resulted in fuse2fs failing
to build on MacOS.
*) MacOS doesn't have the timespec fields in struct stat; we have
a autoconf test to check for this, so use it.
*) The portable way to print off_t values is to use
printf("%jd", (intmax_t) d); The cast is necessary to avoid
type mismatch warnings.
*) Define FUSE_DARWIN_ENABLE_EXTENSIONS=0 to avoid using random
structs such as struct fuse_darwin_attr and struct fuse_darwin_fill_dir_t
in the fuse operation function prototypes.
With these fixes, fuse2fs successfully compiles and works with
MacFuse on macOS Sequoia.
Theodore Ts'o [Tue, 20 May 2025 02:07:55 +0000 (22:07 -0400)]
libext2fs: add the EXT2FS_LINK_APPEND flag to ext2fs_link()
Add a flag which only tries to add the new directory entry to the last
block in the directory. This is helpful avoids mke2fs -d offering
from an O(n**2) performance bottleneck when adding a large number of
files to a directory.
Theodore Ts'o [Sun, 4 May 2025 18:07:14 +0000 (14:07 -0400)]
mke2fs: don't set the raid stripe for non-rotational devices by default
The ext4 block allocator is not at all efficient when it is asked to
enforce RAID alignment. It is especially bad for flash-based devices,
or when the file system is highly fragmented. For non-rotational
devices, it's fine to set the stride parameter (which controls
spreading the allocation bitmaps across the RAID component devices,
which always makessense); but for the stripe parameter (which asks the
ext4 block alocator to try _very_ hard to find RAID stripe aligned
devices) it's probably not a good idea.
Add new mke2fs.conf parameters with the defaults:
[defaults]
set_raid_stride = always
set_raid_stripe = disk
Even for RAID arrays based on HDD's, we can still have problems for
highly fragmented file systems. This will need to solved in the
kernel, probably by having some kind of wall clock or CPU time
limitation for each block allocation or adding some kind of
optimization which is faster than using our current buddy bitmap
implementation, especially if the stripe size is not multiple of a
power of two. But for SSD's, it's much less likely to make sense even
if we have an optimized block allocator, because if you've paid $$$
for a flash-based RAID array, the cost/benefit tradeoffs of doing less
optimized stripe RMW cycles versus the block allocator time and CPU
overhead is harder to justify without a lot of optimization effort.
If and when we can improve the ext4 kernel implementation (and it gets
rolled out to users using LTS kernels), we can change the defaults.
And of course, system administrators can always change
/etc/mke2fs.conf settings.
Theodore Ts'o [Wed, 21 May 2025 14:39:53 +0000 (10:39 -0400)]
libext2fs: rename fls() to find_last_bit_set()
In unix_io.c, rename fls() since some systems may already define it in
a system header file to fix a portability problem on MacOS. The name
"find_last_bit_set" is a bit more self-descriptive anyway.
Darrick J. Wong [Thu, 24 Apr 2025 21:46:33 +0000 (14:46 -0700)]
fuse2fs: allow use of direct io for disk access
Allow users to ask for O_DIRECT for disk accesses so that block device
writes won't be throttled. This should improve latency, but will put
a lot more pressure on the disk cache.
Darrick J. Wong [Thu, 24 Apr 2025 21:45:28 +0000 (14:45 -0700)]
fuse2fs: delegate access control decisions to the kernel
In "kernel" mode (aka allow_others + default_permissions), the kernel
enforces all the access control for us. Therefore, we don't need to do
any checking of our own. Create a purpose-built helper to detect this
situation and turn off all the access controlling.
Darrick J. Wong [Thu, 24 Apr 2025 21:45:12 +0000 (14:45 -0700)]
fuse2fs: refactor sysadmin predicate
Refactor the code that decides if an access is being made by the
superuser into a helper, which we'll use to fix more permissions
problems in the next patch.
Darrick J. Wong [Thu, 24 Apr 2025 21:44:25 +0000 (14:44 -0700)]
fuse2fs: disable renameat2
Apparently fuse munged rename and renameat2 together into the same
upcall, so we actually have to filter out nonzero flags because
otherwise we do a regular rename for a RENAME_EXCHANGE/WHITEOUT, which
is not what the user asked for.
Darrick J. Wong [Thu, 24 Apr 2025 21:43:39 +0000 (14:43 -0700)]
fuse2fs: support changing newer iflags
Redefine FUSE2FS_MODIFIABLE_IFLAGS so that userspace can modify any
flags that the kernel can, except for the ones that fuse2fs lacks the
ability to change.
Darrick J. Wong [Thu, 24 Apr 2025 21:42:52 +0000 (14:42 -0700)]
fuse2fs: clamp timestamps that are being written to disk
Clamp the timestamps that we write to disk to the minimum and maximum
values permitted given the ondisk format. This fixes y2038 support, as
tested by generic/402.
Darrick J. Wong [Thu, 24 Apr 2025 21:42:20 +0000 (14:42 -0700)]
fuse2fs: add an easy option for emulating kernel access behaviors
By default, fuse doesn't allow processes with a uid/gid that don't match
those of the server process to access the fuse mount, it doesn't allow
suid files or devices, and it relies on the fuse server to perform
permissions checking. This is a secure default for very untrusted
filesystems, but it's possible that we might actually want to allow
general access to an ext4 filesystem as part of containerizing the ext4
metadata parsing. In other words, we want the kernel access control
behavior.
Add an "kernel" mount option that moves most of the access permissions
interpretation back into the kernel, and logs mount/unmount/error
messages to dmesg. Right now this is mostly useful for fstests, so we
leave it off by default.
Darrick J. Wong [Thu, 24 Apr 2025 21:41:18 +0000 (14:41 -0700)]
fuse2fs: return -EOPNOTSUPP when we don't recognize a fallocate mode
If we don't recognize a set bit in the mode parameter to fallocate,
return EOPNOTSUPP to communicate that we don't support that mode instead
of EINVAL. This avoids unnecessary failures in generic/521 such as:
generic/521 - output mismatch (see /var/tmp/fstests/generic/521.out.bad)
--- tests/generic/521.out 2025-01-30 10:00:16.898276477 -0800
+++ /var/tmp/fstests/generic/521.out.bad 2025-04-03 14:46:20.019822396 -0700
@@ -1,2 +1,9 @@
QA output created by 521
+zero range: 0x407ca to 0x52885
+do_zero_range: fallocate: Invalid argument
Darrick J. Wong [Thu, 24 Apr 2025 21:40:47 +0000 (14:40 -0700)]
fuse2fs: make "ro" behavior consistent with the kernel
Make the behavior of the "ro" mount option consistent with the kernel:
User programs cannot change the files in the filesystem, but the driver
itself is allowed to update the filesystem metadata. This means that ro
mounts can recover the journal.
Darrick J. Wong [Thu, 24 Apr 2025 21:40:31 +0000 (14:40 -0700)]
fuse2fs: set fuse subtype via argv[0] if possible
If argv[0] ends in "ext[0-9]", set the fuse subtype string to this
value. This enables us to place fuse2fs at some place in the filesystem
like /sbin/mount.ext2 and have /proc/mounts report the filesystem type
as "fuse.ext2". This is fairly boring, but it'll make it easier to test
things in fstests.
Andreas Dilger [Sat, 25 Jan 2025 00:42:19 +0000 (17:42 -0700)]
journal: increase revoke block hash size
Increase the size of the revoke block hash table to scale with the
size of the journal, so that we don't get long hash chains if there
are a large number of revoke blocks in the journal to replay.
The new hash size will default to 1/16 of the blocks in the journal.
This is about 1 byte per block in the hash table, but there are two
allocated. The total amount of memory allocated for revoke blocks
depends much more on how many are in the journal, and not on the size
of the hash table. The system is regularly using this much memory
for the journal blocks, so the hash table size is not a big factor.
Consolidate duplicate code between recover_ext3_journal() and
ext2fs_open_ext3_journal() in debugfs.c to avoid duplicating logic.
Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Change-Id: Ibadf2a28c2f42fa92601f9da39a6ff73a43ebbe5
Reviewed-on: https://review.whamcloud.com/52386 Link: https://lore.kernel.org/r/20250125004220.44607-2-adilger@whamcloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Andreas Dilger [Sat, 25 Jan 2025 00:42:18 +0000 (17:42 -0700)]
misc: deduplicate log2/log10 functions
Remove duplicate log2() and log10() functions and replace them with
functions ext2fs_log2_u{32,64}() and ext2fs_log10_u{32,64}().
The int_log10() functions in progress.c and mke2fs.c were not like
the others, since they did not divide by the base before increment,
effectively rounding up instead of down. Compensate by adding one
to the returned ext2fs_log10_u32() value at the callers.
Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Change-Id: Ifc86efe7e5f0243eb914c6d24319cc7dee3ebbe5
Reviewed-on: https://review.whamcloud.com/52385 Link: https://lore.kernel.org/r/20250125004220.44607-1-adilger@whamcloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Brian Foster [Thu, 23 Jan 2025 13:52:11 +0000 (08:52 -0500)]
debugfs: byteswap dirsearch dirent buf on big endian systems
fstests test ext4/048 fails on big endian systems due to broken
debugfs dirsearch functionality. On an s390x system and 4k block
size, the dirsearch command seems to hang indefinitely. On the same
system with a 1k block size, the command fails to locate an existing
entry and causes the test to fail due to unexpected results.
The cause of the dirsearch failure is lack of byte swapping of the
on-disk (little endian) dirent buffer before attempting to iterate
entries in the given block. This leads to garbage record and name
length values, for example. To resolve this problem, byte swap the
directory buffer on big endian systems.
Gwendal Grignou [Fri, 3 Jan 2025 23:50:42 +0000 (15:50 -0800)]
tune2fs: do not update quota when not needed
Enabling quota is expensive: All inodes in the filesystem are scanned.
Only do it when the requested quota configuration does not match the
existing configuration.
Test:
Add a tiny patch to print out when core of function
handle_quota_options() is triggered.
Issue commands:
truncate -s 1G unused ; mkfs.ext4 unused
| commands | trigger |
comments
+---------------------------------------------------------+---------+---------
| tune2fs -Qusrquota,grpquota -Qprjquota -O quota unused | Y |
Quota not set at formatting.
| tune2fs -Qusrquota,grpquota -Qprjquota -O quota unused | N |
Already set just above
| tune2fs -Qusrquota,grpquota -Q^prjquota -O quota unused | Y |
Disabling a quota
| tune2fs -Qusrquota,grpquota -Q^prjquota -O quota unused | N |
No change from previous line.
| tune2fs -Qusrquota,grpquota -O quota unused | N |
No change from previous line.
| tune2fs -Qusrquota,^grpquota -O quota unused | Y |
Disabling a quota
| tune2fs -Qusrquota -O quota unused | N |
No change from previous line.
| tune2fs -O ^quota unused | Y |
Remove quota
| tune2fs -O quota unused | Y |
Re-enable quota, default values
(-Qusrquota,grpquota) used.
| tune2fs -O quota -Qusrquota unused | N |
Already set just above
Theodore Ts'o [Wed, 21 May 2025 02:53:41 +0000 (22:53 -0400)]
libext2fs: teach ext2fs_extent_set_bmap() to update extents more optimally
When programs like resize2fs or e2fsck relocates all of the blocks in
an extent one at a time, the ext2fs_extent_set_bmap() works by
initially adding a new extent and then moving mapping from the old
extent to the new extent. For example:
t=1 EXTENTS: (0-2) 1152-1154
t=2 EXTENTS: (0) 1136, (1-2) 1153-1154
t=3 EXTENTS: (0-1) 1136-1137, (2) 1154
Unfortunately, previously, when the last block is updated, the
resulting extent tree will have two extents instead of one, like this:
t=4 EXTENTS: (0-1) 1136-1137, (2) 1138
With this commit, the resulting extent tree will be more optimally
represented with a single extent:
t=4 EXTENTS: (0-2) 1136-1138
The optimization in this commit solves the prolem reproted at:
https://github.com/tytso/e2fsprogs/issues/146
In that case, the file had a very large, complex (fragmented) extent
tree, and resize2fs needed to relcate all of its blocks as part of a
off-line shrink, the lack of the optimization led to an extent block
overflowing, resulting in the old extent (the one which originally
mapped logical block 2507128 to physical block 389065080) and the new
extent landing in two different leaf blocks:
Theodore Ts'o [Sat, 22 Mar 2025 20:18:26 +0000 (16:18 -0400)]
test: fix expect files which changed after EA bugfix
The logic bug which was fixed in commit 92b6e93936d7 ("e2fsck: fix
logic bug when there are no references...") resulted in some silent
fixes that were never logged, and in some cases, corruption that was
not cleaned up. Fix the tests so that they pass as expected.
Fixes: 92b6e93936d7 ("e2fsck: fix logic bug when there are no references..." Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Theodore Ts'o [Thu, 6 Mar 2025 03:41:43 +0000 (22:41 -0500)]
Don't compile util/symlinks on Windows
This helper program is only needed when building elf shared libraries,
and for some reason Windows is failing to compile POSIX's struct stat.
Windows doesn't support symlinks anyway.
Theodore Ts'o [Mon, 24 Feb 2025 05:00:42 +0000 (00:00 -0500)]
misc: fix missing variable names in function prototype
If libarchive support is not available or has been disable, the
function __populate_fs_from_tar() just prints an error message, and
doesn't use any of the function paramaters. Newer versions of gcc
won't complain about the missing function names, since newer C
standards allow this, but it breaks on older versions of gcc.
Theodore Ts'o [Sat, 22 Feb 2025 04:54:02 +0000 (23:54 -0500)]
Fix parallel "make -j install"
If running a parallel install with --enable-elf-shlibs, multiple makes
in different library directories can collide while try building
util/symlinks. Fix this by building util/symlinks and
util/install-symlink as part of the top-level Makefile building other
dependencies before recursing into other directories.
Theodore Ts'o [Wed, 1 Jan 2025 06:09:43 +0000 (01:09 -0500)]
po: update the binary gmo files
Also relax the msgfmt checking to avoid using --check-format, since
e2fsck's problem string's %-interpolation allows the ordering of
block, inode, etc. numbers to be moved around in translations. (And
the recent update to the Spanish po file takes advantage of this
feature.)