Theodore Ts'o [Mon, 26 May 2025 14:09:59 +0000 (10:09 -0400)]
libe2p: avoid potential integer overflow in interate_on_dir()
Overflows won't happen if the OS's implementation of pathconf()
returns reasonable values, but we can make it a bit more hardened
against maliciou implementations.
Theodore Ts'o [Mon, 26 May 2025 02:20:36 +0000 (22:20 -0400)]
e2fsck: fix e2fsck -E unshare_blocks when there are no shared blocks
If there are no shared blocks in a ext4 file system, e2fsck -E
unshare_blocks will not actually clear the shared_blocks feature flag
since e2fsck_pass1_dupblocks() is never called. Fix this by adding a
check in e2fsck_pass1() to clear the shared blocks flag.
Theodore Ts'o [Sun, 25 May 2025 16:51:36 +0000 (12:51 -0400)]
mke2fs: propagate some chattr flags into the fs image when using mke2fs -d
When copying files from a source directory, propagate chattr flags
such as the immutable, append-only, nodump, etc. into the files in the
destination file system. Flags in directory inodes are also propagated.
Theodore Ts'o [Sun, 25 May 2025 21:38:49 +0000 (17:38 -0400)]
libext2fs: fix ext2fs_link() for EXT2FS_LINK_APPEND and non-regular files
Fix the incorrect flag being passed to ext2fs_process_dir_block().
This bug was masked because EXT2_FT_REG_FILE has the same code point
as DIRENT_FLAG_INCLUDE_EMPTY which was the flag that was needed and
mke2fs -d was only use ext2fs_lik() for regular files.
Fixes: 53aa6c54224f ("libext2fs: add the EXT2FS_LINK_APPEND flag ...) Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Theodore Ts'o [Sun, 25 May 2025 04:39:13 +0000 (00:39 -0400)]
Add a support for new flag (EXT2FS_LINK_EXPAND) for ext2fs_link()
Many calls to ext2fs_link() checks for EXT2_ET_DIR_NO_SPACE and if so,
calls ext2fs_expand_dir() and then retries the ext2fs_link(). We can
simplify a lot of code by adding support for a flag which does the
retry into the ext2fs_link() function.
Similar to 64-bit support, fs-verity support requires extents, so don't
allow to create a filesystem that has -O verity unless it also supports
extents.
When creating a filesystem with `mke2fs -O verity` and populating
content via `-d`, check if that content is fs-verity enabled, and if it
is, copy the fs-verity metadata from the host-native filesystem into the
created filesystem.
When writing data to an inode (with mke2fs -d) we need to do the typical
loop to handle partial writes to make sure all of the data gets written.
Move that code to its own function. This function also takes an offset
parameter, which makes it feel a bit like pwrite() (except that it does
modify the file offset).
Right now we jump to the end as soon as we've found a method that works.
This is a reasonable approach because it's the last operation in the
function, but soon it won't be. Switch to a logically-equivalent
alternative approach: keep trying until we find the approach that works,
dropping the `goto out`. Now we can add code after this.
Darrick J. Wong [Wed, 21 May 2025 22:42:30 +0000 (15:42 -0700)]
fuse2fs: fix group membership checking in op_chmod
In the decade or so since I touched fuse2fs, libfuse3 has grown the
ability to read the group ids of a process making a chmod request. So
now we actually /can/ determine if a file's gid is a in the group list
of the process that initiated a fuse request. Let's implement that too.
Darrick J. Wong [Wed, 21 May 2025 22:41:43 +0000 (15:41 -0700)]
fuse2fs: fix post-EOF preallocation clearing on truncation
generic/092 shows that truncating a file to its current size does not
clean out post-eof preallocations like the kernel does. Adopt the
kernel's behavior for consistency.
Darrick J. Wong [Wed, 21 May 2025 22:41:26 +0000 (15:41 -0700)]
fuse2fs: fix removing ea inodes when freeing a file
If the filesystem has ea_inode set, then each file that has xattrs might
have stored an xattr value in a separate inode. These inodes also need
to be freed, so create a library function to do that, and call it from
the fuse2fs unlink method. Seen by ext4/026.
Darrick J. Wong [Wed, 21 May 2025 22:41:11 +0000 (15:41 -0700)]
fuse2fs: fix return value handling
For the xattr functions, don't obliterate the return value of the file
system operation with an error code coming from ext2fs_xattrs_close
failing. Granted, it doesn't ever fail (right now!) so this is mostly
just preening.
Also fix the obsolete op_truncate not to squash error returns.
Darrick J. Wong [Wed, 21 May 2025 22:40:23 +0000 (15:40 -0700)]
fuse2fs: decode fuse_main error codes
Translate the fuse_main return values into actual mount(8) style error
codes instead of returning 0 all the time, and print something to the
original stderr if something went wrong so that the user will know what
to do next.
Darrick J. Wong [Wed, 21 May 2025 22:37:00 +0000 (15:37 -0700)]
fuse2fs: flip parameter order in __translate_error
Flip the parameter order in __translate_error so that it matches
translate_error. I wasted too much time debugging a memory corruption
that happened because I converted translate_error to __translate_error
when developing the next patch and the compiler didn't warn me about
mismatched types.
Darrick J. Wong [Wed, 21 May 2025 22:36:44 +0000 (15:36 -0700)]
fuse2fs: fix error return handling in op_truncate
Fix a couple of bugs with the errcode/ret handling in op_truncate.
First, we need to return ESTALE for a zero inumber because there is no
inode zero in an ext* filesystem. Second, we need to return negative
errno for failures to libfuse, not raw errcode_t.
Darrick J. Wong [Wed, 21 May 2025 22:36:13 +0000 (15:36 -0700)]
fuse2fs: compact all the boolean flags in struct fuse2fs
Compact all the booleans into u8 fields. I'd go further and turn them
into bitfields but that breaks the fuse argument parsing macros, which
compute the offset of the structure fields, and gcc won't let us do that
to bit fields. Still, 136 -> 112 bytes isn't bad.
Darrick J. Wong [Wed, 21 May 2025 22:35:41 +0000 (15:35 -0700)]
fuse2fs: clean up error messages
Instead of horridly line-wrapping multi-line messages that are printed
during mounting, let's just expand them to be one source code line per
printed line. This will make it a lot easier for someone who sees the
these errors to grep the source code to find out where they came from.
Darrick J. Wong [Wed, 21 May 2025 22:35:26 +0000 (15:35 -0700)]
libext2fs: fix livelock in the unix io manager
generic/441 found a livelock in the unix IO manager. Let's say that
write_primary_superblock decides to call io_channel_set_blksize in the
process of writing the primary super.
unix_set_blksize then takes the cache and bounce mutexes, and calls
flush_cached_blocks. If there are dirty blocks in the cache, they will
be written with raw_write_blk. Unfortunately, that function tries to
take the bounce mutex, which we already hold. At that point, we
livelock fuse2fs.
Darrick J. Wong [Wed, 21 May 2025 22:35:10 +0000 (15:35 -0700)]
libext2fs: fix unix io manager invalidation
flush_cached_blocks does not invalidate clean blocks from the block
cache. From reading all the call sites, it looks like they all actually
want the cache to be empty on successful return, so adjust the
implementation to do this.
Theodore Ts'o [Fri, 23 May 2025 03:53:40 +0000 (23:53 -0400)]
fuse2fs: fix portability issues when compiling on MacOS
Fix a number of portability issues which resulted in fuse2fs failing
to build on MacOS.
*) MacOS doesn't have the timespec fields in struct stat; we have
a autoconf test to check for this, so use it.
*) The portable way to print off_t values is to use
printf("%jd", (intmax_t) d); The cast is necessary to avoid
type mismatch warnings.
*) Define FUSE_DARWIN_ENABLE_EXTENSIONS=0 to avoid using random
structs such as struct fuse_darwin_attr and struct fuse_darwin_fill_dir_t
in the fuse operation function prototypes.
With these fixes, fuse2fs successfully compiles and works with
MacFuse on macOS Sequoia.
Theodore Ts'o [Tue, 20 May 2025 02:07:55 +0000 (22:07 -0400)]
libext2fs: add the EXT2FS_LINK_APPEND flag to ext2fs_link()
Add a flag which only tries to add the new directory entry to the last
block in the directory. This is helpful avoids mke2fs -d offering
from an O(n**2) performance bottleneck when adding a large number of
files to a directory.
Theodore Ts'o [Sun, 4 May 2025 18:07:14 +0000 (14:07 -0400)]
mke2fs: don't set the raid stripe for non-rotational devices by default
The ext4 block allocator is not at all efficient when it is asked to
enforce RAID alignment. It is especially bad for flash-based devices,
or when the file system is highly fragmented. For non-rotational
devices, it's fine to set the stride parameter (which controls
spreading the allocation bitmaps across the RAID component devices,
which always makessense); but for the stripe parameter (which asks the
ext4 block alocator to try _very_ hard to find RAID stripe aligned
devices) it's probably not a good idea.
Add new mke2fs.conf parameters with the defaults:
[defaults]
set_raid_stride = always
set_raid_stripe = disk
Even for RAID arrays based on HDD's, we can still have problems for
highly fragmented file systems. This will need to solved in the
kernel, probably by having some kind of wall clock or CPU time
limitation for each block allocation or adding some kind of
optimization which is faster than using our current buddy bitmap
implementation, especially if the stripe size is not multiple of a
power of two. But for SSD's, it's much less likely to make sense even
if we have an optimized block allocator, because if you've paid $$$
for a flash-based RAID array, the cost/benefit tradeoffs of doing less
optimized stripe RMW cycles versus the block allocator time and CPU
overhead is harder to justify without a lot of optimization effort.
If and when we can improve the ext4 kernel implementation (and it gets
rolled out to users using LTS kernels), we can change the defaults.
And of course, system administrators can always change
/etc/mke2fs.conf settings.
Theodore Ts'o [Wed, 21 May 2025 14:39:53 +0000 (10:39 -0400)]
libext2fs: rename fls() to find_last_bit_set()
In unix_io.c, rename fls() since some systems may already define it in
a system header file to fix a portability problem on MacOS. The name
"find_last_bit_set" is a bit more self-descriptive anyway.
Darrick J. Wong [Thu, 24 Apr 2025 21:46:33 +0000 (14:46 -0700)]
fuse2fs: allow use of direct io for disk access
Allow users to ask for O_DIRECT for disk accesses so that block device
writes won't be throttled. This should improve latency, but will put
a lot more pressure on the disk cache.
Darrick J. Wong [Thu, 24 Apr 2025 21:45:28 +0000 (14:45 -0700)]
fuse2fs: delegate access control decisions to the kernel
In "kernel" mode (aka allow_others + default_permissions), the kernel
enforces all the access control for us. Therefore, we don't need to do
any checking of our own. Create a purpose-built helper to detect this
situation and turn off all the access controlling.
Darrick J. Wong [Thu, 24 Apr 2025 21:45:12 +0000 (14:45 -0700)]
fuse2fs: refactor sysadmin predicate
Refactor the code that decides if an access is being made by the
superuser into a helper, which we'll use to fix more permissions
problems in the next patch.
Darrick J. Wong [Thu, 24 Apr 2025 21:44:25 +0000 (14:44 -0700)]
fuse2fs: disable renameat2
Apparently fuse munged rename and renameat2 together into the same
upcall, so we actually have to filter out nonzero flags because
otherwise we do a regular rename for a RENAME_EXCHANGE/WHITEOUT, which
is not what the user asked for.
Darrick J. Wong [Thu, 24 Apr 2025 21:43:39 +0000 (14:43 -0700)]
fuse2fs: support changing newer iflags
Redefine FUSE2FS_MODIFIABLE_IFLAGS so that userspace can modify any
flags that the kernel can, except for the ones that fuse2fs lacks the
ability to change.
Darrick J. Wong [Thu, 24 Apr 2025 21:42:52 +0000 (14:42 -0700)]
fuse2fs: clamp timestamps that are being written to disk
Clamp the timestamps that we write to disk to the minimum and maximum
values permitted given the ondisk format. This fixes y2038 support, as
tested by generic/402.
Darrick J. Wong [Thu, 24 Apr 2025 21:42:20 +0000 (14:42 -0700)]
fuse2fs: add an easy option for emulating kernel access behaviors
By default, fuse doesn't allow processes with a uid/gid that don't match
those of the server process to access the fuse mount, it doesn't allow
suid files or devices, and it relies on the fuse server to perform
permissions checking. This is a secure default for very untrusted
filesystems, but it's possible that we might actually want to allow
general access to an ext4 filesystem as part of containerizing the ext4
metadata parsing. In other words, we want the kernel access control
behavior.
Add an "kernel" mount option that moves most of the access permissions
interpretation back into the kernel, and logs mount/unmount/error
messages to dmesg. Right now this is mostly useful for fstests, so we
leave it off by default.