Darrick J. Wong [Mon, 13 Oct 2025 23:34:24 +0000 (16:34 -0700)]
xfs_scrub_fail: reduce security lockdowns to avoid postfix problems
Iustin Pop reports that the xfs_scrub_fail service fails to email
problem reports on Debian when postfix is installed. This is apparently
due to several factors:
1. postfix's sendmail wrapper calling postdrop directly,
2. postdrop requiring the ability to write to the postdrop group,
3. lockdown preventing the xfs_scrub_fail@ service to have postdrop in
the supplemental group list or the ability to run setgid programs
Item (3) could be solved by adding the whole service to the postdrop
group via SupplementalGroups=, but that will fail if postfix is not
installed and hence there is no postdrop group.
It could also be solved by forcing msmtp to be installed, bind mounting
msmtp into the service container, and injecting a config file that
instructs msmtp to connect to port 25, but that in turn isn't compatible
with systems not configured to allow an smtp server to listen on ::1.
So we'll go with the less restrictive approach that e2scrub_fail@ does,
which is to say that we just turn off all the sandboxing. :( :(
Reported-by: iustin@debian.org Cc: linux-xfs@vger.kernel.org # v6.10.0 Fixes: 9042fcc08eed6a ("xfs_scrub_fail: tighten up the security on the background systemd service") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code;
namely, that the requested attribute name could not be found.
However, a medium error from disk may also return ENODATA. At best,
this medium error may escape to userspace as "attribute not found"
when in fact it's an IO (disk) error.
At worst, we may oops in xfs_attr_leaf_get() when we do:
because an ENODATA/ENOATTR error from disk leaves us with a null bp,
and the xfs_trans_brelse will then null-deref it.
As discussed on the list, we really need to modify the lower level
IO functions to trap all disk errors and ensure that we don't let
unique errors like this leak up into higher xfs functions - many
like this should be remapped to EIO.
However, this patch directly addresses a reported bug in the xattr
code, and should be safe to backport to stable kernels. A larger-scope
patch to handle more unique errors at lower levels can follow later.
(Note, prior to 07120f1abdff we did not oops, but we did return the
wrong error code to userspace.)
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Replace the deprecated strncpy() with memtostr_pad(). This also avoids
the need for separate zeroing using memset(). Mark sb_fname buffer with
__nonstring as its size is XFSLABEL_MAX and so no terminating NULL for
sb_fname.
Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Split up the XFS_IS_CORRUPT statement so that it immediately shows
if the reference counter overflowed or underflowed.
I ran into this quite a bit when developing the zoned allocator, and had
to reapply the patch for some work recently. We might as well just apply
it upstream given that freeing group is far removed from performance
critical code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
xfs_trans_alloc_empty can't return errors, so return the allocated
transaction directly instead of an output double pointer argument.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Use cmp_int() to yield the result of a three-way-comparison instead of
performing subtractions with extra casts. Thus also rename the function
to make its name clearer in purpose.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Perhaps that's just my silly imagination but 'diff' doesn't look good for
the name of a variable to hold a result of a three-way-comparison
(-1, 0, 1) which is what ->cmp_key_with_cur() does. It implies to contain
an actual difference between the two integer variables but that's not true
anymore after recent refactoring.
Declaring it as int64_t is also misleading now. Plain integer type is
more than enough.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
The net value of these functions is to determine the result of a
three-way-comparison between operands of the same type.
Simplify the code using cmp_int() to eliminate potential errors with
opencoded casts and subtractions. This also means we can change the return
value type of cmp_key_with_cur routines from int64_t to int and make the
interface a bit clearer.
Found by Linux Verification Center (linuxtesting.org).
Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
The net value of these functions is to determine the result of a
three-way-comparison between operands of the same type.
Simplify the code using cmp_int() to eliminate potential errors with
opencoded casts and subtractions. This also means we can change the return
value type of cmp_two_keys routines from int64_t to int and make the
interface a bit clearer.
Found by Linux Verification Center (linuxtesting.org).
Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
One may think that diff_two_keys routines are used to compute the actual
difference between the arguments but they return a result of a
three-way-comparison of the passed operands. So it looks more appropriate
to denote them as cmp_two_keys.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Darrick J. Wong [Sat, 11 Oct 2025 18:34:04 +0000 (11:34 -0700)]
mkfs: fix copy-paste error in calculate_rtgroup_geometry
Fix this copy-paste error -- we should calculate the rt volume
concurrency either if the user gave us an explicit option, or if they
didn't but the rt volume is an SSD.
Cc: linux-xfs@vger.kernel.org # v6.13.0 Fixes: 34738ff0ee80de ("mkfs: allow sizing realtime allocation groups for concurrency") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 19 Sep 2025 16:14:00 +0000 (09:14 -0700)]
xfs_scrub: fix strerror_r usage yet again
In commit 75faf2bc907584, someone tried to fix scrub to use the POSIX
version of strerror_r so that the build would work with musl.
Unfortunately, neither the author nor myself remembered that GNU libc
imposes its own version any time _GNU_SOURCE is defined, which
builddefs.in always does. Regrettably, the POSIX and GNU versions have
different return types and the GNU version can return any random
pointer, so now this code is broken on glibc.
"Fix" this standards body own goal by casting the return value to
intptr_t and employing some gross heuristics to guess at the location of
the actual error string.
Fixes: 75faf2bc907584 ("xfs_scrub: Use POSIX-conformant strerror_r") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: A. Wilcox <AWilcox@Wilcox-Tech.com>
Darrick J. Wong [Tue, 23 Sep 2025 17:10:27 +0000 (10:10 -0700)]
libfrog: pass mode to xfrog_file_setattr
xfs/633 crashes rdump_fileattrs_path passes a NULL struct stat pointer
and then the fallback code dereferences it to get the file mode.
Instead, let's just pass the stat mode directly to it, because that's
the only piece of information that it needs.
Fixes: 128ac4dadbd633 ("xfs_db: use file_setattr to copy attributes on special files with rdump") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Darrick J. Wong [Tue, 23 Sep 2025 17:08:57 +0000 (10:08 -0700)]
mkfs: fix libxfs_iget return value sign inversion
libxfs functions return negative errno, so utilities must invert the
return values from such functions. Caught by xfs/437.
Fixes: 8a4ea72724930c ("proto: add ability to populate a filesystem from a directory") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
A. Wilcox [Sat, 6 Sep 2025 08:12:07 +0000 (03:12 -0500)]
xfs_scrub: Use POSIX-conformant strerror_r
When building xfsprogs with musl libc, strerror_r returns int as
specified in POSIX. This differs from the glibc extension that returns
char*. Successful calls will return 0, which will be dereferenced as a
NULL pointer by (v)fprintf.
Signed-off-by: A. Wilcox <AWilcox@Wilcox-Tech.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_db: use file_setattr to copy attributes on special files with rdump
rdump just skipped file attributes on special files as copying wasn't
possible. Let's use new file_getattr/file_setattr syscalls to copy
attributes even for special files.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
xfs_quota: utilize file_setattr to set prjid on special files
Utilize new file_getattr/file_setattr syscalls to set project ID on
special files. Previously, special files were skipped due to lack of the
way to call FS_IOC_SETFSXATTR ioctl on them. The quota accounting was
therefore missing these inodes (special files created before project
setup). The ones created after project initialization did inherit the
projid flag from the parent.
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
libfrog: Define STATX__RESERVED if not provided by the system
This define is not provided by musl libc. Use the fallback that is
already provided if statx and its types (tested on STATX_TYPE) are
not defined in the general case.
This fixes one cause for failing to compile against musl libc.
Signed-off-by: Johannes Nixdorf <johannes@nixdorf.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Petr Vaněk <arkamar@gentoo.org>
configure: Base NEED_INTERNAL_STATX on libc headers first
At compile time the libc headers are preferred, and linux/stat.h is
only included if the libc headers don't provide a definition for statx
and its types (tested on STATX_TYPE). The configure test should be
based on the same logic.
This fixes one cause for failing to compile against musl libc.
Signed-off-by: Johannes Nixdorf <johannes@nixdorf.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Petr Vaněk <arkamar@gentoo.org>
Zhang Yi [Wed, 13 Aug 2025 02:42:50 +0000 (10:42 +0800)]
xfs_io: add FALLOC_FL_WRITE_ZEROES support
The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES support to the
fallocate utility by introducing a new 'fwzero' command in the xfs_io
tool.
Christian Kujau [Tue, 26 Aug 2025 16:06:26 +0000 (18:06 +0200)]
xfsprogs: fix utcnow deprecation warning in xfs_scrub_all.py
Running xfs_scrub_all under Python 3.13.5 prints the following warning:
----------------------------------------------
$ /usr/sbin/xfs_scrub_all --auto-media-scan-stamp \
/var/lib/xfsprogs/xfs_scrub_all_media.stamp \
--auto-media-scan-interval 1d
/usr/sbin/xfs_scrub_all:489: DeprecationWarning:
datetime.datetime.utcnow() is deprecated and scheduled for removal in a
future version. Use timezone-aware objects to represent datetimes in UTC:
datetime.datetime.now(datetime.UTC).
dt = datetime.utcnow()
Automatically enabling file data scrub.
----------------------------------------------
Python documentation for context:
https://docs.python.org/3/library/datetime.html#datetime.datetime.utcnow
Fix this by using datetime.now() instead.
NB: Debian/13 ships Python 3.13.5 and has a xfs_scrub_all.timer active,
I'd assume that many systems will have that warning now in their logs :-)
Signed-off-by: Christian Kujau <lists@nerdbynature.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Carlos Maiolino [Tue, 26 Aug 2025 12:23:12 +0000 (14:23 +0200)]
Improve information about logbsize valid values
Valid values for logbsize depends on whether log_sunit is set
on the filesystem or not and if logbsize is manually set or not.
When manually set, logbsize must be one of the speficied values -
32k to 256k inclusive in power-of-to increments. And, the specified
value must also be a multiple of log_sunit.
The default configuration for v2 logs uses a relaxed restriction,
setting logbsize to log_sunit, independent if it is one of the valid
values or not - also implicitly ignoring the power of two restriction.
Instead of changing valid possible values for logbsize, increasing the
testing matrix and allowing users to use some dubious configuration,
just update the man page to describe this difference in behavior when
manually setting logbsize or leave it to defaults.
This has originally been found by an user attempting to manually set
logbsize to the same value picked by the default configuration just so
to receive an error message as result.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Luca Di Maio [Wed, 30 Jul 2025 16:12:22 +0000 (18:12 +0200)]
proto: add ability to populate a filesystem from a directory
This patch implements the functionality to populate a newly created XFS
filesystem directly from an existing directory structure.
It resuses existing protofile logic, it branches if input is a
directory.
The population process steps are as follows:
- create the root inode before populating content
- recursively process nested directories
- handle regular files, directories, symlinks, char devices, block
devices, sockets, fifos
- preserve attributes (ownership, permissions)
- preserve mtime timestamps from source files to maintain file history
- use current time for atime/ctime/crtime
- possible to specify atime=1 to preserve atime timestamps from
source files
- preserve extended attributes and fsxattrs for all file types
- preserve hardlinks
At the moment, the implementation for the hardlink tracking is very
simple, as it involves a linear search.
from my local testing using larger source directories
(1.3mln inodes, ~400k hardlinks) the difference was actually
just a few seconds (given that most of the time is doing i/o).
We might want to revisit that in the future if this becomes a
bottleneck.
This functionality makes it easier to create populated filesystems
without having to mount them, it's particularly useful for
reproducible builds.
Signed-off-by: Luca Di Maio <luca.dimaio1@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_log_recover.h is in fs/xfs/libxfs/ in the kernel tree, and thus the
libxfs-apply tool tries to apply changes to it in libxfs/ and fails
because the header is in include.
Move it to libxfs to make libxfs-apply work properly and to keep our
house in order.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
John Garry [Wed, 30 Jul 2025 10:13:20 +0000 (10:13 +0000)]
mkfs: require reflink for max_atomic_write option
For max_atomic_write option to be set, it means that the user wants to
support atomic writes up to that size.
However, to support this we must have reflink, so enforce that this is
available.
Signed-off-by: John Garry <john.g.garry@oracle.com> Suggested-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Darrick J. Wong [Tue, 29 Jul 2025 20:14:14 +0000 (13:14 -0700)]
misc: fix reversed calloc arguments
gcc 14 complains about reversed arguments to calloc:
namei.c: In function ‘path_parse’:
namei.c:51:32: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
51 | dirpath = calloc(sizeof(*dirpath), 1);
| ^
namei.c:51:32: note: earlier argument should specify number of elements, later size of each element
Fix all of these.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Busy extent tracking is primarily used to ensure that freed blocks are
not reused for data allocations before the transaction that deleted them
has been committed to stable storage, and secondarily to drive online
discard. None of the use cases applies to zoned RTGs, as the zoned
allocator can't overwrite blocks before resetting the zone, which already
flushes out all transactions touching the RTGs.
So the busy extent tracking is not needed for zoned RTGs, and also not
called for zoned RTGs. But somehow the code to skip allocating and
freeing the structure got lost during the zoned XFS upstreaming process.
This not only causes these structures to unnecessarily allocated, but can
also lead to memory leaks as the xg_busy_extents pointer in the
xfs_group structure is overlayed with the pointer for the linked list
of to be reset zones.
Stop allocating and freeing the structure to not pointlessly allocate
memory which is then leaked when the zone is reset.
Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection") Signed-off-by: Christoph Hellwig <hch@lst.de>
[cem: Fix type and add stable tag] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
There is a race condition that can trigger in dmflakey fstests that
can result in asserts in xfs_ialloc_read_agi() and
xfs_alloc_read_agf() firing. The asserts look like this:
Essentially, it is the same problem. When _flakey_drop_and_remount()
loads the drop-writes table, it makes all writes silently fail. Writes
are reported to the fs as completed successfully, but they are not
issued to the backing store. The filesystem sees the successful
write completion and marks the metadata buffer clean and removes it
from the AIL.
If this happens at the same time as memory pressure is occuring,
the now-clean AGF and/or AGI buffers can be reclaimed from memory.
Shortly afterwards, but before _flakey_drop_and_remount() runs
unmount, background writeback is kicked and it tries to allocate
blocks for the dirty pages in memory. This then tries to access the
AGF buffer we just turfed out of memory. It's not found, so it gets
read in from disk.
This is all fine, except for the fact that the last writeback of the
AGF did not actually reach disk. The AGF on disk is stale compared
to the in-memory state held by the perag, and so they don't match
and the assert fires.
Then other operations on that inode hang because the task was killed
whilst holding inode locks. e.g:
Memory pressure is one way to trigger this, another is to run "echo
3 > /proc/sys/vm/drop_caches" randomly while tests are running.
Regardless of how it is triggered, this effectively takes down the
system once umount hangs because it's holding a sb->s_umount lock
exclusive and now every sync(1) call gets stuck on it.
Fix this by replacing the asserts with a corruption detection check
and a shutdown.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Tue, 1 Jul 2025 17:45:15 +0000 (10:45 -0700)]
xfs_scrub: remove EXPERIMENTAL warnings
The kernel code for online fsck has been stable for a year, and there
haven't been any major changes to the program in quite some time, so
let's drop the experimental warnings.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 1 Jul 2025 17:45:14 +0000 (10:45 -0700)]
mkfs: allow users to configure the desired maximum atomic write size
Allow callers of mkfs.xfs to specify a desired maximum atomic write
size. This value will cause the log size to be adjusted to support
software atomic writes, and the AG size to be aligned to support
hardware atomic writes.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com>
Darrick J. Wong [Tue, 1 Jul 2025 17:45:14 +0000 (10:45 -0700)]
mkfs: don't complain about overly large auto-detected log stripe units
If mkfs declines to apply what it thinks is an overly large data device
stripe unit to the log device, it should only log a message about that
if the lsunit parameter was actually supplied by the caller. It should
not do that when the lsunit was autodetected from the block devices.
The cli parameters are zero-initialized in main and always have been.
Cc: <linux-xfs@vger.kernel.org> # v4.15.0 Fixes: 2f44b1b0e5adc4 ("mkfs: rework stripe calculations") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com>
Darrick J. Wong [Tue, 1 Jul 2025 17:45:13 +0000 (10:45 -0700)]
xfs_db: create an untorn_max subcommand
Create a debugger command to compute the either the logres needed to
perform an untorn cow write completion for a given number of blocks; or
the number of blocks that can be completed given a log reservation.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com>
Introduce a mount option to allow sysadmins to specify the maximum size
of an atomic write. If the filesystem can work with the supplied value,
that becomes the new guaranteed maximum.
The value mustn't be too big for the existing filesystem geometry (max
write size, max AG/rtgroup size). We dynamically recompute the
tr_atomic_write transaction reservation based on the given block size,
check that the current log size isn't less than the new minimum log size
constraints, and set a new maximum.
The actual software atomic write max is still computed based off of
tr_atomic_ioend the same way it has for the past few commits. Note also
that xfs_calc_atomic_write_log_geometry is non-static because mkfs will
need that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: John Garry <john.g.garry@oracle.com>
Now that CoW-based atomic writes are supported, update the max size of an
atomic write for the data device.
The limit of a CoW-based atomic write will be the limit of the number of
logitems which can fit into a single transaction.
In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.
Function xfs_atomic_write_logitems() is added to find the limit the number
of log items which can fit in a single transaction.
Amend the max atomic write computation to create a new transaction
reservation type, and compute the maximum size of an atomic write
completion (in fsblocks) based on this new transaction reservation.
Initially, tr_atomic_write is a clone of tr_itruncate, which provides a
reasonable level of parallelism. In the next patch, we'll add a mount
option so that sysadmins can configure their own limits.
[djwong: use a new reservation type for atomic write ioends, refactor
group limit calculations]
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: rounddown power-of-2 always] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
Darrick J. Wong [Tue, 1 Jul 2025 17:45:12 +0000 (10:45 -0700)]
libxfs: add helpers to compute log item overhead
Add selected helpers to estimate the transaction reservation required to
write various log intent and buffer items to the log. These helpers
will be used by the online repair code for more precise estimations of
how much work can be done in a single transaction.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.
For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.
Note that there is a limit on the amount of log intent items which can be
fit into a single transaction, but this is being ignored for now since
the count of items for a typical atomic write would be much less than is
typically supported. A typical atomic write would be expected to be 64KB
or less, which means only 16 possible extents unmaps, which is quite
small.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add tr_atomic_ioend] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
Add a BMAPI flag to provide a hint to the block allocator to align extents
according to the extszhint.
This will be useful for atomic writes to ensure that we are not being
allocated extents which are not suitable (for atomic writes).
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
In the transaction reservation code, hoist the logic that computes the
reservation needed to finish one log intent item into separate helper
functions. These will be used in subsequent patches to estimate the
number of blocks that an online repair can commit to reaping in the same
transaction as the change committing the new data structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com>
The way mdrestore works is not very amendable to zone devices. The code
that checks the device size tries to write to the highest offset, which
doesn't match the write pointer of a clean zone device. And while that
is relatively easily fixable, the metadata for each RTG records the
highest written offset, and the mount code compares that to the hardware
write pointer, which will mismatch. This could be fixed by using write
zeroes to pad the RTG until the expected write pointer, but this turns
the quick metadata operation that mdrestore is supposed to be into
something that could take hours on HDD.
So instead error out when someone tries to mdrestore onto a zoned device
to clearly document that this won't work. Doing a mdrestore into a file
still works perfectly fine, and we might look into a new mdrestore option
to restore into a set of files suitable for the zoned loop device driver
to make mdrestore fully usable for debugging.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Unmapped buffer access is a pain, so kill it. The switch to large
folios means we rarely pay a vmap penalty for large buffers,
so this functionality is largely unnecessary now.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Luca Di Maio [Wed, 16 Apr 2025 21:20:04 +0000 (23:20 +0200)]
xfs_protofile: fix permission octet when suid/guid is set
When encountering suid or sgid files, we already set the `u` or `g`
property in the prototype file.
Given that proto.c only supports three numbers for permissions, we
need to remove the redundant information from the permission, else
it was incorrectly parsed.
Co-authored-by: Luca Di Maio <luca.dimaio1@gmail.com> Co-authored-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Luca Di Maio <luca.dimaio1@gmail.com>
[aalbersh: 68 chars limit and removed patch review revisions]
Handle the synthetic fmr_device values, and deal with the fact that
ctx->fsinfo.fs_rt is allowed to be non-NULL for internal RT devices as
it is the same as the data device in this case.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_mkfs: default to rtinherit=1 for zoned file systems
Zone file systems are intended to use sequential write required zones
(or areas treated as such) for the main data store. And usually use the
data device only for metadata that requires random writes.
rtinherit=1 is the way to achieve that, so enabled it by default, but
still allow the user to override it if needed.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[aalbersh remove accidental "inherit" word in commit desc.]
xfs_mkfs: support creating file system with zoned RT devices
To create file systems with a zoned RT device, query the hardware
zone information to align the RT groups to it, and create an internal
RT device if the device has conventional and sequential write required
zones.
Default to use all sequential write required zoned for the RT device if
there are sequential write required zones.
Default to 256 and 1% conventional when -r zoned is specified without
further option and there are no sequential write required zones. This
mimics a SMR HDD and works well with tests.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_repair: validate rt groups vs reported hardware zones
Run a report zones ioctl, and verify the rt group state vs the
reported hardware zone state. Note that there is no way to actually
fix up any discrepancies here, as that would be rather scary without
having transactions.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Note really much to do here. Mostly ignore the validation and
regeneration of the bitmap and summary inodes. Eventually this
could grow a bit of validation of the hardware zone state.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Zoned devices can have gaps beyond the usable capacity of a zone and the
end in the LBA/daddr address space. In other words, the hardware
equivalent to the RT groups already takes care of the power of 2
alignment for us. In this case the sparse FSB/RTB address space maps 1:1
to the device address space.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Enable the zoned RT device directory feature. With this feature, RT
groups are written sequentially and always emptied before rewriting
the blocks. This perfectly maps to zoned devices, but can also be
used on conventional block devices.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
File system with internal RT devices are a bit odd in that we need
to report AGs and RGs. To make this happen use separate synthetic
fmr_device values for the different sections instead of the dev_t
mapping used by other XFS configurations.
The data device is reported as file system metadata before the
start of the RGs for the synthetic RT fmr_device.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
RT groups on a zoned file system need to be completely empty before their
space can be reused. This means that partially empty groups need to be
emptied entirely to free up space if no entirely free groups are
available.
Add a garbage collection thread that moves all data out of the least used
zone when not enough free zones are available, and which resets all zones
that have been emptied. To find empty zone a simple set of 10 buckets
based on the amount of space used in the zone is used. To empty zones,
the rmap is walked to find the owners and the data is read and then
written to the new place.
To automatically defragment files the rmap records are sorted by inode
and logical offset. This means defragmentation of parallel writes into
a single zone happens automatically when performing garbage collection.
Because holding the iolock over the entire GC cycle would inject very
noticeable latency for other accesses to the inodes, the iolock is not
taken while performing I/O. Instead the I/O completion handler checks
that the mapping hasn't changed over the one recorded at the start of
the GC cycle and doesn't update the mapping if it change.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
For zoned file systems garbage collection (GC) has to take the iolock
and mmaplock after moving data to a new place to synchronize with
readers. This means waiting for garbage collection with the iolock can
deadlock.
To avoid this, the worst case required blocks have to be reserved before
taking the iolock, which is done using a new RTAVAILABLE counter that
tracks blocks that are free to write into and don't require garbage
collection. The new helpers try to take these available blocks, and
if there aren't enough available it wakes and waits for GC. This is
done using a list of on-stack reservations to ensure fairness.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
For zoned RT devices space is always allocated at the write pointer, that
is right after the last written block and only recorded on I/O completion.
Because the actual allocation algorithm is very simple and just involves
picking a good zone - preferably the one used for the last write to the
inode. As the number of zones that can written at the same time is
usually limited by the hardware, selecting a zone is done as late as
possible from the iomap dio and buffered writeback bio submissions
helpers just before submitting the bio.
Given that the writers already took a reservation before acquiring the
iolock, space will always be readily available if an open zone slot is
available. A new structure is used to track these open zones, and
pointed to by the xfs_rtgroup. Because zoned file systems don't have
a rsum cache the space for that pointer can be reused.
Allocations are only recorded at I/O completion time. The scheme used
for that is very similar to the reflink COW end I/O path.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Add support to validate and parse reported hardware zone state.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Zoned file systems not only don't use the global frextents counter, but
for them the in-memory percpu counter also includes reservations taken
before even allocating delalloc extent records, so it will never match
the per-zone used information. Disable all updates and verification of
the sb counter for zoned file systems as it isn't useful for them.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Allow creating an RT subvolume on the same device as the main data
device. This is mostly used for SMR HDDs where the conventional zones
are used for the data device and the sequential write required zones
for the zoned RT section.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Zone file systems reuse the basic RT group enabled XFS file system
structure to support a mode where each RT group is always written from
start to end and then reset for reuse (after moving out any remaining
data). There are few minor but important changes, which are indicated
by a new incompat flag:
1) there are no bitmap and summary inodes, thus the
/rtgroups/{rgno}.{bitmap,summary} metadir files do not exist and the
sb_rbmblocks superblock field must be cleared to zero.
2) there is a new superblock field that specifies the start of an
internal RT section. This allows supporting SMR HDDs that have random
writable space at the beginning which is used for the XFS data device
(which really is the metadata device for this configuration), directly
followed by a RT device on the same block device. While something
similar could be achieved using dm-linear just having a single device
directly consumed by XFS makes handling the file systems a lot easier.
3) Another superblock field that tracks the amount of reserved space (or
overprovisioning) that is never used for user capacity, but allows GC
to run more smoothly.
4) an overlay of the cowextsize field for the rtrmap inode so that we
can persistently track the total amount of rtblocks currently used in
a RT group. There is no data structure other than the rmap that
tracks used space in an RT group, and this counter is used to decide
when a RT group has been entirely emptied, and to select one that
is relatively empty if garbage collection needs to be performed.
While this counter could be tracked entirely in memory and rebuilt
from the rmap at mount time, that would lead to very long mount times
with the large number of RT groups implied by the number of hardware
zones especially on SMR hard drives with 256MB zone sizes.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a helper to find the last offset mapped in the rtrmap. This will be
used by the zoned code to find out where to start writing again on
conventional devices without hardware zone support.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation. To support that pass the
flags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Delalloc reservations are not supported in userspace, and thus it doesn't
make sense to share this helper with xfsprogs.c. Move it to xfs_iomap.c
toward the two callers.
Note that there rest of the delalloc handling should probably eventually
also move out of xfs_bmap.c, but that will require a bit more surgery.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
There is no point in reserving more space than actually available
on the data device for the worst case scenario that is unlikely to
happen. Reserve at most 1/4th of the data device blocks, which is
still a heuristic.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Currently each metabtree inode has it's own space reservation to ensure
it can be expanded to the maximum size, mirroring what is done for the
AG-based btrees. But unlike the AG-based btrees the metabtree inodes
aren't restricted to allocate from a single AG but can use free space
form the entire file system. And unlike AG-based btrees where the
required reservation shrinks with the available free space due to this,
the metabtree reservations for the rtrmap and rtfreflink trees are not
bound in any way by the data device free space as they track RT extent
allocations. This is not very efficient as it requires a large number
of blocks to be set aside that can't be used at all by other btrees.
Switch to a model that uses a global pool instead in preparation for
reducing the amount of reserved space, which now also removes the
overloading of the i_nblocks field for metabtree inodes, which would
create problems if metabtree inodes ever had a big enough xattr fork
to require xattr blocks outside the inode.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
xfs_{add,dec}_freecounter already handles the block and RT extent
percpu counters, but it currently hardcodes the passed in counter.
Add a freecounter abstraction that uses an enum to designate the counter
and add wrappers that hide the actual percpu_counters. This will allow
expanding the reserved block handling to the RT extent counter in the
next step, and also prepares for adding yet another such counter that
can share the code. Both these additions will be needed for the zoned
allocator.
Also switch the flooring of the frextents counter to 0 in statfs for the
rthinherit case to a manual min_t call to match the handling of the
fdblocks counter for normal file systems.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Bill O'Donnell [Tue, 15 Apr 2025 18:48:49 +0000 (13:48 -0500)]
xfs_repair: phase6: scan longform entries before header check
In longform_dir2_entry_check, if check_dir3_header() fails for v5
metadata, we immediately go to out_fix: and try to rebuild the
directory via longform_dir2_rebuild. But because we haven't yet
called longform_dir2_entry_check_data, the *hashtab used to rebuild
the directory is empty, which results in all existing entries
getting moved to lost+found, and an empty rebuilt directory. On top
of that, the empty directory is now short form, so its nlinks come
out wrong and this requires another repair run to fix.
Scan the entries before checking the header, so that we have a
decent chance of properly rebuilding the dir if the header is
corrupt, rather than orphaning all the entries and moving them to
lost+found.
Suggested-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
[aalbersh updated changelog as suggested by Eric Sandeen] Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Eric Sandeen [Tue, 15 Apr 2025 18:09:23 +0000 (13:09 -0500)]
xfs_repair: Bump link count if longform_dir2_rebuild yields shortform dir
If longform_dir2_rebuild() has so few entries in *hashtab that it results
in a short form directory, bump the link count manually as shortform
directories have no explicit "." entry.
Without this, repair will end with i.e.:
resetting inode 131 nlinks from 2 to 1
in this case, because it thinks this directory inode only has 1 link
discovered, and then a 2nd repair will fix it:
resetting inode 131 nlinks from 1 to 2
because shortform_dir2_entry_check() explicitly adds the extra ref when
the (newly-created)shortform directory is checked:
/*
* no '.' entry in shortform dirs, just bump up ref count by 1
* '..' was already (or will be) accounted for and checked when
* the directory is reached or will be taken care of when the
* directory is moved to orphanage.
*/
add_inode_ref(current_irec, current_ino_offset);
Avoid this by adding the extra ref if we convert from longform to
shortform.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[aalbersh drop SoB with user.mail as name] Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
mkfs: fix the issue of maxpct set to 0 not taking effect
If a filesystem has the sb_imax_pct field set to zero, there is no limit to the number of inode blocks in the filesystem.
However, when using mkfs.xfs and specifying maxpct = 0, the result is not as expected.
[root@fs ~]# mkfs.xfs -f -i maxpct=0 xfs.img
data = bsize=4096 blocks=262144, imaxpct=25
= sunit=0 swidth=0 blks
The reason is that the condition will never succeed when specifying maxpct = 0. As a result, the default algorithm was applied.
cfg->imaxpct = cli->imaxpct;
if (cfg->imaxpct)
return;
The result with patch:
[root@fs ~]# mkfs.xfs -f -i maxpct=0 xfs.img
data = bsize=4096 blocks=262144, imaxpct=0
= sunit=0 swidth=0 blks
Darrick J. Wong [Thu, 24 Apr 2025 21:53:55 +0000 (14:53 -0700)]
mkfs: fix blkid probe API violations causing weird output
The blkid_do_fullprobe function in libblkid 2.38.1 will try to read the
last 512 bytes off the end of a block device. If the block device has a
2k LBA size, that read will fail. blkid_do_fullprobe passes the -EIO
back to the caller (mkfs) even though the API documentation says it
only returns 1, 0, or -1.
Change the "cannot detect existing fs" logic to look for any negative
number. Otherwise, you get unhelpful output like this:
$ mkfs.xfs -l size=32m -b size=4096 /dev/loop3
mkfs.xfs: Use the -f option to force overwrite.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Thu, 24 Apr 2025 21:53:23 +0000 (14:53 -0700)]
xfs_io: redefine what statx -m all does
As of kernel commit 581701b7efd60b ("uapi: deprecate STATX_ALL"),
STATX_ALL is deprecated and has been withdrawn from the kernel codebase.
The symbol still exists for userspace to avoid compilation breakage, but
we're all suppose to stop using it.
Therefore, redefine statx -m all to set all the bits except for the
reserved bit since it's pretty silly that "all" doesn't actually get you
all the fields.
Update the STATX_ALL definition in io/statx.h so people stop using it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com>
Darrick J. Wong [Thu, 24 Apr 2025 21:53:08 +0000 (14:53 -0700)]
xfs_io: catch statx fields up to 6.15
Add all the new statx fields and flags that have accumulated for the
past couple of years so they all print now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>