The code has been stubbed out since the initial creation of the
xfsprogs repository. Open code the single-line printf in the
data fork caller (attr forks can't contain directories) and remove
the dead code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Darrick J. Wong [Fri, 21 Nov 2025 16:39:37 +0000 (08:39 -0800)]
xfs_scrub: fix null pointer crash in scrub_render_ino_descr
Starting in Debian 13's libc6, passing a NULL format string to vsnprintf
causes the program to segfault. Prior to this, the null format string
would be ignored. Because @format is optional, let's explicitly steer
around the vsnprintf if there is no format string. Also tidy whitespace
in the comment.
Found by generic/45[34] on Debian 13.
Cc: linux-xfs@vger.kernel.org # v6.10.0 Fixes: 9a8b09762f9a52 ("xfs_scrub: use parent pointers when possible to report file operations") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Carlos Maiolino [Thu, 13 Nov 2025 13:57:11 +0000 (14:57 +0100)]
metadump: catch used extent array overflow
An user reported a SIGSEGV when attempting to create a metadump image of
a filesystem.
The reason is because we fail to catch a possible overflow in the
used extents array in process_exinode() which may happen if the extent
count is corrupted.
This leads process_bmbt_reclist() to attempt to index into the array
using the bogus extent count with:
convert_extent(&rp[numrecs - 1], &o, &s, &c, &f);
Fix this by extending the used counter to uint64_t and
checking for the overflow possibility.
Reported-by: hubert . <hubjin657@outlook.com> Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
Carlos Maiolino [Thu, 13 Nov 2025 13:46:13 +0000 (14:46 +0100)]
mkfs: fix zone capacity check for sequential zones
Sequential zones can have a different, smaller capacity than
conventional zones.
Currently mkfs assumes both sequential and conventional zones will have
the same capacity and and set the zone_info to the capacity of the first
found zone and use that value to validate all the remaining zones's
capacity.
Because conventional zones can't have a different capacity than its
size, the first zone always have the largest possible capacity, so, mkfs
will fail to validate any consecutive sequential zone if its capacity is
smaller than the conventional zones.
What we should do instead, is set the zone info capacity accordingly to
the settings of first zone found of the respective type and validate
the capacity based on that instead of assuming all zones will have the
same capacity.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Luca Di Maio [Sat, 8 Nov 2025 14:39:53 +0000 (15:39 +0100)]
libxfs: support reproducible filesystems using deterministic time/seed
Add support for reproducible filesystem creation through two environment
variables that enable deterministic behavior when building XFS filesystems.
SOURCE_DATE_EPOCH support:
When SOURCE_DATE_EPOCH is set, use its value for all filesystem timestamps
instead of the current time. This follows the reproducible builds
specification (https://reproducible-builds.org/specs/source-date-epoch/)
and ensures consistent inode timestamps across builds.
DETERMINISTIC_SEED support:
When DETERMINISTIC_SEED=1 is set, return a fixed seed value (0x53454544 =
"SEED") from get_random_u32() instead of reading from /dev/urandom.
get_random_u32() seems to be used mostly to set inode generation number, being
fixed should not be create collision issues at mkfs time.
The implementation introduces two helper functions to minimize changes
to existing code:
- current_fixed_time(): Parses and caches SOURCE_DATE_EPOCH on first
call. Returns fixed timestamp when set, falls back to gettimeofday() on
parse errors or when unset.
- get_deterministic_seed(): Checks for DETERMINISTIC_SEED=1 environment
variable on first call, and returns a fixed seed value (0x53454544).
Falls back to getrandom() when unset.
- Both helpers use one-time initialization to avoid repeated getenv() calls.
- Both quickly exit and noop if environment is not set or has invalid
variables, falling back to original behaviour.
This enables distributions and build systems to create bit-for-bit
identical XFS filesystems when needed for verification and debugging.
v1 -> v2:
- simplify deterministic seed by returning a fixed value instead
of using Middle Square Weyl Sequence PRNG
- fix timestamp type time_t -> time64_t
- fix timestamp initialization flag to allow negative epochs
- fix timestamp conversion type using strtoll
- fix timestamp conversion check to be sure the whole string was parsed
- print warning message when SOURCE_DATE_EPOCH is invalid
Signed-off-by: Luca Di Maio <luca.dimaio1@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Right now 5 places in the kernel and one in xfsprogs need to be updated
for each new error tag. Add a bit of macro magic so that only the
error tag definition and a single table, which reside next to each
other, need to be updated.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Chandan Babu R [Tue, 4 Nov 2025 09:14:37 +0000 (14:44 +0530)]
repair/prefetch.c: Create one workqueue with multiple workers
When xfs_repair is executed with a non-zero value for ag_stride,
do_inode_prefetch() create multiple workqueues with each of them having just
one worker thread.
Since commit 12838bda12e669 ("libfrog: fix overly sleep workqueues"), a
workqueue can process multiple work items concurrently. Hence, this commit
replaces the above logic with just one workqueue having multiple workers.
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Chandan Babu R [Tue, 4 Nov 2025 09:14:36 +0000 (14:44 +0530)]
libfrog: Prevent unnecessary waking of worker thread when using bounded workqueues
When woken up, a worker will pick a work item from the workqueue and wake up
another worker when the current workqueue is a bounded workqueue and if there
is atleast one more work item remains to be processed.
The commit 12838bda12e669 ("libfrog: fix overly sleep workqueues") prevented
single-threaded processing of work items by waking up sleeping workers when a
work item is added to the workqueue. Hence the earlier described mechanism of
waking workers is no longer required.
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Luca Di Maio <luca.dimaio1@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Fixes: 8a4ea72724930c ("proto: add ability to populate a filesystem from a directory")
Zone reset is a mandatory part of creating a file system on a zoned
device, unlike discard, which can be skipped. It also is implemented
a bit different, so just split the handling. This also means that we
can now support the -K option to skip discard on the data section for
zoned file systems.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 13 Oct 2025 23:34:24 +0000 (16:34 -0700)]
xfs_scrub_fail: reduce security lockdowns to avoid postfix problems
Iustin Pop reports that the xfs_scrub_fail service fails to email
problem reports on Debian when postfix is installed. This is apparently
due to several factors:
1. postfix's sendmail wrapper calling postdrop directly,
2. postdrop requiring the ability to write to the postdrop group,
3. lockdown preventing the xfs_scrub_fail@ service to have postdrop in
the supplemental group list or the ability to run setgid programs
Item (3) could be solved by adding the whole service to the postdrop
group via SupplementalGroups=, but that will fail if postfix is not
installed and hence there is no postdrop group.
It could also be solved by forcing msmtp to be installed, bind mounting
msmtp into the service container, and injecting a config file that
instructs msmtp to connect to port 25, but that in turn isn't compatible
with systems not configured to allow an smtp server to listen on ::1.
So we'll go with the less restrictive approach that e2scrub_fail@ does,
which is to say that we just turn off all the sandboxing. :( :(
Reported-by: iustin@debian.org Cc: linux-xfs@vger.kernel.org # v6.10.0 Fixes: 9042fcc08eed6a ("xfs_scrub_fail: tighten up the security on the background systemd service") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code;
namely, that the requested attribute name could not be found.
However, a medium error from disk may also return ENODATA. At best,
this medium error may escape to userspace as "attribute not found"
when in fact it's an IO (disk) error.
At worst, we may oops in xfs_attr_leaf_get() when we do:
because an ENODATA/ENOATTR error from disk leaves us with a null bp,
and the xfs_trans_brelse will then null-deref it.
As discussed on the list, we really need to modify the lower level
IO functions to trap all disk errors and ensure that we don't let
unique errors like this leak up into higher xfs functions - many
like this should be remapped to EIO.
However, this patch directly addresses a reported bug in the xattr
code, and should be safe to backport to stable kernels. A larger-scope
patch to handle more unique errors at lower levels can follow later.
(Note, prior to 07120f1abdff we did not oops, but we did return the
wrong error code to userspace.)
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Replace the deprecated strncpy() with memtostr_pad(). This also avoids
the need for separate zeroing using memset(). Mark sb_fname buffer with
__nonstring as its size is XFSLABEL_MAX and so no terminating NULL for
sb_fname.
Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Split up the XFS_IS_CORRUPT statement so that it immediately shows
if the reference counter overflowed or underflowed.
I ran into this quite a bit when developing the zoned allocator, and had
to reapply the patch for some work recently. We might as well just apply
it upstream given that freeing group is far removed from performance
critical code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
xfs_trans_alloc_empty can't return errors, so return the allocated
transaction directly instead of an output double pointer argument.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Use cmp_int() to yield the result of a three-way-comparison instead of
performing subtractions with extra casts. Thus also rename the function
to make its name clearer in purpose.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Perhaps that's just my silly imagination but 'diff' doesn't look good for
the name of a variable to hold a result of a three-way-comparison
(-1, 0, 1) which is what ->cmp_key_with_cur() does. It implies to contain
an actual difference between the two integer variables but that's not true
anymore after recent refactoring.
Declaring it as int64_t is also misleading now. Plain integer type is
more than enough.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
The net value of these functions is to determine the result of a
three-way-comparison between operands of the same type.
Simplify the code using cmp_int() to eliminate potential errors with
opencoded casts and subtractions. This also means we can change the return
value type of cmp_key_with_cur routines from int64_t to int and make the
interface a bit clearer.
Found by Linux Verification Center (linuxtesting.org).
Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
The net value of these functions is to determine the result of a
three-way-comparison between operands of the same type.
Simplify the code using cmp_int() to eliminate potential errors with
opencoded casts and subtractions. This also means we can change the return
value type of cmp_two_keys routines from int64_t to int and make the
interface a bit clearer.
Found by Linux Verification Center (linuxtesting.org).
Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
One may think that diff_two_keys routines are used to compute the actual
difference between the arguments but they return a result of a
three-way-comparison of the passed operands. So it looks more appropriate
to denote them as cmp_two_keys.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Darrick J. Wong [Sat, 11 Oct 2025 18:34:04 +0000 (11:34 -0700)]
mkfs: fix copy-paste error in calculate_rtgroup_geometry
Fix this copy-paste error -- we should calculate the rt volume
concurrency either if the user gave us an explicit option, or if they
didn't but the rt volume is an SSD.
Cc: linux-xfs@vger.kernel.org # v6.13.0 Fixes: 34738ff0ee80de ("mkfs: allow sizing realtime allocation groups for concurrency") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 19 Sep 2025 16:14:00 +0000 (09:14 -0700)]
xfs_scrub: fix strerror_r usage yet again
In commit 75faf2bc907584, someone tried to fix scrub to use the POSIX
version of strerror_r so that the build would work with musl.
Unfortunately, neither the author nor myself remembered that GNU libc
imposes its own version any time _GNU_SOURCE is defined, which
builddefs.in always does. Regrettably, the POSIX and GNU versions have
different return types and the GNU version can return any random
pointer, so now this code is broken on glibc.
"Fix" this standards body own goal by casting the return value to
intptr_t and employing some gross heuristics to guess at the location of
the actual error string.
Fixes: 75faf2bc907584 ("xfs_scrub: Use POSIX-conformant strerror_r") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: A. Wilcox <AWilcox@Wilcox-Tech.com>
Darrick J. Wong [Tue, 23 Sep 2025 17:10:27 +0000 (10:10 -0700)]
libfrog: pass mode to xfrog_file_setattr
xfs/633 crashes rdump_fileattrs_path passes a NULL struct stat pointer
and then the fallback code dereferences it to get the file mode.
Instead, let's just pass the stat mode directly to it, because that's
the only piece of information that it needs.
Fixes: 128ac4dadbd633 ("xfs_db: use file_setattr to copy attributes on special files with rdump") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Darrick J. Wong [Tue, 23 Sep 2025 17:08:57 +0000 (10:08 -0700)]
mkfs: fix libxfs_iget return value sign inversion
libxfs functions return negative errno, so utilities must invert the
return values from such functions. Caught by xfs/437.
Fixes: 8a4ea72724930c ("proto: add ability to populate a filesystem from a directory") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
A. Wilcox [Sat, 6 Sep 2025 08:12:07 +0000 (03:12 -0500)]
xfs_scrub: Use POSIX-conformant strerror_r
When building xfsprogs with musl libc, strerror_r returns int as
specified in POSIX. This differs from the glibc extension that returns
char*. Successful calls will return 0, which will be dereferenced as a
NULL pointer by (v)fprintf.
Signed-off-by: A. Wilcox <AWilcox@Wilcox-Tech.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_db: use file_setattr to copy attributes on special files with rdump
rdump just skipped file attributes on special files as copying wasn't
possible. Let's use new file_getattr/file_setattr syscalls to copy
attributes even for special files.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
xfs_quota: utilize file_setattr to set prjid on special files
Utilize new file_getattr/file_setattr syscalls to set project ID on
special files. Previously, special files were skipped due to lack of the
way to call FS_IOC_SETFSXATTR ioctl on them. The quota accounting was
therefore missing these inodes (special files created before project
setup). The ones created after project initialization did inherit the
projid flag from the parent.
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
libfrog: Define STATX__RESERVED if not provided by the system
This define is not provided by musl libc. Use the fallback that is
already provided if statx and its types (tested on STATX_TYPE) are
not defined in the general case.
This fixes one cause for failing to compile against musl libc.
Signed-off-by: Johannes Nixdorf <johannes@nixdorf.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Petr Vaněk <arkamar@gentoo.org>
configure: Base NEED_INTERNAL_STATX on libc headers first
At compile time the libc headers are preferred, and linux/stat.h is
only included if the libc headers don't provide a definition for statx
and its types (tested on STATX_TYPE). The configure test should be
based on the same logic.
This fixes one cause for failing to compile against musl libc.
Signed-off-by: Johannes Nixdorf <johannes@nixdorf.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Petr Vaněk <arkamar@gentoo.org>
Zhang Yi [Wed, 13 Aug 2025 02:42:50 +0000 (10:42 +0800)]
xfs_io: add FALLOC_FL_WRITE_ZEROES support
The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES support to the
fallocate utility by introducing a new 'fwzero' command in the xfs_io
tool.
Christian Kujau [Tue, 26 Aug 2025 16:06:26 +0000 (18:06 +0200)]
xfsprogs: fix utcnow deprecation warning in xfs_scrub_all.py
Running xfs_scrub_all under Python 3.13.5 prints the following warning:
----------------------------------------------
$ /usr/sbin/xfs_scrub_all --auto-media-scan-stamp \
/var/lib/xfsprogs/xfs_scrub_all_media.stamp \
--auto-media-scan-interval 1d
/usr/sbin/xfs_scrub_all:489: DeprecationWarning:
datetime.datetime.utcnow() is deprecated and scheduled for removal in a
future version. Use timezone-aware objects to represent datetimes in UTC:
datetime.datetime.now(datetime.UTC).
dt = datetime.utcnow()
Automatically enabling file data scrub.
----------------------------------------------
Python documentation for context:
https://docs.python.org/3/library/datetime.html#datetime.datetime.utcnow
Fix this by using datetime.now() instead.
NB: Debian/13 ships Python 3.13.5 and has a xfs_scrub_all.timer active,
I'd assume that many systems will have that warning now in their logs :-)
Signed-off-by: Christian Kujau <lists@nerdbynature.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Carlos Maiolino [Tue, 26 Aug 2025 12:23:12 +0000 (14:23 +0200)]
Improve information about logbsize valid values
Valid values for logbsize depends on whether log_sunit is set
on the filesystem or not and if logbsize is manually set or not.
When manually set, logbsize must be one of the speficied values -
32k to 256k inclusive in power-of-to increments. And, the specified
value must also be a multiple of log_sunit.
The default configuration for v2 logs uses a relaxed restriction,
setting logbsize to log_sunit, independent if it is one of the valid
values or not - also implicitly ignoring the power of two restriction.
Instead of changing valid possible values for logbsize, increasing the
testing matrix and allowing users to use some dubious configuration,
just update the man page to describe this difference in behavior when
manually setting logbsize or leave it to defaults.
This has originally been found by an user attempting to manually set
logbsize to the same value picked by the default configuration just so
to receive an error message as result.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Luca Di Maio [Wed, 30 Jul 2025 16:12:22 +0000 (18:12 +0200)]
proto: add ability to populate a filesystem from a directory
This patch implements the functionality to populate a newly created XFS
filesystem directly from an existing directory structure.
It resuses existing protofile logic, it branches if input is a
directory.
The population process steps are as follows:
- create the root inode before populating content
- recursively process nested directories
- handle regular files, directories, symlinks, char devices, block
devices, sockets, fifos
- preserve attributes (ownership, permissions)
- preserve mtime timestamps from source files to maintain file history
- use current time for atime/ctime/crtime
- possible to specify atime=1 to preserve atime timestamps from
source files
- preserve extended attributes and fsxattrs for all file types
- preserve hardlinks
At the moment, the implementation for the hardlink tracking is very
simple, as it involves a linear search.
from my local testing using larger source directories
(1.3mln inodes, ~400k hardlinks) the difference was actually
just a few seconds (given that most of the time is doing i/o).
We might want to revisit that in the future if this becomes a
bottleneck.
This functionality makes it easier to create populated filesystems
without having to mount them, it's particularly useful for
reproducible builds.
Signed-off-by: Luca Di Maio <luca.dimaio1@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_log_recover.h is in fs/xfs/libxfs/ in the kernel tree, and thus the
libxfs-apply tool tries to apply changes to it in libxfs/ and fails
because the header is in include.
Move it to libxfs to make libxfs-apply work properly and to keep our
house in order.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
John Garry [Wed, 30 Jul 2025 10:13:20 +0000 (10:13 +0000)]
mkfs: require reflink for max_atomic_write option
For max_atomic_write option to be set, it means that the user wants to
support atomic writes up to that size.
However, to support this we must have reflink, so enforce that this is
available.
Signed-off-by: John Garry <john.g.garry@oracle.com> Suggested-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Darrick J. Wong [Tue, 29 Jul 2025 20:14:14 +0000 (13:14 -0700)]
misc: fix reversed calloc arguments
gcc 14 complains about reversed arguments to calloc:
namei.c: In function ‘path_parse’:
namei.c:51:32: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
51 | dirpath = calloc(sizeof(*dirpath), 1);
| ^
namei.c:51:32: note: earlier argument should specify number of elements, later size of each element
Fix all of these.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Busy extent tracking is primarily used to ensure that freed blocks are
not reused for data allocations before the transaction that deleted them
has been committed to stable storage, and secondarily to drive online
discard. None of the use cases applies to zoned RTGs, as the zoned
allocator can't overwrite blocks before resetting the zone, which already
flushes out all transactions touching the RTGs.
So the busy extent tracking is not needed for zoned RTGs, and also not
called for zoned RTGs. But somehow the code to skip allocating and
freeing the structure got lost during the zoned XFS upstreaming process.
This not only causes these structures to unnecessarily allocated, but can
also lead to memory leaks as the xg_busy_extents pointer in the
xfs_group structure is overlayed with the pointer for the linked list
of to be reset zones.
Stop allocating and freeing the structure to not pointlessly allocate
memory which is then leaked when the zone is reset.
Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection") Signed-off-by: Christoph Hellwig <hch@lst.de>
[cem: Fix type and add stable tag] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
There is a race condition that can trigger in dmflakey fstests that
can result in asserts in xfs_ialloc_read_agi() and
xfs_alloc_read_agf() firing. The asserts look like this:
Essentially, it is the same problem. When _flakey_drop_and_remount()
loads the drop-writes table, it makes all writes silently fail. Writes
are reported to the fs as completed successfully, but they are not
issued to the backing store. The filesystem sees the successful
write completion and marks the metadata buffer clean and removes it
from the AIL.
If this happens at the same time as memory pressure is occuring,
the now-clean AGF and/or AGI buffers can be reclaimed from memory.
Shortly afterwards, but before _flakey_drop_and_remount() runs
unmount, background writeback is kicked and it tries to allocate
blocks for the dirty pages in memory. This then tries to access the
AGF buffer we just turfed out of memory. It's not found, so it gets
read in from disk.
This is all fine, except for the fact that the last writeback of the
AGF did not actually reach disk. The AGF on disk is stale compared
to the in-memory state held by the perag, and so they don't match
and the assert fires.
Then other operations on that inode hang because the task was killed
whilst holding inode locks. e.g:
Memory pressure is one way to trigger this, another is to run "echo
3 > /proc/sys/vm/drop_caches" randomly while tests are running.
Regardless of how it is triggered, this effectively takes down the
system once umount hangs because it's holding a sb->s_umount lock
exclusive and now every sync(1) call gets stuck on it.
Fix this by replacing the asserts with a corruption detection check
and a shutdown.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Tue, 1 Jul 2025 17:45:15 +0000 (10:45 -0700)]
xfs_scrub: remove EXPERIMENTAL warnings
The kernel code for online fsck has been stable for a year, and there
haven't been any major changes to the program in quite some time, so
let's drop the experimental warnings.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 1 Jul 2025 17:45:14 +0000 (10:45 -0700)]
mkfs: allow users to configure the desired maximum atomic write size
Allow callers of mkfs.xfs to specify a desired maximum atomic write
size. This value will cause the log size to be adjusted to support
software atomic writes, and the AG size to be aligned to support
hardware atomic writes.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com>
Darrick J. Wong [Tue, 1 Jul 2025 17:45:14 +0000 (10:45 -0700)]
mkfs: don't complain about overly large auto-detected log stripe units
If mkfs declines to apply what it thinks is an overly large data device
stripe unit to the log device, it should only log a message about that
if the lsunit parameter was actually supplied by the caller. It should
not do that when the lsunit was autodetected from the block devices.
The cli parameters are zero-initialized in main and always have been.
Cc: <linux-xfs@vger.kernel.org> # v4.15.0 Fixes: 2f44b1b0e5adc4 ("mkfs: rework stripe calculations") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com>
Darrick J. Wong [Tue, 1 Jul 2025 17:45:13 +0000 (10:45 -0700)]
xfs_db: create an untorn_max subcommand
Create a debugger command to compute the either the logres needed to
perform an untorn cow write completion for a given number of blocks; or
the number of blocks that can be completed given a log reservation.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com>
Introduce a mount option to allow sysadmins to specify the maximum size
of an atomic write. If the filesystem can work with the supplied value,
that becomes the new guaranteed maximum.
The value mustn't be too big for the existing filesystem geometry (max
write size, max AG/rtgroup size). We dynamically recompute the
tr_atomic_write transaction reservation based on the given block size,
check that the current log size isn't less than the new minimum log size
constraints, and set a new maximum.
The actual software atomic write max is still computed based off of
tr_atomic_ioend the same way it has for the past few commits. Note also
that xfs_calc_atomic_write_log_geometry is non-static because mkfs will
need that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: John Garry <john.g.garry@oracle.com>
Now that CoW-based atomic writes are supported, update the max size of an
atomic write for the data device.
The limit of a CoW-based atomic write will be the limit of the number of
logitems which can fit into a single transaction.
In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.
Function xfs_atomic_write_logitems() is added to find the limit the number
of log items which can fit in a single transaction.
Amend the max atomic write computation to create a new transaction
reservation type, and compute the maximum size of an atomic write
completion (in fsblocks) based on this new transaction reservation.
Initially, tr_atomic_write is a clone of tr_itruncate, which provides a
reasonable level of parallelism. In the next patch, we'll add a mount
option so that sysadmins can configure their own limits.
[djwong: use a new reservation type for atomic write ioends, refactor
group limit calculations]
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: rounddown power-of-2 always] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
Darrick J. Wong [Tue, 1 Jul 2025 17:45:12 +0000 (10:45 -0700)]
libxfs: add helpers to compute log item overhead
Add selected helpers to estimate the transaction reservation required to
write various log intent and buffer items to the log. These helpers
will be used by the online repair code for more precise estimations of
how much work can be done in a single transaction.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.
For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.
Note that there is a limit on the amount of log intent items which can be
fit into a single transaction, but this is being ignored for now since
the count of items for a typical atomic write would be much less than is
typically supported. A typical atomic write would be expected to be 64KB
or less, which means only 16 possible extents unmaps, which is quite
small.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add tr_atomic_ioend] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
Add a BMAPI flag to provide a hint to the block allocator to align extents
according to the extszhint.
This will be useful for atomic writes to ensure that we are not being
allocated extents which are not suitable (for atomic writes).
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
In the transaction reservation code, hoist the logic that computes the
reservation needed to finish one log intent item into separate helper
functions. These will be used in subsequent patches to estimate the
number of blocks that an online repair can commit to reaping in the same
transaction as the change committing the new data structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com>
The way mdrestore works is not very amendable to zone devices. The code
that checks the device size tries to write to the highest offset, which
doesn't match the write pointer of a clean zone device. And while that
is relatively easily fixable, the metadata for each RTG records the
highest written offset, and the mount code compares that to the hardware
write pointer, which will mismatch. This could be fixed by using write
zeroes to pad the RTG until the expected write pointer, but this turns
the quick metadata operation that mdrestore is supposed to be into
something that could take hours on HDD.
So instead error out when someone tries to mdrestore onto a zoned device
to clearly document that this won't work. Doing a mdrestore into a file
still works perfectly fine, and we might look into a new mdrestore option
to restore into a set of files suitable for the zoned loop device driver
to make mdrestore fully usable for debugging.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Unmapped buffer access is a pain, so kill it. The switch to large
folios means we rarely pay a vmap penalty for large buffers,
so this functionality is largely unnecessary now.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Luca Di Maio [Wed, 16 Apr 2025 21:20:04 +0000 (23:20 +0200)]
xfs_protofile: fix permission octet when suid/guid is set
When encountering suid or sgid files, we already set the `u` or `g`
property in the prototype file.
Given that proto.c only supports three numbers for permissions, we
need to remove the redundant information from the permission, else
it was incorrectly parsed.
Co-authored-by: Luca Di Maio <luca.dimaio1@gmail.com> Co-authored-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Luca Di Maio <luca.dimaio1@gmail.com>
[aalbersh: 68 chars limit and removed patch review revisions]
Handle the synthetic fmr_device values, and deal with the fact that
ctx->fsinfo.fs_rt is allowed to be non-NULL for internal RT devices as
it is the same as the data device in this case.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_mkfs: default to rtinherit=1 for zoned file systems
Zone file systems are intended to use sequential write required zones
(or areas treated as such) for the main data store. And usually use the
data device only for metadata that requires random writes.
rtinherit=1 is the way to achieve that, so enabled it by default, but
still allow the user to override it if needed.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[aalbersh remove accidental "inherit" word in commit desc.]
xfs_mkfs: support creating file system with zoned RT devices
To create file systems with a zoned RT device, query the hardware
zone information to align the RT groups to it, and create an internal
RT device if the device has conventional and sequential write required
zones.
Default to use all sequential write required zoned for the RT device if
there are sequential write required zones.
Default to 256 and 1% conventional when -r zoned is specified without
further option and there are no sequential write required zones. This
mimics a SMR HDD and works well with tests.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_repair: validate rt groups vs reported hardware zones
Run a report zones ioctl, and verify the rt group state vs the
reported hardware zone state. Note that there is no way to actually
fix up any discrepancies here, as that would be rather scary without
having transactions.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Note really much to do here. Mostly ignore the validation and
regeneration of the bitmap and summary inodes. Eventually this
could grow a bit of validation of the hardware zone state.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Zoned devices can have gaps beyond the usable capacity of a zone and the
end in the LBA/daddr address space. In other words, the hardware
equivalent to the RT groups already takes care of the power of 2
alignment for us. In this case the sparse FSB/RTB address space maps 1:1
to the device address space.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Enable the zoned RT device directory feature. With this feature, RT
groups are written sequentially and always emptied before rewriting
the blocks. This perfectly maps to zoned devices, but can also be
used on conventional block devices.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
File system with internal RT devices are a bit odd in that we need
to report AGs and RGs. To make this happen use separate synthetic
fmr_device values for the different sections instead of the dev_t
mapping used by other XFS configurations.
The data device is reported as file system metadata before the
start of the RGs for the synthetic RT fmr_device.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
RT groups on a zoned file system need to be completely empty before their
space can be reused. This means that partially empty groups need to be
emptied entirely to free up space if no entirely free groups are
available.
Add a garbage collection thread that moves all data out of the least used
zone when not enough free zones are available, and which resets all zones
that have been emptied. To find empty zone a simple set of 10 buckets
based on the amount of space used in the zone is used. To empty zones,
the rmap is walked to find the owners and the data is read and then
written to the new place.
To automatically defragment files the rmap records are sorted by inode
and logical offset. This means defragmentation of parallel writes into
a single zone happens automatically when performing garbage collection.
Because holding the iolock over the entire GC cycle would inject very
noticeable latency for other accesses to the inodes, the iolock is not
taken while performing I/O. Instead the I/O completion handler checks
that the mapping hasn't changed over the one recorded at the start of
the GC cycle and doesn't update the mapping if it change.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
For zoned file systems garbage collection (GC) has to take the iolock
and mmaplock after moving data to a new place to synchronize with
readers. This means waiting for garbage collection with the iolock can
deadlock.
To avoid this, the worst case required blocks have to be reserved before
taking the iolock, which is done using a new RTAVAILABLE counter that
tracks blocks that are free to write into and don't require garbage
collection. The new helpers try to take these available blocks, and
if there aren't enough available it wakes and waits for GC. This is
done using a list of on-stack reservations to ensure fairness.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>