git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

logprint: re-indent printing helpers

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

logprint: remove xlog_print_dir2_sf

The code has been stubbed out since the initial creation of the
xfsprogs repository. Open code the single-line printf in the
data fork caller (attr forks can't contain directories) and remove
the dead code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

include: remove struct xfs_qoff_logitem

Not used anywhere, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_scrub: fix null pointer crash in scrub_render_ino_descr

Starting in Debian 13's libc6, passing a NULL format string to vsnprintf
causes the program to segfault.  Prior to this, the null format string
would be ignored.  Because @format is optional, let's explicitly steer
around the vsnprintf if there is no format string.  Also tidy whitespace
in the comment.

Found by generic/45[34] on Debian 13.

Cc: linux-xfs@vger.kernel.org # v6.10.0
Fixes: 9a8b09762f9a52 ("xfs_scrub: use parent pointers when possible to report file operations")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

metadump: catch used extent array overflow

An user reported a SIGSEGV when attempting to create a metadump image of
a filesystem.
The reason is because we fail to catch a possible overflow in the
used extents array in process_exinode() which may happen if the extent
count is corrupted.
This leads process_bmbt_reclist() to attempt to index into the array
using the bogus extent count with:

convert_extent(&rp[numrecs - 1], &o, &s, &c, &f);

Fix this by extending the used counter to uint64_t and
checking for the overflow possibility.

Reported-by: hubert . <hubjin657@outlook.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

mkfs: fix zone capacity check for sequential zones

Sequential zones can have a different, smaller capacity than
conventional zones.

Currently mkfs assumes both sequential and conventional zones will have
the same capacity and and set the zone_info to the capacity of the first
found zone and use that value to validate all the remaining zones's
capacity.

Because conventional zones can't have a different capacity than its
size, the first zone always have the largest possible capacity, so, mkfs
will fail to validate any consecutive sequential zone if its capacity is
smaller than the conventional zones.

What we should do instead, is set the zone info capacity accordingly to
the settings of first zone found of the respective type and validate
the capacity based on that instead of assuming all zones will have the
same capacity.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>

libxfs: support reproducible filesystems using deterministic time/seed

Add support for reproducible filesystem creation through two environment
variables that enable deterministic behavior when building XFS filesystems.

SOURCE_DATE_EPOCH support:
When SOURCE_DATE_EPOCH is set, use its value for all filesystem timestamps
instead of the current time. This follows the reproducible builds
specification (https://reproducible-builds.org/specs/source-date-epoch/)
and ensures consistent inode timestamps across builds.

DETERMINISTIC_SEED support:
When DETERMINISTIC_SEED=1 is set, return a fixed seed value (0x53454544 =
"SEED") from get_random_u32() instead of reading from /dev/urandom.

get_random_u32() seems to be used mostly to set inode generation number, being
fixed should not be create collision issues at mkfs time.

The implementation introduces two helper functions to minimize changes
to existing code:

- current_fixed_time(): Parses and caches SOURCE_DATE_EPOCH on first
  call. Returns fixed timestamp when set, falls back to gettimeofday() on
  parse errors or when unset.
- get_deterministic_seed(): Checks for DETERMINISTIC_SEED=1 environment
  variable on first call, and returns a fixed seed value (0x53454544).
  Falls back to getrandom() when unset.
- Both helpers use one-time initialization to avoid repeated getenv() calls.
- Both quickly exit and noop if environment is not set or has invalid
  variables, falling back to original behaviour.

Example usage:
  SOURCE_DATE_EPOCH=1234567890 \
  DETERMINISTIC_SEED=1 \
  mkfs.xfs \
-m uuid=$EXAMPLE_UUID \
-p file=./rootfs \
disk1.img

This enables distributions and build systems to create bit-for-bit
identical XFS filesystems when needed for verification and debugging.

v1 -> v2:
- simplify deterministic seed by returning a fixed value instead
  of using Middle Square Weyl Sequence PRNG
- fix timestamp type time_t -> time64_t
- fix timestamp initialization flag to allow negative epochs
- fix timestamp conversion type using strtoll
- fix timestamp conversion check to be sure the whole string was parsed
- print warning message when SOURCE_DATE_EPOCH is invalid

Signed-off-by: Luca Di Maio <luca.dimaio1@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Fix alloc/free of cache item

xfs_extfree_item_cache is allocated and freed twice. Remove the
obsolete alloc/free.

Signed-off-by: Torsten Rupp <torsten.rupp@gmx.net>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs_io: use the XFS_ERRTAG macro to generate injection targets

Use the new magic macro table provided by libxfs to autogenerate
the list of valid error injection targets.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: centralize error tag definitions

From: Christoph Hellwig <hch@lst.de>

Source kernel commit: 71fa062196ae3abab790c91f1bdf09dcdc6fb1fe

Right now 5 places in the kernel and one in xfsprogs need to be updated
for each new error tag. Add a bit of macro magic so that only the
error tag definition and a single table, which reside next to each
other, need to be updated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

repair/prefetch.c: Create one workqueue with multiple workers

When xfs_repair is executed with a non-zero value for ag_stride,
do_inode_prefetch() create multiple workqueues with each of them having just
one worker thread.

Since commit 12838bda12e669 ("libfrog: fix overly sleep workqueues"), a
workqueue can process multiple work items concurrently. Hence, this commit
replaces the above logic with just one workqueue having multiple workers.

Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

libfrog: Prevent unnecessary waking of worker thread when using bounded workqueues

When woken up, a worker will pick a work item from the workqueue and wake up
another worker when the current workqueue is a bounded workqueue and if there
is atleast one more work item remains to be processed.

The commit 12838bda12e669 ("libfrog: fix overly sleep workqueues") prevented
single-threaded processing of work items by waking up sleeping workers when a
work item is added to the workqueue. Hence the earlier described mechanism of
waking workers is no longer required.

Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

proto: fix file descriptor leak

fix leak of pathfd introduced in commit 8a4ea72724930cfe262ccda03028264e1a81b145

Signed-off-by: Luca Di Maio <luca.dimaio1@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Fixes: 8a4ea72724930c ("proto: add ability to populate a filesystem from a directory")

mkfs: split zone reset from discard

Zone reset is a mandatory part of creating a file system on a zoned
device, unlike discard, which can be skipped. It also is implemented
a bit different, so just split the handling. This also means that we
can now support the -K option to skip discard on the data section for
zoned file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

mkfs: remove duplicate struct libxfs_init arguments

The libxfs_init structure instance is pointed to by cli_params, so use
that were it already exists instead of passing an additional argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

libxfs: cleanup get_topology

Add a libxfs_ prefix to the name, clear the structure in the helper
instead of in the callers, and use a bool to pass a boolean argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

mkfs: move clearing LIBXFS_DIRECT into check_device_type

Keep it close to the block device vs regular file logic and remove
the checks for each device in the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

mkfs: improve the error message in adjust_nr_zones

Print the zone counts to help the user to understand the problem.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

mkfs: improve the error message from check_device_type

Tell the user what device is missing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_copy: improve the error message when mkfs is in progress

Indicate the correct reason for the failure instead of the same
message as for the generic error condition just above.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfsprogs: Release v6.17.0

Update all the necessary files for a v6.17.0 release.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_scrub_fail: reduce security lockdowns to avoid postfix problems

Iustin Pop reports that the xfs_scrub_fail service fails to email
problem reports on Debian when postfix is installed. This is apparently
due to several factors:

1. postfix's sendmail wrapper calling postdrop directly,
2. postdrop requiring the ability to write to the postdrop group,
3. lockdown preventing the xfs_scrub_fail@ service to have postdrop in
the supplemental group list or the ability to run setgid programs

Item (3) could be solved by adding the whole service to the postdrop
group via SupplementalGroups=, but that will fail if postfix is not
installed and hence there is no postdrop group.

It could also be solved by forcing msmtp to be installed, bind mounting
msmtp into the service container, and injecting a config file that
instructs msmtp to connect to port 25, but that in turn isn't compatible
with systems not configured to allow an smtp server to listen on ::1.

So we'll go with the less restrictive approach that e2scrub_fail@ does,
which is to say that we just turn off all the sandboxing. :( :(

Reported-by: iustin@debian.org
Cc: linux-xfs@vger.kernel.org # v6.10.0
Fixes: 9042fcc08eed6a ("xfs_scrub_fail: tighten up the security on the background systemd service")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs: do not propagate ENODATA disk errors into xattr code

Source kernel commit: ae668cd567a6a7622bc813ee0bb61c42bed61ba7

ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code;
namely, that the requested attribute name could not be found.

However, a medium error from disk may also return ENODATA. At best,
this medium error may escape to userspace as "attribute not found"
when in fact it's an IO (disk) error.

At worst, we may oops in xfs_attr_leaf_get() when we do:

error = xfs_attr_leaf_hasname(args, &bp);
if (error == -ENOATTR) {
xfs_trans_brelse(args->trans, bp);
return error;
}

because an ENODATA/ENOATTR error from disk leaves us with a null bp,
and the xfs_trans_brelse will then null-deref it.

As discussed on the list, we really need to modify the lower level
IO functions to trap all disk errors and ensure that we don't let
unique errors like this leak up into higher xfs functions - many
like this should be remapped to EIO.

However, this patch directly addresses a reported bug in the xattr
code, and should be safe to backport to stable kernels. A larger-scope
patch to handle more unique errors at lower levels can follow later.

(Note, prior to 07120f1abdff we did not oops, but we did return the
wrong error code to userspace.)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: don't use a xfs_log_iovec for ri_buf in log recovery

Source kernel commit: ded74fddcaf685a9440c5612f7831d0c4c1473ca

ri_buf just holds a pointer/len pair and is not a log iovec used for
writing to the log. Switch to use a kvec instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

fs/xfs: replace strncpy with memtostr_pad()

Source kernel commit: f4a3f01e8e451fb3cb444a95a59964f4bc746902

Replace the deprecated strncpy() with memtostr_pad(). This also avoids
the need for separate zeroing using memset(). Mark sb_fname buffer with
__nonstring as its size is XFSLABEL_MAX and so no terminating NULL for
sb_fname.

Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: improve the xg_active_ref check in xfs_group_free

Source kernel commit: 59655147ec34fb72cc090ca4ee688ece05ffac56

Split up the XFS_IS_CORRUPT statement so that it immediately shows
if the reference counter overflowed or underflowed.

I ran into this quite a bit when developing the zoned allocator, and had
to reapply the patch for some work recently. We might as well just apply
it upstream given that freeing group is far removed from performance
critical code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: return the allocated transaction from xfs_trans_alloc_empty

Source kernel commit: d8e1ea43e5a314bc01ec059ce93396639dcf9112

xfs_trans_alloc_empty can't return errors, so return the allocated
transaction directly instead of an output double pointer argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: refactor xfs_btree_diff_two_ptrs() to take advantage of cmp_int()

Source kernel commit: ce6cce46aff79423f47680ee65e8f12191a50605

Use cmp_int() to yield the result of a three-way-comparison instead of
performing subtractions with extra casts. Thus also rename the function
to make its name clearer in purpose.

Found by Linux Verification Center (linuxtesting.org).

Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: use a proper variable name and type for storing a comparison result

Source kernel commit: 2717eb35185581988799bb0d5179409978f36a90

Perhaps that's just my silly imagination but 'diff' doesn't look good for
the name of a variable to hold a result of a three-way-comparison
(-1, 0, 1) which is what ->cmp_key_with_cur() does. It implies to contain
an actual difference between the two integer variables but that's not true
anymore after recent refactoring.

Declaring it as int64_t is also misleading now. Plain integer type is
more than enough.

Found by Linux Verification Center (linuxtesting.org).

Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: refactor cmp_key_with_cur routines to take advantage of cmp_int()

Source kernel commit: 734b871d6cf7d4f815bb1eff8c808289079701c2

The net value of these functions is to determine the result of a
three-way-comparison between operands of the same type.

Simplify the code using cmp_int() to eliminate potential errors with
opencoded casts and subtractions. This also means we can change the return
value type of cmp_key_with_cur routines from int64_t to int and make the
interface a bit clearer.

Found by Linux Verification Center (linuxtesting.org).

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: refactor cmp_two_keys routines to take advantage of cmp_int()

Source kernel commit: 3b583adf55c649d5ba37bcd1ca87644b0bc10b86

The net value of these functions is to determine the result of a
three-way-comparison between operands of the same type.

Simplify the code using cmp_int() to eliminate potential errors with
opencoded casts and subtractions. This also means we can change the return
value type of cmp_two_keys routines from int64_t to int and make the
interface a bit clearer.

Found by Linux Verification Center (linuxtesting.org).

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: rename key_diff routines

Source kernel commit: 82b63ee160016096436aa026a27c8d85d40f3fb1

key_diff routines compare a key value with a cursor value. Make the naming
to be a bit more self-descriptive.

Found by Linux Verification Center (linuxtesting.org).

Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: rename diff_two_keys routines

Source kernel commit: edce172444b4f489715a3df2e5d50893e74ce3da

One may think that diff_two_keys routines are used to compute the actual
difference between the arguments but they return a result of a
three-way-comparison of the passed operands. So it looks more appropriate
to denote them as cmp_two_keys.

Found by Linux Verification Center (linuxtesting.org).

Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

mkfs: fix copy-paste error in calculate_rtgroup_geometry

Fix this copy-paste error -- we should calculate the rt volume
concurrency either if the user gave us an explicit option, or if they
didn't but the rt volume is an SSD.

Cc: linux-xfs@vger.kernel.org # v6.13.0
Fixes: 34738ff0ee80de ("mkfs: allow sizing realtime allocation groups for concurrency")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: fix strerror_r usage yet again

In commit 75faf2bc907584, someone tried to fix scrub to use the POSIX
version of strerror_r so that the build would work with musl.
Unfortunately, neither the author nor myself remembered that GNU libc
imposes its own version any time _GNU_SOURCE is defined, which
builddefs.in always does. Regrettably, the POSIX and GNU versions have
different return types and the GNU version can return any random
pointer, so now this code is broken on glibc.

"Fix" this standards body own goal by casting the return value to
intptr_t and employing some gross heuristics to guess at the location of
the actual error string.

Fixes: 75faf2bc907584 ("xfs_scrub: Use POSIX-conformant strerror_r")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: A. Wilcox <AWilcox@Wilcox-Tech.com>

libfrog: pass mode to xfrog_file_setattr

xfs/633 crashes rdump_fileattrs_path passes a NULL struct stat pointer
and then the fallback code dereferences it to get the file mode.
Instead, let's just pass the stat mode directly to it, because that's
the only piece of information that it needs.

Fixes: 128ac4dadbd633 ("xfs_db: use file_setattr to copy attributes on special files with rdump")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

mkfs: fix libxfs_iget return value sign inversion

libxfs functions return negative errno, so utilities must invert the
return values from such functions. Caught by xfs/437.

Fixes: 8a4ea72724930c ("proto: add ability to populate a filesystem from a directory")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_scrub: Use POSIX-conformant strerror_r

When building xfsprogs with musl libc, strerror_r returns int as
specified in POSIX. This differs from the glibc extension that returns
char*. Successful calls will return 0, which will be dereferenced as a
NULL pointer by (v)fprintf.

Signed-off-by: A. Wilcox <AWilcox@Wilcox-Tech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_db: use file_setattr to copy attributes on special files with rdump

rdump just skipped file attributes on special files as copying wasn't
possible. Let's use new file_getattr/file_setattr syscalls to copy
attributes even for special files.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_io: make ls/chattr work with special files

With new file_getattr/file_setattr syscalls we can now list/change file
attributes on special files instead for ignoring them.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_quota: utilize file_setattr to set prjid on special files

Utilize new file_getattr/file_setattr syscalls to set project ID on
special files. Previously, special files were skipped due to lack of the
way to call FS_IOC_SETFSXATTR ioctl on them. The quota accounting was
therefore missing these inodes (special files created before project
setup). The ones created after project initialization did inherit the
projid flag from the parent.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

libfrog: add wrappers for file_getattr/file_setattr syscalls

Add wrappers for new file_getattr/file_setattr inode syscalls which will
be used by xfs_quota and xfs_io.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

libfrog: Define STATX__RESERVED if not provided by the system

This define is not provided by musl libc. Use the fallback that is
already provided if statx and its types (tested on STATX_TYPE) are
not defined in the general case.

This fixes one cause for failing to compile against musl libc.

Signed-off-by: Johannes Nixdorf <johannes@nixdorf.dev>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Petr Vaněk <arkamar@gentoo.org>

configure: Base NEED_INTERNAL_STATX on libc headers first

At compile time the libc headers are preferred, and linux/stat.h is
only included if the libc headers don't provide a definition for statx
and its types (tested on STATX_TYPE). The configure test should be
based on the same logic.

This fixes one cause for failing to compile against musl libc.

Signed-off-by: Johannes Nixdorf <johannes@nixdorf.dev>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Petr Vaněk <arkamar@gentoo.org>

xfs_io: add FALLOC_FL_WRITE_ZEROES support

The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES support to the
fallocate utility by introducing a new 'fwzero' command in the xfs_io
tool.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfsprogs: fix utcnow deprecation warning in xfs_scrub_all.py

Running xfs_scrub_all under Python 3.13.5 prints the following warning:

----------------------------------------------
$ /usr/sbin/xfs_scrub_all --auto-media-scan-stamp \
   /var/lib/xfsprogs/xfs_scrub_all_media.stamp \
   --auto-media-scan-interval 1d
/usr/sbin/xfs_scrub_all:489: DeprecationWarning:
datetime.datetime.utcnow() is deprecated and scheduled for removal in a
future version. Use timezone-aware objects to represent datetimes in UTC:
datetime.datetime.now(datetime.UTC).
  dt = datetime.utcnow()
Automatically enabling file data scrub.
----------------------------------------------

Python documentation for context:
https://docs.python.org/3/library/datetime.html#datetime.datetime.utcnow

Fix this by using datetime.now() instead.

NB: Debian/13 ships Python 3.13.5 and has a xfs_scrub_all.timer active,
I'd assume that many systems will have that warning now in their logs :-)

Signed-off-by: Christian Kujau <lists@nerdbynature.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

Improve information about logbsize valid values

Valid values for logbsize depends on whether log_sunit is set
on the filesystem or not and if logbsize is manually set or not.

When manually set, logbsize must be one of the speficied values -
32k to 256k inclusive in power-of-to increments. And, the specified
value must also be a multiple of log_sunit.

The default configuration for v2 logs uses a relaxed restriction,
setting logbsize to log_sunit, independent if it is one of the valid
values or not - also implicitly ignoring the power of two restriction.

Instead of changing valid possible values for logbsize, increasing the
testing matrix and allowing users to use some dubious configuration,
just update the man page to describe this difference in behavior when
manually setting logbsize or leave it to defaults.

This has originally been found by an user attempting to manually set
logbsize to the same value picked by the default configuration just so
to receive an error message as result.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

proto: add ability to populate a filesystem from a directory

This patch implements the functionality to populate a newly created XFS
filesystem directly from an existing directory structure.

It resuses existing protofile logic, it branches if input is a
directory.

The population process steps are as follows:
  - create the root inode before populating content
  - recursively process nested directories
  - handle regular files, directories, symlinks, char devices, block
    devices, sockets, fifos
  - preserve attributes (ownership, permissions)
  - preserve mtime timestamps from source files to maintain file history
    - use current time for atime/ctime/crtime
    - possible to specify atime=1 to preserve atime timestamps from
      source files
  - preserve extended attributes and fsxattrs for all file types
  - preserve hardlinks

At the moment, the implementation for the hardlink tracking is very
simple, as it involves a linear search.
from my local testing using larger source directories
(1.3mln inodes, ~400k hardlinks) the difference was actually
just a few seconds (given that most of the time is doing i/o).
We might want to revisit that in the future if this becomes a
bottleneck.

This functionality makes it easier to create populated filesystems
without having to mount them, it's particularly useful for
reproducible builds.

Signed-off-by: Luca Di Maio <luca.dimaio1@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfsprogs: Release v6.16.0

Update all the necessary files for a v6.16.0 release.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

Document current limitation of shrinking fs

Current implementation in the kernel doesn't allow to shrink more that
one AG

Signed-off-by: Xavier Claude <contact@xavierclaude.be>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

libxfs: update xfs_log_recover.h to kernel version as of Linux 6.16

None of this is used in userland, but it will make automatically
applying kernel changes much easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

move xfs_log_recover.h to libxfs/

xfs_log_recover.h is in fs/xfs/libxfs/ in the kernel tree, and thus the
libxfs-apply tool tries to apply changes to it in libxfs/ and fails
because the header is in include.

Move it to libxfs to make libxfs-apply work properly and to keep our
house in order.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

mkfs: require reflink for max_atomic_write option

For max_atomic_write option to be set, it means that the user wants to
support atomic writes up to that size.

However, to support this we must have reflink, so enforce that this is
available.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

misc: fix reversed calloc arguments

gcc 14 complains about reversed arguments to calloc:

namei.c: In function ‘path_parse’:
namei.c:51:32: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
51 | dirpath = calloc(sizeof(*dirpath), 1);
| ^
namei.c:51:32: note: earlier argument should specify number of elements, later size of each element

Fix all of these.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: don't allocate the xfs_extent_busy structure for zoned RTGs

Source kernel commit: 5948705adbf1a7afcecfe9a13ff39221ef61e16b

Busy extent tracking is primarily used to ensure that freed blocks are
not reused for data allocations before the transaction that deleted them
has been committed to stable storage, and secondarily to drive online
discard. None of the use cases applies to zoned RTGs, as the zoned
allocator can't overwrite blocks before resetting the zone, which already
flushes out all transactions touching the RTGs.

So the busy extent tracking is not needed for zoned RTGs, and also not
called for zoned RTGs. But somehow the code to skip allocating and
freeing the structure got lost during the zoned XFS upstreaming process.
This not only causes these structures to unnecessarily allocated, but can
also lead to memory leaks as the xg_busy_extents pointer in the
xfs_group structure is overlayed with the pointer for the linked list
of to be reset zones.

Stop allocating and freeing the structure to not pointlessly allocate
memory which is then leaked when the zone is reset.

Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection")
Signed-off-by: Christoph Hellwig <hch@lst.de>
[cem: Fix type and add stable tag]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: catch stale AGF/AGF metadata

Source kernel commit: db6a2274162de615ff74b927d38942fe3134d298

There is a race condition that can trigger in dmflakey fstests that
can result in asserts in xfs_ialloc_read_agi() and
xfs_alloc_read_agf() firing. The asserts look like this:

XFS: Assertion failed: pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks), file: fs/xfs/libxfs/xfs_alloc.c, line: 3440
.....
Call Trace:
<TASK>
xfs_alloc_read_agf+0x2ad/0x3a0
xfs_alloc_fix_freelist+0x280/0x720
xfs_alloc_vextent_prepare_ag+0x42/0x120
xfs_alloc_vextent_iterate_ags+0x67/0x260
xfs_alloc_vextent_start_ag+0xe4/0x1c0
xfs_bmapi_allocate+0x6fe/0xc90
xfs_bmapi_convert_delalloc+0x338/0x560
xfs_map_blocks+0x354/0x580
iomap_writepages+0x52b/0xa70
xfs_vm_writepages+0xd7/0x100
do_writepages+0xe1/0x2c0
__writeback_single_inode+0x44/0x340
writeback_sb_inodes+0x2d0/0x570
__writeback_inodes_wb+0x9c/0xf0
wb_writeback+0x139/0x2d0
wb_workfn+0x23e/0x4c0
process_scheduled_works+0x1d4/0x400
worker_thread+0x234/0x2e0
kthread+0x147/0x170
ret_from_fork+0x3e/0x50
ret_from_fork_asm+0x1a/0x30

I've seen the AGI variant from scrub running on the filesysetm
after unmount failed due to systemd interference:

XFS: Assertion failed: pag->pagi_freecount == be32_to_cpu(agi->agi_freecount) || xfs_is_shutdown(pag->pag_mount), file: fs/xfs/libxfs/xfs_ialloc.c, line: 2804
.....
Call Trace:
<TASK>
xfs_ialloc_read_agi+0xee/0x150
xchk_perag_drain_and_lock+0x7d/0x240
xchk_ag_init+0x34/0x90
xchk_inode_xref+0x7b/0x220
xchk_inode+0x14d/0x180
xfs_scrub_metadata+0x2e2/0x510
xfs_ioc_scrub_metadata+0x62/0xb0
xfs_file_ioctl+0x446/0xbf0
__se_sys_ioctl+0x6f/0xc0
__x64_sys_ioctl+0x1d/0x30
x64_sys_call+0x1879/0x2ee0
do_syscall_64+0x68/0x130
? exc_page_fault+0x62/0xc0
entry_SYSCALL_64_after_hwframe+0x76/0x7e

Essentially, it is the same problem. When _flakey_drop_and_remount()
loads the drop-writes table, it makes all writes silently fail. Writes
are reported to the fs as completed successfully, but they are not
issued to the backing store. The filesystem sees the successful
write completion and marks the metadata buffer clean and removes it
from the AIL.

If this happens at the same time as memory pressure is occuring,
the now-clean AGF and/or AGI buffers can be reclaimed from memory.

Shortly afterwards, but before _flakey_drop_and_remount() runs
unmount, background writeback is kicked and it tries to allocate
blocks for the dirty pages in memory. This then tries to access the
AGF buffer we just turfed out of memory. It's not found, so it gets
read in from disk.

This is all fine, except for the fact that the last writeback of the
AGF did not actually reach disk. The AGF on disk is stale compared
to the in-memory state held by the perag, and so they don't match
and the assert fires.

Then other operations on that inode hang because the task was killed
whilst holding inode locks. e.g:

Workqueue: xfs-conv/dm-12 xfs_end_io
Call Trace:
<TASK>
__schedule+0x650/0xb10
schedule+0x6d/0xf0
schedule_preempt_disabled+0x15/0x30
rwsem_down_write_slowpath+0x31a/0x5f0
down_write+0x43/0x60
xfs_ilock+0x1a8/0x210
xfs_trans_alloc_inode+0x9c/0x240
xfs_iomap_write_unwritten+0xe3/0x300
xfs_end_ioend+0x90/0x130
xfs_end_io+0xce/0x100
process_scheduled_works+0x1d4/0x400
worker_thread+0x234/0x2e0
kthread+0x147/0x170
ret_from_fork+0x3e/0x50
ret_from_fork_asm+0x1a/0x30
</TASK>

and it's all down hill from there.

Memory pressure is one way to trigger this, another is to run "echo
3 > /proc/sys/vm/drop_caches" randomly while tests are running.

Regardless of how it is triggered, this effectively takes down the
system once umount hangs because it's holding a sb->s_umount lock
exclusive and now every sync(1) call gets stuck on it.

Fix this by replacing the asserts with a corruption detection check
and a shutdown.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_scrub: remove EXPERIMENTAL warnings

The kernel code for online fsck has been stable for a year, and there
haven't been any major changes to the program in quite some time, so
let's drop the experimental warnings.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

mkfs: allow users to configure the desired maximum atomic write size

Allow callers of mkfs.xfs to specify a desired maximum atomic write
size. This value will cause the log size to be adjusted to support
software atomic writes, and the AG size to be aligned to support
hardware atomic writes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

mkfs: try to align AG size based on atomic write capabilities

Try to align the AG size to the maximum hardware atomic write unit so
that we can give users maximum flexibility in choosing an RWF_ATOMIC
write size.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

mkfs: autodetect log stripe unit for external log devices

If we're using an external log device and the caller doesn't give us a
lsunit, use the block device geometry (if present) to set it.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

mkfs: don't complain about overly large auto-detected log stripe units

If mkfs declines to apply what it thinks is an overly large data device
stripe unit to the log device, it should only log a message about that
if the lsunit parameter was actually supplied by the caller. It should
not do that when the lsunit was autodetected from the block devices.

The cli parameters are zero-initialized in main and always have been.

Cc: <linux-xfs@vger.kernel.org> # v4.15.0
Fixes: 2f44b1b0e5adc4 ("mkfs: rework stripe calculations")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

xfs_io: dump new atomic_write_unit_max_opt statx field

Dump the new atomic writes statx field that's being submitted for 6.16.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

xfs_db: create an untorn_max subcommand

Create a debugger command to compute the either the logres needed to
perform an untorn cow write completion for a given number of blocks; or
the number of blocks that can be completed given a log reservation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

libfrog: move statx.h from io/ to libfrog/

Move this header file so we can use the declaration and wrappers
in other parts of xfsprogs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

xfs: allow sysadmins to specify a maximum atomic write limit at mount time

Source kernel commit: 4528b9052731f14c1a9be16b98e33c9401e6d1bc

Introduce a mount option to allow sysadmins to specify the maximum size
of an atomic write.  If the filesystem can work with the supplied value,
that becomes the new guaranteed maximum.

The value mustn't be too big for the existing filesystem geometry (max
write size, max AG/rtgroup size).  We dynamically recompute the
tr_atomic_write transaction reservation based on the given block size,
check that the current log size isn't less than the new minimum log size
constraints, and set a new maximum.

The actual software atomic write max is still computed based off of
tr_atomic_ioend the same way it has for the past few commits.  Note also
that xfs_calc_atomic_write_log_geometry is non-static because mkfs will
need that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>

xfs: add xfs_calc_atomic_write_unit_max()

Source kernel commit: 0c438dcc31504bf4f50b20dc52f8f5ca7fab53e2

Now that CoW-based atomic writes are supported, update the max size of an
atomic write for the data device.

The limit of a CoW-based atomic write will be the limit of the number of
logitems which can fit into a single transaction.

In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.

Function xfs_atomic_write_logitems() is added to find the limit the number
of log items which can fit in a single transaction.

Amend the max atomic write computation to create a new transaction
reservation type, and compute the maximum size of an atomic write
completion (in fsblocks) based on this new transaction reservation.
Initially, tr_atomic_write is a clone of tr_itruncate, which provides a
reasonable level of parallelism. In the next patch, we'll add a mount
option so that sysadmins can configure their own limits.

[djwong: use a new reservation type for atomic write ioends, refactor
group limit calculations]

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: rounddown power-of-2 always]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>

libxfs: add helpers to compute log item overhead

Add selected helpers to estimate the transaction reservation required to
write various log intent and buffer items to the log. These helpers
will be used by the online repair code for more precise estimations of
how much work can be done in a single transaction.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs: commit CoW-based atomic writes atomically

Source kernel commit: b1e09178b73adf10dc87fba9aee7787a7ad26874

When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.

For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.

Note that there is a limit on the amount of log intent items which can be
fit into a single transaction, but this is being ignored for now since
the count of items for a typical atomic write would be much less than is
typically supported. A typical atomic write would be expected to be 64KB
or less, which means only 16 possible extents unmaps, which is quite
small.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add tr_atomic_ioend]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>

xfs: allow block allocator to take an alignment hint

Source kernel commit: 6baf4cc47a741024d37e6149d5d035d3fc9ed1fe

Add a BMAPI flag to provide a hint to the block allocator to align extents
according to the extszhint.

This will be useful for atomic writes to ensure that we are not being
allocated extents which are not suitable (for atomic writes).

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>

xfs: add helpers to compute transaction reservation for finishing intent items

Source kernel commit: 805f89881252a9aee30799b8a395deec79c13414

In the transaction reservation code, hoist the logic that computes the
reservation needed to finish one log intent item into separate helper
functions. These will be used in subsequent patches to estimate the
number of blocks that an online repair can commit to reaping in the same
transaction as the change committing the new data structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>

xfsprogs: Release v6.15.0

Update all the necessary files for a v6.15.0 release.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_mdrestore: don't allow restoring onto zoned block devices

The way mdrestore works is not very amendable to zone devices.  The code
that checks the device size tries to write to the highest offset, which
doesn't match the write pointer of a clean zone device.  And while that
is relatively easily fixable, the metadata for each RTG records the
highest written offset, and the mount code compares that to the hardware
write pointer, which will mismatch.  This could be fixed by using write
zeroes to pad the RTG until the expected write pointer, but this turns
the quick metadata operation that mdrestore is supposed to be into
something that could take hours on HDD.

So instead error out when someone tries to mdrestore onto a zoned device
to clearly document that this won't work.  Doing a mdrestore into a file
still works perfectly fine, and we might look into a new mdrestore option
to restore into a set of files suitable for the zoned loop device driver
to make mdrestore fully usable for debugging.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

man: adjust description of the statx manpage

Amend the manpage description of how the lack of statx -m options work.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>

xfs: remove the flags argument to xfs_buf_get_uncached

Source kernel commit: b3f8f2903b8cd48b0746bf05a40b85ae4b684034

No callers passes flags to xfs_buf_get_uncached, which makes sense
given that the flags apply to behavior not used for uncached buffers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: kill XBF_UNMAPPED

Source kernel commit: a2f790b28512c22f7cf4f420a99e1b008e7098fe

Unmapped buffer access is a pain, so kill it. The switch to large
folios means we rarely pay a vmap penalty for large buffers,
so this functionality is largely unnecessary now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_protofile: fix permission octet when suid/guid is set

When encountering suid or sgid files, we already set the `u` or `g`
property in the prototype file.
Given that proto.c only supports three numbers for permissions, we
need to remove the redundant information from the permission, else
it was incorrectly parsed.

Co-authored-by: Luca Di Maio <luca.dimaio1@gmail.com>
Co-authored-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Luca Di Maio <luca.dimaio1@gmail.com>
[aalbersh: 68 chars limit and removed patch review revisions]

xfs_repair: fix libxfs abstraction mess

Do some xfs -> libxfs callsite conversions to shut up xfs/437.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_growfs: support internal RT devices

Allow RT growfs when rtstart is set in the geomety, and adjust the
queried size for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_mdrestore: support internal RT devices

Calculate the size properly for internal RT devices and skip restoring
to the external one for this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_scrub: support internal RT device

Handle the synthetic fmr_device values, and deal with the fact that
ctx->fsinfo.fs_rt is allowed to be non-NULL for internal RT devices as
it is the same as the data device in this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_spaceman: handle internal RT devices

Handle the synthetic fmr_device values for fsmap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_io: handle internal RT devices in fsmap output

Deal with the synthetic fmr_device values and the rt device offset when
calculating RG numbers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_io: don't re-query fs_path information in fsmap_f

Reuse the information stashed in "file" instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_io: correctly report RGs with internal rt dev in bmap output

Apply the proper offset. Somehow this made gcc complain about
possible overflowing abuf, so increase the size for that as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

man: document XFS_FSOP_GEOM_FLAGS_ZONED

Document the new zoned feature flag and the two new fields added
with it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_mkfs: document the new zoned options in the man page

Add documentation for the zoned file system specific options.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_mkfs: reflink conflicts with zoned file systems for now

Don't allow reflink on zoned file system until garbage collections learns
how to deal with shared extents and doesn't blindly unshare them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_mkfs: default to rtinherit=1 for zoned file systems

Zone file systems are intended to use sequential write required zones
(or areas treated as such) for the main data store. And usually use the
data device only for metadata that requires random writes.

rtinherit=1 is the way to achieve that, so enabled it by default, but
still allow the user to override it if needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[aalbersh remove accidental "inherit" word in commit desc.]

xfs_mkfs: calculate zone overprovisioning when specifying size

When size is specified for zoned file systems, calculate the required
over provisioning to back the requested capacity.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_mkfs: support creating file system with zoned RT devices

To create file systems with a zoned RT device, query the hardware
zone information to align the RT groups to it, and create an internal
RT device if the device has conventional and sequential write required
zones.

Default to use all sequential write required zoned for the RT device if
there are sequential write required zones.

Default to 256 and 1% conventional when -r zoned is specified without
further option and there are no sequential write required zones. This
mimics a SMR HDD and works well with tests.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_mkfs: factor out a validate_rtgroup_geometry helper

Factor out the rtgroup geometry checks so that they can be easily reused
for the upcoming zoned RT allocator support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_repair: validate rt groups vs reported hardware zones

Run a report zones ioctl, and verify the rt group state vs the
reported hardware zone state. Note that there is no way to actually
fix up any discrepancies here, as that would be rather scary without
having transactions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_repair: fix the RT device check in process_dinode_int

Don't look at the variable for the rtname command line option, but
the actual file system geometry.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_repair: support repairing zoned file systems

Note really much to do here. Mostly ignore the validation and
regeneration of the bitmap and summary inodes. Eventually this
could grow a bit of validation of the hardware zone state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

libfrog: report the zoned geometry

The rtdev_name helper is based on example code posted by Darrick Wong.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: support zone gaps

Source kernel commit: 97c69ba1c08d5a2bb3cacecae685b63e20e4d485

Zoned devices can have gaps beyond the usable capacity of a zone and the
end in the LBA/daddr address space. In other words, the hardware
equivalent to the RT groups already takes care of the power of 2
alignment for us. In this case the sparse FSB/RTB address space maps 1:1
to the device address space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: enable the zoned RT device feature

Source kernel commit: be458049ffe32b5885c5c35b12997fd40c2986c4

Enable the zoned RT device directory feature. With this feature, RT
groups are written sequentially and always emptied before rewriting
the blocks. This perfectly maps to zoned devices, but can also be
used on conventional block devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: enable fsmap reporting for internal RT devices

Source kernel commit: e50ec7fac81aa271f20ae09868f772ff43a240b0

File system with internal RT devices are a bit odd in that we need
to report AGs and RGs. To make this happen use separate synthetic
fmr_device values for the different sections instead of the dev_t
mapping used by other XFS configurations.

The data device is reported as file system metadata before the
start of the RGs for the synthetic RT fmr_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: implement zoned garbage collection

Source kernel commit: 080d01c41d44f0993f2c235a6bfdb681f0a66be6

RT groups on a zoned file system need to be completely empty before their
space can be reused.  This means that partially empty groups need to be
emptied entirely to free up space if no entirely free groups are
available.

Add a garbage collection thread that moves all data out of the least used
zone when not enough free zones are available, and which resets all zones
that have been emptied.  To find empty zone a simple set of 10 buckets
based on the amount of space used in the zone is used.  To empty zones,
the rmap is walked to find the owners and the data is read and then
written to the new place.

To automatically defragment files the rmap records are sorted by inode
and logical offset.  This means defragmentation of parallel writes into
a single zone happens automatically when performing garbage collection.
Because holding the iolock over the entire GC cycle would inject very
noticeable latency for other accesses to the inodes, the iolock is not
taken while performing I/O.  Instead the I/O completion handler checks
that the mapping hasn't changed over the one recorded at the start of
the GC cycle and doesn't update the mapping if it change.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: add support for zoned space reservations

Source kernel commit: 0bb2193056b5969e4148fc0909e89a5362da873e

For zoned file systems garbage collection (GC) has to take the iolock
and mmaplock after moving data to a new place to synchronize with
readers.  This means waiting for garbage collection with the iolock can
deadlock.

To avoid this, the worst case required blocks have to be reserved before
taking the iolock, which is done using a new RTAVAILABLE counter that
tracks blocks that are free to write into and don't require garbage
collection.  The new helpers try to take these available blocks, and
if there aren't enough available it wakes and waits for GC.  This is
done using a list of on-stack reservations to ensure fairness.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>