git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

xfsprogs: Release v6.17.0

Update all the necessary files for a v6.17.0 release.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_scrub_fail: reduce security lockdowns to avoid postfix problems

Iustin Pop reports that the xfs_scrub_fail service fails to email
problem reports on Debian when postfix is installed. This is apparently
due to several factors:

1. postfix's sendmail wrapper calling postdrop directly,
2. postdrop requiring the ability to write to the postdrop group,
3. lockdown preventing the xfs_scrub_fail@ service to have postdrop in
the supplemental group list or the ability to run setgid programs

Item (3) could be solved by adding the whole service to the postdrop
group via SupplementalGroups=, but that will fail if postfix is not
installed and hence there is no postdrop group.

It could also be solved by forcing msmtp to be installed, bind mounting
msmtp into the service container, and injecting a config file that
instructs msmtp to connect to port 25, but that in turn isn't compatible
with systems not configured to allow an smtp server to listen on ::1.

So we'll go with the less restrictive approach that e2scrub_fail@ does,
which is to say that we just turn off all the sandboxing. :( :(

Reported-by: iustin@debian.org
Cc: linux-xfs@vger.kernel.org # v6.10.0
Fixes: 9042fcc08eed6a ("xfs_scrub_fail: tighten up the security on the background systemd service")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs: do not propagate ENODATA disk errors into xattr code

Source kernel commit: ae668cd567a6a7622bc813ee0bb61c42bed61ba7

ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code;
namely, that the requested attribute name could not be found.

However, a medium error from disk may also return ENODATA. At best,
this medium error may escape to userspace as "attribute not found"
when in fact it's an IO (disk) error.

At worst, we may oops in xfs_attr_leaf_get() when we do:

error = xfs_attr_leaf_hasname(args, &bp);
if (error == -ENOATTR) {
xfs_trans_brelse(args->trans, bp);
return error;
}

because an ENODATA/ENOATTR error from disk leaves us with a null bp,
and the xfs_trans_brelse will then null-deref it.

As discussed on the list, we really need to modify the lower level
IO functions to trap all disk errors and ensure that we don't let
unique errors like this leak up into higher xfs functions - many
like this should be remapped to EIO.

However, this patch directly addresses a reported bug in the xattr
code, and should be safe to backport to stable kernels. A larger-scope
patch to handle more unique errors at lower levels can follow later.

(Note, prior to 07120f1abdff we did not oops, but we did return the
wrong error code to userspace.)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: don't use a xfs_log_iovec for ri_buf in log recovery

Source kernel commit: ded74fddcaf685a9440c5612f7831d0c4c1473ca

ri_buf just holds a pointer/len pair and is not a log iovec used for
writing to the log. Switch to use a kvec instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

fs/xfs: replace strncpy with memtostr_pad()

Source kernel commit: f4a3f01e8e451fb3cb444a95a59964f4bc746902

Replace the deprecated strncpy() with memtostr_pad(). This also avoids
the need for separate zeroing using memset(). Mark sb_fname buffer with
__nonstring as its size is XFSLABEL_MAX and so no terminating NULL for
sb_fname.

Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: improve the xg_active_ref check in xfs_group_free

Source kernel commit: 59655147ec34fb72cc090ca4ee688ece05ffac56

Split up the XFS_IS_CORRUPT statement so that it immediately shows
if the reference counter overflowed or underflowed.

I ran into this quite a bit when developing the zoned allocator, and had
to reapply the patch for some work recently. We might as well just apply
it upstream given that freeing group is far removed from performance
critical code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: return the allocated transaction from xfs_trans_alloc_empty

Source kernel commit: d8e1ea43e5a314bc01ec059ce93396639dcf9112

xfs_trans_alloc_empty can't return errors, so return the allocated
transaction directly instead of an output double pointer argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: refactor xfs_btree_diff_two_ptrs() to take advantage of cmp_int()

Source kernel commit: ce6cce46aff79423f47680ee65e8f12191a50605

Use cmp_int() to yield the result of a three-way-comparison instead of
performing subtractions with extra casts. Thus also rename the function
to make its name clearer in purpose.

Found by Linux Verification Center (linuxtesting.org).

Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: use a proper variable name and type for storing a comparison result

Source kernel commit: 2717eb35185581988799bb0d5179409978f36a90

Perhaps that's just my silly imagination but 'diff' doesn't look good for
the name of a variable to hold a result of a three-way-comparison
(-1, 0, 1) which is what ->cmp_key_with_cur() does. It implies to contain
an actual difference between the two integer variables but that's not true
anymore after recent refactoring.

Declaring it as int64_t is also misleading now. Plain integer type is
more than enough.

Found by Linux Verification Center (linuxtesting.org).

Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: refactor cmp_key_with_cur routines to take advantage of cmp_int()

Source kernel commit: 734b871d6cf7d4f815bb1eff8c808289079701c2

The net value of these functions is to determine the result of a
three-way-comparison between operands of the same type.

Simplify the code using cmp_int() to eliminate potential errors with
opencoded casts and subtractions. This also means we can change the return
value type of cmp_key_with_cur routines from int64_t to int and make the
interface a bit clearer.

Found by Linux Verification Center (linuxtesting.org).

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: refactor cmp_two_keys routines to take advantage of cmp_int()

Source kernel commit: 3b583adf55c649d5ba37bcd1ca87644b0bc10b86

The net value of these functions is to determine the result of a
three-way-comparison between operands of the same type.

Simplify the code using cmp_int() to eliminate potential errors with
opencoded casts and subtractions. This also means we can change the return
value type of cmp_two_keys routines from int64_t to int and make the
interface a bit clearer.

Found by Linux Verification Center (linuxtesting.org).

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: rename key_diff routines

Source kernel commit: 82b63ee160016096436aa026a27c8d85d40f3fb1

key_diff routines compare a key value with a cursor value. Make the naming
to be a bit more self-descriptive.

Found by Linux Verification Center (linuxtesting.org).

Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: rename diff_two_keys routines

Source kernel commit: edce172444b4f489715a3df2e5d50893e74ce3da

One may think that diff_two_keys routines are used to compute the actual
difference between the arguments but they return a result of a
three-way-comparison of the passed operands. So it looks more appropriate
to denote them as cmp_two_keys.

Found by Linux Verification Center (linuxtesting.org).

Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

mkfs: fix copy-paste error in calculate_rtgroup_geometry

Fix this copy-paste error -- we should calculate the rt volume
concurrency either if the user gave us an explicit option, or if they
didn't but the rt volume is an SSD.

Cc: linux-xfs@vger.kernel.org # v6.13.0
Fixes: 34738ff0ee80de ("mkfs: allow sizing realtime allocation groups for concurrency")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: fix strerror_r usage yet again

In commit 75faf2bc907584, someone tried to fix scrub to use the POSIX
version of strerror_r so that the build would work with musl.
Unfortunately, neither the author nor myself remembered that GNU libc
imposes its own version any time _GNU_SOURCE is defined, which
builddefs.in always does. Regrettably, the POSIX and GNU versions have
different return types and the GNU version can return any random
pointer, so now this code is broken on glibc.

"Fix" this standards body own goal by casting the return value to
intptr_t and employing some gross heuristics to guess at the location of
the actual error string.

Fixes: 75faf2bc907584 ("xfs_scrub: Use POSIX-conformant strerror_r")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: A. Wilcox <AWilcox@Wilcox-Tech.com>

libfrog: pass mode to xfrog_file_setattr

xfs/633 crashes rdump_fileattrs_path passes a NULL struct stat pointer
and then the fallback code dereferences it to get the file mode.
Instead, let's just pass the stat mode directly to it, because that's
the only piece of information that it needs.

Fixes: 128ac4dadbd633 ("xfs_db: use file_setattr to copy attributes on special files with rdump")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

mkfs: fix libxfs_iget return value sign inversion

libxfs functions return negative errno, so utilities must invert the
return values from such functions. Caught by xfs/437.

Fixes: 8a4ea72724930c ("proto: add ability to populate a filesystem from a directory")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_scrub: Use POSIX-conformant strerror_r

When building xfsprogs with musl libc, strerror_r returns int as
specified in POSIX. This differs from the glibc extension that returns
char*. Successful calls will return 0, which will be dereferenced as a
NULL pointer by (v)fprintf.

Signed-off-by: A. Wilcox <AWilcox@Wilcox-Tech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_db: use file_setattr to copy attributes on special files with rdump

rdump just skipped file attributes on special files as copying wasn't
possible. Let's use new file_getattr/file_setattr syscalls to copy
attributes even for special files.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_io: make ls/chattr work with special files

With new file_getattr/file_setattr syscalls we can now list/change file
attributes on special files instead for ignoring them.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_quota: utilize file_setattr to set prjid on special files

Utilize new file_getattr/file_setattr syscalls to set project ID on
special files. Previously, special files were skipped due to lack of the
way to call FS_IOC_SETFSXATTR ioctl on them. The quota accounting was
therefore missing these inodes (special files created before project
setup). The ones created after project initialization did inherit the
projid flag from the parent.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

libfrog: add wrappers for file_getattr/file_setattr syscalls

Add wrappers for new file_getattr/file_setattr inode syscalls which will
be used by xfs_quota and xfs_io.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

libfrog: Define STATX__RESERVED if not provided by the system

This define is not provided by musl libc. Use the fallback that is
already provided if statx and its types (tested on STATX_TYPE) are
not defined in the general case.

This fixes one cause for failing to compile against musl libc.

Signed-off-by: Johannes Nixdorf <johannes@nixdorf.dev>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Petr Vaněk <arkamar@gentoo.org>

configure: Base NEED_INTERNAL_STATX on libc headers first

At compile time the libc headers are preferred, and linux/stat.h is
only included if the libc headers don't provide a definition for statx
and its types (tested on STATX_TYPE). The configure test should be
based on the same logic.

This fixes one cause for failing to compile against musl libc.

Signed-off-by: Johannes Nixdorf <johannes@nixdorf.dev>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Petr Vaněk <arkamar@gentoo.org>

xfs_io: add FALLOC_FL_WRITE_ZEROES support

The Linux kernel (since version 6.17) supports FALLOC_FL_WRITE_ZEROES in
fallocate(2). Add support for FALLOC_FL_WRITE_ZEROES support to the
fallocate utility by introducing a new 'fwzero' command in the xfs_io
tool.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=278c7d9b5e0c
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfsprogs: fix utcnow deprecation warning in xfs_scrub_all.py

Running xfs_scrub_all under Python 3.13.5 prints the following warning:

----------------------------------------------
$ /usr/sbin/xfs_scrub_all --auto-media-scan-stamp \
   /var/lib/xfsprogs/xfs_scrub_all_media.stamp \
   --auto-media-scan-interval 1d
/usr/sbin/xfs_scrub_all:489: DeprecationWarning:
datetime.datetime.utcnow() is deprecated and scheduled for removal in a
future version. Use timezone-aware objects to represent datetimes in UTC:
datetime.datetime.now(datetime.UTC).
  dt = datetime.utcnow()
Automatically enabling file data scrub.
----------------------------------------------

Python documentation for context:
https://docs.python.org/3/library/datetime.html#datetime.datetime.utcnow

Fix this by using datetime.now() instead.

NB: Debian/13 ships Python 3.13.5 and has a xfs_scrub_all.timer active,
I'd assume that many systems will have that warning now in their logs :-)

Signed-off-by: Christian Kujau <lists@nerdbynature.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

Improve information about logbsize valid values

Valid values for logbsize depends on whether log_sunit is set
on the filesystem or not and if logbsize is manually set or not.

When manually set, logbsize must be one of the speficied values -
32k to 256k inclusive in power-of-to increments. And, the specified
value must also be a multiple of log_sunit.

The default configuration for v2 logs uses a relaxed restriction,
setting logbsize to log_sunit, independent if it is one of the valid
values or not - also implicitly ignoring the power of two restriction.

Instead of changing valid possible values for logbsize, increasing the
testing matrix and allowing users to use some dubious configuration,
just update the man page to describe this difference in behavior when
manually setting logbsize or leave it to defaults.

This has originally been found by an user attempting to manually set
logbsize to the same value picked by the default configuration just so
to receive an error message as result.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

proto: add ability to populate a filesystem from a directory

This patch implements the functionality to populate a newly created XFS
filesystem directly from an existing directory structure.

It resuses existing protofile logic, it branches if input is a
directory.

The population process steps are as follows:
  - create the root inode before populating content
  - recursively process nested directories
  - handle regular files, directories, symlinks, char devices, block
    devices, sockets, fifos
  - preserve attributes (ownership, permissions)
  - preserve mtime timestamps from source files to maintain file history
    - use current time for atime/ctime/crtime
    - possible to specify atime=1 to preserve atime timestamps from
      source files
  - preserve extended attributes and fsxattrs for all file types
  - preserve hardlinks

At the moment, the implementation for the hardlink tracking is very
simple, as it involves a linear search.
from my local testing using larger source directories
(1.3mln inodes, ~400k hardlinks) the difference was actually
just a few seconds (given that most of the time is doing i/o).
We might want to revisit that in the future if this becomes a
bottleneck.

This functionality makes it easier to create populated filesystems
without having to mount them, it's particularly useful for
reproducible builds.

Signed-off-by: Luca Di Maio <luca.dimaio1@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfsprogs: Release v6.16.0

Update all the necessary files for a v6.16.0 release.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

Document current limitation of shrinking fs

Current implementation in the kernel doesn't allow to shrink more that
one AG

Signed-off-by: Xavier Claude <contact@xavierclaude.be>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

libxfs: update xfs_log_recover.h to kernel version as of Linux 6.16

None of this is used in userland, but it will make automatically
applying kernel changes much easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

move xfs_log_recover.h to libxfs/

xfs_log_recover.h is in fs/xfs/libxfs/ in the kernel tree, and thus the
libxfs-apply tool tries to apply changes to it in libxfs/ and fails
because the header is in include.

Move it to libxfs to make libxfs-apply work properly and to keep our
house in order.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

mkfs: require reflink for max_atomic_write option

For max_atomic_write option to be set, it means that the user wants to
support atomic writes up to that size.

However, to support this we must have reflink, so enforce that this is
available.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

misc: fix reversed calloc arguments

gcc 14 complains about reversed arguments to calloc:

namei.c: In function ‘path_parse’:
namei.c:51:32: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
51 | dirpath = calloc(sizeof(*dirpath), 1);
| ^
namei.c:51:32: note: earlier argument should specify number of elements, later size of each element

Fix all of these.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: don't allocate the xfs_extent_busy structure for zoned RTGs

Source kernel commit: 5948705adbf1a7afcecfe9a13ff39221ef61e16b

Busy extent tracking is primarily used to ensure that freed blocks are
not reused for data allocations before the transaction that deleted them
has been committed to stable storage, and secondarily to drive online
discard. None of the use cases applies to zoned RTGs, as the zoned
allocator can't overwrite blocks before resetting the zone, which already
flushes out all transactions touching the RTGs.

So the busy extent tracking is not needed for zoned RTGs, and also not
called for zoned RTGs. But somehow the code to skip allocating and
freeing the structure got lost during the zoned XFS upstreaming process.
This not only causes these structures to unnecessarily allocated, but can
also lead to memory leaks as the xg_busy_extents pointer in the
xfs_group structure is overlayed with the pointer for the linked list
of to be reset zones.

Stop allocating and freeing the structure to not pointlessly allocate
memory which is then leaked when the zone is reset.

Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection")
Signed-off-by: Christoph Hellwig <hch@lst.de>
[cem: Fix type and add stable tag]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: catch stale AGF/AGF metadata

Source kernel commit: db6a2274162de615ff74b927d38942fe3134d298

There is a race condition that can trigger in dmflakey fstests that
can result in asserts in xfs_ialloc_read_agi() and
xfs_alloc_read_agf() firing. The asserts look like this:

XFS: Assertion failed: pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks), file: fs/xfs/libxfs/xfs_alloc.c, line: 3440
.....
Call Trace:
<TASK>
xfs_alloc_read_agf+0x2ad/0x3a0
xfs_alloc_fix_freelist+0x280/0x720
xfs_alloc_vextent_prepare_ag+0x42/0x120
xfs_alloc_vextent_iterate_ags+0x67/0x260
xfs_alloc_vextent_start_ag+0xe4/0x1c0
xfs_bmapi_allocate+0x6fe/0xc90
xfs_bmapi_convert_delalloc+0x338/0x560
xfs_map_blocks+0x354/0x580
iomap_writepages+0x52b/0xa70
xfs_vm_writepages+0xd7/0x100
do_writepages+0xe1/0x2c0
__writeback_single_inode+0x44/0x340
writeback_sb_inodes+0x2d0/0x570
__writeback_inodes_wb+0x9c/0xf0
wb_writeback+0x139/0x2d0
wb_workfn+0x23e/0x4c0
process_scheduled_works+0x1d4/0x400
worker_thread+0x234/0x2e0
kthread+0x147/0x170
ret_from_fork+0x3e/0x50
ret_from_fork_asm+0x1a/0x30

I've seen the AGI variant from scrub running on the filesysetm
after unmount failed due to systemd interference:

XFS: Assertion failed: pag->pagi_freecount == be32_to_cpu(agi->agi_freecount) || xfs_is_shutdown(pag->pag_mount), file: fs/xfs/libxfs/xfs_ialloc.c, line: 2804
.....
Call Trace:
<TASK>
xfs_ialloc_read_agi+0xee/0x150
xchk_perag_drain_and_lock+0x7d/0x240
xchk_ag_init+0x34/0x90
xchk_inode_xref+0x7b/0x220
xchk_inode+0x14d/0x180
xfs_scrub_metadata+0x2e2/0x510
xfs_ioc_scrub_metadata+0x62/0xb0
xfs_file_ioctl+0x446/0xbf0
__se_sys_ioctl+0x6f/0xc0
__x64_sys_ioctl+0x1d/0x30
x64_sys_call+0x1879/0x2ee0
do_syscall_64+0x68/0x130
? exc_page_fault+0x62/0xc0
entry_SYSCALL_64_after_hwframe+0x76/0x7e

Essentially, it is the same problem. When _flakey_drop_and_remount()
loads the drop-writes table, it makes all writes silently fail. Writes
are reported to the fs as completed successfully, but they are not
issued to the backing store. The filesystem sees the successful
write completion and marks the metadata buffer clean and removes it
from the AIL.

If this happens at the same time as memory pressure is occuring,
the now-clean AGF and/or AGI buffers can be reclaimed from memory.

Shortly afterwards, but before _flakey_drop_and_remount() runs
unmount, background writeback is kicked and it tries to allocate
blocks for the dirty pages in memory. This then tries to access the
AGF buffer we just turfed out of memory. It's not found, so it gets
read in from disk.

This is all fine, except for the fact that the last writeback of the
AGF did not actually reach disk. The AGF on disk is stale compared
to the in-memory state held by the perag, and so they don't match
and the assert fires.

Then other operations on that inode hang because the task was killed
whilst holding inode locks. e.g:

Workqueue: xfs-conv/dm-12 xfs_end_io
Call Trace:
<TASK>
__schedule+0x650/0xb10
schedule+0x6d/0xf0
schedule_preempt_disabled+0x15/0x30
rwsem_down_write_slowpath+0x31a/0x5f0
down_write+0x43/0x60
xfs_ilock+0x1a8/0x210
xfs_trans_alloc_inode+0x9c/0x240
xfs_iomap_write_unwritten+0xe3/0x300
xfs_end_ioend+0x90/0x130
xfs_end_io+0xce/0x100
process_scheduled_works+0x1d4/0x400
worker_thread+0x234/0x2e0
kthread+0x147/0x170
ret_from_fork+0x3e/0x50
ret_from_fork_asm+0x1a/0x30
</TASK>

and it's all down hill from there.

Memory pressure is one way to trigger this, another is to run "echo
3 > /proc/sys/vm/drop_caches" randomly while tests are running.

Regardless of how it is triggered, this effectively takes down the
system once umount hangs because it's holding a sb->s_umount lock
exclusive and now every sync(1) call gets stuck on it.

Fix this by replacing the asserts with a corruption detection check
and a shutdown.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_scrub: remove EXPERIMENTAL warnings

The kernel code for online fsck has been stable for a year, and there
haven't been any major changes to the program in quite some time, so
let's drop the experimental warnings.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

mkfs: allow users to configure the desired maximum atomic write size

Allow callers of mkfs.xfs to specify a desired maximum atomic write
size. This value will cause the log size to be adjusted to support
software atomic writes, and the AG size to be aligned to support
hardware atomic writes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

mkfs: try to align AG size based on atomic write capabilities

Try to align the AG size to the maximum hardware atomic write unit so
that we can give users maximum flexibility in choosing an RWF_ATOMIC
write size.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

mkfs: autodetect log stripe unit for external log devices

If we're using an external log device and the caller doesn't give us a
lsunit, use the block device geometry (if present) to set it.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

mkfs: don't complain about overly large auto-detected log stripe units

If mkfs declines to apply what it thinks is an overly large data device
stripe unit to the log device, it should only log a message about that
if the lsunit parameter was actually supplied by the caller. It should
not do that when the lsunit was autodetected from the block devices.

The cli parameters are zero-initialized in main and always have been.

Cc: <linux-xfs@vger.kernel.org> # v4.15.0
Fixes: 2f44b1b0e5adc4 ("mkfs: rework stripe calculations")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

xfs_io: dump new atomic_write_unit_max_opt statx field

Dump the new atomic writes statx field that's being submitted for 6.16.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

xfs_db: create an untorn_max subcommand

Create a debugger command to compute the either the logres needed to
perform an untorn cow write completion for a given number of blocks; or
the number of blocks that can be completed given a log reservation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

libfrog: move statx.h from io/ to libfrog/

Move this header file so we can use the declaration and wrappers
in other parts of xfsprogs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

xfs: allow sysadmins to specify a maximum atomic write limit at mount time

Source kernel commit: 4528b9052731f14c1a9be16b98e33c9401e6d1bc

Introduce a mount option to allow sysadmins to specify the maximum size
of an atomic write.  If the filesystem can work with the supplied value,
that becomes the new guaranteed maximum.

The value mustn't be too big for the existing filesystem geometry (max
write size, max AG/rtgroup size).  We dynamically recompute the
tr_atomic_write transaction reservation based on the given block size,
check that the current log size isn't less than the new minimum log size
constraints, and set a new maximum.

The actual software atomic write max is still computed based off of
tr_atomic_ioend the same way it has for the past few commits.  Note also
that xfs_calc_atomic_write_log_geometry is non-static because mkfs will
need that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>

xfs: add xfs_calc_atomic_write_unit_max()

Source kernel commit: 0c438dcc31504bf4f50b20dc52f8f5ca7fab53e2

Now that CoW-based atomic writes are supported, update the max size of an
atomic write for the data device.

The limit of a CoW-based atomic write will be the limit of the number of
logitems which can fit into a single transaction.

In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.

Function xfs_atomic_write_logitems() is added to find the limit the number
of log items which can fit in a single transaction.

Amend the max atomic write computation to create a new transaction
reservation type, and compute the maximum size of an atomic write
completion (in fsblocks) based on this new transaction reservation.
Initially, tr_atomic_write is a clone of tr_itruncate, which provides a
reasonable level of parallelism. In the next patch, we'll add a mount
option so that sysadmins can configure their own limits.

[djwong: use a new reservation type for atomic write ioends, refactor
group limit calculations]

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: rounddown power-of-2 always]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>

libxfs: add helpers to compute log item overhead

Add selected helpers to estimate the transaction reservation required to
write various log intent and buffer items to the log. These helpers
will be used by the online repair code for more precise estimations of
how much work can be done in a single transaction.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs: commit CoW-based atomic writes atomically

Source kernel commit: b1e09178b73adf10dc87fba9aee7787a7ad26874

When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.

For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.

Note that there is a limit on the amount of log intent items which can be
fit into a single transaction, but this is being ignored for now since
the count of items for a typical atomic write would be much less than is
typically supported. A typical atomic write would be expected to be 64KB
or less, which means only 16 possible extents unmaps, which is quite
small.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add tr_atomic_ioend]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>

xfs: allow block allocator to take an alignment hint

Source kernel commit: 6baf4cc47a741024d37e6149d5d035d3fc9ed1fe

Add a BMAPI flag to provide a hint to the block allocator to align extents
according to the extszhint.

This will be useful for atomic writes to ensure that we are not being
allocated extents which are not suitable (for atomic writes).

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>

xfs: add helpers to compute transaction reservation for finishing intent items

Source kernel commit: 805f89881252a9aee30799b8a395deec79c13414

In the transaction reservation code, hoist the logic that computes the
reservation needed to finish one log intent item into separate helper
functions. These will be used in subsequent patches to estimate the
number of blocks that an online repair can commit to reaping in the same
transaction as the change committing the new data structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>

xfsprogs: Release v6.15.0

Update all the necessary files for a v6.15.0 release.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_mdrestore: don't allow restoring onto zoned block devices

The way mdrestore works is not very amendable to zone devices.  The code
that checks the device size tries to write to the highest offset, which
doesn't match the write pointer of a clean zone device.  And while that
is relatively easily fixable, the metadata for each RTG records the
highest written offset, and the mount code compares that to the hardware
write pointer, which will mismatch.  This could be fixed by using write
zeroes to pad the RTG until the expected write pointer, but this turns
the quick metadata operation that mdrestore is supposed to be into
something that could take hours on HDD.

So instead error out when someone tries to mdrestore onto a zoned device
to clearly document that this won't work.  Doing a mdrestore into a file
still works perfectly fine, and we might look into a new mdrestore option
to restore into a set of files suitable for the zoned loop device driver
to make mdrestore fully usable for debugging.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

man: adjust description of the statx manpage

Amend the manpage description of how the lack of statx -m options work.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>

xfs: remove the flags argument to xfs_buf_get_uncached

Source kernel commit: b3f8f2903b8cd48b0746bf05a40b85ae4b684034

No callers passes flags to xfs_buf_get_uncached, which makes sense
given that the flags apply to behavior not used for uncached buffers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: kill XBF_UNMAPPED

Source kernel commit: a2f790b28512c22f7cf4f420a99e1b008e7098fe

Unmapped buffer access is a pain, so kill it. The switch to large
folios means we rarely pay a vmap penalty for large buffers,
so this functionality is largely unnecessary now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_protofile: fix permission octet when suid/guid is set

When encountering suid or sgid files, we already set the `u` or `g`
property in the prototype file.
Given that proto.c only supports three numbers for permissions, we
need to remove the redundant information from the permission, else
it was incorrectly parsed.

Co-authored-by: Luca Di Maio <luca.dimaio1@gmail.com>
Co-authored-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Luca Di Maio <luca.dimaio1@gmail.com>
[aalbersh: 68 chars limit and removed patch review revisions]

xfs_repair: fix libxfs abstraction mess

Do some xfs -> libxfs callsite conversions to shut up xfs/437.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_growfs: support internal RT devices

Allow RT growfs when rtstart is set in the geomety, and adjust the
queried size for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_mdrestore: support internal RT devices

Calculate the size properly for internal RT devices and skip restoring
to the external one for this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_scrub: support internal RT device

Handle the synthetic fmr_device values, and deal with the fact that
ctx->fsinfo.fs_rt is allowed to be non-NULL for internal RT devices as
it is the same as the data device in this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_spaceman: handle internal RT devices

Handle the synthetic fmr_device values for fsmap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_io: handle internal RT devices in fsmap output

Deal with the synthetic fmr_device values and the rt device offset when
calculating RG numbers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_io: don't re-query fs_path information in fsmap_f

Reuse the information stashed in "file" instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_io: correctly report RGs with internal rt dev in bmap output

Apply the proper offset. Somehow this made gcc complain about
possible overflowing abuf, so increase the size for that as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

man: document XFS_FSOP_GEOM_FLAGS_ZONED

Document the new zoned feature flag and the two new fields added
with it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_mkfs: document the new zoned options in the man page

Add documentation for the zoned file system specific options.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_mkfs: reflink conflicts with zoned file systems for now

Don't allow reflink on zoned file system until garbage collections learns
how to deal with shared extents and doesn't blindly unshare them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_mkfs: default to rtinherit=1 for zoned file systems

Zone file systems are intended to use sequential write required zones
(or areas treated as such) for the main data store. And usually use the
data device only for metadata that requires random writes.

rtinherit=1 is the way to achieve that, so enabled it by default, but
still allow the user to override it if needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[aalbersh remove accidental "inherit" word in commit desc.]

xfs_mkfs: calculate zone overprovisioning when specifying size

When size is specified for zoned file systems, calculate the required
over provisioning to back the requested capacity.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_mkfs: support creating file system with zoned RT devices

To create file systems with a zoned RT device, query the hardware
zone information to align the RT groups to it, and create an internal
RT device if the device has conventional and sequential write required
zones.

Default to use all sequential write required zoned for the RT device if
there are sequential write required zones.

Default to 256 and 1% conventional when -r zoned is specified without
further option and there are no sequential write required zones. This
mimics a SMR HDD and works well with tests.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_mkfs: factor out a validate_rtgroup_geometry helper

Factor out the rtgroup geometry checks so that they can be easily reused
for the upcoming zoned RT allocator support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_repair: validate rt groups vs reported hardware zones

Run a report zones ioctl, and verify the rt group state vs the
reported hardware zone state. Note that there is no way to actually
fix up any discrepancies here, as that would be rather scary without
having transactions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_repair: fix the RT device check in process_dinode_int

Don't look at the variable for the rtname command line option, but
the actual file system geometry.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_repair: support repairing zoned file systems

Note really much to do here. Mostly ignore the validation and
regeneration of the bitmap and summary inodes. Eventually this
could grow a bit of validation of the hardware zone state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

libfrog: report the zoned geometry

The rtdev_name helper is based on example code posted by Darrick Wong.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: support zone gaps

Source kernel commit: 97c69ba1c08d5a2bb3cacecae685b63e20e4d485

Zoned devices can have gaps beyond the usable capacity of a zone and the
end in the LBA/daddr address space. In other words, the hardware
equivalent to the RT groups already takes care of the power of 2
alignment for us. In this case the sparse FSB/RTB address space maps 1:1
to the device address space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: enable the zoned RT device feature

Source kernel commit: be458049ffe32b5885c5c35b12997fd40c2986c4

Enable the zoned RT device directory feature. With this feature, RT
groups are written sequentially and always emptied before rewriting
the blocks. This perfectly maps to zoned devices, but can also be
used on conventional block devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: enable fsmap reporting for internal RT devices

Source kernel commit: e50ec7fac81aa271f20ae09868f772ff43a240b0

File system with internal RT devices are a bit odd in that we need
to report AGs and RGs. To make this happen use separate synthetic
fmr_device values for the different sections instead of the dev_t
mapping used by other XFS configurations.

The data device is reported as file system metadata before the
start of the RGs for the synthetic RT fmr_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: implement zoned garbage collection

Source kernel commit: 080d01c41d44f0993f2c235a6bfdb681f0a66be6

RT groups on a zoned file system need to be completely empty before their
space can be reused.  This means that partially empty groups need to be
emptied entirely to free up space if no entirely free groups are
available.

Add a garbage collection thread that moves all data out of the least used
zone when not enough free zones are available, and which resets all zones
that have been emptied.  To find empty zone a simple set of 10 buckets
based on the amount of space used in the zone is used.  To empty zones,
the rmap is walked to find the owners and the data is read and then
written to the new place.

To automatically defragment files the rmap records are sorted by inode
and logical offset.  This means defragmentation of parallel writes into
a single zone happens automatically when performing garbage collection.
Because holding the iolock over the entire GC cycle would inject very
noticeable latency for other accesses to the inodes, the iolock is not
taken while performing I/O.  Instead the I/O completion handler checks
that the mapping hasn't changed over the one recorded at the start of
the GC cycle and doesn't update the mapping if it change.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: add support for zoned space reservations

Source kernel commit: 0bb2193056b5969e4148fc0909e89a5362da873e

For zoned file systems garbage collection (GC) has to take the iolock
and mmaplock after moving data to a new place to synchronize with
readers.  This means waiting for garbage collection with the iolock can
deadlock.

To avoid this, the worst case required blocks have to be reserved before
taking the iolock, which is done using a new RTAVAILABLE counter that
tracks blocks that are free to write into and don't require garbage
collection.  The new helpers try to take these available blocks, and
if there aren't enough available it wakes and waits for GC.  This is
done using a list of on-stack reservations to ensure fairness.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: add the zoned space allocator

Source kernel commit: 4e4d52075577707f8393e3fc74c1ef79ca1d3ce6

For zoned RT devices space is always allocated at the write pointer, that
is right after the last written block and only recorded on I/O completion.

Because the actual allocation algorithm is very simple and just involves
picking a good zone - preferably the one used for the last write to the
inode.  As the number of zones that can written at the same time is
usually limited by the hardware, selecting a zone is done as late as
possible from the iomap dio and buffered writeback bio submissions
helpers just before submitting the bio.

Given that the writers already took a reservation before acquiring the
iolock, space will always be readily available if an open zone slot is
available.  A new structure is used to track these open zones, and
pointed to by the xfs_rtgroup.  Because zoned file systems don't have
a rsum cache the space for that pointer can be reused.

Allocations are only recorded at I/O completion time.  The scheme used
for that is very similar to the reflink COW end I/O path.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: parse and validate hardware zone information

Source kernel commit: 720c2d58348329ce57bfa7ecef93ee0c9bf4b405

Add support to validate and parse reported hardware zone state.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: disable sb_frextents for zoned file systems

Source kernel commit: 1d319ac6fe1bd6364c5fc6e285ac47b117aed117

Zoned file systems not only don't use the global frextents counter, but
for them the in-memory percpu counter also includes reservations taken
before even allocating delalloc extent records, so it will never match
the per-zone used information. Disable all updates and verification of
the sb counter for zoned file systems as it isn't useful for them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: export zoned geometry via XFS_FSOP_GEOM

Source kernel commit: 1fd8159e7ca41203798b6f65efaf1724eb318cd4

Export the zoned geometry information so that userspace can query it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: allow internal RT devices for zoned mode

Source kernel commit: bdc03eb5f98f6f1ae4bd5e020d1582a23efb7799

Allow creating an RT subvolume on the same device as the main data
device. This is mostly used for SMR HDDs where the conventional zones
are used for the data device and the sequential write required zones
for the zoned RT section.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: define the zoned on-disk format

Source kernel commit: 2167eaabe2fadde24cb8f1dafbec64da1d2ed2f5

Zone file systems reuse the basic RT group enabled XFS file system
structure to support a mode where each RT group is always written from
start to end and then reset for reuse (after moving out any remaining
data).  There are few minor but important changes, which are indicated
by a new incompat flag:

1) there are no bitmap and summary inodes, thus the
/rtgroups/{rgno}.{bitmap,summary} metadir files do not exist and the
sb_rbmblocks superblock field must be cleared to zero.

2) there is a new superblock field that specifies the start of an
internal RT section.  This allows supporting SMR HDDs that have random
writable space at the beginning which is used for the XFS data device
(which really is the metadata device for this configuration), directly
followed by a RT device on the same block device.  While something
similar could be achieved using dm-linear just having a single device
directly consumed by XFS makes handling the file systems a lot easier.

3) Another superblock field that tracks the amount of reserved space (or
overprovisioning) that is never used for user capacity, but allows GC
to run more smoothly.

4) an overlay of the cowextsize field for the rtrmap inode so that we
can persistently track the total amount of rtblocks currently used in
a RT group.  There is no data structure other than the rmap that
tracks used space in an RT group, and this counter is used to decide
when a RT group has been entirely emptied, and to select one that
is relatively empty if garbage collection needs to be performed.
While this counter could be tracked entirely in memory and rebuilt
from the rmap at mount time, that would lead to very long mount times
with the large number of RT groups implied by the number of hardware
zones especially on SMR hard drives with 256MB zone sizes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: add a xfs_rtrmap_highest_rgbno helper

Source kernel commit: aacde95a37160b1462e46e0fd0cc7fd70e3bf1cc

Add a helper to find the last offset mapped in the rtrmap. This will be
used by the zoned code to find out where to start writing again on
conventional devices without hardware zone support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay

Source kernel commit: f42c652434de5e26e02798bf6a0c2a4a8627196b

The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation. To support that pass the
flags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c

Source kernel commit: 7c879c8275c0505c551f0fc6c152299c8d11f756

Delalloc reservations are not supported in userspace, and thus it doesn't
make sense to share this helper with xfsprogs.c. Move it to xfs_iomap.c
toward the two callers.

Note that there rest of the delalloc handling should probably eventually
also move out of xfs_bmap.c, but that will require a bit more surgery.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: add a rtg_blocks helper

Source kernel commit: 012482b3308a49a84c2a7df08218dd4ad081e1da

Shortcut dereferencing the xg_block_count field in the generic group
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: reduce metafile reservations

Source kernel commit: 272e20bb24dc895375ccc18a82596a7259b5a652

There is no point in reserving more space than actually available
on the data device for the worst case scenario that is unlikely to
happen. Reserve at most 1/4th of the data device blocks, which is
still a heuristic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: make metabtree reservations global

Source kernel commit: 1df8d75030b787a9fae270b59b93eef809dd2011

Currently each metabtree inode has it's own space reservation to ensure
it can be expanded to the maximum size, mirroring what is done for the
AG-based btrees.  But unlike the AG-based btrees the metabtree inodes
aren't restricted to allocate from a single AG but can use free space
form the entire file system.  And unlike AG-based btrees where the
required reservation shrinks with the available free space due to this,
the metabtree reservations for the rtrmap and rtfreflink trees are not
bound in any way by the data device free space as they track RT extent
allocations.  This is not very efficient as it requires a large number
of blocks to be set aside that can't be used at all by other btrees.

Switch to a model that uses a global pool instead in preparation for
reducing the amount of reserved space, which now also removes the
overloading of the i_nblocks field for metabtree inodes, which would
create problems if metabtree inodes ever had a big enough xattr fork
to require xattr blocks outside the inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: generalize the freespace and reserved blocks handling

Source kernel commit: 712bae96631852c1a1822ee4f57a08ccd843358b

xfs_{add,dec}_freecounter already handles the block and RT extent
percpu counters, but it currently hardcodes the passed in counter.

Add a freecounter abstraction that uses an enum to designate the counter
and add wrappers that hide the actual percpu_counters. This will allow
expanding the reserved block handling to the RT extent counter in the
next step, and also prepares for adding yet another such counter that
can share the code. Both these additions will be needed for the zoned
allocator.

Also switch the flooring of the frextents counter to 0 in statfs for the
rthinherit case to a manual min_t call to match the handling of the
fdblocks counter for normal file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs_repair: phase6: scan longform entries before header check

In longform_dir2_entry_check, if check_dir3_header() fails for v5
metadata, we immediately go to out_fix: and try to rebuild the
directory via longform_dir2_rebuild. But because we haven't yet
called longform_dir2_entry_check_data, the *hashtab used to rebuild
the directory is empty, which results in all existing entries
getting moved to lost+found, and an empty rebuilt directory. On top
of that, the empty directory is now short form, so its nlinks come
out wrong and this requires another repair run to fix.

Scan the entries before checking the header, so that we have a
decent chance of properly rebuilding the dir if the header is
corrupt, rather than orphaning all the entries and moving them to
lost+found.

Suggested-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
[aalbersh updated changelog as suggested by Eric Sandeen]
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_repair: Bump link count if longform_dir2_rebuild yields shortform dir

If longform_dir2_rebuild() has so few entries in *hashtab that it results
in a short form directory, bump the link count manually as shortform
directories have no explicit "." entry.

Without this, repair will end with i.e.:

resetting inode 131 nlinks from 2 to 1

in this case, because it thinks this directory inode only has 1 link
discovered, and then a 2nd repair will fix it:

resetting inode 131 nlinks from 1 to 2

because shortform_dir2_entry_check() explicitly adds the extra ref when
the (newly-created)shortform directory is checked:

        /*
         * no '.' entry in shortform dirs, just bump up ref count by 1
         * '..' was already (or will be) accounted for and checked when
         * the directory is reached or will be taken care of when the
         * directory is moved to orphanage.
         */
        add_inode_ref(current_irec, current_ino_offset);

Avoid this by adding the extra ref if we convert from longform to
shortform.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[aalbersh drop SoB with user.mail as name]
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

mkfs: fix the issue of maxpct set to 0 not taking effect

If a filesystem has the sb_imax_pct field set to zero, there is no limit to the number of inode blocks in the filesystem.

However, when using mkfs.xfs and specifying maxpct = 0, the result is not as expected.
[root@fs ~]# mkfs.xfs -f -i maxpct=0 xfs.img
data     =               bsize=4096   blocks=262144, imaxpct=25
         =               sunit=0      swidth=0 blks

The reason is that the condition will never succeed when specifying maxpct = 0. As a result, the default algorithm was applied.
cfg->imaxpct = cli->imaxpct;
if (cfg->imaxpct)
return;

The result with patch:
[root@fs ~]# mkfs.xfs -f -i maxpct=0 xfs.img
data     =               bsize=4096   blocks=262144, imaxpct=0
         =               sunit=0      swidth=0 blks

[root@fs ~]# mkfs.xfs -f xfs.img
data     =               bsize=4096   blocks=262144, imaxpct=25
         =               sunit=0      swidth=0 blks

Cc: linux-xfs@vger.kernel.org # v4.15.0
Fixes: d7240c965389e1 ("mkfs: rework imaxpct calculation")
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: liuh <liuhuan01@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

mkfs: fix blkid probe API violations causing weird output

The blkid_do_fullprobe function in libblkid 2.38.1 will try to read the
last 512 bytes off the end of a block device.  If the block device has a
2k LBA size, that read will fail.  blkid_do_fullprobe passes the -EIO
back to the caller (mkfs) even though the API documentation says it
only returns 1, 0, or -1.

Change the "cannot detect existing fs" logic to look for any negative
number.  Otherwise, you get unhelpful output like this:

$ mkfs.xfs -l size=32m -b size=4096 /dev/loop3
mkfs.xfs: Use the -f option to force overwrite.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_io: make statx mask parsing more generally useful

Enhance the statx -m parsing to be more useful:

Add words for all the new STATX_* field flags added in the previous
patch.

Allow "+" and "-" prefixes to add or remove flags from the mask.

Allow multiple arguments to be specified as a comma separated list.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_io: redefine what statx -m all does

As of kernel commit 581701b7efd60b ("uapi: deprecate STATX_ALL"),
STATX_ALL is deprecated and has been withdrawn from the kernel codebase.
The symbol still exists for userspace to avoid compilation breakage, but
we're all suppose to stop using it.

Therefore, redefine statx -m all to set all the bits except for the
reserved bit since it's pretty silly that "all" doesn't actually get you
all the fields.

Update the STATX_ALL definition in io/statx.h so people stop using it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>

xfs_io: catch statx fields up to 6.15

Add all the new statx fields and flags that have accumulated for the
past couple of years so they all print now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>