git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

xfs_scrub: don't retry unsupported optimizations

If the kernel says it doesn't support optimizing a data structure, we
should mark it done and move on. This is much better than requeuing the
repair, in which case it will likely keep failing. Eventually these
requeued repairs end up in the single-threaded last resort at the end of
phase 4, which makes things /very/ slow.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: handle spurious wakeups in scan_fs_tree

Coverity reminded me that the pthread_cond_wait can wake up and return
without the predicate variable (sft.nr_dirs > 0) actually changing.
Therefore, one has to retest the condition after each wakeup.

Coverity-id: 1554280
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_io: support passing the FORCE_REBUILD flag to online repair

Add CLI options to the scrubv and repair commands so that the user can
pass FORCE_REBUILD to force the kernel to rebuild metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_io: extract control number parsing routines

Break out the parts of parse_args that extract control numbers from the
CLI arguments, so that the function isn't as long. This isn't all that
exciting now, but the scrub vectorization speedups will introduce a new
ioctl. For the new command that comes with that, we'll want the control
number parsing helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_io: collapse trivial helpers

Simply the call chain by having parse_args set the scrub ioctl
parameters in the caller's object. The parse_args callers can then
invoke the ioctl directly, eliminating one function and one indirect
call.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_io: set exitcode = 1 on parsing errors in scrub/repair command

Set exitcode to 1 if there is an error parsing the CLI arguments to the
scrub or repair commands, like we do most other places in xfs_io.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_mdrestore: refactor progress printing and sb fixup code

Now that we've fixed the dissimilarities between the two progress
printing callsites, refactor them into helpers. Do the same for the
duplicate code that clears sb_inprogress from the primary superblock
after the copy succeeds.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_mdrestore: fix missed progress reporting

Currently, the progress reporting only triggers when the number of bytes
read is exactly a multiple of a megabyte. This isn't always guaranteed,
since AG headers can be 512 bytes in size. Fix the algorithm by
recording the number of megabytes we've reported as being read, and emit
a new report any time the bytes_read count, once converted to megabytes,
doesn't match.

Fix the v2 code to emit one final status message in case the last
extent restored is more than a megabyte.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanbabu@kernel.org>

xfs_mdrestore: EXTERNALLOG is a compat value, not incompat

Fix this check to look at the correct header field.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanbabu@kernel.org>

xfs_mdrestore: emit newlines for fatal errors

Spit out a newline after a fatal error message.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanbabu@kernel.org>

xfs_mdrestore: fix uninitialized variables in mdrestore main

Coverity complained about the "is fd a file?" flags being uninitialized.
Clean this up.

Coverity-id: 1554270
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanbabu@kernel.org>

xfs_metadump.8: update for external log device options

Update the documentation to reflect that we can metadump external log
device contents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanbabu@kernel.org>

xfs_db: report the device associated with each io cursor

When db is reporting on an io cursor, have it print out the device
that the cursor is pointing to.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_copy: actually do directio writes to block devices

Not sure why block device targets don't get O_DIRECT in !buffered mode,
but it's misleading when the copy completes instantly only to stall
forever due to fsync-on-close. Adjust the "write last sector" code to
allocate a properly aligned buffer.

In removing the onstack buffer for EOD writes, this also corrects the
buffer being larger than necessary -- the old code declared an array of
32768 pointers, whereas all we really need is an aligned 32768-byte
buffer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_copy: distinguish short writes to EOD from runtime errors

Detect short writes to the end of the destination device and report
them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libxfs: don't UAF a requeued EFI

In the kernel, commit 8ebbf262d4684 ("xfs: don't block in busy flushing
when freeing extents") changed the allocator behavior such that AGFL
fixing can return -EAGAIN in response to detection of a deadlock with
the transaction busy extent list. If this happens, we're supposed to
requeue the EFI so that we can roll the transaction and try the item
again.

If a requeue happens, we should not free the xefi pointer in
xfs_extent_free_finish_item or else the retry will walk off a dangling
pointer. There is no extent busy list in userspace so this should
never happen, but let's fix the logic bomb anyway.

We should have ported kernel commit 0853b5de42b47 ("xfs: allow extent
free intents to be retried") to userspace, but neither Carlos nor I
noticed this fine detail. :(

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanbabu@kernel.org>

libfrog: move 64-bit division wrappers to libfrog

We want to keep the rtgroup unit conversion functions as static inlines,
so share the div64 functions via libfrog instead of libxfs_priv.h.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanbabu@kernel.org>

libxfs: split out a libxfs_dev structure from struct libxfs_init

Most of the content of libxfs_init is members duplicated for each of the
data, log and RT devices. Split those members into a separate
libxfs_dev structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: stash away the device fd in struct xfs_buftarg

Cache the open file descriptor for each device in the buftarg
structure and remove the now unused dev_map infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_repair: remove various libxfs_device_to_fd calls

A few places in xfs_repair call libxfs_device_to_fd to get the data
device fd from the data device dev_t stored in the libxfs_init
structure. Just use the file descriptor stored right there directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: pass the device fd to discard_blocks

No need to do a dev_t to fd lookup when the caller already has the fd.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: return the opened fd from libxfs_device_open

So that the caller can stash it away without having to call
xfs_device_to_fd.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: mark libxfs_device_{open,close} static

libxfs_device_open and libxfs_device_close are only used in init.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: remove dead size < 0 checks in libxfs_init

libxfs_init initializes the device size to 0 at the start of the function
and libxfs_open_device never sets the size to a negativ value. Remove
these checks as they are dead code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libfrog: make platform_set_blocksize exit on fatal failure

platform_set_blocksize has a fatal argument that is currently only
used to change the printed message. Make it actually fatal similar to
other libfrog platform helpers to simplify the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: remove the setblksize == 1 case in libxfs_device_open

All callers of libxfs_init always pass an actual sector size or zero in
the setblksize member. Remove the unreachable setblksize == 1 case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: making passing flags to libxfs_init less confusing

The libxfs_xinit stucture has four different ways to pass flags to
libxfs_init:

- the isreadonly argument despite it's name contains various LIBXFS_
   flags that go beyond just the readonly flag
- the isdirect flag contains a single LIBXFS_ flag from the same name
- the usebuflock is an integer used as bool
- the bcache_flags member is used to pass flags directly to cache_init()
   for the buffer cache

While there is good arguments for keeping the last one separate, all the
others are rather confusing.  Consolidate them into a single flags member
using flags in the LIBXFS_* namespace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: merge the file vs device cases in libxfs_init

The only special handling for an XFS device on a regular file is that
we skip the checks in check_open. Simplify perform those conditionally
instead of duplicating the entire sequence.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: pass a struct libxfs_init to libxfs_alloc_buftarg

Pass a libxfs_init structure to libxfs_alloc_buftarg instead of three
separate dev_t values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: pass a struct libxfs_init to libxfs_mount

Pass a libxfs_init structure to libxfs_mount instead of three separate
dev_t values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: rename struct libxfs_xinit to libxfs_init

Make the struct name more usual, and remove the libxfs_init_t typedef.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxlog: remove the global libxfs_xinit x structure

There is no need to export a libxfs_xinit with the somewhat unsuitable
name x from libxlog. Move it into the tools linking against libxlog
that actually need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxlog: don't require a libxfs_xinit structure for xlog_init

xlog_init currently requires a libxfs_args structure to be passed in,
and then clobbers various log-related arguments to it. There is no
good reason for that as all the required information can be calculated
without it.

Remove the x argument to xlog_init and xlog_is_dirty and the now unused
logBBstart member in struct libxfs_xinit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxlog: add a helper to initialize a xlog without clobbering the x structure

xfsprogs has three copies of a code sequence to initialize an xlog
structure from a libxfs_init structure. Factor the code into a helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxlog: remove the verbose argument to xlog_is_dirty

No caller passes a non-zero verbose argument to xlog_is_dirty.
Remove the argument the code keyed off by it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_logprint: move all code to set up the fake xlog into logstat()

Isolate the code that sets up the fake xlog into the logstat() helper to
prepare for upcoming changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: remove the volname concept

IRIX has the concept of a volume that has data/log/rt subvolumes (that's
where the subvolume name in Linux comes from), but in the current
Linux-only xfsprogs version trying to pretend we do anything with that
it is just utterly confusing. The volname is basically just a very
obsfucated second way to pass the data device name, so get rid of it
in the libxfs and progs internals.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs/frog: remove latform_find{raw,block}path

Stop pretending we try to distinguish between the legacy Unix raw and
block devices nodes. Linux as the only currently support platform never
had them, but other modern Unix variants like FreeBSD also got rid of
this distinction years ago.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: remove the dead {d,log,rt}path variables in libxfs_init

These variables are only initialized, and then unlink is called if they
were changed from the initial value, which can't happen. Remove the
variables and the conditional unlink calls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: remove the unused icache_flags member from struct libxfs_xinit

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_io/encrypt: support specifying crypto data unit size

Add an '-s' option to the 'set_encpolicy' command of xfs_io to allow
exercising the log2_data_unit_size field that is being added to struct
fscrypt_policy_v2 (kernel patch:
https://lore.kernel.org/linux-fscrypt/20230925055451.59499-6-ebiggers@kernel.org).

The xfs_io support is needed for xfstests
(https://lore.kernel.org/fstests/20231013061403.138425-1-ebiggers@kernel.org),
which currently relies on xfs_io to access the encryption ioctls.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mdrestore: Add support for passing log device as an argument

metadump v2 format allows dumping metadata from external log devices. This
commit allows passing the device file to which log data must be restored from
the corresponding metadump file.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mdrestore: Define mdrestore ops for v2 format

This commit adds functionality to restore metadump stored in v2 format.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mdrestore: Extract target device size verification into a function

A future commit will need to perform the device size verification on an
external log device. In preparation for this, this commit extracts the
relevant portions into a new function. No functional changes have been
introduced.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mdrestore: Introduce mdrestore v1 operations

In order to indicate the version of metadump files that they can work with,
this commit renames read_header(), show_info() and restore() functions to
read_header_v1(), show_info_v1() and restore_v1() respectively.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mdrestore: Replace metadump header pointer argument with a union pointer

We will need two variants of read_header(), show_info() and restore() helper
functions to support two versions of metadump formats. To this end, A future
commit will introduce a vector of function pointers to work with the two
metadump formats. To have a common function signature for the function
pointers, this commit replaces the first argument of the previously listed
function pointers from "struct xfs_metablock *" with "union
mdrestore_headers *".

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mdrestore: Add open_device(), read_header() and show_info() functions

This commit moves functionality associated with opening the target device,
reading metadump header information and printing information about the
metadump into their respective functions. There are no functional changes made
by this commit.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mdrestore: Detect metadump v1 magic before reading the header

In order to support both v1 and v2 versions of metadump, mdrestore will have
to detect the format in which the metadump file has been stored on the disk
and then read the ondisk structures accordingly. In a step in that direction,
this commit splits the work of reading the metadump header from disk into two
parts,
1. Read the first 4 bytes containing the metadump magic code.
2. Read the remaining part of the header.

A future commit will take appropriate action based on the value of the magic
code.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mdrestore: Define and use struct mdrestore

This commit collects all state tracking variables in a new "struct mdrestore"
structure. This is done to collect all the global variables in one place
rather than having them spread across the file. A new structure member of type
"struct mdrestore_ops *" will be added by a future commit to support the two
versions of metadump.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mdrestore: Declare boolean variables with bool type

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_db: Add support to read from external log device

This commit introduces a new function set_log_cur() allowing xfs_db to read
from an external log device. This is required by a future commit which will
add the ability to dump metadata from external log devices.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

metadump: Define metadump ops for v2 format

This commit adds functionality to dump metadata from an XFS filesystem in
newly introduced v2 format.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

metadump: Define metadump v2 ondisk format structures and macros

The corresponding metadump file's disk layout is as shown below,

     |------------------------------|
     | struct xfs_metadump_header   |
     |------------------------------|
     | struct xfs_meta_extent 0     |
     | Extent 0's data              |
     | struct xfs_meta_extent 1     |
     | Extent 1's data              |
     | ...                          |
     | struct xfs_meta_extent (n-1) |
     | Extent (n-1)'s data          |
     |------------------------------|

The "struct xfs_metadump_header" is followed by alternating series of "struct
xfs_meta_extent" and the extent itself.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

metadump: Rename XFS_MD_MAGIC to XFS_MD_MAGIC_V1

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

metadump: Introduce metadump v1 operations

This commit moves functionality associated with writing metadump to disk into
a new function. It also renames metadump initialization, write and release
functions to reflect the fact that they work with v1 metadump files.

The metadump initialization, write and release functions are now invoked via
metadump_ops->init(), metadump_ops->write() and metadump_ops->release()
respectively.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

metadump: Introduce struct metadump_ops

We will need two sets of functions to implement two versions of metadump. This
commit adds the definition for 'struct metadump_ops' to hold pointers to
version specific metadump functions.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

metadump: Postpone invocation of init_metadump()

The metadump v2 initialization function (introduced in a later commit) writes
the header structure into the metadump file. This will require the program to
open the metadump file before the initialization function has been invoked.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

metadump: Add initialization and release functions

Move metadump initialization and release functionality into corresponding
functions. There are no functional changes made in this commit.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

metadump: Define and use struct metadump

This commit collects all state tracking variables in a new "struct metadump"
structure. This is done to collect all the global variables in one place
rather than having them spread across the file. A new structure member of type
"struct metadump_ops *" will be added by a future commit to support the two
versions of metadump.

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

metadump: Declare boolean variables with bool type

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mdrestore: Fix logic used to check if target device is large enough

The device size verification code should be writing XFS_MAX_SECTORSIZE bytes
to the end of the device rather than "sizeof(char *) * XFS_MAX_SECTORSIZE"
bytes.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

metadump: Use boolean values true/false instead of 1/0

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

repair: fix the call to search_rt_dup_extent in scan_bmapbt

search_rt_dup_extent expects an RT extent number and not a fsbno.
Convert the units before the call. Without this we are unlikely
to ever found a legit duplicate extent on the RT subvolume because
the search will always be off the end.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_quota: fix missing mount point warning

When user have mounted an XFS volume, and defined project in
/etc/projects file that points to a directory on a different volume,
then:
`xfs_quota -xc "report -a" $path_to_mounted_volume'

complains with:
"xfs_quota: cannot find mount point for path \
`directory_from_projects': Invalid argument"

unlike `xfs_quota -xc "report -a"' which works as expected and no
warning is printed.

This is happening because in the 1st call we pass to xfs_quota command
the $path_to_mounted_volume argument which says to xfs_quota not to
look for all mounted volumes on the system, but use only those passed
to the command and ignore all others (This behavior is intended as an
optimization for systems with huge number of mounted volumes). After
that, while projects are initialized, the project's directories on
other volumes are obviously not in searched subset of volumes and
warning is printed.

I propose to fix this behavior by conditioning the printing of warning
only if all mounted volumes are searched.

Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

db: fix unsigned char related warnings

Clean up the code in hash.c to use the normal char type for all
high-level code, only casting to uint8_t when calling into low-level
code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

Polish translation update for xfsprogs 6.5.0.

Signed-off-by: Jakub Bogusz <qboosh@pld-linux.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: adjust the incore perag block_count when shrinking

Source kernel commit: 6868b8505c807ad9397d78cc4e07cb1cb3582152

If we reduce the number of blocks in an AG, we must update the incore
geometry values as well.

Fixes: 0800169e3e2c9 ("xfs: Pre-calculate per-AG agbno geometry")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

Revert "xfs: switch to multigrain timestamps"

Source kernel commit: f798accd5987dc2280e0ba9055edf1124af46a5f

This reverts commit e44df2664746aed8b6dd5245eb711a0ce33c5cf5.

Users reported regressions due to enabling multi-grained timestamps
unconditionally. As no clear consensus on a solution has come up and the
discussion has gone back to the drawing board revert the infrastructure
changes for. If it isn't code that's here to stay, make it go away.

Message-ID: <20230920-keine-eile-c9755b5825db@brauner>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix log recovery when unknown rocompat bits are set

Source kernel commit: 74ad4693b6473950e971b3dc525b5ee7570e05d0

Log recovery has always run on read only mounts, even where the primary
superblock advertises unknown rocompat bits. Due to a misunderstanding
between Eric and Darrick back in 2018, we accidentally changed the
superblock write verifier to shutdown the fs over that exact scenario.
As a result, the log cleaning that occurs at the end of the mounting
process fails if there are unknown rocompat bits set.

As we now allow writing of the superblock if there are unknown rocompat
bits set on a RO mount, we no longer want to turn off RO state to allow
log recovery to succeed on a RO mount. Hence we also remove all the
(now unnecessary) RO state toggling from the log recovery path.

Fixes: 9e037cb7972f ("xfs: check for unknown v5 feature bits in superblock write verifier"
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: switch to multigrain timestamps

Source kernel commit: e44df2664746aed8b6dd5245eb711a0ce33c5cf5

Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.

Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-11-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: allow userspace to rebuild metadata structures

Source kernel commit: 5c83df2e54b6af870e3e02ccd2a8ecd54e36668c

Add a new (superuser-only) flag to the online metadata repair ioctl to
force it to rebuild structures, even if they're not broken. We will use
this to move metadata structures out of the way during a free space
defragmentation operation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: convert to ctime accessor functions

Source kernel commit: a0a415e34b57368acd262e1172720252c028b936

In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-80-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfsprogs: Release v6.5.0

Update all the necessary files for a 6.5.0 release.

Signed-off-by: Carlos Maiolino <cem@kernel.org>

libfrog: drop build host crc32 selftest

CRC selftests running on a build host were useful long time ago, when
CRC support was added to the on-disk support. Now it's purpose is
replaced by fstests. Also mkfs.xfs and xfs_repair have their own
selftests.

On top of that, it adds a dependency on liburcu on the build host for
no reason - liburcu is not used by the crc32 selftest.

Signed-off-by: Krzesimir Nowak <knowak@microsoft.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: fix atomic64_t detection on x86 32-bit architectures

xfsprogs during compilation tries to detect if liburcu supports atomic
64-bit ops on the platform it is being compiled on, and if not it falls
back to using pthread mutex locks.

The detection logic for that fallback relies on _uatomic_link_error()
which is a link-time trick used by liburcu that will cause compilation
errors on archs that lack the required support. That only works for the
generic liburcu code though, and it is not implemented for the
x86-specific code.

In practice this means that when xfsprogs is compiled on 32-bit x86
archs will successfully link to liburcu for atomic ops, but liburcu does
not support atomic64_t on those archs. It indicates this during runtime
by generating an illegal instruction that aborts execution, and thus
causes various xfsprogs utils to be segfaulting.

Fix this by requiring that unsigned longs are at least 64 bits in size,
which /usually/ means that 64-bit atomic counters are supported. We
can't simply execute the liburcu atomic64_t detection code during
configure instead of only relying on the linker error because that
doesn't work for cross-compiled packages.

Fixes: 7448af588a2e ("libxfs: fix atomic64_t poorly for 32-bit architectures")
Reported-by: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_repair: set aformat and anextents correctly when clearing the attr fork

Ever since commit b42db0860e130 ("xfs: enhance dinode verifier"), we've
required that inodes with zero di_forkoff must also have di_aformat ==
EXTENTS and di_naextents == 0.  clear_dinode_attr actually does this,
but then both callers inexplicably set di_format = LOCAL.  That in turn
causes a verifier failure the next time the xattrs of that file are
read by the kernel.  Get rid of the bogus field write.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_scrub: actually return errno from check_xattr_ns_names

Actually return the error code when extended attribute checks fail.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: use XFS_IGET_CREATE when creating new files

Use this flag to check that newly allocated inodes are, in fact,
unallocated. This matches the kernel, and prevents userspace programs
from making latent corruptions worse by unintentionally crosslinking
files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libfrog: don't fail on XFS_FSOP_GEOM_FLAGS_NREXT64 in xfrog_bulkstat_single5

This flag is perfectly acceptable for bulkstatting a single file;
there's no reason not to allow it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libfrog: fix overly sleep workqueues

I discovered the following bad behavior in the workqueue code when I
noticed that xfs_scrub was running single-threaded despite having 4
virtual CPUs allocated to the VM.  I observed this sequence:

Thread 1 WQ1 WQ2...N
workqueue_create
<start up>
pthread_cond_wait
<start up>
pthread_cond_wait
workqueue_add
next_item == NULL
pthread_cond_signal

workqueue_add
next_item != NULL
<do not pthread_cond_signal>

<receives wakeup>
<run first item>

workqueue_add
next_item != NULL
<do not pthread_cond_signal>

<run second item>
<run third item>
pthread_cond_wait

workqueue_terminate
pthread_cond_broadcast
<receives wakeup>
<nothing to do, exits>
<wakes up again>
<nothing to do, exits>

Notice how threads WQ2...N are completely idle while WQ1 ends up doing
all the work!  That wasn't the point of a worker pool!  Observe that
thread 1 manages to queue two work items before WQ1 pulls the first item
off the queue.  When thread 1 queues the third item, it sees that
next_item is not NULL, so it doesn't wake a worker.  If thread 1 queues
all the N work that it has before WQ1 empties the queue, then none of
the other thread get woken up.

Fix this by maintaining a count of the number of active threads, and
using that to wake either the sole idle thread, or all the threads if
there are many that are idle.  This dramatically improves startup
behavior of the workqueue and eliminates the collapse case.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_db: use directio for device access

XFS and tools (mkfs, copy, repair) don't generally rely on the block
device page cache, preferring instead to use directio. For whatever
reason, the debugger was never made to do this, but let's do that now.

This should eliminate the weird fstests failures resulting from
udev/blkid pinning a cache page while the unmounting filesystem writes
to the superblock such that xfs_db finds the stale pagecache instead of
the post-unmount superblock.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

libxfs: make platform_set_blocksize optional with directio

If we're accessing the block device with directio (and hence bypassing
the page cache), then don't fail on BLKBSZSET not working. We don't
care what happens to the pagecache bufferheads.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mkfs: add a config file for 6.6 LTS kernels

Enable 64-bit extent counts and reverse mapping for 6.6.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mkfs: enable reverse mapping by default

Now that online fsck is feature complete, there's actually a compelling
story for having the reverse mappings enabled. Turn it on by default.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

mkfs: enable large extent counts by default

Format filesystems with the large extent counter feature turned on.
We shall now support 64-bit extent counts for the data fork and 32-bit
extent counts for the attr fork.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_db: create unlinked inodes

Create an expert-mode debugger command to create unlinked inodes.
This will hopefully aid in simulation of leaked unlinked inode handling
in the kernel and elsewhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_db: dump unlinked buckets

Create a new command to dump the resource usage of files in the unlinked
buckets.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: convert flex-array declarations in xfs attr shortform objects

Source kernel commit: f6250e205691a58c81be041b1809a2e706852641

As of 6.5-rc1, UBSAN trips over the ondisk extended attribute shortform
definitions using an array length of 1 to pretend to be a flex array.
Kernel compilers have to support unbounded array declarations, so let's
correct this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: convert flex-array declarations in xfs attr leaf blocks

Source kernel commit: a49bbce58ea90b14d4cb1d00681023a8606955f2

As of 6.5-rc1, UBSAN trips over the ondisk extended attribute leaf block
definitions using an array length of 1 to pretend to be a flex array.
Kernel compilers have to support unbounded array declarations, so let's
correct this.

================================================================================
UBSAN: array-index-out-of-bounds in fs/xfs/libxfs/xfs_attr_leaf.c:2535:24
index 2 is out of range for type '__u8 [1]'
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x50
__ubsan_handle_out_of_bounds+0x9c/0xd0
xfs_attr3_leaf_getvalue+0x2ce/0x2e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_attr_leaf_get+0x148/0x1c0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_attr_get_ilocked+0xae/0x110 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_attr_get+0xee/0x150 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_xattr_get+0x7d/0xc0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
__vfs_getxattr+0xa3/0x100
vfs_getxattr+0x87/0x1d0
do_getxattr+0x17a/0x220
getxattr+0x89/0xf0

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: convert flex-array declarations in struct xfs_attrlist*

Source kernel commit: 371baf5c9750a258fee21d0cb8c8d683bb057429

As of 6.5-rc1, UBSAN trips over the attrlist ioctl definitions using an
array length of 1 to pretend to be a flex array. Kernel compilers have
to support unbounded array declarations, so let's correct this. This
may cause friction with userspace header declarations, but suck is life.

================================================================================
UBSAN: array-index-out-of-bounds in fs/xfs/xfs_ioctl.c:345:18
index 1 is out of range for type '__s32 [1]'
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x50
__ubsan_handle_out_of_bounds+0x9c/0xd0
xfs_ioc_attr_put_listent+0x413/0x420 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_attr_list_ilocked+0x170/0x850 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_attr_list+0xb7/0x120 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_ioc_attr_list+0x13b/0x2e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_attrlist_by_handle+0xab/0x120 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_file_ioctl+0x1ff/0x15e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
vfs_ioctl+0x1f/0x60

The kernel and xfsprogs code that uses these structures will not have
problems, but the long tail of external user programs might.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: AGI length should be bounds checked

Source kernel commit: 2d7d1e7ea321b0b2810eb00183e21332ee9c4b6f

Similar to the recent patch strengthening the AGF agf_length
verification, the AGI verifier does not check that the AGI length field
is within known good bounds. This isn't currently checked by runtime
kernel code, yet we assume in many places that it is correct and verify
other metadata against it.

Add length verification to the AGI verifier. Just like the AGF length
checking, the length of the AGI must be equal to the size of the AG
specified in the superblock, unless it is the last AG in the filesystem.
In that case, it must be less than or equal to sb->sb_agblocks and
greater than XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs
operation will allow to exist.

There's only one place in the filesystem that actually uses agi_length,
but let's not leave it vulnerable to the same weird nonsense that
generates syzbot bugs, eh?

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix xfs_btree_query_range callers to initialize btree rec fully

Source kernel commit: 75dc0345312221971903b2e28279b7e24b7dbb1b

Use struct initializers to ensure that the xfs_btree_irecs passed into
the query_range function are completely initialized. No functional
changes, just closing some sloppy hygiene.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix bounds check in xfs_defer_agfl_block()

Source kernel commit: 2bed0d82c2f78b91a0a9a5a73da57ee883a0c070

Need to happen before we allocate and then leak the xefi. Found by
coverity via an xfsprogs libxfs scan.

[djwong: This also fixes the type of the @agbno argument.]

Fixes: 7dfee17b13e5 ("xfs: validate block number being freed before adding to xefi")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: AGF length has never been bounds checked

Source kernel commit: edd8276dd70279c29d412d99b99c2c0cac1b2cdd

The AGF verifier does not check that the AGF length field is within
known good bounds. This has never been checked by runtime kernel
code (i.e. the lack of verification goes back to 1993) yet we assume
in many places that it is correct and verify other metdata against
it.

Add length verification to the AGF verifier. The length of the AGF
must be equal to the size of the AG specified in the superblock,
unless it is the last AG in the filesystem. In that case, it must be
less than or equal to sb->sb_agblocks and greater than
XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will
allow to exist.

This requires a bit of rework of the verifier function. We want to
verify metadata before we use it to verify other metadata. Hence
we need to verify the AGF sequence numbers before using them to
verify the length of the AGF. Then we can verify the AGF length
before we verify AGFL fields. Then we can verifier other fields that
are bounds limited by the AGF length.

And, finally, by calculating agf_length only once into a local
variable, we can collapse repeated "if (xfs_has_foo() &&"
conditionaly checks into single checks. This makes the code much
easier to follow as all the checks for a given feature are obviously
in the same place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: journal geometry is not properly bounds checked

Source kernel commit: f1e1765aad7de7a8b8102044fc6a44684bc36180

If the journal geometry results in a sector or log stripe unit
validation problem, it indicates that we cannot set the log up to
safely write to the the journal. In these cases, we must abort the
mount because the corruption needs external intervention to resolve.
Similarly, a journal that is too large cannot be written to safely,
either, so we shouldn't allow those geometries to mount, either.

If the log is too small, we risk having transaction reservations
overruning the available log space and the system hanging waiting
for space it can never provide. This is purely a runtime hang issue,
not a corruption issue as per the first cases listed above. We abort
mounts of the log is too small for V5 filesystems, but we must allow
v4 filesystems to mount because, historically, there was no log size
validity checking and so some systems may still be out there with
undersized logs.

The problem is that on V4 filesystems, when we discover a log
geometry problem, we skip all the remaining checks and then allow
the log to continue mounting. This mean that if one of the log size
checks fails, we skip the log stripe unit check. i.e. we allow the
mount because a "non-fatal" geometry is violated, and then fail to
check the hard fail geometries that should fail the mount.

Move all these fatal checks to the superblock verifier, and add a
new check for the two log sector size geometry variables having the
same values. This will prevent any attempt to mount a log that has
invalid or inconsistent geometries long before we attempt to mount
the log.

However, for the minimum log size checks, we can only do that once
we've setup up the log and calculated all the iclog sizes and
roundoffs. Hence this needs to remain in the log mount code after
the log has been initialised. It is also the only case where we
should allow a v4 filesystem to continue running, so leave that
handling in place, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: don't block in busy flushing when freeing extents

Source kernel commit: 8ebbf262d4684e035af5e7aa2a71cab636673a9b

If the current transaction holds a busy extent and we are trying to
allocate a new extent to fix up the free list, we can deadlock if
the AG is entirely empty except for the busy extent held by the
transaction.

This can occur at runtime processing an XEFI with multiple extents
in this path:

__schedule+0x22f at ffffffff81f75e8f
schedule+0x46 at ffffffff81f76366
xfs_extent_busy_flush+0x69 at ffffffff81477d99
xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a
xfs_alloc_ag_vextent+0x19b at ffffffff81417edb
xfs_alloc_fix_freelist+0x22f at ffffffff8141896f
xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a
__xfs_free_extent+0x99 at ffffffff81419499
xfs_trans_free_extent+0x3e at ffffffff814a6fee
xfs_extent_free_finish_item+0x24 at ffffffff814a70d4
xfs_defer_finish_noroll+0x1f7 at ffffffff81441407
xfs_defer_finish+0x11 at ffffffff814417e1
xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd
xfs_inactive_truncate+0xb9 at ffffffff8148bb89
xfs_inactive+0x227 at ffffffff8148c4f7
xfs_fs_destroy_inode+0xb8 at ffffffff81496898
destroy_inode+0x3b at ffffffff8127d2ab
do_unlinkat+0x1d1 at ffffffff81270df1
do_syscall_64+0x40 at ffffffff81f6b5f0
entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c

This can also happen in log recovery when processing an EFI
with multiple extents through this path:

context_switch() kernel/sched/core.c:3881
__schedule() kernel/sched/core.c:5111
schedule() kernel/sched/core.c:5186
xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
__xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
xfs_log_mount_finish() fs/xfs/xfs_log.c:764
xfs_mountfs() fs/xfs/xfs_mount.c:978
xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
mount_bdev() fs/super.c:1417
xfs_fs_mount() fs/xfs/xfs_super.c:1985
legacy_get_tree() fs/fs_context.c:647
vfs_get_tree() fs/super.c:1547
do_new_mount() fs/namespace.c:2843
do_mount() fs/namespace.c:3163
ksys_mount() fs/namespace.c:3372
__do_sys_mount() fs/namespace.c:3386
__se_sys_mount() fs/namespace.c:3383
__x64_sys_mount() fs/namespace.c:3383
do_syscall_64() arch/x86/entry/common.c:296
entry_SYSCALL_64() arch/x86/entry/entry_64.S:180

To avoid this deadlock, we should not block in
xfs_extent_busy_flush() if we hold a busy extent in the current
transaction.

Now that the EFI processing code can handle requeuing a partially
completed EFI, we can detect this situation in
xfs_extent_busy_flush() and return -EAGAIN rather than going to
sleep forever. The -EAGAIN get propagated back out to the
xfs_trans_free_extent() context, where the EFD is populated and the
transaction is rolled, thereby moving the busy extents into the CIL.

At this point, we can retry the extent free operation again with a
clean transaction. If we hit the same "all free extents are busy"
situation when trying to fix up the free list, we can safely call
xfs_extent_busy_flush() and wait for the busy extents to resolve
and wake us. At this point, the allocation search can make progress
again and we can fix up the free list.

This deadlock was first reported by Chandan in mid-2021, but I
couldn't make myself understood during review, and didn't have time
to fix it myself.

It was reported again in March 2023, and again I have found myself
unable to explain the complexities of the solution needed during
review.

As such, I don't have hours more time to waste trying to get the
fix written the way it needs to be written, so I'm just doing it
myself. This patchset is largely based on Wengang Wang's last patch,
but with all the unnecessary stuff removed, split up into multiple
patches and cleaned up somewhat.

Reported-by: Chandan Babu R <chandanrlinux@gmail.com>
Reported-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: pass alloc flags through to xfs_extent_busy_flush()

Source kernel commit: 6a2a9d776c4ae24a797e25eed2b9f7f33f756296

To avoid blocking in xfs_extent_busy_flush() when freeing extents
and the only busy extents are held by the current transaction, we
need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
into xfs_extent_busy_flush().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: use deferred frees for btree block freeing

Source kernel commit: b742d7b4f0e03df25c2a772adcded35044b625ca

Btrees that aren't freespace management trees use the normal extent
allocation and freeing routines for their blocks. Hence when a btree
block is freed, a direct call to xfs_free_extent() is made and the
extent is immediately freed. This puts the entire free space
management btrees under this path, so we are stacking btrees on
btrees in the call stack. The inobt, finobt and refcount btrees
all do this.

However, the bmap btree does not do this - it calls
xfs_free_extent_later() to defer the extent free operation via an
XEFI and hence it gets processed in deferred operation processing
during the commit of the primary transaction (i.e. via intent
chaining).

We need to change xfs_free_extent() to behave in a non-blocking
manner so that we can avoid deadlocks with busy extents near ENOSPC
in transactions that free multiple extents. Inserting or removing a
record from a btree can cause a multi-level tree merge operation and
that will free multiple blocks from the btree in a single
transaction. i.e. we can call xfs_free_extent() multiple times, and
hence the btree manipulation transaction is vulnerable to this busy
extent deadlock vector.

To fix this, convert all the remaining callers of xfs_free_extent()
to use xfs_free_extent_later() to queue XEFIs and hence defer
processing of the extent frees to a context that can be safely
restarted if a deadlock condition is detected.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: remove redundant initializations of pointers drop_leaf and save_leaf

Source kernel commit: 347eb95b27eb97bebdc3ea7de23558216f4e2c90

Pointers drop_leaf and save_leaf are initialized with values that are never
read, they are being re-assigned later on just before they are used. Remove
the redundant early initializations and keep the later assignments at the
point where they are used. Cleans up two clang scan build warnings:

fs/xfs/libxfs/xfs_attr_leaf.c:2288:29: warning: Value stored to 'drop_leaf'
during its initialization is never read [deadcode.DeadStores]
fs/xfs/libxfs/xfs_attr_leaf.c:2289:29: warning: Value stored to 'save_leaf'
during its initialization is never read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix ag count overflow during growfs

Source kernel commit: c3b880acadc95d6e019eae5d669e072afda24f1b

I found a corruption during growfs:

XFS (loop0): Internal error agbno >= mp->m_sb.sb_agblocks at line 3661 of
file fs/xfs/libxfs/xfs_alloc.c.  Caller __xfs_free_extent+0x28e/0x3c0
CPU: 0 PID: 573 Comm: xfs_growfs Not tainted 6.3.0-rc7-next-20230420-00001-gda8c95746257
Call Trace:
<TASK>
dump_stack_lvl+0x50/0x70
xfs_corruption_error+0x134/0x150
__xfs_free_extent+0x2c1/0x3c0
xfs_ag_extend_space+0x291/0x3e0
xfs_growfs_data+0xd72/0xe90
xfs_file_ioctl+0x5f9/0x14a0
__x64_sys_ioctl+0x13e/0x1c0
do_syscall_64+0x39/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
XFS (loop0): Corruption detected. Unmount and run xfs_repair
XFS (loop0): Internal error xfs_trans_cancel at line 1097 of file
fs/xfs/xfs_trans.c.  Caller xfs_growfs_data+0x691/0xe90
CPU: 0 PID: 573 Comm: xfs_growfs Not tainted 6.3.0-rc7-next-20230420-00001-gda8c95746257
Call Trace:
<TASK>
dump_stack_lvl+0x50/0x70
xfs_error_report+0x93/0xc0
xfs_trans_cancel+0x2c0/0x350
xfs_growfs_data+0x691/0xe90
xfs_file_ioctl+0x5f9/0x14a0
__x64_sys_ioctl+0x13e/0x1c0
do_syscall_64+0x39/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f2d86706577

The bug can be reproduced with the following sequence:

# truncate -s  1073741824 xfs_test.img
# mkfs.xfs -f -b size=1024 -d agcount=4 xfs_test.img
# truncate -s 2305843009213693952  xfs_test.img
# mount -o loop xfs_test.img /mnt/test
# xfs_growfs -D  1125899907891200  /mnt/test

The root cause is that during growfs, user space passed in a large value
of newblcoks to xfs_growfs_data_private(), due to current sb_agblocks is
too small, new AG count will exceed UINT_MAX. Because of AG number type
is unsigned int and it would overflow, that caused nagcount much smaller
than the actual value. During AG extent space, delta blocks in
xfs_resizefs_init_new_ags() will much larger than the actual value due to
incorrect nagcount, even exceed UINT_MAX. This will cause corruption and
be detected in __xfs_free_extent. Fix it by growing the filesystem to up
to the maximally allowed AGs and not return EINVAL when new AG count
overflow.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>