Darrick J. Wong [Mon, 24 Feb 2025 18:22:05 +0000 (10:22 -0800)]
xfs_repair: check existing realtime refcountbt entries against observed refcounts
Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime refcount
btree (particularly if we're in -n mode) to detect rtrefcountbt
problems.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:05 +0000 (10:22 -0800)]
xfs_repair: compute refcount data for the realtime groups
At the end of phase 4, compute reference count information for realtime
groups from the realtime rmap information collected, just like we do for
AGs in the data section.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:05 +0000 (10:22 -0800)]
xfs_repair: find and mark the rtrefcountbt inode
Make sure that we find the realtime refcountbt inode and mark it
appropriately, just in case we find a rogue inode claiming to
be an rtrefcount, or just plain garbage in the superblock field.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:05 +0000 (10:22 -0800)]
xfs_repair: use realtime refcount btree data to check block types
Use the realtime refcount btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:05 +0000 (10:22 -0800)]
xfs_repair: allow CoW staging extents in the realtime rmap records
Don't flag the rt rmap btree as having errors if there are CoW staging
extent records in it and the filesystem supports reflink. As far as
reporting leftover staging extents, we'll report them when we scan the
rt refcount btree, in a future patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:03 +0000 (10:22 -0800)]
libxfs: apply rt extent alignment constraints to CoW extsize hint
The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint. Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.
Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:03 +0000 (10:22 -0800)]
libxfs: add a realtime flag to the refcount update log redo items
Extend the refcount update (CUI) log items with a new realtime flag that
indicates that the updates apply against the realtime refcountbt. We'll
wire up the actual refcount code later.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:02 +0000 (10:22 -0800)]
xfs_repair: reserve per-AG space while rebuilding rt metadata
Realtime metadata btrees can consume quite a bit of space on a full
filesystem. Since the metadata are just regular files, we need to
make the per-AG reservations to avoid overfilling any of the AGs while
rebuilding metadata. This avoids the situation where a filesystem comes
straight from repair and immediately trips over not having enough space
in an AG.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:01 +0000 (10:22 -0800)]
xfs_repair: check for global free space concerns with default btree slack levels
It's possible that before repair was started, the filesystem might have
been nearly full, and its metadata btree blocks could all have been
nearly full. If we then rebuild the btrees with blocks that are only
75% full, that expansion might be enough to run out of free space. The
solution to this is to pack the new blocks completely full if we fear
running out of space.
Previously, we only had to check and decide that on a per-AG basis.
However, now that XFS can have filesystems with metadata btrees rooted
in inodes, we have a global free space concern because there might be
enough space in each AG to regenerate the AG btrees at 75%, but that
might not leave enough space to regenerate the inode btrees, even if we
fill those blocks to 100%.
Hence we need to precompute the worst case space usage for all btrees in
the filesystem and compare /that/ against the global free space to
decide if we're going to pack the btrees maximally to conserve space.
That decision can override the per-AG determination.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:01 +0000 (10:22 -0800)]
xfs_repair: always check realtime file mappings against incore info
Curiously, the xfs_repair code that processes data fork mappings of
realtime files doesn't actually compare the mappings against the incore
state map during the !check_dups phase (aka phase 3). As a result, we
lose the opportunity to clear damaged realtime data forks before we get
to crosslinked file checking in phase 4, which results in ondisk
metadata errors calling do_error, which aborts repair.
Split the process_rt_rec_state code into two functions: one to check the
mapping, and another to update the incore state. The first one can be
called to help us decide if we're going to zap the fork, and the second
one updates the incore state if we decide to keep the fork. We already
do this for regular data files.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:01 +0000 (10:22 -0800)]
xfs_repair: check existing realtime rmapbt entries against observed rmaps
Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime rmap btree
(particularly if we're in -n mode) to detect rtrmapbt problems.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:01 +0000 (10:22 -0800)]
xfs_repair: find and mark the rtrmapbt inodes
Make sure that we find the realtime rmapbt inodes and mark them
appropriately, just in case we find a rogue inode claiming to be an
rtrmap, or garbage in the metadata directory tree.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:00 +0000 (10:22 -0800)]
xfs_repair: use realtime rmap btree data to check block types
Use the realtime rmap btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:22:00 +0000 (10:22 -0800)]
xfs_repair: flag suspect long-format btree blocks
Pass a "suspect" counter through scan_lbtree just like we do for
short-format btree blocks, and increment its value when we encounter
blocks with bad CRCs or outright corruption. This makes it so that
repair actually catches bmbt blocks with bad crcs or other verifier
errors.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:21:58 +0000 (10:21 -0800)]
xfs_db: don't abort when bmapping on a non-extents/bmbt fork
We're going to introduce new fork formats, so let's fix the problem that
xfs_db's bmap command aborts when the fork format isn't one of the
existing ones.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:21:57 +0000 (10:21 -0800)]
libxfs: add a realtime flag to the rmap update log redo items
Extend the rmap update (RUI) log items with a new realtime flag that
indicates that the updates apply against the realtime rmapbt. We'll
wire up the actual rmap code later.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Replacing kmalloc() + memcpy() with kmemdump() doesn't change semantics.
Original code works without fault, so this is not a bug fix but proposed improvement.
Link: https://lwn.net/Articles/198928/ Fixes: 94a69db2367ef ("xfs: use __GFP_NOLOCKDEP instead of GFP_NOFS") Fixes: 384f3ced07efd ("[XFS] Return case-insensitive match for dentry cache") Fixes: 2451337dd0439 ("xfs: global error sign conversion") Cc: Carlos Maiolino <cem@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Chandan Babu R <chandanbabu@kernel.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: linux-xfs@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Mirsad Todorovac <mtodorovac69@gmail.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
They will eventually be needed to be const for zoned growfs, but even
now having such simpler helpers as const as possible is a good thing.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Add code to scrub realtime refcount btrees. Similar to the refcount
btree checking code for the data device, we walk the rmap btree for each
refcount record to confirm that the reference counts are correct.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint. Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.
Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently, we (ab)use xfs_get_extsz_hint so that it always returns a
nonzero value for realtime files. This apparently was done to disable
delayed allocation for realtime files.
However, once we enable realtime reflink, we can also turn on the
alwayscow flag to force CoW writes to realtime files. In this case, the
logic will incorrectly send the write through the delalloc write path.
Fix this by adjusting the logic slightly.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Plumb in the pieces we need to embed the root of the realtime refcount
btree in an inode's data fork, complete with metafile type and on-disk
interpretation functions.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Add a metadir path to select the realtime refcount btree inode and load
it at mount time. The rtrefcountbt inode will have a unique extent format
code, which means that we also have to update the inode validation and
flush routines to look for it.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Extend the refcount update (CUI) log items with a new realtime flag that
indicates that the updates apply against the realtime refcountbt. We'll
wire up the actual refcount code later.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Prepare the high-level refcount functions to deal with the new realtime
refcountbt and its slightly different conventions. Provide the ability
to talk to either refcountbt or rtrefcountbt formats from the same high
level code.
Note that we leave the _recover_cow_leftovers functions for a separate
patch so that we can convert it all at once.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Implement the generic btree operations needed to manipulate rtrefcount
btree blocks. This is different from the regular refcountbt in that we
allocate space from the filesystem at large, and are neither constrained
to the free space nor any particular AG.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Make sure that there's enough log reservation to handle mapping
and unmapping realtime extents. We have to reserve enough space
to handle a split in the rtrefcountbt to add the record and a second
split in the regular refcountbt to record the rtrefcountbt split.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Add the ondisk structure definitions for realtime refcount btrees. The
realtime refcount btree will be rooted from a hidden inode so it needs
to have a separate btree block magic and pointer format.
Next, add everything needed to read, write and manipulate refcount btree
blocks. This prepares the way for connecting the btree operations
implementation, though the changes to actually root the rtrefcount btree
in an inode come later.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Repair the realtime rmap btree while mounted. Similar to the regular
rmap btree repair code, we walk the data fork mappings of every realtime
file in the filesystem to collect reverse-mapping records in an xfarray.
Then we sort the xfarray, and use the btree bulk loader to create a new
rtrmap btree ondisk. Finally, we swap the btree roots, and reap the old
blocks in the usual way.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Connect the map and unmap reverse-mapping operations to the realtime
rmapbt via the deferred operation callbacks. This enables us to
perform rmap operations against the correct btree.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Plumb in the pieces we need to embed the root of the realtime rmap btree
in an inode's data fork, complete with new metafile type and on-disk
interpretation functions.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Add a metadir path to select the realtime rmap btree inode and load
it at mount time. The rtrmapbt inode will have a unique extent format
code, which means that we also have to update the inode validation and
flush routines to look for it.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new fork format type for metadata btrees. This fork type
requires that the inode is in the metadata directory tree, and only
applies to the data fork. The actual type of the metadata btree itself
is determined by the di_metatype field.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a helper function to turn a metadata file type code into a
printable string, and use this to complain about lockdep problems with
rtgroup inodes. We'll use this more in the next patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Prepare the high-level rmap functions to deal with the new realtime
rmapbt and its slightly different conventions. Provide the ability
to talk to either rmapbt or rtrmapbt formats from the same high
level code.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Implement the generic btree operations needed to manipulate rtrmap
btree blocks. This is different from the regular rmapbt in that we
allocate space from the filesystem at large, and are neither
constrained to the free space nor any particular AG.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Make sure that there's enough log reservation to handle mapping
and unmapping realtime extents. We have to reserve enough space
to handle a split in the rtrmapbt to add the record and a second
split in the regular rmapbt to record the rtrmapbt split.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Add the ondisk structure definitions for realtime rmap btrees. The
realtime rmap btree will be rooted from a hidden inode so it needs to
have a separate btree block magic and pointer format.
Next, add everything needed to read, write and manipulate rmap btree
blocks. This prepares the way for connecting the btree operations
implementation, though embedding the rtrmap btree root in the inode
comes later in the series.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new space reservation scheme so that btree metadata for the
realtime volume can reserve space in the data device to avoid space
underruns.
Back when we were testing the rmap and refcount btrees for the data
device, people observed occasional shutdowns when xfs_btree_split was
called for either of those two btrees. This happened when certain
operations (mostly writeback ioends) created new rmap or refcount
records, which would expand the size of the btree. If there were no
free blocks available the allocation would fail and the split would shut
down the filesystem.
I considered pre-reserving blocks for btree expansion at the time of a
write() call, but there wasn't any good way to attach the reservations
to an inode and keep them there all the way to ioend processing. Unlike
delalloc reservations which have that indlen mechanism, there's no way
to do that for mapped extents; and indlen blocks are given back during
the delalloc -> unwritten transition.
The solution was to reserve sufficient blocks for rmap/refcount btree
expansion at mount time. This is what the XFS_AG_RESV_* flags provide;
any expansion of those two btrees can come from the pre-reserved space.
This patch brings that pre-reservation ability to inode-rooted btrees so
that the rt rmap and refcount btrees can also save room for future
expansion.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Add the necessary flags and code so that we can support storing leaf
records in the inode root block of a btree. This hasn't been necessary
before, but the realtime rmapbt will need to be able to do this.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Simplify the calling conventions by allowing callers to pass a fsbno
(xfs_fsblock_t) directly into these functions, since we're just going to
set it in a struct anyway.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Files participating in the metadata directory tree are not accounted to
the quota subsystem. Therefore, the i_[ugp]dquot pointers in struct
xfs_inode are never used and should always be NULL.
In the next patch we want to add a u64 count of fs blocks reserved for
metadata btree expansion, but we don't want every inode in the fs to pay
the memory price for this feature. The intent is to union those three
pointers with the u64 counter, but for that to work we must guard
against all access to the dquot pointers for metadata files.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Create some simple helpers to reduce the amount of typing whenever we
access rtgroup inodes. Conversion was done with this spatch and some
minor reformatting:
In preparation for allowing records in an inode btree root, hoist the
code that copies keyptrs from an existing node child into the root block
to a separate function. Remove some unnecessary conditionals and clean
up a few function calls in the new function. Note that this change
reorders the ->free_block call with respect to the change in bc_nlevels
to make it easier to support inode root leaf blocks in the next patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
In preparation for allowing records in an inode btree root, hoist the
code that copies keyptrs from an existing node root into a child block
to a separate function. Note that the new function explicitly computes
the keys of the new child block and stores that in the root block; while
the bmap btree could rely on leaving the key alone, realtime rmap needs
to set the new high key.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Hoist out the code that migrates broot pointers during a resize
operation to avoid code duplication and streamline the caller. Also
use the correct bmbt pointer type for the sizeof operation.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Change the calling signature of xfs_iroot_realloc to take the ifork and
the new number of records in the btree block, not a diff against the
current number. This will make the callsites easier to understand.
Note that this function is misnamed because it is very specific to the
single type of inode-rooted btree supported. This will be addressed in
a subsequent patch.
Return the new btree root to reduce the amount of code clutter.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Hoist the code that allocates, frees, and reallocates if_broot into a
single xfs_iroot_krealloc function. Eventually we're going to push
xfs_iroot_realloc into the btree ops structure to handle multiple
inode-rooted btrees, but first let's separate out the bits that should
stay in xfs_inode_fork.c.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:21:44 +0000 (10:21 -0800)]
xfs_scrub: try harder to fill the bulkstat array with bulkstat()
Sometimes, the last bulkstat record returned by the first xfrog_bulkstat
call in bulkstat_for_inumbers will contain an inumber less than the
highest allocated inode mentioned in the inumbers record. This happens
either because the inodes have been freed, or because the the kernel
encountered a corrupt inode during bulkstat and stopped filling up the
array.
In both cases, we can call bulkstat again to try to fill up the rest of
the array. If there are newly allocated inodes, they'll be returned; if
we've truly hit the end of the filesystem, the kernel will return zero
records; and if the first allocated inode is indeed corrupt, the kernel
will return EFSCORRUPTED.
As an optimization to avoid the single-step code, call bulkstat with an
increasing ino parameter until the bulkstat array is full or the kernel
tells us there are no bulkstat records to return. This speeds things
up a bit in cases where the allocmask is all ones and only the second
inode is corrupt.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 24 Feb 2025 18:21:44 +0000 (10:21 -0800)]
xfs_scrub: ignore freed inodes when single-stepping during phase 3
For inodes that inumbers told us were allocated but weren't loaded by
the bulkstat call, we fall back to loading bulkstat data one inode at a
time to try to find the inodes that are too corrupt to load.
However, there are a couple of outcomes of the single bulkstat call that
clearly indicate that the inode is free, not corrupt. In this case, the
phase 3 inode scan will try to scrub the inode, only to be told ENOENT
because it doesn't exist.
As an optimization here, don't increment ocount, just move on to the
next inode in the mask.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>