Darrick J. Wong [Wed, 10 May 2023 13:15:15 +0000 (15:15 +0200)]
mkfs: warning about misaligned AGs and RAID stripes is not an error
I've noticed a fair number of fstests failures when we create a scratch
device on a RAID device and the test specifies an explicit AG count or
AG size:
--- /tmp/fstests/tests/xfs/042.out 2022-09-01 15:09:11.484679979 -0700
+++ /var/tmp/fstests/xfs/042.out.bad 2023-04-25 19:59:04.040000000 -0700
@@ -1,5 +1,8 @@
QA output created by 042
-Make a 96 megabyte filesystem on SCRATCH_DEV and mount... done
+Make a 96 megabyte filesystem on SCRATCH_DEV and mount... Warning: AG size is a multiple of stripe width. This can cause performance
+problems by aligning all AGs on the same disk. To avoid this, run mkfs with
+an AG size that is one stripe unit smaller or larger, for example 8160.
+done
Emitting this warning on stderr is silly -- nothing has failed, and we
aren't going to abort the format either. Send the warning to stdout.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Thu, 27 Apr 2023 19:02:33 +0000 (12:02 -0700)]
xfs_db: fix broken logic in error path
smatch complains proceeding into the if body if leaf is a null pointer:
check.c:3614 process_leaf_node_dir_v2_int() warn: variable dereferenced before check 'leaf' (see line 3518)
However, the logic here is misleading and broken -- what we're trying to
do is switch between the v4 and v5 variants of the directory check.
We're using @leaf3 being a null pointer (or not) to determine v4 vs. v5,
so the "!" part of the comparison is correct, but the variable used
(leaf) is not.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
At some point in between sending this patch to the list and merging it
into for-next, the tracepoints got all mixed up because I've
over-reliant on automated tools not sucking. The end result is that the
tracepoints are all wrong, so fix them.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Prior to commit 7ac2ff8bb371, when we loaded the incore perag structure
with information from the AGF header, we would set or clear the
pagf_agfl_reset field based on whether or not the AGFL list was
misaligned within the block. IOWs, it's an incore state bit that's
supposed to cache something in the ondisk metadata. Therefore, the code
still needs to support clearing the incore bit if (somehow) the AGFL
were to correct itself.
It turns out that xfs_repair does exactly this -- phase 4 loads the AGF
to scan the rmapbt for corrupt records, which can set NEEDS_AGFL_RESET.
The scan unsets AGF_INIT but doesn't unset NEEDS_AGFL_RESET. Phase 5
totally rewrites the AGFL and fixes the alignment problem, didn't clear
NEEDS_AGFL_RESET historically, and reloads the perag state to fix the
freelist. This results in the AGFL being reset based on stale data,
which then causes the new AGFL blocks to be leaked. A subsequent
xfs_repair -n then complains about the leaks.
One could argue that phase 5 ought to clear this bit directly when it
reloads the perag AGF data after rewriting the AGFL, but libxfs used to
handle this for us, so it should go back to doing that.
Found by fuzzing flfirst = ones in xfs/352.
Fixes: 7ac2ff8bb371 ("xfs: perags need atomic operational state") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
There are now five separate space allocator interfaces exposed to the
rest of XFS for five different strategies to find space. Add
tracepoints for each of them so that I can tell from a trace dump
exactly which ones got called and what happened underneath them. Add a
sixth so it's more obvious if an allocation actually happened.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Callers of xfs_alloc_vextent_iterate_ags that pass in the TRYLOCK flag
want us to perform a non-blocking scan of the AGs for free space. There
are no ordering constraints for non-blocking AGF lock acquisition, so
the scan can freely start over at AG 0 even when minimum_agno > 0.
This manifests fairly reliably on xfs/294 on 6.3-rc2 with the parent
pointer patchset applied and the realtime volume enabled. I observed
the following sequence as part of an xfs_dir_createname call:
0. Fragment the free space, then allocate nearly all the free space in
all AGs except AG 0.
1. Create a directory in AG 2 and let it grow for a while.
2. Try to allocate 2 blocks to expand the dirent part of a directory.
The space will be allocated out of AG 0, but the allocation will not
be contiguous. This (I think) activates the LOWMODE allocator.
3. The bmapi call decides to convert from extents to bmbt format and
tries to allocate 1 block. This allocation request calls
xfs_alloc_vextent_start_ag with the inode number, which starts the
scan at AG 2. We ignore AG 0 (with all its free space) and instead
scrape AG 2 and 3 for more space. We find one block, but this now
kicks t_highest_agno to 3.
4. The createname call decides it needs to split the dabtree. It tries
to allocate even more space with xfs_alloc_vextent_start_ag, but now
we're constrained to AG 3, and we don't find the space. The
createname returns ENOSPC and the filesystem shuts down.
This change fixes the problem by making the trylock scan wrap around to
AG 0 if it doesn't like the AGs that it finds. Since the current
transaction itself holds AGF 0, the trylock of AGF 0 will succeed, and
we take space from the AG that has plenty.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
In porting his development branch to 6.3-rc1, yours truly has
repeatedly screwed up the args->pag being fed to the xfs_alloc_vextent*
functions. Add some debugging assertions to test the preconditions
required of the callers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Prior to the removal of xfs_ialloc_next_ag, we would increment the agi
rotor and return the *old* value. atomic_inc_return returns the new
value, which causes mkfs to allocate the root directory in AG 1. Put
back the old behavior (at least for mkfs) by subtracting 1 here.
Fixes: 20a5eab49d35 ("xfs: convert xfs_ialloc_next_ag() to an atomic") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Now that the filestreams AG selection tracks active perags, we need
to return an active perag to the core allocator code. This is
because the file allocation the filestreams code will run are AG
specific allocations and so need to pin the AG until the allocations
complete.
We cannot rely on the filestreams item reference to do this - the
filestreams association can be torn down at any time, hence we
need to have a separate reference for the allocation process to pin
the AG after it has been selected.
This means there is some perag juggling in allocation failure
fallback paths as they will do all AG scans in the case the AG
specific allocation fails. Hence we need to track the perag
reference that the filestream allocator returned to make sure we
don't leak it on repeated allocation failure.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_bmap_btalloc_filestreams() calls two filestreams functions to
select the AG to allocate from. Both those functions end up in
the same selection function that iterates all AGs multiple times.
Worst case, xfs_bmap_btalloc_filestreams() can iterate all AGs 4
times just to select the initial AG to allocate in.
Move the AG selection to fs/xfs/xfs_filestreams.c as a single
interface so that the inefficient AG interation is contained
entirely within the filestreams code. This will allow the
implementation to be simplified and made more efficient in future
patches.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The code in xfs_bmap_longest_free_extent() is open coded in
xfs_filestream_pick_ag(). Export xfs_bmap_longest_free_extent and
call it from the filestreams code instead.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
It is only set if reading the AGF gets a EAGAIN error. Just return
the EAGAIN error and handle that error in the callers.
This means we can remove the not_init parameter from
xfs_bmap_select_minlen(), too, because the use of not_init there is
pessimistic. If we can't read the agf, it won't increase blen.
The only time we actually care whether we checked all the AGFs for
contiguous free space is when the best length is less than the
minimum allocation length. If not_init is set, then we ignore blen
and set the minimum alloc length to the absolute minimum, not the
best length we know already is present.
However, if blen is less than the minimum we're going to ignore it
anyway, regardless of whether we scanned all the AGFs or not. Hence
not_init can go away, because we only use if blen is good from
the scanned AGs otherwise we ignore it altogether and use minlen.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
There's many if (filestreams) {} else {} branches in this function.
Split it out into a filestreams specific function so that we can
then work directly on cleaning up the filestreams code without
impacting the rest of the allocation algorithms.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Now that the AG iteration code in the core allocation code has been
cleaned up, we can easily convert it to use a for_each_perag..()
variant to use active references and skip AGs that it can't get
active references on.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
All of the allocation functions now extract the minimum allowed AG
from the transaction and then use it in some way. The allocation
functions that are restricted to a single AG all check if the
AG requested can be allocated from and return an error if so. These
all set args->agno appropriately.
All the allocation functions that iterate AGs use it to calculate
the scan start AG. args->agno is not set until the iterator starts
walking AGs.
Hence we can easily set up a conditional check against the minimum
AG allowed in xfs_alloc_vextent_check_args() based on whether
args->agno contains NULLAGNUMBER or not and move all the repeated
setup code to xfs_alloc_vextent_check_args(), further simplifying
the allocation functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We don't need the multiplexing xfs_alloc_ag_vextent() provided
anymore - we can just call the exact/near/size variants directly.
This allows us to remove args->type completely and stop using
args->fsbno as an input to the allocator algorithms.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Move it from xfs_alloc_ag_vextent() so we can get rid of that layer.
Rename xfs_alloc_vextent_set_fsbno() to xfs_alloc_vextent_finish()
to indicate that it's function is finishing off the allocation that
we've run now that it contains much more functionality.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Now that we have wrapper functions for each type of allocation we
can ask for, we can start unravelling xfs_alloc_ag_vextent(). That
is essentially just a prepare stage, the allocation multiplexer
and a post-allocation accounting step is the allocation proceeded.
The current xfs_alloc_vextent*() wrappers all have a prepare stage,
the allocation operation and a post-allocation accounting step.
We can consolidate this by moving the AG alloc prep code into the
wrapper functions, the accounting code in the wrapper accounting
functions, and cut out the multiplexer layer entirely.
This patch consolidates the AG preparation stage.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Two of the callers to xfs_alloc_vextent_this_ag() actually want
exact block number allocation, not anywhere-in-ag allocation. Split
this out from _this_ag() as a first class citizen so no external
extent allocation code needs to care about args->type anymore.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The remaining callers of xfs_alloc_vextent() are all doing NEAR_BNO
allocations. We can replace that function with a new
xfs_alloc_vextent_near_bno() function that does this explicitly.
We also multiplex NEAR_BNO allocations through
xfs_alloc_vextent_this_ag via args->type. Replace all of these with
direct calls to xfs_alloc_vextent_near_bno(), too.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Change obvious callers of single AG allocation to use
xfs_alloc_vextent_start_bno(). Callers no long need to specify
XFS_ALLOCTYPE_START_BNO, and so the type can be driven inward and
removed.
While doing this, also pass the allocation target fsb as a parameter
rather than encoding it in args->fsbno.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Change obvious callers of single AG allocation to use
xfs_alloc_vextent_first_ag(). This gets rid of
XFS_ALLOCTYPE_FIRST_AG as the type used within
xfs_alloc_vextent_first_ag() during iteration is _THIS_AG. Hence we
can remove the setting of args->type from all the callers of
_first_ag() and remove the alloctype.
While doing this, pass the allocation target fsb as a parameter
rather than encoding it in args->fsbno. This starts the process
of making args->fsbno an output only variable rather than
input/output.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
There are several different contexts xfs_bmap_btalloc() handles, and
large chunks of the code execute independent allocation contexts.
Try to untangle this mess a bit.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Change obvious callers of single AG allocation to use
xfs_alloc_vextent_this_ag(). Drive the per-ag grabbing out to the
callers, too, so that callers with active references don't need
to do new lookups just for an allocation in a context that already
has a perag reference.
The only remaining caller that does single AG allocation through
xfs_alloc_vextent() is xfs_bmap_btalloc() with
XFS_ALLOCTYPE_NEAR_BNO. That is going to need more untangling before
it can be converted cleanly.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
There's a bit of a recursive conundrum around
xfs_alloc_ag_vextent(). We can't first call xfs_alloc_ag_vextent()
without preparing the AGFL for the allocation, and preparing the
AGFL calls xfs_alloc_ag_vextent() to prepare the AGFL for the
allocation. This "double allocation" requirement is not really clear
from the current xfs_alloc_fix_freelist() calls that are sprinkled
through the allocation code.
It's not helped that xfs_alloc_ag_vextent() can actually allocate
from the AGFL itself, but there's special code to prevent AGFL prep
allocations from allocating from the free list it's trying to prep.
The naming is also not consistent: args->wasfromfl is true when we
allocated _from_ the free list, but the indication that we are
allocating _for_ the free list is via checking that (args->resv ==
XFS_AG_RESV_AGFL).
So, lets make this "allocation required for allocation" situation
clear by moving it all inside xfs_alloc_ag_vextent(). The freelist
allocation is a specific XFS_ALLOCTYPE_THIS_AG allocation, which
translated directly to xfs_alloc_ag_vextent_size() allocation.
This enables us to replace __xfs_alloc_vextent_this_ag() with a call
to xfs_alloc_ag_vextent(), and we drive the freelist fixing further
into the per-ag allocation algorithm.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The core of the per-ag iteration is effectively doing a "this ag"
allocation on one AG at a time. Use the same code to implement the
core "this ag" allocation in both xfs_alloc_vextent_this_ag()
and xfs_alloc_vextent_iterate_ags().
This means we only call xfs_alloc_ag_vextent() from one place so we
can easily collapse the call stack in future patches.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
It's a multiplexing mess that can be greatly simplified, and really
needs to be simplified to allow active per-ag references to
propagate from initial AG selection code the the bmapi code.
This splits the code out into separate a parameter checking
function, an iterator function, and allocation completion functions
and then implements the individual policies using these functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
In several places we iterate every AG from a specific start agno and
wrap back to the first AG when we reach the end of the filesystem to
continue searching. We don't have a primitive for this iteration
yet, so add one for conversion of these algorithms to per-ag based
iteration.
The filestream AG select code is a mess, and this initially makes it
worse. The per-ag selection needs to be driven completely into the
filestream code to clean this up and it will be done in a future
patch that makes the filestream allocator use active per-ag
references correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We currently don't have any flags or operational state in the
xfs_perag except for the pagf_init and pagi_init flags. And the
agflreset flag. Oh, there's also the pagf_metadata and pagi_inodeok
flags, too.
For controlling per-ag operations, we are going to need some atomic
state flags. Hence add an opstate field similar to what we already
have in the mount and log, and convert all these state flags across
to atomic bit operations.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
This is currently a spinlock lock protected rotor which can be
implemented with a single atomic operation. Change it to be more
efficient and get rid of the m_agirotor_lock. Noticed while
converting the inode allocation AG selection loop to active perag
references.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Lots of code in the inobt infrastructure is passed both xfs_mount
and perags. We only need perags for the per-ag inode allocation
code, so reduce the duplication by passing only the perags as the
primary object.
This ends up reducing the code size by a bit:
text data bss dec hex filename
orig 1138878 323979 548 1463405 16546d (TOTALS)
patched 1138709 323979 548 1463236 1653c4 (TOTALS)
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Convert the inode allocation routines to use active perag references
or references held by callers rather than grab their own. Also drive
the perag further inwards to replace xfs_mounts when doing
operations on a specific AG.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Callers have referenced perags but they don't pass it into
xfs_imap() so it takes it's own reference. Fix that so we can change
inode allocation over to using active references.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
So that they all output the same information in the traces to make
debugging refcount issues easier.
This means that all the lookup/drop functions no longer need to use
the full memory barrier atomic operations (atomic*_return()) so
will have less overhead when tracing is off. The set/clear tag
tracepoints no longer abuse the reference count to pass the tag -
the tag being cleared is obvious from the _RET_IP_ that is recorded
in the trace point.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We need to be able to dynamically remove instantiated AGs from
memory safely, either for shrinking the filesystem or paging AG
state in and out of memory (e.g. supporting millions of AGs). This
means we need to be able to safely exclude operations from accessing
perags while dynamic removal is in progress.
To do this, introduce the concept of active and passive references.
Active references are required for high level operations that make
use of an AG for a given operation (e.g. allocation) and pin the
perag in memory for the duration of the operation that is operating
on the perag (e.g. transaction scope). This means we can fail to get
an active reference to an AG, hence callers of the new active
reference API must be able to handle lookup failure gracefully.
Passive references are used in low level code, where we might need
to access the perag structure for the purposes of completing high
level operations. For example, buffers need to use passive
references because:
- we need to be able to do metadata IO during operations like grow
and shrink transactions where high level active references to the
AG have already been blocked
- buffers need to pin the perag until they are reclaimed from
memory, something that high level code has no direct control over.
- unused cached buffers should not prevent a shrink from being
started.
Hence we have active references that will form exclusion barriers
for operations to be performed on an AG, and passive references that
will prevent reclaim of the perag until all objects with passive
references have been reclaimed themselves.
This patch introduce xfs_perag_grab()/xfs_perag_rele() as the API
for active AG reference functionality. We also need to convert the
for_each_perag*() iterators to use active references, which will
start the process of converting high level code over to using active
references. Conversion of non-iterator based code to active
references will be done in followup patches.
Note that the implementation using reference counting is really just
a development vehicle for the API to ensure we don't have any leaks
in the callers. Once we need to remove perag structures from memory
dyanmically, we will need a much more robust per-ag state transition
mechanism for preventing new references from being taken while we
wait for existing references to drain before removal from memory can
occur....
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The tp->t_firstblock field is now raelly tracking the highest AG we
have locked, not the block number of the highest allocation we've
made. It's purpose is to prevent AGF locking deadlocks, so rename it
to "highest AG" and simplify the implementation to just track the
agno rather than a fsbno.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Now that xfs_alloc_vextent() does all the AGF deadlock prevention
filtering for multiple allocations in a single transaction, we no
longer need the allocation setup code to care about what AGs we
might already have locked.
Hence we can remove all the "nullfb" conditional logic in places
like xfs_bmap_btalloc() and instead have them focus simply on
setting up locality constraints. If the allocation fails due to
AGF lock filtering in xfs_alloc_vextent, then we just fall back as
we normally do to more relaxed allocation constraints.
As a result, any allocation that allows AG scanning (i.e. not
confined to a single AG) and does not force a worst case full
filesystem scan will now be able to attempt allocation from AGs
lower than that defined by tp->t_firstblock. This is because
xfs_alloc_vextent() allows try-locking of the AGFs and hence enables
low space algorithms to at least -try- to get space from AGs lower
than the one that we have currently locked and allocated from. This
is a significant improvement in the low space allocation algorithm.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
When we enter xfs_bmbt_alloc_block() without having first allocated
a data extent (i.e. tp->t_firstblock == NULLFSBLOCK) because we
are doing something like unwritten extent conversion, the transaction
block reservation is used as the minleft value.
This works for operations like unwritten extent conversion, but it
assumes that the block reservation is only for a BMBT split. THis is
not always true, and sometimes results in larger than necessary
minleft values being set. We only actually need enough space for a
btree split, something we already handle correctly in
xfs_bmapi_write() via the xfs_bmapi_minleft() calculation.
We should use xfs_bmapi_minleft() in xfs_bmbt_alloc_block() to
calculate the number of blocks a BMBT split on this inode is going to
require, not use the transaction block reservation that contains the
maximum number of blocks this transaction may consume in it...
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
When an XFS filesystem has free inodes in chunks already allocated
on disk, it will still allocate new inode chunks if the target AG
has no free inodes in it. Normally, this is a good idea as it
preserves locality of all the inodes in a given directory.
However, at ENOSPC this can lead to using the last few remaining
free filesystem blocks to allocate a new chunk when there are many,
many free inodes that could be allocated without consuming free
space. This results in speeding up the consumption of the last few
blocks and inode create operations then returning ENOSPC when there
free inodes available because we don't have enough block left in the
filesystem for directory creation reservations to proceed.
Hence when we are near ENOSPC, we should be attempting to preserve
the remaining blocks for directory block allocation rather than
using them for unnecessary inode chunk creation.
This particular behaviour is exposed by xfs/294, when it drives to
ENOSPC on empty file creation whilst there are still thousands of
free inodes available for allocation in other AGs in the filesystem.
Hence, when we are within 1% of ENOSPC, change the inode allocation
behaviour to prefer to use existing free inodes over allocating new
inode chunks, even though it results is poorer locality of the data
set. It is more important for the allocations to be space efficient
near ENOSPC than to have optimal locality for performance, so lets
modify the inode AG selection code to reflect that fact.
This allows generic/294 to not only pass with this allocator rework
patchset, but to increase the number of post-ENOSPC empty inode
allocations to from ~600 to ~9080 before we hit ENOSPC on the
directory create transaction reservation.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
I've recently encountered an ABBA deadlock with g/476. The upcoming
changes seem to make this much easier to hit, but the underlying
problem is a pre-existing one.
Essentially, if we select an AG for allocation, then lock the AGF
and then fail to allocate for some reason (e.g. minimum length
requirements cannot be satisfied), then we drop out of the
allocation with the AGF still locked.
The caller then modifies the allocation constraints - usually
loosening them up - and tries again. This can result in trying to
access AGFs that are lower than the AGF we already have locked from
the failed attempt. e.g. the failed attempt skipped several AGs
before failing, so we have locks an AG higher than the start AG.
Retrying the allocation from the start AG then causes us to violate
AGF lock ordering and this can lead to deadlocks.
The deadlock exists even if allocation succeeds - we can do a
followup allocations in the same transaction for BMBT blocks that
aren't guaranteed to be in the same AG as the original, and can move
into higher AGs. Hence we really need to move the tp->t_firstblock
tracking down into xfs_alloc_vextent() where it can be set when we
exit with a locked AG.
xfs_alloc_vextent() can also check there if the requested
allocation falls within the allow range of AGs set by
tp->t_firstblock. If we can't allocate within the range set, we have
to fail the allocation. If we are allowed to to non-blocking AGF
locking, we can ignore the AG locking order limitations as we can
use try-locks for the first iteration over requested AG range.
This invalidates a set of post allocation asserts that check that
the allocation is always above tp->t_firstblock if it is set.
Because we can use try-locks to avoid the deadlock in some
circumstances, having a pre-existing locked AGF doesn't always
prevent allocation from lower order AGFs. Hence those ASSERTs need
to be removed.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
When we split a BMBT due to record insertion, we offload it to a
worker thread because we can be deep in the stack when we try to
allocate a new block for the BMBT. Allocation can use several
kilobytes of stack (full memory reclaim, swap and/or IO path can
end up on the stack during allocation) and we can already be several
kilobytes deep in the stack when we need to split the BMBT.
A recent workload demonstrated a deadlock in this BMBT split
offload. It requires several things to happen at once:
1. two inodes need a BMBT split at the same time, one must be
unwritten extent conversion from IO completion, the other must be
from extent allocation.
2. there must be a no available xfs_alloc_wq worker threads
available in the worker pool.
3. There must be sustained severe memory shortages such that new
kworker threads cannot be allocated to the xfs_alloc_wq pool for
both threads that need split work to be run
4. The split work from the unwritten extent conversion must run
first.
5. when the BMBT block allocation runs from the split work, it must
loop over all AGs and not be able to either trylock an AGF
successfully, or each AGF is is able to lock has no space available
for a single block allocation.
6. The BMBT allocation must then attempt to lock the AGF that the
second task queued to the rescuer thread already has locked before
it finds an AGF it can allocate from.
At this point, we have an ABBA deadlock between tasks queued on the
xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
holding the AGF lock can't be run by the rescuer thread until the
task the rescuer thread is runing gets the AGF lock....
This is a highly improbably series of events, but there it is.
There's a couple of ways to fix this, but the easiest way to ensure
that we only punt tasks with a locked AGF that holds enough space
for the BMBT block allocations to the worker thread.
This works for unwritten extent conversion in IO completion (which
doesn't have a locked AGF and space reservations) because we have
tight control over the IO completion stack. It is typically only 6
functions deep when xfs_btree_split() is called because we've
already offloaded the IO completion work to a worker thread and
hence we don't need to worry about stack overruns here.
The other place we can be called for a BMBT split without a
preceeding allocation is __xfs_bunmapi() when punching out the
center of an existing extent. We don't remove extents in the IO
path, so these operations don't tend to be called with a lot of
stack consumed. Hence we don't really need to ship the split off to
a worker thread in these cases, either.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Change the name of all pointers to xfs_extent_item structures to "xefi"
to make the name consistent and because the current selections ("new"
and "free") mean other things in C.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Eric Sandeen [Thu, 20 Apr 2023 14:01:01 +0000 (09:01 -0500)]
xfsprogs: nrext64 option should be in [inode] section of mkfs conf files
nrext64 is an inode (-i) section option, not a metadata (-m) section option.
Reported-by: Marcos Mello <marcosfrm@gmail.com> Fixes: 69e7272213 ("mkfs: Add option to create filesystem with large extent counters") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Thu, 6 Apr 2023 00:08:02 +0000 (17:08 -0700)]
misc: test the dir/attr hash before formatting or repairing fs
Back in the 6.2-rc1 days, Eric Whitney reported a fstests regression in
ext4 against generic/454. The cause of this test failure was the
unfortunate combination of setting an xattr name containing UTF8 encoded
emoji, an xattr hash function that accepted a char pointer with no
explicit signedness, signed type extension of those chars to an int, and
the 6.2 build tools maintainers deciding to mandate -funsigned-char
across the board. As a result, the ondisk extended attribute structure
written out by 6.1 and 6.2 were not the same.
This discrepancy, in fact, had been noticeable if a filesystem with such
an xattr were moved between any two architectures that don't employ the
same signedness of a raw "char" declaration. The only reason anyone
noticed is that x86 gcc defaults to signed, and no such -funsigned-char
update was made to e2fsprogs, so e2fsck immediately started reporting
data corruption.
After a day and a half of discussing how to handle this use case (xattrs
with bit 7 set anywhere in the name) without breaking existing users,
Linus merged his own patch and didn't tell the mailing list. None of
the developers noticed until AUTOSEL made an announcement.
In the end, this problem could have been detected much earlier if there
had been any useful tests of hash function(s) in use inside ext4 to make
sure that they always produce the same outputs given the same inputs.
The XFS dirent/xattr name hash takes a uint8_t*, so I don't think it's
vulnerable to this problem. However, let's avoid all this drama by
adding our own self test to check that the da hash produces the same
outputs for a static pile of inputs on various platforms. This
corresponds to the similar patch for the kernel.
Lately I've been stress-testing extreme-sized rmap btrees by using the
(new) xfs_db bmap_inflate command to clone bmbt mappings billions of
times and then using xfs_repair to build new rmap and refcount btrees.
This of course is /much/ faster than actually FICLONEing a file billions
of times.
Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with
EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not
sufficiently large for the test scenario. For a 1TB filesystem (~67
million AG blocks, 4 AGs) the btheight command reports:
The AG is sufficiently large to build this rmap btree. Unfortunately,
m_rmap_maxlevels is 5. Augmenting the loop in the space->height
function to report height, node blocks, and blocks remaining produces
this:
The goal of this function is to compute the maximum height btree that
can be stored in the given number of ondisk fsblocks. Starting with the
top level of the tree, each iteration through the loop adds the fanout
factor of the next level down until we run out of blocks. IOWs, maximum
height is achieved by using the smallest fanout factor that can apply
to that level.
However, the loop setup is not correct. Top level btree blocks are
allowed to contain fewer than minrecs items, so the computation is
incorrect because the first time through the loop it should be using a
fanout factor of 2. With this corrected, the above becomes:
Fixes: 9ec691205e7d ("xfs: compute the maximum height of the rmap btree when reflink enabled") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
While investigating test failures in xfs/17[1-3] in alwayscow mode, I
noticed through code inspection that xfs_bmap_alloc_userdata isn't
setting XFS_ALLOC_USERDATA when allocating extents for a file's CoW
fork. COW staging extents should be flagged as USERDATA, since user
data are persisted to these blocks before being remapped into a file.
This mis-classification has a few impacts on the behavior of the system.
First, the filestreams allocator is supposed to keep allocating from a
chosen AG until it runs out of space in that AG. However, it only does
that for USERDATA allocations, which means that COW allocations aren't
tied to the filestreams AG. Fortunately, few people use filestreams, so
nobody's noticed.
A more serious problem is that xfs_alloc_ag_vextent_small looks for a
buffer to invalidate *if* the USERDATA flag is set and the AG is so full
that the allocation had to come from the AGFL because the cntbt is
empty. The consequences of not invalidating the buffer are severe --
if the AIL incorrectly checkpoints a buffer that is now being used to
store user data, that action will clobber the user's written data.
Fix filestreams and yet another data corruption vector by flagging COW
allocations as USERDATA.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_btree_check_block contains debugging knobs. With XFS_DEBUG setting up,
turn on the debugging knob can trigger the assert of xfs_btree_islastblock,
test script as follows:
while true
do
mount $disk $mountpoint
fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null
echo 1 > /sys/fs/xfs/sda/errortag/btree_chk_sblk
sleep 10
umount $mountpoint
done
Kick off fsstress and only *then* turn on the debugging knob. If it
happens that the knob gets turned on after the cntbt lookup succeeds
but before the call to xfs_btree_islastblock, then we *can* end up in
the situation where a previously checked btree block suddenly starts
returning EFSCORRUPTED from xfs_btree_check_block. Kaboom.
Darrick give a very detailed explanation as follows:
Looking back at commit 27d9ee577dcce, I think the point of all this was
to make sure that the cursor has actually performed a lookup, and that
the btree block at whatever level we're asking about is ok.
If the caller hasn't ever done a lookup, the bc_levels array will be
empty, so cur->bc_levels[level].bp pointer will be NULL. The call to
xfs_btree_get_block will crash anyway, so the "ASSERT(block);" part is
pointless.
If the caller did a lookup but the lookup failed due to block
corruption, the corresponding cur->bc_levels[level].bp pointer will also
be NULL, and we'll still crash. The "ASSERT(xfs_btree_check_block);"
logic is also unnecessary.
If the cursor level points to an inode root, the block buffer will be
incore, so it had better always be consistent.
If the caller ignores a failed lookup after a successful one and calls
this function, the cursor state is garbage and the assert wouldn't have
tripped anyway. So get rid of the assert.
Fixes: 27d9ee577dcc ("xfs: actually check xfs_btree_check_block return in xfs_btree_islastblock") Signed-off-by: Guo Xuenan <guoxuenan@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Upon enabling fsdax + reflink for XFS, xfs/179 began to report refcount
metadata corruptions after being run. Specifically, xfs_repair noticed
single-block refcount records that could be combined but had not been.
The root cause of this is improper MAXREFCOUNT edge case handling in
xfs_refcount_merge_extents. When we're trying to find candidates for a
refcount btree record merge, we compute the refcount attribute of the
merged record, but we fail to account for the fact that once a record
hits rc_refcount == MAXREFCOUNT, it is pinned that way forever. Hence
the computed refcount is wrong, and we fail to merge the extents.
Fix this by adjusting the merge predicates to compute the adjusted
refcount correctly.
Fixes: 3172725814f9 ("xfs: adjust refcount of an extent of blocks in refcount btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Hoist these multiline conditionals into separate static inline helpers
to improve readability and set the stage for corruption fixes that will
be introduced in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Add a new error injection knob so that we can arbitrarily slow down
pagecache writes to test for race conditions and aberrant reclaim
behavior if the writeback mechanisms are slow to issue writeback. This
will enable functional testing for the ifork sequence counters
introduced in commit 304a68b9c63b ("xfs: use iomap_valid method to
detect stale cached iomaps") that fixes write racing with reclaim
writeback.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Add a new error injection knob so that we can arbitrarily slow down
writeback to test for race conditions and aberrant reclaim behavior if
the writeback mechanisms are slow to issue writeback. This will enable
functional testing for the ifork sequence counters introduced in commit 745b3f76d1c8 ("xfs: maintain a sequence count for inode fork
manipulations").
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
With the changes to scan the page cache for dirty data to avoid data
corruptions from partial write cleanup racing with other page cache
operations, the drop writes error injection no longer works the same
way it used to and causes xfs/196 to fail. This is because xfs/196
writes to the file and populates the page cache before it turns on
the error injection and starts failing -overwrites-.
The result is that the original drop-writes code failed writes only
-after- overwriting the data in the cache, followed by invalidates
the cached data, then punching out the delalloc extent from under
that data.
On the surface, this looks fine. The problem is that page cache
invalidation *doesn't guarantee that it removes anything from the
page cache* and it doesn't change the dirty state of the folio. When
block size == page size and we do page aligned IO (as xfs/196 does)
everything happens to align perfectly and page cache invalidation
removes the single page folios that span the written data. Hence the
followup delalloc punch pass does not find cached data over that
range and it can punch the extent out.
IOWs, xfs/196 "works" for block size == page size with the new
code. I say "works", because it actually only works for the case
where IO is page aligned, and no data was read from disk before
writes occur. Because the moment we actually read data first, the
readahead code allocates multipage folios and suddenly the
invalidate code goes back to zeroing subfolio ranges without
changing dirty state.
Hence, with multipage folios in play, block size == page size is
functionally identical to block size < page size behaviour, and
drop-writes is manifestly broken w.r.t to this case. Invalidation of
a subfolio range doesn't result in the folio being removed from the
cache, just the range gets zeroed. Hence after we've sequentially
walked over a folio that we've dirtied (via write data) and then
invalidated, we end up with a dirty folio full of zeroed data.
And because the new code skips punching ranges that have dirty
folios covering them, we end up leaving the delalloc range intact
after failing all the writes. Hence failed writes now end up
writing zeroes to disk in the cases where invalidation zeroes folios
rather than removing them from cache.
This is a fundamental change of behaviour that is needed to avoid
the data corruption vectors that exist in the old write fail path,
and it renders the drop-writes injection non-functional and
unworkable as it stands.
As it is, I think the error injection is also now unnecessary, as
partial writes that need delalloc extent are going to be a lot more
common with stale iomap detection in place. Hence this patch removes
the drop-writes error injection completely. xfs/196 can remain for
testing kernels that don't have this data corruption fix, but those
that do will report:
xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Now that iomap supports a mechanism to validate cached iomaps for
buffered write operations, hook it up to the XFS buffered write ops
so that we can avoid data corruptions that result from stale cached
iomaps. See:
or the ->iomap_valid() introduction commit for exact details of the
corruption vector.
The validity cookie we store in the iomap is based on the type of
iomap we return. It is expected that the iomap->flags we set in
xfs_bmbt_to_iomap() is not perturbed by the iomap core and are
returned to us in the iomap passed via the .iomap_valid() callback.
This ensures that the validity cookie is always checking the correct
inode fork sequence numbers to detect potential changes that affect
the extent cached by the iomap.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
If an xattr structure contains enough names with the same hash value to
fill multiple xattr leaf blocks with names all hashing to the same
value, then the dabtree nodes will contain consecutive entries with the
same hash value.
This causes false corruption reports in xfs_repair because it's not
expecting such a huge same-hashing structure. Fix that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 1 Mar 2023 16:05:45 +0000 (08:05 -0800)]
mkfs: substitute slashes with spaces in protofiles
A user requested the ability to specify directory entry names in a
protofile that have spaces in them. The protofile format itself does
not allow spaces (yay 1973-era protofiles!) but it does allow slashes.
Slashes aren't allowed in directory entry names, so we'll permit this
one gross hack. After this, the protofile:
Requested-by: Daan De Meyer <daan.j.demeyer@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 1 Mar 2023 16:05:39 +0000 (08:05 -0800)]
mkfs: use suboption processing for -p
Use suboption processing for -p so that we can add a few behavioral
variants to protofiles in the next patch. As a side effect of this
change, one can now provide the path to a protofile in the config
file:
[proto]
file=/tmp/protofile
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 1 Mar 2023 16:05:34 +0000 (08:05 -0800)]
mkfs: check dirent names when reading protofile
The protofile parser in mkfs does not check directory entry names when
populating the filesystem. The libxfs directory code doesn't check them
either, since they depend on the Linux VFS to sanitize incoming names.
If someone puts a slash in the first (name) column in the protofile,
this results in a successful format and xfs_repair -n immediately
complains.
Screen the names that are being read from the protofile.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Arjun Shankar [Wed, 8 Feb 2023 14:34:16 +0000 (15:34 +0100)]
Remove several implicit function declarations
During configure, several ioctl checks omit the corresponding include
and a pwritev2 check uses the wrong feature test macro.
This commit fixes the same.
Signed-off-by: Arjun Shankar <arjun@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_db: make flist_find_ftyp() to check for field existance on disk
flist_find_ftyp() searches for the field of the requested type. The
first found field/path is returned. However, this doesn't work when
there are multiple fields of the same type. For example, attr3 type
have a few CRC fields. Leaf block (xfs_attr_leaf_hdr ->
xfs_da3_blkinfo) and remote value block (xfs_attr3_rmt_hdr) both
have CRC but goes under attr3 type. This causes 'crc' command to be
unable to find CRC field when we are at remote attribute block as it
tries to use leaf block CRC path:
$ # CRC of the remote value block
$ xfs_db -r -x /dev/sda5 -c 'inode 132' -c 'ablock 1' -c 'crc'
field info not found
parsing error
Solve this by making flist_find_ftyp() to also check that field in
question have non-zero count (exist at the current block).
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Thu, 16 Feb 2023 21:53:10 +0000 (13:53 -0800)]
xfs_io: fix bmap command not detecting realtime files with xattrs
Fix the bmap command so that it will detect a realtime file if any of
the other file flags (e.g. xattrs) are set. Observed via xfs/556.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
doesn't work on an ext4 filesystem. The above command is supposed to
issue a GETFSMAP query against the "data" device. Although the manpage
doesn't claim support for ext4, it turns out that this you get this
trace data:
xfs_io-4144 [002] 210.965642: ext4_getfsmap_low_key: dev
7:0 keydev 163:2567 block 0 len 0 owner 0 flags 0x0
xfs_io-4144 [002] 210.965645: ext4_getfsmap_high_key: dev
7:0 keydev 32:5277:0 block 0 len 0 owner -1 flags 0xffffffff
Notice the random garbage in the keydev field -- this happens because
openfile (in xfs_io) doesn't initialize *fs_path if the caller doesn't
supply a geometry structure or the opened file isn't on an XFS
filesystem. IOWs, we feed random heap garbage to the kernel, and the
kernel rejects the call unnecessarily.
Fix this to set the fspath information even for foreign filesystems.
Reported-by: tytso@mit.edu Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Thu, 16 Feb 2023 21:52:58 +0000 (13:52 -0800)]
xfs_scrub: fix broken realtime free blocks unit conversions
r_blocks is in units of fs blocks, but freertx is in units of realtime
extents. Add the missing conversion factor so we don't end up with
bogus things like this:
Pretend that sda and sdb are both 100T volumes.
# mkfs.xfs -f /dev/sda -b -r rtdev=/dev/sdb,extsize=2m
# mount /dev/sda /mnt -o rtdev=/dev/sdb
# xfs_scrub -dTvn /mnt
<snip>
Phase 7: Check summary counters.
3.5TiB data used; 99.8TiB realtime data used; 55 inodes used.
2.0GiB data found; 50.0MiB realtime data found; 55 inodes found.
55 inodes counted; 0 inodes checked.
We just created the filesystem, the realtime volume should be empty.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Dave Chinner [Thu, 19 Jan 2023 23:39:06 +0000 (10:39 +1100)]
progs: just use libtoolize
We no longer support xfsprogs on random platforms other than Linux,
so drop the complexity in detecting the libtoolize binary on MacOS
from the main makefile.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Dave Chinner [Thu, 19 Jan 2023 23:39:05 +0000 (10:39 +1100)]
progs: autoconf fails during debian package builds
For some reason, a current debian testing build system will fail to
build debian packages because the build environment is not correctly
detecting that libtoolize needs the "-i" parameter to copy in the
files needed by autoconf.
My build scripts run "make -j 16 realclean; make -j 16 deb", and the
second step is failing immediately with:
If I run 'make realclean; make deb' from the command line, the
package build runs to completion. I have not been able to work out
why the initial build fails, but then succeeds after a 'make
realclean' has been run, and I don't feel like spending hours
running down this rabbit hole.
This conditional "-i" flag detection was added back in *2009* when
default libtoolize behaviour was changed to not copy the config
files into the build area, and the "-i" flag was added to provide
that behaviour. It is detecting that the "-i" flag is needed that is
now failing, but it is most definitely still needed.
Rather than ispending lots of time trying to understand this and
then making the detection more complex, just use the "-i" flag
unconditionally and require any userspace that this now breaks on to
upgrade their 15+ year old version of libtoolize something a little
more modern.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Catherine Hoang [Thu, 5 Jan 2023 00:36:13 +0000 (16:36 -0800)]
xfs_admin: get UUID of mounted filesystem
Adapt this tool to call xfs_io to retrieve the UUID of a mounted filesystem.
This is a precursor to enabling xfs_admin to set the UUID of a mounted
filesystem.
Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfsprogs: scrub: fix warnings/errors due to missing include
Gentoo is currently trying to rebuild the world with clang-16, uncovering exciting
new errors in many packages since several warnings have been turned into errors,
among them missing prototypes, as documented at:
https://discourse.llvm.org/t/clang-16-notice-of-potentially-breaking-changes/65562
xfsprogs came up, with details at https://bugs.gentoo.org/875050.
The problem was easy to find: a missing include for the u_init/u_cleanup
prototypes. The error:
Building scrub
[CC] unicrash.o
unicrash.c:746:2: error: call to undeclared function 'u_init'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
u_init(&uerr);
^
unicrash.c:746:2: note: did you mean 'u_digit'?
/usr/include/unicode/uchar.h:4073:1: note: 'u_digit' declared here
u_digit(UChar32 ch, int8_t radix);
^
unicrash.c:754:2: error: call to undeclared function 'u_cleanup'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
u_cleanup();
^
2 errors generated.
The complaint is valid and the fix is easy enough: just add the missing include.
Signed-off-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 21 Dec 2022 00:53:34 +0000 (16:53 -0800)]
xfs_db: fix dir3 block magic check
Fix this broken check, which (amazingly) went unnoticed until I cranked
up the warning level /and/ built the system for s390x.
Fixes: e96864ff4d4 ("xfs_db: enable blockget for v5 filesystems") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Srikanth C S [Tue, 13 Dec 2022 17:15:43 +0000 (22:45 +0530)]
fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
After a recent data center crash, we had to recover root filesystems
on several thousands of VMs via a boot time fsck. Since these
machines are remotely manageable, support can inject the kernel
command line with 'fsck.mode=force fsck.repair=yes' to kick off
xfs_repair if the machine won't come up or if they suspect there
might be deeper issues with latent errors in the fs metadata, which
is what they did to try to get everyone running ASAP while
anticipating any future problems. But, fsck.xfs does not address the
journal replay in case of a crash.
fsck.xfs does xfs_repair -e if fsck.mode=force is set. It is
possible that when the machine crashes, the fs is in inconsistent
state with the journal log not yet replayed. This can drop the machine
into the rescue shell because xfs_fsck.sh does not know how to clean the
log. Since the administrator told us to force repairs, address the
deficiency by cleaning the log and rerunning xfs_repair.
Run xfs_repair -e when fsck.mode=force and repair=auto or yes.
Replay the logs only if fsck.mode=force and fsck.repair=yes. For
other option -fa and -f drop to the rescue shell if repair detects
any corruptions.
Signed-off-by: Srikanth C S <srikanth.c.s@oracle.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Tue, 13 Dec 2022 19:39:48 +0000 (11:39 -0800)]
xfs_db: create separate struct and field definitions for finobts
Create separate field_t definitions for the free inode btree because db
needs to know that the interior block pointers point to finobt blocks,
not inobt blocks. This is critical now because the buffer ops contain
magic numbers, the ->verify_struct routines use the magics listed in the
buffer ops, and the xfs_db iocursor calls the verifier functions.
Without this patch, xfs_db emits bizarre output like this:
Darrick J. Wong [Wed, 23 Nov 2022 17:09:39 +0000 (09:09 -0800)]
xfs_{db,repair}: fix XFS_REFC_COW_START usage
This is really a bit field stashed in the upper bit of the rc_startblock
field, so change its usage patterns to use masking instead of integer
addition and subtraction.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 23 Nov 2022 17:09:33 +0000 (09:09 -0800)]
xfs_repair: retain superblock buffer to avoid write hook deadlock
Every now and then I experience the following deadlock in xfs_repair
when I'm running the offline repair fuzz tests:
#0 futex_wait (private=0, expected=2, futex_word=0x55555566df70) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x55555566df70, private=0) at ./nptl/lowlevellock.c:49
#2 lll_mutex_lock_optimized (mutex=0x55555566df70) at ./nptl/pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=mutex@entry=0x55555566df70) at ./nptl/pthread_mutex_lock.c:93
#4 cache_shake (cache=cache@entry=0x55555566de60, priority=priority@entry=2, purge=purge@entry=false) at cache.c:231
#5 cache_node_get (cache=cache@entry=0x55555566de60, key=key@entry=0x7fffe55e01b0, nodep=nodep@entry=0x7fffe55e0168) at cache.c:452
#6 __cache_lookup (key=key@entry=0x7fffe55e01b0, flags=0, bpp=bpp@entry=0x7fffe55e0228) at rdwr.c:405
#7 libxfs_getbuf_flags (btp=0x55555566de00, blkno=0, len=<optimized out>, flags=<optimized out>, bpp=0x7fffe55e0228) at rdwr.c:457
#8 libxfs_buf_read_map (btp=0x55555566de00, map=map@entry=0x7fffe55e0280, nmaps=nmaps@entry=1, flags=flags@entry=0, bpp=bpp@entry=0x7fffe55e0278, ops=0x5555556233e0 <xfs_sb_buf_ops>)
at rdwr.c:704
#9 libxfs_buf_read (ops=<optimized out>, bpp=0x7fffe55e0278, flags=0, numblks=<optimized out>, blkno=0, target=<optimized out>)
at /storage/home/djwong/cdev/work/xfsprogs/build-x86_64/libxfs/libxfs_io.h:195
#10 libxfs_getsb (mp=mp@entry=0x7fffffffd690) at rdwr.c:162
#11 force_needsrepair (mp=0x7fffffffd690) at xfs_repair.c:924
#12 repair_capture_writeback (bp=<optimized out>) at xfs_repair.c:1000
#13 libxfs_bwrite (bp=0x7fffe011e530) at rdwr.c:869
#14 cache_shake (cache=cache@entry=0x55555566de60, priority=priority@entry=2, purge=purge@entry=false) at cache.c:240
#15 cache_node_get (cache=cache@entry=0x55555566de60, key=key@entry=0x7fffe55e0470, nodep=nodep@entry=0x7fffe55e0428) at cache.c:452
#16 __cache_lookup (key=key@entry=0x7fffe55e0470, flags=1, bpp=bpp@entry=0x7fffe55e0538) at rdwr.c:405
#17 libxfs_getbuf_flags (btp=0x55555566de00, blkno=12736, len=<optimized out>, flags=<optimized out>, bpp=0x7fffe55e0538) at rdwr.c:457
#18 __libxfs_buf_get_map (btp=<optimized out>, map=map@entry=0x7fffe55e05b0, nmaps=<optimized out>, flags=flags@entry=1, bpp=bpp@entry=0x7fffe55e0538) at rdwr.c:501
#19 libxfs_buf_get_map (btp=<optimized out>, map=map@entry=0x7fffe55e05b0, nmaps=<optimized out>, flags=flags@entry=1, bpp=bpp@entry=0x7fffe55e0538) at rdwr.c:525
#20 pf_queue_io (args=args@entry=0x5555556722c0, map=map@entry=0x7fffe55e05b0, nmaps=<optimized out>, flag=flag@entry=11) at prefetch.c:124
#21 pf_read_bmbt_reclist (args=0x5555556722c0, rp=<optimized out>, numrecs=78) at prefetch.c:220
#22 pf_scan_lbtree (dbno=dbno@entry=1211, level=level@entry=1, isadir=isadir@entry=1, args=args@entry=0x5555556722c0, func=0x55555557f240 <pf_scanfunc_bmap>) at prefetch.c:298
#23 pf_read_btinode (isadir=1, dino=<optimized out>, args=0x5555556722c0) at prefetch.c:385
#24 pf_read_inode_dirs (args=args@entry=0x5555556722c0, bp=bp@entry=0x7fffdc023790) at prefetch.c:459
#25 pf_read_inode_dirs (bp=<optimized out>, args=0x5555556722c0) at prefetch.c:411
#26 pf_batch_read (args=args@entry=0x5555556722c0, which=which@entry=PF_PRIMARY, buf=buf@entry=0x7fffd001d000) at prefetch.c:609
#27 pf_io_worker (param=0x5555556722c0) at prefetch.c:673
#28 start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#29 clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
>From this stack trace, we see that xfs_repair's prefetch module is
getting some xfs_buf objects ahead of initiating a read (#19). The
buffer cache has hit its limit, so it calls cache_shake (#14) to free
some unused xfs_bufs. The buffer it finds is a dirty buffer, so it
calls libxfs_bwrite to flush it out to disk, which in turn invokes the
buffer write hook that xfs_repair set up in 3b7667cb to mark the ondisk
filesystem's superblock as NEEDSREPAIR until repair actually completes.
Unfortunately, the NEEDSREPAIR handler itself needs to grab the
superblock buffer, so it makes another call into the buffer cache (#9),
which sees that the cache is full and tries to shake it(#4). Hence we
deadlock on cm_mutex because shaking is not reentrant.
Fix this by retaining a reference to the superblock buffer when possible
so that the writeback hook doesn't have to access the buffer cache to
set NEEDSREPAIR.
Fixes: 3b7667cb ("xfs_repair: set NEEDSREPAIR the first time we write to a filesystem") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 23 Nov 2022 17:09:28 +0000 (09:09 -0800)]
xfs_repair: don't crash on unknown inode parents in dry run mode
Fuzz testing of directory block headers exposed a debug assertion vector
in xfs_repair. In normal (aka fixit) mode, if a single-block directory
has a totally trashed block, repair will zap the entire directory.
Phase 4 ignores any dirents pointing to the zapped directory, phase 6
ignores the freed directory, and everything is good.
However, in dry run mode, we don't actually free the inode. Phase 4
still ignores any dirents pointing to the zapped directory, but phase 6
thinks the inode is still live and tries to walk it. xfs_repair doesn't
know of any parents for the zapped directory and so trips the assertion.
The assertion is critical for fixit mode because we need all the parent
information to ensure consistency of the directory tree. In dry run
mode we don't care, because we only have to print inconsistencies and
return 1. Worse yet, (our) customers file bugs when xfs_repair crashes
during a -n scan, so this will generate support calls.
Make everyone's life easier by downgrading the assertion to a warning if
we're running in dry run mode.
Found by fuzzing bhdr.hdr.bno = zeroes in xfs/471.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 23 Nov 2022 17:09:22 +0000 (09:09 -0800)]
xfs_db: fix printing of reverse mapping record blockcounts
FLDT_EXTLEN is the correct type for a 32-bit block count within an AG;
FLDT_REXTLEN is the type for a 21-bit file mapping block count. This
code should have been using the first type, not the second.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Notice that we passed in octal-zero, 'h', '5', 'o', but the fs label is
set to octal-5, 'o' because of the incorrect loop logic. -Wlogical-op
found this one.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 23 Nov 2022 17:09:11 +0000 (09:09 -0800)]
misc: add missing includes
Add missing #include directives so that the compiler can typecheck
functions against their declarations. IOWs, -Wmissing-declarations
found some things.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
This corruption will shutdown the file system and the file system will
no longer be mountable. The following script can reproduce the problem,
but it may take a long time.
With lazysbcount is enabled, There is no additional lock protection for
reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
m_ifree, this will make the m_ifree greater than m_icount. For example,
consider the following sequence and ifreedelta is postive:
After this, incorrect inode count (sb_ifree > sb_icount) will be writen to
the log. In the subsequent writing of sb, incorrect inode count (sb_ifree >
sb_icount) will fail to pass the boundary check in xfs_validate_sb_write()
that cause the file system shutdown.
When lazysbcount is enabled, we don't need to guarantee that Lazy sb
counters are completely correct, but we do need to guarantee that sb_ifree
<= sb_icount. On the other hand, the constraint that m_ifree <= m_icount
must be satisfied any time that there /cannot/ be other threads allocating
or freeing inode chunks. If the constraint is violated under these
circumstances, sb_i{count,free} (the ondisk superblock inode counters)
maybe incorrect and need to be marked sick at unmount, the count will
be rebuilt on the next mount.
Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks") Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We've been (ab)using XFS_REFC_COW_START as both an integer quantity and
a bit flag, even though it's *only* a bit flag. Rename the variable to
reflect its nature and update the cast target since we're not supposed
to be comparing it to xfs_agblock_t now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We're supposed to initialize the list head of an object before adding it
to another list. Fix that, and stop using the kmem_{alloc,free} calls
from the Irix days.
Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
As we've seen, refcount records use the upper bit of the rc_startblock
field to ensure that all the refcount records are at the right side of
the refcount btree. This works because an AG is never allowed to have
more than (1U << 31) blocks in it. If we ever encounter a filesystem
claiming to have that many blocks, we absolutely do not want reflink
touching it at all.
However, this test at the start of xfs_refcount_recover_cow_leftovers is
slightly incorrect -- it /should/ be checking that agblocks isn't larger
than the XFS_MAX_CRC_AG_BLOCKS constant, and it should check that the
constant is never large enough to conflict with that CoW flag.
Note that the V5 superblock verifier has not historically rejected
filesystems where agblocks >= XFS_MAX_CRC_AG_BLOCKS, which is why this
ended up in the COW recovery routine.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Now that we've separated the startblock and CoW/shared extent domain in
the incore refcount record structure, check the domain whenever we
retrieve a record to ensure that it's still in the domain that we want.
Depending on the circumstances, a change in domain either means we're
done processing or that we've found a corruption and need to fail out.
The refcount check in xchk_xref_is_cow_staging is redundant since
_get_rec has done that for a long time now, so we can get rid of it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>