Darrick J. Wong [Mon, 15 Apr 2024 23:07:46 +0000 (16:07 -0700)]
xfs_scrub: don't fail while reporting media scan errors
If we can't open a file to report that it has media errors, just log
that fact and move on. In this case we want to keep going with phase 6
so we report as many errors as possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 15 Apr 2024 23:07:46 +0000 (16:07 -0700)]
xfs_scrub: fix threadcount estimates for phase 6
If a filesystem has a realtime device or an external log device, the
media scan can start up a separate readverify controller (and workqueue)
to handle that. Each of those controllers can call progress_add, so we
need to bump up nr_threads so that the progress reports controller knows
to make its ptvar big enough to handle all these threads.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 15 Apr 2024 23:07:45 +0000 (16:07 -0700)]
xfs_db: improve number extraction in getbitval
For some reason, getbitval insists upon collecting a u64 from a pointer
bit by bit if it's not aligned to a 16-byte boundary. If not, then it
resorts to scraping bits individually. I don't know of any platform
where we require 16-byte alignment for a 8-byte access, or why we'd care
now that we have things like get_unaligned_beXX.
Rework this function to detect either naturally aligned accesses and use
the regular beXX_to_cpu functions; or byte-aligned accesses and use the
get_unaligned_beXX functions. Only fall back to the bit scraping
algorithm for the really weird cases.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 15 Apr 2024 23:07:45 +0000 (16:07 -0700)]
xfs_repair: double-check with shortform attr verifiers
Call the shortform attr structure verifier as the last thing we do in
process_shortform_attr to make sure that we don't leave any latent
errors for the kernel to stumble over.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Darrick J. Wong [Mon, 15 Apr 2024 23:07:45 +0000 (16:07 -0700)]
xfs_repair: bulk load records into new btree blocks
Amortize the cost of indirect calls further by loading a batch of
records into a new btree block instead of one record per ->get_record
call. On a rmap btree with 3.9 million records, this reduces the
runtime of xfs_btree_bload by 3% for xfsprogs. For the upcoming online
repair functionality, this will reduce runtime by 6% when spectre
mitigations are enabled in the kernel.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Darrick J. Wong [Mon, 15 Apr 2024 23:07:45 +0000 (16:07 -0700)]
xfs_repair: adjust btree bulkloading slack computations to match online repair
Adjust the lowspace threshold in the new btree block slack computation
code to match online repair, which uses a straight 10% instead of magic
shifting to approximate that without division. Repairs aren't that
frequent in the kernel; and userspace can always do u64 division.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
I mistakenly turned off CONFIG_XFS_RT in the Kconfig file for arm64
variant of the djwong-wtf git branch. Unfortunately, it took me a good
hour to figure out that RT wasn't built because this is what got printed
to dmesg:
XFS (sda2): Not built with CONFIG_XFS_RT
XFS (sda2): RT mount failed
The root cause of these problems is the conditional compilation of the
new functions xfs_validate_rtextents and xfs_compute_rextslog that I
introduced in the two commits listed below. The !RT versions of these
functions return false and 0, respectively, which causes primary
superblock validation to fail, which explains the first message.
Move the two functions to other parts of libxfs that are not
conditionally defined by CONFIG_XFS_RT and remove the broken stubs so
that validation works again.
Fixes: e14293803f4e ("xfs: don't allow overly small or large realtime volumes") Fixes: a6a38f309afc ("xfs: make rextslog computation consistent with mkfs") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
In XFS_DAS_NODE_REMOVE_ATTR case, xfs_attr_mode_remove_attr() sets
filter to XFS_ATTR_INCOMPLETE. The filter is then reset in
xfs_attr_complete_op() if XFS_DA_OP_REPLACE operation is performed.
The filter is not reset though if XFS just removes the attribute
(args->value == NULL) with xfs_attr_defer_remove(). attr code goes
to XFS_DAS_DONE state.
Fix this by always resetting XFS_ATTR_INCOMPLETE filter. The replace
operation already resets this filter in anyway and others are
completed at this step hence don't need it.
Fixes: fdaf1bb3cafc ("xfs: ATTR_REPLACE algorithm with LARP enabled needs rework") Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
We're only allocating from the realtime device if the inode is marked
for realtime and we're /not/ allocating into the attr fork.
Fixes: 58643460546d ("xfs: also use xfs_bmap_btalloc_accounting for RT allocations") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Instead of tracing the address of the recovery handler, use the name
in the defer op, similar to other defer ops related tracepoints.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
dfp will be freed by ->recover_work and thus the tracepoint in case
of an error can lead to a use after free.
Store the defer ops in a local variable to avoid that.
Fixes: 7f2f7531e0d4 ("xfs: store an ops pointer in struct xfs_defer_pending") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Since commit deed9512872d ("xfs: Check for -ENOATTR or -EEXIST"), the
high-level attr code does a lookup for any attr we're trying to set,
and does the checks to handle the create vs replace cases, which thus
never hit the low-level attr code.
Turn the checks in xfs_attr_shortform_addname as they must never trip.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
sparse complains about struct xfs_attr_shortform because it embeds a
structure with a variable sized array in a variable sized array.
Given that xfs_attr_shortform is not a very useful structure, and the
dir2 equivalent has been removed a long time ago, remove it as well.
Provide a xfs_attr_sf_firstentry helper that returns the first
xfs_attr_sf_entry behind a xfs_attr_sf_hdr to replace the structure
dereference.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
xfs_attr_shortform_getvalue duplicates the logic in xfs_attr_sf_findname.
Use the helper instead.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
xfs_attr_shortform_lookup is only used by xfs_attr_shortform_addname,
which is much better served by calling xfs_attr_sf_findname. Switch
it over and remove xfs_attr_shortform_lookup.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
xfs_attr_sf_findname has the simple job of finding a xfs_attr_sf_entry in
the attr fork, but the convoluted calling convention obfuscates that.
Return the found entry as the return value instead of an pointer
argument, as the -ENOATTR/-EEXIST can be trivally derived from that, and
remove the basep argument, as it is equivalent of the offset of sfe in
the data for if an sfe was found, or an offset of totsize if not was
found. To simplify the totsize computation add a xfs_attr_sf_endptr
helper that returns the imaginative xfs_attr_sf_entry at the end of
the current attrs.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
trace_xfs_attr_sf_lookup is currently only called by
xfs_attr_shortform_lookup, which despit it's name is a simple helper for
xfs_attr_shortform_addname, which has it's own tracing. Move the
callsite to xfs_attr_shortform_getvalue, which is the closest thing to
a high level lookup we have for the Linux xattr API.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Many of the xfs_idata_realloc callers need to set a local pointer to the
just reallocated if_data memory. Return the pointer to simplify them a
bit and use the opportunity to re-use krealloc for freeing if_data if the
size hits 0.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
The xfs_ifork structure currently has a union of the if_root void pointer
and the if_data char pointer. In either case it is an opaque pointer
that depends on the fork format. Replace the union with a single if_data
void pointer as that is what almost all callers want. Only the symlink
NULL termination code in xfs_init_local_fork actually needs a new local
variable now.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
xfs_format.h has a bunch odd wrappers for helper functions and mount
structure access using RT* prefixes. Replace them with their open coded
versions (for those that weren't entirely unused) and remove the wrappers.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Inline the logic of xfs_rtmodify_summary_int into xfs_rtmodify_summary
and xfs_rtget_summary instead of having a somewhat awkward helper to
share a little bit of code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
xfs_rtmodify_summary_int is only used inside xfs_rtbitmap.c and to
implement xfs_rtget_summary. Move xfs_rtget_summary to xfs_rtbitmap.c
as the exported API and mark xfs_rtmodify_summary_int static.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Add a return value to xfs_bmap_adjacent to indicate if it did change
ap->blkno or not.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Just return -ENOSPC instead of returning 0 and setting the return rt
extent number to NULLRTEXTNO. This is turn removes all users of
NULLRTEXTNO, so remove that as well.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Make xfs_bmap_btalloc_accounting more generic by handling the RT quota
reservations and then also use it from xfs_bmap_rtalloc instead of
open coding the accounting logic there.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
xfs_bmap_btalloc_accounting only uses the len field from args, but that
has just been propagated to ap->length field by the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
During growfs, if new ag in memory has been initialized, however
sb_agcount has not been updated, if an error occurs at this time it
will cause perag leaks as follows, these new AGs will not been freed
during umount , because of these new AGs are not visible(that is
included in mp->m_sb.sb_agcount).
Factor out xfs_free_unused_perag_range() from xfs_initialize_perag(),
used for freeing unused perag within a specified range in error handling,
included in the error path of the growfs failure.
Fixes: 1c1c6ebcf528 ("xfs: Replace per-ag array with a radix tree") Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Take mp->m_perag_lock for deletions from the perag radix tree in
xfs_initialize_perag to prevent racing with tagging operations.
Lookups are fine - they are RCU protected so already deal with the
tree changing shape underneath the lookup - but tagging operations
require the tree to be stable while the tags are propagated back up
to the root.
Right now there's nothing stopping radix tree tagging from operating
while a growfs operation is progress and adding/removing new entries
into the radix tree.
Hence we can have traversals that require a stable tree occurring at
the same time we are removing unused entries from the radix tree which
causes the shape of the tree to change.
Likely this hasn't caused a problem in the past because we are only
doing append addition and removal so the active AG part of the tree
is not changing shape, but that doesn't mean it is safe. Just making
the radix tree modifications serialise against each other is obviously
correct.
Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Upon a closer inspection of the quota record scrubber, I noticed that
dqiterate wasn't actually walking all possible dquots for the mapped
blocks in the quota file. This is due to xfs_qm_dqget_next skipping all
XFS_IS_DQUOT_UNINITIALIZED dquots.
For a fsck program, we really want to look at all the dquots, even if
all counters and limits in the dquot record are zero. Rewrite the
implementation to do this, as well as switching to an iterator paradigm
to reduce the number of indirect calls.
This enables removal of the old broken dqiterate code from xfs_dquot.c.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Use the reverse-mapping btree information to rebuild an inode block map.
Update the btree bulk loading code as necessary to support inode rooted
btrees and fix some bitrot problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Determine if inode fork damage is responsible for the inode being unable
to pass the ifork verifiers in xfs_iget and zap the fork contents if
this is true. Once this is done the fork will be empty but we'll be
able to construct an in-core inode, and a subsequent call to the inode
fork repair ioctl will search the rmapbt to rebuild the records that
were in the fork.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
In a few patches, we'll add some online repair code that tries to
massage the ondisk inode record just enough to get it to pass the inode
verifiers so that we can continue with more file repairs. Part of that
massaging can include zapping the ondisk forks to clear errors. After
that point, the bmap fork repair functions will rebuild the zapped
forks.
Christoph asked for stronger protections against online repair zapping a
fork to get the inode to load vs. other threads trying to access the
partially repaired file. Do this by adding a special "[DA]FORK_ZAPPED"
inode health flag whenever repair zaps a fork, and sprinkling checks for
that flag into the various file operations for things that don't like
handling an unexpected zero-extents fork.
In practice xfs_scrub will scrub and fix the forks almost immediately
after zapping them, so the window is very small. However, if a crash or
unmount should occur, we can still detect these zapped inode forks by
looking for a zero-extents fork when data was expected.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Code in the next patch will assign the return value of XFS_DFORK_*PTR
macros to a struct pointer. gcc complains about casting char* strings
to struct pointers, so let's fix the macro's cast to void* to shut up
the warnings.
While we're at it, fix one of the scrub tests that uses PTR to use BOFF
instead for a simpler integer comparison, since other linters whine
about char* and void* comparisons.
Can't satisfy all these dman bots.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Use the rmapbt to find inode chunks, query the chunks to compute hole
and free masks, and with that information rebuild the inobt and finobt.
Refer to the case study in
Documentation/filesystems/xfs-online-fsck-design.rst for more details.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Rebuild the free space btrees from the gaps in the rmap btree. Refer to
the case study in Documentation/filesystems/xfs-online-fsck-design.rst
for more details.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Constrain the number of dirty buffers that are locked by the btree
staging code at any given time by establishing a threshold at which we
put them all on the delwri queue and push them to disk. This limits
memory consumption while writing out new btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
When we're performing a bulk load of a btree, move the code that
actually stores the btree record in the new btree block out of the
generic code and into the individual ->get_record implementations.
This is preparation for being able to store multiple records with a
single indirect call.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
When constructing a new btree, xfs_btree_bload_node needs to read the
btree blocks for level N to compute the keyptrs for the blocks that will
be loaded into level N+1. The level N blocks must be formatted at that
point.
A subsequent patch will change the btree bulkloader to write new btree
blocks in 256K chunks to moderate memory consumption if the new btree is
very large. As a consequence of that, it's possible that the buffers
for lower level blocks might have been reclaimed by the time the node
builder comes back to the block.
Therefore, change xfs_btree_bload_node to read the lower level blocks
to handle the reclaimed buffer case. As a side effect, the read will
increase the LRU refs, which will bias towards keeping new btree buffers
in memory after the new btree commits.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
The btree bulkloading code calls xfs_buf_delwri_queue_here when it has
finished formatting a new btree block and wants to queue it to be
written to disk. Once the new btree root has been committed, the blocks
(and hence the buffers) will be accessible to the rest of the
filesystem. Mark each new buffer as DONE when adding it to the delwri
list so that the next btree traversal can skip reloading the contents
from disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
While stress-testing online repair of btrees, I noticed periodic
assertion failures from the buffer cache about buffers with incorrect
DELWRI_Q state. Looking further, I observed this race between the AIL
trying to write out a btree block and repair zapping a btree block after
the fact:
AIL: Repair0:
pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list
stale buf X:
clear DELWRI_Q
does not clear b_list
free space X
delwri_submit # oops
Worse yet, I discovered that running the same repair over and over in a
tight loop can result in a second race that cause data integrity
problems with the repair:
AIL: Repair0: Repair1:
pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list
stale buf X:
clear DELWRI_Q
does not clear b_list
free space X
find free space X
get buffer
rewrite buffer
delwri_queue:
set DELWRI_Q
already on a list, do not add
BAD: committed tree root before all blocks written
delwri_submit # too late now
I traced this to my own misunderstanding of how the delwri lists work,
particularly with regards to the AIL's buffer list. If a buffer is
logged and committed, the buffer can end up on that AIL buffer list. If
btree repairs are run twice in rapid succession, it's possible that the
first repair will invalidate the buffer and free it before the next time
the AIL wakes up. Marking the buffer stale clears DELWRI_Q from the
buffer state without removing the buffer from its delwri list. The
buffer doesn't know which list it's on, so it cannot know which lock to
take to protect the list for a removal.
If the second repair allocates the same block, it will then recycle the
buffer to start writing the new btree block. Meanwhile, if the AIL
wakes up and walks the buffer list, it will ignore the buffer because it
can't lock it, and go back to sleep.
When the second repair calls delwri_queue to put the buffer on the
list of buffers to write before committing the new btree, it will set
DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
buffer list, it won't add it to the bulkload buffer's list.
This is incorrect, because the bulkload caller relies on delwri_submit
to ensure that all the buffers have been sent to disk /before/
required for data consistency.
Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
drop it, so the next thread to walk through the btree will trip over a
debug assertion on that flag.
To fix this, create a new function that waits for the buffer to be
removed from any other delwri lists before adding the buffer to the
caller's delwri list. By waiting for the buffer to clear both the
delwri list and any potential delwri wait list, we can be sure that
repair will initiate writes of all buffers and report all write errors
back to userspace instead of committing the new structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Pass a pointer to the xfs_defer_op_type structure to xfs_defer_add and
remove the indirection through the xfs_defer_ops_type enum and a global
table of all possible operations.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
xfs_defer_start_recovery is only called from xlog_recover_intent_item,
and the callers of that all have the actual xfs_defer_ops_type operation
vector at hand. Pass that directly instead of looking it up from the
defer_op_types table.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
The dfp_type field in struct xfs_defer_pending is only used to either
look up the operations associated with the pending word or in trace
points. Replace it with a direct pointer to the operations vector,
and store a pretty name in the vector for tracing.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Consolidate the xfs_attr_defer_* helpers into a single xfs_attr_defer_add
one that picks the right dela_state based on the passed in operation.
Also move to a single trace point as the actual operation is visible
through the flags in the delta_state passed to the trace point.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Move xfs_ondisk.h to libxfs so that we can do the struct sanity checks
in userspace libxfs as well. This should allow us to retire the
somewhat fragile xfs/122 test on xfstests.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
xfs_da3_swap_lastblock() copy the last block content to the dead block,
but do not update the metadata in it. We need update some metadata
for some kinds of type block, such as dir3 leafn block records its
blkno, we shall update it to the dead block blkno. Otherwise,
before write the xfs_buf to disk, the verify_write() will fail in
blk_hdr->blkno != xfs_buf->b_bn, then xfs will be shutdown.
In the case of returning -ENOSPC, ensure logflagsp is initialized by 0.
Otherwise the caller __xfs_bunmapi will set uninitialized illegal
tmp_logflags value into xfs log, which might cause unpredictable error
in the log recovery procedure.
Also, remove the flags variable and set the *logflagsp directly, so that
the code should be more robust in the long run.
Fixes: 1b24b633aafe ("xfs: move some more code into xfs_bmap_del_extent_real") Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Introduce the concept of a defer ops barrier to separate consecutively
queued pending work items of the same type. With a barrier in place,
the two work items will be tracked separately, and receive separate log
intent items. The goal here is to prevent reaping of old metadata
blocks from creating unnecessarily huge EFIs that could then run the
risk of overflowing the scrub transaction.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Remove these unused fields since nobody uses them. They should have
been removed years ago in a different cleanup series from Christoph
Hellwig.
Fixes: daf83964a3681 ("xfs: move the per-fork nextents fields into struct xfs_ifork") Fixes: f7e67b20ecbbc ("xfs: move the fork format fields into struct xfs_ifork") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
As mentioned in the previous commit, online repair wants to allocate
space to write out a new metadata structure, and it also wants to hedge
against system crashes during repairs by logging (and later cancelling)
EFIs to free the space if we crash before committing the new data
structure.
Therefore, create a trio of functions to schedule automatic reaping of
freshly allocated unwritten space. xfs_alloc_schedule_autoreap creates
a paused EFI representing the space we just allocated. Once the
allocations are made and the autoreaps scheduled, we can start writing
to disk.
If the writes succeed, xfs_alloc_cancel_autoreap marks the EFI work
items as stale and unpauses the pending deferred work item. Assuming
that's done in the same transaction that commits the new structure into
the filesystem, we guarantee that either the new object is fully
visible, or that all the space gets reclaimed.
If the writes succeed but only part of an extent was used, repair must
call the same _cancel_autoreap function to kill the first EFI and then
log a new EFI to free the unused space. The first EFI is already
For full extents that aren't used, xfs_alloc_commit_autoreap will
unpause the EFI, which results in the space being freed during the next
_defer_finish cycle.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
xfs_free_extent_later is a trivial helper, so remove it to reduce the
amount of thinking required to understand the deferred freeing
interface. This will make it easier to introduce automatic reaping of
speculative allocations in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Traditionally, all pending deferred work attached to a transaction is
finished when one of the xfs_defer_finish* functions is called.
However, online repair wants to be able to allocate space for a new data
structure, format a new metadata structure into the allocated space, and
As a hedge against system crashes during repairs, we also want to log
some EFI items for the allocated space speculatively, and cancel them if
we elect to commit the new data structure.
Therefore, introduce the idea of pausing a pending deferred work item.
Log intent items are still created for paused items and relogged as
necessary. However, paused items are pushed onto a side list before we
start calling ->finish_item, and the whole list is reattach to the
transaction afterwards. New work items are never attached to paused
pending items.
Modify xfs_defer_cancel to clean up pending deferred work items holding
a log intent item but not a log intent done item, since that is now
possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
When someone tries to add a deferred work item to xfs_defer_add, it will
try to attach the work item to the most recently added xfs_defer_pending
object attached to the transaction. However, it doesn't check if the
pending object has a log intent item attached to it. This is incorrect
behavior because we cannot add more work to an object that has already
been committed to the ondisk log.
Therefore, change the behavior not to append to pending items with a non
null dfp_intent. In practice this has not been an issue because the
only way xfs_defer_add gets called after log intent items have been
the @dop_pending isolation in xfs_defer_finish_noroll protects the
pending items that have already been logged.
However, the next patch will add the ability to pause a deferred extent
free object during online btree rebuilding, and any new extfree work
items need to have their own pending event.
While we're at it, hoist the predicate to its own static inline function
for readability.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Extended attribute updates use the deferred work machinery to manage
state across a chain of smaller transactions. All previous deferred
work users have employed log intent items and log done items to manage
restarting of interrupted operations, which means that ->create_intent
sets dfp_intent to a log intent item and ->create_done uses that item to
create a log intent done item.
However, xattrs have used the INCOMPLETE flag to deal with the lack of
recovery support for an interrupted transaction chain. Log items are
optional if the xattr update caller didn't set XFS_DA_OP_LOGGED to
require a restartable sequence.
In other words, ->create_intent can return NULL to say that there's no
log intent item. If that's the case, no log intent done item should be
created. Clean up xfs_defer_create_done not to do this, so that the
->create_done functions don't have to check for non-null dfp_intent
themselves.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Don't allow realtime volumes that are less than one rt extent long.
This has been broken across 4 LTS kernels with nobody noticing, so let's
just disable it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
It's quite reasonable that some customer somewhere will want to
configure a realtime volume with more than 2^32 extents. If they try to
do this, the highbit32() call will truncate the upper bits of the
xfs_rtbxlen_t and produce the wrong value for rextslog. This in turn
causes the rsumlevels to be wrong, which results in a realtime summary
file that is the wrong length. Fix that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
There's a weird discrepancy in xfsprogs dating back to the creation of
the Linux port -- if there are zero rt extents, mkfs will set
sb_rextents and sb_rextslog both to zero:
However, that's not the check that xfs_repair uses for nonzero rtblocks:
if (sb->sb_rextslog !=
libxfs_highbit32((unsigned int)sb->sb_rextents))
The difference here is that xfs_highbit32 returns -1 if its argument is
zero. Unfortunately, this means that in the weird corner case of a
realtime volume shorter than 1 rt extent, xfs_repair will immediately
flag a freshly formatted filesystem as corrupt. Because mkfs has been
writing ondisk artifacts like this for decades, we have to accept that
as "correct". TBH, zero rextslog for zero rtextents makes more sense to
me anyway.
Regrettably, the superblock verifier checks created in commit copied
xfs_repair even though mkfs has been writing out such filesystems for
ages. Fix the superblock verifier to accept what mkfs spits out; the
userspace version of this patch will have to fix xfs_repair as well.
Note that the new helper leaves the zeroday bug where the upper 32 bits
of sb_rextents is ripped off and fed to highbit32. This leads to a
seriously undersized rt summary file, which immediately breaks mkfs:
The next patch will drop support for rt volumes with fewer than 1 or
more than 2^32-1 rt extents, since they've clearly been broken forever.
Fixes: f8e566c0f5e1f ("xfs: validate the realtime geometry in xfs_validate_sb_common") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
The only log items that need relogging are the ones created for deferred
work operations, and the only part of the code base that relogs log
items is the deferred work machinery. Move the function pointers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Now that we have a helper to handle creating a log intent done item and
updating all the necessary state flags, use it to reduce boilerplate in
the ->iop_relog implementations.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Each log intent item's ->finish_item call chain inevitably includes some
code to set the dirty flag of the transaction. If there's an associated
log intent done item, it also sets the item's dirty flag and the
transaction's INTENT_DONE flag. This is repeated throughout the
codebase.
Reduce the LOC by moving all that to xfs_defer_finish_one.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Finish off the series by moving the intent item recovery function
pointer to the xfs_defer_op_type struct, since this is really a deferred
work function now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Get rid of the open-coded calls to xfs_defer_finish_one. This also
means that the recovery transaction takes care of cleaning up the dfp,
and we have solved (I hope) all the ownership issues in recovery.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
One thing I never quite got around to doing is porting the log intent
item recovery code to reconstruct the deferred pending work state. As a
result, each intent item open codes xfs_defer_finish_one in its recovery
method, because that's what the EFI code did before xfs_defer.c even
existed.
This is a gross thing to have left unfixed -- if an EFI cannot proceed
due to busy extents, we end up creating separate new EFIs for each
unfinished work item, which is a change in behavior from what runtime
would have done.
Worse yet, Long Li pointed out that there's a UAF in the recovery code.
The ->commit_pass2 function adds the intent item to the AIL and drops
the refcount. The one remaining refcount is now owned by the recovery
mechanism (aka the log intent items in the AIL) with the intent of
giving the refcount to the intent done item in the ->iop_recover
function.
However, if something fails later in recovery, xlog_recover_finish will
walk the recovered intent items in the AIL and release them. If the CIL
hasn't been pushed before that point (which is possible since we don't
force the log until later) then the intent done release will try to free
its associated intent, which has already been freed.
This patch starts to address this mess by having the ->commit_pass2
functions recreate the xfs_defer_pending state. The next few patches
will fix the recovery functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Darrick J. Wong [Mon, 15 Apr 2024 23:07:30 +0000 (16:07 -0700)]
xfs_{db,repair}: use m_blockwsize instead of sb_blocksize for rt blocks
In preparation to add block headers to rt bitmap and summary blocks,
convert all the relevant calculations in the userspace tools to use the
per-block word count instead of the raw blocksize. This is key to
adding this support outside of libxfs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Darrick J. Wong [Mon, 15 Apr 2024 23:07:30 +0000 (16:07 -0700)]
xfs_{db,repair}: use accessor functions for summary info words
Port xfs_db and xfs_repair to use get and set functions for rtsummary
words so that we can redefine the ondisk format with a specific
endianness. Note that this requires the definition of a distinct type
for ondisk summary info words so that the compiler can perform proper
typechecking.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Darrick J. Wong [Mon, 15 Apr 2024 23:07:30 +0000 (16:07 -0700)]
xfs_{db,repair}: use accessor functions for bitmap words
Port xfs_db and xfs_repair to use get and set functions for rtbitmap
words so that we can redefine the ondisk format with a specific
endianness. Note that this requires the definition of a distinct type
for ondisk rtbitmap words so that the compiler can perform proper
typechecking as we go back and forth.
In the upcoming rtgroups feature, we're going to fix the problem that
rtwords are written in host endian order, which means we'll need the
distinct rtword/rtword_raw types.
Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Darrick J. Wong [Mon, 15 Apr 2024 23:07:29 +0000 (16:07 -0700)]
xfs_{db,repair}: convert open-coded xfs_rtword_t pointer accesses to helper
There are a bunch of places in xfs_db and xfs_repair where we use
open-coded logic to find a pointer to an xfs_rtword_t within a rt bitmap
buffer. Convert all that to helper functions for better type safety.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Darrick J. Wong [Mon, 15 Apr 2024 23:07:28 +0000 (16:07 -0700)]
xfs_repair: fix confusing rt space units in the duplicate detection code
Christoph Hellwig stumbled over the crosslinked file data detection code
in xfs_repair. While trying to make sense of his fixpatch, I realized
that the variable names and unit types are very misleading.
The rt dup tree builder inserts records in units of realtime extents.
One query of the rt dup tree passes in a realtime extent number, but one
of them does not. Confusingly, all the variable names have "block" even
though they really mean "extent". This makes a real difference for
rextsize > 1 filesystems, though given the lack of complaints I'm
guessing there aren't many users.
Clean up this whole mess by fixing the variable names of the duplicates
tree and the state array to reflect the units that are stored in the
data structure, and fix the buggy query code. Later on in this patchset
we'll fix the variable types too.
This seems to have been broken since before the start of the git repo.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Notice that mkfs thinks it should format a 32768-fsblock external log,
but the log device itself is 32767 fsblocks. Hence the write goes off
the end of the device and we get ENOSPC.
I tracked this behavior down to align_log_size in mkfs, which first
tries to round the log size up to a stripe boundary, then tries to round
it down. Unfortunately, in the case of an external log we call the
function with XFS_MAX_LOG_BLOCKS without accounting for the possibility
that the log device might be smaller.
Correct the callsite and clean up the open-coded rounding.
Fixes: 8d1bff2be336 ("mkfs: reduce internal log size when log stripe units are in play") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Darrick J. Wong [Mon, 15 Apr 2024 23:07:28 +0000 (16:07 -0700)]
libxfs: fix incorrect porting to 6.7
Userspace libxfs is supposed to match the kernel libxfs except for the
preprocessor include directives. Fix a few discrepancies that came up
for whatever reason.
To fix the build errors resulting from CONFIG_XFS_RT not being defined,
add it to libxfs.h and alter the Makefile to track xfs_rtbitmap.h.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Darrick J. Wong [Mon, 15 Apr 2024 23:07:27 +0000 (16:07 -0700)]
debian: fix package configuration after removing platform_defs.h.in
In commit 0fa9dcb61b4f, we made platform_defs.h a static header file
instead of generating it from platform_defs.h.in. Unfortunately, it
turns out that the debian packaging rules use "make
include/platform_defs.h" to run configure with the build options
set via LOCAL_CONFIGURE_OPTIONS.
Since platform_defs.h is no longer generated, the make command in
debian/rules does nothing, which means that the binaries don't get built
the way the packaging scripts specify. This breaks multiarch for
libhandle.so, as well as libeditline and libblkid support for
xfs_db/io/spaceman.
Fix this by correcting debian/rules to make include/builddefs, which
will start ./configure with the desired options.
Fixes: 0fa9dcb61b4f ("include: stop generating platform_defs.h") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
fls should never be provided by system headers. It seems like on MacOS
it did, but as we're not supporting MacOS anymore there is no need to
check for it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>