This provides users the ability to look up a reverse mapping from a bmbt
record -- start with the physical block; then if there are multiple
records for the same block, move on to the owner; then the inode fork
type; and so on to the file offset.
However, the key comparison functions incorrectly remove the
fork/btree/unwritten information that's encoded in the on-disk offset.
This means that lookup comparisons are only done with:
(physical block, owner, offset)
This means that queries can return incorrect results. On consistent
filesystems this hasn't been an issue because blocks are never shared
between forks or with bmbt blocks; and are never unwritten. However,
this bug means that online repair cannot always detect corruption in the
key information in internal rmapbt nodes.
Found by fuzzing keys[1].attrfork = ones on xfs/371.
Fixes: 4b8ed67794fe ("xfs: add rmap btree operations") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Pass the same oldext argument (which contains the existing rmapping's
unwritten state) to xfs_rmap_lookup_le_range at the start of
xfs_rmap_convert_shared. At this point in the code, flags is zero,
which means that we perform lookups using the wrong key.
Fixes: 3f165b334e51 ("xfs: convert unwritten status of reverse mappings for shared files") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Make sure that we actually initialize xefi_discard when we're scheduling
a deferred free of an AGFL block. This was (eventually) found by the
UBSAN while I was banging on realtime rmap problems, but it exists in
the upstream codebase. While we're at it, rearrange the structure to
reduce the struct size from 64 to 56 bytes.
Fixes: fcb762f5de2e ("xfs: add bmapi nodiscard flag") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Fix some off-by-one errors in xfs_rtalloc_query_range. The highest key
in the realtime bitmap is always one less than the number of rt extents,
which means that the key clamp at the start of the function is wrong.
The 4th argument to xfs_rtfind_forw is the highest rt extent that we
want to probe, which means that passing 1 less than the high key is
wrong. Finally, drop the rem variable that controls the loop because we
can compare the iteration point (rtstart) against the high key directly.
The sordid history of this function is that the original commit (fb3c3)
incorrectly passed (high_rec->ar_startblock - 1) as the 'limit' parameter
to xfs_rtfind_forw. This was wrong because the "high key" is supposed
to be the largest key for which the caller wants result rows, not the
key for the first row that could possibly be outside the range that the
caller wants to see.
A subsequent attempt (8ad56) to strengthen the parameter checking added
incorrect clamping of the parameters to the number of rt blocks in the
system (despite the bitmap functions all taking units of rt extents) to
avoid querying ranges past the end of rt bitmap file but failed to fix
the incorrect _rtfind_forw parameter. The original _rtfind_forw
parameter error then survived the conversion of the startblock and
blockcount fields to rt extents (a0e5c), and the most recent off-by-one
fix (a3a37) thought it was patching a problem when the end of the rt
volume is not in use, but none of these fixes actually solved the
original problem that the author was confused about the "limit" argument
to xfs_rtfind_forw.
Sadly, all four of these patches were written by this author and even
his own usage of this function and rt testing were inadequate to get
this fixed quickly.
Original-problem: fb3c3de2f65c ("xfs: add a couple of queries to iterate free extents in the rtbitmap") Not-fixed-by: 8ad560d2565e ("xfs: strengthen rtalloc query range checks") Not-fixed-by: a0e5c435babd ("xfs: fix xfs_rtalloc_rec units") Fixes: a3a374bf1889 ("xfs: fix off-by-one error in xfs_rtalloc_query_range") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Now that we have the ability to ask the log how far the tail needs to be
pushed to maintain its free space targets, augment the decision to relog
an intent item so that we only do it if the log has hit the 75% full
threshold. There's no point in relogging an intent into the same
checkpoint, and there's no need to relog if there's plenty of free space
in the log.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
There's a subtle design flaw in the deferred log item code that can lead
to pinning the log tail. Taking up the defer ops chain examples from
the previous commit, we can get trapped in sequences like this:
Caller hands us a transaction t0 with D0-D3 attached. The defer ops
chain will look like the following if the transaction rolls succeed:
In transaction 9, we finish d9 and try to roll to t10 while holding onto
an intent item for D3 that we logged in t0.
The previous commit changed the order in which we place new defer ops in
the defer ops processing chain to reduce the maximum chain length. Now
make xfs_defer_finish_noroll capable of relogging the entire chain
periodically so that we can always move the log tail forward. Most
chains will never get relogged, except for operations that generate very
long chains (large extents containing many blocks with different sharing
levels) or are on filesystems with small logs and a lot of ongoing
metadata updates.
Callers are now required to ensure that the transaction reservation is
large enough to handle logging done items and new intent items for the
maximum possible chain length. Most callers are careful to keep the
chain lengths low, so the overhead should be minimal.
The decision to relog an intent item is made based on whether the intent
was logged in a previous checkpoint, since there's no point in relogging
an intent into the same checkpoint.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The defer ops code has been finishing items in the wrong order -- if a
top level defer op creates items A and B, and finishing item A creates
more defer ops A1 and A2, we'll put the new items on the end of the
chain and process them in the order A B A1 A2. This is kind of weird,
since it's convenient for programmers to be able to think of A and B as
an ordered sequence where all the sub-tasks for A must finish before we
move on to B, e.g. A A1 A2 D.
Right now, our log intent items are not so complex that this matters,
but this will become important for the atomic extent swapping patchset.
In order to maintain correct reference counting of extents, we have to
unmap and remap extents in that order, and we want to complete that work
before moving on to the next range that the user wants to swap. This
patch fixes defer ops to satsify that requirement.
The primary symptom of the incorrect order was noticed in an early
performance analysis of the atomic extent swap code. An astonishingly
large number of deferred work items accumulated when userspace requested
an atomic update of two very fragmented files. The cause of this was
traced to the same ordering bug in the inner loop of
xfs_defer_finish_noroll.
If the ->finish_item method of a deferred operation queues new deferred
operations, those new deferred ops are appended to the tail of the
pending work list. To illustrate, say that a caller creates a
transaction t0 with four deferred operations D0-D3. The first thing
defer ops does is roll the transaction to t1, leaving us with:
t1: D0(t0), D1(t0), D2(t0), D3(t0)
Let's say that finishing each of D0-D3 will create two new deferred ops.
After finish D0 and roll, we'll have the following chain:
t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)
d4 and d5 were logged to t1. Notice that while we're about to start
work on D1, we haven't actually completed all the work implied by D0
being finished. So far we've been careful (or lucky) to structure the
dfops callers such that D1 doesn't depend on d4 or d5 being finished,
but this is a potential logic bomb.
There's a second problem lurking. Let's see what happens as we finish
D1-D3:
When we try to roll to transaction #12, we're holding defer op d11,
which we logged way back in t4. This means that the tail of the log is
pinned at t4. If the log is very small or there are a lot of other
threads updating metadata, this means that we might have wrapped the log
and cannot get roll to t11 because there isn't enough space left before
we'd run into t4.
Let's shift back to the original failure. I mentioned before that I
discovered this flaw while developing the atomic file update code. In
that scenario, we have a defer op (D0) that finds a range of file blocks
to remap, creates a handful of new defer ops to do that, and then asks
to be continued with however much work remains.
So, D0 is the original swapext deferred op. The first thing defer ops
does is rolls to t1:
t1: D0(t0)
We try to finish D0, logging d1 and d2 in the process, but can't get all
the work done. We log a done item and a new intent item for the work
that D0 still has to do, and roll to t2:
t2: D0'(t1), d1(t1), d2(t1)
We roll and try to finish D0', but still can't get all the work done, so
we log a done item and a new intent item for it, requeue D0 a second
time, and roll to t3:
t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)
If it takes 48 more rolls to complete D0, then we'll finally dispense
with D0 in t50:
t50: D<fifty primes>(t49), d1(t1), ..., d102(t50)
We then try to roll again to get a chain like this:
Notice that in rolling to transaction #51, we're holding on to a log
intent item for d1 that was logged in transaction #1. This means that
the tail of the log is pinned at t1. If the log is very small or there
are a lot of other threads updating metadata, this means that we might
have wrapped the log and cannot roll to t51 because there isn't enough
space left before we'd run into t1. This is of course problem #2 again.
But notice the third problem with this scenario: we have 102 defer ops
tied to this transaction! Each of these items are backed by pinned
kernel memory, which means that we risk OOM if the chains get too long.
Yikes. Problem #1 is a subtle logic bomb that could hit someone in the
future; problem #2 applies (rarely) to the current upstream, and problem
#3 applies to work under development.
This is not how incremental deferred operations were supposed to work.
The dfops design of logging in the same transaction an intent-done item
and a new intent item for the work remaining was to make it so that we
only have to juggle enough deferred work items to finish that one small
piece of work. Deferred log item recovery will find that first
unfinished work item and restart it, no matter how many other intent
items might follow it in the log. Therefore, it's ok to put the new
intents at the start of the dfops chain.
This actually sucks more for pinning the log tail (we try to roll to t10
while holding an intent item that was logged in t1) but we've solved
problem #1. We've also reduced the maximum chain length from:
sum(all the new items) + nr_original_items
to:
max(new items that each original item creates) + nr_original_items
This solves problem #3 by sharply reducing the number of defer ops that
can be attached to a transaction at any given time. The change makes
the problem of log tail pinning worse, but is improvement we need to
solve problem #2. Actually solving #2, however, is left to the next
patch.
Note that a subsequent analysis of some hard-to-trigger reflink and COW
livelocks on extremely fragmented filesystems (or systems running a lot
of IO threads) showed the same symptoms -- uncomfortably large numbers
of incore deferred work items and occasional stalls in the transaction
grant code while waiting for log reservations. I think this patch and
the next one will also solve these problems.
As originally written, the code used list_splice_tail_init instead of
list_splice_init, so change that, and leave a short comment explaining
our actions.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
In xfs_bui_item_recover, there exists a use-after-free bug with regards
to the inode that is involved in the bmap replay operation. If the
mapping operation does not complete, we call xfs_bmap_unmap_extent to
create a deferred op to finish the unmapping work, and we retain a
pointer to the incore inode.
Unfortunately, the very next thing we do is commit the transaction and
drop the inode. If reclaim tears down the inode before we try to finish
the defer ops, we dereference garbage and blow up. Therefore, create a
way to join inodes to the defer ops freezer so that we can maintain the
xfs_inode reference until we're done with the inode.
Note: This imposes the requirement that there be enough memory to keep
every incore inode in memory throughout recovery.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
When xfs_defer_capture extracts the deferred ops and transaction state
from a transaction, it should record the transaction reservation type
from the old transaction so that when we continue the dfops chain, we
still use the same reservation parameters.
Doing this means that the log item recovery functions get to determine
the transaction reservation instead of abusing tr_itruncate in yet
another part of xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
When xfs_defer_capture extracts the deferred ops and transaction state
from a transaction, it should record the remaining block reservations so
that when we continue the dfops chain, we can reserve the same number of
blocks to use. We capture the reservations for both data and realtime
volumes.
This adds the requirement that every log intent item recovery function
must be careful to reserve enough blocks to handle both itself and all
defer ops that it can queue. On the other hand, this enables us to do
away with the handwaving block estimation nonsense that was going on in
xlog_finish_defer_ops.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
When we replay unfinished intent items that have been recovered from the
log, it's possible that the replay will cause the creation of more
deferred work items. As outlined in commit 509955823cc9c ("xfs: log
recovery should replay deferred ops in order"), later work items have an
implicit ordering dependency on earlier work items. Therefore, recovery
must replay the items (both recovered and created) in the same order
that they would have been during normal operation.
For log recovery, we enforce this ordering by using an empty transaction
to collect deferred ops that get created in the process of recovering a
log intent item to prevent them from being committed before the rest of
the recovered intent items. After we finish committing all the
recovered log items, we allocate a transaction with an enormous block
reservation, splice our huge list of created deferred ops into that
transaction, and commit it, thereby finishing all those ops.
This is /really/ hokey -- it's the one place in XFS where we allow
nested transactions; the splicing of the defer ops list is is inelegant
and has to be done twice per recovery function; and the broken way we
handle inode pointers and block reservations cause subtle use-after-free
and allocator problems that will be fixed by this patch and the two
patches after it.
Therefore, replace the hokey empty transaction with a structure designed
to capture each chain of deferred ops that are created as part of
recovering a single unfinished log intent. Finally, refactor the loop
that replays those chains to do so using one transaction per chain.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Remove this one-line helper since the assert is trivially true in one
call site and the rest obscures a bitmask operation.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
During code review, I noticed that the rmap code uses the (slower)
shared mappings rmap functions for any extent of a reflinked file, even
if those extents are for the attr fork, which doesn't support sharing.
We can speed up rmap a tiny bit by optimizing out this case.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Cleanup the typedef usage, the unnecessary parentheses, the unnecessary
backslash and use the open-coded round_up call in
xfs_attr_leaf_entsize_{remote,local}.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
We already check whether the crc feature is enabled before calling
xfs_attr3_rmt_verify(), so remove the redundant feature check in that
function.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Fix the comments to help people understand the code.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
[darrick: fix the indenting problems too] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
We have already defined the project ID type prid_t, so maybe should
use it here.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
During a code inspection, I found a serious bug in the log intent item
recovery code when an intent item cannot complete all the work and
decides to requeue itself to get that done. When this happens, the
item recovery creates a new incore deferred op representing the
remaining work and attaches it to the transaction that it allocated. At
the end of _item_recover, it moves the entire chain of deferred ops to
the dummy parent_tp that xlog_recover_process_intents passed to it, but
fail to log a new intent item for the remaining work before committing
the transaction for the single unit of work.
xlog_finish_defer_ops logs those new intent items once recovery has
finished dealing with the intent items that it recovered, but this isn't
sufficient. If the log is forced to disk after a recovered log item
decides to requeue itself and the system goes down before we call
xlog_finish_defer_ops, the second log recovery will never see the new
intent item and therefore has no idea that there was more work to do.
It will finish recovery leaving the filesystem in a corrupted state.
The same logic applies to /any/ deferred ops added during intent item
recovery, not just the one handling the remaining work.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
When callers pass XFS_BMAPI_REMAP into xfs_bunmapi, they want the extent
to be unmapped from the given file fork without the extent being freed.
We do this for non-rt files, but we forgot to do this for realtime
files. So far this isn't a big deal since nobody makes a bunmapi call
to a rt file with the REMAP flag set, but don't leave a logic bomb.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
xfs_attr_sf_totsize() requires access to xfs_inode structure, so, once
xfs_attr_shortform_addname() is its only user, move it to xfs_attr.c
instead of playing with more #includes.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
nameval is a variable-size array, so, define it as it, and remove all
the -1 magic number subtractions
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Enable the bigtime feature for quota timers. We decrease the accuracy
of the timers to ~4s in exchange for being able to set timers up to the
bigtime maximum.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Redesign the ondisk inode timestamps to be a simple unsigned 64-bit
counter of nanoseconds since 14 Dec 1901 (i.e. the minimum time in the
32-bit unix time epoch). This enables us to handle dates up to 2486,
which solves the y2038 problem.
sandeen: update xfs_flags2diflags2() as well, to match
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Redefine xfs_ictimestamp_t as a uint64_t typedef in preparation for the
bigtime functionality. Preserve the legacy structure format so that we
can let the compiler take care of the masking and shifting.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Redefine xfs_timestamp_t as a __be64 typedef in preparation for the
bigtime functionality. Preserve the legacy structure format so that we
can let the compiler take care of masking and shifting.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Move this function to xfs_inode_item_recover.c since there's only one
caller of it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Refactor quota timestamp encoding and decoding into helper functions so
that we can add extra behavior in the next patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Refactor the code that sets the default quota grace period into a helper
function so that we can override the ondisk behavior later.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Define explicit limits on the range of quota grace period expiration
timeouts and refactor the code that modifies the timeouts into helpers
that clamp the values appropriately. Note that we'll refactor the
default grace period timer separately.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Formally define the inode timestamp ranges that existing filesystems
support, and switch the vfs timetamp ranges to use it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Add the necessary bits to the online repair code to support logging the
inode btree counters when rebuilding the btrees, and to support fixing
the counters when rebuilding the AGI.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Now that we have reliable finobt block counts, use them to speed up the
per-AG block reservation calculations at mount time.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Add a btree block usage counters for both inode btrees to the AGI header
so that we don't have to walk the entire finobt at mount time to create
the per-AG reservations.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Remove the mp argument as this function is only called in transaction
context, and open code xfs_getsb given that the function already accesses
the buffer pointer in the mount point directly.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The log recovery I/O completion handler does not substancially differ from
the normal one except for the fact that it:
a) never retries failed writes
b) can have log items that aren't on the AIL
c) never has inode/dquot log items attached and thus don't need to
handle them
Add conditionals for (a) and (b) to the ioend code, while (c) doesn't
need special handling anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Move the buffer retry state machine logic to xfs_buf.c and call it once
from xfs_ioend instead of duplicating it three times for the three kinds
of buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Remove kmem_realloc() function and convert its users to use MM API
directly (krealloc())
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Tue, 10 Nov 2020 19:52:31 +0000 (14:52 -0500)]
libxfs: refactor NSEC_PER_SEC
Clean up all the open-coded and duplicate definitions of time unit
conversion factors.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Tue, 10 Nov 2020 17:05:32 +0000 (12:05 -0500)]
libxfs: create a real struct timespec64
Create a real struct timespec64 that supports 64-bit seconds counts.
The C library struct timespec doesn't support this on 32-bit
architectures and we cannot lose the upper bits in the incore inode.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Gao Xiang [Mon, 12 Oct 2020 19:40:02 +0000 (15:40 -0400)]
xfsprogs: allow i18n to xfs printk
In preparation to a common stripe validation helper,
allow i18n to xfs_{notice,warn,err,alert} so that
xfsprogs can share code with kernel.
Suggested-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Ian Kent [Mon, 12 Oct 2020 19:40:01 +0000 (15:40 -0400)]
xfsprogs: ignore autofs mount table entries
Some of the xfsprogs utilities read the mount table via. getmntent(3).
The mount table may contain (almost always these days since /etc/mtab is
symlinked to /proc/self/mounts) autofs mount entries. During processing
of the mount table entries statfs(2) can be called on mount point paths
which will trigger an automount if those entries are direct or offset
autofs mount triggers (indirect autofs mounts aren't affected).
This can be a problem when there are a lot of autofs direct or offset
mounts because real mounts will be triggered when statfs(2) is called.
This can be particularly bad if the triggered mounts are NFS mounts and
the server is unavailable leading to lengthy boot times or worse.
Simply ignoring autofs mount entries during getmentent(3) traversals
avoids the statfs() call that triggers these mounts. If there are
automounted mounts (real mounts) at the time of reading the mount table
these will still be seen in the list so they will be included if that
actually matters to the reader.
Recent glibc getmntent(3) can ignore autofs mounts but that requires the
autofs user to configure autofs to use the "ignore" pseudo mount option
for autofs mounts. But this isn't yet the autofs default (to prevent
unexpected side effects) so that can't be used.
The autofs direct and offset automount triggers are pseudo file system
mounts and are more or less useless in terms on file system information
so excluding them doesn't sacrifice useful file system information
either.
Consequently excluding autofs mounts shouldn't have any adverse side
effects.
Changes since v1:
- drop hunk from fsr/xfs_fsr.c.
Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Mon, 12 Oct 2020 19:40:01 +0000 (15:40 -0400)]
xfsprogs: fix ioctl_xfs_geometry manpage naming
Somehow "fsop_/FSOP_" snuck into this manpage's name, filename, and
ioctl name. It's not XFS_IOC_FSOP_GEOMETRY, it's XFS_IOC_FSGEOMETRY
so change all references, including the man page name, filename, and
references from xfsctl(3).
(the structure and flags do have the fsop_ string, which certainly
makes this a bit confusing)
Fixes: b427c816847e ("man: create a separate GEOMETRY ioctl manpage") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Mon, 12 Oct 2020 19:40:01 +0000 (15:40 -0400)]
xfs_repair: coordinate parallel updates to the rt bitmap
Actually take the rt lock before updating the bitmap from multiple
threads. This fixes an infrequent corruption problem when running
generic/013 and rtinherit=1 is set on the root dir.
Fixes: 2556c98bd9e6 ("Perform true sequential bulk read prefetching in xfs_repair Merge of master-melb:xfs-cmds:29147a by kenmcd.") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Mon, 12 Oct 2020 19:39:52 +0000 (15:39 -0400)]
xfs_scrub: don't use statvfs to collect filesystem summary counts
The function scrub_scan_estimate_blocks naïvely uses the statvfs counts
to estimate the size and free blocks on the data volume. Unfortunately,
it fails to account for the fact that statvfs can return the size and
free counts for the realtime volume if the root directory has the
rtinherit flag set, which leads to phase 7 reporting totally absurd
quantities.
Eric pointed out a further problem with statvfs, which is that the file
counts are clamped to the current user's project quota inode limits.
Therefore, we must not use statvfs for querying the filesystem summary
counts.
The XFS_IOC_FSCOUNTS ioctl returns all the data we need, so use that
instead.
Fixes: 604dd3345f35 ("xfs_scrub: filesystem counter collection functions") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Mon, 12 Oct 2020 15:59:19 +0000 (11:59 -0400)]
libhandle: fix potential unterminated string problem
gcc 10.2 complains about the strncpy call here, since it's possible that
the source string is so long that the fspath inside the fdhash structure
will end up without a null terminator. Work around strncpy braindamage
yet again by forcing the string to be terminated properly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Mon, 12 Oct 2020 15:59:19 +0000 (11:59 -0400)]
libfrog: fix a potential null pointer dereference
Apparently, gcc 10.2 thinks that it's possible for either of the calloc
arguments to be zero here, in which case it will return NULL with a zero
errno. I suppose it's possible to do that via integer overflow in the
macro, though I find it unlikely unless someone passes in a yuuuge value.
Nevertheless, just shut up the warning by hardcoding the error number
so I can move on to nastier bugs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 30 Sep 2020 16:45:13 +0000 (12:45 -0400)]
libxfs: disallow filesystems with reverse mapping and reflink and realtime
Neither the kernel nor the code in xfsprogs support filesystems that
have (either reverse mapping btrees or reflink) enabled and a realtime
volume configured. The kernel rejects such combinations and mkfs
refuses to format such a config, but xfsprogs doesn't check and can do
Bad Things, so port those checks before someone shreds their filesystem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Wed, 30 Sep 2020 15:10:09 +0000 (11:10 -0400)]
mkfs.xfs: remove comment about needed future work
Remove comment about the need to sync this function with the
kernel; that was mostly taken care of with:
7b7548052 ("mkfs: use libxfs to write out new AGs")
There's maybe a little more samey-samey that we could do here,
but it's not egregiously cut & pasted as it was before.
Fixes: 7b7548052d12 ("mkfs: use libxfs to write out new AGs") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 30 Sep 2020 14:59:15 +0000 (10:59 -0400)]
xfs_repair: don't flag RTINHERIT files when no rt volume
Don't flag directories with the RTINHERIT flag set when the filesystem
doesn't have a realtime volume configured. The kernel has let us set
RTINHERIT without a rt volume for ages, so it's not an invalid state.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 30 Sep 2020 14:59:15 +0000 (10:59 -0400)]
mkfs: don't allow creation of realtime files from a proto file
If someone runs mkfs with rtinherit=1, a realtime volume configured, and
a protofile that creates a regular file in the filesystem, mkfs will
error out with "Function not implemented" because userspace doesn't know
how to allocate extents from the rt bitmap. Catch this specific case
and hand back a somewhat nicer explanation of what happened.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 30 Sep 2020 14:59:15 +0000 (10:59 -0400)]
libxfs: don't propagate RTINHERIT -> REALTIME when there is no rtdev
When creating a file inside a directory that has RTINHERIT set, only
propagate the REALTIME flag to the file if the filesystem actually has a
realtime volume configured. Otherwise, we end up writing inodes that
trip the verifiers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 30 Sep 2020 14:59:15 +0000 (10:59 -0400)]
mkfs: set required parts of the realtime geometry before computing log geometry
The minimum log size depends on the transaction reservation sizes, which
in turn depend on the realtime device geometry. Therefore, we need to
set up some of the rt geometry before we can compute the real minimum
log size.
This fixes a problem where mkfs, given a small data device and a
realtime volume, formats a filesystem with a log that is too small to
pass the mount time log size checks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 30 Sep 2020 14:59:15 +0000 (10:59 -0400)]
mkfs: fix reflink/rmap logic w.r.t. realtime devices and crc=0 support
mkfs has some logic to deal with situations where reflink or rmapbt are
turned on and the administrator has configured a realtime device or a V4
filesystem; such configurations are not allowed.
The logic ought to disable reflink and/or rmapbt if they're enabled due
to being the defaults, and it ought to complain and abort if they're
enabled because the admin explicitly turned them on.
Unfortunately, the logic here doesn't do that and makes no sense at all
since usage() exits the program. Fix it to follow what everything else
does.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 30 Sep 2020 14:59:15 +0000 (10:59 -0400)]
mkfs.xfs: tweak wording of external log device size complaint
If the external log device is too small to satisfy minimum requirements,
mkfs will complain about "external log device 512 too small...". That
doesn't make any sense, so add a few missing words to clarify what we're
talking about:
"external log device size 512 blocks too small..."
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 30 Sep 2020 14:59:15 +0000 (10:59 -0400)]
man: install all manpages that redirect to another manpage
Some of the ioctl manpages do not contain any information other than a
pointer to a different manpage. These aren't picked up by the install
scripts, so fix them so that they do.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 30 Sep 2020 14:59:05 +0000 (10:59 -0400)]
xfs_repair: throw away totally bad clusters
If the filesystem supports sparse inodes, we detect that an entire
cluster buffer has no detectable inodes at all, and we can easily mark
that part of the inode chunk sparse, just drop the cluster buffer and
forget about it. This makes repair less likely to go to great lengths
to try to save something that's totally unsalvageable.
This manifested in recs[2].free=zeroes in xfs/364, wherein the root
directory claimed to own block X and the inobt also claimed that X was
inodes; repair tried to create rmaps for both owners, and then the whole
mess blew up because the rmap code aborts on those kinds of anomalies.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Mon, 28 Sep 2020 21:35:37 +0000 (17:35 -0400)]
xfs_repair: fix handling of data blocks colliding with existing metadata
Prior to commit a406779bc8d8, any blocks in a data fork extent that
collided with existing blocks would cause the entire data fork extent to
be rejected. Unfortunately, the patch to add data block sharing support
suppressed checking for any collision, including metadata. What we
really wanted to do here during a check_dups==1 scan is to is check for
specific collisions and without updating the block mapping data.
So, move the check_dups test after the for-switch construction. This
re-enables detecting collisions between data fork blocks and a
previously scanned chunk of metadata, and improves the specificity of
the error message that results.
This was found by fuzzing recs[2].free=zeroes in xfs/364, though this
patch alone does not solve all the problems that scenario presents.
Fixes: a406779bc8d8 ("xfs_repair: handle multiple owners of data blocks") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Mon, 28 Sep 2020 21:35:37 +0000 (17:35 -0400)]
xfs_repair: junk corrupt xattr root blocks
If reading the root block of an extended attribute structure fails due
to a corruption error, we should junk the block since we know it's bad.
There's no point in moving on to the (rather insufficient) checks in the
attr code.
Found by fuzzing hdr.freemap[1].base = ones in xfs/400.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Mon, 28 Sep 2020 21:35:37 +0000 (17:35 -0400)]
xfs_repair: fix error in process_sf_dir2_fixi8
The goal of process_sf_dir2_fixi8 is to convert an i8 shortform
directory into a (shorter) i4 shortform directory. It achieves this by
duplicating the old sf directory contents (as oldsfp), zeroing i8count
in the caller's directory buffer (i.e. newsfp/sfp), and reinitializing
the new directory with the old directory's entries.
Unfortunately, it copies the parent pointer from sfp (the buffer we've
already started changing), not oldsfp. This leads to directory
corruption since at that point we zeroed i8count, which means that we
save only the upper four bytes from the parent pointer entry.
This was found by fuzzing u3.sfdir3.hdr.i8count = ones in xfs/384.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Mon, 28 Sep 2020 21:35:37 +0000 (17:35 -0400)]
xfs_repair: don't crash on partially sparse inode clusters
While running xfs/364 to fuzz the middle bit of recs[2].holemask, I
observed a crash in xfs_repair stemming from the fact that each sparse
bit accounts for 4 inodes, but inode cluster buffers can map to more
than four inodes.
When the first inode in an inode cluster is marked sparse,
process_inode_chunk won't try to load the inode cluster buffer.
Unfortunately, if the holemask indicates that there are inodes present
anywhere in the rest of the cluster buffer, repair will try to check the
corresponding cluster buffer, even if we didn't load it. This leads to
a null pointer dereference, which crashes repair.
Avoid the null pointer dereference by marking the inode sparse and
moving on to the next inode.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This is because align_ag_geometry() finds that the size of the last
(only) AG is too small, and attempts to trim it off. Obviously 0
AGs is invalid, and we hit the ASSERT.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Suggested-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Pavel Reichl <preichl@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The realtime flag only applies to the data fork, so don't use the
realtime block number checks on the attr fork of a realtime file.
Fixes: 30b0984d9117 ("xfs: refactor bmap record validation") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Don't leak kernel memory contents into the shortform attr fork.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The boundary test for the fixed-offset parts of xfs_attr_sf_entry in
xfs_attr_shortform_verify is off by one, because the variable array
at the end is defined as nameval[1] not nameval[].
Hence we need to subtract 1 from the calculation.
This can be shown by:
# touch file
# setfattr -n root.a file
and verifications will fail when it's written to disk.
This only matters for a last attribute which has a single-byte name
and no value, otherwise the combination of namelen & valuelen will
push endp further out and this test won't fail.
Fixes: 1e1bbd8e7ee06 ("xfs: create structure verifier function for shortform xattrs") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The inode chunk allocation transaction reserves inobt_maxlevels-1
blocks to accommodate a full split of the inode btree. A full split
requires an allocation for every existing level and a new root
block, which means inobt_maxlevels is the worst case block
requirement for a transaction that inserts to the inobt. This can
lead to a transaction block reservation overrun when tmpfile
creation allocates an inode chunk and expands the inobt to its
maximum depth. This problem has been observed in conjunction with
overlayfs, which makes frequent use of tmpfiles internally.
The existing reservation code goes back as far as the Linux git repo
history (v2.6.12). It was likely never observed as a problem because
the traditional file/directory creation transactions also include
worst case block reservation for directory modifications, which most
likely is able to make up for a single block deficiency in the inode
allocation portion of the calculation. tmpfile support is relatively
more recent (v3.15), less heavily used, and only includes the inode
allocation block reservation as tmpfiles aren't linked into the
directory tree on creation.
Fix up the inode alloc block reservation macro and a couple of the
block allocator minleft parameters that enforce an allocation to
leave enough free blocks in the AG for a full inobt split.
Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The only use of I_DIRTY_TIME_EXPIRE is to detect in
__writeback_single_inode() that inode got there because flush worker
decided it's time to writeback the dirty inode time stamps (either
because we are syncing or because of age). However we can detect this
directly in __writeback_single_inode() and there's no need for the
strange propagation with I_DIRTY_TIME_EXPIRE flag.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Delete repeated words in fs/xfs/.
{we, that, the, a, to, fork}
Change "it it" to "it is" in one location.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
To: linux-fsdevel@vger.kernel.org Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: linux-xfs@vger.kernel.org Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Lift -ENOSPC handler from xfs_attr_leaf_addname. This will help to
reorganize transitions between the attr forms later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Invert the rename logic in xfs_attr_node_addname to simplify the
delayed attr logic later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Invert the rename logic in xfs_attr_leaf_addname to simplify the
delayed attr logic later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This patch adds another new helper function
xfs_attr_node_removename_rmt. This will also help modularize
xfs_attr_node_removename when we add delay ready attributes later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This patch adds a new helper function xfs_attr_node_removename_setup.
This will help modularize xfs_attr_node_removename when we add delay
ready attributes later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
[darrick: fix unused variable complaints by 0day robot] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This patch adds two new helper functions xfs_attr_store_rmt_blk and
xfs_attr_restore_rmt_blk. These two helpers assist to remove redundant
code associated with storing and retrieving remote blocks during the
attr set operations.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This patch helps to simplify xfs_attr_node_removename by modularizing
the code around the transactions into helper functions. This will make
the function easier to follow when we introduce delayed attributes.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
In this patch, we hoist code from xfs_attr_set_args into two new helpers
xfs_attr_is_shortform and xfs_attr_set_shortform. These two will help
to simplify xfs_attr_set_args when we get into delayed attrs later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
A transaction roll is not necessary immediately after setting the
INCOMPLETE flag when removing a node xattr entry with remote value
blocks. The remote block invalidation that immediately follows setting
the flag is an in-core only change. The next step after that is to start
unmapping the remote blocks from the attr fork, but the xattr remove
transaction reservation includes reservation for full tree splits of the
dabtree and bmap tree. The remote block unmap code will roll the
transaction as extents are unmapped and freed.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Some calls to xfs_trans_roll_inode and xfs_defer_finish routines are not
needed. If they are the last operations executed in these functions, and
no further changes are made, then higher level routines will roll or
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This patch adds a new helper function xfs_attr_node_shrink used to
shrink an attr name into an inode if it is small enough. This helps to
modularize the greater calling function xfs_attr_node_removename.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This patch pulls xfs_attr_rmtval_invalidate out of
xfs_attr_rmtval_remove and into the calling functions. Eventually
__xfs_attr_rmtval_remove will replace xfs_attr_rmtval_remove when we
introduce delayed attributes. These functions are exepcted to return
-EAGAIN when they need a new transaction. Because the invalidate does
not need a new transaction, we need to separate it from the rest of the
function that does. This will enable __xfs_attr_rmtval_remove to
smoothly replace xfs_attr_rmtval_remove later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Refactor xfs_attr_rmtval_remove to add helper function
__xfs_attr_rmtval_remove. We will use this later when we introduce
delayed attributes. This function will eventually replace
xfs_attr_rmtval_remove
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
New delayed allocation routines cannot be handling transactions so
pull them out into the calling functions
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Because new delayed attribute routines cannot roll transactions, we
carve off the parts of xfs_attr_rmtval_remove that we can use. This
will help to reduce repetitive code later when we introduce delayed
attributes.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
New delayed allocation routines cannot be handling transactions so
pull them up into the calling functions
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
To help pre-simplify xfs_attr_set_args, we need to hoist transaction
handling up, while modularizing the adjacent code down into helpers. In
this patch, hoist the commit in xfs_attr_try_sf_addname up into the
calling function, and also pull the attr list creation down.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Split out new helper function xfs_attr_leaf_try_add from
xfs_attr_leaf_addname. Because new delayed attribute routines cannot
roll transactions, we split off the parts of xfs_attr_leaf_addname that
we can use, and move the commit into the calling function.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Since delayed operations cannot roll transactions, pull up the
transaction handling into the calling function
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Break xfs_attr_rmtval_set into two helper functions
xfs_attr_rmt_find_hole and xfs_attr_rmtval_set_value.
xfs_attr_rmtval_set rolls the transaction between the helpers, but
delayed operations cannot. We will use the helpers later when
constructing new delayed attribute routines.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Delayed operations cannot return error codes. So we must check for
these conditions first before starting set or remove operations
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This patch adds a new functions to check for the existence of an
attribute. Subroutines are also added to handle the cases of leaf
blocks, nodes or shortform. Common code that appears in existing attr
add and remove functions have been factored out to help reduce the
appearance of duplicated code. We will need these routines later for
delayed attributes since delayed operations cannot return error codes.
Signed-off-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix a leak-on-error bug reported by Dan Carpenter]
[darrick: fix unused variable warning reported by 0day] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com> Reported-by: dan.carpenter@oracle.com Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Every call to xfs_da_state_alloc() also requires setting up state->args
and state->mp
Change xfs_da_state_alloc() to receive an xfs_da_args_t as argument and
return a xfs_da_state_t with both args and mp already set.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: reduce struct typedef usage] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>