git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

xfs: Merge xfs_delattr_context into xfs_attr_item

Source kernel commit: d68c51e9a4095b57f06bf5dd15ab8fae6dab5d8b

This is a clean up patch that merges xfs_delattr_context into
xfs_attr_item. Now that the refactoring is complete and the delayed
operation infrastructure is in place, we can combine these to eliminate
the extra struct

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Add larp debug option

Source kernel commit: 535e2f75c4e377e6ccc9d4396695b516d118f8f0

This patch adds a debug option to enable log attribute replay. Eventually
this can be removed when delayed attrs becomes permanent.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Add log attribute error tag

Source kernel commit: abd61ca3c333506ffa4ee73b78659ab57e7efcf7

This patch adds an error tag that we can use to test log attribute
recovery and replay

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Remove unused xfs_attr_*_args

Source kernel commit: 73159fc27c6944ebe55e6652d6a1981d7cb3eb4a

Remove xfs_attr_set_args, xfs_attr_remove_args, and xfs_attr_trans_roll.
These high level loops are now driven by the delayed operations code,
and can be removed.

Additionally collapse in the leaf_bp parameter of xfs_attr_set_iter
since we only have one caller that passes dac->leaf_bp

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Add xfs_attr_set_deferred and xfs_attr_remove_deferred

Source kernel commit: f3f36c893f260275eb9229cdc3dabb4c79650591

These routines set up and queue a new deferred attribute operations.
These functions are meant to be called by any routine needing to
initiate a deferred attribute operation as opposed to the existing
inline operations. New helper function xfs_attr_item_init also added.

Finally enable delayed attributes in xfs_attr_set and xfs_attr_remove.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Skip flip flags for delayed attrs

Source kernel commit: f38dc503d366b589d98d5676a5b279d10b47bcb9

This is a clean up patch that skips the flip flag logic for delayed attr
renames. Since the log replay keeps the inode locked, we do not need to
worry about race windows with attr lookups. So we can skip over
flipping the flag and the extra transaction roll for it

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Implement attr logging and replay

Source kernel commit: 1d08e11d04d293cb7006d1c8641be1fdd8a8e397

This patch adds the needed routines to create, log and recover logged
extended attribute intents.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Set up infrastructure for log attribute replay

Source kernel commit: fd920008784ead369e79c2be2f8d9cc736e306ca

Currently attributes are modified directly across one or more
transactions. But they are not logged or replayed in the event of an
error. The goal of log attr replay is to enable logging and replaying
of attribute operations using the existing delayed operations
infrastructure.  This will later enable the attributes to become part of
larger multi part operations that also must first be recorded to the
log.  This is mostly of interest in the scheme of parent pointers which
would need to maintain an attribute containing parent inode information
any time an inode is moved, created, or removed.  Parent pointers would
then be of interest to any feature that would need to quickly derive an
inode path from the mount point. Online scrub, nfs lookups and fs grow
or shrink operations are all features that could take advantage of this.

This patch adds two new log item types for setting or removing
attributes as deferred operations.  The xfs_attri_log_item will log an
intent to set or remove an attribute.  The corresponding
xfs_attrd_log_item holds a reference to the xfs_attri_log_item and is
freed once the transaction is done.  Both log items use a generic
xfs_attr_log_format structure that contains the attribute name, value,
flags, inode, and an op_flag that indicates if the operations is a set
or remove.

[dchinner: added extra little bits needed for intent whiteouts]

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Return from xfs_attr_set_iter if there are no more rmtblks to process

Source kernel commit: 9a39cdabc172ef2de3f21a34e73cdc1d02338d79

During an attr rename operation, blocks are saved for later removal
as rmtblkno2. The rmtblkno is used in the case of needing to alloc
more blocks if not enough were available.  However, in the case
that no further blocks need to be added or removed, we can return as soon
as xfs_attr_node_addname completes, rather than rolling the transaction
with an -EAGAIN return.  This extra loop does not hurt anything right
now, but it will be a problem later when we get into log items because
we end up with an empty log transaction.  So, add a simple check to
cut out the unneeded iteration.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Fix double unlock in defer capture code

Source kernel commit: 7b3ec2b20e44f579c022ad62243aa18c04c6addc

The new deferred attr patch set uncovered a double unlock in the
recent port of the defer ops capture and continue code.  During log
recovery, we're allowed to hold buffers to a transaction that's being
used to replay an intent item.  When we capture the resources as part
of scheduling a continuation of an intent chain, we call xfs_buf_hold
to retain our reference to the buffer beyond the transaction commit,
but we do /not/ call xfs_trans_bhold to maintain the buffer lock.
This means that xfs_defer_ops_continue needs to relock the buffers
before xfs_defer_restore_resources joins then tothe new transaction.

Additionally, the buffers should not be passed back via the dres
structure since they need to remain locked unlike the inodes.  So
simply set dr_bufs to zero after populating the dres structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: validate v5 feature fields

Source kernel commit: f0f5f658065a5af09126ec892e4c383540a1c77f

We don't check that the v4 feature flags taht v5 requires to be set
are actually set anywhere. Do this check when we see that the
filesystem is a v5 filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: set XFS_FEAT_NLINK correctly

Source kernel commit: dd0d2f9755191690541b09e6385d0f8cd8bc9d8f

While xfs_has_nlink() is not used in kernel, it is used in userspace
(e.g. by xfs_db) so we need to set the XFS_FEAT_NLINK flag correctly
in xfs_sb_version_to_features().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: validate inode fork size against fork format

Source kernel commit: 1eb70f54c445fcbb25817841e774adb3d912f3e8

xfs_repair catches fork size/format mismatches, but the in-kernel
verifier doesn't, leading to null pointer failures when attempting
to perform operations on the fork. This can occur in the
xfs_dir_is_empty() where the in-memory fork format does not match
the size and so the fork data pointer is accessed incorrectly.

Note: this causes new failures in xfs/348 which is testing mode vs
ftype mismatches. We now detect a regular file that has been changed
to a directory or symlink mode as being corrupt because the data
fork is for a symlink or directory should be in local form when
there are only 3 bytes of data in the data fork. Hence the inode
verify for the regular file now fires w/ -EFSCORRUPTED because
the inode fork format does not match the format the corrupted mode
says it should be in.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: detect self referencing btree sibling pointers

Source kernel commit: dc04db2aa7c9307e740d6d0e173085301c173b1a

To catch the obvious graph cycle problem and hence potential endless
looping.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: tag transactions that contain intent done items

Source kernel commit: bb7b1c9c5dd3d24db3f296e365570fd50c8ca80c

Intent whiteouts will require extra work to be done during
transaction commit if the transaction contains an intent done item.

To determine if a transaction contains an intent done item, we want
to avoid having to walk all the items in the transaction to check if
they are intent done items. Hence when we add an intent done item to
a transaction, tag the transaction to indicate that it contains such
an item.

We don't tag the transaction when the defer ops is relogging an
intent to move it forward in the log. Whiteouts will never apply to
these cases, so we don't need to bother looking for them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: don't commit the first deferred transaction without intents

Source kernel commit: 5ddd658ea878f8dbae5ec33dba6cfdabb5056916

If the first operation in a string of defer ops has no intents,
then there is no reason to commit it before running the first call
to xfs_defer_finish_one(). This allows the defer ops to be used
effectively for non-intent based operations without requiring an
unnecessary extra transaction commit when first called.

This fixes a regression in per-attribute modification transaction
count when delayed attributes are not being used.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: hide log iovec alignment constraints

Source kernel commit: b2c28035cea290edbcec697504e5b7a4b1e023e7

Callers currently have to round out the size of buffers to match the
aligment constraints of log iovecs and xlog_write(). They should not
need to know this detail, so introduce a new function to calculate
the iovec length (for use in ->iop_size implementations). Also
modify xlog_finish_iovec() to round up the length to the correct
alignment so the callers don't need to do this, either.

Convert the only user - inode forks - of this alignment rounding to
use the new interface.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: zero inode fork buffer at allocation

Source kernel commit: cb512c921639613ce03f87e62c5e93ed9fe8c84d

When we first allocate or resize an inline inode fork, we round up
the allocation to 4 byte alingment to make journal alignment
constraints. We don't clear the unused bytes, so we can copy up to
three uninitialised bytes into the journal. Zero those bytes so we
only ever copy zeros into the journal.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: rename xfs_*alloc*_log_count to _block_count

Source kernel commit: 6ed7e509d2304519f4f6741670f512a55e9e80fe

These functions return the maximum number of blocks that could be logged
in a particular transaction. "log count" is confusing since there's a
separate concept of a log (operation) count in the reservation code, so
let's change it to "block count" to be less confusing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: reduce transaction reservations with reflink

Source kernel commit: b037c4eed2df4568a7702cd512d26625962f95b9

Before to the introduction of deferred refcount operations, reflink
would try to cram refcount btree updates into the same transaction as an
allocation or a free event. Mainline XFS has never actually done that,
but we never refactored the transaction reservations to reflect that we
now do all refcount updates in separate transactions. Fix this to
reduce the transaction reservation size even farther, so that between
this patch and the previous one, we reduce the tr_write and tr_itruncate
sizes by 66%.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: reduce the absurdly large log operation count

Source kernel commit: 4ecf9e7c69edcb8f5b98df471dd026419b881d2b

Back in the early days of reflink and rmap development I set the
transaction reservation sizes to be overly generous for rmap+reflink
filesystems, and a little under-generous for rmap-only filesystems.

Since we don't need *eight* transaction rolls to handle three new log
intent items, decrease the logcounts to what we actually need, and amend
the shadow reservation computation function to reflect what we used to
do so that the minimum log size doesn't change.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: report "max_resp" used for min log size computation

Source kernel commit: 918247ce541995dba05391cf14d6061cf0844866

Move the tracepoint that computes the size of the transaction used to
compute the minimum log size into xfs_log_get_max_trans_res so that we
only have to compute this stuff once.

Leave xfs_log_get_max_trans_res as a non-static function so that xfs_db
can call it to report the results of the userspace computation of the
same value to diagnose mkfs/kernel misinteractions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: create shadow transaction reservations for computing minimum log size

Source kernel commit: 52d8ea4f2406c14d632a0e7f816bbb18d8c3e9ed

Every time someone changes the transaction reservation sizes, they
introduce potential compatibility problems if the changes affect the
minimum log size that we validate at mount time. If the minimum log
size gets larger (which should be avoided because doing so presents a
serious risk of log livelock), filesystems created with old mkfs will
not mount on a newer kernel; if the minimum size shrinks, filesystems
created with newer mkfs will not mount on older kernels.

Therefore, enable the creation of a shadow log reservation structure
where we can "undo" the effects of tweaks when computing minimum log
sizes. These shadow reservations should never be used in practice, but
they insulate us from perturbations in minimum log size.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: stop artificially limiting the length of bunmap calls

Source kernel commit: 4ed6435cc369cce722966983f6e07b872562276f

In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the
data forks of shared files to avoid two failure scenarios: one where the
extent being unmapped is so sparsely shared that we exceed the
transaction reservation with the sheer number of refcount btree updates
and EFI intent items; and the other where we attach so many deferred
updates to the transaction that we pin the log tail and later the log
head meets the tail, causing the log to livelock.

We avoid triggering the first problem by tracking the number of ops in
the refcount btree cursor and forcing a requeue of the refcount intent
item any time we think that we might be close to overflowing. This has
been baked into XFS since before the original e1a4 patch.

A recent patchset fixed the second problem by changing the deferred ops
code to finish all the work items created by each round of trying to
complete a refcount intent item, which eliminates the long chains of
deferred items (27dad); and causing long-running transactions to relog
their intent log items when space in the log gets low (74f4d).

Because this clamp affects /any/ unmapping request regardless of the
sharing factors of the component blocks, it degrades the performance of
all large unmapping requests -- whereas with an unshared file we can
unmap millions of blocks in one go, shared files are limited to
unmapping a few thousand blocks at a time, which causes the upper level
code to spin in a bunmapi loop even if it wasn't needed.

This also eliminates one more place where log recovery behavior can
differ from online behavior, because bunmapi operations no longer need
to requeue. The fstest generic/447 was created to test the old fix, and
it still passes with this applied.

Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: count EFIs when deciding to ask for a continuation of a refcount update

Source kernel commit: c47260d4ea2ac11ce607d6ac1e0ca5528f42f482

A long time ago, I added to XFS the ability to use deferred reference
count operations as part of a transaction chain.  This enabled us to
avoid blowing out the transaction reservation when the blocks in a
physical extent all had different reference counts because we could ask
the deferred operation manager for a continuation, which would get us a
clean transaction.

The refcount code asks for a continuation when the number of refcount
record updates reaches the point where we think that the transaction has
logged enough full btree blocks due to refcount (and free space) btree
shape changes and refcount record updates that we're in danger of
overflowing the transaction.

We did not previously count the EFIs logged to the refcount update
transaction because the clamps on the length of a bunmap operation were
sufficient to avoid overflowing the transaction reservation even in the
worst case situation where every other block of the unmapped extent is
shared.

Unfortunately, the restrictions on bunmap length avoid failure in the
worst case by imposing a maximum unmap length of ~3000 blocks, even for
non-pathological cases.  This seriously limits performance when freeing
large extents.

Therefore, track EFIs with the same counter as refcount record updates,
and use that information as input into when we should ask for a
continuation.  This enables the next patch to drop the clumsy bunmap
limitation.

Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: speed up write operations by using non-overlapped lookups when possible

Source kernel commit: 1edf8056131aca6fe7f98873da8297e6fa279d8c

Reverse mapping on a reflink-capable filesystem has some pretty high
overhead when performing file operations. This is because the rmap
records for logically and physically adjacent extents might not be
adjacent in the rmap index due to data block sharing. As a result, we
use expensive overlapped-interval btree search, which walks every record
that overlaps with the supplied key in the hopes of finding the record.

However, profiling data shows that when the index contains a record that
is an exact match for a query key, the non-overlapped btree search
function can find the record much faster than the overlapped version.
Try the non-overlapped lookup first when we're trying to find the left
neighbor rmap record for a given file mapping, which makes unwritten
extent conversion and remap operations run faster if data block sharing
is minimal in this part of the filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: speed up rmap lookups by using non-overlapped lookups when possible

Source kernel commit: 75d893d19c8e1b4bf4a9acd613fe5e7a80b58974

Reverse mapping on a reflink-capable filesystem has some pretty high
overhead when performing file operations. This is because the rmap
records for logically and physically adjacent extents might not be
adjacent in the rmap index due to data block sharing. As a result, we
use expensive overlapped-interval btree search, which walks every record
that overlaps with the supplied key in the hopes of finding the record.

However, profiling data shows that when the index contains a record that
is an exact match for a query key, the non-overlapped btree search
function can find the record much faster than the overlapped version.
Try the non-overlapped lookup first, which will make scrub run much
faster.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: simplify xfs_rmap_lookup_le call sites

Source kernel commit: 5b7ca8b313621907d80460bfcc1fa876d2a38488

Most callers of xfs_rmap_lookup_le will retrieve the btree record
immediately if the lookup succeeds. The overlapped version of this
function (xfs_rmap_lookup_le_range) will return the record if the lookup
succeeds, so make the regular version do it too. Get rid of the useless
len argument, since it's not part of the lookup key.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: convert quota options flags to unsigned.

Source kernel commit: b9f3082eee5a77d5000742859532ba4ff584354f

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: convert dquot flags to unsigned.

Source kernel commit: 1005dd019c88f556f85cb3632df4d2c702ae95cd

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: convert da btree operations flags to unsigned.

Source kernel commit: 3402d931575f1fb0c6863eaad6595f55e6389eda

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: convert btree buffer log flags to unsigned.

Source kernel commit: 722db70fb2f03ef9ff21cd5194e9f592701e1be6

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

We also pass the fields to log to xfs_btree_offsets() as a uint32_t
all cases now. I have no idea why we made that parameter a int64_t
in the first place, but while we are fixing this up change it to
a uint32_t field, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: convert AGI log flags to unsigned.

Source kernel commit: 0d1b97696696871dc42dfc59d527a0b68b1a1209

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: convert AGF log flags to unsigned.

Source kernel commit: f53dde11b405e7c655997513822c90ac9761efdb

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: convert bmapi flags to unsigned.

Source kernel commit: e7d410ac336856cdae934e14b9c2c749ca5a32ea

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: convert bmap extent type flags to unsigned.

Source kernel commit: 0e5b8e45229bc2680f4b10505da338f1ca15a6d2

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: convert scrub type flags to unsigned.

Source kernel commit: 79539c7c761ac6d00abd50d5f1e4390f5dc9af18

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

This touches xfs_fs.h so affects the user API, but the user API
fields are also unsigned so the flags should really be unsigned,
too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: convert attr type flags to unsigned.

Source kernel commit: a4d98629c93fdb312641dfc336a9bda56358ef72

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: log tickets don't need log client id

Source kernel commit: c7610dceed39d978ef1ee0f2ab5a3c8d2d54d120

We currently set the log ticket client ID when we reserve a
transaction. This client ID is only ever written to the log by
a CIL checkpoint or unmount records, and so anything using a high
level transaction allocated through xfs_trans_alloc() does not need
a log ticket client ID to be set.

For the CIL checkpoint, the client ID written to the journal is
always XFS_TRANSACTION, and for the unmount record it is always
XFS_LOG, and nothing else writes to the log. All of these operations
tell xlog_write() exactly what they need to write to the log (the
optype) and build their own opheaders for start, commit and unmount
records. Hence we no longer need to set the client id in either the
log ticket or the xfs_trans.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Add XFS_SB_FEAT_INCOMPAT_NREXT64 to the list of supported flags

Source kernel commit: 973ac0eb3a7dfedecd385bd2b48b12e62a0492f2

This commit enables XFS module to work with fs instances having 64-bit
per-inode extent counters by adding XFS_SB_FEAT_INCOMPAT_NREXT64 flag to the
list of supported incompat feature flags.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters

Source kernel commit: c3c4ecb529c5a1f0590cffb70649d407ee79b8a8

The following changes are made to enable userspace to obtain 64-bit extent
counters,
1. Carve out a new 64-bit field xfs_bulkstat->bs_extents64 from
xfs_bulkstat->bs_pad[] to hold 64-bit extent counter.
2. Define the new flag XFS_BULK_IREQ_BULKSTAT for userspace to indicate that
it is capable of receiving 64-bit extent counters.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Conditionally upgrade existing inodes to use large extent counters

Source kernel commit: 4f86bb4b66c999ad9ddcfd49fec93992eeba2715

This commit enables upgrading existing inodes to use large extent counters
provided that underlying filesystem's superblock has large extent counter
feature enabled.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Directory's data fork extent counter can never overflow

Source kernel commit: 83a21c18441f75aec64548692b52d34582b98a6a

The maximum file size that can be represented by the data fork extent counter
in the worst case occurs when all extents are 1 block in length and each block
is 1KB in size.

With XFS_MAX_EXTCNT_DATA_FORK_SMALL representing maximum extent count and with
1KB sized blocks, a file can reach upto,
(2^31) * 1KB = 2TB

This is much larger than the theoretical maximum size of a directory
i.e. XFS_DIR2_SPACE_SIZE * 3 = ~96GB.

Since a directory's inode can never overflow its data fork extent counter,
this commit removes all the overflow checks associated with
it. xfs_dinode_verify() now performs a rough check to verify if a diretory's
data fork is larger than 96GB.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: use a separate frextents counter for rt extent reservations

Source kernel commit: 2229276c5283264b8c2241c1ed972bbb136cab22

As mentioned in the previous commit, the kernel misuses sb_frextents in
the incore mount to reflect both incore reservations made by running
transactions as well as the actual count of free rt extents on disk.
This results in the superblock being written to the log with an
underestimate of the number of rt extents that are marked free in the
rtbitmap.

Teaching XFS to recompute frextents after log recovery avoids
operational problems in the current mount, but it doesn't solve the
problem of us writing undercounted frextents which are then recovered by
an older kernel that doesn't have that fix.

Create an incore percpu counter to mirror the ondisk frextents. This
new counter will track transaction reservations and the only time we
will touch the incore super counter (i.e the one that gets logged) is
when those transactions commit updates to the rt bitmap. This is in
contrast to the lazysbcount counters (e.g. fdblocks), where we know that
log recovery will always fix any incorrect counter that we log.
As a bonus, we only take m_sb_lock at transaction commit time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: pass explicit mount pointer to rtalloc query functions

Source kernel commit: f34061f554feba68e12b7a73008c350d2a9afd0c

Pass an explicit xfs_mount pointer to the rtalloc query functions so
that they can support transactionless queries.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Introduce per-inode 64-bit extent counters

Source kernel commit: 52a4a14842ef940e5bab1c949e5adc8f027327dc

This commit introduces new fields in the on-disk inode format to support
64-bit data fork extent counters and 32-bit attribute fork extent
counters. The new fields will be used only when an inode has
XFS_DIFLAG2_NREXT64 flag set. Otherwise we continue to use the regular 32-bit
data fork extent counters and 16-bit attribute fork extent counters.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Introduce macros to represent new maximum extent counts for data/attr forks

Source kernel commit: df9ad5cc7a524048ea7ff983d6feeb6d8c47a761

This commit defines new macros to represent maximum extent counts allowed by
filesystems which have support for large per-inode extent counters.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Use uint64_t to count maximum blocks that can be used by BMBT

Source kernel commit: 0c35e7ba18508e9344a1f27b412924bc8b34eba8

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers

Source kernel commit: 9b7d16e34bbebc0398b1dd4f2d64ae6793fdc5ea

This commit adds the new per-inode flag XFS_DIFLAG2_NREXT64 to indicate that
an inode supports 64-bit extent counters. This flag is also enabled by default
on newly created inodes when the corresponding filesystem has large extent
counter feature bit (i.e. XFS_FEAT_NREXT64) set.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Introduce XFS_FSOP_GEOM_FLAGS_NREXT64

Source kernel commit: 7c05aa9d9d2014937c8dacbd514bca2592b11f48

XFS_FSOP_GEOM_FLAGS_NREXT64 indicates that the current filesystem instance
supports 64-bit per-inode extent counters.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Introduce XFS_SB_FEAT_INCOMPAT_NREXT64 and associated per-fs feature bit

Source kernel commit: 919819f5e18097e6e888764c30625b1288d416c5

XFS_SB_FEAT_INCOMPAT_NREXT64 incompat feature bit will be set on filesystems
which support large per-inode extent counters. This commit defines the new
incompat feature bit and the corresponding per-fs feature bit (along with
inline functions to work on it).

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively

Source kernel commit: 755c38ffe1a5937d8fa03419018f49f3a23fa9a7

A future commit will introduce a 64-bit on-disk data extent counter and a
32-bit on-disk attr extent counter. This commit promotes xfs_extnum_t and
xfs_aextnum_t to 64 and 32-bits in order to correctly handle in-core versions
of these quantities.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Use basic types to define xfs_log_dinode's di_nextents and di_anextents

Source kernel commit: 1e7384f93db57c2135a9fa176e27b1c72ad860e3

A future commit will increase the width of xfs_extnum_t in order to facilitate
larger per-inode extent counters. Hence this patch now uses basic types to
define xfs_log_dinode->[di_nextents|dianextents].

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Introduce xfs_dfork_nextents() helper

Source kernel commit: dd95a6ce31d6441dfd5fd3aa5d7208b0fc61782f

This commit replaces the macro XFS_DFORK_NEXTENTS() with the helper function
xfs_dfork_nextents(). As of this commit, xfs_dfork_nextents() returns the same
value as XFS_DFORK_NEXTENTS(). A future commit which extends inode's extent
counter fields will add more logic to this helper.

This commit also replaces direct accesses to xfs_dinode->di_[a]nextents
with calls to xfs_dfork_nextents().

No functional changes have been made.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Use xfs_extnum_t instead of basic data types

Source kernel commit: bb1d50494cbdd9c5991ddc7feeeb14982872b2a8

xfs_extnum_t is the type to use to declare variables which have values
obtained from xfs_dinode->di_[a]nextents. This commit replaces basic
types (e.g. uint32_t) with xfs_extnum_t for such variables.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Introduce xfs_iext_max_nextents() helper

Source kernel commit: 9feb8f19665c8ba051c6a81aa7897149e7748e1e

xfs_iext_max_nextents() returns the maximum number of extents possible for one
of data, cow or attribute fork. This helper will be extended further in a
future commit when maximum extent counts associated with data/attribute forks
are increased.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Define max extent length based on on-disk format definition

Source kernel commit: 95f0b95e2b686ceaa3f465e9fa079f22e0fe7665

The maximum extent length depends on maximum block count that can be stored in
a BMBT record. Hence this commit defines MAXEXTLEN based on
BMBT_BLOCKCOUNT_BITLEN.

While at it, the commit also renames MAXEXTLEN to XFS_MAX_BMBT_EXTLEN.

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Move extent count limits to xfs_format.h

Source kernel commit: 3b0d9fd369ea48419ccb578e0bafa4c54df63ba6

Maximum values associated with extent counters i.e. Maximum extent length,
Maximum data extents and Maximum xattr extents are dictated by the on-disk
format. Hence move these definitions over to xfs_format.h.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.18.0

Update all the necessary files for a 5.18.0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: more autoconf modernisation

Fix a few autoconf things that were added after the submission of the
autoconf modernization patch. This was performed by running:

$ autoupdate configure.ac m4/*.m4

And manually putting the version back to 2.69.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.18.0-rc1

Update all the necessary files for a 5.18.0-rc1 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: Fix memory leak

'value' is allocated by strdup() in getstr(). It
needs to be freed as we do not keep any permanent
reference to it.

Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: autoconf modernisation

Because apparently AC_TRY_COMPILE and AC_TRY_LINK has been
deprecated and made obsolete.

.....
configure.ac:164: warning: The macro `AC_TRY_COMPILE' is obsolete.
configure.ac:164: You should run autoupdate.
./lib/autoconf/general.m4:2847: AC_TRY_COMPILE is expanded from...
m4/package_libcdev.m4:68: AC_HAVE_GETMNTENT is expanded from...
configure.ac:164: the top level
configure.ac:165: warning: The macro `AC_TRY_LINK' is obsolete.
configure.ac:165: You should run autoupdate.
./lib/autoconf/general.m4:2920: AC_TRY_LINK is expanded from...
m4/package_libcdev.m4:84: AC_HAVE_FALLOCATE is expanded from...
configure.ac:165: the top level
.....

But "autoupdate" does nothing to fix this, so I have to manually do
these conversions:

- AC_TRY_COMPILE -> AC_COMPILE_IFELSE
- AC_TRY_LINK -> AC_LINK_IFELSE

because I have nothing better to do than fix currently working
code.

Also, running autoupdate forces the minimum pre-req to be autoconf
2.71 because it replaces other stuff...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: use AC_PREREQ of 2.69 vs 2.71 to avoid bleeding edge]
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_io: add a quiet option to bulkstat

This is purely for driving the kernel bulkstat operations as hard
as userspace can drive them - we don't care about the actual output,
just want to drive maximum IO rates through the inode cache.

Bulkstat at 3.4 million inodes a second via xfs_io currently burns
about 30% of CPU time just formatting and outputting the stat
information to stdout and dumping it to /dev/null.

wall time rate IOPS bandwidth
unpatched 17.823s 3.4M/s 70k 1.9GB/s
with -q 15.682 6.1M/s 150k 3.5GB/s

The disks are at about 30% of max bandwidth and only at 70kiops, so
this CPU can be used to drive the kernel and IO subsystem harder.

Wall time doesn't really go down on this specific test because the
increase in inode cache turn-over (about 10GB/s of cached metadata
(in-core inodes and buffers) is being cycled through memory on a
machine with 16GB of RAM) and that hammers memory reclaim into a
utter mess that often takes seconds for it to recover from...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

metadump: be careful zeroing corrupt inode forks

When a corrupt inode fork is encountered, we can zero beyond the end
of the inode if the fork pointers are sufficiently trashed. We
should not trust the fork pointers when corruption is detected and
skip the zeroing in this case. We want metadump to capture the
corruption and so skipping the zeroing will give us the best chance
of preserving the corruption in a meaningful state for diagnosis.

Reported-by: Sean Caron <scaron@umich.edu>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

metadump: handle corruption errors without aborting

Sean Caron reported that a metadump terminated after givin gthis
warning:

xfs_metadump: inode 2216156864 has unexpected extents

Metadump is supposed to ignore corruptions and continue dumping the
filesystem as best it can. Whilst it warns about many situations
where it can't fully dump structures, it should stop processing that
structure and continue with the next one until the entire filesystem
has been processed.

Unfortunately, some warning conditions also return an "abort" error
status, causing metadump to abort if that condition is hit. Most of
these abort conditions should really be "continue on next object"
conditions so that the we attempt to dump the rest of the
filesystem.

Fix the returns for warnings that incorrectly cause aborts
such that the only abort conditions are read errors when
"stop-on-read-error" semantics are specified. Also make the return
values consistently mean abort/continue rather than returning -errno
to mean "stop because read error" and then trying to infer what
the error means in callers without the context it occurred in.

Reported-and-tested-by: Sean Caron <scaron@umich.edu>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: take BB cluster offset into account when using 'type' cmd

Changing the interpretation type of data under the cursor moves the
cursor to the beginning of BB cluster. When cursor is set to an
inode the cursor is offset in BB buffer. However, this offset is not
considered when type of the data is changed - the cursor points to
the beginning of BB buffer. For example:

$ xfs_db -c "inode 131" -c "daddr" -c "type text" \
-c "daddr" /dev/sdb1
current daddr is 131
current daddr is 128

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode

If we decide to restart an inode chunk walk because one of the inodes is
stale, make sure that we don't walk an inode that's been scanned before.
This ensure we always make forward progress.

Found by observing backwards inode scan progress while running xfs/285.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: add comment above forward-progress test]
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: balance inode chunk scan across CPUs

Use the bounded workqueue functionality to spread the inode chunk scan
load across the CPUs more evenly. First, we create per-AG workers to
walk each AG's inode btree to create inode batch work items for each
inobt record. These items are added to a (second) bounded workqueue
that invokes BULKSTAT and invokes the caller's function on each bulkstat
record.

By splitting the work items into batches of 64 inodes instead of one
thread per AG, we keep the level of parallelism at a reasonably high
level almost all the way to the end of the inode scan if the inodes are
not evenly divided across AGs or if a few files have far more extent
records than average.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: prepare phase3 for per-inogrp worker threads

In the next patch, we're going to rewrite scrub_scan_all_inodes to
schedule per-inogrp workqueue items that will run the iterator function.
In other words, the worker threads in phase 3 wil soon cease to be
per-AG threads.

To prepare for this, we must modify phase 3 so that any writes to shared
state are protected by the appropriate per-AG locks. As far as I can
tell, the only updates to shared state are the per-AG action lists, so
create some per-AG locks for phase 3 and create locked wrappers for the
action_list_* functions if we find things to repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: widen action list length variables

On a 32-bit system it's possible for there to be so many items in the
repair list that we overflow a size_t. Widen this to unsigned long
long.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: in phase 3, use the opened file descriptor for repair calls

While profiling the performance of xfs_scrub, I noticed that phase3 only
employs the scrub-by-handle interface for repairs. The kernel has had
the ability to skip the untrusted iget lookup if the fd matches the
handle data since the beginning, and using it reduces the repair runtime
by 5% on the author's system. Normally, we shouldn't be running that
many repairs or optimizations, but we did this for scrub, so we should
do the same for repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: make phase 4 go straight to fstrim if nothing to fix

If there's nothing to repair in phase 4, there's no need to hold up the
FITRIM call to do the summary count scan that prepares us to repair
filesystem metadata. Rearrange this a bit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: fix unfixable_errors test logic thinko]
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: don't try any file repairs during phase 3 if AG metadata bad

Currently, phase 3 tries to repair file metadata even after phase 2
tells us that there are problems with the AG metadata. While this
generally won't cause too many problems since the repair code will bail
out on any obvious corruptions it finds, this isn't totally foolproof.
If the filesystem space metadata are not in good shape, we want to queue
the file repairs to run /after/ the space metadata repairs in phase 4.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: fall back to scrub-by-handle if opening handles fails

Back when I originally wrote xfs_scrub, I decided that phase 3 (the file
scrubber) should try to open all regular files by handle to pin the file
during the scrub.  At the time, I decided that an ESTALE return value
should cause the entire inode group (aka an inobt record) to be
rescanned for thoroughness reasons.

Over the past four years, I've realized that checking the return value
here isn't necessary.  For most runtime errors, we already fall back to
scrubbing with the file handle, at a fairly small performance cost.

For ESTALE, if the file has been freed and reallocated, its metadata has
been rewritten completely since bulkstat, so it's not necessary to check
it for latent disk errors.  If the file was freed, we can simply move on
to the next file.  If the filesystem is corrupt enough that
open-by-handle fails (this also results in ESTALE), we actually /want/
to fall back to scrubbing that file by handle, not rescanning the entire
inode group.

Therefore, remove the ESTALE check and leave a comment detailing these
findings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: in phase 3, use the opened file descriptor for scrub calls

While profiling the performance of xfs_scrub, I noticed that phase3 only
employs the scrub-by-handle interface. The kernel has had the ability
to skip the untrusted iget lookup if the fd matches the handle data
since the beginning, and using it reduces the phase 3 runtime by 5% on
the author's system.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: collapse trivial file scrub helpers

Remove all these trivial file scrub helper functions since they make
tracing code paths difficult and will become annoying in the patches
that follow.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: check the ftype of dot and dotdot directory entries

The long-format directory block checking code skips the filetype check
for the '.' and '..' entries, even though they're part of the ondisk
format. This leads to repair failing to catch subtle corruption at the
start of a directory.

Found by fuzzing bu[0].filetype = zeroes in xfs/386.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: improve error reporting when checking rmap and refcount btrees

When we're checking the rmap and refcount btrees, push error reporting
down to the exact locations of failures so that the user sees both a
message specific to the failure that occurred and what repair was doing
at the time. This also removes quite a bit of return code handling,
since all the errors were fatal anyway.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: update subject and change to djwong's updated commit log]
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: detect v5 featureset mismatches in secondary supers

Make sure we detect and correct mismatches between the V5 features
described in the primary and the secondary superblocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: add comment about XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR]
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: don't trample the gid set in the protofile

Catherine's recent changes to xfs/019 exposed a bug in how libxfs
handles setgid bits. mkfs reads the desired gid in from the protofile,
but if the parent directory is setgid, it will override the user's
setting and (re)set the child's gid to the parent's gid. Overriding
user settings is (probably) not the desired mode of operation, so create
a flag to struct cred to force the gid in the protofile.

It looks like this has been broken since ~2005.

Cc: Catherine Hoang <catherine.hoang@oracle.com>
Fixes: 9f064b7e ("Provide mkfs options to easily exercise all inheritable attributes, esp. the extsize allocator hint. Merge of master-melb:xfs-cmds:24370a by kenmcd.")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: round log size down if rounding log start up causes overflow

If rounding the log start up to the next stripe unit would cause the log
to overrun the end of the AG, round the log size down by a stripe unit.
We already ensured that logblocks was small enough to fit inside the AG,
so the minor adjustment should suffice.

This can be reproduced with:
mkfs.xfs -dsu=44k,sw=1,size=300m,file,name=fsfile -m rmapbt=0
and:
mkfs.xfs -dsu=48k,sw=1,size=512m,file,name=fsfile -m rmapbt=0

Reported-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: improve log extent validation

Use the standard libxfs fsblock verifiers to check the start and end of
the internal log. The current code does not catch the case of a
(segmented) fsblock that is beyond agf_blocks but not so large to change
the agno part of the segmented fsblock.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: don't let internal logs bump the root dir inode chunk to AG 1

Currently, we don't let an internal log consume every last block in an
AG.  According to the comment, we're doing this to avoid tripping AGF
verifiers if freeblks==0, but on a modern filesystem this isn't
sufficient to avoid problems because we need to have enough space in the
AG to allocate an aligned root inode chunk, if it should be the case
that the log also ends up in AG 0:

$ truncate -s 6366g /tmp/a ; mkfs.xfs -f /tmp/a -d agcount=3200 -l agnum=0
meta-data=/tmp/a                 isize=512    agcount=3200, agsize=521503 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=1668808704, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521492, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: root inode created in AG 1, not AG 0

Therefore, modify the maximum internal log size calculation to constrain
the maximum internal log size so that the aligned inode chunk allocation
will always succeed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: reduce internal log size when log stripe units are in play

Currently, one can feed mkfs a combination of options like this:

$ truncate -s 6366g /tmp/a ; mkfs.xfs -f /tmp/a -d agcount=3200 -d su=256k,sw=4
meta-data=/tmp/a                 isize=512    agcount=3200, agsize=521536 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=1668808704, imaxpct=5
         =                       sunit=64     swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521536, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Metadata corruption detected at 0x55e88052c6b6, xfs_agf block 0x1/0x200
libxfs_bwrite: write verifier failed on xfs_agf bno 0x1/0x1
mkfs.xfs: writing AG headers failed, err=117

The format fails because the internal log size sizing algorithm
specifies a log size of 521492 blocks to avoid taking all the space in
the AG, but align_log_size sees the stripe unit and rounds that up to
the next stripe unit, which is 521536 blocks.

Fix this problem by rounding the log size down if rounding up would
result in a log that consumes more space in the AG than we allow.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: fix missing validation of -l size against maximum internal log size

If a sysadmin specifies a log size explicitly, we don't actually check
that against the maximum internal log size that we compute for the
default log size computation. We're going to add more validation soon,
so refactor the max internal log blocks into a common variable and
add a check.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: fix sizing of the incore rt space usage map calculation

If someone creates a realtime volume exactly *one* extent in length, the
sizing calculation for the incore rt space usage bitmap will be zero
because the integer division here rounds down. Use howmany() to round
up. Note that there can't be that many single-extent rt volumes since
repair will corrupt them into zero-extent rt volumes, and we haven't
gotten any reports.

Found by running xfs/530 after fixing xfs_repair to check the rt bitmap.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: report absolute maxlevels for each btree type

Augment the xfs_db btheight command so that the debugger can display the
absolute maximum btree height for each btree type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: support computing btheight for all cursor types

Add the special magic btree type value 'all' to the btheight command so
that we can display information about all known btree types at once.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: warn about suspicious btree levels in AG headers

Warn about suspicious AG btree levels in the AGF and AGI.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: warn about suspicious finobt trees when metadumping

We warn about suspicious roots and btree heights before metadumping the
inode btree, so do the same for the free inode btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: note the removal of XFS_IOC_FSSETDM in the documentation

Update the documentation to note the removal of this ioctl.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: fix a complaint about a printf buffer overrun

gcc 11 warns that stack_f doesn't allocate a sufficiently large buffer
to hold the printf output. I don't think the io cursor stack is really
going to grow to 4 billion levels deep, but let's fix this anyway.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: move to mallinfo2 when available

Starting with glibc 2.35, the mallinfo library call has finally been
upgraded to return 64-bit memory usage quantities. Migrate to the new
call, since it also warns about mallinfo being deprecated.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

debian: support multiarch for libhandle

For nearly a decade now, Debian and derivatives have supported the
"multiarch" layout, where shared libraries are installed to
/lib/<gcc triple>/ instead of /lib. This enables a single rootfs to
support binaries from multiple architectures (e.g. i386 inside an amd64
system). We should follow this, since libhandle is useful.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

debian: bump compat level to 11

Increase the compat level to 11 since we're now getting warnings about
level 9 being obsolescent.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

debian: refactor common options

Don't respecify identical configure options; move them into the
configure_options variable.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.18.0-rc0

Update all the necessary files for a 5.18.0-rc0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mm/fs: delete PF_SWAPWRITE

Source kernel commit: b698f0a1773f7df73f2bb4bfe0e597ea1bb3881f

PF_SWAPWRITE has been redundant since v3.2 commit ee72886d8ed5 ("mm:
vmscan: do not writeback filesystem pages in direct reclaim").

Coincidentally, NeilBrown's current patch "remove inode_congested()"
deletes may_write_to_inode(), which appeared to be the one function which
took notice of PF_SWAPWRITE. But if you study the old logic, and the
conditions under which may_write_to_inode() was called, you discover that
flag and function have been pointless for a decade.

Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Jan Kara <jack@suse.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: document the XFS_ALLOC_AGFL_RESERVE constant

Source kernel commit: 93defd5a15dd74791532692cc59be3a1aaa045fe

Currently, we use this undocumented macro to encode the minimum number
of blocks needed to replenish a completely empty AGFL when an AG is
nearly full. This has lead to confusion on the part of the maintainers,
so let's document what the value actually means, and move it to
xfs_alloc.c since it's not used outside of that module.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>