Reverse mapping on a reflink-capable filesystem has some pretty high
overhead when performing file operations. This is because the rmap
records for logically and physically adjacent extents might not be
adjacent in the rmap index due to data block sharing. As a result, we
use expensive overlapped-interval btree search, which walks every record
that overlaps with the supplied key in the hopes of finding the record.
However, profiling data shows that when the index contains a record that
is an exact match for a query key, the non-overlapped btree search
function can find the record much faster than the overlapped version.
Try the non-overlapped lookup first, which will make scrub run much
faster.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Most callers of xfs_rmap_lookup_le will retrieve the btree record
immediately if the lookup succeeds. The overlapped version of this
function (xfs_rmap_lookup_le_range) will return the record if the lookup
succeeds, so make the regular version do it too. Get rid of the useless
len argument, since it's not part of the lookup key.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.
We also pass the fields to log to xfs_btree_offsets() as a uint32_t
all cases now. I have no idea why we made that parameter a int64_t
in the first place, but while we are fixing this up change it to
a uint32_t field, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.
This touches xfs_fs.h so affects the user API, but the user API
fields are also unsigned so the flags should really be unsigned,
too.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
We currently set the log ticket client ID when we reserve a
transaction. This client ID is only ever written to the log by
a CIL checkpoint or unmount records, and so anything using a high
level transaction allocated through xfs_trans_alloc() does not need
a log ticket client ID to be set.
For the CIL checkpoint, the client ID written to the journal is
always XFS_TRANSACTION, and for the unmount record it is always
XFS_LOG, and nothing else writes to the log. All of these operations
tell xlog_write() exactly what they need to write to the log (the
optype) and build their own opheaders for start, commit and unmount
records. Hence we no longer need to set the client id in either the
log ticket or the xfs_trans.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This commit enables XFS module to work with fs instances having 64-bit
per-inode extent counters by adding XFS_SB_FEAT_INCOMPAT_NREXT64 flag to the
list of supported incompat feature flags.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The following changes are made to enable userspace to obtain 64-bit extent
counters,
1. Carve out a new 64-bit field xfs_bulkstat->bs_extents64 from
xfs_bulkstat->bs_pad[] to hold 64-bit extent counter.
2. Define the new flag XFS_BULK_IREQ_BULKSTAT for userspace to indicate that
it is capable of receiving 64-bit extent counters.
Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This commit enables upgrading existing inodes to use large extent counters
provided that underlying filesystem's superblock has large extent counter
feature enabled.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The maximum file size that can be represented by the data fork extent counter
in the worst case occurs when all extents are 1 block in length and each block
is 1KB in size.
With XFS_MAX_EXTCNT_DATA_FORK_SMALL representing maximum extent count and with
1KB sized blocks, a file can reach upto,
(2^31) * 1KB = 2TB
This is much larger than the theoretical maximum size of a directory
i.e. XFS_DIR2_SPACE_SIZE * 3 = ~96GB.
Since a directory's inode can never overflow its data fork extent counter,
this commit removes all the overflow checks associated with
it. xfs_dinode_verify() now performs a rough check to verify if a diretory's
data fork is larger than 96GB.
Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
As mentioned in the previous commit, the kernel misuses sb_frextents in
the incore mount to reflect both incore reservations made by running
transactions as well as the actual count of free rt extents on disk.
This results in the superblock being written to the log with an
underestimate of the number of rt extents that are marked free in the
rtbitmap.
Teaching XFS to recompute frextents after log recovery avoids
operational problems in the current mount, but it doesn't solve the
problem of us writing undercounted frextents which are then recovered by
an older kernel that doesn't have that fix.
Create an incore percpu counter to mirror the ondisk frextents. This
new counter will track transaction reservations and the only time we
will touch the incore super counter (i.e the one that gets logged) is
when those transactions commit updates to the rt bitmap. This is in
contrast to the lazysbcount counters (e.g. fdblocks), where we know that
log recovery will always fix any incorrect counter that we log.
As a bonus, we only take m_sb_lock at transaction commit time.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Pass an explicit xfs_mount pointer to the rtalloc query functions so
that they can support transactionless queries.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This commit introduces new fields in the on-disk inode format to support
64-bit data fork extent counters and 32-bit attribute fork extent
counters. The new fields will be used only when an inode has
XFS_DIFLAG2_NREXT64 flag set. Otherwise we continue to use the regular 32-bit
data fork extent counters and 16-bit attribute fork extent counters.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Suggested-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This commit defines new macros to represent maximum extent counts allowed by
filesystems which have support for large per-inode extent counters.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This commit adds the new per-inode flag XFS_DIFLAG2_NREXT64 to indicate that
an inode supports 64-bit extent counters. This flag is also enabled by default
on newly created inodes when the corresponding filesystem has large extent
counter feature bit (i.e. XFS_FEAT_NREXT64) set.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
XFS_FSOP_GEOM_FLAGS_NREXT64 indicates that the current filesystem instance
supports 64-bit per-inode extent counters.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
XFS_SB_FEAT_INCOMPAT_NREXT64 incompat feature bit will be set on filesystems
which support large per-inode extent counters. This commit defines the new
incompat feature bit and the corresponding per-fs feature bit (along with
inline functions to work on it).
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
A future commit will introduce a 64-bit on-disk data extent counter and a
32-bit on-disk attr extent counter. This commit promotes xfs_extnum_t and
xfs_aextnum_t to 64 and 32-bits in order to correctly handle in-core versions
of these quantities.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
A future commit will increase the width of xfs_extnum_t in order to facilitate
larger per-inode extent counters. Hence this patch now uses basic types to
define xfs_log_dinode->[di_nextents|dianextents].
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
This commit replaces the macro XFS_DFORK_NEXTENTS() with the helper function
xfs_dfork_nextents(). As of this commit, xfs_dfork_nextents() returns the same
value as XFS_DFORK_NEXTENTS(). A future commit which extends inode's extent
counter fields will add more logic to this helper.
This commit also replaces direct accesses to xfs_dinode->di_[a]nextents
with calls to xfs_dfork_nextents().
No functional changes have been made.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
xfs_extnum_t is the type to use to declare variables which have values
obtained from xfs_dinode->di_[a]nextents. This commit replaces basic
types (e.g. uint32_t) with xfs_extnum_t for such variables.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
xfs_iext_max_nextents() returns the maximum number of extents possible for one
of data, cow or attribute fork. This helper will be extended further in a
future commit when maximum extent counts associated with data/attribute forks
are increased.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The maximum extent length depends on maximum block count that can be stored in
a BMBT record. Hence this commit defines MAXEXTLEN based on
BMBT_BLOCKCOUNT_BITLEN.
While at it, the commit also renames MAXEXTLEN to XFS_MAX_BMBT_EXTLEN.
Suggested-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Maximum values associated with extent counters i.e. Maximum extent length,
Maximum data extents and Maximum xattr extents are dictated by the on-disk
format. Hence move these definitions over to xfs_format.h.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Pavel Reichl [Fri, 27 May 2022 20:36:21 +0000 (16:36 -0400)]
mkfs: Fix memory leak
'value' is allocated by strdup() in getstr(). It
needs to be freed as we do not keep any permanent
reference to it.
Signed-off-by: Pavel Reichl <preichl@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Fri, 27 May 2022 20:36:21 +0000 (16:36 -0400)]
xfsprogs: autoconf modernisation
Because apparently AC_TRY_COMPILE and AC_TRY_LINK has been
deprecated and made obsolete.
.....
configure.ac:164: warning: The macro `AC_TRY_COMPILE' is obsolete.
configure.ac:164: You should run autoupdate.
./lib/autoconf/general.m4:2847: AC_TRY_COMPILE is expanded from...
m4/package_libcdev.m4:68: AC_HAVE_GETMNTENT is expanded from...
configure.ac:164: the top level
configure.ac:165: warning: The macro `AC_TRY_LINK' is obsolete.
configure.ac:165: You should run autoupdate.
./lib/autoconf/general.m4:2920: AC_TRY_LINK is expanded from...
m4/package_libcdev.m4:84: AC_HAVE_FALLOCATE is expanded from...
configure.ac:165: the top level
.....
But "autoupdate" does nothing to fix this, so I have to manually do
these conversions:
because I have nothing better to do than fix currently working
code.
Also, running autoupdate forces the minimum pre-req to be autoconf
2.71 because it replaces other stuff...
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: use AC_PREREQ of 2.69 vs 2.71 to avoid bleeding edge] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Fri, 27 May 2022 20:36:14 +0000 (16:36 -0400)]
xfs_io: add a quiet option to bulkstat
This is purely for driving the kernel bulkstat operations as hard
as userspace can drive them - we don't care about the actual output,
just want to drive maximum IO rates through the inode cache.
Bulkstat at 3.4 million inodes a second via xfs_io currently burns
about 30% of CPU time just formatting and outputting the stat
information to stdout and dumping it to /dev/null.
wall time rate IOPS bandwidth
unpatched 17.823s 3.4M/s 70k 1.9GB/s
with -q 15.682 6.1M/s 150k 3.5GB/s
The disks are at about 30% of max bandwidth and only at 70kiops, so
this CPU can be used to drive the kernel and IO subsystem harder.
Wall time doesn't really go down on this specific test because the
increase in inode cache turn-over (about 10GB/s of cached metadata
(in-core inodes and buffers) is being cycled through memory on a
machine with 16GB of RAM) and that hammers memory reclaim into a
utter mess that often takes seconds for it to recover from...
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Fri, 27 May 2022 20:36:14 +0000 (16:36 -0400)]
metadump: be careful zeroing corrupt inode forks
When a corrupt inode fork is encountered, we can zero beyond the end
of the inode if the fork pointers are sufficiently trashed. We
should not trust the fork pointers when corruption is detected and
skip the zeroing in this case. We want metadump to capture the
corruption and so skipping the zeroing will give us the best chance
of preserving the corruption in a meaningful state for diagnosis.
Reported-by: Sean Caron <scaron@umich.edu> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Dave Chinner [Fri, 27 May 2022 20:36:14 +0000 (16:36 -0400)]
metadump: handle corruption errors without aborting
Sean Caron reported that a metadump terminated after givin gthis
warning:
xfs_metadump: inode 2216156864 has unexpected extents
Metadump is supposed to ignore corruptions and continue dumping the
filesystem as best it can. Whilst it warns about many situations
where it can't fully dump structures, it should stop processing that
structure and continue with the next one until the entire filesystem
has been processed.
Unfortunately, some warning conditions also return an "abort" error
status, causing metadump to abort if that condition is hit. Most of
these abort conditions should really be "continue on next object"
conditions so that the we attempt to dump the rest of the
filesystem.
Fix the returns for warnings that incorrectly cause aborts
such that the only abort conditions are read errors when
"stop-on-read-error" semantics are specified. Also make the return
values consistently mean abort/continue rather than returning -errno
to mean "stop because read error" and then trying to infer what
the error means in callers without the context it occurred in.
Reported-and-tested-by: Sean Caron <scaron@umich.edu> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
xfs_db: take BB cluster offset into account when using 'type' cmd
Changing the interpretation type of data under the cursor moves the
cursor to the beginning of BB cluster. When cursor is set to an
inode the cursor is offset in BB buffer. However, this offset is not
considered when type of the data is changed - the cursor points to
the beginning of BB buffer. For example:
$ xfs_db -c "inode 131" -c "daddr" -c "type text" \
-c "daddr" /dev/sdb1
current daddr is 131
current daddr is 128
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:53:22 +0000 (22:53 -0400)]
xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode
If we decide to restart an inode chunk walk because one of the inodes is
stale, make sure that we don't walk an inode that's been scanned before.
This ensure we always make forward progress.
Found by observing backwards inode scan progress while running xfs/285.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: add comment above forward-progress test] Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
xfs_scrub: balance inode chunk scan across CPUs
Use the bounded workqueue functionality to spread the inode chunk scan
load across the CPUs more evenly. First, we create per-AG workers to
walk each AG's inode btree to create inode batch work items for each
inobt record. These items are added to a (second) bounded workqueue
that invokes BULKSTAT and invokes the caller's function on each bulkstat
record.
By splitting the work items into batches of 64 inodes instead of one
thread per AG, we keep the level of parallelism at a reasonably high
level almost all the way to the end of the inode scan if the inodes are
not evenly divided across AGs or if a few files have far more extent
records than average.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
xfs_scrub: prepare phase3 for per-inogrp worker threads
In the next patch, we're going to rewrite scrub_scan_all_inodes to
schedule per-inogrp workqueue items that will run the iterator function.
In other words, the worker threads in phase 3 wil soon cease to be
per-AG threads.
To prepare for this, we must modify phase 3 so that any writes to shared
state are protected by the appropriate per-AG locks. As far as I can
tell, the only updates to shared state are the per-AG action lists, so
create some per-AG locks for phase 3 and create locked wrappers for the
action_list_* functions if we find things to repair.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
xfs_scrub: in phase 3, use the opened file descriptor for repair calls
While profiling the performance of xfs_scrub, I noticed that phase3 only
employs the scrub-by-handle interface for repairs. The kernel has had
the ability to skip the untrusted iget lookup if the fd matches the
handle data since the beginning, and using it reduces the repair runtime
by 5% on the author's system. Normally, we shouldn't be running that
many repairs or optimizations, but we did this for scrub, so we should
do the same for repair.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
xfs_scrub: make phase 4 go straight to fstrim if nothing to fix
If there's nothing to repair in phase 4, there's no need to hold up the
FITRIM call to do the summary count scan that prepares us to repair
filesystem metadata. Rearrange this a bit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: fix unfixable_errors test logic thinko] Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
xfs_scrub: don't try any file repairs during phase 3 if AG metadata bad
Currently, phase 3 tries to repair file metadata even after phase 2
tells us that there are problems with the AG metadata. While this
generally won't cause too many problems since the repair code will bail
out on any obvious corruptions it finds, this isn't totally foolproof.
If the filesystem space metadata are not in good shape, we want to queue
the file repairs to run /after/ the space metadata repairs in phase 4.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
xfs_scrub: fall back to scrub-by-handle if opening handles fails
Back when I originally wrote xfs_scrub, I decided that phase 3 (the file
scrubber) should try to open all regular files by handle to pin the file
during the scrub. At the time, I decided that an ESTALE return value
should cause the entire inode group (aka an inobt record) to be
rescanned for thoroughness reasons.
Over the past four years, I've realized that checking the return value
here isn't necessary. For most runtime errors, we already fall back to
scrubbing with the file handle, at a fairly small performance cost.
For ESTALE, if the file has been freed and reallocated, its metadata has
been rewritten completely since bulkstat, so it's not necessary to check
it for latent disk errors. If the file was freed, we can simply move on
to the next file. If the filesystem is corrupt enough that
open-by-handle fails (this also results in ESTALE), we actually /want/
to fall back to scrubbing that file by handle, not rescanning the entire
inode group.
Therefore, remove the ESTALE check and leave a comment detailing these
findings.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
xfs_scrub: in phase 3, use the opened file descriptor for scrub calls
While profiling the performance of xfs_scrub, I noticed that phase3 only
employs the scrub-by-handle interface. The kernel has had the ability
to skip the untrusted iget lookup if the fd matches the handle data
since the beginning, and using it reduces the phase 3 runtime by 5% on
the author's system.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
xfs_repair: check the ftype of dot and dotdot directory entries
The long-format directory block checking code skips the filetype check
for the '.' and '..' entries, even though they're part of the ondisk
format. This leads to repair failing to catch subtle corruption at the
start of a directory.
Found by fuzzing bu[0].filetype = zeroes in xfs/386.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
xfs_repair: improve error reporting when checking rmap and refcount btrees
When we're checking the rmap and refcount btrees, push error reporting
down to the exact locations of failures so that the user sees both a
message specific to the failure that occurred and what repair was doing
at the time. This also removes quite a bit of return code handling,
since all the errors were fatal anyway.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: update subject and change to djwong's updated commit log] Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
xfs_repair: detect v5 featureset mismatches in secondary supers
Make sure we detect and correct mismatches between the V5 features
described in the primary and the secondary superblocks.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: add comment about XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR] Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
mkfs: don't trample the gid set in the protofile
Catherine's recent changes to xfs/019 exposed a bug in how libxfs
handles setgid bits. mkfs reads the desired gid in from the protofile,
but if the parent directory is setgid, it will override the user's
setting and (re)set the child's gid to the parent's gid. Overriding
user settings is (probably) not the desired mode of operation, so create
a flag to struct cred to force the gid in the protofile.
It looks like this has been broken since ~2005.
Cc: Catherine Hoang <catherine.hoang@oracle.com> Fixes: 9f064b7e ("Provide mkfs options to easily exercise all inheritable attributes, esp. the extsize allocator hint. Merge of master-melb:xfs-cmds:24370a by kenmcd.") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
mkfs: round log size down if rounding log start up causes overflow
If rounding the log start up to the next stripe unit would cause the log
to overrun the end of the AG, round the log size down by a stripe unit.
We already ensured that logblocks was small enough to fit inside the AG,
so the minor adjustment should suffice.
This can be reproduced with:
mkfs.xfs -dsu=44k,sw=1,size=300m,file,name=fsfile -m rmapbt=0
and:
mkfs.xfs -dsu=48k,sw=1,size=512m,file,name=fsfile -m rmapbt=0
Reported-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
mkfs: improve log extent validation
Use the standard libxfs fsblock verifiers to check the start and end of
the internal log. The current code does not catch the case of a
(segmented) fsblock that is beyond agf_blocks but not so large to change
the agno part of the segmented fsblock.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:13 +0000 (22:48 -0400)]
mkfs: don't let internal logs bump the root dir inode chunk to AG 1
Currently, we don't let an internal log consume every last block in an
AG. According to the comment, we're doing this to avoid tripping AGF
verifiers if freeblks==0, but on a modern filesystem this isn't
sufficient to avoid problems because we need to have enough space in the
AG to allocate an aligned root inode chunk, if it should be the case
that the log also ends up in AG 0:
Therefore, modify the maximum internal log size calculation to constrain
the maximum internal log size so that the aligned inode chunk allocation
will always succeed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The format fails because the internal log size sizing algorithm
specifies a log size of 521492 blocks to avoid taking all the space in
the AG, but align_log_size sees the stripe unit and rounds that up to
the next stripe unit, which is 521536 blocks.
Fix this problem by rounding the log size down if rounding up would
result in a log that consumes more space in the AG than we allow.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:12 +0000 (22:48 -0400)]
mkfs: fix missing validation of -l size against maximum internal log size
If a sysadmin specifies a log size explicitly, we don't actually check
that against the maximum internal log size that we compute for the
default log size computation. We're going to add more validation soon,
so refactor the max internal log blocks into a common variable and
add a check.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:12 +0000 (22:48 -0400)]
xfs_repair: fix sizing of the incore rt space usage map calculation
If someone creates a realtime volume exactly *one* extent in length, the
sizing calculation for the incore rt space usage bitmap will be zero
because the integer division here rounds down. Use howmany() to round
up. Note that there can't be that many single-extent rt volumes since
repair will corrupt them into zero-extent rt volumes, and we haven't
gotten any reports.
Found by running xfs/530 after fixing xfs_repair to check the rt bitmap.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:12 +0000 (22:48 -0400)]
xfs_db: report absolute maxlevels for each btree type
Augment the xfs_db btheight command so that the debugger can display the
absolute maximum btree height for each btree type.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:12 +0000 (22:48 -0400)]
xfs_db: support computing btheight for all cursor types
Add the special magic btree type value 'all' to the btheight command so
that we can display information about all known btree types at once.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:12 +0000 (22:48 -0400)]
xfs_repair: warn about suspicious btree levels in AG headers
Warn about suspicious AG btree levels in the AGF and AGI.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:12 +0000 (22:48 -0400)]
xfs_db: warn about suspicious finobt trees when metadumping
We warn about suspicious roots and btree heights before metadumping the
inode btree, so do the same for the free inode btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:07 +0000 (22:48 -0400)]
xfs_db: fix a complaint about a printf buffer overrun
gcc 11 warns that stack_f doesn't allocate a sufficiently large buffer
to hold the printf output. I don't think the io cursor stack is really
going to grow to 4 billion levels deep, but let's fix this anyway.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:07 +0000 (22:48 -0400)]
xfs_scrub: move to mallinfo2 when available
Starting with glibc 2.35, the mallinfo library call has finally been
upgraded to return 64-bit memory usage quantities. Migrate to the new
call, since it also warns about mallinfo being deprecated.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 18 May 2022 02:48:07 +0000 (22:48 -0400)]
debian: support multiarch for libhandle
For nearly a decade now, Debian and derivatives have supported the
"multiarch" layout, where shared libraries are installed to
/lib/<gcc triple>/ instead of /lib. This enables a single rootfs to
support binaries from multiple architectures (e.g. i386 inside an amd64
system). We should follow this, since libhandle is useful.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
PF_SWAPWRITE has been redundant since v3.2 commit ee72886d8ed5 ("mm:
vmscan: do not writeback filesystem pages in direct reclaim").
Coincidentally, NeilBrown's current patch "remove inode_congested()"
deletes may_write_to_inode(), which appeared to be the one function which
took notice of PF_SWAPWRITE. But if you study the old logic, and the
conditions under which may_write_to_inode() was called, you discover that
flag and function have been pointless for a decade.
Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: NeilBrown <neilb@suse.de> Cc: Jan Kara <jack@suse.de> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Currently, we use this undocumented macro to encode the minimum number
of blocks needed to replenish a completely empty AGFL when an AG is
nearly full. This has lead to confusion on the part of the maintainers,
so let's document what the value actually means, and move it to
xfs_alloc.c since it's not used outside of that module.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The symbol xfs_name_dotdot is a global variable that the xfs codebase
uses here and there to look up directory dotdot entries. Currently it's
a non-const variable, which means that it's a mutable global variable.
So far nobody's abused this to cause problems, but let's use the
compiler to enforce that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Various directory functions do not modify their @name parameter,
so mark it const to make that clear. This will enable us to mark
the global xfs_name_dotdot variable as const to prevent mischief.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Now that we've made these ioctls defunct, move them from xfs_fs.h to
xfs_ioctl.c, which effectively removes them from the publicly supported
ioctl interfaces for XFS.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Remove the definitions for these ioctls, since the functionality (and,
weirdly, the 32-bit compat ioctl definitions) were removed from the
kernel in November 2019.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
To prepare for looking at the IOMAP_DAX flag in xfs_bmbt_to_iomap pass in
the input mapping flags to xfs_bmbt_to_iomap.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-24-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Eric Sandeen [Wed, 4 May 2022 18:49:14 +0000 (14:49 -0400)]
libxfs: remove kernel stubs from xfs_shared.h
The kernel stubs added to xfs_shared.h don't belong there, and are
mostly unnecessary with the #ifdef __KERNEL__ bits added to the
xfs_ag.[ch] files. Move the one remaining needed stub in libxfs_priv.h.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
When testing xfstests xfs/126 on lastest upstream kernel, it will hang on some machine.
Adding a getxattr operation after xattr corrupted, I can reproduce it 100%.
When getxattr calls xfs_attr_node_get function, xfs_da3_node_lookup_int fails with EFSCORRUPTED in
xfs_attr_node_hasname because we have use blocktrash to random it in xfs/126. So it
free state in internal and xfs_attr_node_get doesn't do xfs_buf_trans release job.
Then subsequent removexattr will hang because of it.
This bug was introduced by kernel commit 07120f1abdff ("xfs: Add xfs_has_attr and subroutines").
It adds xfs_attr_node_hasname helper and said caller will be responsible for freeing the state
in this case. But xfs_attr_node_hasname will free state itself instead of caller if
xfs_da3_node_lookup_int fails.
Fix this bug by moving the step of free state into caller.
Also, use "goto error/out" instead of returning error directly in xfs_attr_node_addname_find_attr and
xfs_attr_node_removename_setup function because we should free state ourselves.
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The xfs_perag structure and initialization is unused in userspace,
so #ifdef it out with __KERNEL__ to facilitate the xfsprogs sync
and build.
Signed-off-by: Eric Sandeen <esandeen@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Use the macro 'swap()' defined in 'include/linux/minmax.h' to avoid
opencoding it.
Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Yang Guang <yang.guang5@zte.com.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The owner info parameter is always NULL, so get rid of the parameter.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
We only use EFIs to free metadata blocks -- not regular data/attr fork
extents. Remove all the fields that we never use, for a net reduction
of 16 bytes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
xfs_bmap_add_free isn't a block mapping function; it schedules deferred
freeing operations for a later point in a compound transaction chain.
While it's primarily used by bunmapi, its use has expanded beyond that.
Move it to xfs_alloc.c and rename the function since it's now general
freeing functionality. Bring the slab cache bits in line with the
way we handle the other intent items.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Create slab caches for the high-level structures that coordinate
deferred intent items, since they're used fairly heavily.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Rearrange these structs to reduce the amount of unused padding bytes.
This saves eight bytes for each of the three structs changed here, which
means they're now all (rmap/bmap are 64 bytes, refc is 32 bytes) even
powers of two.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Remove these typedefs by referencing kmem_cache directly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Now that we have the infrastructure to track the max possible height of
each btree type, we can create a separate slab cache for cursors of each
type of btree. For smaller indices like the free space btrees, this
means that we can pack more cursors into a slab page, improving slab
utilization.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Add code for all five btree types so that we can compute the absolute
maximum possible btree height for each btree type. This is a setup for
the next patch, which makes every btree type have its own cursor cache.
The functions are exported so that we can have xfs_db report the
absolute maximum btree heights for each btree type, rather than making
everyone run their own ad-hoc computations.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Thu, 28 Apr 2022 19:39:03 +0000 (15:39 -0400)]
xfs_repair: stop using XFS_BTREE_MAXLEVELS
Use the precomputed per-btree-type max height values.
[sandeen: note that >= changes to > here; The maximal value is
fine, but with the precomputed value specific to this filesystem,
our new limit is the actual acceptable max, vs. XFS_BTREE_MAXLEVELS
which was an absolute design max and was larger than most filesystems
could create.]
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
enough to handle the maximally tall rmap btree when all blocks are in
use and maximally shared, let's compute the maximum height assuming the
rmapbt consumes as many blocks as possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
During review of the next patch, Dave remarked that he found these two
btree geometry calculation functions lacking in documentation and that
they performed more work than was really necessary.
These functions take the same parameters and have nearly the same logic;
the only real difference is in the return values. Reword the function
comment to make it clearer what each function does, and move them to be
adjacent to reinforce their relation.
Clean up both of them to stop opencoding the howmany functions, stop
using the uint typedefs, and make them both support computations for
more than 2^32 leaf records, since we're going to need all of the above
for files with large data forks and large rmap btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>