git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

xfsprogs: Release v5.18.0

Update all the necessary files for a 5.18.0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: more autoconf modernisation

Fix a few autoconf things that were added after the submission of the
autoconf modernization patch. This was performed by running:

$ autoupdate configure.ac m4/*.m4

And manually putting the version back to 2.69.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.18.0-rc1

Update all the necessary files for a 5.18.0-rc1 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: Fix memory leak

'value' is allocated by strdup() in getstr(). It
needs to be freed as we do not keep any permanent
reference to it.

Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: autoconf modernisation

Because apparently AC_TRY_COMPILE and AC_TRY_LINK has been
deprecated and made obsolete.

.....
configure.ac:164: warning: The macro `AC_TRY_COMPILE' is obsolete.
configure.ac:164: You should run autoupdate.
./lib/autoconf/general.m4:2847: AC_TRY_COMPILE is expanded from...
m4/package_libcdev.m4:68: AC_HAVE_GETMNTENT is expanded from...
configure.ac:164: the top level
configure.ac:165: warning: The macro `AC_TRY_LINK' is obsolete.
configure.ac:165: You should run autoupdate.
./lib/autoconf/general.m4:2920: AC_TRY_LINK is expanded from...
m4/package_libcdev.m4:84: AC_HAVE_FALLOCATE is expanded from...
configure.ac:165: the top level
.....

But "autoupdate" does nothing to fix this, so I have to manually do
these conversions:

- AC_TRY_COMPILE -> AC_COMPILE_IFELSE
- AC_TRY_LINK -> AC_LINK_IFELSE

because I have nothing better to do than fix currently working
code.

Also, running autoupdate forces the minimum pre-req to be autoconf
2.71 because it replaces other stuff...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: use AC_PREREQ of 2.69 vs 2.71 to avoid bleeding edge]
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_io: add a quiet option to bulkstat

This is purely for driving the kernel bulkstat operations as hard
as userspace can drive them - we don't care about the actual output,
just want to drive maximum IO rates through the inode cache.

Bulkstat at 3.4 million inodes a second via xfs_io currently burns
about 30% of CPU time just formatting and outputting the stat
information to stdout and dumping it to /dev/null.

wall time rate IOPS bandwidth
unpatched 17.823s 3.4M/s 70k 1.9GB/s
with -q 15.682 6.1M/s 150k 3.5GB/s

The disks are at about 30% of max bandwidth and only at 70kiops, so
this CPU can be used to drive the kernel and IO subsystem harder.

Wall time doesn't really go down on this specific test because the
increase in inode cache turn-over (about 10GB/s of cached metadata
(in-core inodes and buffers) is being cycled through memory on a
machine with 16GB of RAM) and that hammers memory reclaim into a
utter mess that often takes seconds for it to recover from...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

metadump: be careful zeroing corrupt inode forks

When a corrupt inode fork is encountered, we can zero beyond the end
of the inode if the fork pointers are sufficiently trashed. We
should not trust the fork pointers when corruption is detected and
skip the zeroing in this case. We want metadump to capture the
corruption and so skipping the zeroing will give us the best chance
of preserving the corruption in a meaningful state for diagnosis.

Reported-by: Sean Caron <scaron@umich.edu>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

metadump: handle corruption errors without aborting

Sean Caron reported that a metadump terminated after givin gthis
warning:

xfs_metadump: inode 2216156864 has unexpected extents

Metadump is supposed to ignore corruptions and continue dumping the
filesystem as best it can. Whilst it warns about many situations
where it can't fully dump structures, it should stop processing that
structure and continue with the next one until the entire filesystem
has been processed.

Unfortunately, some warning conditions also return an "abort" error
status, causing metadump to abort if that condition is hit. Most of
these abort conditions should really be "continue on next object"
conditions so that the we attempt to dump the rest of the
filesystem.

Fix the returns for warnings that incorrectly cause aborts
such that the only abort conditions are read errors when
"stop-on-read-error" semantics are specified. Also make the return
values consistently mean abort/continue rather than returning -errno
to mean "stop because read error" and then trying to infer what
the error means in callers without the context it occurred in.

Reported-and-tested-by: Sean Caron <scaron@umich.edu>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: take BB cluster offset into account when using 'type' cmd

Changing the interpretation type of data under the cursor moves the
cursor to the beginning of BB cluster. When cursor is set to an
inode the cursor is offset in BB buffer. However, this offset is not
considered when type of the data is changed - the cursor points to
the beginning of BB buffer. For example:

$ xfs_db -c "inode 131" -c "daddr" -c "type text" \
-c "daddr" /dev/sdb1
current daddr is 131
current daddr is 128

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode

If we decide to restart an inode chunk walk because one of the inodes is
stale, make sure that we don't walk an inode that's been scanned before.
This ensure we always make forward progress.

Found by observing backwards inode scan progress while running xfs/285.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: add comment above forward-progress test]
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: balance inode chunk scan across CPUs

Use the bounded workqueue functionality to spread the inode chunk scan
load across the CPUs more evenly. First, we create per-AG workers to
walk each AG's inode btree to create inode batch work items for each
inobt record. These items are added to a (second) bounded workqueue
that invokes BULKSTAT and invokes the caller's function on each bulkstat
record.

By splitting the work items into batches of 64 inodes instead of one
thread per AG, we keep the level of parallelism at a reasonably high
level almost all the way to the end of the inode scan if the inodes are
not evenly divided across AGs or if a few files have far more extent
records than average.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: prepare phase3 for per-inogrp worker threads

In the next patch, we're going to rewrite scrub_scan_all_inodes to
schedule per-inogrp workqueue items that will run the iterator function.
In other words, the worker threads in phase 3 wil soon cease to be
per-AG threads.

To prepare for this, we must modify phase 3 so that any writes to shared
state are protected by the appropriate per-AG locks. As far as I can
tell, the only updates to shared state are the per-AG action lists, so
create some per-AG locks for phase 3 and create locked wrappers for the
action_list_* functions if we find things to repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: widen action list length variables

On a 32-bit system it's possible for there to be so many items in the
repair list that we overflow a size_t. Widen this to unsigned long
long.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: in phase 3, use the opened file descriptor for repair calls

While profiling the performance of xfs_scrub, I noticed that phase3 only
employs the scrub-by-handle interface for repairs. The kernel has had
the ability to skip the untrusted iget lookup if the fd matches the
handle data since the beginning, and using it reduces the repair runtime
by 5% on the author's system. Normally, we shouldn't be running that
many repairs or optimizations, but we did this for scrub, so we should
do the same for repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: make phase 4 go straight to fstrim if nothing to fix

If there's nothing to repair in phase 4, there's no need to hold up the
FITRIM call to do the summary count scan that prepares us to repair
filesystem metadata. Rearrange this a bit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: fix unfixable_errors test logic thinko]
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: don't try any file repairs during phase 3 if AG metadata bad

Currently, phase 3 tries to repair file metadata even after phase 2
tells us that there are problems with the AG metadata. While this
generally won't cause too many problems since the repair code will bail
out on any obvious corruptions it finds, this isn't totally foolproof.
If the filesystem space metadata are not in good shape, we want to queue
the file repairs to run /after/ the space metadata repairs in phase 4.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: fall back to scrub-by-handle if opening handles fails

Back when I originally wrote xfs_scrub, I decided that phase 3 (the file
scrubber) should try to open all regular files by handle to pin the file
during the scrub.  At the time, I decided that an ESTALE return value
should cause the entire inode group (aka an inobt record) to be
rescanned for thoroughness reasons.

Over the past four years, I've realized that checking the return value
here isn't necessary.  For most runtime errors, we already fall back to
scrubbing with the file handle, at a fairly small performance cost.

For ESTALE, if the file has been freed and reallocated, its metadata has
been rewritten completely since bulkstat, so it's not necessary to check
it for latent disk errors.  If the file was freed, we can simply move on
to the next file.  If the filesystem is corrupt enough that
open-by-handle fails (this also results in ESTALE), we actually /want/
to fall back to scrubbing that file by handle, not rescanning the entire
inode group.

Therefore, remove the ESTALE check and leave a comment detailing these
findings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: in phase 3, use the opened file descriptor for scrub calls

While profiling the performance of xfs_scrub, I noticed that phase3 only
employs the scrub-by-handle interface. The kernel has had the ability
to skip the untrusted iget lookup if the fd matches the handle data
since the beginning, and using it reduces the phase 3 runtime by 5% on
the author's system.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: collapse trivial file scrub helpers

Remove all these trivial file scrub helper functions since they make
tracing code paths difficult and will become annoying in the patches
that follow.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: check the ftype of dot and dotdot directory entries

The long-format directory block checking code skips the filetype check
for the '.' and '..' entries, even though they're part of the ondisk
format. This leads to repair failing to catch subtle corruption at the
start of a directory.

Found by fuzzing bu[0].filetype = zeroes in xfs/386.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: improve error reporting when checking rmap and refcount btrees

When we're checking the rmap and refcount btrees, push error reporting
down to the exact locations of failures so that the user sees both a
message specific to the failure that occurred and what repair was doing
at the time. This also removes quite a bit of return code handling,
since all the errors were fatal anyway.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: update subject and change to djwong's updated commit log]
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: detect v5 featureset mismatches in secondary supers

Make sure we detect and correct mismatches between the V5 features
described in the primary and the secondary superblocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: add comment about XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR]
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: don't trample the gid set in the protofile

Catherine's recent changes to xfs/019 exposed a bug in how libxfs
handles setgid bits. mkfs reads the desired gid in from the protofile,
but if the parent directory is setgid, it will override the user's
setting and (re)set the child's gid to the parent's gid. Overriding
user settings is (probably) not the desired mode of operation, so create
a flag to struct cred to force the gid in the protofile.

It looks like this has been broken since ~2005.

Cc: Catherine Hoang <catherine.hoang@oracle.com>
Fixes: 9f064b7e ("Provide mkfs options to easily exercise all inheritable attributes, esp. the extsize allocator hint. Merge of master-melb:xfs-cmds:24370a by kenmcd.")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: round log size down if rounding log start up causes overflow

If rounding the log start up to the next stripe unit would cause the log
to overrun the end of the AG, round the log size down by a stripe unit.
We already ensured that logblocks was small enough to fit inside the AG,
so the minor adjustment should suffice.

This can be reproduced with:
mkfs.xfs -dsu=44k,sw=1,size=300m,file,name=fsfile -m rmapbt=0
and:
mkfs.xfs -dsu=48k,sw=1,size=512m,file,name=fsfile -m rmapbt=0

Reported-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: improve log extent validation

Use the standard libxfs fsblock verifiers to check the start and end of
the internal log. The current code does not catch the case of a
(segmented) fsblock that is beyond agf_blocks but not so large to change
the agno part of the segmented fsblock.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: don't let internal logs bump the root dir inode chunk to AG 1

Currently, we don't let an internal log consume every last block in an
AG.  According to the comment, we're doing this to avoid tripping AGF
verifiers if freeblks==0, but on a modern filesystem this isn't
sufficient to avoid problems because we need to have enough space in the
AG to allocate an aligned root inode chunk, if it should be the case
that the log also ends up in AG 0:

$ truncate -s 6366g /tmp/a ; mkfs.xfs -f /tmp/a -d agcount=3200 -l agnum=0
meta-data=/tmp/a                 isize=512    agcount=3200, agsize=521503 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=1668808704, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521492, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: root inode created in AG 1, not AG 0

Therefore, modify the maximum internal log size calculation to constrain
the maximum internal log size so that the aligned inode chunk allocation
will always succeed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: reduce internal log size when log stripe units are in play

Currently, one can feed mkfs a combination of options like this:

$ truncate -s 6366g /tmp/a ; mkfs.xfs -f /tmp/a -d agcount=3200 -d su=256k,sw=4
meta-data=/tmp/a                 isize=512    agcount=3200, agsize=521536 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=1668808704, imaxpct=5
         =                       sunit=64     swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521536, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Metadata corruption detected at 0x55e88052c6b6, xfs_agf block 0x1/0x200
libxfs_bwrite: write verifier failed on xfs_agf bno 0x1/0x1
mkfs.xfs: writing AG headers failed, err=117

The format fails because the internal log size sizing algorithm
specifies a log size of 521492 blocks to avoid taking all the space in
the AG, but align_log_size sees the stripe unit and rounds that up to
the next stripe unit, which is 521536 blocks.

Fix this problem by rounding the log size down if rounding up would
result in a log that consumes more space in the AG than we allow.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: fix missing validation of -l size against maximum internal log size

If a sysadmin specifies a log size explicitly, we don't actually check
that against the maximum internal log size that we compute for the
default log size computation. We're going to add more validation soon,
so refactor the max internal log blocks into a common variable and
add a check.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: fix sizing of the incore rt space usage map calculation

If someone creates a realtime volume exactly *one* extent in length, the
sizing calculation for the incore rt space usage bitmap will be zero
because the integer division here rounds down. Use howmany() to round
up. Note that there can't be that many single-extent rt volumes since
repair will corrupt them into zero-extent rt volumes, and we haven't
gotten any reports.

Found by running xfs/530 after fixing xfs_repair to check the rt bitmap.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: report absolute maxlevels for each btree type

Augment the xfs_db btheight command so that the debugger can display the
absolute maximum btree height for each btree type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: support computing btheight for all cursor types

Add the special magic btree type value 'all' to the btheight command so
that we can display information about all known btree types at once.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: warn about suspicious btree levels in AG headers

Warn about suspicious AG btree levels in the AGF and AGI.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: warn about suspicious finobt trees when metadumping

We warn about suspicious roots and btree heights before metadumping the
inode btree, so do the same for the free inode btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: note the removal of XFS_IOC_FSSETDM in the documentation

Update the documentation to note the removal of this ioctl.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: fix a complaint about a printf buffer overrun

gcc 11 warns that stack_f doesn't allocate a sufficiently large buffer
to hold the printf output. I don't think the io cursor stack is really
going to grow to 4 billion levels deep, but let's fix this anyway.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: move to mallinfo2 when available

Starting with glibc 2.35, the mallinfo library call has finally been
upgraded to return 64-bit memory usage quantities. Migrate to the new
call, since it also warns about mallinfo being deprecated.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

debian: support multiarch for libhandle

For nearly a decade now, Debian and derivatives have supported the
"multiarch" layout, where shared libraries are installed to
/lib/<gcc triple>/ instead of /lib. This enables a single rootfs to
support binaries from multiple architectures (e.g. i386 inside an amd64
system). We should follow this, since libhandle is useful.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

debian: bump compat level to 11

Increase the compat level to 11 since we're now getting warnings about
level 9 being obsolescent.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

debian: refactor common options

Don't respecify identical configure options; move them into the
configure_options variable.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.18.0-rc0

Update all the necessary files for a 5.18.0-rc0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mm/fs: delete PF_SWAPWRITE

Source kernel commit: b698f0a1773f7df73f2bb4bfe0e597ea1bb3881f

PF_SWAPWRITE has been redundant since v3.2 commit ee72886d8ed5 ("mm:
vmscan: do not writeback filesystem pages in direct reclaim").

Coincidentally, NeilBrown's current patch "remove inode_congested()"
deletes may_write_to_inode(), which appeared to be the one function which
took notice of PF_SWAPWRITE. But if you study the old logic, and the
conditions under which may_write_to_inode() was called, you discover that
flag and function have been pointless for a decade.

Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Jan Kara <jack@suse.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: document the XFS_ALLOC_AGFL_RESERVE constant

Source kernel commit: 93defd5a15dd74791532692cc59be3a1aaa045fe

Currently, we use this undocumented macro to encode the minimum number
of blocks needed to replenish a completely empty AGFL when an AG is
nearly full. This has lead to confusion on the part of the maintainers,
so let's document what the value actually means, and move it to
xfs_alloc.c since it's not used outside of that module.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: constify xfs_name_dotdot

Source kernel commit: 744e6c8ada5d612353a42ce8cd8323dd2364a70d

The symbol xfs_name_dotdot is a global variable that the xfs codebase
uses here and there to look up directory dotdot entries. Currently it's
a non-const variable, which means that it's a mutable global variable.
So far nobody's abused this to cause problems, but let's use the
compiler to enforce that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: constify the name argument to various directory functions

Source kernel commit: 996b2329b20a89963fa577d495cf057dd7bf129c

Various directory functions do not modify their @name parameter,
so mark it const to make that clear. This will enable us to mark
the global xfs_name_dotdot variable as const to prevent mischief.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove the XFS_IOC_{ALLOC,FREE}SP* definitions

Source kernel commit: b3bb9413e717b44e4aea833d07f14e90fb91cf97

Now that we've made these ioctls defunct, move them from xfs_fs.h to
xfs_ioctl.c, which effectively removes them from the publicly supported
ioctl interfaces for XFS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove the XFS_IOC_FSSETDM definitions

Source kernel commit: 9dec0368b9640c09ef5af48214e097245e57a204

Remove the definitions for these ioctls, since the functionality (and,
weirdly, the 32-bit compat ioctl definitions) were removed from the
kernel in November 2019.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: pass the mapping flags to xfs_bmbt_to_iomap

Source kernel commit: 740fd671e04f8a977018eb9cfe440b4817850f0d

To prepare for looking at the IOMAP_DAX flag in xfs_bmbt_to_iomap pass in
the input mapping flags to xfs_bmbt_to_iomap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-24-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.16.0

Update all the necessary files for a 5.16.0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libxfs: remove kernel stubs from xfs_shared.h

The kernel stubs added to xfs_shared.h don't belong there, and are
mostly unnecessary with the #ifdef __KERNEL__ bits added to the
xfs_ag.[ch] files. Move the one remaining needed stub in libxfs_priv.h.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

debian: Generate .gitcensus instead of .census (Closes: #999743)

Fix the Debian build outside a git tree (e.g., Debian archive builds) by
creating an empty .gitcensus instead of .census file on config.

Signed-off-by: Bastian Germann <bage@debian.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.16.0-rc0

Update all the necessary files for a 5.16.0-rc0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Fix the free logic of state in xfs_attr_node_hasname

Source kernel commit: a1de97fe296c52eafc6590a3506f4bbd44ecb19a

When testing xfstests xfs/126 on lastest upstream kernel, it will hang on some machine.
Adding a getxattr operation after xattr corrupted, I can reproduce it 100%.

The deadlock as below:
[983.923403] task:setfattr        state:D stack:    0 pid:17639 ppid: 14687 flags:0x00000080
[  983.923405] Call Trace:
[  983.923410]  __schedule+0x2c4/0x700
[  983.923412]  schedule+0x37/0xa0
[  983.923414]  schedule_timeout+0x274/0x300
[  983.923416]  __down+0x9b/0xf0
[  983.923451]  ? xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
[  983.923453]  down+0x3b/0x50
[  983.923471]  xfs_buf_lock+0x33/0xf0 [xfs]
[  983.923490]  xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
[  983.923508]  xfs_buf_get_map+0x4c/0x320 [xfs]
[  983.923525]  xfs_buf_read_map+0x53/0x310 [xfs]
[  983.923541]  ? xfs_da_read_buf+0xcf/0x120 [xfs]
[  983.923560]  xfs_trans_read_buf_map+0x1cf/0x360 [xfs]
[  983.923575]  ? xfs_da_read_buf+0xcf/0x120 [xfs]
[  983.923590]  xfs_da_read_buf+0xcf/0x120 [xfs]
[  983.923606]  xfs_da3_node_read+0x1f/0x40 [xfs]
[  983.923621]  xfs_da3_node_lookup_int+0x69/0x4a0 [xfs]
[  983.923624]  ? kmem_cache_alloc+0x12e/0x270
[  983.923637]  xfs_attr_node_hasname+0x6e/0xa0 [xfs]
[  983.923651]  xfs_has_attr+0x6e/0xd0 [xfs]
[  983.923664]  xfs_attr_set+0x273/0x320 [xfs]
[  983.923683]  xfs_xattr_set+0x87/0xd0 [xfs]
[  983.923686]  __vfs_removexattr+0x4d/0x60
[  983.923688]  __vfs_removexattr_locked+0xac/0x130
[  983.923689]  vfs_removexattr+0x4e/0xf0
[  983.923690]  removexattr+0x4d/0x80
[  983.923693]  ? __check_object_size+0xa8/0x16b
[  983.923695]  ? strncpy_from_user+0x47/0x1a0
[  983.923696]  ? getname_flags+0x6a/0x1e0
[  983.923697]  ? _cond_resched+0x15/0x30
[  983.923699]  ? __sb_start_write+0x1e/0x70
[  983.923700]  ? mnt_want_write+0x28/0x50
[  983.923701]  path_removexattr+0x9b/0xb0
[  983.923702]  __x64_sys_removexattr+0x17/0x20
[  983.923704]  do_syscall_64+0x5b/0x1a0
[  983.923705]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  983.923707] RIP: 0033:0x7f080f10ee1b

When getxattr calls xfs_attr_node_get function, xfs_da3_node_lookup_int fails with EFSCORRUPTED in
xfs_attr_node_hasname because we have use blocktrash to random it in xfs/126. So it
free state in internal and xfs_attr_node_get doesn't do xfs_buf_trans release job.

Then subsequent removexattr will hang because of it.

This bug was introduced by kernel commit 07120f1abdff ("xfs: Add xfs_has_attr and subroutines").
It adds xfs_attr_node_hasname helper and said caller will be responsible for freeing the state
in this case. But xfs_attr_node_hasname will free state itself instead of caller if
xfs_da3_node_lookup_int fails.

Fix this bug by moving the step of free state into caller.

Also, use "goto error/out" instead of returning error directly in xfs_attr_node_addname_find_attr and
xfs_attr_node_removename_setup function because we should free state ourselves.

Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: #ifdef out perag code for userspace

Source kernel commit: 29f11fce211c7fcf32713457c031e71785fb6088

The xfs_perag structure and initialization is unused in userspace,
so #ifdef it out with __KERNEL__ to facilitate the xfsprogs sync
and build.

Signed-off-by: Eric Sandeen <esandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: use swap() to make dabtree code cleaner

Source kernel commit: 5b068aadf62da006891383f6b23e47bc3ad49995

Use the macro 'swap()' defined in 'include/linux/minmax.h' to avoid
opencoding it.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Yang Guang <yang.guang5@zte.com.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove unused parameter from refcount code

Source kernel commit: c04c51c524697cd68d668d595f8ebc381ffe426b

The owner info parameter is always NULL, so get rid of the parameter.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: reduce the size of struct xfs_extent_free_item

Source kernel commit: b3b5ff412ab04afd99173bb12d3cc146ee478ae7

We only use EFIs to free metadata blocks -- not regular data/attr fork
extents. Remove all the fields that we never use, for a net reduction
of 16 bytes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: rename xfs_bmap_add_free to xfs_free_extent_later

Source kernel commit: c201d9ca5392b20f04882848a071025b0e194c17

xfs_bmap_add_free isn't a block mapping function; it schedules deferred
freeing operations for a later point in a compound transaction chain.
While it's primarily used by bunmapi, its use has expanded beyond that.
Move it to xfs_alloc.c and rename the function since it's now general
freeing functionality. Bring the slab cache bits in line with the
way we handle the other intent items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: create slab caches for frequently-used deferred items

Source kernel commit: f3c799c22c661e181c71a0d9914fc923023f65fb

Create slab caches for the high-level structures that coordinate
deferred intent items, since they're used fairly heavily.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: compact deferred intent item structures

Source kernel commit: 9e253954acf53227f33d307f5ac5ff94c1ca5880

Rearrange these structs to reduce the amount of unused padding bytes.
This saves eight bytes for each of the three structs changed here, which
means they're now all (rmap/bmap are 64 bytes, refc is 32 bytes) even
powers of two.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: rename _zone variables to _cache

Source kernel commit: 182696fb021fc196e5cbe641565ca40fcf0f885a

Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove kmem_zone typedef

Source kernel commit: e7720afad068a6729d9cd3aaa08212f2f5a7ceff

Remove these typedefs by referencing kmem_cache directly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: use separate btree cursor cache for each btree type

Source kernel commit: 9fa47bdcd33b117599e9ee3f2e315cb47939ac2d

Now that we have the infrastructure to track the max possible height of
each btree type, we can create a separate slab cache for cursors of each
type of btree. For smaller indices like the free space btrees, this
means that we can pack more cursors into a slab page, improving slab
utilization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: compute absolute maximum nlevels for each btree type

Source kernel commit: 0ed5f7356daee74244b02e100b3cc043e886e686

Add code for all five btree types so that we can compute the absolute
maximum possible btree height for each btree type. This is a setup for
the next patch, which makes every btree type have its own cursor cache.

The functions are exported so that we can have xfs_db report the
absolute maximum btree heights for each btree type, rather than making
everyone run their own ad-hoc computations.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: kill XFS_BTREE_MAXLEVELS

Source kernel commit: bc8883eb775dd18d8b84733d8b3a3955b72d103a

Nobody uses this symbol anymore, so kill it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: stop using XFS_BTREE_MAXLEVELS

Use the precomputed per-btree-type max height values.

[sandeen: note that >= changes to > here; The maximal value is
fine, but with the precomputed value specific to this filesystem,
our new limit is the actual acceptable max, vs. XFS_BTREE_MAXLEVELS
which was an absolute design max and was larger than most filesystems
could create.]

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: stop using XFS_BTREE_MAXLEVELS

Use the precomputed per-btree-type max height values.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: compute the maximum height of the rmap btree when reflink enabled

Source kernel commit: 9ec691205e7d4a11190519df6561a168ae6af3a4

Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
enough to handle the maximally tall rmap btree when all blocks are in
use and maximally shared, let's compute the maximum height assuming the
rmapbt consumes as many blocks as possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: clean up xfs_btree_{calc_size,compute_maxlevels}

Source kernel commit: 1b236ad7ba800bc3e9994881a8a453eb8bf5ca0f

During review of the next patch, Dave remarked that he found these two
btree geometry calculation functions lacking in documentation and that
they performed more work than was really necessary.

These functions take the same parameters and have nearly the same logic;
the only real difference is in the return values. Reword the function
comment to make it clearer what each function does, and move them to be
adjacent to reinforce their relation.

Clean up both of them to stop opencoding the howmany functions, stop
using the uint typedefs, and make them both support computations for
more than 2^32 leaf records, since we're going to need all of the above
for files with large data forks and large rmap btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: compute maximum AG btree height for critical reservation calculation

Source kernel commit: b74e15d720d0764345934ebb599a99a077c52533

Compute the actual maximum AG btree height for deciding if a per-AG
block reservation is critically low. This only affects the sanity check
condition, since we /generally/ will trigger on the 10% threshold. This
is a long-winded way of saying that we're removing one more usage of
XFS_BTREE_MAXLEVELS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: rename m_ag_maxlevels to m_allocbt_maxlevels

Source kernel commit: 7cb3efb4cfdd4f3eb1f36b0ce39254b848ff2371

Years ago when XFS was thought to be much more simple, we introduced
m_ag_maxlevels to specify the maximum btree height of per-AG btrees for
a given filesystem mount. Then we observed that inode btrees don't
actually have the same height and split that off; and now we have rmap
and refcount btrees with much different geometries and separate
maxlevels variables.

The 'ag' part of the name doesn't make much sense anymore, so rename
this to m_alloc_maxlevels to reinforce that this is the maximum height
of the *free space* btrees. This sets us up for the next patch, which
will add a variable to track the maximum height of all AG btrees.

(Also take the opportunity to improve adjacent comments and fix minor
style problems.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: dynamically allocate cursors based on maxlevels

Source kernel commit: c940a0c54a2e9333478f1d87ed40006a04fcec7e

To support future btree code, we need to be able to size btree cursors
dynamically for very large btrees. Switch the maxlevels computation to
use the precomputed values in the superblock, and create cursors that
can handle a certain height. For now, we retain the btree cursor cache
that can handle up to 9-level btrees, though a subsequent patch
introduces separate caches for each btree type, where each cache's
objects will be exactly tall enough to handle the specific btree type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: encode the max btree height in the cursor

Source kernel commit: c0643f6fdd6d3c448142ed1492a9a6b6505f9afb

Encode the maximum btree height in the cursor, since we're soon going to
allow smaller cursors for AG btrees and larger cursors for file btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: refactor btree cursor allocation function

Source kernel commit: 56370ea6e5fe3e3d6e1ca2da58f95fb0d5e1779f

Refactor btree allocation to a common helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: rearrange xfs_btree_cur fields for better packing

Source kernel commit: 69724d920e7c30ca4421af615c499e92cfcc550b

Reduce the size of the btree cursor structure some more by rearranging
fields to eliminate unused space. While we're at it, fix the ragged
indentation and a spelling error.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: prepare xfs_btree_cur for dynamic cursor heights

Source kernel commit: 6ca444cfd663545e9e1c19ad2695836ffafad0a6

Split out the btree level information into a separate struct and put it
at the end of the cursor structure as a VLA. Files with huge data forks
(and in the future, the realtime rmap btree) will require the ability to
support many more levels than a per-AG btree cursor, which means that
we're going to create per-btree type cursor caches to conserve memory
for the more common case.

Note that a subsequent patch actually introduces dynamic cursor heights.
This one merely rearranges the structure to prepare for that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: reduce the size of nr_ops for refcount btree cursors

Source kernel commit: efb79ea31067ae3dd0f348eb06e6b9a5e9907078

We're never going to run more than 4 billion btree operations on a
refcount cursor, so shrink the field to an unsigned int to reduce the
structure size. Fix whitespace alignment too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove xfs_btree_cur.bc_blocklog

Source kernel commit: cc411740472d958b718b9c6a7791ba00d88f7cef

This field isn't used by anyone, so get rid of it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: fix perag reference leak on iteration race with growfs

Source kernel commit: 892a666fafa19ab04b5e948f6c92f98f1dafb489

The for_each_perag*() set of macros are hacky in that some (i.e.
those based on sb_agcount) rely on the assumption that perag
iteration terminates naturally with a NULL perag at the specified
end_agno. Others allow for the final AG to have a valid perag and
require the calling function to clean up any potential leftover
xfs_perag reference on termination of the loop.

Aside from providing a subtly inconsistent interface, the former
variant is racy with growfs because growfs can create discoverable
post-eofs perags before the final superblock update that completes
the grow operation and increases sb_agcount. This leads to the
following assert failure (reproduced by xfs/104) in the perag free
path during unmount:

XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/libxfs/xfs_ag.c, line: 195

This occurs because one of the many for_each_perag() loops in the
code that is expected to terminate with a NULL pag (and thus has no
post-loop xfs_perag_put() check) raced with a growfs and found a
non-NULL post-EOFS perag, but terminated naturally based on the
end_agno check without releasing the post-EOFS perag.

Rework the iteration logic to lift the agno check from the main for
loop conditional to the iteration helper function. The for loop now
purely terminates on a NULL pag and xfs_perag_next() avoids taking a
reference to any perag beyond end_agno in the first place.

Fixes: f250eedcf762 ("xfs: make for_each_perag... a first class citizen")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: terminate perag iteration reliably on agcount

Source kernel commit: 8ed004eb9d07a5d6114db3e97a166707c186262d

The for_each_perag_from() iteration macro relies on sb_agcount to
process every perag currently within EOFS from a given starting
point. It's perfectly valid to have perag structures beyond
sb_agcount, however, such as if a growfs is in progress. If a perag
loop happens to race with growfs in this manner, it will actually
attempt to process the post-EOFS perag where ->pag_agno ==
sb_agcount. This is reproduced by xfs/104 and manifests as the
following assert failure in superblock write verifier context:

XFS: Assertion failed: agno < mp->m_sb.sb_agcount, file: fs/xfs/libxfs/xfs_types.c, line: 22

Update the corresponding macro to only process perags that are
within the current sb_agcount.

Fixes: 58d43a7e3263 ("xfs: pass perags around in fsmap data dev functions")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: rename the next_agno perag iteration variable

Source kernel commit: f1788b5e5ee25bedf00bb4d25f82b93820d61189

Rename the next_agno variable to be consistent across the several
iteration macros and shorten line length.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: fold perag loop iteration logic into helper function

Source kernel commit: bf2307b195135ed9c95eebb38920d8bd41843092

Fold the loop iteration logic into a helper in preparation for
further fixups. No functional change in this patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove the xfs_dqblk_t typedef

Source kernel commit: 11a83f4c393040dc3a6a368c6399785dbfae7602

Remove the few leftover instances of the xfs_dinode_t typedef.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove the xfs_dsb_t typedef

Source kernel commit: ed67ebfd7c4061b4b505ac42eb00e08dd09f4d38

Remove the few leftover instances of the xfs_dinode_t typedef.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove the xfs_dinode_t typedef

Source kernel commit: de38db7239c4bd2f37ebfcb8a5f22b4e8e657737

Remove the few leftover instances of the xfs_dinode_t typedef.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: check that bc_nlevels never overflows

Source kernel commit: 4c175af2ccd3e0d618b2af941e656fabc453c4af

Warn if we ever bump nlevels higher than the allowed maximum cursor
height.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove xfs_btree_cur_t typedef

Source kernel commit: ae127f087dc22b6e37edc870079abf0721a6aed0

Get rid of this old typedef before we start changing other things.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: fix maxlevels comparisons in the btree staging code

Source kernel commit: 78e8ec83a404d63dcc86b251f42e4ee8aff27465

The btree geometry computation function has an off-by-one error in that
it does not allow maximally tall btrees (nlevels == XFS_BTREE_MAXLEVELS).
This can result in repairs failing unnecessarily on very fragmented
filesystems. Subsequent patches to remove MAXLEVELS usage in favor of
the per-btree type computations will make this a much more likely
occurrence.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: port the defer ops capture and continue to resource capture

Source kernel commit: 512edfac85d243ed6a5a5f42f513ebb7c2d32863

When log recovery tries to recover a transaction that had log intent
items attached to it, it has to save certain parts of the transaction
state (reservation, dfops chain, inodes with no automatic unlock) so
that it can finish single-stepping the recovered transactions before
finishing the chains.

This is done with the xfs_defer_ops_capture and xfs_defer_ops_continue
functions. Right now they open-code this functionality, so let's port
this to the formalized resource capture structure that we introduced in
the previous patch. This enables us to hold up to two inodes and two
buffers during log recovery, the same way we do for regular runtime.

With this patch applied, we'll be ready to support atomic extent swap
which holds two inodes; and logged xattrs which holds one inode and one
xattr leaf buffer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: formalize the process of holding onto resources across a defer roll

Source kernel commit: c5db9f937b2971c78d6c6bbaa61a6450efa8b845

Transaction users are allowed to flag up to two buffers and two inodes
for ownership preservation across a deferred transaction roll. Hoist
the variables and code responsible for this out of xfs_defer_trans_roll
so that we can use it for the defer capture mechanism.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: use kmem_cache_free() for kmem_cache objects

Source kernel commit c30a0cbd07ecc0eec7b3cd568f7b1c7bb7913f93

For kmalloc() allocations SLOB prepends the blocks with a 4-byte header,
and it puts the size of the allocated blocks in that header.
Blocks allocated with kmem_cache_alloc() allocations do not have that
header.

SLOB explodes when you allocate memory with kmem_cache_alloc() and then
try to free it with kfree() instead of kmem_cache_free().
SLOB will assume that there is a header when there is none, read some
garbage to size variable and corrupt the adjacent objects, which
eventually leads to hang or panic.

Let's make XFS work with SLOB by using proper free function.

Fixes: 9749fee83f38 ("xfs: enable the xfs_defer mechanism to process extents to free")
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: fix AG header btree level comparisons

It's not an error if repair encounters a btree with the maximal
height, so don't print warnings. Also, we don't allow zero-height
btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: fix metadump level comparisons

It's not an error if metadump encounters a btree with the maximal
height, so don't print warnings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.15.0

Update all the necessary files for a 5.15.0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: increase the minimum log size to 64MB when possible

(Commit log from Darrick J. Wong):

Recently, the upstream maintainers have been taking a lot of heat on
account of writer threads encountering high latency when asking for log
grant space when the log is small.  The reported use case is a heavily
threaded indexing product logging trace information to a filesystem
ranging in size between 20 and 250GB.  The meetings that result from the
complaints about latency and stall warnings in dmesg both from this use
case and also a large well known cloud product are now consuming 25% of
the maintainer's weekly time and have been for months.

For small filesystems, the log is small by default because we have
defaulted to a ratio of 1:2048 (or even less).  For grown filesystems,
this is even worse, because big filesystems generate big metadata.
However, the log size is still insufficient even if it is formatted at
the larger size.

On a 220GB filesystem, the 99.95% latencies observed with a 200-writer
file synchronous append workload running on a 44-AG filesystem (with 44
CPUs) spread across 4 hard disks showed:

99.5%
Log(MB) Latency(ms) BW (MB/s) xlog_grant_head_wait
10 520 243 1875
20 220 308 540
40 140 360 6
80 92 363 0
160 86 364 0

For 4 NVME, the results were:

10 201 409 898
20 177 488 144
40 122 550 0
80 120 549 0
160 121 545 0

This shows pretty clearly that we could reduce the amount of time that
threads spend waiting on the XFS log by increasing the log size to at
least 40MB regardless of size.  We then repeated the benchmark with a
cloud system and an old machine to see if there were any ill effects on
less stable hardware.

For cloudy iscsi block storage, the results were:

10 390 176 2584
20 173 186 357
40 37 187 0
80 40 183 0
160 37 183 0

A decade-old machine w/ 24 CPUs and a giant spinning disk RAID6 array
produced this:

10 55 5.4 0
20 40 5.9 0
40 62 5.7 0
80 66 5.7 0
160 25 5.4 0

From the first three scenarios, it is clear that there are gains to be
had by sizing the log somewhere between 40 and 80MB -- the long tail
latency drops quite a bit, and programs are no longer blocking on the
log's transaction space grant heads.  Split the difference and set the
log size floor to 64MB.

This patch/behavior was originally proposed by Darrick Wong, rewritten
to avoid extra heuristics and dependencies on other pending changes.

Inspired-by: Darrick J. Wong <djwong@kernel.org>
Commit-log-stolen-from: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: move "at least 64mb" threshold to 300mb filesystem]
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: retry scrub (and repair) of items that are ok except for XFAIL

Sometimes a metadata object will pass all of the obvious scrubber
checks, but we won't be able to cross-reference the object's records
with other metadata objects (e.g. a file data fork and a free space
btree both claim ownership of an extent). When this happens during the
checking phase, we should queue the object for a repair, which means
that phase 4 will keep re-evaluating the object as repairs proceed.
Eventually, the hope is that we'll fix the filesystem and everything
will scrub cleanly; if not, we recommend running xfs_repair as a second
attempt to fix the inconsistency.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: fix xfrog_scrub_metadata error reporting

Commit de5d20ec converted xfrog_scrub_metadata to return negative error
codes directly, but forgot to fix up the str_errno calls to use
str_liberror. This doesn't result in incorrect error reporting
currently, but (a) the calls in the switch statement are inconsistent,
and (b) this will matter in future patches where we can call library
functions in between xfrog_scrub_metadata and str_liberror.

Fixes: de5d20ec ("libfrog: convert scrub.c functions to negative error codes")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.15.0-rc1

Update all the necessary files for a 5.15.0-rc1 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_quota: fix up dump and report documentation

Documentation for these commands was a bit of a mess.

1) The help args were respecified in the _help() functions, overwriting
the strings which had been set up in the _init functions as all
other commands do. Worse, in the report case, they differed.

2) The -L/-U dump options were not present in either short help string.

3) The -L/-U dump options were not documented in the xfs_quota manpage.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: don't guess about failure reason in phase6

There are many error messages in phase 6 which say
"filesystem may be out of space," when in reality the failure could
have been corruption or some other issue. Rather than guessing, and
emitting a confusing and possibly-wrong message, use the existing
res_failed() for any xfs_trans_alloc failures, and simply print the
error number in the other cases.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_quota: don't exit on fs_table_insert_project_path failure

If "project -p" fails in fs_table_insert_project_path, it
calls exit() today which is quite unfriendly. Return an error
and return to the command prompt as expected.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: move fprintf to caller per request]
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>