Srikanth C S [Tue, 13 Dec 2022 17:15:43 +0000 (22:45 +0530)]
fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
After a recent data center crash, we had to recover root filesystems
on several thousands of VMs via a boot time fsck. Since these
machines are remotely manageable, support can inject the kernel
command line with 'fsck.mode=force fsck.repair=yes' to kick off
xfs_repair if the machine won't come up or if they suspect there
might be deeper issues with latent errors in the fs metadata, which
is what they did to try to get everyone running ASAP while
anticipating any future problems. But, fsck.xfs does not address the
journal replay in case of a crash.
fsck.xfs does xfs_repair -e if fsck.mode=force is set. It is
possible that when the machine crashes, the fs is in inconsistent
state with the journal log not yet replayed. This can drop the machine
into the rescue shell because xfs_fsck.sh does not know how to clean the
log. Since the administrator told us to force repairs, address the
deficiency by cleaning the log and rerunning xfs_repair.
Run xfs_repair -e when fsck.mode=force and repair=auto or yes.
Replay the logs only if fsck.mode=force and fsck.repair=yes. For
other option -fa and -f drop to the rescue shell if repair detects
any corruptions.
Signed-off-by: Srikanth C S <srikanth.c.s@oracle.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Tue, 13 Dec 2022 19:39:48 +0000 (11:39 -0800)]
xfs_db: create separate struct and field definitions for finobts
Create separate field_t definitions for the free inode btree because db
needs to know that the interior block pointers point to finobt blocks,
not inobt blocks. This is critical now because the buffer ops contain
magic numbers, the ->verify_struct routines use the magics listed in the
buffer ops, and the xfs_db iocursor calls the verifier functions.
Without this patch, xfs_db emits bizarre output like this:
Darrick J. Wong [Wed, 23 Nov 2022 17:09:39 +0000 (09:09 -0800)]
xfs_{db,repair}: fix XFS_REFC_COW_START usage
This is really a bit field stashed in the upper bit of the rc_startblock
field, so change its usage patterns to use masking instead of integer
addition and subtraction.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 23 Nov 2022 17:09:33 +0000 (09:09 -0800)]
xfs_repair: retain superblock buffer to avoid write hook deadlock
Every now and then I experience the following deadlock in xfs_repair
when I'm running the offline repair fuzz tests:
#0 futex_wait (private=0, expected=2, futex_word=0x55555566df70) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x55555566df70, private=0) at ./nptl/lowlevellock.c:49
#2 lll_mutex_lock_optimized (mutex=0x55555566df70) at ./nptl/pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=mutex@entry=0x55555566df70) at ./nptl/pthread_mutex_lock.c:93
#4 cache_shake (cache=cache@entry=0x55555566de60, priority=priority@entry=2, purge=purge@entry=false) at cache.c:231
#5 cache_node_get (cache=cache@entry=0x55555566de60, key=key@entry=0x7fffe55e01b0, nodep=nodep@entry=0x7fffe55e0168) at cache.c:452
#6 __cache_lookup (key=key@entry=0x7fffe55e01b0, flags=0, bpp=bpp@entry=0x7fffe55e0228) at rdwr.c:405
#7 libxfs_getbuf_flags (btp=0x55555566de00, blkno=0, len=<optimized out>, flags=<optimized out>, bpp=0x7fffe55e0228) at rdwr.c:457
#8 libxfs_buf_read_map (btp=0x55555566de00, map=map@entry=0x7fffe55e0280, nmaps=nmaps@entry=1, flags=flags@entry=0, bpp=bpp@entry=0x7fffe55e0278, ops=0x5555556233e0 <xfs_sb_buf_ops>)
at rdwr.c:704
#9 libxfs_buf_read (ops=<optimized out>, bpp=0x7fffe55e0278, flags=0, numblks=<optimized out>, blkno=0, target=<optimized out>)
at /storage/home/djwong/cdev/work/xfsprogs/build-x86_64/libxfs/libxfs_io.h:195
#10 libxfs_getsb (mp=mp@entry=0x7fffffffd690) at rdwr.c:162
#11 force_needsrepair (mp=0x7fffffffd690) at xfs_repair.c:924
#12 repair_capture_writeback (bp=<optimized out>) at xfs_repair.c:1000
#13 libxfs_bwrite (bp=0x7fffe011e530) at rdwr.c:869
#14 cache_shake (cache=cache@entry=0x55555566de60, priority=priority@entry=2, purge=purge@entry=false) at cache.c:240
#15 cache_node_get (cache=cache@entry=0x55555566de60, key=key@entry=0x7fffe55e0470, nodep=nodep@entry=0x7fffe55e0428) at cache.c:452
#16 __cache_lookup (key=key@entry=0x7fffe55e0470, flags=1, bpp=bpp@entry=0x7fffe55e0538) at rdwr.c:405
#17 libxfs_getbuf_flags (btp=0x55555566de00, blkno=12736, len=<optimized out>, flags=<optimized out>, bpp=0x7fffe55e0538) at rdwr.c:457
#18 __libxfs_buf_get_map (btp=<optimized out>, map=map@entry=0x7fffe55e05b0, nmaps=<optimized out>, flags=flags@entry=1, bpp=bpp@entry=0x7fffe55e0538) at rdwr.c:501
#19 libxfs_buf_get_map (btp=<optimized out>, map=map@entry=0x7fffe55e05b0, nmaps=<optimized out>, flags=flags@entry=1, bpp=bpp@entry=0x7fffe55e0538) at rdwr.c:525
#20 pf_queue_io (args=args@entry=0x5555556722c0, map=map@entry=0x7fffe55e05b0, nmaps=<optimized out>, flag=flag@entry=11) at prefetch.c:124
#21 pf_read_bmbt_reclist (args=0x5555556722c0, rp=<optimized out>, numrecs=78) at prefetch.c:220
#22 pf_scan_lbtree (dbno=dbno@entry=1211, level=level@entry=1, isadir=isadir@entry=1, args=args@entry=0x5555556722c0, func=0x55555557f240 <pf_scanfunc_bmap>) at prefetch.c:298
#23 pf_read_btinode (isadir=1, dino=<optimized out>, args=0x5555556722c0) at prefetch.c:385
#24 pf_read_inode_dirs (args=args@entry=0x5555556722c0, bp=bp@entry=0x7fffdc023790) at prefetch.c:459
#25 pf_read_inode_dirs (bp=<optimized out>, args=0x5555556722c0) at prefetch.c:411
#26 pf_batch_read (args=args@entry=0x5555556722c0, which=which@entry=PF_PRIMARY, buf=buf@entry=0x7fffd001d000) at prefetch.c:609
#27 pf_io_worker (param=0x5555556722c0) at prefetch.c:673
#28 start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#29 clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
>From this stack trace, we see that xfs_repair's prefetch module is
getting some xfs_buf objects ahead of initiating a read (#19). The
buffer cache has hit its limit, so it calls cache_shake (#14) to free
some unused xfs_bufs. The buffer it finds is a dirty buffer, so it
calls libxfs_bwrite to flush it out to disk, which in turn invokes the
buffer write hook that xfs_repair set up in 3b7667cb to mark the ondisk
filesystem's superblock as NEEDSREPAIR until repair actually completes.
Unfortunately, the NEEDSREPAIR handler itself needs to grab the
superblock buffer, so it makes another call into the buffer cache (#9),
which sees that the cache is full and tries to shake it(#4). Hence we
deadlock on cm_mutex because shaking is not reentrant.
Fix this by retaining a reference to the superblock buffer when possible
so that the writeback hook doesn't have to access the buffer cache to
set NEEDSREPAIR.
Fixes: 3b7667cb ("xfs_repair: set NEEDSREPAIR the first time we write to a filesystem") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 23 Nov 2022 17:09:28 +0000 (09:09 -0800)]
xfs_repair: don't crash on unknown inode parents in dry run mode
Fuzz testing of directory block headers exposed a debug assertion vector
in xfs_repair. In normal (aka fixit) mode, if a single-block directory
has a totally trashed block, repair will zap the entire directory.
Phase 4 ignores any dirents pointing to the zapped directory, phase 6
ignores the freed directory, and everything is good.
However, in dry run mode, we don't actually free the inode. Phase 4
still ignores any dirents pointing to the zapped directory, but phase 6
thinks the inode is still live and tries to walk it. xfs_repair doesn't
know of any parents for the zapped directory and so trips the assertion.
The assertion is critical for fixit mode because we need all the parent
information to ensure consistency of the directory tree. In dry run
mode we don't care, because we only have to print inconsistencies and
return 1. Worse yet, (our) customers file bugs when xfs_repair crashes
during a -n scan, so this will generate support calls.
Make everyone's life easier by downgrading the assertion to a warning if
we're running in dry run mode.
Found by fuzzing bhdr.hdr.bno = zeroes in xfs/471.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 23 Nov 2022 17:09:22 +0000 (09:09 -0800)]
xfs_db: fix printing of reverse mapping record blockcounts
FLDT_EXTLEN is the correct type for a 32-bit block count within an AG;
FLDT_REXTLEN is the type for a 21-bit file mapping block count. This
code should have been using the first type, not the second.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Notice that we passed in octal-zero, 'h', '5', 'o', but the fs label is
set to octal-5, 'o' because of the incorrect loop logic. -Wlogical-op
found this one.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 23 Nov 2022 17:09:11 +0000 (09:09 -0800)]
misc: add missing includes
Add missing #include directives so that the compiler can typecheck
functions against their declarations. IOWs, -Wmissing-declarations
found some things.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
This corruption will shutdown the file system and the file system will
no longer be mountable. The following script can reproduce the problem,
but it may take a long time.
With lazysbcount is enabled, There is no additional lock protection for
reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
m_ifree, this will make the m_ifree greater than m_icount. For example,
consider the following sequence and ifreedelta is postive:
After this, incorrect inode count (sb_ifree > sb_icount) will be writen to
the log. In the subsequent writing of sb, incorrect inode count (sb_ifree >
sb_icount) will fail to pass the boundary check in xfs_validate_sb_write()
that cause the file system shutdown.
When lazysbcount is enabled, we don't need to guarantee that Lazy sb
counters are completely correct, but we do need to guarantee that sb_ifree
<= sb_icount. On the other hand, the constraint that m_ifree <= m_icount
must be satisfied any time that there /cannot/ be other threads allocating
or freeing inode chunks. If the constraint is violated under these
circumstances, sb_i{count,free} (the ondisk superblock inode counters)
maybe incorrect and need to be marked sick at unmount, the count will
be rebuilt on the next mount.
Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks") Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We've been (ab)using XFS_REFC_COW_START as both an integer quantity and
a bit flag, even though it's *only* a bit flag. Rename the variable to
reflect its nature and update the cast target since we're not supposed
to be comparing it to xfs_agblock_t now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We're supposed to initialize the list head of an object before adding it
to another list. Fix that, and stop using the kmem_{alloc,free} calls
from the Irix days.
Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
As we've seen, refcount records use the upper bit of the rc_startblock
field to ensure that all the refcount records are at the right side of
the refcount btree. This works because an AG is never allowed to have
more than (1U << 31) blocks in it. If we ever encounter a filesystem
claiming to have that many blocks, we absolutely do not want reflink
touching it at all.
However, this test at the start of xfs_refcount_recover_cow_leftovers is
slightly incorrect -- it /should/ be checking that agblocks isn't larger
than the XFS_MAX_CRC_AG_BLOCKS constant, and it should check that the
constant is never large enough to conflict with that CoW flag.
Note that the V5 superblock verifier has not historically rejected
filesystems where agblocks >= XFS_MAX_CRC_AG_BLOCKS, which is why this
ended up in the COW recovery routine.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Now that we've separated the startblock and CoW/shared extent domain in
the incore refcount record structure, check the domain whenever we
retrieve a record to ensure that it's still in the domain that we want.
Depending on the circumstances, a change in domain either means we're
done processing or that we've found a corruption and need to fail out.
The refcount check in xchk_xref_is_cow_staging is redundant since
_get_rec has done that for a long time now, so we can get rid of it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Now that we have an explicit enum for shared and CoW staging extents, we
can get rid of the old FIND_RCEXT flags. Omit a couple of conversions
that disappear in the next patches.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Create a helper function to ensure that CoW staging extent records have
a single refcount and that shared extent records have more than 1
refcount. We'll put this to more use in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Now that we've broken out the startblock and shared/cow domain in the
incore refcount extent record structure, update the tracepoints to
report the domain.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Just prior to committing the reflink code into upstream, the xfs
maintainer at the time requested that I find a way to shard the refcount
records into two domains -- one for records tracking shared extents, and
a second for tracking CoW staging extents. The idea here was to
minimize mount time CoW reclamation by pushing all the CoW records to
the right edge of the keyspace, and it was accomplished by setting the
upper bit in rc_startblock. We don't allow AGs to have more than 2^31
blocks, so the bit was free.
Unfortunately, this was a very late addition to the codebase, so most of
the refcount record processing code still treats rc_startblock as a u32
and pays no attention to whether or not the upper bit (the cow flag) is
set. This is a weakness is theoretically exploitable, since we're not
fully validating the incoming metadata records.
Fuzzing demonstrates practical exploits of this weakness. If the cow
flag of a node block key record is corrupted, a lookup operation can go
to the wrong record block and start returning records from the wrong
cow/shared domain. This causes the math to go all wrong (since cow
domain is still implicit in the upper bit of rc_startblock) and we can
crash the kernel by tricking xfs into jumping into a nonexistent AG and
tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL.
To fix this, start tracking the domain as an explicit part of struct
xfs_refcount_irec, adjust all refcount functions to check the domain
of a returned record, and alter the function definitions to accept them
where necessary.
Found by fuzzing keys[2].cowflag = add in xfs/464.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
If we're in the middle of a deferred refcount operation and decide to
roll the transaction to avoid overflowing the transaction space, we need
to check the new agbno/aglen parameters that we're about to record in
the new intent. Specifically, we need to check that the new extent is
completely within the filesystem, and that continuation does not put us
into a different AG.
If the keys of a node block are wrong, the lookup to resume an
xfs_refcount_adjust_extents operation can put us into the wrong record
block. If this happens, we might not find that we run out of aglen at
an exact record boundary, which will cause the loop control to do the
wrong thing.
The previous patch should take care of that problem, but let's add this
extra sanity check to stop corruption problems sooner than later.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Create a predicate function to verify that a given agbno/blockcount pair
fit entirely within a single allocation group and don't suffer
mathematical overflows. Refactor the existng open-coded logic; we're
going to add more calls to this function in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Prior to calling xfs_refcount_adjust_extents, we trimmed agbno/aglen
such that the end of the range would not be in the middle of a refcount
record. If this is no longer the case, something is seriously wrong
with the btree. Bail out with a corruption error.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Refactor all the open-coded sizeof logic for EFI/EFD log item and log
format structures into common helper functions whose names reflect the
struct names.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of
memcpy. Since we're already fixing problems with BUI item copying, we
should fix it everything else.
An extra difficulty here is that the ef[id]_extents arrays are declared
as single-element arrays. This is not the convention for flex arrays in
the modern kernel, and it causes all manner of problems with static
checking tools, since they often cannot tell the difference between a
single element array and a flex array.
So for starters, change those array[1] declarations to array[]
declarations to signal that they are proper flex arrays and adjust all
the "size-1" expressions to fit the new declaration style.
Next, refactor the xfs_efi_copy_format function to handle the copying of
the head and the flex array members separately. While we're at it, fix
a minor validation deficiency in the recovery function.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_rename can update up to 5 inodes: src_dp, target_dp, src_ip, target_ip
and wip. So we need to increase the inode reservation to match.
Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
For leaf dir, In most cases, there should be as many bestfree slots
as the dir data blocks that can fit under i_size (except for [1]).
Root cause is we don't examin the number bestfree slots, when the slots
number less than dir data blocks, if we need to allocate new dir data
block and update the bestfree array, we will use the dir block number as
index to assign bestfree array, while we did not check the leaf buf
boundary which may cause UAF or other memory access problem. This issue
can also triggered with test cases xfs/473 from fstests.
According to Dave Chinner & Darrick's suggestion, adding buffer verifier
to detect this abnormal situation in time.
Simplify the testcase for fstest xfs/554 [1]
[1] https://lore.kernel.org/all/20220928095355.2074025-1-guoxuenan@huawei.com/ Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Guo Xuenan <guoxuenan@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:
// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@
value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@
@collapse_ret@
type T;
identifier VAR;
expression E;
@@
{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}
@drop_var@
type T;
identifier VAR;
@@
{
- T VAR;
... when != VAR
}
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_dir2_isleaf is used to see if the directory is a single-leaf
form directory instead, as commented right above the function.
Besides getting rid of the broken comment, we rearrange the logic by
converting everything over to standard formatting and conventions,
at the same time, to make it easier to understand and self documenting.
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Take a look at the for-loop in xfs_da_grow_inode_int:
======
for(){
nmap = min(XFS_BMAP_MAX_NMAP, count);
...
error = xfs_bmapi_write(...,&mapp[mapi], &nmap);//(..., $1, $2)
...
mapi += nmap;
}
=====
where $1 stands for the start address of the array,
while $2 is used to indicate the size of the array.
The array $1 will advance by $nmap in each iteration after
the allocation of extents.
But the size $2 still remains unchanged, which is determined by
min(XFS_BMAP_MAX_NMAP, count).
It seems that it has forgotten to trim the mapp array after each
iteration, so change it.
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Return the value xfs_dir_cilookup_result() directly instead of storing it
in another redundant variable.
Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The "%Ld" specifier, which represents long long unsigned,
doesn't meet C language standard, and even more,
it makes people easily mistake with "%ld", which represent
long unsigned. So replace "%Ld" with "lld".
Do the same with "%Lu".
Signed-off-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_quota: apply -L/-U range limits in uid/gid/pid loops
In case kernel doesn't support XFS_GETNEXTQUOTA the report/dump
command will fallback to iterating over all known uid/gid/pid.
However, currently it won't take -L/-U range limits into account
(all entities with non-zero qoutas will be outputted). This applies
those limits for fallback case.
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_quota: utilize XFS_GETNEXTQUOTA for ranged calls in report/dump
The implementation based on XFS_GETQUOTA call for each ID in range,
specified with -L/-U, is quite slow for wider ranges.
If kernel supports XFS_GETNEXTQUOTA, report_*_mount/dump_any_file
will use that to obtain quota list for the mount. XFS_GETNEXTQUOTA
returns quota of the requested ID and next ID with non-empty quota.
Otherwise, XFS_GETQUOTA will be used for each user/group/project ID
known from password/group/project database.
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_quota: separate get_dquot() and report_mount()
Separate quota info acquisition from outputting. This allows upper
functions to filter obtained info (e.g. within specific ID range).
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Separate quota info acquisition from outputting it to file. This
allows upper functions to filter obtained info (e.g. within specific
ID range).
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_quota: separate quota info acquisition into get_dquot()
Both report_mount() and dump_file() have identical code to get quota
information. This could be used for further separation of the
functions.
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The macro definitions of 'EOFF' and 'E3OFF' are same, so no matter to
use either to seek field offset in 'dir3_sf_entry_flds'.
But it seems the intent of defining 'E3OFF' macro is to be used in
'dir3_sf_entry_flds', and 'E3OFF' macro has not been used at any place
of the 'xfsprogs-dev' source:
/* command begin */
$ grep -r E3OFF /path/to/xfsprogs-dev/git/repository/
./db/dir2sf.c:#define E3OFF(f) bitize(offsetof(xfs_dir2_sf_entry_t, f))
$
/* command end */
Above command shows the 'E3OFF' is only been defined but nerver been
used, that is weird, so there has reason to suspect using 'EOFF'
rather than 'E3OFF' in 'dir3_sf_entry_flds' is a typo, this patch fix
it, there has no logical change in this commit at all.
Signed-off-by: Xiaole He <hexiaole@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
In 'fs/xfs/libxfs/xfs_trans_resv.c', the comment for transaction of removing a
directory entry writes:
/* fs/xfs/libxfs/xfs_trans_resv.c begin */
/*
* For removing a directory entry we can modify:
* the parent directory inode: inode size
* the removed inode: inode size
...
xfs_calc_remove_reservation(
struct xfs_mount *mp)
{
return XFS_DQUOT_LOGRES(mp) +
xfs_calc_iunlink_add_reservation(mp) +
max((xfs_calc_inode_res(mp, 1) +
...
/* fs/xfs/libxfs/xfs_trans_resv.c end */
There has 2 inode size of space to be reserverd, but the actual code
for inode reservation space writes.
There only count for 1 inode size to be reserved in
'xfs_calc_inode_res(mp, 1)', rather than 2.
Signed-off-by: hexiaole <hexiaole@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: remove redundant code citations] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Slark Xiao <slark_xiao@163.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
This newly-added assertion checks that there aren't any incore data
structures hanging off the incore fork when we're trying to reset its
contents. From the call trace, it is evident that iget was trying to
construct an incore inode from the ondisk inode, but the attr fork
verifier failed and we were trying to undo all the memory allocations
that we had done earlier.
The three assertions in xfs_ifork_zap_attr check that the caller has
already called xfs_idestroy_fork, which clearly has not been done here.
As the zap function then zeroes the pointers, we've effectively leaked
the memory.
The shortest change would have been to insert an extra call to
xfs_idestroy_fork, but it makes more sense to bundle the _idestroy_fork
call into _zap_attr, since all other callsites call _idestroy_fork
immediately prior to calling _zap_attr. IOWs, it eliminates one way to
fail.
Note: This change only applies cleanly to 2ed5b09b3e8f, since we just
reworked the attr fork lifetime. However, I think this memory leak has
existed since 0f45a1b20cd8, since the chain xfs_iformat_attr_fork ->
xfs_iformat_local -> xfs_init_local_fork will allocate
ifp->if_u1.if_data, but if xfs_ifork_verify_local_attr fails,
xfs_iformat_attr_fork will free i_afp without freeing any of the stuff
hanging off i_afp. The solution for older kernels I think is to add the
missing call to xfs_idestroy_fork just prior to calling kmem_cache_free.
Found by fuzzing a.sfattr.hdr.totsize = lastbit in xfs/399.
Fixes: 2ed5b09b3e8f ("xfs: make inode attribute forks a permanent part of struct xfs_inode")
Probably-Fixes: 0f45a1b20cd8 ("xfs: improve local fork verification") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
These NULL check are no long needed after commit 2ed5b09b3e8f ("xfs:
make inode attribute forks a permanent part of struct xfs_inode").
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The 'ctime', 'mtime', and 'atime' for inode is the type of
'xfs_timestamp_t', which is a 64-bit type:
/* fs/xfs/libxfs/xfs_format.h begin */
typedef __be64 xfs_timestamp_t;
/* fs/xfs/libxfs/xfs_format.h end */
When the 'bigtime' feature is disabled, this 64-bit type is splitted
into two parts of 32-bit, one part is encoded for seconds since
1970-01-01 00:00:00 UTC, the other part is encoded for nanoseconds
above the seconds, this two parts are the type of
'xfs_legacy_timestamp' and the min and max time value of this type are
defined as macros 'XFS_LEGACY_TIME_MIN' and 'XFS_LEGACY_TIME_MAX':
/* fs/xfs/libxfs/xfs_format.h begin */
struct xfs_legacy_timestamp {
__be32 t_sec; /* timestamp seconds */
__be32 t_nsec; /* timestamp nanoseconds */
};
/* fs/xfs/libxfs/xfs_format.h end */
/* include/linux/limits.h begin */
/* include/linux/limits.h end */
'XFS_LEGACY_TIME_MIN' is the min time value of the
'xfs_legacy_timestamp', that is -(2^31) seconds relative to the
1970-01-01 00:00:00 UTC, it can be converted to human-friendly time
value by 'date' command:
/* command begin */
[root@~]# date --utc -d '@0' +'%Y-%m-%d %H:%M:%S'
1970-01-01 00:00:00
[root@~]# date --utc -d "@`echo '-(2^31)'|bc`" +'%Y-%m-%d %H:%M:%S'
1901-12-13 20:45:52
[root@~]#
/* command end */
When 'bigtime' feature is enabled, this 64-bit type becomes a 64-bit
nanoseconds counter, with the start time value is the min time value of
'xfs_legacy_timestamp'(start time means the value of 64-bit nanoseconds
counter is 0). We have already caculated the min time value of
'xfs_legacy_timestamp', that is 1901-12-13 20:45:52 UTC, but the comment
for the start time value of inode with 'bigtime' feature enabled writes
the value is 1901-12-31 20:45:52 UTC:
/* fs/xfs/libxfs/xfs_format.h begin */
/*
* XFS Timestamps
* ==============
* When the bigtime feature is enabled, ondisk inode timestamps become an
* unsigned 64-bit nanoseconds counter. This means that the bigtime inode
* timestamp epoch is the start of the classic timestamp range, which is
* Dec 31 20:45:52 UTC 1901. ...
...
*/
/* fs/xfs/libxfs/xfs_format.h end */
That is a typo, and this patch corrects the typo, from 'Dec 31' to
'Dec 13'.
Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Xiaole He <hexiaole@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Now we have forwards traversal via the incore inode in place, we now
need to add back pointers to the incore inode to entirely replace
the back reference cache. We use the same lookup semantics and
constraints as for the forwards pointer lookups during unlinks, and
so we can look up any inode in the unlinked list directly and update
the list pointers, forwards or backwards, at any time.
The only wrinkle in converting the unlinked list manipulations to
use in-core previous pointers is that log recovery doesn't have the
incore inode state built up so it can't just read in an inode and
release it to finish off the unlink. Hence we need to modify the
traversal in recovery to read one inode ahead before we
release the inode at the head of the list. This populates the
next->prev relationship sufficient to be able to replay the unlinked
list and hence greatly simplify the runtime code.
This recovery algorithm also requires that we actually remove inodes
from the unlinked list one at a time as background inode
inactivation will result in unlinked list removal racing with the
building of the in-memory unlinked list state. We could serialise
this by holding the AGI buffer lock when constructing the in memory
state, but all that does is lockstep background processing with list
building. It is much simpler to flush the inodegc immediately after
releasing the inode so that it is unlinked immediately and there is
no races present at all.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Having direct access to the i_next_unlinked pointer in unlinked
inodes greatly simplifies the processing of inodes on the unlinked
list. We no longer need to look up the inode buffer just to find
next inode in the list if the xfs_inode is in memory. These
improvements will be realised over upcoming patches as other
dependencies on the inode buffer for unlinked list processing are
removed.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the
attribute fork but i_forkoff is zero. This eliminates the ambiguity
between i_forkoff and i_af.if_present, which should make it easier to
understand the lifetime of attr forks.
While we're at it, remove the if_present checks around calls to
xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr
forks that have already been torn down.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
==================================================================
BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958
Memory state around the buggy address: ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
>ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
^ ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
==================================================================
The root cause of this bug is the unlocked access to xfs_inode.i_afp
from the getxattr code paths while trying to determine which ILOCK mode
to use to stabilize the xattr data. Unfortunately, the VFS does not
acquire i_rwsem when vfs_getxattr (or listxattr) call into the
filesystem, which means that getxattr can race with a removexattr that's
tearing down the attr fork and crash:
Regrettably, the VFS is much more lax about i_rwsem and getxattr than
is immediately obvious -- not only does it not guarantee that we hold
i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
The getxattr system call won't acquire the lock before calling XFS, but
the file capabilities code calls getxattr with and without i_rwsem held
to determine if the "security.capabilities" xattr is set on the file.
Fixing the VFS locking requires a treewide investigation into every code
path that could touch an xattr and what i_rwsem state it expects or sets
up. That could take years or even prove impossible; fortunately, we
can fix this UAF problem inside XFS.
An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
ensure that i_forkoff is always zeroed before i_afp is set to null and
changed the read paths to use smp_rmb before accessing i_forkoff and
i_afp, which avoided these UAF problems. However, the patch author was
too busy dealing with other problems in the meantime, and by the time he
came back to this issue, the situation had changed a bit.
On a modern system with selinux, each inode will always have at least
one xattr for the selinux label, so it doesn't make much sense to keep
incurring the extra pointer dereference. Furthermore, Allison's
upcoming parent pointer patchset will also cause nearly every inode in
the filesystem to have extended attributes. Therefore, make the inode
attribute fork structure part of struct xfs_inode, at a cost of 40 more
bytes.
This patch adds a clunky if_present field where necessary to maintain
the existing logic of xattr fork null pointer testing in the existing
codebase. The next patch switches the logic over to XFS_IFORK_Q and it
all goes away.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We're about to make this logic do a bit more, so convert the macro to a
static inline function for better typechecking and fewer shouty macros.
No functional changes here.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Andrey Strachuk <strochuk@ispras.ru> Fixes: 4d0cdd2bb8f0 ("xfs: clean up xfs_attr_node_hasname") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We check if an ag contains the log in many places, so make this
a first class XFS helper by lifting it to fs/xfs/libxfs/xfs_ag.h and
renaming it xfs_ag_contains_log(). The convert all the places that
check if the AG contains the log to use this helper.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Many of the places that call xfs_ag_block_count() have a perag
available. These places can just read pag->block_count directly
instead of calculating the AG block count from first principles.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
There is a lot of overhead in functions like xfs_verify_agino() that
repeatedly calculate the geometry limits of an AG. These can be
pre-calculated as they are static and the verification context has
a per-ag context it can quickly reference.
In the case of xfs_verify_agino(), we now always have a perag
context handy, so we can store the minimum and maximum agino values
in the AG in the perag. This means we don't have to calculate
it on every call and it can be inlined in callers if we move it
to xfs_ag.h.
xfs_verify_agino_or_null() gets the same perag treatment.
xfs_agino_range() is moved to xfs_ag.c as it's not really a type
function, and it's use is largely restricted as the first and last
aginos can be grabbed straight from the perag in most cases.
Note that we leave the original xfs_verify_agino in place in
xfs_types.c as a static function as other callers in that file do
not have per-ag contexts so still need to go the long way. It's been
renamed to xfs_verify_agno_agino() to indicate it takes both an agno
and an agino to differentiate it from new function.
$ size --totals fs/xfs/built-in.a
text data bss dec hex filename
before 1482185 329588 572 1812345 1ba779 (TOTALS)
after 1481937 329588 572 1812097 1ba681 (TOTALS)
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
There is a lot of overhead in functions like xfs_verify_agbno() that
repeatedly calculate the geometry limits of an AG. These can be
pre-calculated as they are static and the verification context has
a per-ag context it can quickly reference.
In the case of xfs_verify_agbno(), we now always have a perag
context handy, so we can store the AG length and the minimum valid
block in the AG in the perag. This means we don't have to calculate
it on every call and it can be inlined in callers if we move it
to xfs_ag.h.
Move xfs_ag_block_count() to xfs_ag.c because it's really a
per-ag function and not an XFS type function. We need a little
bit of rework that is specific to xfs_initialise_perag() to allow
growfs to calculate the new perag sizes before we've updated the
primary superblock during the grow (chicken/egg situation).
Note that we leave the original xfs_verify_agbno in place in
xfs_types.c as a static function as other callers in that file do
not have per-ag contexts so still need to go the long way. It's been
renamed to xfs_verify_agno_agbno() to indicate it takes both an agno
and an agbno to differentiate it from new function.
Future commits will make similar changes for other per-ag geometry
validation functions.
Further:
$ size --totals fs/xfs/built-in.a
text data bss dec hex filename
before 1483006 329588 572 1813166 1baaae (TOTALS)
after 1482185 329588 572 1812345 1ba779 (TOTALS)
This rework reduces the binary size by ~820 bytes, indicating
that much less work is being done to bounds check the agbno values
against on per-ag geometry information.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We have the perag in most places we call xfs_alloc_read_agfl, so
pass the perag instead of a mount/agno pair.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
It's available in all callers, so pass it in so that the perag can
be passed further down the stack.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
It's available in all callers, so pass it in so that the perag can
be passed further down the stack.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We have the perag in most places we call xfs_read_agf, so pass the
perag instead of a mount/agno pair.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We have the perag in most palces we call xfs_read_agi, so pass the
perag instead of a mount/agno pair.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_alloc_read_agf() initialises the perag if it hasn't been done
yet, so it makes sense to pass it the perag rather than pull a
reference from the buffer. This allows callers to be per-ag centric
rather than passing mount/agno pairs everywhere.
Whilst modifying the xfs_reflink_find_shared() function definition,
declare it static and remove the extern declaration as it is an
internal function only these days.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Trivial wrapper around xfs_alloc_read_agf(), can be easily replaced
by passing a NULL agfbp to xfs_alloc_read_agf().
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_ialloc_read_agi() initialises the perag if it hasn't been done
yet, so it makes sense to pass it the perag rather than pull a
reference from the buffer. This allows callers to be per-ag centric
rather than passing mount/agno pairs everywhere.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
This is just a basic wrapper around xfs_ialloc_read_agi(), which can
be entirely handled by xfs_ialloc_read_agi() by passing a NULL
agibpp....
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Because the perag must exist for these operations, look it up as
part of the common shrink operations and pass it instead of the
mount/agno pair.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Fri, 12 Aug 2022 18:39:25 +0000 (13:39 -0500)]
xfs_repair: fix printf format specifiers on 32-bit platforms
armv7l builds spit out the following warnings:
In file included from ../include/platform_defs.h:44,
from ../include/libxfs.h:13,
from bmap.c:7:
bmap.c: In function ‘blkmap_alloc’:
bmap.c:41:11: error: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘xfs_extnum_t’ {aka ‘long long unsigned int’} [-Werror=format=]
41 | _("Number of extents requested in blkmap_alloc (%d) overflows 32 bits.\n"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bmap.c:41:9: note: in expansion of macro ‘_’
41 | _("Number of extents requested in blkmap_alloc (%d) overflows 32 bits.\n"
| ^
bmap.c:41:58: note: format string is defined here
41 | _("Number of extents requested in blkmap_alloc (%d) overflows 32 bits.\n"
| ~^
| |
| int
| %lld
In file included from ../include/platform_defs.h:44,
from ../include/libxfs.h:13,
from bmap.c:7:
bmap.c:54:35: error: format ‘%zu’ expects argument of type ‘size_t’, but argument 2 has type ‘xfs_extnum_t’ {aka ‘long long unsigned int’} [-Werror=format=]
54 | do_warn(_("malloc failed in blkmap_alloc (%zu bytes)\n"),
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bmap.c:54:33: note: in expansion of macro ‘_’
54 | do_warn(_("malloc failed in blkmap_alloc (%zu bytes)\n"),
| ^
bmap.c:54:69: note: format string is defined here
54 | do_warn(_("malloc failed in blkmap_alloc (%zu bytes)\n"),
| ~~^
| |
| unsigned int
| %llu
Fix these.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Chandan Babu R [Fri, 5 Aug 2022 02:54:27 +0000 (21:54 -0500)]
xfs_repair: Add support for upgrading to large extent counters
This commit adds support to xfs_repair to allow upgrading an existing
filesystem to support per-inode large extent counters.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Fri, 5 Aug 2022 02:54:25 +0000 (21:54 -0500)]
xfs_repair: check filesystem geometry before allowing upgrades
Currently, the two V5 feature upgrades permitted by xfs_repair do not
affect filesystem space usage, so we haven't needed to verify the
geometry.
However, this will change once we start to allow the sysadmin to add the
large extent count feature to existing filesystems. Add all the
infrastructure we need to ensure that the log will still be large
enough, and the root inode will still be where we expect it to be after
the upgrade.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
[david: Recompute transaction reservation values; Exit with error if upgrade fails] Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Fri, 5 Aug 2022 02:28:23 +0000 (21:28 -0500)]
mkfs: complain about impossible log size constraints
xfs/042 trips over an impossible fs geometry when nrext64 is enabled.
The minimum log size calculation comes out to 4287 blocks, but the mkfs
parameters specify an AG size of 4096 blocks. This eventually causes
mkfs to complain that the autoselected log size doesn't meet the minimum
size, but we could be a little more explicit in pointing out that the
two size constraints make for an impossible geometry.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Fri, 5 Aug 2022 02:27:01 +0000 (21:27 -0500)]
mkfs: stop allowing tiny filesystems
Refuse to format a filesystem that are "too small", because these
configurations are known to have performance and redundancy problems
that are not present on the volume sizes that XFS is best at handling.
Specifically, this means that we won't allow logs smaller than 64MB, we
won't allow single-AG filesystems, and we won't allow volumes smaller
than 300MB. There are two exceptions: the first is an undocumented CLI
option that can be used for crafting debug filesystems.
The second exception is that if fstests is detected, because there are a
lot of fstests that use tiny filesystems to perform targeted regression
and functional testing in a controlled environment. Fixing the ~40 or
so tests to run more slowly with larger filesystems isn't worth the risk
of breaking the tests.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Fri, 5 Aug 2022 02:27:00 +0000 (21:27 -0500)]
mkfs: ignore data blockdev stripe geometry for small filesystems
As part of the process of removing support for tiny filesystems (defined
in the next patch to be anything under 300MB or 64M log size), we are
trying to eliminate all the edge case regressions for small filesystems
that the maintainer can find.
Eric pointed out that the use case of formatting a 510M on a RAID device
regresses once we start enforcing the 64M log size limit:
# modprobe scsi_debug opt_blks=256 opt_xferlen_exp=6 dev_size_mb=510
# mkfs.xfs /dev/sdg
Log size must be at least 64MB.
<hapless user reads manpage, adjusts log size>
# mkfs.xfs -l size=64m /dev/sdg
internal log size 16384 too large, must be less than 16301
Because the device reports a stripe geometry, mkfs tries to create 8 AGs
(instead of the usual 4) which are then very nearly 64M in size. The
log itself cannot consume the entire AG, so its size is decreased, so
its size is rounded down to allow the creation of AG headers and btrees,
and then the log size is rounded down again to match the stripe unit.
This results in a log that is less than 64MB in size, causing the format
to fail.
There's not much point in formatting tiny AGs on a small filesystem,
even if it is on a RAID. Doubling the AG count from 4 to 8 doubles the
metadata overhead, conflicts with our attempts to boost the log size,
and on 2022-era storage hardware gains us very little extra performance
since we're not limited by storage access times.
Therefore, disable automatic detection of stripe unit and width if the
data device is less than 1GB. We would like to format with 128M AGs to
avoid constraining the size of the internal log, and since RAIDs smaller
than 8GB are formatted with 8 AGs by default, 128*8=1G was chosen as the
cutoff.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Fri, 5 Aug 2022 02:26:43 +0000 (21:26 -0500)]
libxfs: stop overriding MAP_SYNC in publicly exported header files
Florian Fainelli most recently reported that xfsprogs doesn't build with
musl on mips:
"MIPS platforms building with recent kernel headers and the musl-libc
toolchain will expose the following build failure:
mmap.c: In function 'mmap_f':
mmap.c:196:12: error: 'MAP_SYNC' undeclared (first use in this function); did you mean 'MS_SYNC'?
196 | flags = MAP_SYNC | MAP_SHARED_VALIDATE;
| ^~~~~~~~
| MS_SYNC
mmap.c:196:12: note: each undeclared identifier is reported only once for each function it appears in
make[4]: *** [../include/buildrules:81: mmap.o] Error 1"
At first glance, the build failure here is caused by the fact that:
1. The configure script doesn't detect MAP_SYNC support
2. The build system doesn't set HAVE_MAP_SYNC
2. io/mmap.c includes input.h -> projects.h -> xfs.h and later sys/mman.h
3. include/linux.h #define's MAP_SYNC to 0 if HAVE_MAP_SYNC is not set
4. musl's sys/mman.h #undef MAP_SYNC on platforms that don't support it
5. io/mmap.c tries to use MAP_SYNC, not realizing that libc undefined it
Normally, xfs_io only exports functionality that is defined by the libc
and/or kernel headers on the build system. We often make exceptions for
new functionality so that we have a way to test them before the header
file packages catch up, hence this '#ifndef HAVE_FOO #define FOO'
paradigm.
MAP_SYNC is a gross and horribly broken example of this. These support
crutches are supposed to be *private* to xfsprogs for benefit of early
testing, but they were instead added to include/linux.h, which we
provide to user programs in the xfslibs-dev package. IOWs, we've been
#defining MAP_SYNC to zero for unsuspecting programs.
Worst yet, gcc 11.3 doesn't even warn about overriding a #define to 0:
#include <stdio.h>
#include <sys/mman.h>
#ifdef STUPID
# include <xfs/xfs.h>
#endif
int main(int argc, char *argv[]) {
printf("MAP_SYNC 0x%x\n", MAP_SYNC);
}
Four years have gone by since the introduction of MAP_SYNC, so let's get
rid of the override code entirely -- any platform that supports MAP_SYNC
has had plenty of chances to ensure their header files have the right
bits. While we're at it, fix AC_HAVE_MAP_SYNC to look for MAP_SYNC in
the same header file that the one user (io/mmap.c) uses -- sys/mman.h.
Annoyingly, I had to test this by hand because the sole fstest that
exercises MAP_SYNC (generic/470) requires dm-logwrites and dm-thinp,
neither of which support fsdax on current kernels.
Reported-by: info@mobile-stream.com Reported-by: Fabrice Fontaine <fontaine.fabrice@gmail.com> Reported-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
I traced this down to getsubopt walking off the end of the dopts.subopts
array. The manpage says you're supposed to terminate the suboptions
string array with a NULL entry, but the structure definition uses
MAX_SUBOPTS/D_MAX_OPTS directly, which means there is no terminator.
Explicitly terminate each suboption array with a NULL entry after
making room for it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: explicitly add NULL terminators & clarify comment] Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
xfs_repair: Search for conflicts in inode_tree_ptrs[] when processing uncertain inodes
When processing an uncertain inode chunk record, if we lose 2 blocks worth of
inodes or 25% of the chunk, xfs_repair decides to ignore the chunk. Otherwise,
xfs_repair adds a new chunk record to inode_tree_ptrs[agno], marking each
inode as either free or used. However, before adding the new chunk record,
xfs_repair has to check for the existance of a conflicting record.
The existing code incorrectly checks for the conflicting record in
inode_uncertain_tree_ptrs[agno]. This check will succeed since the inode chunk
record being processed was originally obtained from
inode_uncertain_tree_ptrs[agno].
This commit fixes the bug by changing xfs_repair to search
inode_tree_ptrs[agno] for conflicts.
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Thu, 14 Jul 2022 01:58:25 +0000 (20:58 -0500)]
xfs_repair: ignore empty xattr leaf blocks
As detailed in the commit:
5e572d1a xfs: empty xattr leaf header blocks are not corruption
empty xattr leaf blocks can be the benign byproduct of the system
going down during the multi-step process of adding a large xattr
to a file that has no xattrs. If we find one at attr fork offset 0,
we should clear it, but this isn't a corruption.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: hexiaole <hexiaole@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 13 Jul 2022 22:20:56 +0000 (17:20 -0500)]
xfs_repair: check the rt summary against observations
Teach xfs_repair to check the ondisk realtime summary file against its
own observations.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 13 Jul 2022 22:20:56 +0000 (17:20 -0500)]
xfs_repair: check the rt bitmap against observations
Teach xfs_repair to check the ondisk realtime bitmap against its own
observations.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Wed, 13 Jul 2022 22:20:56 +0000 (17:20 -0500)]
xfs_repair: check free rt extent count
Check the superblock's free rt extent count against what we observed.
This increases the runtime and memory usage, but we can now report
undercounting frextents as a result of a logging bug in the kernel.
Note that repair has always fixed the undercount, but it no longer does
that silently.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
The bigtime and inobtcount feature is enabled by default by 1c08f0ae28b34d97b0a89c8483ef3c743914e85e (mkfs: enable inobtcount and
bigtime by default). This patch updates the manpage of mkfs to mention
this change.
Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Tue, 12 Jul 2022 18:30:33 +0000 (13:30 -0500)]
mkfs: always use new_diflags2 to initialize new inodes
The new_diflags2 field that's set in the inode geometry represent
features that we want enabled for /all/ newly created inodes.
Unfortunately, mkfs doesn't do that because xfs_flags2diflags2 doesn't
read new_diflags2. Change the new_diflags2 logic to match the kernel.
Without this fix, the root directory gets created without the
DIFLAG2_NREXT64 iflag set, but files created by a protofile /do/ have it
turned on.
This wasn't an issue with DIFLAG2_BIGTIME because xfs_trans_log_inode
quietly turns that on whenever possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Tue, 12 Jul 2022 18:28:33 +0000 (13:28 -0500)]
mkfs: preserve DIFLAG2_NREXT64 when setting other inode attributes
Preserve the state of the NREXT64 inode flag when we're changing the
other flags2 fields. This is only vital for the kernel version of this
function, but we should keep these in sync.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Tue, 12 Jul 2022 18:25:33 +0000 (13:25 -0500)]
xfs_copy: don't use cached buffer reads until after libxfs_mount
I accidentally tried to xfs_copy an ext4 filesystem, but instead of
rejecting the filesystem, the program instead crashed. I figured out
that zeroing the superblock was enough to trigger this:
The exact crash happens in this line from libxfs_getbuf_flags, which is
called from the main() routine of xfs_copy:
if (btp == btp->bt_mount->m_ddev_targp) {
(*bpp)->b_pag = xfs_perag_get(btp->bt_mount,
xfs_daddr_to_agno(btp->bt_mount, blkno));
The problem here is that the uncached read filled the incore superblock
with zeroes, which means mbuf.sb_agblocks is zero. This causes a
division by zero in xfs_daddr_to_agno, thereby crashing the program.
In commit f8b581d6, we made it so that xfs_buf structures contain a
passive reference to the associated perag structure. That commit
assumes that no program would try a cached buffer read until the buffer
cache is fully set up, which is true throughout xfsprogs... except for
the beginning of xfs_copy. For whatever reason, it attempts an uncached
read of the superblock to figure out the real superblock size, then
performs a *cached* read with the proper buffer length and verifier.
The cached read crashes the program.
Fix the problem by changing the (second) cached read into an uncached read.
Fixes: f8b581d6 ("libxfs: actually make buffers track the per-ag structures") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Tue, 12 Jul 2022 18:23:33 +0000 (13:23 -0500)]
xfs_repair: don't flag log_incompat inconsistencies as corruptions
While testing xfs/233 and xfs/127 with LARP mode enabled, I noticed
errors such as the following:
xfs_growfs --BlockSize=4096 --Blocks=8192
data blocks changed from 8192 to 2579968
meta-data=/dev/sdf isize=512 agcount=630, agsize=4096 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1
data = bsize=4096 blocks=2579968, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=3075, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
_check_xfs_filesystem: filesystem on /dev/sdf is inconsistent (r)
*** xfs_repair -n output ***
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- 23:03:47: zeroing log - 3075 of 3075 blocks done
- scan filesystem freespace and inode maps...
would fix log incompat feature mismatch in AG 30 super, 0x0 != 0x1
would fix log incompat feature mismatch in AG 8 super, 0x0 != 0x1
would fix log incompat feature mismatch in AG 12 super, 0x0 != 0x1
would fix log incompat feature mismatch in AG 24 super, 0x0 != 0x1
would fix log incompat feature mismatch in AG 18 super, 0x0 != 0x1
<snip>
0x1 corresponds to XFS_SB_FEAT_INCOMPAT_LOG_XATTRS, which is the feature
bit used to indicate that the log contains extended attribute log intent
items. This is a mechanism to prevent older kernels from trying to
recover log items that they won't know how to recover.
I thought about this a little bit more, and realized that log_incompat
features bits are set on the primary sb prior to writing certain types
of log records, and cleared once the log has written the committed
changes back to the filesystem. If the secondary superblocks are
updated at any point during that interval (due to things like growfs or
setting labels), the log_incompat field will now be set on the secondary
supers.
Due to the ephemeral nature of the current log_incompat feature bits,
a discrepancy between the primary and secondary supers is not a
corruption. If we're in dry run mode, we should log the discrepancy,
but that's not a reason to end with EXIT_FAILURE.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Darrick J. Wong [Tue, 12 Jul 2022 18:22:33 +0000 (13:22 -0500)]
xfs_repair: always rewrite secondary supers when needsrepair is set
Dave Chinner complained about xfs_scrub failures coming from xfs/158.
That test induces xfs_repair to fail while upgrading a filesystem to
have the inobtcount feature, and then restarts xfs_repair to finish the
upgrade. When the second xfs_repair run starts, it will find that the
primary super has NEEDSREPAIR set, along with whatever new feature that
we were trying to add to the filesystem.
From there, repair completes the upgrade in much the same manner as the
first repair run would have, with one big exception -- it forgets to set
features_changed to trigger rewriting of the secondary supers at the end
of repair. This results in discrepancies between the supers:
# XFS_REPAIR_FAIL_AFTER_PHASE=2 xfs_repair -c inobtcount=1 /dev/sdf
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Adding inode btree counts to filesystem.
Killed
# xfs_repair /dev/sdf
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
bad inobt block count 0, saw 1
bad finobt block count 0, saw 1
bad inobt block count 0, saw 1
bad finobt block count 0, saw 1
bad inobt block count 0, saw 1
bad finobt block count 0, saw 1
bad inobt block count 0, saw 1
bad finobt block count 0, saw 1
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 1
- agno = 2
- agno = 0
- agno = 3
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
# xfs_db -c 'sb 0' -c 'print' -c 'sb 1' -c 'print' /dev/sdf | \
egrep '(features_ro_compat|features_incompat)'
features_ro_compat = 0xd
features_incompat = 0xb
features_ro_compat = 0x5
features_incompat = 0xb
Curiously, re-running xfs_repair will not trigger any warnings about the
featureset mismatch between the primary and secondary supers. xfs_scrub
immediately notices, which is what causes xfs/158 to fail.
This discrepancy doesn't happen when the upgrade completes successfully
in a single repair run, so we need to teach repair to rewrite the
secondaries at the end of repair any time needsrepair was set.
Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>