git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

xfs: precalculate default inode attribute offset

Source kernel commit: b2941046ea85d2cd94b485831bf03402f34f4060

Default attr fork offset is based on inode size, so is a fixed
geometry parameter of the inode. Move it to the xfs_ino_geometry
structure and stop calculating it on every call to
xfs_default_attroffset().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: default attr fork size does not handle device inodes

Source kernel commit: 683ec9ba887d096a6cbd9a5778be9400efe6468c

Device inodes have a non-default data fork size of 8 bytes
as checked/enforced by xfs_repair. xfs_default_attroffset() doesn't
handle this, so lets do a minor refactor so it does.

Fixes: e6a688c33238 ("xfs: initialise attr fork on inode create")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Use struct xfs_bmdr_block instead of struct xfs_btree_block to calculate root node size

Source kernel commit: b6785e279d53ca5c4fa6be1146e85000870d73ef

The incore data fork of an inode stores the bmap btree root node as 'struct
xfs_btree_block'. However, the ondisk version of the inode stores the bmap
btree root node as a 'struct xfs_bmdr_block'.

xfs_bmap_add_attrfork_btree() checks if the btree root node fits inside the
data fork of the inode. However, it incorrectly uses 'struct xfs_btree_block'
to compute the size of the bmap btree root node. Since size of 'struct
xfs_btree_block' is larger than that of 'struct xfs_bmdr_block',
xfs_bmap_add_attrfork_btree() could end up unnecessarily demoting the current
root node as the child of newly allocated root node.

This commit optimizes space usage by modifying xfs_bmap_add_attrfork_btree()
to use 'struct xfs_bmdr_block' to check if the bmap btree root node fits
inside the data fork of the inode.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: deprecate BMV_IF_NO_DMAPI_READ flag

Source kernel commit: fcb62c28031eeeb626392e6a338a90dedbdecf1c

Use of the flag has had no effect since kernel commit 288699fecaff
("xfs: drop dmapi hooks"), which removed all dmapi related code, so
deprecate it.

Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: move the di_crtime field to struct xfs_inode

Source kernel commit: e98d5e882b3ccb0f7f38d4e893fe60c1dd7934db

Move the crtime field from struct xfs_icdinode into stuct xfs_inode and
remove the now entirely unused struct xfs_icdinode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: move the di_flags2 field to struct xfs_inode

Source kernel commit: 3e09ab8fdc4d4c9d0afee7a63a3b39e5ade3c863

In preparation of removing the historic icinode struct, move the flags2
field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: move the di_flags field to struct xfs_inode

Source kernel commit: db07349da2f564742c0f23528691991e641e315e

In preparation of removing the historic icinode struct, move the flags
field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: move the di_forkoff field to struct xfs_inode

Source kernel commit: 7821ea302dca72469c558e382d6e4ae09232b7a7

In preparation of removing the historic icinode struct, move the
forkoff field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: use a union for i_cowextsize and i_flushiter

Source kernel commit: ee7b83fd365e32beaa405d60b8c42f42ec5f42c2

The i_cowextsize field is only used for v3 inodes, and the i_flushiter
field is only used for v1/v2 inodes. Use a union to pack the inode a
littler better after adding a few missing guards around their usage.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: move the di_flushiter field to struct xfs_inode

Source kernel commit: 965e0a1ad273ba61a8040220ef8ec09c9d065875

In preparation of removing the historic icinode struct, move the
flushiter field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: move the di_cowextsize field to struct xfs_inode

Source kernel commit: b33ce57d3e61020328582ce6d7dbae1d694ac496

In preparation of removing the historic icinode struct, move the
cowextsize field into the containing xfs_inode structure. Also
switch to use the xfs_extlen_t instead of a uint32_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: move the di_extsize field to struct xfs_inode

Source kernel commit: 031474c28a3a9a2772a715d1ec9770f9068ea5a4

In preparation of removing the historic icinode struct, move the extsize
field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: move the di_nblocks field to struct xfs_inode

Source kernel commit: 6e73a545f91e128d8dd7da1769dca200225f5d82

In preparation of removing the historic icinode struct, move the nblocks
field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: move the di_size field to struct xfs_inode

Source kernel commit: 13d2c10b05d8e67cb9b4c2d1d4a09a906148a72e

In preparation of removing the historic icinode struct, move the on-disk
size field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: move the di_projid field to struct xfs_inode

Source kernel commit: ceaf603c7024d3c021803a3e90e893feda8d76e2

In preparation of removing the historic icinode struct, move the projid
field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove the di_dmevmask and di_dmstate fields from struct xfs_icdinode

Source kernel commit: 9b3beb028ff5bed99473021d1a7de8747665ac32

The legacy DMAPI fields were never set by upstream Linux XFS, and have no
way to be read using the kernel APIs. So instead of bloating the in-core
inode for them just copy them from the on-disk inode into the log when
logging the inode. The only caveat is that we need to make sure to zero
the fields for newly read or deleted inodes, which is solved using a new
flag in the inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove the unused xfs_icdinode_has_bigtime helper

Source kernel commit: 55f773380e922d3b975a7acb24331c76611cce30

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: split xfs_imap_to_bp

Source kernel commit: af9dcddef662e1437a63c2decb6e1e2efb7d81ea

Split looking up the dinode from xfs_imap_to_bp, which can be
significantly simplified as a result.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Initialize xfs_alloc_arg->total correctly when allocating minlen extents

Source kernel commit: 6e8bd39d7227747f13c891bc5a5fea00373d4bb9

xfs/538 can cause the following call trace to be printed when executing on a
multi-block directory configuration,

WARNING: CPU: 1 PID: 2578 at fs/xfs/libxfs/xfs_bmap.c:717 xfs_bmap_extents_to_btree+0x520/0x5d0
Call Trace:
? xfs_buf_rele+0x4f/0x450
xfs_bmap_add_extent_hole_real+0x747/0x960
xfs_bmapi_allocate+0x39a/0x440
xfs_bmapi_write+0x507/0x9e0
xfs_da_grow_inode_int+0x1cd/0x330
? up+0x12/0x60
xfs_dir2_grow_inode+0x62/0x110
? xfs_trans_log_inode+0x234/0x2d0
xfs_dir2_sf_to_block+0x103/0x940
? xfs_dir2_sf_check+0x8c/0x210
? xfs_da_compname+0x19/0x30
? xfs_dir2_sf_lookup+0xd0/0x3d0
xfs_dir2_sf_addname+0x10d/0x910
xfs_dir_createname+0x1ad/0x210
xfs_create+0x404/0x620
xfs_generic_create+0x24c/0x320
path_openat+0xda6/0x1030
do_filp_open+0x88/0x130
? kmem_cache_alloc+0x50/0x210
? __cond_resched+0x16/0x40
? kmem_cache_alloc+0x50/0x210
do_sys_openat2+0x97/0x150
__x64_sys_creat+0x49/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae

This occurs because xfs_bmap_exact_minlen_extent_alloc() initializes
xfs_alloc_arg->total to xfs_bmalloca->minlen. In the context of
xfs_bmap_exact_minlen_extent_alloc(), xfs_bmalloca->minlen has a value of 1
and hence the space allocator could choose an AG which has less than
xfs_bmalloca->total number of free blocks available. As the transaction
proceeds, one of the future space allocation requests could fail due to
non-availability of free blocks in the AG that was originally chosen.

This commit fixes the bug by assigning xfs_alloc_arg->total to the value of
xfs_bmalloca->total.

Fixes: 301519674699 ("xfs: Introduce error injection to allocate only minlen size extents for files")
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: add error injection for per-AG resv failure

Source kernel commit: 2b92faed551173f065ee2a8cf087dc76cf40303b

per-AG resv failure after fixing up freespace is hard to test in an
effective way, so directly add an error injection path to observe
such error handling path works as expected.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: introduce xfs_ag_shrink_space()

Source kernel commit: 46141dc891f7d28cc5cac473ad1a54a312c021c1

This patch introduces a helper to shrink unused space in the last AG
by fixing up the freespace btree.

Also make sure that the per-AG reservation works under the new AG
size. If such per-AG reservation or extent allocation fails, roll
the transaction so the new transaction could cancel without any side
effects.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: reduce debug overhead of dir leaf/node checks

Source kernel commit: 1fea323ff00526dcc04fbb4ee6e7d04e4e2ab0e1

On debug kernels, we call xfs_dir3_leaf_check_int() multiple times
on every directory modification. The robust hash ordering checks it
does on every entry in the leaf on every call results in a massive
CPU overhead which slows down debug kernels by a large amount.

We use xfs_dir3_leaf_check_int() for the verifiers as well, so we
can't just gut the function to reduce overhead. What we can do,
however, is reduce the work it does when it is called from the
debug interfaces, just leaving the high level checks in place and
leaving the robust validation to the verifiers. This means the debug
checks will catch gross errors, but subtle bugs might not be caught
until a verifier is run.

It is easy enough to restore the existing debug behaviour if the
developer needs it (just change a call parameter in the debug code),
but overwise the overhead makes testing large directory block sizes
on debug kernels very slow.

Profile at an unlink rate of ~80k file/s on a 64k block size
filesystem before the patch:

40.30%  [kernel]  [k] xfs_dir3_leaf_check_int
10.98%  [kernel]  [k] __xfs_dir3_data_check
8.10%  [kernel]  [k] xfs_verify_dir_ino
4.42%  [kernel]  [k] memcpy
2.22%  [kernel]  [k] xfs_dir2_data_get_ftype
1.52%  [kernel]  [k] do_raw_spin_lock

Profile after, at an unlink rate of ~125k files/s (+50% improvement)
has largely dropped the leaf verification debug overhead out of the
profile.

16.53%  [kernel]  [k] __xfs_dir3_data_check
12.53%  [kernel]  [k] xfs_verify_dir_ino
7.97%  [kernel]  [k] memcpy
3.36%  [kernel]  [k] xfs_dir2_data_get_ftype
2.86%  [kernel]  [k] __pv_queued_spin_lock_slowpath

Create shows a similar change in profile and a +25% improvement in
performance.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: No need for inode number error injection in __xfs_dir3_data_check

Source kernel commit: 39d3c0b5968b5421922e2fc939b6d6158df8ac1c

We call xfs_dir_ino_validate() for every dir entry in a directory
when doing validity checking of the directory. It calls
xfs_verify_dir_ino() then emits a corruption report if bad or does
error injection if good. It is extremely costly:

43.27%  [kernel]  [k] xfs_dir3_leaf_check_int
10.28%  [kernel]  [k] __xfs_dir3_data_check
6.61%  [kernel]  [k] xfs_verify_dir_ino
4.16%  [kernel]  [k] xfs_errortag_test
4.00%  [kernel]  [k] memcpy
3.48%  [kernel]  [k] xfs_dir_ino_validate

7% of the cpu usage in this directory traversal workload is
xfs_dir_ino_validate() doing absolutely nothing.

We don't need error injection to simulate a bad inode numbers in the
directory structure because we can do that by fuzzing the structure
on disk.

And we don't need a corruption report, because the
__xfs_dir3_data_check() will emit one if the inode number is bad.

So just call xfs_verify_dir_ino() directly here, and get rid of all
this unnecessary overhead:

40.30%  [kernel]  [k] xfs_dir3_leaf_check_int
10.98%  [kernel]  [k] __xfs_dir3_data_check
8.10%  [kernel]  [k] xfs_verify_dir_ino
4.42%  [kernel]  [k] memcpy
2.22%  [kernel]  [k] xfs_dir2_data_get_ftype
1.52%  [kernel]  [k] do_raw_spin_lock

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: type verification is expensive

Source kernel commit: ec08c14ba28ce073b3f63c8edbee0f3c38e1b6a1

From a concurrent rm -rf workload:

41.04%  [kernel]  [k] xfs_dir3_leaf_check_int
9.85%  [kernel]  [k] __xfs_dir3_data_check
5.60%  [kernel]  [k] xfs_verify_ino
5.32%  [kernel]  [k] xfs_agino_range
4.21%  [kernel]  [k] memcpy
3.06%  [kernel]  [k] xfs_errortag_test
2.57%  [kernel]  [k] xfs_dir_ino_validate
1.66%  [kernel]  [k] xfs_dir2_data_get_ftype
1.17%  [kernel]  [k] do_raw_spin_lock
1.11%  [kernel]  [k] xfs_verify_dir_ino
0.84%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
0.83%  [kernel]  [k] xfs_buf_find
0.64%  [kernel]  [k] xfs_log_commit_cil

THere's an awful lot of overhead in just range checking inode
numbers in that, but each inode number check is not a lot of code.
The total is a bit over 14.5% of the CPU time is spent validating
inode numbers.

The problem is that they deeply nested global scope functions so the
overhead here is all in function call marshalling.

text    data     bss     dec     hex filename
2077       0       0    2077     81d fs/xfs/libxfs/xfs_types.o.orig
2197       0       0    2197     895 fs/xfs/libxfs/xfs_types.o

There's a small increase in binary size by inlining all the local
nested calls in the verifier functions, but the same workload now
profiles as:

40.69%  [kernel]  [k] xfs_dir3_leaf_check_int
10.52%  [kernel]  [k] __xfs_dir3_data_check
6.68%  [kernel]  [k] xfs_verify_dir_ino
4.22%  [kernel]  [k] xfs_errortag_test
4.15%  [kernel]  [k] memcpy
3.53%  [kernel]  [k] xfs_dir_ino_validate
1.87%  [kernel]  [k] xfs_dir2_data_get_ftype
1.37%  [kernel]  [k] do_raw_spin_lock
0.98%  [kernel]  [k] xfs_buf_find
0.94%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
0.73%  [kernel]  [k] xfs_log_commit_cil

Now we only spend just over 10% of the time validing inode numbers
for the same workload. Hence a few "inline" keyworks is good enough
to reduce the validation overhead by 30%...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: initialise attr fork on inode create

Source kernel commit: e6a688c3323840f3e388ba28fd2db86675b79917

When we allocate a new inode, we often need to add an attribute to
the inode as part of the create. This can happen as a result of
needing to add default ACLs or security labels before the inode is
made visible to userspace.

This is highly inefficient right now. We do the create transaction
to allocate the inode, then we do an "add attr fork" transaction to
modify the just created empty inode to set the inode fork offset to
allow attributes to be stored, then we go and do the attribute
creation.

This means 3 transactions instead of 1 to allocate an inode, and
this greatly increases the load on the CIL commit code, resulting in
excessive contention on the CIL spin locks and performance
degradation:

18.99%  [kernel]                [k] __pv_queued_spin_lock_slowpath
3.57%  [kernel]                [k] do_raw_spin_lock
2.51%  [kernel]                [k] __raw_callee_save___pv_queued_spin_unlock
2.48%  [kernel]                [k] memcpy
2.34%  [kernel]                [k] xfs_log_commit_cil

The typical profile resulting from running fsmark on a selinux enabled
filesytem is adds this overhead to the create path:

- 15.30% xfs_init_security
- 15.23% security_inode_init_security
- 13.05% xfs_initxattrs
- 12.94% xfs_attr_set
- 6.75% xfs_bmap_add_attrfork
- 5.51% xfs_trans_commit
- 5.48% __xfs_trans_commit
- 5.35% xfs_log_commit_cil
- 3.86% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.70% xfs_trans_alloc
0.52% xfs_trans_reserve
- 5.41% xfs_attr_set_args
- 5.39% xfs_attr_set_shortform.constprop.0
- 4.46% xfs_trans_commit
- 4.46% __xfs_trans_commit
- 4.33% xfs_log_commit_cil
- 2.74% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
0.60% xfs_inode_item_format
0.90% xfs_attr_try_sf_addname
- 1.99% selinux_inode_init_security
- 1.02% security_sid_to_context_force
- 1.00% security_sid_to_context_core
- 0.92% sidtab_entry_to_string
- 0.90% sidtab_sid2str_get
0.59% sidtab_sid2str_put.part.0
- 0.82% selinux_determine_inode_label
- 0.77% security_transition_sid
0.70% security_compute_sid.part.0

And fsmark creation rate performance drops by ~25%. The key point to
note here is that half the additional overhead comes from adding the
attribute fork to the newly created inode. That's crazy, considering
we can do this same thing at inode create time with a couple of
lines of code and no extra overhead.

So, if we know we are going to add an attribute immediately after
creating the inode, let's just initialise the attribute fork inside
the create transaction and chop that whole chunk of code out of
the create fast path. This completely removes the performance
drop caused by enabling SELinux, and the profile looks like:

- 8.99% xfs_init_security
- 9.00% security_inode_init_security
- 6.43% xfs_initxattrs
- 6.37% xfs_attr_set
- 5.45% xfs_attr_set_args
- 5.42% xfs_attr_set_shortform.constprop.0
- 4.51% xfs_trans_commit
- 4.54% __xfs_trans_commit
- 4.59% xfs_log_commit_cil
- 2.67% _raw_spin_lock
- 3.28% do_raw_spin_lock
3.08% __pv_queued_spin_lock_slowpath
0.66% xfs_inode_item_format
- 0.90% xfs_attr_try_sf_addname
- 0.60% xfs_trans_alloc
- 2.35% selinux_inode_init_security
- 1.25% security_sid_to_context_force
- 1.21% security_sid_to_context_core
- 1.19% sidtab_entry_to_string
- 1.20% sidtab_sid2str_get
- 0.86% sidtab_sid2str_put.part.0
- 0.62% _raw_spin_lock_irqsave
- 0.77% do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.84% selinux_determine_inode_label
- 0.83% security_transition_sid
0.86% security_compute_sid.part.0

Which indicates the XFS overhead of creating the selinux xattr has
been halved. This doesn't fix the CIL lock contention problem, just
means it's not a limiting factor for this workload. Lock contention
in the security subsystems is going to be an issue soon, though...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: fix compilation error when CONFIG_SECURITY=n]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: prevent metadata files from being inactivated

Source kernel commit: 383e32b0d0db464dc53052a97bf7f9ee3a1937cc

Files containing metadata (quota records, rt bitmap and summary info)
are fully managed by the filesystem, which means that all resource
cleanup must be explicit, not automatic.  This means that they should
never be subjected automatic to post-eof truncation, nor should they be
freed automatically even if the link count drops to zero.

In other words, xfs_inactive() should leave these files alone.  Add the
necessary predicate functions to make this happen.  This adds a second
layer of prevention for the kinds of fs corruption that was fixed by
files, we should make all those metadata updates explicit.

Rearrange the order of #includes to fix compiler errors, since
xfs_mount.h is supposed to be included before xfs_inode.h

Followup-to: f4c32e87de7d ("xfs: fix realtime bitmap/summary file truncation when growing rt volume")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: validate ag btree levels using the precomputed values

Source kernel commit: 973975b72a36ee86c8c59057f06fcde03478ff4f

Use the AG btree height limits that we precomputed into the xfs_mount to
validate the AG headers instead of using XFS_BTREE_MAXLEVELS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: refactor resetting incore dinode fields to zero

Refactor the repair code that resets inode fields when we want to
recreate the rt bitmap, rt summary, and root directories. This will
become important as we deconstruct xfs_dinode.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.12.0

Update all the necessary files for a 5.12.0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.12.0-rc1

Update all the necessary files for a 5.12.0-rc1 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libxfs: copy crtime correctly now that it's timespec64

The incore i_mtime and di_crtime are both timespec64 now, which means
that tv_sec is a 64-bit value. Don't cast that to int32_t when we're
creating an inode, because we'll end up truncating the creation time
incorrectly, should an xfsprogs of this vintage make it to 2039. :P

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: reject cowextsize after making final decision about reflink support

There's a snippet of code that rejects cowextsize option if reflink is
disabled. This really ought to be /after/ the last place where we can
turn off reflink. Fix it so that people don't see stuff like this:

$ mkfs.xfs -r rtdev=b.img a.img -f -d cowextsize=16
illegal CoW extent size hint 16, must be less than 9600.

(reflink isn't supported when realtime is enabled)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

repair: fix an uninitialized variable issue

An uninitialized variable issue reported by Coverity, it seems
the following for-loop can be exited in advance with isblock == 1,
and bp is still uninitialized.

In case of that, initialize bp as NULL in advance to avoid this.

Fixes: 1f7c7553489c ("repair: don't duplicate names in phase 6")
Fixes-coverity-id: 1476291
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_growfs: support shrinking unused space

This allows shrinking operation can pass into kernel. Currently,
only shrinking unused space in the tail AG functionality works.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: remove BMV_IF_NO_DMAPI_READ flag

Use of the flag has had no effect since kernel commit 288699fecaff
("xfs: drop dmapi hooks"), which removed all dmapi related code, so
remove it from bmap.

Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: don't default to the physical sector size if > XFS_MAX_SECTORSIZE

Hi,

In testing on ppc64, I ran into the following error when making a file
system:

# ./mkfs.xfs -b size=65536 -f /dev/ram0
illegal sector size 65536

Which is odd, because I didn't specify a sector size! The problem is
that validate_sectorsize defaults to the physical sector size, but in
this case, the physical sector size is bigger than XFS_MAX_SECTORSIZE.

# cat /sys/block/ram0/queue/physical_block_size
65536

Fall back to the default (logical sector size) if the physical sector
size is greater than XFS_MAX_SECTORSIZE.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: shorten subject, rebase]
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_logprint: Fix buffer overflow printing quotaoff

xlog_recover_print_quotaoff() was using a static buffer to aggregate
quota option strings to be printed at the end. The buffer size was
miscalculated and when printing all 3 flags, a buffer overflow occurs
crashing xfs_logprint, like:

QOFF: cnt:1 total:1 a:0x560530ff3bb0 len:160
*** buffer overflow detected ***: terminated
Aborted (core dumped)

Fix this by removing the static buffer and using printf() directly to
print each flag. Also add a trailling space before each flag, so they
are a bit more readable on the output.

Reported-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

repair: scale duplicate name checking in phase 6.

phase 6 on large directories is cpu bound on duplicate name checking
due to the algorithm having effectively O(n^2) scalability. Hence
when the duplicate name hash table  size is far smaller than the
number of directory entries, we end up with long hash chains that
are searched linearly on every new entry that is found in the
directory to do duplicate detection.

The in-memory hash table size is limited to 64k entries. Hence when
we have millions of entries in a directory, duplicate entry lookups
on the hash table have substantial overhead. Scale this table out to
larger sizes so that we keep the chain lengths short and hence the
O(n^2) scalability impact is limited because N is always small.

For a 10M entry directory consuming 400MB of directory data, the
hash table now sizes at 6.4 million entries instead of ~64k - it is
~100x larger. While the hash table now consumes ~50MB of RAM, the
xfs_repair footprint barely changes as it's using already consuming
~9GB of RAM at this point in time. IOWs, the incremental memory
usage change is noise, but the directory checking time:

Unpatched:

  97.11%  xfs_repair          [.] dir_hash_add
   0.38%  xfs_repair          [.] longform_dir2_entry_check_data
   0.34%  libc-2.31.so        [.] __libc_calloc
   0.32%  xfs_repair          [.] avl_ino_start

Phase 6:        10/22 12:11:40  10/22 12:14:28  2 minutes, 48 seconds

Patched:

  46.74%  xfs_repair          [.] radix_tree_lookup
  32.13%  xfs_repair          [.] dir_hash_see_all
   7.70%  xfs_repair          [.] radix_tree_tag_get
   3.92%  xfs_repair          [.] dir_hash_add
   3.52%  xfs_repair          [.] radix_tree_tag_clear
   2.43%  xfs_repair          [.] crc32c_le

Phase 6:        10/22 13:11:01  10/22 13:11:18  17 seconds

has been reduced by an order of magnitude.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

repair: convert the dir byaddr hash to a radix tree

Phase 6 uses a hash table to track the data segment addresses of the
entries it has seen in a directory. This is indexed by the offset
into the data segment for the dirent, and is used to check if the
entry exists, is a duplicate or has a bad hash value. The lookup
operations involve walking long hash chains on large directories and
they are done for every entry in the directory. This means certain
operations have O(n^2) scalability (or worse!) and hence hurt on
very large directories.

It is also used to determine if the directory has unseen entries,
which involves a full hash traversal that is very expensive on large
directories. Hence the directory checking for unseen ends up being
roughly a O(n^2 + n) algorithm.

Switch the byaddr indexing to a radix tree. While a radix tree will
burn more memory than the linked list, it gives us O(log n) lookup
operations instead of O(n) on large directories, and use for tags
gives us O(1) determination of whether all entries have been seen or
not. This brings the "entry seen" algorithm scalability back to
O(nlog n) and so is a major improvement for processing large
directories.

Given a filesystem with 10M empty files in a single directory, we
see:

5.6.0:

  97.56%  xfs_repair              [.] dir_hash_add.lto_priv.0
   0.38%  xfs_repair              [.] avl_ino_start.lto_priv.0
   0.37%  libc-2.31.so            [.] malloc
   0.34%  xfs_repair              [.] longform_dir2_entry_check_data.lto_priv.0

Phase 6:        10/22 12:07:13  10/22 12:10:51  3 minutes, 38 seconds

Patched:

  97.11%  xfs_repair          [.] dir_hash_add
   0.38%  xfs_repair          [.] longform_dir2_entry_check_data
   0.34%  libc-2.31.so        [.] __libc_calloc
   0.32%  xfs_repair          [.] avl_ino_start

Phase 6:        10/22 12:11:40  10/22 12:14:28  2 minutes, 48 seconds

So there's some improvement, but we are clearly still CPU bound due
to the O(n^2) scalability of the duplicate name checking algorithm.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

repair: don't duplicate names in phase 6

The name hash in phase 6 is constructed by using names that point
directly into the directory buffers. Hence before the buffers can be
released, the constructed name hash has to duplicate all those names
into meory it owns via dir_hash_dup_names().

Given that the structure that holds the name is dynamically
allocated, it makes no sense to store a pointer to the name
dir_hash_add() and then later have dynamically allocate the name.

Extend the name hash allocation to contain space for the name
itself, and copy the name into the name hash structure in
dir_hash_add(). This allows us to get rid of dir_hash_dup_names(),
and the directory checking code no longer needs to hold all the
directory buffers in memory until the entire directory walk is
complete and the names duplicated.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

repair: parallelise phase 6

A recent metadump provided to us caused repair to take hours in
phase6. It wasn't IO bound - it was fully CPU bound the entire time.
The only way to speed it up is to make phase 6 run multiple
concurrent processing threads.

The obvious way to do this is to spread the concurrency across AGs,
like the other phases, and while this works it is not optimal. When
a processing thread hits a really large directory, it essentially
sits CPU bound until that directory is processed. IF an AG has lots
of large directories, we end up with a really long single threaded
tail that limits concurrency.

Hence we also need to have concurrency /within/ the AG. This is
realtively easy, as the inode chunk records allow for a simple
concurrency mechanism within an AG. We can simply feed each chunk
record to a workqueue, and we get concurrency within the AG for
free. However, this allows prefetch to run way ahead of processing
and this blows out the buffer cache size and can cause OOM.

However, we can use the new workqueue depth limiting to limit the
number of inode chunks queued, and this then backs up the inode
prefetching to it's maximum queue depth. Hence we prevent having the
prefetch code queue the entire AG's inode chunks on the workqueue
blowing out memory by throttling the prefetch consumer.

This takes phase 6 from taking many, many hours down to:

Phase 6: 10/30 21:12:58 10/30 21:40:48 27 minutes, 50 seconds

And burning 20-30 cpus that entire time on my test rig.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

repair: protect inode chunk tree records with a mutex

Phase 6 accesses inode chunk records mostly in an isolated manner.
However, when it finds a corruption in a directory or there are
multiple hardlinks to an inode, there can be concurrent access
to the inode chunk record to update state.

Hence the inode record itself needs a mutex. This protects all state
changes within the inode chunk record, as well as inode link counts
and chunk references. That allows us to process multiple chunks at
once, providing concurrency within an AG as well as across AGs.

The inode chunk tree itself is not modified in the directory
scanning and rebuilding part of phase 6 which we are making
concurrent, hence we do not need to worry about locking for AVL tree
lookups to find the inode chunk records themselves. Therefore
internal locking is all we need here.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

repair: Protect bad inode list with mutex

To enable phase 6 parallelisation, we need to protect the bad inode
list from concurrent modification and/or access. Wrap it with a
mutex and clean up the nasty typedefs.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

workqueue: bound maximum queue depth

Existing users of workqueues have bound maximum queue depths in
their external algorithms (e.g. prefetch counts). For parallelising
work that doesn't have an external bound, allow workqueues to
throttle incoming requests at a maximum bound. Bounded workqueues
also need to distribute work over all worker threads themselves as
there is no external bounding or worker function throttling
provided.

Existing callers are not throttled and retain direct control of
worker threads, only users of the new create interface will be
throttled and concurrency managed.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

Add dax mount option to man xfs(5)

Details are already in kernel's documentation, but make dax mount option
information accessible through xfs(5) manpage.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: include <signal.h> for platform_crash

Needed for kill(2) prototype and SIGKILL definition.
Fixes build on musl 1.1.24.

Signed-off-by: Leah Neukirchen <leah@vuxu.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_admin: pick up log arguments correctly

In commit ab9d8d69, we added support to xfs_admin to pass an external
log to xfs_db and xfs_repair. Unfortunately, we didn't do this
correctly -- by appending the log arguments to DB_OPTS, we now guarantee
an invocation of xfs_db when we don't have any work for it to do.

Brian Foster noticed that this results in xfs/764 hanging fstests
because xfs_db (when not compiled with libeditline) will wait for input
on stdin. I didn't notice because my build includes libeditline and my
test runner script does silly things with pipes such that xfs_db would
exit immediately.

Reported-by: Brian Foster <bfoster@redhat.com>
Fixes: ab9d8d69 ("xfs_admin: support adding features to V5 filesystems")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libfrog: report inobtcount in geometry

Report the inode btree counter feature in fs feature reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.12.0-rc0

Update all the necessary files for a 5.12.0-rc0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: use current->journal_info for detecting transaction recursion

Source kernel commit: 756b1c343333a5aefcc26b0409f3fd16f72281bf

Because the iomap code using PF_MEMALLOC_NOFS to detect transaction
recursion in XFS is just wrong. Remove it from the iomap code and
replace it with XFS specific internal checks using
current->journal_info instead.

[djwong: This change also realigns the lifetime of NOFS flag changes to
match the incore transaction, instead of the inconsistent scheme we have
now.]

Fixes: 9070733b4efa ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: consider shutdown in bmapbt cursor delete assert

Source kernel commit: 1cd738b13ae9b29e03d6149f0246c61f76e81fcf

The assert in xfs_btree_del_cursor() checks that the bmapbt block
allocation field has been handled correctly before the cursor is
freed. This field is used for accurate calculation of indirect block
reservation requirements (for delayed allocations), for example.
generic/019 reproduces a scenario where this assert fails because
the filesystem has shutdown while in the middle of a bmbt record
insertion. This occurs after a bmbt block has been allocated via the
cursor but before the higher level bmap function (i.e.
xfs_bmap_add_extent_hole_real()) completes and resets the field.

Update the assert to accommodate the transient state if the
filesystem has shutdown. While here, clean up the indentation and
comments in the function.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libxfs: expose inobtcount in xfs geometry

Source kernel commit: bc41fa5321f93ecbabec177f888451cfc17ad66d

As xfs supports the feature of inode btree block counters now, expose
this feature flag in xfs geometry, for userspace can check if the
inobtcnt is enabled or not.

Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: allow reservation of rtblocks with xfs_trans_alloc_inode

Source kernel commit: 3de4eb106fcc97f086b78bd17a0c3529691e8259

Make it so that we can reserve rt blocks with the xfs_trans_alloc_inode
wrapper function, then convert a few more callsites.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: refactor common transaction/inode/quota allocation idiom

Source kernel commit: 3a1af6c317d0a55524f39079183be107be4c1f39

Create a new helper xfs_trans_alloc_inode that allocates a transaction,
locks and joins an inode to it, and then reserves the appropriate amount
of quota against that transction. Then replace all the open-coded
idioms with a single call to this helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: reserve data and rt quota at the same time

Source kernel commit: 02b7ee4eb613240d2bb3f6a67723f94ceda19eb6

Modify xfs_trans_reserve_quota_nblks so that we can reserve data and
realtime blocks from the dquot at the same time. This change has the
theoretical side effect that for allocations to realtime files we will
reserve from the dquot both the number of rtblocks being allocated and
the number of bmbt blocks that might be needed to add the mapping.
However, since the mount code disables quota if it finds a realtime
device, this should not result in any behavior changes.

Now that we've moved the inode creation callers away from using the
_nblks function, we can repurpose the (now unused) ninos argument for
realtime blocks, so make that change. This also replaces the flags
argument with a boolean parameter to force the reservation since we
don't need to distinguish between data and rt quota reservations any
more, and the only flag being passed in was FORCE_RES.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: create convenience wrappers for incore quota block reservations

Source kernel commit: 8554650003b8a66f3dd357692ab73101d088d938

Create a couple of convenience wrappers for creating and deleting quota
block reservations against future changes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: clean up quota reservation callsites

Source kernel commit: 4abe21ad67a7b9dc6844f55e91a6e3ef81879d42

Convert a few xfs_trans_*reserve* callsites that are open-coding other
convenience functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Fix 'set but not used' warning in xfs_bmap_compute_alignments()

Source kernel commit: 560ab6c0d12ebccabb83638abe23a7875b946f9a

With both CONFIG_XFS_DEBUG and CONFIG_XFS_WARN disabled, the only reference to
local variable "error" in xfs_bmap_compute_alignments() gets eliminated during
pre-processing stage of the compilation process. This causes the compiler to
generate a "set but not used" warning.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Introduce error injection to allocate only minlen size extents for files

Source kernel commit: 301519674699aa9b80a15b2b2165e08532b176e6

This commit adds XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT error tag which
helps userspace test programs to get xfs_bmap_btalloc() to always
allocate minlen sized extents.

This is required for test programs which need a guarantee that minlen
extents allocated for a file do not get merged with their existing
neighbours in the inode's BMBT. "Inode fork extent overflow check" for
Directories, Xattrs and extension of realtime inodes need this since the
file offset at which the extents are being allocated cannot be
explicitly controlled from userspace.

One way to use this error tag is to,
1. Consume all of the free space by sequentially writing to a file.
2. Punch alternate blocks of the file. This causes CNTBT to contain
sufficient number of one block sized extent records.
3. Inject XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT error tag.
After step 3, xfs_bmap_btalloc() will issue space allocation
requests for minlen sized extents only.

ENOSPC error code is returned to userspace when there aren't any "one
block sized" extents left in any of the AGs.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Process allocated extent in a separate function

Source kernel commit: 07c72e556299a7fea448912b1330b9ebfd418662

This commit moves over the code in xfs_bmap_btalloc() which is
responsible for processing an allocated extent to a new function. Apart
from xfs_bmap_btalloc(), the new function will be invoked by another
function introduced in a future commit.

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Compute bmap extent alignments in a separate function

Source kernel commit: 0961fddfdd3f8ccd6302af2e7718abbaf18c9fff

This commit moves over the code which computes stripe alignment and
extent size hint alignment into a separate function. Apart from
xfs_bmap_btalloc(), the new function will be used by another function
introduced in a future commit.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Remove duplicate assert statement in xfs_bmap_btalloc()

Source kernel commit: aff4db57d510082f11194ca915d8101463c92d46

The check for verifying if the allocated extent is from an AG whose
index is greater than or equal to that of tp->t_firstblock is already
done a couple of statements earlier in the same function. Hence this

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Introduce error injection to reduce maximum inode fork extent count

Source kernel commit: f9fa87169d2bc1bf55ab42bb6085114378c53b86

This commit adds XFS_ERRTAG_REDUCE_MAX_IEXTENTS error tag which enables
userspace programs to test "Inode fork extent count overflow detection"
by reducing maximum possible inode fork extent count to 10.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Check for extent overflow when swapping extents

Source kernel commit: bcc561f21f115437a010307420fc43d91be91c66

Removing an initial range of source/donor file's extent and adding a new
extent (from donor/source file) in its place will cause extent count to
increase by 1.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Check for extent overflow when moving extent from cow to data fork

Source kernel commit: 5f1d5bbfb2e674052a9fe542f53678978af20770

Moving an extent to data fork can cause a sub-interval of an existing
extent to be unmapped. This will increase extent count by 1. Mapping in
the new extent can increase the extent count by 1 again i.e.
| Old extent | New extent | Old extent |
Hence number of extents increases by 2.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Check for extent overflow when writing to unwritten extent

Source kernel commit: c442f3086d5a108b7ff086c8ade1923a8f389db5

A write to a sub-interval of an existing unwritten extent causes
the original extent to be split into 3 extents
i.e. | Unwritten | Real | Unwritten |
Hence extent count can increase by 2.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Check for extent overflow when adding/removing xattrs

Source kernel commit: 3a19bb147c72d2e9b77137bf5130b9cfb50a5eef

Adding/removing an xattr can cause XFS_DA_NODE_MAXDEPTH extents to be
added. One extra extent for dabtree in case a local attr is large enough
to cause a double split. It can also cause extent count to increase
proportional to the size of a remote xattr's value.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Check for extent overflow when renaming dir entries

Source kernel commit: 02092a2f034fdeabab524ae39c2de86ba9ffa15a

A rename operation is essentially a directory entry remove operation
from the perspective of parent directory (i.e. src_dp) of rename's
source. Hence the only place where we check for extent count overflow
for src_dp is in xfs_bmap_del_extent_real(). xfs_bmap_del_extent_real()
returns -ENOSPC when it detects a possible extent count overflow and in
response, the higher layers of directory handling code do the following:
1. Data/Free blocks: XFS lets these blocks linger until a future remove
operation removes them.
2. Dabtree blocks: XFS swaps the blocks with the last block in the Leaf
space and unmaps the last block.

For target_dp, there are two cases depending on whether the destination
directory entry exists or not.

When destination directory entry does not exist (i.e. target_ip ==
NULL), extent count overflow check is performed only when transaction
has a non-zero sized space reservation associated with it. With a
zero-sized space reservation, XFS allows a rename operation to continue
only when the directory has sufficient free space in its data/leaf/free
space blocks to hold the new entry.

When destination directory entry exists (i.e. target_ip != NULL), all
we need to do is change the inode number associated with the already
existing entry. Hence there is no need to perform an extent count
overflow check.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Check for extent overflow when removing dir entries

Source kernel commit: 0dbc5cb1a91cc8c44b1c75429f5b9351837114fd

Directory entry removal must always succeed; Hence XFS does the
following during low disk space scenario:
1. Data/Free blocks linger until a future remove operation.
2. Dabtree blocks would be swapped with the last block in the leaf space
and then the new last block will be unmapped.

This facility is reused during low inode extent count scenario i.e. this
that the above mentioned behaviour is exercised causing no change to the
directory's extent count.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Check for extent overflow when adding dir entries

Source kernel commit: f5d92749191402c50e32ac83dd9da3b910f5680f

Directory entry addition can cause the following,
1. Data block can be added/removed.
A new extent can cause extent count to increase by 1.
2. Free disk block can be added/removed.
Same behaviour as described above for Data block.
3. Dabtree blocks.
XFS_DA_NODE_MAXDEPTH blocks can be added. Each of these
can be new extents. Hence extent count can increase by
XFS_DA_NODE_MAXDEPTH.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Check for extent overflow when punching a hole

Source kernel commit: 85ef08b5a667615bc7be5058259753dc42a7adcd

The extent mapping the file offset at which a hole has to be
inserted will be split into two extents causing extent count to
increase by 1.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Check for extent overflow when trivally adding a new extent

Source kernel commit: 727e1acd297cae15449607d6e2ee39c71216cf1a

When adding a new data extent (without modifying an inode's existing
extents) the extent count increases only by 1. This commit checks for
extent count overflow in such cases.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Add helper for checking per-inode extent count overflow

Source kernel commit: b9b7e1dc56c5ca8d6fc37c410b054e9f26737d2e

XFS does not check for possible overflow of per-inode extent counter
fields when adding extents to either data or attr fork.

For e.g.
1. Insert 5 million xattrs (each having a value size of 255 bytes) and
then delete 50% of them in an alternating manner.

2. On a 4k block sized XFS filesystem instance, the above causes 98511
extents to be created in the attr fork of the inode.

xfsaild/loop0 2008 [003] 1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131

3. The incore inode fork extent counter is a signed 32-bit
quantity. However the on-disk extent counter is an unsigned 16-bit
quantity and hence cannot hold 98511 extents.

4. The following incorrect value is stored in the attr extent counter,
# xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
core.naextents = -32561

This commit adds a new helper function (i.e.
xfs_iext_count_may_overflow()) to check for overflow of the per-inode
data and xattr extent counters. Future patches will use this function to
make sure that an FS operation won't cause the extent counter to
overflow.

Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: fix an ABBA deadlock in xfs_rename

Source kernel commit: 6da1b4b1ab36d80a3994fd4811c8381de10af604

When overlayfs is running on top of xfs and the user unlinks a file in
the overlay, overlayfs will create a whiteout inode and ask xfs to
"rename" the whiteout file atop the one being unlinked. If the file
being unlinked loses its one nlink, we then have to put the inode on the
unlinked list.

This requires us to grab the AGI buffer of the whiteout inode to take it
off the unlinked list (which is where whiteouts are created) and to grab
the AGI buffer of the file being deleted. If the whiteout was created
in a higher numbered AG than the file being deleted, we'll lock the AGIs
in the wrong order and deadlock.

Therefore, grab all the AGI locks we think we'll need ahead of time, and
in order of increasing AG number per the locking rules.

Reported-by: wenli xie <wlxie7296@gmail.com>
Fixes: 93597ae8dac0 ("xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.11.0

Update all the necessary files for a 5.11.0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

debian: Build-depend on libinih-dev with udeb package

The first libinih Debian package version with udeb binary package is 53-1.
Debian bug #981662 documents the need for it:
xfsprogs-udeb depends on libinih1, not libinih1-udeb

Link: https://bugs.debian.org/981662
Signed-off-by: Bastian Germann <bastiangermann@fishpost.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

debian: Regenerate config.guess using debhelper

This is a change introduced in 5.10.0-2ubuntu2 with the changelog:

> xfsprogs upstream has regressed config.guess, so use
> dh_update_autotools_config.

The 5.10.0 tarball has a config.guess that breaks builds on RISC-V:
...
UNAME_MACHINE = riscv64
UNAME_RELEASE = 5.0.0+
UNAME_SYSTEM = Linux
UNAME_VERSION = #2 SMP Sat Mar 9 22:34:53 UTC 2019
configure: error: cannot guess build type; you must specify one
make[1]: *** [Makefile:131: include/builddefs] Error 1
...

Reported-by: Steve Langasek <steve.langasek@ubuntu.com>
Signed-off-by: Bastian Germann <bastiangermann@fishpost.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

man: document attr2, ikeep option deprecation in xfs.5

Since kernel v5.10, the (no)attr2 and (no)ikeep mount options are deprecated:

c23c393eaab5d xfs: remove deprecated mount options

Document this fact in the xfs(5) manpage.

Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_admin: don't hide the xfs_repair output when upgrading

Currently, xfs_admin suppresses the output from xfs_repair when it tries
to upgrade a filesystem, and prints a rather unhelpful message if the
upgrade fails.

Neither of these behaviors are useful -- repair can fail for reasons
outside of the filesystem being mounted, and if it does, the admin will
never know what actually happened.

Worse yet, if repair finds corruptions on disk, the upgrade script
silently throws all that away, which means that nobody will ever be able
to report what happened if an upgrade trashes a filesystem.

Therefore, allow the console to capture all of repair's stdout/stderr
reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_admin: don't add '=1' when building repair command line for -O

Geert Hendrickx reported an inconsistency between the xfs_admin manpage
and its behavior -- the documentation says that users must provide the
status explicitly, but the script injects '=1' anyway. While this seems
to work with the glibc getsubopt, it's a bit ugly and isn't consistent
with the docs.

So, get rid of that extra two bytes.

Reported-by: Geert Hendrickx <geert@hendrickx.be>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.11.0-rc1

Update all the necessary files for a 5.11.0-rc1 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

man: document XFS_XFLAG_APPEND behavior for directories

For directories, the APPEND flag means that files cannot be unlinked
from the directory.  Files can be linked in or created, just not
unlinked.  Document this behavior, since it's been in the VFS for years
though not explicitly mentioned.  This patch is in preparation for
trying to hoist the fsgetxattr ioctl documentation to the man-pages
project.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: make use of xfs_validate_stripe_geometry()

Check stripe numbers in calc_stripe_factors() by using
xfs_validate_stripe_geometry().

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: enable bigtime upgrade via repair

Upgrade existing V5 filesystems to support large timestamps up to 2486.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: enable inobtcount upgrade via repair

Use xfs_repair to add the inode btree counter feature to a filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_admin: support adding features to V5 filesystems

Teach the xfs_admin script how to add features to V5 filesystems.
Technically speaking we could add lazycount to the list, but that option
is only useful for the V4 format which is deprecated.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: allow upgrades on v5 filesystems

Add some helper functions so that we can allow users to upgrade V5
filesystems in a sane manner. This just lands the boilerplate; the
actual feature validation and whatnot will land in the next patches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[sandeen: change subject slightly]
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

man: mark all deprecated V4 format options

Update the manual pages for the most popular tools to note which options
are only useful with the V4 XFS format, and that the V4 format is
deprecated and will be removed no later than September 2030.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: tweak wording and formatting a little]
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: add post-phase error injection points

Create an error injection point so that we can simulate repair failing
after a certain phase.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: factor phase transitions into a helper

Create a helper function to centralize all the stuff we do at the end of
a repair phase (which for now is limited to reporting progress). The
next patch will add more interesting things to this helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libxfs: simulate system failure after a certain number of writes

Add an error injection knob so that we can simulate system failure after
a certain number of disk writes. This knob is being added so that we
can check repair's behavior after an arbitrary number of tests.

Set LIBXFS_DEBUG_WRITE_CRASH={ddev,logdev,rtdev}=nn in the environment
to make libxfs SIGKILL itself after nn writes to the data, log, or rt
devices. Note that this only applies to xfs_buf writes and zero_range.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: set NEEDSREPAIR the first time we write to a filesystem

Add a hook to the buffer cache so that xfs_repair can intercept the
first write to a V5 filesystem to set the NEEDSREPAIR flag. In the
event that xfs_repair dirties the filesystem and goes down, this ensures
that the sysadmin will have to re-start repair before mounting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: clear the needsrepair flag

Clear the needsrepair flag, since it's used to prevent mounting of an
inconsistent filesystem. We only do this if we make it to the end of
repair with a non-zero error code, and all the rebuilt indices and
corrected metadata are persisted correctly.

Note that we cannot combine clearing needsrepair with clearing the quota
checked flags because we need to clear the quota flags even if
reformatting the log fails, whereas we can't clear needsrepair if the
log reformat fails.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: clear quota CHKD flags on the incore superblock too

At the end of a repair run, xfs_repair clears the superblock's quota
checked flags if it found mistakes in the quota accounting to force a
quotacheck at the next mount. This is currently the last time repair
modifies the primary superblock, so it is sufficient to update the
ondisk buffer and not the incore mount structure.

However, we're about to introduce code to clear the needsrepair feature
at the very end of repair, after all metadata blocks have been written
to disk and all disk caches flush. Since the convention everywhere else
in xfs is to update the incore superblock, call libxfs_sb_to_disk to
translate that into the ondisk buffer, and then write the buffer to
disk, switch the quota CHKD code to use this mechanism too.

(Get rid of dsb too, since the incore super should be in sync with the
ondisk super.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: fix unmount error message to have a newline

Add a newline so that this is consistent with the other error messages.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: don't allow label/uuid setting if the needsrepair flag is set

The NEEDSREPAIR flag can be set on filesystems where we /know/ that
there's something wrong with the metadata and want to force the sysadmin
to run xfs_repair before the next mount. The goal here is to prevent
non-repair changes to a filesystem when we are confident of its
instability. Normally we wouldn't bother with such safety checks for
the debugger, but the label and uuid functions can be called from
xfs_admin, so we should prevent these administrative tasks until the
filesystem can be repaired.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: report the needsrepair flag in check and version commands

Teach the version and check commands to report the presence of the
NEEDSREPAIR flag.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_admin: support filesystems with realtime devices

Add a -r option to xfs_admin so that we can pass the name of the
realtime device to xfs_repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_admin: clean up string quoting

Clean up the string quoting in this script so that we don't trip over
users feeding us arguments like "/dev/sd ha ha ha lol".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: fix weirdness in directory name check code

Remove the redundant second check of fd and ISDIR in check_inode_names,
and rework the comment to describe why we can't run phase 5 if we found
other corruptions in the filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>