git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

treewide: use get_random_u32() when possible

Source kernel commit: a251c17aa558d8e3128a528af5cf8b9d7caae4fd

The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

treewide: use prandom_u32_max() when possible, part 1

Source kernel commit: 81895a65ec63ee1daec3255dc1a06675d2fbe915

Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:

@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)

@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@

-       RAND = get_random_u32();
... when != RAND
-       RAND %= (E);
+       RAND = prandom_u32_max(E);

// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@

((T)get_random_u32()@p & (LITERAL))

// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@

value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))

// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@

-       (FUNC()@p & (LITERAL))
+       prandom_u32_max(RESULT)

@collapse_ret@
type T;
identifier VAR;
expression E;
@@

{
-       T VAR;
-       VAR = (E);
-       return VAR;
+       return E;
}

@drop_var@
type T;
identifier VAR;
@@

{
-       T VAR;
... when != VAR
}

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: rearrange the logic and remove the broken comment for xfs_dir2_isxx

Source kernel commit: c098576f5f63bc0ee2424bba50892514a71d54e8

xfs_dir2_isleaf is used to see if the directory is a single-leaf
form directory instead, as commented right above the function.

Besides getting rid of the broken comment, we rearrange the logic by
converting everything over to standard formatting and conventions,
at the same time, to make it easier to understand and self documenting.

Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: trim the mapp array accordingly in xfs_da_grow_inode_int

Source kernel commit: 44159659df8ca381b84261e11058b2176fa03ba0

Take a look at the for-loop in xfs_da_grow_inode_int:
======
for(){
nmap = min(XFS_BMAP_MAX_NMAP, count);
...
error = xfs_bmapi_write(...,&mapp[mapi], &nmap);//(..., $1, $2)
...
mapi += nmap;
}
=====
where $1 stands for the start address of the array,
while $2 is used to indicate the size of the array.

The array $1 will advance by $nmap in each iteration after
the allocation of extents.
But the size $2 still remains unchanged, which is determined by
min(XFS_BMAP_MAX_NMAP, count).

It seems that it has forgotten to trim the mapp array after each
iteration, so change it.

Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: Remove the unneeded result variable

Source kernel commit: abda5271f8ec6e9a84ae8129ddc59226c89def7a

Return the value xfs_dir_cilookup_result() directly instead of storing it
in another redundant variable.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: clean up "%Ld/%Lu" which doesn't meet C standard

Source kernel commit: 78b0f58bdfef45aa9f3c7fbbd9b4d41abad6d85f

The "%Ld" specifier, which represents long long unsigned,
doesn't meet C language standard, and even more,
it makes people easily mistake with "%ld", which represent
long unsigned. So replace "%Ld" with "lld".

Do the same with "%Lu".

Signed-off-by: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfsprogs: Release v6.0.0

Update all the necessary files for a 6.0.0 release.

Signed-off-by: Carlos Maiolino <cem@kernel.org>

Polish translation update for xfsprogs 5.19.0.

Signed-off-by: Jakub Bogusz <qboosh@pld-linux.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_quota: apply -L/-U range limits in uid/gid/pid loops

In case kernel doesn't support XFS_GETNEXTQUOTA the report/dump
command will fallback to iterating over all known uid/gid/pid.
However, currently it won't take -L/-U range limits into account
(all entities with non-zero qoutas will be outputted). This applies
those limits for fallback case.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_quota: utilize XFS_GETNEXTQUOTA for ranged calls in report/dump

The implementation based on XFS_GETQUOTA call for each ID in range,
specified with -L/-U, is quite slow for wider ranges.

If kernel supports XFS_GETNEXTQUOTA, report_*_mount/dump_any_file
will use that to obtain quota list for the mount. XFS_GETNEXTQUOTA
returns quota of the requested ID and next ID with non-empty quota.

Otherwise, XFS_GETQUOTA will be used for each user/group/project ID
known from password/group/project database.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_quota: separate get_dquot() and report_mount()

Separate quota info acquisition from outputting. This allows upper
functions to filter obtained info (e.g. within specific ID range).

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_quota: separate get_dquot() and dump_file()

Separate quota info acquisition from outputting it to file. This
allows upper functions to filter obtained info (e.g. within specific
ID range).

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_quota: separate quota info acquisition into get_dquot()

Both report_mount() and dump_file() have identical code to get quota
information. This could be used for further separation of the
functions.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs_db: use preferable macro to seek offset for local dir3 entry fields

In 'xfsprogs-dev' source:

/* db/dir2sf.c begin */
#define        EOFF(f) bitize(offsetof(xfs_dir2_sf_entry_t, f))
const field_t   dir2_sf_entry_flds[] = {
        { "namelen", FLDT_UINT8D, OI(EOFF(namelen)), C1, 0, TYP_NONE },
...
#define        E3OFF(f)        bitize(offsetof(xfs_dir2_sf_entry_t, f))
const field_t   dir3_sf_entry_flds[] = {
        { "namelen", FLDT_UINT8D, OI(EOFF(namelen)), C1, 0, TYP_NONE },
...
/* db/dir2sf.c end */

The macro definitions of 'EOFF' and 'E3OFF' are same, so no matter to
use either to seek field offset in 'dir3_sf_entry_flds'.
But it seems the intent of defining 'E3OFF' macro is to be used in
'dir3_sf_entry_flds', and 'E3OFF' macro has not been used at any place
of the 'xfsprogs-dev' source:

/* command begin */
$ grep -r E3OFF /path/to/xfsprogs-dev/git/repository/
./db/dir2sf.c:#define   E3OFF(f)        bitize(offsetof(xfs_dir2_sf_entry_t, f))
$
/* command end */

Above command shows the 'E3OFF' is only been defined but nerver been
used, that is weird, so there has reason to suspect using 'EOFF'
rather than 'E3OFF' in 'dir3_sf_entry_flds' is a typo, this patch fix
it, there has no logical change in this commit at all.

Signed-off-by: Xiaole He <hexiaole@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfsprogs: Release v6.0.0-rc0

Update all the necessary files for a 6.0.0-rc0 release.

Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix inode reservation space for removing transaction

Source kernel commit: 031d166f968efba6e4f091ff75d0bb5206bb3918

In 'fs/xfs/libxfs/xfs_trans_resv.c', the comment for transaction of removing a
directory entry writes:

/* fs/xfs/libxfs/xfs_trans_resv.c begin */
/*
* For removing a directory entry we can modify:
*    the parent directory inode: inode size
*    the removed inode: inode size
...
xfs_calc_remove_reservation(
struct xfs_mount        *mp)
{
return XFS_DQUOT_LOGRES(mp) +
xfs_calc_iunlink_add_reservation(mp) +
max((xfs_calc_inode_res(mp, 1) +
...
/* fs/xfs/libxfs/xfs_trans_resv.c end */

There has 2 inode size of space to be reserverd, but the actual code
for inode reservation space writes.

There only count for 1 inode size to be reserved in
'xfs_calc_inode_res(mp, 1)', rather than 2.

Signed-off-by: hexiaole <hexiaole@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: remove redundant code citations]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: Fix typo 'the the' in comment

Source kernel commit: 4869b6e84a23076eca813d5739d6717976aeae66

Replace 'the the' with 'the' in the comment.

Signed-off-by: Slark Xiao <slark_xiao@163.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: don't leak memory when attr fork loading fails

Source kernel commit: c78c2d0903183a41beb90c56a923e30f90fa91b9

I observed the following evidence of a memory leak while running xfs/399
from the xfs fsck test suite (edited for brevity):

XFS (sde): Metadata corruption detected at xfs_attr_shortform_verify_struct.part.0+0x7b/0xb0 [xfs], inode 0x1172 attr fork
XFS: Assertion failed: ip->i_af.if_u1.if_data == NULL, file: fs/xfs/libxfs/xfs_inode_fork.c, line: 315
------------[ cut here ]------------
WARNING: CPU: 2 PID: 91635 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
CPU: 2 PID: 91635 Comm: xfs_scrub Tainted: G        W         5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:assfail+0x46/0x4a [xfs]
Call Trace:
<TASK>
xfs_ifork_zap_attr+0x7c/0xb0
xfs_iformat_attr_fork+0x86/0x110
xfs_inode_from_disk+0x41d/0x480
xfs_iget+0x389/0xd70
xfs_bulkstat_one_int+0x5b/0x540
xfs_bulkstat_iwalk+0x1e/0x30
xfs_iwalk_ag_recs+0xd1/0x160
xfs_iwalk_run_callbacks+0xb9/0x180
xfs_iwalk_ag+0x1d8/0x2e0
xfs_iwalk+0x141/0x220
xfs_bulkstat+0x105/0x180
xfs_ioc_bulkstat.constprop.0.isra.0+0xc5/0x130
xfs_file_ioctl+0xa5f/0xef0
__x64_sys_ioctl+0x82/0xa0
do_syscall_64+0x2b/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0

This newly-added assertion checks that there aren't any incore data
structures hanging off the incore fork when we're trying to reset its
contents.  From the call trace, it is evident that iget was trying to
construct an incore inode from the ondisk inode, but the attr fork
verifier failed and we were trying to undo all the memory allocations
that we had done earlier.

The three assertions in xfs_ifork_zap_attr check that the caller has
already called xfs_idestroy_fork, which clearly has not been done here.
As the zap function then zeroes the pointers, we've effectively leaked
the memory.

The shortest change would have been to insert an extra call to
xfs_idestroy_fork, but it makes more sense to bundle the _idestroy_fork
call into _zap_attr, since all other callsites call _idestroy_fork
immediately prior to calling _zap_attr.  IOWs, it eliminates one way to
fail.

Note: This change only applies cleanly to 2ed5b09b3e8f, since we just
reworked the attr fork lifetime.  However, I think this memory leak has
existed since 0f45a1b20cd8, since the chain xfs_iformat_attr_fork ->
xfs_iformat_local -> xfs_init_local_fork will allocate
ifp->if_u1.if_data, but if xfs_ifork_verify_local_attr fails,
xfs_iformat_attr_fork will free i_afp without freeing any of the stuff
hanging off i_afp.  The solution for older kernels I think is to add the
missing call to xfs_idestroy_fork just prior to calling kmem_cache_free.

Found by fuzzing a.sfattr.hdr.totsize = lastbit in xfs/399.

Fixes: 2ed5b09b3e8f ("xfs: make inode attribute forks a permanent part of struct xfs_inode")
Probably-Fixes: 0f45a1b20cd8 ("xfs: improve local fork verification")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: delete unnecessary NULL checks

Source kernel commit: 3f52e016af600982989b5dee958d313c52483c92

These NULL check are no long needed after commit 2ed5b09b3e8f ("xfs:
make inode attribute forks a permanent part of struct xfs_inode").

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix comment for start time value of inode with bigtime enabled

Source kernel commit: fdbae121b4369fe49eb5f8efbd23604ab4c50116

The 'ctime', 'mtime', and 'atime' for inode is the type of
'xfs_timestamp_t', which is a 64-bit type:

/* fs/xfs/libxfs/xfs_format.h begin */
typedef __be64 xfs_timestamp_t;
/* fs/xfs/libxfs/xfs_format.h end */

When the 'bigtime' feature is disabled, this 64-bit type is splitted
into two parts of 32-bit, one part is encoded for seconds since
1970-01-01 00:00:00 UTC, the other part is encoded for nanoseconds
above the seconds, this two parts are the type of
'xfs_legacy_timestamp' and the min and max time value of this type are
defined as macros 'XFS_LEGACY_TIME_MIN' and 'XFS_LEGACY_TIME_MAX':

/* fs/xfs/libxfs/xfs_format.h begin */
struct xfs_legacy_timestamp {
__be32          t_sec;          /* timestamp seconds */
__be32          t_nsec;         /* timestamp nanoseconds */
};
/* fs/xfs/libxfs/xfs_format.h end */
/* include/linux/limits.h begin */
/* include/linux/limits.h end */

'XFS_LEGACY_TIME_MIN' is the min time value of the
'xfs_legacy_timestamp', that is -(2^31) seconds relative to the
1970-01-01 00:00:00 UTC, it can be converted to human-friendly time
value by 'date' command:

/* command begin */
[root@~]# date --utc -d '@0' +'%Y-%m-%d %H:%M:%S'
1970-01-01 00:00:00
[root@~]# date --utc -d "@`echo '-(2^31)'|bc`" +'%Y-%m-%d %H:%M:%S'
1901-12-13 20:45:52
[root@~]#
/* command end */

When 'bigtime' feature is enabled, this 64-bit type becomes a 64-bit
nanoseconds counter, with the start time value is the min time value of
'xfs_legacy_timestamp'(start time means the value of 64-bit nanoseconds
counter is 0). We have already caculated the min time value of
'xfs_legacy_timestamp', that is 1901-12-13 20:45:52 UTC, but the comment
for the start time value of inode with 'bigtime' feature enabled writes
the value is 1901-12-31 20:45:52 UTC:

/* fs/xfs/libxfs/xfs_format.h begin */
/*
* XFS Timestamps
* ==============
* When the bigtime feature is enabled, ondisk inode timestamps become an
* unsigned 64-bit nanoseconds counter.  This means that the bigtime inode
* timestamp epoch is the start of the classic timestamp range, which is
* Dec 31 20:45:52 UTC 1901. ...
...
*/
/* fs/xfs/libxfs/xfs_format.h end */

That is a typo, and this patch corrects the typo, from 'Dec 31' to
'Dec 13'.

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Xiaole He <hexiaole@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: double link the unlinked inode list

Source kernel commit: 2fd26cc07e9f8050e29bf314cbf1bcb64dbe088c

Now we have forwards traversal via the incore inode in place, we now
need to add back pointers to the incore inode to entirely replace
the back reference cache. We use the same lookup semantics and
constraints as for the forwards pointer lookups during unlinks, and
so we can look up any inode in the unlinked list directly and update
the list pointers, forwards or backwards, at any time.

The only wrinkle in converting the unlinked list manipulations to
use in-core previous pointers is that log recovery doesn't have the
incore inode state built up so it can't just read in an inode and
release it to finish off the unlink. Hence we need to modify the
traversal in recovery to read one inode ahead before we
release the inode at the head of the list. This populates the
next->prev relationship sufficient to be able to replay the unlinked
list and hence greatly simplify the runtime code.

This recovery algorithm also requires that we actually remove inodes
from the unlinked list one at a time as background inode
inactivation will result in unlinked list removal racing with the
building of the in-memory unlinked list state. We could serialise
this by holding the AGI buffer lock when constructing the in memory
state, but all that does is lockstep background processing with list
building. It is much simpler to flush the inodegc immediately after
releasing the inode so that it is unlinked immediately and there is
no races present at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: track the iunlink list pointer in the xfs_inode

Source kernel commit: 4fcc94d653270fcc7800dbaf3b11f78cb462b293

Having direct access to the i_next_unlinked pointer in unlinked
inodes greatly simplifies the processing of inodes on the unlinked
list. We no longer need to look up the inode buffer just to find
next inode in the list if the xfs_inode is in memory. These
improvements will be realised over upcoming patches as other
dependencies on the inode buffer for unlinked list processing are
removed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: replace inode fork size macros with functions

Source kernel commit: c01147d929899f02a0a8b15e406d12784768ca72

Replace the shouty macros here with typechecked helper functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: replace XFS_IFORK_Q with a proper predicate function

Source kernel commit: 932b42c66cb5d0ca9800b128415b4ad6b1952b3e

Replace this shouty macro with a real C function that has a more
descriptive name.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: use XFS_IFORK_Q to determine the presence of an xattr fork

Source kernel commit: e45d7cb2356e6b59fe64da28324025cc6fcd3fbd

Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the
attribute fork but i_forkoff is zero. This eliminates the ambiguity
between i_forkoff and i_af.if_present, which should make it easier to
understand the lifetime of attr forks.

While we're at it, remove the if_present checks around calls to
xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr
forks that have already been torn down.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: make inode attribute forks a permanent part of struct xfs_inode

Source kernel commit: 2ed5b09b3e8fc274ae8fecd6ab7c5106a364bed1

Syzkaller reported a UAF bug a while back:

==================================================================
BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958

CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
5.15.0-0.30.3-20220406_1406 #3
Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
__vfs_getxattr+0xdf/0x13d fs/xattr.c:399
cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
do_truncate+0xc3/0x1e0 fs/open.c:56
handle_truncate fs/namei.c:3084 [inline]
do_open fs/namei.c:3432 [inline]
path_openat+0x30ab/0x396d fs/namei.c:3561
do_filp_open+0x1c4/0x290 fs/namei.c:3588
do_sys_openat2+0x60d/0x98c fs/open.c:1212
do_sys_open+0xcf/0x13c fs/open.c:1228
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0x0
RIP: 0033:0x7f7ef4bb753d
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
</TASK>

Allocated by task 2953:
kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:254 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3213 [inline]
slab_alloc mm/slub.c:3221 [inline]
kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
kmem_cache_zalloc include/linux/slab.h:711 [inline]
xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
__vfs_setxattr+0x11b/0x177 fs/xattr.c:180
__vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
__vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
vfs_setxattr+0x154/0x33d fs/xattr.c:301
setxattr+0x216/0x29f fs/xattr.c:575
__do_sys_fsetxattr fs/xattr.c:632 [inline]
__se_sys_fsetxattr fs/xattr.c:621 [inline]
__x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0x0

Freed by task 2949:
kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:230 [inline]
slab_free_hook mm/slub.c:1700 [inline]
slab_free_freelist_hook mm/slub.c:1726 [inline]
slab_free mm/slub.c:3492 [inline]
kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
__vfs_removexattr+0x106/0x16a fs/xattr.c:468
cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
security_inode_killpriv+0x54/0xa1 security/security.c:1414
setattr_prepare+0x1a6/0x897 fs/attr.c:146
xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
notify_change+0xae5/0x10a1 fs/attr.c:410
do_truncate+0x134/0x1e0 fs/open.c:64
handle_truncate fs/namei.c:3084 [inline]
do_open fs/namei.c:3432 [inline]
path_openat+0x30ab/0x396d fs/namei.c:3561
do_filp_open+0x1c4/0x290 fs/namei.c:3588
do_sys_openat2+0x60d/0x98c fs/open.c:1212
do_sys_open+0xcf/0x13c fs/open.c:1228
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0x0

The buggy address belongs to the object at ffff88802cec9188
which belongs to the cache xfs_ifork of size 40
The buggy address is located 20 bytes inside of
40-byte region [ffff88802cec9188, ffff88802cec91b0)
The buggy address belongs to the page:
page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x2cec9
flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
>ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
^
ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
==================================================================

The root cause of this bug is the unlocked access to xfs_inode.i_afp
from the getxattr code paths while trying to determine which ILOCK mode
to use to stabilize the xattr data.  Unfortunately, the VFS does not
acquire i_rwsem when vfs_getxattr (or listxattr) call into the
filesystem, which means that getxattr can race with a removexattr that's
tearing down the attr fork and crash:

xfs_attr_set:                          xfs_attr_get:
xfs_attr_fork_remove:                  xfs_ilock_attr_map_shared:

xfs_idestroy_fork(ip->i_afp);
kmem_cache_free(xfs_ifork_cache, ip->i_afp);

if (ip->i_afp &&

ip->i_afp = NULL;

xfs_need_iread_extents(ip->i_afp))
<KABOOM>

ip->i_forkoff = 0;

Regrettably, the VFS is much more lax about i_rwsem and getxattr than
is immediately obvious -- not only does it not guarantee that we hold
i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
The getxattr system call won't acquire the lock before calling XFS, but
the file capabilities code calls getxattr with and without i_rwsem held
to determine if the "security.capabilities" xattr is set on the file.

Fixing the VFS locking requires a treewide investigation into every code
path that could touch an xattr and what i_rwsem state it expects or sets
up.  That could take years or even prove impossible; fortunately, we
can fix this UAF problem inside XFS.

An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
ensure that i_forkoff is always zeroed before i_afp is set to null and
changed the read paths to use smp_rmb before accessing i_forkoff and
i_afp, which avoided these UAF problems.  However, the patch author was
too busy dealing with other problems in the meantime, and by the time he
came back to this issue, the situation had changed a bit.

On a modern system with selinux, each inode will always have at least
one xattr for the selinux label, so it doesn't make much sense to keep
incurring the extra pointer dereference.  Furthermore, Allison's
upcoming parent pointer patchset will also cause nearly every inode in
the filesystem to have extended attributes.  Therefore, make the inode
attribute fork structure part of struct xfs_inode, at a cost of 40 more
bytes.

This patch adds a clunky if_present field where necessary to maintain
the existing logic of xattr fork null pointer testing in the existing
codebase.  The next patch switches the logic over to XFS_IFORK_Q and it
all goes away.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: convert XFS_IFORK_PTR to a static inline helper

Source kernel commit: 732436ef916b4f338d672ea56accfdb11e8d0732

We're about to make this logic do a bit more, so convert the macro to a
static inline function for better typechecking and fewer shouty macros.
No functional changes here.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: removed useless condition in function xfs_attr_node_get

Source kernel commit: 0f38063d7a38015a47ca1488406bf21e0effe80e

At line 1561, variable "state" is being compared
with NULL every loop iteration.

-------------------------------------------------------------------
1561    for (i = 0; state != NULL && i < state->path.active; i++) {
1562            xfs_trans_brelse(args->trans, state->path.blk[i].bp);
1563            state->path.blk[i].bp = NULL;
1564    }
-------------------------------------------------------------------

However, it cannot be NULL.

----------------------------------------
1546    state = xfs_da_state_alloc(args);
----------------------------------------

xfs_da_state_alloc calls kmem_cache_zalloc. kmem_cache_zalloc is
called with __GFP_NOFAIL flag and, therefore, it cannot return NULL.

--------------------------------------------------------------------------
struct xfs_da_state *
xfs_da_state_alloc(
struct xfs_da_args      *args)
{
struct xfs_da_state     *state;

state = kmem_cache_zalloc(xfs_da_state_cache, GFP_NOFS | __GFP_NOFAIL);
state->args = args;
state->mp = args->dp->i_mount;
return state;
}
--------------------------------------------------------------------------

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Andrey Strachuk <strochuk@ispras.ru>
Fixes: 4d0cdd2bb8f0 ("xfs: clean up xfs_attr_node_hasname")
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: rework xfs_buf_incore() API

Source kernel commit: 85c73bf726e41be276bcad3325d9a8aef10be289

Make it consistent with the other buffer APIs to return a error and
the buffer is placed in a parameter.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: make is_log_ag() a first class helper

Source kernel commit: 36029dee382a20cf515494376ce9f0d5949944eb

We check if an ag contains the log in many places, so make this
a first class XFS helper by lifting it to fs/xfs/libxfs/xfs_ag.h and
renaming it xfs_ag_contains_log(). The convert all the places that
check if the AG contains the log to use this helper.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: replace xfs_ag_block_count() with perag accesses

Source kernel commit: 3829c9a10fc7da40194ec9860df8c557c2b86ed8

Many of the places that call xfs_ag_block_count() have a perag
available. These places can just read pag->block_count directly
instead of calculating the AG block count from first principles.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: Pre-calculate per-AG agino geometry

Source kernel commit: 2d6ca8321c354e1cb6f6b1963c4f7bd053d2e272

There is a lot of overhead in functions like xfs_verify_agino() that
repeatedly calculate the geometry limits of an AG. These can be
pre-calculated as they are static and the verification context has
a per-ag context it can quickly reference.

In the case of xfs_verify_agino(), we now always have a perag
context handy, so we can store the minimum and maximum agino values
in the AG in the perag. This means we don't have to calculate
it on every call and it can be inlined in callers if we move it
to xfs_ag.h.

xfs_verify_agino_or_null() gets the same perag treatment.

xfs_agino_range() is moved to xfs_ag.c as it's not really a type
function, and it's use is largely restricted as the first and last
aginos can be grabbed straight from the perag in most cases.

Note that we leave the original xfs_verify_agino in place in
xfs_types.c as a static function as other callers in that file do
not have per-ag contexts so still need to go the long way. It's been
renamed to xfs_verify_agno_agino() to indicate it takes both an agno
and an agino to differentiate it from new function.

$ size --totals fs/xfs/built-in.a
text    data     bss     dec     hex filename
before  1482185  329588     572 1812345  1ba779 (TOTALS)
after   1481937  329588     572 1812097  1ba681 (TOTALS)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: Pre-calculate per-AG agbno geometry

Source kernel commit: 0800169e3e2c97a033e8b7f3d1e6c689e0d71a19

There is a lot of overhead in functions like xfs_verify_agbno() that
repeatedly calculate the geometry limits of an AG. These can be
pre-calculated as they are static and the verification context has
a per-ag context it can quickly reference.

In the case of xfs_verify_agbno(), we now always have a perag
context handy, so we can store the AG length and the minimum valid
block in the AG in the perag. This means we don't have to calculate
it on every call and it can be inlined in callers if we move it
to xfs_ag.h.

Move xfs_ag_block_count() to xfs_ag.c because it's really a
per-ag function and not an XFS type function. We need a little
bit of rework that is specific to xfs_initialise_perag() to allow
growfs to calculate the new perag sizes before we've updated the
primary superblock during the grow (chicken/egg situation).

Note that we leave the original xfs_verify_agbno in place in
xfs_types.c as a static function as other callers in that file do
not have per-ag contexts so still need to go the long way. It's been
renamed to xfs_verify_agno_agbno() to indicate it takes both an agno
and an agbno to differentiate it from new function.

Future commits will make similar changes for other per-ag geometry
validation functions.

Further:

$ size --totals fs/xfs/built-in.a
text    data     bss     dec     hex filename
before  1483006  329588     572 1813166  1baaae (TOTALS)
after   1482185  329588     572 1812345  1ba779 (TOTALS)

This rework reduces the binary size by ~820 bytes, indicating
that much less work is being done to bounds check the agbno values
against on per-ag geometry information.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: pass perag to xfs_alloc_read_agfl

Source kernel commit: cec7bb7d58fa0e644f8cec46b081bf5427c1a0f8

We have the perag in most places we call xfs_alloc_read_agfl, so
pass the perag instead of a mount/agno pair.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: pass perag to xfs_alloc_put_freelist

Source kernel commit: 8c392eb27f7a98e403658d066e387c7b1c604f2b

It's available in all callers, so pass it in so that the perag can
be passed further down the stack.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: pass perag to xfs_alloc_get_freelist

Source kernel commit: 49f0d84ec1db5bd46dcf3796fc792fce74ff25a3

It's available in all callers, so pass it in so that the perag can
be passed further down the stack.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: pass perag to xfs_read_agf

Source kernel commit: fa044ae70c64343b07277256952d22a0dc05b319

We have the perag in most places we call xfs_read_agf, so pass the
perag instead of a mount/agno pair.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: pass perag to xfs_read_agi

Source kernel commit: 61021deb1faa5b2b913bf0ad76e2769276160b04

We have the perag in most palces we call xfs_read_agi, so pass the
perag instead of a mount/agno pair.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: pass perag to xfs_alloc_read_agf()

Source kernel commit: 08d3e84feeb8cb8e20d54f659446b98fe17913aa

xfs_alloc_read_agf() initialises the perag if it hasn't been done
yet, so it makes sense to pass it the perag rather than pull a
reference from the buffer. This allows callers to be per-ag centric
rather than passing mount/agno pairs everywhere.

Whilst modifying the xfs_reflink_find_shared() function definition,
declare it static and remove the extern declaration as it is an
internal function only these days.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: kill xfs_alloc_pagf_init()

Source kernel commit: 76b47e528e3a27a3bf3b3f9153aad9435e03be8c

Trivial wrapper around xfs_alloc_read_agf(), can be easily replaced
by passing a NULL agfbp to xfs_alloc_read_agf().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: pass perag to xfs_ialloc_read_agi()

Source kernel commit: 99b13c7f0bd35dd3cf2cacb61beb4557dc2b6f9b

xfs_ialloc_read_agi() initialises the perag if it hasn't been done
yet, so it makes sense to pass it the perag rather than pull a
reference from the buffer. This allows callers to be per-ag centric
rather than passing mount/agno pairs everywhere.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: kill xfs_ialloc_pagi_init()

Source kernel commit: a95fee40e3d433d8fabff7c02e75f7c2c2e54400

This is just a basic wrapper around xfs_ialloc_read_agi(), which can
be entirely handled by xfs_ialloc_read_agi() by passing a NULL
agibpp....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: make last AG grow/shrink perag centric

Source kernel commit: c6aee2481419b638a5257adbd3ffd33b11c59fa8

Because the perag must exist for these operations, look it up as
part of the common shrink operations and pass it instead of the
mount/agno pair.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfsprogs: Release v5.19.0

Update all the necessary files for a 5.19.0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: fix printf format specifiers on 32-bit platforms

armv7l builds spit out the following warnings:

In file included from ../include/platform_defs.h:44,
                 from ../include/libxfs.h:13,
                 from bmap.c:7:
bmap.c: In function ‘blkmap_alloc’:
bmap.c:41:11: error: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘xfs_extnum_t’ {aka ‘long long unsigned int’} [-Werror=format=]
   41 |         _("Number of extents requested in blkmap_alloc (%d) overflows 32 bits.\n"
      |           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bmap.c:41:9: note: in expansion of macro ‘_’
   41 |         _("Number of extents requested in blkmap_alloc (%d) overflows 32 bits.\n"
      |         ^
bmap.c:41:58: note: format string is defined here
   41 |         _("Number of extents requested in blkmap_alloc (%d) overflows 32 bits.\n"
      |                                                         ~^
      |                                                          |
      |                                                          int
      |                                                         %lld
In file included from ../include/platform_defs.h:44,
                 from ../include/libxfs.h:13,
                 from bmap.c:7:
bmap.c:54:35: error: format ‘%zu’ expects argument of type ‘size_t’, but argument 2 has type ‘xfs_extnum_t’ {aka ‘long long unsigned int’} [-Werror=format=]
   54 |                         do_warn(_("malloc failed in blkmap_alloc (%zu bytes)\n"),
      |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bmap.c:54:33: note: in expansion of macro ‘_’
   54 |                         do_warn(_("malloc failed in blkmap_alloc (%zu bytes)\n"),
      |                                 ^
bmap.c:54:69: note: format string is defined here
   54 |                         do_warn(_("malloc failed in blkmap_alloc (%zu bytes)\n"),
      |                                                                   ~~^
      |                                                                     |
      |                                                                     unsigned int
      |                                                                   %llu

Fix these.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.19.0-rc1

Update all the necessary files for a 5.19.0-rc1 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: Add support for upgrading to large extent counters

This commit adds support to xfs_repair to allow upgrading an existing
filesystem to support per-inode large extent counters.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: check filesystem geometry before allowing upgrades

Currently, the two V5 feature upgrades permitted by xfs_repair do not
affect filesystem space usage, so we haven't needed to verify the
geometry.

However, this will change once we start to allow the sysadmin to add the
large extent count feature to existing filesystems. Add all the
infrastructure we need to ensure that the log will still be large
enough, and the root inode will still be where we expect it to be after
the upgrade.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
[david: Recompute transaction reservation values; Exit with error if upgrade fails]
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: complain about impossible log size constraints

xfs/042 trips over an impossible fs geometry when nrext64 is enabled.
The minimum log size calculation comes out to 4287 blocks, but the mkfs
parameters specify an AG size of 4096 blocks. This eventually causes
mkfs to complain that the autoselected log size doesn't meet the minimum
size, but we could be a little more explicit in pointing out that the
two size constraints make for an impossible geometry.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: stop allowing tiny filesystems

Refuse to format a filesystem that are "too small", because these
configurations are known to have performance and redundancy problems
that are not present on the volume sizes that XFS is best at handling.

Specifically, this means that we won't allow logs smaller than 64MB, we
won't allow single-AG filesystems, and we won't allow volumes smaller
than 300MB. There are two exceptions: the first is an undocumented CLI
option that can be used for crafting debug filesystems.

The second exception is that if fstests is detected, because there are a
lot of fstests that use tiny filesystems to perform targeted regression
and functional testing in a controlled environment. Fixing the ~40 or
so tests to run more slowly with larger filesystems isn't worth the risk
of breaking the tests.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: ignore data blockdev stripe geometry for small filesystems

As part of the process of removing support for tiny filesystems (defined
in the next patch to be anything under 300MB or 64M log size), we are
trying to eliminate all the edge case regressions for small filesystems
that the maintainer can find.

Eric pointed out that the use case of formatting a 510M on a RAID device
regresses once we start enforcing the 64M log size limit:

# modprobe scsi_debug opt_blks=256 opt_xferlen_exp=6 dev_size_mb=510
# mkfs.xfs /dev/sdg
Log size must be at least 64MB.

<hapless user reads manpage, adjusts log size>

# mkfs.xfs -l size=64m /dev/sdg
internal log size 16384 too large, must be less than 16301

Because the device reports a stripe geometry, mkfs tries to create 8 AGs
(instead of the usual 4) which are then very nearly 64M in size.  The
log itself cannot consume the entire AG, so its size is decreased, so
its size is rounded down to allow the creation of AG headers and btrees,
and then the log size is rounded down again to match the stripe unit.
This results in a log that is less than 64MB in size, causing the format
to fail.

There's not much point in formatting tiny AGs on a small filesystem,
even if it is on a RAID.  Doubling the AG count from 4 to 8 doubles the
metadata overhead, conflicts with our attempts to boost the log size,
and on 2022-era storage hardware gains us very little extra performance
since we're not limited by storage access times.

Therefore, disable automatic detection of stripe unit and width if the
data device is less than 1GB.  We would like to format with 128M AGs to
avoid constraining the size of the internal log, and since RAIDs smaller
than 8GB are formatted with 8 AGs by default, 128*8=1G was chosen as the
cutoff.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libxfs: stop overriding MAP_SYNC in publicly exported header files

Florian Fainelli most recently reported that xfsprogs doesn't build with
musl on mips:

"MIPS platforms building with recent kernel headers and the musl-libc
toolchain will expose the following build failure:

mmap.c: In function 'mmap_f':
mmap.c:196:12: error: 'MAP_SYNC' undeclared (first use in this function); did you mean 'MS_SYNC'?
  196 |    flags = MAP_SYNC | MAP_SHARED_VALIDATE;
      |            ^~~~~~~~
      |            MS_SYNC
mmap.c:196:12: note: each undeclared identifier is reported only once for each function it appears in
make[4]: *** [../include/buildrules:81: mmap.o] Error 1"

At first glance, the build failure here is caused by the fact that:

1. The configure script doesn't detect MAP_SYNC support
2. The build system doesn't set HAVE_MAP_SYNC
2. io/mmap.c includes input.h -> projects.h -> xfs.h and later sys/mman.h
3. include/linux.h #define's MAP_SYNC to 0 if HAVE_MAP_SYNC is not set
4. musl's sys/mman.h #undef MAP_SYNC on platforms that don't support it
5. io/mmap.c tries to use MAP_SYNC, not realizing that libc undefined it

Normally, xfs_io only exports functionality that is defined by the libc
and/or kernel headers on the build system.  We often make exceptions for
new functionality so that we have a way to test them before the header
file packages catch up, hence this '#ifndef HAVE_FOO #define FOO'
paradigm.

MAP_SYNC is a gross and horribly broken example of this.  These support
crutches are supposed to be *private* to xfsprogs for benefit of early
testing, but they were instead added to include/linux.h, which we
provide to user programs in the xfslibs-dev package.  IOWs, we've been
#defining MAP_SYNC to zero for unsuspecting programs.

Worst yet, gcc 11.3 doesn't even warn about overriding a #define to 0:

#include <stdio.h>
#include <sys/mman.h>
#ifdef STUPID
# include <xfs/xfs.h>
#endif

int main(int argc, char *argv[]) {
printf("MAP_SYNC 0x%x\n", MAP_SYNC);
}

$ gcc -o a a.c -Wall
$ ./a
MAP_SYNC 0x80000
$ gcc -DSTUPID -o a a.c -Wall
$ ./a
MAP_SYNC 0x0

Four years have gone by since the introduction of MAP_SYNC, so let's get
rid of the override code entirely -- any platform that supports MAP_SYNC
has had plenty of chances to ensure their header files have the right
bits.  While we're at it, fix AC_HAVE_MAP_SYNC to look for MAP_SYNC in
the same header file that the one user (io/mmap.c) uses -- sys/mman.h.

Annoyingly, I had to test this by hand because the sole fstest that
exercises MAP_SYNC (generic/470) requires dm-logwrites and dm-thinp,
neither of which support fsdax on current kernels.

Reported-by: info@mobile-stream.com
Reported-by: Fabrice Fontaine <fontaine.fabrice@gmail.com>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: terminate getsubopt arrays properly

Having not drank any (or maybe too much) coffee this morning, I typed:

$ mkfs.xfs -d agcount=3 -d nrext64=0
Segmentation fault

I traced this down to getsubopt walking off the end of the dopts.subopts
array. The manpage says you're supposed to terminate the suboptions
string array with a NULL entry, but the structure definition uses
MAX_SUBOPTS/D_MAX_OPTS directly, which means there is no terminator.

Explicitly terminate each suboption array with a NULL entry after
making room for it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[sandeen: explicitly add NULL terminators & clarify comment]
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: Search for conflicts in inode_tree_ptrs[] when processing uncertain inodes

When processing an uncertain inode chunk record, if we lose 2 blocks worth of
inodes or 25% of the chunk, xfs_repair decides to ignore the chunk. Otherwise,
xfs_repair adds a new chunk record to inode_tree_ptrs[agno], marking each
inode as either free or used. However, before adding the new chunk record,
xfs_repair has to check for the existance of a conflicting record.

The existing code incorrectly checks for the conflicting record in
inode_uncertain_tree_ptrs[agno]. This check will succeed since the inode chunk
record being processed was originally obtained from
inode_uncertain_tree_ptrs[agno].

This commit fixes the bug by changing xfs_repair to search
inode_tree_ptrs[agno] for conflicts.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: ignore empty xattr leaf blocks

As detailed in the commit:

5e572d1a xfs: empty xattr leaf header blocks are not corruption

empty xattr leaf blocks can be the benign byproduct of the system
going down during the multi-step process of adding a large xattr
to a file that has no xattrs. If we find one at attr fork offset 0,
we should clear it, but this isn't a corruption.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: correct nlink printf specifier from hd to PRIu32

1. Description
libxfs/xfs_log_format.h declare 'di_nlink' as unsigned 32-bit integer:

typedef struct xfs_icdinode {
        ...
        __uint32_t      di_nlink;       /* number of links to file */
        ...
} xfs_icdinode_t;

But logprint/log_misc.c use '%hd' to print 'di_nlink':

void
xlog_print_trans_inode_core(xfs_icdinode_t *ip)
{
    ...
    printf(_("nlink %hd uid %d gid %d\n"),
           ip->di_nlink, ip->di_uid, ip->di_gid);
    ...
}

'%hd' can be 16-bit on many architectures, on these architectures, the 'printf' only print the low 16-bit of 'di_nlink'.

2. Reproducer
2.1. Commands
[root@localhost ~]# cd
[root@localhost ~]# xfs_mkfile 128m 128m.xfs
[root@localhost ~]# mkfs.xfs 128m.xfs
[root@localhost ~]# mount 128m.xfs /mnt/
[root@localhost ~]# cd /mnt/
[root@localhost mnt]# seq 1 65534|xargs mkdir -p
[root@localhost mnt]# cd
[root@localhost ~]# umount /mnt/
[root@localhost ~]# xfs_logprint 128m.xfs|grep nlink|tail -1

2.2. Expect result
nlink 65536

2.3. Actual result
nlink 0

Signed-off-by: hexiaole <hexiaole@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: check the rt summary against observations

Teach xfs_repair to check the ondisk realtime summary file against its
own observations.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: check the rt bitmap against observations

Teach xfs_repair to check the ondisk realtime bitmap against its own
observations.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: check free rt extent count

Check the superblock's free rt extent count against what we observed.
This increases the runtime and memory usage, but we can now report
undercounting frextents as a result of a logging bug in the kernel.
Note that repair has always fixed the undercount, but it no longer does
that silently.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: update manpage of bigtime and inobtcount

The bigtime and inobtcount feature is enabled by default by
1c08f0ae28b34d97b0a89c8483ef3c743914e85e (mkfs: enable inobtcount and
bigtime by default). This patch updates the manpage of mkfs to mention
this change.

Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: always use new_diflags2 to initialize new inodes

The new_diflags2 field that's set in the inode geometry represent
features that we want enabled for /all/ newly created inodes.
Unfortunately, mkfs doesn't do that because xfs_flags2diflags2 doesn't
read new_diflags2. Change the new_diflags2 logic to match the kernel.

Without this fix, the root directory gets created without the
DIFLAG2_NREXT64 iflag set, but files created by a protofile /do/ have it
turned on.

This wasn't an issue with DIFLAG2_BIGTIME because xfs_trans_log_inode
quietly turns that on whenever possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: document the large extent count switch in the --help screen

We should document this feature in the --help screen for mkfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: preserve DIFLAG2_NREXT64 when setting other inode attributes

Preserve the state of the NREXT64 inode flag when we're changing the
other flags2 fields. This is only vital for the kernel version of this
function, but we should keep these in sync.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: detect and fix padding fields that changed with nrext64

Detect incorrectly set padding fields when large extent counts are
enabled or disabled on v3 inodes.

Found by fuzzing v3.flags2 = zeroes with xfs/374 and an nrext64=1
filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: clear DIFLAG2_NREXT64 when filesystem doesn't support nrext64

Clear the nrext64 inode flag if the filesystem doesn't have the nrext64
feature enabled in the superblock.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_copy: don't use cached buffer reads until after libxfs_mount

I accidentally tried to xfs_copy an ext4 filesystem, but instead of
rejecting the filesystem, the program instead crashed.  I figured out
that zeroing the superblock was enough to trigger this:

# dd if=/dev/zero of=/dev/sda bs=1024k count=1
# xfs_copy  /dev/sda /dev/sdb
Floating point exception

The exact crash happens in this line from libxfs_getbuf_flags, which is
called from the main() routine of xfs_copy:

if (btp == btp->bt_mount->m_ddev_targp) {
(*bpp)->b_pag = xfs_perag_get(btp->bt_mount,
xfs_daddr_to_agno(btp->bt_mount, blkno));

The problem here is that the uncached read filled the incore superblock
with zeroes, which means mbuf.sb_agblocks is zero.  This causes a
division by zero in xfs_daddr_to_agno, thereby crashing the program.

In commit f8b581d6, we made it so that xfs_buf structures contain a
passive reference to the associated perag structure.  That commit
assumes that no program would try a cached buffer read until the buffer
cache is fully set up, which is true throughout xfsprogs... except for
the beginning of xfs_copy.  For whatever reason, it attempts an uncached
read of the superblock to figure out the real superblock size, then
performs a *cached* read with the proper buffer length and verifier.
The cached read crashes the program.

Fix the problem by changing the (second) cached read into an uncached read.

Fixes: f8b581d6 ("libxfs: actually make buffers track the per-ag structures")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: identify the minlogsize transaction reservation

Right now, we don't make it easy to spot the transaction reservation
used to compute the minimum log size in userspace:

# xfs_db -c logres /dev/sda
type 0 logres 168184 logcount 5 flags 0x4
...
type 25 logres 760 logcount 0 flags 0x0
type -1 logres 547200 logcount 8 flags 0x4

Type "-1" doesn't communicate the purpose at all, it just looks like a
math error. Help out the user a bit by printing more information:

minlogsize logres 547200 logcount 8

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: don't flag log_incompat inconsistencies as corruptions

While testing xfs/233 and xfs/127 with LARP mode enabled, I noticed
errors such as the following:

xfs_growfs --BlockSize=4096 --Blocks=8192
data blocks changed from 8192 to 2579968
meta-data=/dev/sdf               isize=512    agcount=630, agsize=4096 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1
data     =                       bsize=4096   blocks=2579968, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=3075, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
_check_xfs_filesystem: filesystem on /dev/sdf is inconsistent (r)
*** xfs_repair -n output ***
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - 23:03:47: zeroing log - 3075 of 3075 blocks done
        - scan filesystem freespace and inode maps...
would fix log incompat feature mismatch in AG 30 super, 0x0 != 0x1
would fix log incompat feature mismatch in AG 8 super, 0x0 != 0x1
would fix log incompat feature mismatch in AG 12 super, 0x0 != 0x1
would fix log incompat feature mismatch in AG 24 super, 0x0 != 0x1
would fix log incompat feature mismatch in AG 18 super, 0x0 != 0x1
<snip>

0x1 corresponds to XFS_SB_FEAT_INCOMPAT_LOG_XATTRS, which is the feature
bit used to indicate that the log contains extended attribute log intent
items.  This is a mechanism to prevent older kernels from trying to
recover log items that they won't know how to recover.

I thought about this a little bit more, and realized that log_incompat
features bits are set on the primary sb prior to writing certain types
of log records, and cleared once the log has written the committed
changes back to the filesystem.  If the secondary superblocks are
updated at any point during that interval (due to things like growfs or
setting labels), the log_incompat field will now be set on the secondary
supers.

Due to the ephemeral nature of the current log_incompat feature bits,
a discrepancy between the primary and secondary supers is not a
corruption.  If we're in dry run mode, we should log the discrepancy,
but that's not a reason to end with EXIT_FAILURE.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: always rewrite secondary supers when needsrepair is set

Dave Chinner complained about xfs_scrub failures coming from xfs/158.
That test induces xfs_repair to fail while upgrading a filesystem to
have the inobtcount feature, and then restarts xfs_repair to finish the
upgrade.  When the second xfs_repair run starts, it will find that the
primary super has NEEDSREPAIR set, along with whatever new feature that
we were trying to add to the filesystem.

From there, repair completes the upgrade in much the same manner as the
first repair run would have, with one big exception -- it forgets to set
features_changed to trigger rewriting of the secondary supers at the end
of repair.  This results in discrepancies between the supers:

# XFS_REPAIR_FAIL_AFTER_PHASE=2 xfs_repair -c inobtcount=1 /dev/sdf
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Adding inode btree counts to filesystem.
Killed
# xfs_repair /dev/sdf
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
bad inobt block count 0, saw 1
bad finobt block count 0, saw 1
bad inobt block count 0, saw 1
bad finobt block count 0, saw 1
bad inobt block count 0, saw 1
bad finobt block count 0, saw 1
bad inobt block count 0, saw 1
bad finobt block count 0, saw 1
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 2
        - agno = 0
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
# xfs_db -c 'sb 0' -c 'print' -c 'sb 1' -c 'print' /dev/sdf | \
egrep '(features_ro_compat|features_incompat)'
features_ro_compat = 0xd
features_incompat = 0xb
features_ro_compat = 0x5
features_incompat = 0xb

Curiously, re-running xfs_repair will not trigger any warnings about the
featureset mismatch between the primary and secondary supers.  xfs_scrub
immediately notices, which is what causes xfs/158 to fail.

This discrepancy doesn't happen when the upgrade completes successfully
in a single repair run, so we need to teach repair to rewrite the
secondaries at the end of repair any time needsrepair was set.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libxfs: remove xfs_globals.larp

This dummy debugging knob isn't necessary anymore, so get rid of it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_logprint: fix formatting specifiers

Fix a missing %u -> %PRIu32 conversion, and add the required '%' in the
format specifiers because PRIu{32,64} do not include it on their own.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

misc: fix unsigned integer comparison complaints

gcc 11.2 complains about certain variables now that xfs_extnum_t is an
unsigned 64-bit integer:

dinode.c: In function ‘process_exinode’:
dinode.c:960:21: error: comparison of unsigned expression in ‘< 0’ is always false [-Werror=type-limits]
960 | if (numrecs < 0)

Since we actually have a function that will tell us the maximum
supported extent count for an ondisk dinode structure, use a direct
comparison instead of tricky integer math to detect overflows. A more
exhaustive audit is probably necessary.

IOWS, shut up, gcc...

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.19.0-rc0.1

Update all the necessary files for a 5.19.0-rc0.1 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: don't hold xattr leaf buffers across transaction rolls

Source kernel commit: e53bcffad0326c1ef4b4baec4262b5343e420c44

Now that we've established (again!) that empty xattr leaf buffers are
ok, we no longer need to bhold them to transactions when we're creating
new leaf blocks. Get rid of the entire mechanism, which should simplify
the xattr code quite a bit.

The original justification for using bhold here was to prevent the AIL
from trying to write the empty leaf block into the fs during the brief
time that we release the buffer lock. The reason for /that/ was to
prevent recovery from tripping over the empty ondisk block.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: empty xattr leaf header blocks are not corruption

Source kernel commit: 7be3bd8856fba99f8b25b9c223250e42292c312e

TLDR: Revert commit 51e6104fdb95 ("xfs: detect empty attr leaf blocks in
xfs_attr3_leaf_verify") because it was wrong.

Every now and then we get a corruption report from the kernel or
xfs_repair about empty leaf blocks in the extended attribute structure.
We've long thought that these shouldn't be possible, but prior to 5.18
one would shake loose in the recoveryloop fstests about once a month.

A new addition to the xattr leaf block verifier in 5.19-rc1 makes this
happen every 7 minutes on my testing cloud.  I added a ton of logging to
detect any time we set the header count on an xattr leaf block to zero.
This produced the following dmesg output on generic/388:

XFS (sda4): ino 0x21fcbaf leaf 0x129bf78 hdcount==0!
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
xfs_attr3_leaf_create+0x187/0x230
xfs_attr_shortform_to_leaf+0xd1/0x2f0
xfs_attr_set_iter+0x73e/0xa90
xfs_xattri_finish_update+0x45/0x80
xfs_attr_finish_item+0x1b/0xd0
xfs_defer_finish_noroll+0x19c/0x770
__xfs_trans_commit+0x153/0x3e0
xfs_attr_set+0x36b/0x740
xfs_xattr_set+0x89/0xd0
__vfs_setxattr+0x67/0x80
__vfs_setxattr_noperm+0x6e/0x120
vfs_setxattr+0x97/0x180
setxattr+0x88/0xa0
path_setxattr+0xc3/0xe0
__x64_sys_setxattr+0x27/0x30
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0

So now we know that someone is creating empty xattr leaf blocks as part
of converting a sf xattr structure into a leaf xattr structure.  The
conversion routine logs any existing sf attributes in the same
transaction that creates the leaf block, so we know this is a setxattr
to a file that has no attributes at all.

Next, g/388 calls the shutdown ioctl and cycles the mount to trigger log
recovery.  I also augmented buffer item recovery to call ->verify_struct
on any attr leaf blocks and complain if it finds a failure:

XFS (sda4): Unmounting Filesystem
XFS (sda4): Mounting V5 Filesystem
XFS (sda4): Starting recovery (logdev: internal)
XFS (sda4): xattr leaf daddr 0x129bf78 hdrcount == 0!
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
xfs_attr3_leaf_verify+0x3b8/0x420
xlog_recover_buf_commit_pass2+0x60a/0x6c0
xlog_recover_items_pass2+0x4e/0xc0
xlog_recover_commit_trans+0x33c/0x350
xlog_recovery_process_trans+0xa5/0xe0
xlog_recover_process_data+0x8d/0x140
xlog_do_recovery_pass+0x19b/0x720
xlog_do_log_recovery+0x62/0xc0
xlog_do_recover+0x33/0x1d0
xlog_recover+0xda/0x190
xfs_log_mount+0x14c/0x360
xfs_mountfs+0x517/0xa60
xfs_fs_fill_super+0x6bc/0x950
get_tree_bdev+0x175/0x280
vfs_get_tree+0x1a/0x80
path_mount+0x6f5/0xaa0
__x64_sys_mount+0x103/0x140
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fc61e241eae

And a moment later, the _delwri_submit of the recovered buffers trips
the same verifier and recovery fails:

XFS (sda4): Metadata corruption detected at xfs_attr3_leaf_verify+0x393/0x420 [xfs], xfs_attr3_leaf block 0x129bf78
XFS (sda4): Unmount and run xfs_repair
XFS (sda4): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
00000010: 00 00 00 00 01 29 bf 78 00 00 00 00 00 00 00 00  .....).x........
00000020: a5 1b d0 02 b2 9a 49 df 8e 9c fb 8d f8 31 3e 9d  ......I......1>.
00000030: 00 00 00 00 02 1f cb af 00 00 00 00 10 00 00 00  ................
00000040: 00 50 0f b0 00 00 00 00 00 00 00 00 00 00 00 00  .P..............
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
XFS (sda4): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply+0x37f/0x3b0 [xfs] (fs/xfs/xfs_buf.c:1518).  Shutting down filesystem.
XFS (sda4): Please unmount the filesystem and rectify the problem(s)
XFS (sda4): log mount/recovery failed: error -117
XFS (sda4): log mount failed

I think I see what's going on here -- setxattr is racing with something
that shuts down the filesystem:

Thread 1                                Thread 2
--------                                --------
xfs_attr_sf_addname
xfs_attr_shortform_to_leaf
<create empty leaf>
xfs_trans_bhold(leaf)
xattri_dela_state = XFS_DAS_LEAF_ADD
<roll transaction>
<flush log>
<shut down filesystem>
xfs_trans_bhold_release(leaf)
<discover fs is dead, bail>

Thread 3
--------
<cycle mount, start recovery>
xlog_recover_buf_commit_pass2
xlog_recover_do_reg_buffer
<replay empty leaf buffer from recovered buf item>
xfs_buf_delwri_queue(leaf)
xfs_buf_delwri_submit
_xfs_buf_ioapply(leaf)
xfs_attr3_leaf_write_verify
<trip over empty leaf buffer>
<fail recovery>

As you can see, the bhold keeps the leaf buffer locked and thus prevents
the *AIL* from tripping over the ichdr.count==0 check in the write
verifier.  Unfortunately, it doesn't prevent the log from getting
flushed to disk, which sets up log recovery to fail.

So.  It's clear that the kernel has always had the ability to persist
attr leaf blocks with ichdr.count==0, which means that it's part of the
ondisk format now.

Unfortunately, this check has been added and removed multiple times
throughout history.  It first appeared in[1] kernel 3.10 as part of the
early V5 format patches.  The check was later discovered to break log
recovery and hence disabled[2] during log recovery in kernel 4.10.
Simultaneously, the check was added[3] to xfs_repair 4.9.0 to try to
weed out the empty leaf blocks.  This was still not correct because log
recovery would recover an empty attr leaf block successfully only for
regular xattr operations to trip over the empty block during of the
block during regular operation.  Therefore, the check was removed
entirely[4] in kernel 5.7 but removal of the xfs_repair check was
forgotten.  The continued complaints from xfs_repair lead to us
mistakenly re-adding[5] the verifier check for kernel 5.19.  Remove it
once again.

[1] 517c22207b04 ("xfs: add CRCs to attr leaf blocks")
[2] 2e1d23370e75 ("xfs: ignore leaf attr ichdr.count in verifier
during log replay")
[3] f7140161 ("xfs_repair: junk leaf attribute if count == 0")
[4] f28cef9e4dac ("xfs: don't fail verifier on empty attr3 leaf
block")
[5] 51e6104fdb95 ("xfs: detect empty attr leaf blocks in
xfs_attr3_leaf_verify")

Looking at the rest of the xattr code, it seems that files with empty
leaf blocks behave as expected -- listxattr reports no attributes;
getxattr on any xattr returns nothing as expected; removexattr does
nothing; and setxattr can add attributes just fine.

Original-bug: 517c22207b04 ("xfs: add CRCs to attr leaf blocks")
Still-not-fixed-by: 2e1d23370e75 ("xfs: ignore leaf attr ichdr.count in verifier during log replay")
Removed-in: f28cef9e4dac ("xfs: don't fail verifier on empty attr3 leaf block")
Fixes: 51e6104fdb95 ("xfs: detect empty attr leaf blocks in xfs_attr3_leaf_verify")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: fix variable state usage

Source kernel commit: 10930b254d5be1cb4350fb7a456ccd5ea7e3cbd9

The variable @args is fed to a tracepoint, and that's the only place
it's used.  This is fine for the kernel, but for userspace, tracepoints
are #define'd out of existence, which results in this warning on gcc
11.2:

xfs_attr.c: In function ‘xfs_attr_node_try_addname’:
xfs_attr.c:1440:42: warning: unused variable ‘args’ [-Wunused-variable]
1440 |         struct xfs_da_args              *args = attr->xattri_da_args;
|                                          ^~~~

Clean this up.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: fix TOCTOU race involving the new logged xattrs control knob

Source kernel commit: f4288f01820e2d57722d21874c1fda661003c9b9

I found a race involving the larp control knob, aka the debugging knob
that lets developers enable logging of extended attribute updates:

Thread 1                        Thread 2

echo 0 > /sys/fs/xfs/debug/larp
setxattr(REPLACE)
xfs_has_larp (returns false)
xfs_attr_set

echo 1 > /sys/fs/xfs/debug/larp

xfs_attr_defer_replace
xfs_attr_init_replace_state
xfs_has_larp (returns true)
xfs_attr_init_remove_state

<oops, wrong DAS state!>

This isn't a particularly severe problem right now because xattr logging
is only enabled when CONFIG_XFS_DEBUG=y, and developers *should* know
what they're doing.

However, the eventual intent is that callers should be able to ask for
the assistance of the log in persisting xattr updates.  This capability
might not be required for /all/ callers, which means that dynamic
control must work correctly.  Once an xattr update has decided whether
or not to use logged xattrs, it needs to stay in that mode until the end
of the operation regardless of what subsequent parallel operations might
do.

Therefore, it is an error to continue sampling xfs_globals.larp once
xfs_attr_change has made a decision about larp, and it was not correct
for me to have told Allison that ->create_intent functions can sample
the global log incompat feature bitfield to decide to elide a log item.

Instead, create a new op flag for the xfs_da_args structure, and convert
all other callers of xfs_has_larp and xfs_sb_version_haslogxattrs within
the attr update state machine to look for the operations flag.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.19.0-rc0

Update all the necessary files for a 5.19.0-rc0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_logprint: Add log item printing for ATTRI and ATTRD

This patch implements a new set of log printing functions to print the
ATTRI and ATTRD items and vectors in the log.  These will be used during
log dump and log recover operations.

Though most attributes are strings, the attribute operations accept
any binary payload, so we should not assume them printable.  This was
done intentionally in preparation for parent pointers.  Until parent
pointers get here, attributes have no discernible format.  So the print
routines are just a simple print or hex dump for now.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: Add option to create filesystem with large extent counters

Enabling nrext64 option on mkfs.xfs command line extends the maximum values of
inode data and attr fork extent counters to 2^48 - 1 and 2^32 - 1
respectively. This also sets the XFS_SB_FEAT_INCOMPAT_NREXT64 incompat flag
on the superblock preventing older kernels from mounting such a filesystem.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_info: Report NREXT64 feature status

This commit adds support to libfrog to obtain information about the
availability of NREXT64 feature in the underlying filesystem.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Invoke bulkstat ioctl with XFS_BULK_IREQ_NREXT64 flag

This commit adds support to libfrog to enable reporting 64-bit extent counters
to its users. In order to do so, bulkstat ioctl is now invoked with the newly
introduced XFS_BULK_IREQ_NREXT64 flag if the underlying filesystem's geometry
supports 64-bit extent counters.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: move xfs_attr_use_log_assist usage out of libxfs

Source kernel commit: efc2efeba169ff37bbd425631985064365c614e0

The LARP patchset added an awkward coupling point between libxfs and
what would be libxlog, if the XFS log were actually its own library.
Move the code that sets up logged xattr updates out of libxfs and into
xfs_xattr.c so that libxfs no longer has to know about xlog_* functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: avoid unnecessary runtime sibling pointer endian conversions

Source kernel commit: 5672225e8f2a872a22b0cecedba7a6644af1fb84

Commit dc04db2aa7c9 has caused a small aim7 regression, showing a
small increase in CPU usage in __xfs_btree_check_sblock() as a
result of the extra checking.

This is likely due to the endian conversion of the sibling poitners
being unconditional instead of relying on the compiler to endian
convert the NULL pointer at compile time and avoiding the runtime
conversion for this common case.

Rework the checks so that endian conversion of the sibling pointers
is only done if they are not null as the original code did.

.... and these need to be "inline" because the compiler completely
fails to inline them automatically like it should be doing.

$ size fs/xfs/libxfs/xfs_btree.o*
text    data     bss     dec     hex filename
51874     240       0   52114    cb92 fs/xfs/libxfs/xfs_btree.o.orig
51562     240       0   51802    ca5a fs/xfs/libxfs/xfs_btree.o.inline

Just when you think the tools have advanced sufficiently we don't
have to care about stuff like this anymore, along comes a reminder
that *our tools still suck*.

Fixes: dc04db2aa7c9 ("xfs: detect self referencing btree sibling pointers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: convert buf_cancel_table allocation to kmalloc_array

Source kernel commit: 910bbdf2f4d7df46781bc9b723048f5ebed3d0d7

While we're messing around with how recovery allocates and frees the
buffer cancellation table, convert the allocation to use kmalloc_array
instead of the old kmem_alloc APIs, and make it handle a null return,
even though that's not likely.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: move xfs_attr_use_log_assist out of xfs_log.c

Source kernel commit: d9c61ccb3b09d8f892cccbf662ce0c870f8e4ade

The LARP patchset added an awkward coupling point between libxfs and
what would be libxlog, if the XFS log were actually its own library.
Move the code that enables logged xattr updates out of "lib"xlog and into
xfs_xattr.c so that it no longer has to know about xlog_* functions.

While we're at it, give xfs_xattr.c its own header file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: refactor buffer cancellation table allocation

Source kernel commit: 2723234923b3294dbcf6019c288c87465e927ed4

Move the code that allocates and frees the buffer cancellation tables
used by log recovery into the file that actually uses the tables. This
is a precursor to some cleanups and a memory leak fix.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: don't leak btree cursor when insrec fails after a split

Source kernel commit: a54f78def73d847cb060b18c4e4a3d1d26c9ca6d

The recent patch to improve btree cycle checking caused a regression
when I rebased the in-memory btree branch atop the 5.19 for-next branch,
because in-memory short-pointer btrees do not have AG numbers.  This
produced the following complaint from kmemleak:

unreferenced object 0xffff88803d47dde8 (size 264):
comm "xfs_io", pid 4889, jiffies 4294906764 (age 24.072s)
hex dump (first 32 bytes):
90 4d 0b 0f 80 88 ff ff 00 a0 bd 05 80 88 ff ff  .M..............
e0 44 3a a0 ff ff ff ff 00 df 08 06 80 88 ff ff  .D:.............
backtrace:
[<ffffffffa0388059>] xfbtree_dup_cursor+0x49/0xc0 [xfs]
[<ffffffffa029887b>] xfs_btree_dup_cursor+0x3b/0x200 [xfs]
[<ffffffffa029af5d>] __xfs_btree_split+0x6ad/0x820 [xfs]
[<ffffffffa029b130>] xfs_btree_split+0x60/0x110 [xfs]
[<ffffffffa029f6da>] xfs_btree_make_block_unfull+0x19a/0x1f0 [xfs]
[<ffffffffa029fada>] xfs_btree_insrec+0x3aa/0x810 [xfs]
[<ffffffffa029fff3>] xfs_btree_insert+0xb3/0x240 [xfs]
[<ffffffffa02cb729>] xfs_rmap_insert+0x99/0x200 [xfs]
[<ffffffffa02cf142>] xfs_rmap_map_shared+0x192/0x5f0 [xfs]
[<ffffffffa02cf60b>] xfs_rmap_map_raw+0x6b/0x90 [xfs]
[<ffffffffa0384a85>] xrep_rmap_stash+0xd5/0x1d0 [xfs]
[<ffffffffa0384dc0>] xrep_rmap_visit_bmbt+0xa0/0xf0 [xfs]
[<ffffffffa0384fb6>] xrep_rmap_scan_iext+0x56/0xa0 [xfs]
[<ffffffffa03850d8>] xrep_rmap_scan_ifork+0xd8/0x160 [xfs]
[<ffffffffa0385195>] xrep_rmap_scan_inode+0x35/0x80 [xfs]
[<ffffffffa03852ee>] xrep_rmap_find_rmaps+0x10e/0x270 [xfs]

I noticed that xfs_btree_insrec has a bunch of debug code that return
out of the function immediately, without freeing the "new" btree cursor
that can be returned when _make_block_unfull calls xfs_btree_split.  Fix
the error return in this function to free the btree cursor.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: assert in xfs_btree_del_cursor should take into account error

Source kernel commit: 56486f307100e8fc66efa2ebd8a71941fa10bf6f

xfs/538 on a 1kB block filesystem failed with this assert:

XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_ino.allocated == 0 || xfs_is_shutdown(cur->bc_mp), file: fs/xfs/libxfs/xfs_btree.c, line: 448

The problem was that an allocation failed unexpectedly in
xfs_bmbt_alloc_block() after roughly 150,000 minlen allocation error
injections, resulting in an EFSCORRUPTED error being returned to
xfs_bmapi_write(). The error occurred on extent-to-btree format
conversion allocating the new root block:

RIP: 0010:xfs_bmbt_alloc_block+0x177/0x210
Call Trace:
<TASK>
xfs_btree_new_iroot+0xdf/0x520
xfs_btree_make_block_unfull+0x10d/0x1c0
xfs_btree_insrec+0x364/0x790
xfs_btree_insert+0xaa/0x210
xfs_bmap_add_extent_hole_real+0x1fe/0x9a0
xfs_bmapi_allocate+0x34c/0x420
xfs_bmapi_write+0x53c/0x9c0
xfs_alloc_file_space+0xee/0x320
xfs_file_fallocate+0x36b/0x450
vfs_fallocate+0x148/0x340
__x64_sys_fallocate+0x3c/0x70
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa

Why the allocation failed at this point is unknown, but is likely
that we ran the transaction out of reserved space and filesystem out
of space with bmbt blocks because of all the minlen allocations
being done causing worst case fragmentation of a large allocation.

Regardless of the cause, we've then called xfs_bmapi_finish() which
calls xfs_btree_del_cursor(cur, error) to tear down the cursor.

So we have a failed operation, error != 0, cur->bc_ino.allocated > 0
and the filesystem is still up. The assert fails to take into
account that allocation can fail with an error and the transaction
teardown will shut the filesystem down if necessary. i.e. the
assert needs to check "|| error != 0" as well, because at this point
shutdown is pending because the current transaction is dirty....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: don't assert fail on perag references on teardown

Source kernel commit: 5b55cbc2d72632e874e50d2e36bce608e55aaaea

Not fatal, the assert is there to catch developer attention. I'm
seeing this occasionally during recoveryloop testing after a
shutdown, and I don't want this to stop an overnight recoveryloop
run as it is currently doing.

Convert the ASSERT to a XFS_IS_CORRUPT() check so it will dump a
corruption report into the log and cause a test failure that way,
but it won't stop the machine dead.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: share xattr name and value buffers when logging xattr updates

Source kernel commit: 4183e4f27f402d712bccab30588a6fe7575963c0

While running xfs/297 and generic/642, I noticed a crash in
xfs_attri_item_relog when it tries to copy the attr name to the new
xattri log item.  I think what happened here was that we called
->iop_commit on the old attri item (which nulls out the pointers) as
part of a log force at the same time that a chained attr operation was
ongoing.  The system was busy enough that at some later point, the defer
ops operation decided it was necessary to relog the attri log item, but
as we've detached the name buffer from the old attri log item, we can't
copy it to the new one, and kaboom.

I think there's a broader refcounting problem with LARP mode -- the
setxattr code can return to userspace before the CIL actually formats
and commits the log item, which results in a UAF bug.  Therefore, the
xattr log item needs to be able to retain a reference to the name and
value buffers until the log items have completely cleared the log.
Furthermore, each time we create an intent log item, we allocate new
memory and (re)copy the contents; sharing here would be very useful.

Solve the UAF and the unnecessary memory allocations by having the log
code create a single refcounted buffer to contain the name and value
contents.  This buffer can be passed from old to new during a relog
operation, and the logging code can (optionally) attach it to the
xfs_attr_item for reuse when LARP mode is enabled.

This also fixes a problem where the xfs_attri_log_item objects weren't
being freed back to the same cache where they came from.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: do not use logged xattr updates on V4 filesystems

Source kernel commit: 22a68ba724232ba675166c307ddef3749ae4c37c

V4 superblocks do not contain the log_incompat feature bit, which means
that we cannot protect xattr log items against kernels that are too old
to know how to recover them. Turn off the log items for such
filesystems and adjust the "delayed" name to reflect what it's really
controlling.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: fix typo in comment

Source kernel commit: 41bc61c02a5a3aa4f8ba5f9e9c54668468925c72

Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: rename struct xfs_attr_item to xfs_attr_intent

Source kernel commit: e3c5de22026fda01674c400b97f96857dfb58aab

Everywhere else in XFS, structures that capture the state of an ongoing
deferred work item all have names that end with "_intent". The new
extended attribute deferred work items are not named as such, so fix it
to follow the naming convention used elsewhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: clean up state variable usage in xfs_attr_node_remove_attr

Source kernel commit: 3768f6985700106e90eca87d814dc08e609a9969

The state variable is now a local variable pointing to a heap
allocation, so we don't need to zero-initialize it, nor do we need the
conditional to decide if we should free it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: put attr[id] log item cache init with the others

Source kernel commit: 4136e38af728eddcab2e51aecde28e94d0782b9b

Initialize and destroy the xattr log item caches in the same places that
we do all the other log item caches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove struct xfs_attr_item.xattri_flags

Source kernel commit: 500a512c60d132dbffc0233ba22ad067756c0ccd

Nobody uses this field, so get rid of it and the unused flag definition.
Rearrange the structure layout to reduce its size from 104 to 96 bytes.
This gets us from 39 to 42 objects per page.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: use a separate slab cache for deferred xattr work state

Source kernel commit: e2c78949b641d34cb4051b32c2fa5e03a4bfef35

Create a separate slab cache for struct xfs_attr_item objects, since we
can pack the (104-byte) intent items more tightly than we can with the
general slab cache objects. On x86, this means 39 intents per memory
page instead of 32.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: put the xattr intent item op flags in their own namespace

Source kernel commit: b53d212b4b5c29ab16edb7eca2b0e05927b7aa34

The flags that are stored in the extended attr intent log item really
should have a separate namespace from the rest of the XFS_ATTR_* flags.
Give them one to make it a little more obvious that they're intent item
flags.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: clean up xfs_attr_node_hasname

Source kernel commit: 4d0cdd2bb8f0bf1a30d361f14bafc428794551e0

The calling conventions of this function are a mess -- callers /can/
provide a pointer to a pointer to a state structure, but it's not
required, and as evidenced by the last two patches, the callers that do
weren't be careful enough about how to deal with an existing da state.

Push the allocation and freeing responsibilty to the callers, which
means that callers from the xattr node state machine steps now have the
visibility to allocate or free the da state structure as they please.
As a bonus, the node remove/add paths for larp-mode replaces can reset
the da state structure instead of freeing and immediately reallocating
it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>