git.ipfire.org Git - thirdparty/xfsprogs-dev.git/log

xfsprogs: Release v5.9.0

Update all the necessary files for a 5.9.0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.9.0-rc1

Update all the necessary files for a 5.9.0-rc1 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: allow i18n to xfs printk

In preparation to a common stripe validation helper,
allow i18n to xfs_{notice,warn,err,alert} so that
xfsprogs can share code with kernel.

Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: ignore autofs mount table entries

Some of the xfsprogs utilities read the mount table via. getmntent(3).

The mount table may contain (almost always these days since /etc/mtab is
symlinked to /proc/self/mounts) autofs mount entries. During processing
of the mount table entries statfs(2) can be called on mount point paths
which will trigger an automount if those entries are direct or offset
autofs mount triggers (indirect autofs mounts aren't affected).

This can be a problem when there are a lot of autofs direct or offset
mounts because real mounts will be triggered when statfs(2) is called.
This can be particularly bad if the triggered mounts are NFS mounts and
the server is unavailable leading to lengthy boot times or worse.

Simply ignoring autofs mount entries during getmentent(3) traversals
avoids the statfs() call that triggers these mounts. If there are
automounted mounts (real mounts) at the time of reading the mount table
these will still be seen in the list so they will be included if that
actually matters to the reader.

Recent glibc getmntent(3) can ignore autofs mounts but that requires the
autofs user to configure autofs to use the "ignore" pseudo mount option
for autofs mounts. But this isn't yet the autofs default (to prevent
unexpected side effects) so that can't be used.

The autofs direct and offset automount triggers are pseudo file system
mounts and are more or less useless in terms on file system information
so excluding them doesn't sacrifice useful file system information
either.

Consequently excluding autofs mounts shouldn't have any adverse side
effects.

Changes since v1:
- drop hunk from fsr/xfs_fsr.c.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: fix ioctl_xfs_geometry manpage naming

Somehow "fsop_/FSOP_" snuck into this manpage's name, filename, and
ioctl name. It's not XFS_IOC_FSOP_GEOMETRY, it's XFS_IOC_FSGEOMETRY
so change all references, including the man page name, filename, and
references from xfsctl(3).

(the structure and flags do have the fsop_ string, which certainly
makes this a bit confusing)

Fixes: b427c816847e ("man: create a separate GEOMETRY ioctl manpage")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: coordinate parallel updates to the rt bitmap

Actually take the rt lock before updating the bitmap from multiple
threads. This fixes an infrequent corruption problem when running
generic/013 and rtinherit=1 is set on the root dir.

Fixes: 2556c98bd9e6 ("Perform true sequential bulk read prefetching in xfs_repair Merge of master-melb:xfs-cmds:29147a by kenmcd.")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_scrub: don't use statvfs to collect filesystem summary counts

The function scrub_scan_estimate_blocks naïvely uses the statvfs counts
to estimate the size and free blocks on the data volume. Unfortunately,
it fails to account for the fact that statvfs can return the size and
free counts for the realtime volume if the root directory has the
rtinherit flag set, which leads to phase 7 reporting totally absurd
quantities.

Eric pointed out a further problem with statvfs, which is that the file
counts are clamped to the current user's project quota inode limits.
Therefore, we must not use statvfs for querying the filesystem summary
counts.

The XFS_IOC_FSCOUNTS ioctl returns all the data we need, so use that
instead.

Fixes: 604dd3345f35 ("xfs_scrub: filesystem counter collection functions")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libhandle: fix potential unterminated string problem

gcc 10.2 complains about the strncpy call here, since it's possible that
the source string is so long that the fspath inside the fdhash structure
will end up without a null terminator. Work around strncpy braindamage
yet again by forcing the string to be terminated properly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libfrog: fix a potential null pointer dereference

Apparently, gcc 10.2 thinks that it's possible for either of the calloc
arguments to be zero here, in which case it will return NULL with a zero
errno. I suppose it's possible to do that via integer overflow in the
macro, though I find it unlikely unless someone passes in a yuuuge value.

Nevertheless, just shut up the warning by hardcoding the error number
so I can move on to nastier bugs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libxfs: disallow filesystems with reverse mapping and reflink and realtime

Neither the kernel nor the code in xfsprogs support filesystems that
have (either reverse mapping btrees or reflink) enabled and a realtime
volume configured. The kernel rejects such combinations and mkfs
refuses to format such a config, but xfsprogs doesn't check and can do
Bad Things, so port those checks before someone shreds their filesystem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: remove a couple of unused function parameters

initialise_mount does not use mkfs_params, and initialise_ag_headers
does not use the xfs_sb param, remove them.

Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs.xfs: remove comment about needed future work

Remove comment about the need to sync this function with the
kernel; that was mostly taken care of with:

7b7548052 ("mkfs: use libxfs to write out new AGs")

There's maybe a little more samey-samey that we could do here,
but it's not egregiously cut & pasted as it was before.

Fixes: 7b7548052d12 ("mkfs: use libxfs to write out new AGs")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: don't flag RTINHERIT files when no rt volume

Don't flag directories with the RTINHERIT flag set when the filesystem
doesn't have a realtime volume configured. The kernel has let us set
RTINHERIT without a rt volume for ages, so it's not an invalid state.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: don't allow creation of realtime files from a proto file

If someone runs mkfs with rtinherit=1, a realtime volume configured, and
a protofile that creates a regular file in the filesystem, mkfs will
error out with "Function not implemented" because userspace doesn't know
how to allocate extents from the rt bitmap. Catch this specific case
and hand back a somewhat nicer explanation of what happened.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libxfs: don't propagate RTINHERIT -> REALTIME when there is no rtdev

When creating a file inside a directory that has RTINHERIT set, only
propagate the REALTIME flag to the file if the filesystem actually has a
realtime volume configured. Otherwise, we end up writing inodes that
trip the verifiers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libxfs: refactor inode flags propagation code

Hoist the code that propagates di_flags from a parent to a new child
into a separate function.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: set required parts of the realtime geometry before computing log geometry

The minimum log size depends on the transaction reservation sizes, which
in turn depend on the realtime device geometry. Therefore, we need to
set up some of the rt geometry before we can compute the real minimum
log size.

This fixes a problem where mkfs, given a small data device and a
realtime volume, formats a filesystem with a log that is too small to
pass the mount time log size checks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: fix reflink/rmap logic w.r.t. realtime devices and crc=0 support

mkfs has some logic to deal with situations where reflink or rmapbt are
turned on and the administrator has configured a realtime device or a V4
filesystem; such configurations are not allowed.

The logic ought to disable reflink and/or rmapbt if they're enabled due
to being the defaults, and it ought to complain and abort if they're
enabled because the admin explicitly turned them on.

Unfortunately, the logic here doesn't do that and makes no sense at all
since usage() exits the program. Fix it to follow what everything else
does.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs.xfs: tweak wording of external log device size complaint

If the external log device is too small to satisfy minimum requirements,
mkfs will complain about "external log device 512 too small...". That
doesn't make any sense, so add a few missing words to clarify what we're
talking about:

"external log device size 512 blocks too small..."

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

man: install all manpages that redirect to another manpage

Some of the ioctl manpages do not contain any information other than a
pointer to a different manpage. These aren't picked up by the install
scripts, so fix them so that they do.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: use libxfs_verify_rtbno to verify rt extents

Use the existing realtime block validation function to check the first
and last block of an extent in a realtime file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: throw away totally bad clusters

If the filesystem supports sparse inodes, we detect that an entire
cluster buffer has no detectable inodes at all, and we can easily mark
that part of the inode chunk sparse, just drop the cluster buffer and
forget about it. This makes repair less likely to go to great lengths
to try to save something that's totally unsalvageable.

This manifested in recs[2].free=zeroes in xfs/364, wherein the root
directory claimed to own block X and the inobt also claimed that X was
inodes; repair tried to create rmaps for both owners, and then the whole
mess blew up because the rmap code aborts on those kinds of anomalies.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: fix handling of data blocks colliding with existing metadata

Prior to commit a406779bc8d8, any blocks in a data fork extent that
collided with existing blocks would cause the entire data fork extent to
be rejected.  Unfortunately, the patch to add data block sharing support
suppressed checking for any collision, including metadata.  What we
really wanted to do here during a check_dups==1 scan is to is check for
specific collisions and without updating the block mapping data.

So, move the check_dups test after the for-switch construction.  This
re-enables detecting collisions between data fork blocks and a
previously scanned chunk of metadata, and improves the specificity of
the error message that results.

This was found by fuzzing recs[2].free=zeroes in xfs/364, though this
patch alone does not solve all the problems that scenario presents.

Fixes: a406779bc8d8 ("xfs_repair: handle multiple owners of data blocks")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: complain about unwritten extents when they're not appropriate

We don't allow unwritten extents in the attr fork, and we don't allow
them in the data fork except for regular files. Check that this is the
case.

Found by manually fuzzing the extentflag field of an attr fork to one.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: junk corrupt xattr root blocks

If reading the root block of an extended attribute structure fails due
to a corruption error, we should junk the block since we know it's bad.
There's no point in moving on to the (rather insufficient) checks in the
attr code.

Found by fuzzing hdr.freemap[1].base = ones in xfs/400.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: fix error in process_sf_dir2_fixi8

The goal of process_sf_dir2_fixi8 is to convert an i8 shortform
directory into a (shorter) i4 shortform directory. It achieves this by
duplicating the old sf directory contents (as oldsfp), zeroing i8count
in the caller's directory buffer (i.e. newsfp/sfp), and reinitializing
the new directory with the old directory's entries.

Unfortunately, it copies the parent pointer from sfp (the buffer we've
already started changing), not oldsfp. This leads to directory
corruption since at that point we zeroed i8count, which means that we
save only the upper four bytes from the parent pointer entry.

This was found by fuzzing u3.sfdir3.hdr.i8count = ones in xfs/384.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_repair: don't crash on partially sparse inode clusters

While running xfs/364 to fuzz the middle bit of recs[2].holemask, I
observed a crash in xfs_repair stemming from the fact that each sparse
bit accounts for 4 inodes, but inode cluster buffers can map to more
than four inodes.

When the first inode in an inode cluster is marked sparse,
process_inode_chunk won't try to load the inode cluster buffer.
Unfortunately, if the holemask indicates that there are inodes present
anywhere in the rest of the cluster buffer, repair will try to check the
corresponding cluster buffer, even if we didn't load it. This leads to
a null pointer dereference, which crashes repair.

Avoid the null pointer dereference by marking the inode sparse and
moving on to the next inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs.xfs: fix ASSERT on too-small device with stripe geometry

When a too-small device is created with stripe geometry, we hit an
assert in align_ag_geometry():

mkfs.xfs: xfs_mkfs.c:2834: align_ag_geometry: Assertion `cfg->agcount != 0' failed.

This is because align_ag_geometry() finds that the size of the last
(only) AG is too small, and attempts to trim it off. Obviously 0
AGs is invalid, and we hit the ASSERT.

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.9.0-rc0

Update all the necessary files for a 5.9.0-rc0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: fix xfs_bmap_validate_extent_raw when checking attr fork of rt files

Source kernel commit: d0c20d38af135b2b4b90aa59df7878ef0c8fbef4

The realtime flag only applies to the data fork, so don't use the
realtime block number checks on the attr fork of a realtime file.

Fixes: 30b0984d9117 ("xfs: refactor bmap record validation")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: initialize the shortform attr header padding entry

Source kernel commit: 125eac243806e021f33a1fdea3687eccbb9f7636

Don't leak kernel memory contents into the shortform attr fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: fix boundary test in xfs_attr_shortform_verify

Source kernel commit: f4020438fab05364018c91f7e02ebdd192085933

The boundary test for the fixed-offset parts of xfs_attr_sf_entry in
xfs_attr_shortform_verify is off by one, because the variable array
at the end is defined as nameval[1] not nameval[].
Hence we need to subtract 1 from the calculation.

This can be shown by:

# touch file
# setfattr -n root.a file

and verifications will fail when it's written to disk.

This only matters for a last attribute which has a single-byte name
and no value, otherwise the combination of namelen & valuelen will
push endp further out and this test won't fail.

Fixes: 1e1bbd8e7ee06 ("xfs: create structure verifier function for shortform xattrs")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: fix off-by-one in inode alloc block reservation calculation

Source kernel commit: 657f101930bc6c5b41bd7d6c22565c4302a80d33

The inode chunk allocation transaction reserves inobt_maxlevels-1
blocks to accommodate a full split of the inode btree. A full split
requires an allocation for every existing level and a new root
block, which means inobt_maxlevels is the worst case block
requirement for a transaction that inserts to the inobt. This can
lead to a transaction block reservation overrun when tmpfile
creation allocates an inode chunk and expands the inobt to its
maximum depth. This problem has been observed in conjunction with
overlayfs, which makes frequent use of tmpfiles internally.

The existing reservation code goes back as far as the Linux git repo
history (v2.6.12). It was likely never observed as a problem because
the traditional file/directory creation transactions also include
worst case block reservation for directory modifications, which most
likely is able to make up for a single block deficiency in the inode
allocation portion of the calculation. tmpfile support is relatively
more recent (v3.15), less heavily used, and only includes the inode
allocation block reservation as tmpfiles aren't linked into the
directory tree on creation.

Fix up the inode alloc block reservation macro and a couple of the
block allocator minleft parameters that enforce an allocation to
leave enough free blocks in the AG for a full inobt split.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

writeback: Drop I_DIRTY_TIME_EXPIRE

Source kernel commit: 5fcd57505c002efc5823a7355e21f48dd02d5a51

The only use of I_DIRTY_TIME_EXPIRE is to detect in
__writeback_single_inode() that inode got there because flush worker
decided it's time to writeback the dirty inode time stamps (either
because we are syncing or because of age). However we can detect this
directly in __writeback_single_inode() and there's no need for the
strange propagation with I_DIRTY_TIME_EXPIRE flag.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: delete duplicated words + other fixes

Source kernel commit: b63da6c8dfa9b2ab3554e8c59ef294d1f28bb9bd

Delete repeated words in fs/xfs/.
{we, that, the, a, to, fork}
Change "it it" to "it is" in one location.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
To: linux-fsdevel@vger.kernel.org
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Lift -ENOSPC handler from xfs_attr_leaf_addname

Source kernel commit: 0f89edcd8e8484dd1790ec474a650dd774c6b75e

Lift -ENOSPC handler from xfs_attr_leaf_addname. This will help to
reorganize transitions between the attr forms later.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Simplify xfs_attr_node_addname

Source kernel commit: bf4a5cfffee096624764d5aed82f786c708dfe90

Invert the rename logic in xfs_attr_node_addname to simplify the
delayed attr logic later.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Simplify xfs_attr_leaf_addname

Source kernel commit: 5fdca0ad5c95464a60c5954f853c22127f8bb64f

Invert the rename logic in xfs_attr_leaf_addname to simplify the
delayed attr logic later.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Add helper function xfs_attr_node_removename_rmt

Source kernel commit: 72b97ea40d23b333175323251d6afecb565a0b53

This patch adds another new helper function
xfs_attr_node_removename_rmt. This will also help modularize
xfs_attr_node_removename when we add delay ready attributes later.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Add helper function xfs_attr_node_removename_setup

Source kernel commit: 674eb548cf0ced1487ee229f96af2c7cf0099d2a

This patch adds a new helper function xfs_attr_node_removename_setup.
This will help modularize xfs_attr_node_removename when we add delay
ready attributes later.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
[darrick: fix unused variable complaints by 0day robot]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Add remote block helper functions

Source kernel commit: 410c19885db5f7c4fca55b437e199e18252142b6

This patch adds two new helper functions xfs_attr_store_rmt_blk and
xfs_attr_restore_rmt_blk. These two helpers assist to remove redundant
code associated with storing and retrieving remote blocks during the
attr set operations.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Add helper function xfs_attr_leaf_mark_incomplete

Source kernel commit: f44df68c82dc060b9b9942e204096447e1efc677

This patch helps to simplify xfs_attr_node_removename by modularizing
the code around the transactions into helper functions. This will make
the function easier to follow when we introduce delayed attributes.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Add helpers xfs_attr_is_shortform and xfs_attr_set_shortform

Source kernel commit: db1a28cc591c1abe5470bedca1a489bb165c4c7c

In this patch, we hoist code from xfs_attr_set_args into two new helpers
xfs_attr_is_shortform and xfs_attr_set_shortform. These two will help
to simplify xfs_attr_set_args when we get into delayed attrs later.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Remove xfs_trans_roll in xfs_attr_node_removename

Source kernel commit: a237f2ddae4b79797fcb4290116acda38323da16

A transaction roll is not necessary immediately after setting the
INCOMPLETE flag when removing a node xattr entry with remote value
blocks. The remote block invalidation that immediately follows setting
the flag is an in-core only change. The next step after that is to start
unmapping the remote blocks from the attr fork, but the xattr remove
transaction reservation includes reservation for full tree splits of the
dabtree and bmap tree. The remote block unmap code will roll the
transaction as extents are unmapped and freed.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Remove unneeded xfs_trans_roll_inode calls

Source kernel commit: 0feaef17db9ddf0d97676eaa258dc55ff504cfb6

Some calls to xfs_trans_roll_inode and xfs_defer_finish routines are not
needed. If they are the last operations executed in these functions, and
no further changes are made, then higher level routines will roll or

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Add helper function xfs_attr_node_shrink

Source kernel commit: 3f6e011ee2bfcf5b70c5f8978673cd0e5bc64a33

This patch adds a new helper function xfs_attr_node_shrink used to
shrink an attr name into an inode if it is small enough. This helps to
modularize the greater calling function xfs_attr_node_removename.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Pull up xfs_attr_rmtval_invalidate

Source kernel commit: d4034c4662af5d40de0655c641ddc6eccde0e8fc

This patch pulls xfs_attr_rmtval_invalidate out of
xfs_attr_rmtval_remove and into the calling functions.  Eventually
__xfs_attr_rmtval_remove will replace xfs_attr_rmtval_remove when we
introduce delayed attributes.  These functions are exepcted to return
-EAGAIN when they need a new transaction.  Because the invalidate does
not need a new transaction, we need to separate it from the rest of the
function that does.  This will enable __xfs_attr_rmtval_remove to
smoothly replace xfs_attr_rmtval_remove later.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Refactor xfs_attr_rmtval_remove

Source kernel commit: 8b8e0cc0208befc5971d552a8798c8f5537afa45

Refactor xfs_attr_rmtval_remove to add helper function
__xfs_attr_rmtval_remove. We will use this later when we introduce
delayed attributes. This function will eventually replace
xfs_attr_rmtval_remove

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Pull up trans roll in xfs_attr3_leaf_clearflag

Source kernel commit: 1fc618d76266750230f19e4053f5422a58b84141

New delayed allocation routines cannot be handling transactions so
pull them out into the calling functions

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Factor out xfs_attr_rmtval_invalidate

Source kernel commit: 795141099abc2ef4c801ca847190757880371390

Because new delayed attribute routines cannot roll transactions, we
carve off the parts of xfs_attr_rmtval_remove that we can use. This
will help to reduce repetitive code later when we introduce delayed
attributes.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Pull up trans roll from xfs_attr3_leaf_setflag

Source kernel commit: 0949d317aee051fcb7ad9c8c7ec5d60b5cc412eb

New delayed allocation routines cannot be handling transactions so
pull them up into the calling functions

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Refactor xfs_attr_try_sf_addname

Source kernel commit: 6cc5b5f89840cfe85cbd14e20500f25353a7f241

To help pre-simplify xfs_attr_set_args, we need to hoist transaction
handling up, while modularizing the adjacent code down into helpers. In
this patch, hoist the commit in xfs_attr_try_sf_addname up into the
calling function, and also pull the attr list creation down.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Split apart xfs_attr_leaf_addname

Source kernel commit: 7c93d4a8fc39cf209c6a67ff26fc83646c9f7d61

Split out new helper function xfs_attr_leaf_try_add from
xfs_attr_leaf_addname. Because new delayed attribute routines cannot
roll transactions, we split off the parts of xfs_attr_leaf_addname that
we can use, and move the commit into the calling function.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Pull up trans handling in xfs_attr3_leaf_flipflags

Source kernel commit: e3be1272ddaf5f7482008578b467cf0ca8c35695

Since delayed operations cannot roll transactions, pull up the
transaction handling into the calling function

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Factor out new helper functions xfs_attr_rmtval_set

Source kernel commit: 1a485fc1e96533587e2eab0e4d60798035528b7a

Break xfs_attr_rmtval_set into two helper functions
xfs_attr_rmt_find_hole and xfs_attr_rmtval_set_value.
xfs_attr_rmtval_set rolls the transaction between the helpers, but
delayed operations cannot. We will use the helpers later when
constructing new delayed attribute routines.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Check for -ENOATTR or -EEXIST

Source kernel commit: deed9512872d094ad1eae4abd0ff1c674df251d5

Delayed operations cannot return error codes. So we must check for
these conditions first before starting set or remove operations

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Add xfs_has_attr and subroutines

Source kernel commit: 07120f1abdff80f3d1351f733661abe28d609535

This patch adds a new functions to check for the existence of an
attribute. Subroutines are also added to handle the cases of leaf
blocks, nodes or shortform. Common code that appears in existing attr
add and remove functions have been factored out to help reduce the
appearance of duplicated code. We will need these routines later for
delayed attributes since delayed operations cannot return error codes.

Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix a leak-on-error bug reported by Dan Carpenter]
[darrick: fix unused variable warning reported by 0day]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reported-by: dan.carpenter@oracle.com
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Refactor xfs_da_state_alloc() helper

Source kernel commit: 4491a3dd7192fb0db54117c42f785a8eb524607d

Every call to xfs_da_state_alloc() also requires setting up state->args
and state->mp

Change xfs_da_state_alloc() to receive an xfs_da_args_t as argument and
return a xfs_da_state_t with both args and mp already set.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: reduce struct typedef usage]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Remove kmem_zone_zalloc() usage

Source kernel commit: 32a2b11f467642ea700bc0b01f4693e52ec0fabd

Use kmem_cache_zalloc() directly.

With the exception of xlog_ticket_alloc() which will be dealt on the
next patch for readability.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Remove kmem_zone_alloc() usage

Source kernel commit: 3050bd0bfe706381c36e4b48bf4de465b0ab94f7

Use kmem_cache_alloc() directly.

All kmem_zone_alloc() users pass 0 as flags, which are translated into:
GFP_KERNEL | __GFP_NOWARN, and kmem_zone_alloc() loops forever until the
allocation succeeds.

We can use __GFP_NOFAIL to tell the allocator to loop forever rather
than doing it ourself, and because the allocation will never fail, we do
not need to use __GFP_NOWARN anymore. Hence, all callers can be
converted to use GFP_KERNEL | __GFP_NOFAIL

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: add a comment back in about nofail]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: xfs_btree_staging.h: delete duplicated words

Source kernel commit: 26270c9f4cf77815397e749f38a343732c28c4a5

Drop the repeated words "with" and "be" in comments.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: rename the ondisk dquot d_flags to d_type

Source kernel commit: d8c1af0d6ad617df4563e78bbee70691f25ace58

The ondisk dquot stores the quota record type in the flags field.
Rename this field to d_type to make the _type relationship between the
ondisk and incore dquot more obvious.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: improve ondisk dquot flags checking

Source kernel commit: a990f7a84edc9941956ea3c1dfb89733c80f9ad0

Create an XFS_DQTYPE_ANY mask for ondisk dquots flags, and use that to
ensure that we never accept any garbage flags when we're loading dquots.
While we're at it, restructure the quota type flag checking to use the
proper masking.

Note that I plan to add y2038 support soon, which will require a new
xfs_dqtype_t flag for extended timestamp support, hence all the work to
make the type masking work correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: create xfs_dqtype_t to represent quota types

Source kernel commit: 1a7ed271653a4f418a6398465f861ee795d34468

Create a new type (xfs_dqtype_t) to represent the type of an incore
dquot (user, group, project, or none). Rename the incore dquot's
dq_flags field to q_type.

This allows us to replace all the "uint type" arguments to the quota
functions with "xfs_dqtype_t type", to make it obvious when we're
passing a quota type argument into a function.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: rename XFS_DQ_{USER,GROUP,PROJ} to XFS_DQTYPE_*

Source kernel commit: 8cd4901da56caadc16b4e8d6b434291a8ce31d7c

We're going to split up the incore dquot state flags from the ondisk
dquot flags (eventually renaming this "type") so start by renaming the
three flags and the bitmask that are going to participate in this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: drop the type parameter from xfs_dquot_verify

Source kernel commit: f9751c4ad3d17fa93773c187732f10c8a49940e3

xfs_qm_reset_dqcounts (aka quotacheck) is the only xfs_dqblk_verify
caller that actually knows the specific quota type that it's looking
for. Since everything else just pass in type==0 (including the buffer
verifier), drop the parameter and open-code the check like
xfs_dquot_from_disk already does.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove qcore from incore dquots

Source kernel commit: 51dbb1be52fedfe7f612854bfcba3400043a7a75

Now that we've stopped using qcore entirely, drop it from the incore
dquot.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: make XFS_DQUOT_CLUSTER_SIZE_FSB part of the ondisk format

Source kernel commit: cb64e1299364a51bf60e96f2a35df31f47aa2eee

Move the dquot cluster size #define to xfs_format.h. It is an important
part of the ondisk format because the ondisk dquot record size is not an
even power of two, which means that the buffer size we use is
significant here because the kernel leaves slack space at the end of the
buffer to avoid having to deal with a dquot record crossing a block
boundary.

This is also an excuse to fix one of the longstanding discrepancies
between kernel and userspace libxfs headers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: rename dquot incore state flags

Source kernel commit: 985a78fdde15e1730383f99867ca38b5648444bf

Rename the existing incore dquot "dq_flags" field to "q_flags" to match
everything else in the structure, then move the two actual dquot state
flags to the XFS_DQFLAG_ namespace from XFS_DQ_.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: fix inode allocation block res calculation precedence

Source kernel commit: b2a8864728683443f34a9fd33a2b78b860934cc1

The block reservation calculation for inode allocation is supposed
to consist of the blocks required for the inode chunk plus
(maxlevels-1) of the inode btree multiplied by the number of inode
btrees in the fs (2 when finobt is enabled, 1 otherwise).

Instead, the macro returns (ialloc_blocks + 2) due to a precedence
error in the calculation logic. This leads to block reservation
overruns via generic/531 on small block filesystems with finobt
enabled. Add braces to fix the calculation and reserve the
appropriate number of blocks.

Fixes: 9d43b180af67 ("xfs: update inode allocation/free transaction reservations for finobt")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: get rid of unnecessary xfs_perag_{get,put} pairs

Source kernel commit: 92a005448f6fed70b5e7a9f29a1f930118449f1b

In the course of some operations, we look up the perag from
the mount multiple times to get or change perag information.
These are often very short pieces of code, so while the
lookup cost is generally low, the cost of the lookup is far
higher than the cost of the operation we are doing on the
perag.

Since we changed buffers to hold references to the perag
they are cached in, many modification contexts already hold
active references to the perag that are held across these
operations. This is especially true for any operation that
is serialised by an allocation group header buffer.

In these cases, we can just use the buffer's reference to
the perag to avoid needing to do lookups to access the
perag. This means that many operations don't need to do
perag lookups at all to access the perag because they've
already looked up objects that own persistent references
and hence can use that reference instead.

Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: remove xfs_inobp_check()

Source kernel commit: e2705b0304778916db87831217ec642e34d9d9fa

This debug code is called on every xfs_iflush() call, which then
checks every inode in the buffer for non-zero unlinked list field.
Hence it checks every inode in the cluster buffer every time a
single inode on that cluster it flushed. This is resulting in:

- 38.91% 5.33% [kernel] [k] xfs_iflush
- 17.70% xfs_iflush
- 9.93% xfs_inobp_check
4.36% xfs_buf_offset

10% of the CPU time spent flushing inodes is repeatedly checking
unlinked fields in the buffer. We don't need to do this.

The other place we call xfs_inobp_check() is
xfs_iunlink_update_dinode(), and this is after we've done this
assert for the agino we are about to write into that inode:

ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));

which means we've already checked that the agino we are about to
write is not 0 on debug kernels. The inode buffer verifiers do
everything else we need, so let's just remove this debug code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: attach inodes to the cluster buffer when dirtied

Source kernel commit: 48d55e2ae3ce837598c073995bbbac5d24a35fe1

Rather than attach inodes to the cluster buffer just when we are
doing IO, attach the inodes to the cluster buffer when they are
dirtied. The means the buffer always carries a list of dirty inodes
that reference it, and we can use that list to make more fundamental
changes to inode writeback that aren't otherwise possible.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: pin inode backing buffer to the inode log item

Source kernel commit: 298f7bec503f30bd98242ec02df6abe13b31a677

When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.

When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.

Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.

Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.

When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.

This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: add an inode item lock

Source kernel commit: 1319ebefd6ed7a9988b7b4bc9317fbcf61a28bfc

The inode log item is kind of special in that it can be aggregating
new changes in memory at the same time time existing changes are
being written back to disk. This means there are fields in the log
item that are accessed concurrently from contexts that don't share
any locking at all.

e.g. updating ili_last_fields occurs at flush time under the
ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
completion time, and is read under the ILOCK_EXCL when the inode is
logged. Hence there is no actual serialisation between reading the
field during logging of the inode in transactions vs clearing the
field in IO completion.

We currently get away with this by the fact that we are only
clearing fields in IO completion, and nothing bad happens if we
accidentally log more of the inode than we actually modify. Worst
case is we consume a tiny bit more memory and log bandwidth.

However, if we want to do more complex state manipulations on the
log item that requires updates at all three of these potential
locations, we need to have some mechanism of serialising those
operations. To do this, introduce a spinlock into the log item to
serialise internal state.

This could be done via the xfs_inode i_flags_lock, but this then
leads to potential lock inversion issues where inode flag updates
need to occur inside locks that best nest inside the inode log item
locks (e.g. marking inodes stale during inode cluster freeing).
Using a separate spinlock avoids these sorts of problems and
simplifies future code.

This does not touch the use of ili_fields in the item formatting
code - that is entirely protected by the ILOCK_EXCL at this point in
time, so it remains untouched.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Don't allow logging of XFS_ISTALE inodes

Source kernel commit: 96355d5a1f0ee6dcc182c37db4894ec0c29f1692

In tracking down a problem in this patchset, I discovered we are
reclaiming dirty stale inodes. This wasn't discovered until inodes
were always attached to the cluster buffer and then the rcu callback
that freed inodes was assert failing because the inode still had an
active pointer to the cluster buffer after it had been reclaimed.

Debugging the issue indicated that this was a pre-existing issue
resulting from the way the inodes are handled in xfs_inactive_ifree.
When we free a cluster buffer from xfs_ifree_cluster, all the inodes
in cache are marked XFS_ISTALE. Those that are clean have nothing
else done to them and so eventually get cleaned up by background
reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
XFS_ISTALE.

On journal commit dirty stale inodes as are handled by both
buffer and inode log items to run though xfs_istale_done() and
removed from the AIL (buffer log item commit) or the log item will
simply unpin it because the buffer log item will clean it. What happens
to any specific inode is entirely dependent on which log item wins
the commit race, but the result is the same - stale inodes are
clean, not attached to the cluster buffer, and not in the AIL. Hence
inode reclaim can just free these inodes without further care.

However, if the stale inode is relogged, it gets dirtied again and
relogged into the CIL. Most of the time this isn't an issue, because
relogging simply changes the inode's location in the current
checkpoint. Problems arise, however, when the CIL checkpoints
between two transactions in the xfs_inactive_ifree() deferops
processing. This results in the XFS_ISTALE inode being redirtied
and inserted into the CIL without any of the other stale cluster
buffer infrastructure being in place.

Hence on journal commit, it simply gets unpinned, so it remains
dirty in memory. Everything in inode writeback avoids XFS_ISTALE
inodes so it can't be written back, and it is not tracked in the AIL
so there's not even a trigger to attempt to clean the inode. Hence
the inode just sits dirty in memory until inode reclaim comes along,
sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
of a dirty inode caused use after free, list corruptions and other
nasty issues later in this patchset.

Hence this patch addresses a violation of the "never log XFS_ISTALE
inodes" caused by the deferops processing rolling a transaction
and relogging a stale inode in xfs_inactive_free. It also adds a
bunch of asserts to catch this problem in debug kernels so that
we don't reintroduce this problem in future.

Reproducer for this issue was generic/558 on a v4 filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: redesign the reflink remap loop to fix blkres depletion crash

Source kernel commit: 00fd1d56dd08a8ceaa9e4ee1a41fefd9f6c6bc7d

The existing reflink remapping loop has some structural problems that
need addressing:

The biggest problem is that we create one transaction for each extent in
the source file without accounting for the number of mappings there are
for the same range in the destination file.  In other words, we don't
know the number of remap operations that will be necessary and we
therefore cannot guess the block reservation required.  On highly
fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
run out of block reservation, and fail.

The second problem is that we don't actually use the bmap intents to
their full potential -- instead of calling bunmapi directly and having
to deal with its backwards operation, we could call the deferred ops
xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
makes the frontend loop much simpler.

Solve all of these problems by refactoring the remapping loops so that
we only perform one remapping operation per transaction, and each
operation only tries to remap a single extent from source to dest.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reported-by: Edwin Török <edwin@etorok.net>
Tested-by: Edwin Török <edwin@etorok.net>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: rename xfs_bmap_is_real_extent to is_written_extent

Source kernel commit: 877f58f53684f14ca3202640f70592bf44890924

The name of this predicate is a little misleading -- it decides if the
extent mapping is allocated and written. Change the name to be more
direct, as we're going to add a new predicate in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: preserve rmapbt swapext block reservation from freed blocks

Source kernel commit: f74681ba2006434be195402e0b15fc5763cddd7e

The rmapbt extent swap algorithm remaps individual extents between
the source inode and the target to trigger reverse mapping metadata
updates. If either inode straddles a format or other bmap allocation
boundary, the individual unmap and map cycles can trigger repeated
bmap block allocations and frees as the extent count bounces back
and forth across the boundary. While net block usage is bound across
the swap operation, this behavior can prematurely exhaust the
transaction block reservation because it continuously drains as the
transaction rolls. Each allocation accounts against the reservation
and each free returns to global free space on transaction roll.

The previous workaround to this problem attempted to detect this
boundary condition and provide surplus block reservation to
acommodate it. This is insufficient because more remaps can occur
than implied by the extent counts; if start offset boundaries are
not aligned between the two inodes, for example.

To address this problem more generically and dynamically, add a
transaction accounting mode that returns freed blocks to the
transaction reservation instead of the superblock counters on
transaction roll and use it when the rmapbt based algorithm is
active. This allows the chain of remap transactions to preserve the
block reservation based own its own frees and prevent premature
exhaustion regardless of the remap pattern. Note that this is only
safe for superblocks with lazy sb accounting, but the latter is
required for v5 supers and the rmap feature depends on v5.

Fixes: b3fed434822d0 ("xfs: account format bouncing into rmapbt swapext tx reservation")
Root-caused-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: Couple of typo fixes in comments

Source kernel commit: 06734e3c95a34e4d71342f0583f1bb88c61ed9b7

./xfs/libxfs/xfs_inode_buf.c:56: unnecssary ==> unnecessary
./xfs/libxfs/xfs_inode_buf.c:59: behavour ==> behaviour
./xfs/libxfs/xfs_inode_buf.c:206: unitialized ==> uninitialized

Signed-off-by: Keyur Patel <iamkeyur96@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

libxfs: actually make buffers track the per-ag structures

One of the patches in 5.9 reduces the number of xfs_perag_get calls by
using the b_pag pointer in struct xfs_buf. Userspace doesn't actually
do anything with this field (and hence that change will cause null
pointer dereferences), so start tracking b_pag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.8.0

Update all the necessary files for a 5.8.0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: set b_ops to NULL in set_cur for types without verifiers

If we are using set_cur() to set a type that has no verifier ops,
be sure to set b_ops to NULL so that the old verifiers don't run
against the buffer anymore, which may have changed size.

Fixes: cdabe556 ("xfs_db: consolidate set_iocur_type behavior")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.8.0-rc1

Update all the necessary files for a 5.8.0-rc1 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: move custom interface definitions out of xfs_fs.h

There are several definitions and structures present in the userspace
copy of libxfs/xfs_fs.h which support older, custom xfs interfaces
which are now common definitions in the vfs.

Move them into their own compat header to minimize the shared file
differences.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_quota: state command should report ugp grace times

Since grace periods are now supported for three quota types (ugp),
modify xfs_quota state command to report times for all three.
Add a helper function for stat reporting.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_quota: display warning limits when printing quota type information

We should dump the default warning limits when we're printing quota
information.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_quota: command error message improvement

Make the error messages for rudimentary xfs_quota commands
(off, enable, disable) more user friendly, instead of the
terse sys error outputs.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: consolidate set_iocur_type behavior

Right now there are 3 cases to type_f: inode type, type with fields,
and a default. The first two were added to address issues with handling
V5 metadata.

The first two already use some version of set_cur, which handles all
of the validation etc. There's no reason to leave the open-coded bits
at the end, just send every non-inode type through set_cur and be done
with it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: short circuit type_f if type is unchanged

There's no reason to go through the type change code if the
type has not been changed.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: use correct inode to set inode type

A test fails as:
  # xfs_db -c "inode 133" -c "addr" -c "p core.size" -c "type inode" -c "addr" -c "p core.size" /dev/sdb1
  current
          byte offset 68096, length 512
          buffer block 128 (fsbno 16), 32 bbs
          inode 133, dir inode -1, type inode
  core.size = 123142
  current
          byte offset 65536, length 512
          buffer block 128 (fsbno 16), 32 bbs
          inode 128, dir inode 128, type inode
  core.size = 42

The "type inode" command accidentally moves the io cursor because it
forgets to include the io cursor's buffer offset when it computes the
inode number from the io cursor's location.

Fixes: 533d1d229a88 ("xfs_db: properly set inode type")
Reported-by: Jianhong Yin <jiyin@redhat.com>
Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

mkfs: allow setting dax flag on root directory

Teach mkfs to set the DAX flag on the root directory so that all new
files can be created in dax mode. This is a complement to removing the
mount option.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

man: update mkfs.xfs inode flag option documentation

The mkfs manpage says that the extent size, cow extent size, realtime,
and project id inheritance bits are passed on to "newly created
children". This isn't technically true -- it's only passed on to newly
created regular files and directories. It is not passed on to special
files.

Fix this minor inaccuracy in the documentation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: report the inode dax flag

Report the inode DAX flag when we're printing an inode, just like we do
for other v3 inode flags.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs_db: fix nlink usage in check

process_inode uses a local convenience variable to abstract the
differences between the ondisk nlink fields in a v1 inode and a v2
inode. Use this variable for checking and reporting errors.

Fixes: 6526f30e4801 ("xfs_db: stop misusing an onstack inode")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfsprogs: Release v5.8.0-rc0

Update all the necessary files for a 5.8.0-rc0 release.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: more lockdep whackamole with kmem_alloc*

Source kernel commit: 6dcde60efd946e38fac8d276a6ca47492103e856

Dave Airlie reported the following lockdep complaint:

>  ======================================================
>  WARNING: possible circular locking dependency detected
>  5.7.0-0.rc5.20200515git1ae7efb38854.1.fc33.x86_64 #1 Not tainted
>  ------------------------------------------------------
>  kswapd0/159 is trying to acquire lock:
>  ffff9b38d01a4470 (&xfs_nondir_ilock_class){++++}-{3:3},
>  at: xfs_ilock+0xde/0x2c0 [xfs]
>
>  but task is already holding lock:
>  ffffffffbbb8bd00 (fs_reclaim){+.+.}-{0:0}, at:
>  __fs_reclaim_acquire+0x5/0x30
>
>  which lock already depends on the new lock.
>
>
>  the existing dependency chain (in reverse order) is:
>
>  -> #1 (fs_reclaim){+.+.}-{0:0}:
>         fs_reclaim_acquire+0x34/0x40
>         __kmalloc+0x4f/0x270
>         kmem_alloc+0x93/0x1d0 [xfs]
>         kmem_alloc_large+0x4c/0x130 [xfs]
>         xfs_attr_copy_value+0x74/0xa0 [xfs]
>         xfs_attr_get+0x9d/0xc0 [xfs]
>         xfs_get_acl+0xb6/0x200 [xfs]
>         get_acl+0x81/0x160
>         posix_acl_xattr_get+0x3f/0xd0
>         vfs_getxattr+0x148/0x170
>         getxattr+0xa7/0x240
>         path_getxattr+0x52/0x80
>         do_syscall_64+0x5c/0xa0
>         entry_SYSCALL_64_after_hwframe+0x49/0xb3
>
>  -> #0 (&xfs_nondir_ilock_class){++++}-{3:3}:
>         __lock_acquire+0x1257/0x20d0
>         lock_acquire+0xb0/0x310
>         down_write_nested+0x49/0x120
>         xfs_ilock+0xde/0x2c0 [xfs]
>         xfs_reclaim_inode+0x3f/0x400 [xfs]
>         xfs_reclaim_inodes_ag+0x20b/0x410 [xfs]
>         xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
>         super_cache_scan+0x190/0x1e0
>         do_shrink_slab+0x184/0x420
>         shrink_slab+0x182/0x290
>         shrink_node+0x174/0x680
>         balance_pgdat+0x2d0/0x5f0
>         kswapd+0x21f/0x510
>         kthread+0x131/0x150
>         ret_from_fork+0x3a/0x50
>
>  other info that might help us debug this:
>
>   Possible unsafe locking scenario:
>
>         CPU0                    CPU1
>         ----                    ----
>    lock(fs_reclaim);
>                                 lock(&xfs_nondir_ilock_class);
>                                 lock(fs_reclaim);
>    lock(&xfs_nondir_ilock_class);
>
>   *** DEADLOCK ***
>
>  4 locks held by kswapd0/159:
>   #0: ffffffffbbb8bd00 (fs_reclaim){+.+.}-{0:0}, at:
>  __fs_reclaim_acquire+0x5/0x30
>   #1: ffffffffbbb7cef8 (shrinker_rwsem){++++}-{3:3}, at:
>  shrink_slab+0x115/0x290
>   #2: ffff9b39f07a50e8
>  (&type->s_umount_key#56){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
>   #3: ffff9b39f077f258
>  (&pag->pag_ici_reclaim_lock){+.+.}-{3:3}, at:
>  xfs_reclaim_inodes_ag+0x82/0x410 [xfs]

This is a known false positive because inodes cannot simultaneously be
getting reclaimed and the target of a getxattr operation, but lockdep
doesn't know that.  We can (selectively) shut up lockdep until either
it gets smarter or we change inode reclaim not to require the ILOCK by
applying a stupid GFP_NOLOCKDEP bandaid.

Reported-by: Dave Airlie <airlied@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Tested-by: Dave Airlie <airlied@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: force writes to delalloc regions to unwritten

Source kernel commit: a5949d3faedf492fa7863b914da408047ab46eb0

When writing to a delalloc region in the data fork, commit the new
allocations (of the da reservation) as unwritten so that the mappings
are only marked written once writeback completes successfully. This
fixes the problem of stale data exposure if the system goes down during
targeted writeback of a specific region of a file, as tested by
generic/042.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: always return -ENOSPC on project quota reservation failure

Source kernel commit: dcf1ccc99e6db06a3a3cc9f72161f7d084a38d40

XFS project quota treats project hierarchies as "mini filesysems" and
so rather than -EDQUOT, the intent is to return -ENOSPC when a quota
reservation fails, but this behavior is not consistent.

The only place we make a decision between -EDQUOT and -ENOSPC
returns based on quota type is in xfs_trans_dqresv().

This behavior is currently controlled by whether or not the
XFS_QMOPT_ENOSPC flag gets passed into the quota reservation. However,
its use is not consistent; paths such as xfs_create() and xfs_symlink()
don't set the flag, so a reservation failure will return -EDQUOT for
project quota reservation failures rather than -ENOSPC for these sorts
of operations, even for project quota:

# mkdir mnt/project
# xfs_quota -x -c "project -s -p mnt/project 42" mnt
# xfs_quota -x -c 'limit -p isoft=2 ihard=3 42' mnt
# touch mnt/project/file{1,2,3}
touch: cannot touch ‘mnt/project/file3’: Disk quota exceeded

We can make this consistent by not requiring the flag to be set at the
top of the callchain; instead we can simply test whether we are
reserving a project quota with XFS_QM_ISPDQ in xfs_trans_dqresv and if
so, return -ENOSPC for that failure. This removes the need for the
XFS_QMOPT_ENOSPC altogether and simplifies the code a fair bit.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

xfs: cleanup xfs_idestroy_fork

Source kernel commit: ef8385128d4b31a382d496b1c433697993bd0bfb

Move freeing the dynamically allocated attr and COW fork, as well
as zeroing the pointers where actually needed into the callers, and
just pass the xfs_ifork structure to xfs_idestroy_fork. Also simplify
the kmem_free calls by not checking for NULL first.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>