]> git.ipfire.org Git - thirdparty/kernel/stable.git/log
thirdparty/kernel/stable.git
7 years agoxfs: add helpers to dispose of old btree blocks after a repair
Darrick J. Wong [Wed, 30 May 2018 05:18:10 +0000 (22:18 -0700)] 
xfs: add helpers to dispose of old btree blocks after a repair

Now that we've plumbed in the ability to construct a list of dead btree
blocks following a repair, add more helpers to dispose of them.  This is
done by examining the rmapbt -- if the btree was the only owner we can
free the block, otherwise it's crosslinked and we can only remove the
rmapbt record.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: add helpers to collect and sift btree block pointers during repair
Darrick J. Wong [Wed, 30 May 2018 05:18:09 +0000 (22:18 -0700)] 
xfs: add helpers to collect and sift btree block pointers during repair

Add some helpers to assemble a list of fs block extents.  Generally,
repair functions will iterate the rmapbt to make a list (1) of all
extents owned by the nominal owner of the metadata structure; then they
will iterate all other structures with the same rmap owner to make a
list (2) of active blocks; and finally we have a subtraction function to
subtract all the blocks in (2) from (1), with the result that (1) is now
a list of blocks that were owned by the old btree and must be disposed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: add helpers to allocate and initialize fresh btree roots
Darrick J. Wong [Wed, 30 May 2018 05:18:09 +0000 (22:18 -0700)] 
xfs: add helpers to allocate and initialize fresh btree roots

Add a pair of helper functions to allocate and initialize fresh btree
roots.  The repair functions will use these as part of recreating
corrupted metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
7 years agoxfs: add helpers to deal with transaction allocation and rolling
Darrick J. Wong [Wed, 30 May 2018 05:18:08 +0000 (22:18 -0700)] 
xfs: add helpers to deal with transaction allocation and rolling

For repairs, we need to reserve at least as many blocks as we think
we're going to need to rebuild the data structure, and we're going to
need some helpers to roll transactions while maintaining locks on the AG
headers so that other threads cannot wander into the middle of a repair.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
7 years agoxfs: grab the per-ag structure whenever relevant
Darrick J. Wong [Wed, 30 May 2018 05:24:44 +0000 (22:24 -0700)] 
xfs: grab the per-ag structure whenever relevant

Grab and hold the per-AG data across a scrub run whenever relevant.
This helps us avoid repeated trips through rcu and the radix tree
in the repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agofs: xfs: Change return type to vm_fault_t
Souptick Joarder [Tue, 29 May 2018 17:39:03 +0000 (10:39 -0700)] 
fs: xfs: Change return type to vm_fault_t

Use new return type vm_fault_t for fault handlers.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: fix inobt magic number check
Darrick J. Wong [Thu, 24 May 2018 15:54:59 +0000 (08:54 -0700)] 
xfs: fix inobt magic number check

In commit a6a781a58befcbd467c ("xfs: have buffer verifier functions
report failing address") the bad magic number return was ported
incorrectly.

Fixes: a6a781a58befcbd467ce843af4eaca3906aa1f08
Reported-by: syzbot+08ab33be0178b76851c8@syzkaller.appspotmail.com
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
7 years agofs: clear writeback errors in inode_init_always
Darrick J. Wong [Tue, 22 May 2018 18:48:08 +0000 (11:48 -0700)] 
fs: clear writeback errors in inode_init_always

In inode_init_always(), we clear the inode mapping flags, which clears
any retained error (AS_EIO, AS_ENOSPC) bits.  Unfortunately, we do not
also clear wb_err, which means that old mapping errors can leak through
to new inodes.

This is crucial for the XFS inode allocation path because we recycle old
in-core inodes and we do not want error state from an old file to leak
into the new file.  This bug was discovered by running generic/036 and
generic/047 in a loop and noticing that the EIOs generated by the
collision of direct and buffered writes in generic/036 would survive the
remount between 036 and 047, and get reported to the fsyncs (on
different files!) in generic/047.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoiomap: don't allow holes in swapfiles
Omar Sandoval [Wed, 16 May 2018 18:13:34 +0000 (11:13 -0700)] 
iomap: don't allow holes in swapfiles

generic_swapfile_activate() doesn't allow holes, so we should be
consistent here. This is also a bit safer: if the user creates a
swapfile with, say, truncate -s $SIZE followed by mkswap, they should
really get an error and not much less swap space than they expected.
swapon(8) will error out before calling swapon(2) if the file has holes,
anyways.

Fixes: 9d93388b0afe ("iomap: add a swapfile activation function")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoiomap: provide more useful errors for invalid swap files
Omar Sandoval [Wed, 16 May 2018 18:13:34 +0000 (11:13 -0700)] 
iomap: provide more useful errors for invalid swap files

Currently, for an invalid swap file, we print the same error message
regardless of the reason. This isn't very useful for an admin, who will
likely want to know why exactly they can't use their swap file. So,
let's add specific error messages for each reason, and also move the
bdev check after the flags checks, since the latter are more
fundamental.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: implement online get/set fs label
Eric Sandeen [Tue, 15 May 2018 20:21:48 +0000 (13:21 -0700)] 
xfs: implement online get/set fs label

The GET ioctl is trivial, just return the current label.

The SET ioctl is more involved:
It transactionally modifies the superblock to write a new filesystem
label to the primary super.

A new variant of xfs_sync_sb then writes the superblock buffer
immediately to disk so that the change is visible from userspace.

It then invalidates any page cache that userspace might have previously
read on the block device so that i.e. blkid can see the change
immediately, and updates all secondary superblocks as userspace relable
does.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[darrick: use dchinner's new xfs_update_secondary_sbs function]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agofs: copy BTRFS_IOC_[SG]ET_FSLABEL to vfs
Eric Sandeen [Tue, 15 May 2018 20:20:03 +0000 (13:20 -0700)] 
fs: copy BTRFS_IOC_[SG]ET_FSLABEL to vfs

This retains 256 chars as the maximum size through the interface, which
is the btrfs limit and AFAIK exceeds any other filesystem's maximum
label size.

This just copies the ioctl for now and leaves it in place for btrfs
for the time being.  A later patch will allow btrfs to use the new
common ioctl definition, but it may be sent after this is merged.

(Note, Reviewed-by's were originally given for the combined vfs+btrfs
patch, some license taken here.)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: factor the ag length extension code into libxfs
Dave Chinner [Mon, 14 May 2018 06:10:08 +0000 (23:10 -0700)] 
xfs: factor the ag length extension code into libxfs

Growfs currently manually codes the extension of the last AG in a
filesytem during the growfs process. Factor that out of the growfs
code and move it into libxfs along with teh rest of the AG header
modification code.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: move growfs core to libxfs
Dave Chinner [Mon, 14 May 2018 06:10:08 +0000 (23:10 -0700)] 
xfs: move growfs core to libxfs

So it can be shared with userspace (e.g. mkfs) easily.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: rework secondary superblock updates in growfs
Dave Chinner [Mon, 14 May 2018 06:10:08 +0000 (23:10 -0700)] 
xfs: rework secondary superblock updates in growfs

Right now we wait until we've committed changes to the primary
superblock before we initialise any of the new secondary
superblocks. This means that if we have any write errors for new
secondary superblocks we end up with garbage in place rather than
zeros or even an "in progress" superblock to indicate a grow
operation is being done.

To ensure we can write the secondary superblocks, initialise them
earlier in the same loop that initialises the AG headers. We stamp
the new secondary superblocks here with the old geometry, but set
the "sb_inprogress" field to indicate that updates are being done to
the superblock so they cannot be used.  This will result in the
secondary superblock fields being updated or triggering errors that
will abort the grow before we commit any permanent changes.

This also means we can change the update mechanism of the secondary
superblocks.  We know that we are going to wholly overwrite the
information in the struct xfs_sb in the buffer, so there's no point
reading it from disk. Just allocate an uncached buffer, zero it in
memory, stamp the new superblock structure in it and write it out.
If we fail to write it out, then we'll leave the existing sb (old or
new w/ inprogress) on disk for repair to deal with later.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: separate secondary sb update in growfs
Dave Chinner [Mon, 14 May 2018 06:10:07 +0000 (23:10 -0700)] 
xfs: separate secondary sb update in growfs

This happens after all the transactions to update the superblock
occur, and errors need to be handled slightly differently. Seperate
out the code into it's own function, and clean up the error goto
stack in the core growfs code as it is now much simpler.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: make imaxpct changes in growfs separate
Dave Chinner [Mon, 14 May 2018 06:10:07 +0000 (23:10 -0700)] 
xfs: make imaxpct changes in growfs separate

When growfs changes the imaxpct value of the filesystem, it runs
through all the "change size" growfs code, whether it needs to or
not. Separate out changing imaxpct into it's own function and
transaction to simplify the rest of the growfs code.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: turn ag header initialisation into a table driven operation
Dave Chinner [Mon, 14 May 2018 06:10:06 +0000 (23:10 -0700)] 
xfs: turn ag header initialisation into a table driven operation

There's still more cookie cutter code in setting up each AG header.
Separate all the variables into a simple structure and iterate a
table of header definitions to initialise everything.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: factor ag btree root block initialisation
Dave Chinner [Mon, 14 May 2018 06:10:06 +0000 (23:10 -0700)] 
xfs: factor ag btree root block initialisation

Cookie cutter code, easily factored.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: convert growfs AG header init to use buffer lists
Dave Chinner [Mon, 14 May 2018 06:10:06 +0000 (23:10 -0700)] 
xfs: convert growfs AG header init to use buffer lists

We currently write all new AG headers synchronously, which can be
slow for large grow operations. All we really need to do is ensure
all the headers are on disk before we run the growfs transaction, so
convert this to a buffer list and a delayed write operation. We
block waiting for the delayed write buffer submission to complete,
so this will fulfill the requirement to have all the buffers written
correctly before proceeding.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: factor out AG header initialisation from growfs core
Dave Chinner [Mon, 14 May 2018 06:10:05 +0000 (23:10 -0700)] 
xfs: factor out AG header initialisation from growfs core

The intialisation of new AG headers is mostly common with the
userspace mkfs code and growfs in the kernel, so start factoring it
out so we can move it to libxfs and use it in both places.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: one-shot cached buffers
Dave Chinner [Mon, 14 May 2018 06:10:05 +0000 (23:10 -0700)] 
xfs: one-shot cached buffers

For the new growfs work, we want to ensure that we serialise
secondary superblock updates with other operations (e.g. scrub)
correctly, but we don't want to cache the buffers for long term
reuse. We need cached buffers for serialisation, however.

To solve this, introduce a "oneshot" buffer which will be marshalled
through the cache but then released once the last current reference
goes away. If the buffer is already cached, then we ignore the
"one-shot" behaviour and leave the buffer in the state it was prior
to the one-shot command being run. This means we don't perturb
either the working set or existing cached buffer state by a one-shot
operation.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: implement the metadata repair ioctl flag
Darrick J. Wong [Mon, 14 May 2018 13:34:36 +0000 (06:34 -0700)] 
xfs: implement the metadata repair ioctl flag

Plumb in the pieces necessary to make the "scrub" subfunction of
the scrub ioctl actually work.  This means that we make the IFLAG_REPAIR
flag to the scrub ioctl actually do something, and we add an errortag
knob so that xfstests can force the kernel to rebuild a metadata
structure even if there's nothing wrong with it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: create tracepoints for online repair
Darrick J. Wong [Mon, 14 May 2018 13:34:35 +0000 (06:34 -0700)] 
xfs: create tracepoints for online repair

These tracepoints will be used to debug the online repair routines.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: teach xfs_bmapi_remap to accept some bmapi flags
Darrick J. Wong [Mon, 14 May 2018 13:34:35 +0000 (06:34 -0700)] 
xfs: teach xfs_bmapi_remap to accept some bmapi flags

Teach xfs_bmapi_remap how to map in unwritten extent and to skip rmap
updates.  This enables us to rebuild real and unwritten extents from the
rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: make xfs_bmapi_remapi work with attribute forks
Darrick J. Wong [Mon, 14 May 2018 13:34:34 +0000 (06:34 -0700)] 
xfs: make xfs_bmapi_remapi work with attribute forks

Add a new flags argument to xfs_bmapi_remapi so that we can pass BMAPI
flags into the function.  This enables us to pass in BMAPI_ATTRFORK so
that we can remap things into the attribute fork.  Eventually the
online repair code will use this to rebuild attribute forks, so make it
non-static.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: hoist xfs_scrub_agfl_walk to libxfs as xfs_agfl_walk
Darrick J. Wong [Mon, 14 May 2018 13:34:34 +0000 (06:34 -0700)] 
xfs: hoist xfs_scrub_agfl_walk to libxfs as xfs_agfl_walk

This function is basically a generic AGFL block iterator, so promote it
to libxfs ahead of online repair wanting to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: avoid ABBA deadlock when scrubbing parent pointers
Darrick J. Wong [Mon, 14 May 2018 13:34:34 +0000 (06:34 -0700)] 
xfs: avoid ABBA deadlock when scrubbing parent pointers

In normal operation, the XFS convention is to take an inode's iolock
and then allocate a transaction.  However, when scrubbing parent inodes
this is inverted -- we allocated the transaction to do the scrub, and
now we're trying to grab the parent's iolock.  This can lead to ABBA
deadlocks: some thread grabbed the parent's iolock and is waiting for
space for a transaction while our parent scrubber is sitting on a
transaction trying to get the parent's iolock.

Therefore, convert all iolock attempts to use trylock; if that fails,
they can use the existing mechanisms to back off and try again.

The ABBA deadlock didn't happen with a non-repair scrub because the
transactions don't reserve any space, but repair scrubs require
reservation in order to update metadata.  However, any other concurrent
metadata update (e.g. directory create in the parent) could also induce
this deadlock with the parent scrubber.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: scrub the data fork of the realtime inodes
Darrick J. Wong [Mon, 14 May 2018 13:34:33 +0000 (06:34 -0700)] 
xfs: scrub the data fork of the realtime inodes

The realtime bitmap and summary inodes live on the metadata device, so
we can scrub their data forks with the regular scrubbers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: quota scrub should use bmapbtd scrubber
Darrick J. Wong [Mon, 14 May 2018 13:34:33 +0000 (06:34 -0700)] 
xfs: quota scrub should use bmapbtd scrubber

Replace the quota scrubber's open-coded data fork scrubber with a
redirected call to the bmapbtd scrubber.  This strengthens the quota
scrub to include all the cross-referencing that it does.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: don't continue scrub if already corrupt
Darrick J. Wong [Mon, 14 May 2018 13:34:32 +0000 (06:34 -0700)] 
xfs: don't continue scrub if already corrupt

If we've already decided that something is corrupt, we might as well
abort all the loops and exit as quickly as possible.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: refactor quota limits initialization
Darrick J. Wong [Mon, 14 May 2018 13:34:32 +0000 (06:34 -0700)] 
xfs: refactor quota limits initialization

Replace all the if (!error) weirdness with helper functions that follow
our regular coding practices, and factor out the ternary expression soup.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: superblock scrub should use short-lived buffers
Darrick J. Wong [Mon, 14 May 2018 13:34:31 +0000 (06:34 -0700)] 
xfs: superblock scrub should use short-lived buffers

Secondary superblocks are rarely used, so create a helper to read a
given non-primary AG's superblock and ensure that it won't stick around
hogging memory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: skip scrub xref if corruption already noted
Darrick J. Wong [Mon, 14 May 2018 13:34:31 +0000 (06:34 -0700)] 
xfs: skip scrub xref if corruption already noted

Don't bother looking for cross-referencing problems if the metadata is
already corrupt or we've already found a cross-referencing problem.
Since we added a helper function for flags testing, convert existing
users to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: clear sb->s_fs_info on mount failure
Dave Chinner [Fri, 11 May 2018 04:50:23 +0000 (21:50 -0700)] 
xfs: clear sb->s_fs_info on mount failure

We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land.

Essentially, the machine was under memory pressure when the mount
was being run, xfs_fs_fill_super() failed after allocating the
xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
freed the xfs_mount, but the sb->s_fs_info field still pointed to
the freed memory. Hence when the superblock shrinker then ran
it fell off the bad pointer.

With the superblock shrinker problem fixed at teh VFS level, this
stale s_fs_info pointer is still a problem - we use it
unconditionally in ->put_super when the superblock is being torn
down, and hence we can still trip over it after a ->fill_super
call failure. Hence we need to clear s_fs_info if
xfs-fs_fill_super() fails, and we need to check if it's valid in
the places it can potentially be dereferenced after a ->fill_super
failure.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: add mount delay debug option
Dave Chinner [Fri, 11 May 2018 04:50:23 +0000 (21:50 -0700)] 
xfs: add mount delay debug option

Similar to log_recovery_delay, this delay occurs between the VFS
superblock being initialised and the xfs_mount being fully
initialised. It also poisons the per-ag radix tree node so that it
can be used for triggering shrinker races during mount
such as the following:

<run memory pressure workload in background>

$ cat dirty-mount.sh
#! /bin/bash

umount -f /dev/pmem0
mkfs.xfs -f /dev/pmem0
mount /dev/pmem0 /mnt/test
rm -f /mnt/test/foo
xfs_io -fxc "pwrite 0 4k" -c fsync -c "shutdown" /mnt/test/foo
umount /dev/pmem0

# let's crash it now!
echo 30 > /sys/fs/xfs/debug/mount_delay
mount /dev/pmem0 /mnt/test
echo 0 > /sys/fs/xfs/debug/mount_delay
umount /dev/pmem0
$ sudo ./dirty-mount.sh
.....
[   60.378118] CPU: 3 PID: 3577 Comm: fs_mark Tainted: G      D W        4.16.0-rc5-dgc #440
[   60.378120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[   60.378124] RIP: 0010:radix_tree_next_chunk+0x76/0x320
[   60.378127] RSP: 0018:ffffc9000276f4f8 EFLAGS: 00010282
[   60.383670] RAX: a5a5a5a5a5a5a5a4 RBX: 0000000000000010 RCX: 000000000000001a
[   60.385277] RDX: 0000000000000000 RSI: ffffc9000276f540 RDI: 0000000000000000
[   60.386554] RBP: 0000000000000000 R08: 0000000000000000 R09: a5a5a5a5a5a5a5a5
[   60.388194] R10: 0000000000000006 R11: 0000000000000001 R12: ffffc9000276f598
[   60.389288] R13: 0000000000000040 R14: 0000000000000228 R15: ffff880816cd6458
[   60.390827] FS:  00007f5c124b9740(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000
[   60.392253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   60.393423] CR2: 00007f5c11bba0b8 CR3: 000000035580e001 CR4: 00000000000606e0
[   60.394519] Call Trace:
[   60.395252]  radix_tree_gang_lookup_tag+0xc4/0x130
[   60.395948]  xfs_perag_get_tag+0x37/0xf0
[   60.396522]  xfs_reclaim_inodes_count+0x32/0x40
[   60.397178]  xfs_fs_nr_cached_objects+0x11/0x20
[   60.397837]  super_cache_count+0x35/0xc0
[   60.399159]  shrink_slab.part.66+0xb1/0x370
[   60.400194]  shrink_node+0x7e/0x1a0
[   60.401058]  try_to_free_pages+0x199/0x470
[   60.402081]  __alloc_pages_slowpath+0x3a1/0xd20
[   60.403729]  __alloc_pages_nodemask+0x1c3/0x200
[   60.404941]  cache_grow_begin+0x20b/0x2e0
[   60.406164]  fallback_alloc+0x160/0x200
[   60.407088]  kmem_cache_alloc+0x111/0x4e0
[   60.408038]  ? xfs_buf_rele+0x61/0x430
[   60.408925]  kmem_zone_alloc+0x61/0xe0
[   60.409965]  xfs_inode_alloc+0x24/0x1d0
.....

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: factor out nodiscard helpers
Brian Foster [Thu, 10 May 2018 16:35:42 +0000 (09:35 -0700)] 
xfs: factor out nodiscard helpers

The changes to skip discards of speculative preallocation and
unwritten extents introduced several new wrapper functions through
the bunmapi -> extent free codepath to reduce churn in all of the
associated callers. In several cases, these wrappers simply toggle a
single flag to skip or not skip discards for the resulting blocks.

The explicit _nodiscard() wrappers for such an isolated set of
callers is a bit overkill. Kill off these wrappers and replace with
the calls to the underlying functions in the contexts that need to
control discard behavior. Retain the wrappers that preserve the
original calling conventions to serve the original purpose of
reducing code churn.

This is a refactoring patch and does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoiomap: add a swapfile activation function
Darrick J. Wong [Thu, 10 May 2018 15:38:15 +0000 (08:38 -0700)] 
iomap: add a swapfile activation function

Add a new iomap_swapfile_activate function so that filesystems can
activate swap files without having to use the obsolete and slow bmap
function.  This enables XFS to support fallocate'd swap files and
swap files on realtime devices.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
7 years agoxfs: halt auto-reclamation activities while rebuilding rmap
Darrick J. Wong [Wed, 9 May 2018 17:03:56 +0000 (10:03 -0700)] 
xfs: halt auto-reclamation activities while rebuilding rmap

Rebuilding the reverse-mapping tree requires us to quiesce all inodes in
the filesystem, so we must stop background reclamation of post-EOF and
CoW prealloc blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: add BMAPI_NORMAP flag to perform block remapping without updating rmapbt
Darrick J. Wong [Wed, 9 May 2018 17:02:32 +0000 (10:02 -0700)] 
xfs: add BMAPI_NORMAP flag to perform block remapping without updating rmapbt

Add a new flag, XFS_BMAPI_NORMAP, which will perform file block
remapping without updating the rmapbt.  This will be used by the repair
code to reconstruct bmbts from the rmapbt, in which case we don't want
the rmapbt update.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: add repair helpers for the reference count btree
Darrick J. Wong [Wed, 9 May 2018 17:02:03 +0000 (10:02 -0700)] 
xfs: add repair helpers for the reference count btree

Add a couple of functions to the refcount btree and generic btree code
that will be used to repair the refcountbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: add repair helpers for the reverse mapping btree
Darrick J. Wong [Wed, 9 May 2018 17:02:02 +0000 (10:02 -0700)] 
xfs: add repair helpers for the reverse mapping btree

Add a couple of functions to the reverse mapping btree that will be used
to repair the rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: expose various functions to repair code
Darrick J. Wong [Wed, 9 May 2018 17:02:02 +0000 (10:02 -0700)] 
xfs: expose various functions to repair code

Expose various helpers that the repair code will want to use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: add helpers to calculate btree size
Darrick J. Wong [Wed, 9 May 2018 17:02:01 +0000 (10:02 -0700)] 
xfs: add helpers to calculate btree size

Add a bunch of helper functions that calculate the sizes of various
btrees.  These will be used to repair btrees and btree headers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: refactor scrub transaction allocation function
Darrick J. Wong [Wed, 9 May 2018 17:02:01 +0000 (10:02 -0700)] 
xfs: refactor scrub transaction allocation function

Since the transaction allocation helper is about to become more complex,
move it to common.c and remove the redundant parameters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: btree scrub should check minrecs
Darrick J. Wong [Wed, 9 May 2018 17:02:00 +0000 (10:02 -0700)] 
xfs: btree scrub should check minrecs

Strengthen the btree block header checks to detect the number of records
being less than the btree type's minimum record count.  Certain blocks
are allowed to violate this constraint -- specifically any btree block
at the top of the tree can have fewer than minrecs records.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: clean up scrub usage of KM_NOFS
Darrick J. Wong [Wed, 9 May 2018 17:02:00 +0000 (10:02 -0700)] 
xfs: clean up scrub usage of KM_NOFS

All scrub code runs in transaction context, which means that memory
allocations are automatically run in PF_MEMALLOC_NOFS context.  It's
therefore unnecessary to pass in KM_NOFS to allocation routines, so
clean them all out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: avoid ilock games in the quota scrubber
Darrick J. Wong [Wed, 9 May 2018 17:02:00 +0000 (10:02 -0700)] 
xfs: avoid ilock games in the quota scrubber

Refactor the quota scrubber to take the quotaofflock and grab the quota
inode in the setup function so that we can treat quota in the same
"scrub in the context of this inode" (i.e. sc->ip) manner as we treat
any other inode.  We do have to drop the quota inode's ILOCK_EXCL to use
dqiterate, but since dquots have their own individual locks the ILOCK
wasn't helping us anyway.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: refactor dquot iteration
Darrick J. Wong [Fri, 4 May 2018 22:31:21 +0000 (15:31 -0700)] 
xfs: refactor dquot iteration

Create a helper function to iterate all the dquots of a given type in
the system, and refactor the dquot scrub to use it.  This will get more
use in the quota repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: rename on-disk dquot counter zap functions
Darrick J. Wong [Fri, 4 May 2018 22:31:20 +0000 (15:31 -0700)] 
xfs: rename on-disk dquot counter zap functions

The function 'xfs_qm_dqiterate' doesn't iterate dquots at all, it
iterates all dquot blocks of a quota inode and clears the counters.
Therefore, change the name to something more descriptive so that we can
introduce a real dquot iterator later.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: replace XFS_QMOPT_DQALLOC with a simple boolean
Darrick J. Wong [Fri, 4 May 2018 22:30:24 +0000 (15:30 -0700)] 
xfs: replace XFS_QMOPT_DQALLOC with a simple boolean

DQALLOC is only ever used with xfs_qm_dqget*, and the only flag that the
_dqget family of functions cares about is DQALLOC.  Therefore, change
it to a boolean 'can alloc?' flag for the dqget interfaces where that
makes sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: remove direct calls to _qm_dqread
Darrick J. Wong [Fri, 4 May 2018 22:30:23 +0000 (15:30 -0700)] 
xfs: remove direct calls to _qm_dqread

The quota initialization code needs an "uncached" variant of _dqget to
read in default quota limits and timers before the dquot cache is fully
set up.  We've already split up _dqget into its component pieces so
create a fourth variant to address this need, and make dqread internal
to xfs_dquot.c again.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: refactor xfs_qm_dqtobp and xfs_qm_dqalloc
Darrick J. Wong [Fri, 4 May 2018 22:30:23 +0000 (15:30 -0700)] 
xfs: refactor xfs_qm_dqtobp and xfs_qm_dqalloc

Separate the disk dquot read and allocation functionality into
two helper functions, then refactor dqread to call them directly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: refactor incore dquot initialization functions
Darrick J. Wong [Fri, 4 May 2018 22:30:23 +0000 (15:30 -0700)] 
xfs: refactor incore dquot initialization functions

Create two incore dquot initialization functions that will help us to
disentangle dqget and dqread.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: fetch dquots directly during quotacheck
Darrick J. Wong [Fri, 4 May 2018 22:30:22 +0000 (15:30 -0700)] 
xfs: fetch dquots directly during quotacheck

Quotacheck only runs during mount, which means that there are no other
processes in the system that could be doing chown or chproj.  Therefore
there's no potential for racing to attach dquots to the inode so we can
drop all the ILOCK and race detection bits from quotacheck.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: split out dqget for inodes from regular dqget
Darrick J. Wong [Fri, 4 May 2018 22:30:22 +0000 (15:30 -0700)] 
xfs: split out dqget for inodes from regular dqget

There are two uses of dqget here -- one is to return the dquot for a
given type and id, and the other is to return the dquot for a given type
and inode.  Those are two separate things, so split them into two
smaller functions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: remove unnecessary xfs_qm_dqattach parameter
Darrick J. Wong [Fri, 4 May 2018 22:30:21 +0000 (15:30 -0700)] 
xfs: remove unnecessary xfs_qm_dqattach parameter

The flags argument is always zero, get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: delegate dqget input checks to helper function
Darrick J. Wong [Fri, 4 May 2018 22:30:21 +0000 (15:30 -0700)] 
xfs: delegate dqget input checks to helper function

Move the dqget input checks to a separate function in preparation for
splitting up the dqget functionality.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: refactor dquot cache handling
Darrick J. Wong [Fri, 4 May 2018 22:30:20 +0000 (15:30 -0700)] 
xfs: refactor dquot cache handling

Delegate the dquot cache handling (radix tree lookup and insertion) to
separate helper functions so that we can continue to simplify the body
of dqget.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: refactor XFS_QMOPT_DQNEXT out of existence
Darrick J. Wong [Fri, 4 May 2018 22:30:20 +0000 (15:30 -0700)] 
xfs: refactor XFS_QMOPT_DQNEXT out of existence

There's only one caller of DQNEXT and its semantics can be moved into a
separate function, so create the function and get rid of the flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: don't spray logs when dquot flush/purge fail
Darrick J. Wong [Fri, 4 May 2018 22:30:20 +0000 (15:30 -0700)] 
xfs: don't spray logs when dquot flush/purge fail

When dquot flush or purge fail there's no need to spam the logs, we've
already logged the IO error or fs shutdown that caused the flush
failures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: release new dquot buffer on defer_finish error
Darrick J. Wong [Fri, 4 May 2018 22:30:19 +0000 (15:30 -0700)] 
xfs: release new dquot buffer on defer_finish error

In commit efa092f3d4c6 "[XFS] Fixes a bug in the quota code when
allocating a new dquot record", we allocate a new dquot block, grab a
buffer to initialize it, and return the locked initialized dquot buffer
to the caller for further in-core dquot initialization.  Unfortunately,
if the _bmap_finish errored out, _qm_dqalloc would also error out
without bothering to free the (locked) buffer.  Leaking a locked buffer
caused hangs in generic/388 when quotas are enabled.

Furthermore, the _bmap_finish -> _defer_finish conversion in
310a75a3c6c747 ("xfs: change xfs_bmap_{finish,cancel,init,free} ->
xfs_defer_*") failed to observe that the buffer was held going into
_defer_finish and therefore failed to notice that the buffer lock is
/not/ maintained afterwards.  Now that we can bjoin a buffer to a
defer_ops, use this mechanism to ensure that the buffer stays locked
across the _defer_finish.  Release the holds and locks on the buffer as
appropriate if we have to error out.

There is a subtlety here for the caller in that the buffer emerges
locked and held to the transaction, so if the _trans_commit fails we
have to release the buffer explicitly.  This fixes the unmount hang
in generic/388 when quotas are enabled.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: don't discard on free of unwritten extents
Brian Foster [Wed, 9 May 2018 15:45:05 +0000 (08:45 -0700)] 
xfs: don't discard on free of unwritten extents

Unwritten extents by definition have not been written to until they
are converted to normal written extents. If unwritten extents are
freed from a file, it is therefore guaranteed that the blocks have
not been written to since allocation (note that zero range punches
and reallocates blocks).

To cut down on online discards generated from workloads that make
use of preallocation, skip discards of extents if they are in the
unwritten state when the extent is freed.

Note that this optimization does not apply to log recovery, during
which all freed extents are discarded if online discard is enabled.
Also note that it may be possible for a filesystem crash to occur
after write completion of an unwritten extent but before unwritten
conversion such that the extent remains unwritten after log
recovery. Since this pseudo-inconsistency may already be possible
after a crash (consider writing to recently allocated blocks where
the allocation transaction is lost after a crash), this change
shouldn't introduce any fundamental limitations that don't already
exist. In short, on storage stacks where discards are important,
it's good practice to run an occasional fstrim even with online
discard enabled in the filesystem, particularly after a crash.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: skip online discard during eofblocks trims
Brian Foster [Wed, 9 May 2018 15:45:04 +0000 (08:45 -0700)] 
xfs: skip online discard during eofblocks trims

We've had reports of online discard operations being sent from XFS
on write-only workloads. These discards occur as a result of
eofblocks trims that can occur after a large file copy completes.

These discards are slightly confusing for users who might be paying
close attention to online discards (i.e., vdo) due to performance
sensitivity. They also happen to be spurious because freed post-eof
blocks by definition have not been written to during the current
allocation cycle.

Update xfs_free_eofblocks() to skip discards that are purely
attributed to eofblocks trims. This cuts down the number of spurious
discards that may occur on write-only workloads due to normal
preallocation activity.

Note that discards of post-eof extents can still occur from other
codepaths that do not isolate handling of post-eof blocks from those
within eof. For example, file unlinks and truncates may still cause
discards for any file blocks affected by the operation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: add bmapi nodiscard flag
Brian Foster [Wed, 9 May 2018 15:45:04 +0000 (08:45 -0700)] 
xfs: add bmapi nodiscard flag

Freed extents are unconditionally discarded when online discard is
enabled. Define XFS_BMAPI_NODISCARD to allow callers to bypass
discards when unnecessary. For example, this will be useful for
eofblocks trimming.

This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: get rid of the log item descriptor
Dave Chinner [Wed, 9 May 2018 14:49:37 +0000 (07:49 -0700)] 
xfs: get rid of the log item descriptor

It's just a connector between a transaction and a log item. There's
a 1:1 relationship between a log item descriptor and a log item,
and a 1:1 relationship between a log item descriptor and a
transaction. Both relationships are created and terminated at the
same time, so why do we even have the descriptor?

Replace it with a specific list_head in the log item and a new
log item dirtied flag to replace the XFS_LID_DIRTY flag.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix up deferred agfl intent finish_item use of LID_DIRTY]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: add some more debug checks to buffer log item reuse
Dave Chinner [Wed, 9 May 2018 14:49:10 +0000 (07:49 -0700)] 
xfs: add some more debug checks to buffer log item reuse

Just to make sure the item isn't associated with another
transaction when we try to reuse it.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: fix double ijoin in xfs_reflink_clear_inode_flag()
Dave Chinner [Wed, 9 May 2018 14:49:10 +0000 (07:49 -0700)] 
xfs: fix double ijoin in xfs_reflink_clear_inode_flag()

xfs_reflink_clear_inode_flag double-joins an inode to a transaction,
which is not allowed.  Fix that and document that the caller must have
already joined it.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[darrick: edit out trace for nonexistent ASSERT]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: fix double ijoin in xfs_reflink_cancel_cow_range
Dave Chinner [Wed, 9 May 2018 14:49:09 +0000 (07:49 -0700)] 
xfs: fix double ijoin in xfs_reflink_cancel_cow_range

xfs_reflink_cancel_cow_range joins an inode twice to the same
transaction.  This is not allowed, so fix it and document that the
callers of xfs_reflink_cancel_cow_blocks() must have already joined the
inode to the permanent transaction passed in.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[darrick: edited the commit log to remove trace for nonexistent ASSERT]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: fix double ijoin in xfs_inactive_symlink_rmt()
Dave Chinner [Wed, 9 May 2018 14:49:09 +0000 (07:49 -0700)] 
xfs: fix double ijoin in xfs_inactive_symlink_rmt()

xfs_inactive_symlink_rmt() does something nasty - it joins an inode
into a transaction it is already joined to. This means the inode can
have multiple log item descriptors attached to the transaction for
it. This breaks teh 1:1 mapping that is supposed to exist
between the log item and log item descriptor.

This results in the log item being processed twice during
transaction commit and CIL formatting, and there are lots of other
potential issues tha arise from double processing of log items in
the transaction commit state machine.

In this case, the inode is already held by the rolling transaction
returned from xfs_defer_finish(), so there's no need to join it
again.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: don't assert fail with AIL lock held
Dave Chinner [Wed, 9 May 2018 14:49:09 +0000 (07:49 -0700)] 
xfs: don't assert fail with AIL lock held

Been hitting AIL ordering assert failures recently, but been unable
to trace them down because the system immediately hangs up onteh
spinlock that was held when this assert fires:

XFS: Assertion failed: XFS_LSN_CMP(prev_lip->li_lsn, lip->li_lsn) <= 0, file: fs/xfs/xfs_trans_ail.c, line: 52

Move the assertions outside of the spinlock so the corpse can
be dissected. Thanks to Brian Foster for supplying a clean
way of doing this.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: adder caller IP to xfs_defer* tracepoints
Dave Chinner [Wed, 9 May 2018 14:48:52 +0000 (07:48 -0700)] 
xfs: adder caller IP to xfs_defer* tracepoints

So it's clear in the trace where they are being called from.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: add tracing to high level transaction operations
Dave Chinner [Wed, 9 May 2018 14:47:57 +0000 (07:47 -0700)] 
xfs: add tracing to high level transaction operations

Because currently we have no idea what the transaction context we
are operating in is, and I need to know that information to track
down bugs in multiple log item joins to transactions.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: log item flags are racy
Dave Chinner [Wed, 9 May 2018 14:47:34 +0000 (07:47 -0700)] 
xfs: log item flags are racy

The log item flags contain a field that is protected by the AIL
lock - the XFS_LI_IN_AIL flag. We use non-atomic RMW operations to
set and clear these flags, but most of the updates and checks are
not done with the AIL lock held and so are susceptible to update
races.

Fix this by changing the log item flags to use atomic bitops rather
than be reliant on the AIL lock for update serialisation.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: add missing rmap error return
Darrick J. Wong [Fri, 4 May 2018 22:31:21 +0000 (15:31 -0700)] 
xfs: add missing rmap error return

xfs_rmap_lookup_le_range can return errors, so we need to check for
them and bail out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: bmap debugging should never panic the system
Darrick J. Wong [Fri, 4 May 2018 22:31:21 +0000 (15:31 -0700)] 
xfs: bmap debugging should never panic the system

Don't panic() the system if the bmap records are garbage, just call
ASSERT which gives us the same backtrace but enables developers to
control if the system goes down or not.  This makes debugging with
generic/388 much easier because it won't reboot the machine midway
through a run just because btree_read_bufl returns EIO when the fs has
already shut down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: defer agfl frees from directory op transactions
Brian Foster [Tue, 8 May 2018 00:38:48 +0000 (17:38 -0700)] 
xfs: defer agfl frees from directory op transactions

Directory operations can perform block allocations as entries are
added/removed from directories. Defer AGFL block frees from the
remaining directory operation transactions. This covers the hard
link, remove and rename operations.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: defer frees from common inode allocation paths
Brian Foster [Tue, 8 May 2018 00:38:48 +0000 (17:38 -0700)] 
xfs: defer frees from common inode allocation paths

Inode allocation can require block allocation for physical inode
chunk allocation, inode btree record insertion, and/or directory
block allocation for entry insertion. Any of these block allocation
requests can require AGFL fixups prior to the actual allocation.
Update the common file creation transacions to defer AGFL frees from
these contexts to avoid too much log reservation consumption
per-transaction.

Since these transactions are already passed down through the btree
cursors and da_args structure, this simply requires to attach dfops
to the transaction. Note that this covers tr_create, tr_mkdir and
tr_symlink. Other transactions such as tr_create_tmpfile do not
already make use of deferred operations and so are left alone for
the time being.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: defer agfl frees from inode inactivation
Brian Foster [Tue, 8 May 2018 00:38:48 +0000 (17:38 -0700)] 
xfs: defer agfl frees from inode inactivation

XFS inode chunks are already freed via deferred operations (which
now also defer AGFL block frees), but inode btree blocks are freed
directly in the associated context. This has been known to lead to
log reservation overruns in particular workloads where an inobt
block free may require several AGFL block frees (and thus several
allocation btree modifications) before the inobt block itself is
actually freed.

To avoid this problem, defer the frees of any AGFL blocks before the
inobt block free takes place. This requires passing the dfops from
xfs_inactive_ifree() down through the inobt ->[alloc|free]_block()
callouts, which essentially only requires to attach the dfops to the
transaction since it is already carried all the way through to the
inobt update and allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: defer agfl block frees from deferred ops processing context
Brian Foster [Tue, 8 May 2018 00:38:47 +0000 (17:38 -0700)] 
xfs: defer agfl block frees from deferred ops processing context

Now that AGFL block frees are deferred when dfops is set in the
transaction, start deferring AGFL block frees from contexts that are
known to push the limits of existing log reservations.

The first such context is deferred operation processing itself. This
primarily targets deferred extent frees (such as file extents and
inode chunks), but in doing so covers all allocation operations that
occur in deferred operation processing context.

Update xfs_defer_finish() to set and reset ->t_agfl_dfops across the
processing sequence. This means that any AGFL block frees due to
allocation events result in the addition of new EFIs to the dfops
rather than being processed immediately. xfs_defer_finish() rolls
the transaction at least once more to process the frees of the AGFL
blocks back to the allocation btrees and returns once the AGFL is
rectified.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: defer agfl block frees when dfops is available
Brian Foster [Tue, 8 May 2018 00:38:47 +0000 (17:38 -0700)] 
xfs: defer agfl block frees when dfops is available

The AGFL fixup code executes before every block allocation/free and
rectifies the AGFL based on the current, dynamic allocation
requirements of the fs. The AGFL must hold a minimum number of
blocks to satisfy a worst case split of the free space btrees caused
by the impending allocation operation. The AGFL is also updated to
maintain the implicit requirement for a minimum number of free slots
to satisfy a worst case join of the free space btrees.

Since the AGFL caches individual blocks, AGFL reduction typically
involves multiple, single block frees. We've had reports of
transaction overrun problems during certain workloads that boil down
to AGFL reduction freeing multiple blocks and consuming more space
in the log than was reserved for the transaction.

Since the objective of freeing AGFL blocks is to ensure free AGFL
free slots are available for the upcoming allocation, one way to
address this problem is to release surplus blocks from the AGFL
immediately but defer the free of those blocks (similar to how
file-mapped blocks are unmapped from the file in one transaction and
freed via a deferred operation) until the transaction is rolled.
This turns AGFL reduction into an operation with predictable log
reservation consumption.

Add the capability to defer AGFL block frees when a deferred ops
list is available to the AGFL fixup code. Add a dfops pointer to the
transaction to carry dfops through various contexts to the allocator
context. Deferring AGFL frees is  conditional behavior based on
whether the transaction pointer is populated. The long term
objective is to reuse the transaction pointer to clean up all
unrelated callchains that pass dfops on the stack along with a
transaction and in doing so, consistently defer AGFL blocks from the
allocator.

A bit of customization is required to handle deferred completion
processing because AGFL blocks are accounted against a per-ag
reservation pool and AGFL blocks are not inserted into the extent
busy list when freed (they are inserted when used and released back
to the AGFL). Reuse the majority of the existing deferred extent
free infrastructure and customize it appropriately to handle AGFL
blocks.

Note that this patch only adds infrastructure. It does not change
behavior because no callers have been updated to pass ->t_agfl_dfops
into the allocation code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: create agfl block free helper function
Brian Foster [Tue, 8 May 2018 00:38:46 +0000 (17:38 -0700)] 
xfs: create agfl block free helper function

Refactor the AGFL block free code into a new helper such that it can
be invoked from deferred context. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: print specific dqblk that failed verifiers
Eric Sandeen [Mon, 7 May 2018 16:20:47 +0000 (09:20 -0700)] 
xfs: print specific dqblk that failed verifiers

Rather than printing the top of the buffer that held a corrupted dqblk,
restructure things to print out the specific one that failed by pushing
the calls to the verifier_error function down into the verifier which
iterates over the buffer and detects the error.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: add full xfs_dqblk verifier
Eric Sandeen [Mon, 7 May 2018 16:20:18 +0000 (09:20 -0700)] 
xfs: add full xfs_dqblk verifier

Add an xfs_dqblk verifier so that it can check the uuid on V5 filesystems;
it calls the existing xfs_dquot_verify verifier to validate the
xfs_disk_dquot_t contained inside it.  This lets us move the uuid
verification out of the crc verifier, which makes little sense.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: pass full xfs_dqblk to repair during quotacheck
Eric Sandeen [Mon, 7 May 2018 16:20:17 +0000 (09:20 -0700)] 
xfs: pass full xfs_dqblk to repair during quotacheck

It's a bit dicey to pass in the smaller xfs_disk_dquot and then cast it to
something larger; pass in the full xfs_dqblk so we know the caller has sent
us the right thing.  Rename the function to xfs_dqblk_repair for
clarity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: check type in quota verifier during quotacheck
Eric Sandeen [Mon, 7 May 2018 16:20:17 +0000 (09:20 -0700)] 
xfs: check type in quota verifier during quotacheck

During quotacheck we send in the quota type, so verify that as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: remove unused flags arg from xfs_dquot_verify
Eric Sandeen [Fri, 4 May 2018 22:15:48 +0000 (15:15 -0700)] 
xfs: remove unused flags arg from xfs_dquot_verify

Long ago the flags argument was used to determine whether to issue warnings
about corruptions, but that's done elsewhere now and the flag is unused
here, so remove it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: clean up locking in xfs_file_iomap_begin
Dave Chinner [Wed, 2 May 2018 19:54:54 +0000 (12:54 -0700)] 
xfs: clean up locking in xfs_file_iomap_begin

Rather than checking what kind of locking is needed in a helper
function and then jumping through hoops to do the locking in line,
move the locking to the helper function that does all the checks
and rename it to xfs_ilock_for_iomap().

This also allows us to hoist all the nonblocking checks up into the
locking helper, further simplifier the code flow in
xfs_file_iomap_begin() and making it easier to understand.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: simplify xfs_file_iomap_begin() logic
Dave Chinner [Wed, 2 May 2018 19:54:53 +0000 (12:54 -0700)] 
xfs: simplify xfs_file_iomap_begin() logic

The current logic that determines whether allocation should be done
has grown somewhat spaghetti like with the addition of IOMAP_NOWAIT
functionality. Separate out each of the different cases into single,
obvious checks to get rid most of the nested IOMAP_NOWAIT checks
in the allocation logic.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoiomap: Use FUA for pure data O_DSYNC DIO writes
Dave Chinner [Wed, 2 May 2018 19:54:53 +0000 (12:54 -0700)] 
iomap: Use FUA for pure data O_DSYNC DIO writes

If we are doing direct IO writes with datasync semantics, we often
have to flush metadata changes along with the data write. However,
if we are overwriting existing data, there are no metadata changes
that we need to flush. In this case, optimising the IO by using
FUA write makes sense.

We know from the IOMAP_F_DIRTY flag as to whether a specific inode
requires a metadata flush - this is currently used by DAX to ensure
extent modification as stable in page fault operations. For direct
IO writes, we can use it to determine if we need to flush metadata
or not once the data is on disk.

Hence if we have been returned a mapped extent that is not new and
the IO mapping is not dirty, then we can use a FUA write to provide
datasync semantics. This allows us to short-cut the
generic_write_sync() call in IO completion and hence avoid
unnecessary operations. This makes pure direct IO data write
behaviour identical to the way block devices use REQ_FUA to provide
datasync semantics.

On a FUA enabled device, a synchronous direct IO write workload
(sequential 4k overwrites in 32MB file) had the following results:

# xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo

kernel time write()s write iops Write b/w
------ ---- -------- ---------- ---------
(no dsync)  4s 2173/s 2173 8.5MB/s
vanilla 22s  370/s  750 1.4MB/s
patched 19s  420/s  420 1.6MB/s

The patched code clearly doesn't send cache flushes anymore, but
instead uses FUA (confirmed via blktrace), and performance improves
a bit as a result. However, the benefits will be higher on workloads
that mix O_DSYNC overwrites with other write IO as we won't be
flushing the entire device cache on every DSYNC overwrite IO
anymore.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoiomap: iomap_dio_rw() handles all sync writes
Dave Chinner [Wed, 2 May 2018 19:54:52 +0000 (12:54 -0700)] 
iomap: iomap_dio_rw() handles all sync writes

Currently iomap_dio_rw() only handles (data)sync write completions
for AIO. This means we can't optimised non-AIO IO to minimise device
flushes as we can't tell the caller whether a flush is required or
not.

To solve this problem and enable further optimisations, make
iomap_dio_rw responsible for data sync behaviour for all IO, not
just AIO.

In doing so, the sync operation is now accounted as part of the DIO
IO by inode_dio_end(), hence post-IO data stability updates will no
long race against operations that serialise via inode_dio_wait()
such as truncate or hole punch.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: move generic_write_sync calls inwards
Dave Chinner [Wed, 2 May 2018 19:54:52 +0000 (12:54 -0700)] 
xfs: move generic_write_sync calls inwards

To prepare for iomap iinfrastructure based DSYNC optimisations.

While moving the code araound, move the XFS write bytes metric
update for direct IO into xfs_dio_write_end_io callback so that we
always capture the amount of data written via AIO+DIO. This fixes
the problem where queued AIO+DIO writes are not accounted to this
metric.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: don't retry xfs_buf_find on XBF_TRYLOCK failure
Dave Chinner [Wed, 18 Apr 2018 15:25:21 +0000 (08:25 -0700)] 
xfs: don't retry xfs_buf_find on XBF_TRYLOCK failure

When looking at an event trace recently, I noticed that non-blocking
buffer lookup attempts would fail on cached locked buffers and then
run the slow cache-miss path. This means we are doing an xfs_buf
allocation, lookup and free unnecessarily every time we avoid
blocking on a locked buffer.

Fix this by changing _xfs_buf_find() to return an error status to
the caller to indicate that we failed the lock attempt rather than
just returning a NULL. This allows the higher level code to
discriminate between a cache miss and an cache hit that we failed to
lock.

This also allows us to return a -EFSCORRUPTED state if we are asked
to look up a block number outside the range of the filesystem in
_xfs_buf_find(), which moves us one step closer to being able to
handle such errors in a more graceful manner at the higher levels.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: make xfs_buf_incore out of line
Dave Chinner [Wed, 18 Apr 2018 15:25:20 +0000 (08:25 -0700)] 
xfs: make xfs_buf_incore out of line

Move xfs_buf_incore out of line and make it the only way to look up
a buffer in the buffer cache from outside the buffer cache. Convert
the external users of _xfs_buf_find() to xfs_buf_incore() and make
_xfs_buf_find() static.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: actually rename xfs_incore -> xfs_buf_incore]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: trace ATTR flags in xattr tracepoints
Eric Sandeen [Wed, 18 Apr 2018 00:17:35 +0000 (17:17 -0700)] 
xfs: trace ATTR flags in xattr tracepoints

This will trace i.e. the ATTR_SECURE/ATTR_CREATE/ATTR_REPLACE
flags as well as the OP_FLAGS.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: validate allocated inode number
Dave Chinner [Wed, 18 Apr 2018 00:17:35 +0000 (17:17 -0700)] 
xfs: validate allocated inode number

When we have corrupted free inode btrees, we can attempt to
allocate inodes that we know are already allocated. Catch allocation
of these inodes and report corruption as early as possible to
prevent corruption propagation or deadlocks.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: validate cached inodes are free when allocated
Dave Chinner [Wed, 18 Apr 2018 00:17:34 +0000 (17:17 -0700)] 
xfs: validate cached inodes are free when allocated

A recent fuzzed filesystem image cached random dcache corruption
when the reproducer was run. This often showed up as panics in
lookup_slow() on a null inode->i_ops pointer when doing pathwalks.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
....
Call Trace:
 lookup_slow+0x44/0x60
 walk_component+0x3dd/0x9f0
 link_path_walk+0x4a7/0x830
 path_lookupat+0xc1/0x470
 filename_lookup+0x129/0x270
 user_path_at_empty+0x36/0x40
 path_listxattr+0x98/0x110
 SyS_listxattr+0x13/0x20
 do_syscall_64+0xf5/0x280
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

but had many different failure modes including deadlocks trying to
lock the inode that was just allocated or KASAN reports of
use-after-free violations.

The cause of the problem was a corrupt INOBT on a v4 fs where the
root inode was marked as free in the inobt record. Hence when we
allocated an inode, it chose the root inode to allocate, found it in
the cache and re-initialised it.

We recently fixed a similar inode allocation issue caused by inobt
record corruption problem in xfs_iget_cache_miss() in commit
ee457001ed6c ("xfs: catch inode allocation state mismatch
corruption"). This change adds similar checks to the cache-hit path
to catch it, and turns the reproducer into a corruption shutdown
situation.

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix typos in comment]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoLinux 4.17-rc4 v4.17-rc4
Linus Torvalds [Mon, 7 May 2018 02:57:38 +0000 (16:57 -1000)] 
Linux 4.17-rc4

7 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Sun, 6 May 2018 15:46:29 +0000 (05:46 -1000)] 
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pll KVM fixes from Radim Krčmář:
 "ARM:
   - Fix proxying of GICv2 CPU interface accesses
   - Fix crash when switching to BE
   - Track source vcpu git GICv2 SGIs
   - Fix an outdated bit of documentation

  x86:
   - Speed up injection of expired timers (for stable)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: remove APIC Timer periodic/oneshot spikes
  arm64: vgic-v2: Fix proxying of cpuif access
  KVM: arm/arm64: vgic_init: Cleanup reference to process_maintenance
  KVM: arm64: Fix order of vcpu_write_sys_reg() arguments
  KVM: arm/arm64: vgic: Fix source vcpu issues for GICv2 SGI

7 years agoMerge tag 'iommu-fixes-v4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 6 May 2018 15:42:24 +0000 (05:42 -1000)] 
Merge tag 'iommu-fixes-v4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu fixes from Joerg Roedel:

 - fix a compile warning in the AMD IOMMU driver with irq remapping
   disabled

 - fix for VT-d interrupt remapping and invalidation size (caused a
   BUG_ON when trying to invalidate more than 4GB)

 - build fix and a regression fix for broken graphics with old DTS for
   the rockchip iommu driver

 - a revert in the PCI window reservation code which fixes a regression
   with VFIO.

* tag 'iommu-fixes-v4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu: rockchip: fix building without CONFIG_OF
  iommu/vt-d: Use WARN_ON_ONCE instead of BUG_ON in qi_flush_dev_iotlb()
  iommu/vt-d: fix shift-out-of-bounds in bug checking
  iommu/dma: Move PCI window region reservation back into dma specific path.
  iommu/rockchip: Make clock handling optional
  iommu/amd: Hide unused iommu_table_lock
  iommu/vt-d: Fix usage of force parameter in intel_ir_reconfigure_irte()