pci-pm-add-needs_resume-flag-to-avoid-suspend-complete-optimization.patch
xfs-fix-missed-holes-in-seek_hole-implementation.patch
xfs-fix-off-by-one-on-max-nr_pages-in-xfs_find_get_desired_pgoff.patch
+xfs-fix-over-copying-of-getbmap-parameters-from-userspace.patch
+xfs-handle-array-index-overrun-in-xfs_dir2_leaf_readbuf.patch
+xfs-prevent-multi-fsb-dir-readahead-from-reading-random-blocks.patch
+xfs-fix-up-quotacheck-buffer-list-error-handling.patch
+xfs-support-ability-to-wait-on-new-inodes.patch
+xfs-update-ag-iterator-to-support-wait-on-new-inodes.patch
+xfs-wait-on-new-inodes-during-quotaoff-dquot-release.patch
+xfs-fix-indlen-accounting-error-on-partial-delalloc-conversion.patch
+xfs-bad-assertion-for-delalloc-an-extent-that-start-at-i_size.patch
+xfs-fix-unaligned-access-in-xfs_btree_visit_blocks.patch
--- /dev/null
+From 892d2a5f705723b2cb488bfb38bcbdcf83273184 Mon Sep 17 00:00:00 2001
+From: Zorro Lang <zlang@redhat.com>
+Date: Mon, 15 May 2017 08:40:02 -0700
+Subject: xfs: bad assertion for delalloc an extent that start at i_size
+
+From: Zorro Lang <zlang@redhat.com>
+
+commit 892d2a5f705723b2cb488bfb38bcbdcf83273184 upstream.
+
+By run fsstress long enough time enough in RHEL-7, I find an
+assertion failure (harder to reproduce on linux-4.11, but problem
+is still there):
+
+ XFS: Assertion failed: (iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c
+
+The assertion is in xfs_getbmap() funciton:
+
+ if (map[i].br_startblock == DELAYSTARTBLOCK &&
+--> map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip)))
+ ASSERT((iflags & BMV_IF_DELALLOC) != 0);
+
+When map[i].br_startoff == XFS_B_TO_FSB(mp, XFS_ISIZE(ip)), the
+startoff is just at EOF. But we only need to make sure delalloc
+extents that are within EOF, not include EOF.
+
+Signed-off-by: Zorro Lang <zlang@redhat.com>
+Reviewed-by: Brian Foster <bfoster@redhat.com>
+Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/xfs/xfs_bmap_util.c | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+--- a/fs/xfs/xfs_bmap_util.c
++++ b/fs/xfs/xfs_bmap_util.c
+@@ -682,7 +682,7 @@ xfs_getbmap(
+ * extents.
+ */
+ if (map[i].br_startblock == DELAYSTARTBLOCK &&
+- map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip)))
++ map[i].br_startoff < XFS_B_TO_FSB(mp, XFS_ISIZE(ip)))
+ ASSERT((iflags & BMV_IF_DELALLOC) != 0);
+
+ if (map[i].br_startblock == HOLESTARTBLOCK &&
--- /dev/null
+From 0daaecacb83bc6b656a56393ab77a31c28139bc7 Mon Sep 17 00:00:00 2001
+From: Brian Foster <bfoster@redhat.com>
+Date: Fri, 12 May 2017 10:44:08 -0700
+Subject: xfs: fix indlen accounting error on partial delalloc conversion
+
+From: Brian Foster <bfoster@redhat.com>
+
+commit 0daaecacb83bc6b656a56393ab77a31c28139bc7 upstream.
+
+The delalloc -> real block conversion path uses an incorrect
+calculation in the case where the middle part of a delalloc extent
+is being converted. This is documented as a rare situation because
+XFS generally attempts to maximize contiguity by converting as much
+of a delalloc extent as possible.
+
+If this situation does occur, the indlen reservation for the two new
+delalloc extents left behind by the conversion of the middle range
+is calculated and compared with the original reservation. If more
+blocks are required, the delta is allocated from the global block
+pool. This delta value can be characterized as the difference
+between the new total requirement (temp + temp2) and the currently
+available reservation minus those blocks that have already been
+allocated (startblockval(PREV.br_startblock) - allocated).
+
+The problem is that the current code does not account for previously
+allocated blocks correctly. It subtracts the current allocation
+count from the (new - old) delta rather than the old indlen
+reservation. This means that more indlen blocks than have been
+allocated end up stashed in the remaining extents and free space
+accounting is broken as a result.
+
+Fix up the calculation to subtract the allocated block count from
+the original extent indlen and thus correctly allocate the
+reservation delta based on the difference between the new total
+requirement and the unused blocks from the original reservation.
+Also remove a bogus assert that contradicts the fact that the new
+indlen reservation can be larger than the original indlen
+reservation.
+
+Signed-off-by: Brian Foster <bfoster@redhat.com>
+Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/xfs/libxfs/xfs_bmap.c | 7 ++++---
+ 1 file changed, 4 insertions(+), 3 deletions(-)
+
+--- a/fs/xfs/libxfs/xfs_bmap.c
++++ b/fs/xfs/libxfs/xfs_bmap.c
+@@ -2179,8 +2179,10 @@ xfs_bmap_add_extent_delay_real(
+ }
+ temp = xfs_bmap_worst_indlen(bma->ip, temp);
+ temp2 = xfs_bmap_worst_indlen(bma->ip, temp2);
+- diff = (int)(temp + temp2 - startblockval(PREV.br_startblock) -
+- (bma->cur ? bma->cur->bc_private.b.allocated : 0));
++ diff = (int)(temp + temp2 -
++ (startblockval(PREV.br_startblock) -
++ (bma->cur ?
++ bma->cur->bc_private.b.allocated : 0)));
+ if (diff > 0) {
+ error = xfs_mod_fdblocks(bma->ip->i_mount,
+ -((int64_t)diff), false);
+@@ -2232,7 +2234,6 @@ xfs_bmap_add_extent_delay_real(
+ temp = da_new;
+ if (bma->cur)
+ temp += bma->cur->bc_private.b.allocated;
+- ASSERT(temp <= da_old);
+ if (temp < da_old)
+ xfs_mod_fdblocks(bma->ip->i_mount,
+ (int64_t)(da_old - temp), false);
--- /dev/null
+From be6324c00c4d1e0e665f03ed1fc18863a88da119 Mon Sep 17 00:00:00 2001
+From: "Darrick J. Wong" <darrick.wong@oracle.com>
+Date: Mon, 3 Apr 2017 15:17:57 -0700
+Subject: xfs: fix over-copying of getbmap parameters from userspace
+
+From: Darrick J. Wong <darrick.wong@oracle.com>
+
+commit be6324c00c4d1e0e665f03ed1fc18863a88da119 upstream.
+
+In xfs_ioc_getbmap, we should only copy the fields of struct getbmap
+from userspace, or else we end up copying random stack contents into the
+kernel. struct getbmap is a strict subset of getbmapx, so a partial
+structure copy should work fine.
+
+Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
+Reviewed-by: Christoph Hellwig <hch@lst.de>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/xfs/xfs_ioctl.c | 5 +++--
+ 1 file changed, 3 insertions(+), 2 deletions(-)
+
+--- a/fs/xfs/xfs_ioctl.c
++++ b/fs/xfs/xfs_ioctl.c
+@@ -1379,10 +1379,11 @@ xfs_ioc_getbmap(
+ unsigned int cmd,
+ void __user *arg)
+ {
+- struct getbmapx bmx;
++ struct getbmapx bmx = { 0 };
+ int error;
+
+- if (copy_from_user(&bmx, arg, sizeof(struct getbmapx)))
++ /* struct getbmap is a strict subset of struct getbmapx. */
++ if (copy_from_user(&bmx, arg, offsetof(struct getbmapx, bmv_iflags)))
+ return -EFAULT;
+
+ if (bmx.bmv_count < 2)
--- /dev/null
+From a4d768e702de224cc85e0c8eac9311763403b368 Mon Sep 17 00:00:00 2001
+From: Eric Sandeen <sandeen@sandeen.net>
+Date: Mon, 22 May 2017 19:54:10 -0700
+Subject: xfs: fix unaligned access in xfs_btree_visit_blocks
+
+From: Eric Sandeen <sandeen@sandeen.net>
+
+commit a4d768e702de224cc85e0c8eac9311763403b368 upstream.
+
+This structure copy was throwing unaligned access warnings on sparc64:
+
+Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs]
+
+xfs_btree_copy_ptrs does a memcpy, which avoids it.
+
+Signed-off-by: Eric Sandeen <sandeen@redhat.com>
+Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/xfs/libxfs/xfs_btree.c | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+--- a/fs/xfs/libxfs/xfs_btree.c
++++ b/fs/xfs/libxfs/xfs_btree.c
+@@ -4064,7 +4064,7 @@ xfs_btree_change_owner(
+ xfs_btree_readahead_ptr(cur, ptr, 1);
+
+ /* save for the next iteration of the loop */
+- lptr = *ptr;
++ xfs_btree_copy_ptrs(cur, &lptr, ptr, 1);
+ }
+
+ /* for each buffer in the level */
--- /dev/null
+From 20e8a063786050083fe05b4f45be338c60b49126 Mon Sep 17 00:00:00 2001
+From: Brian Foster <bfoster@redhat.com>
+Date: Fri, 21 Apr 2017 12:40:44 -0700
+Subject: xfs: fix up quotacheck buffer list error handling
+
+From: Brian Foster <bfoster@redhat.com>
+
+commit 20e8a063786050083fe05b4f45be338c60b49126 upstream.
+
+The quotacheck error handling of the delwri buffer list assumes the
+resident buffers are locked and doesn't clear the _XBF_DELWRI_Q flag
+on the buffers that are dequeued. This can lead to assert failures
+on buffer release and possibly other locking problems.
+
+Move this code to a delwri queue cancel helper function to
+encapsulate the logic required to properly release buffers from a
+delwri queue. Update the helper to clear the delwri queue flag and
+call it from quotacheck.
+
+Signed-off-by: Brian Foster <bfoster@redhat.com>
+Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/xfs/xfs_buf.c | 24 ++++++++++++++++++++++++
+ fs/xfs/xfs_buf.h | 1 +
+ fs/xfs/xfs_qm.c | 7 +------
+ 3 files changed, 26 insertions(+), 6 deletions(-)
+
+--- a/fs/xfs/xfs_buf.c
++++ b/fs/xfs/xfs_buf.c
+@@ -979,6 +979,8 @@ void
+ xfs_buf_unlock(
+ struct xfs_buf *bp)
+ {
++ ASSERT(xfs_buf_islocked(bp));
++
+ XB_CLEAR_OWNER(bp);
+ up(&bp->b_sema);
+
+@@ -1713,6 +1715,28 @@ error:
+ }
+
+ /*
++ * Cancel a delayed write list.
++ *
++ * Remove each buffer from the list, clear the delwri queue flag and drop the
++ * associated buffer reference.
++ */
++void
++xfs_buf_delwri_cancel(
++ struct list_head *list)
++{
++ struct xfs_buf *bp;
++
++ while (!list_empty(list)) {
++ bp = list_first_entry(list, struct xfs_buf, b_list);
++
++ xfs_buf_lock(bp);
++ bp->b_flags &= ~_XBF_DELWRI_Q;
++ list_del_init(&bp->b_list);
++ xfs_buf_relse(bp);
++ }
++}
++
++/*
+ * Add a buffer to the delayed write list.
+ *
+ * This queues a buffer for writeout if it hasn't already been. Note that
+--- a/fs/xfs/xfs_buf.h
++++ b/fs/xfs/xfs_buf.h
+@@ -304,6 +304,7 @@ extern void xfs_buf_iomove(xfs_buf_t *,
+ extern void *xfs_buf_offset(struct xfs_buf *, size_t);
+
+ /* Delayed Write Buffer Routines */
++extern void xfs_buf_delwri_cancel(struct list_head *);
+ extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
+ extern int xfs_buf_delwri_submit(struct list_head *);
+ extern int xfs_buf_delwri_submit_nowait(struct list_head *);
+--- a/fs/xfs/xfs_qm.c
++++ b/fs/xfs/xfs_qm.c
+@@ -1355,12 +1355,7 @@ xfs_qm_quotacheck(
+ mp->m_qflags |= flags;
+
+ error_return:
+- while (!list_empty(&buffer_list)) {
+- struct xfs_buf *bp =
+- list_first_entry(&buffer_list, struct xfs_buf, b_list);
+- list_del_init(&bp->b_list);
+- xfs_buf_relse(bp);
+- }
++ xfs_buf_delwri_cancel(&buffer_list);
+
+ if (error) {
+ xfs_warn(mp,
--- /dev/null
+From 023cc840b40fad95c6fe26fff1d380a8c9d45939 Mon Sep 17 00:00:00 2001
+From: Eric Sandeen <sandeen@redhat.com>
+Date: Thu, 13 Apr 2017 15:15:47 -0700
+Subject: xfs: handle array index overrun in xfs_dir2_leaf_readbuf()
+
+From: Eric Sandeen <sandeen@redhat.com>
+
+commit 023cc840b40fad95c6fe26fff1d380a8c9d45939 upstream.
+
+Carlos had a case where "find" seemed to start spinning
+forever and never return.
+
+This was on a filesystem with non-default multi-fsb (8k)
+directory blocks, and a fragmented directory with extents
+like this:
+
+0:[0,133646,2,0]
+1:[2,195888,1,0]
+2:[3,195890,1,0]
+3:[4,195892,1,0]
+4:[5,195894,1,0]
+5:[6,195896,1,0]
+6:[7,195898,1,0]
+7:[8,195900,1,0]
+8:[9,195902,1,0]
+9:[10,195908,1,0]
+10:[11,195910,1,0]
+11:[12,195912,1,0]
+12:[13,195914,1,0]
+...
+
+i.e. the first extent is a contiguous 2-fsb dir block, but
+after that it is fragmented into 1 block extents.
+
+At the top of the readdir path, we allocate a mapping array
+which (for this filesystem geometry) can hold 10 extents; see
+the assignment to map_info->map_size. During readdir, we are
+therefore able to map extents 0 through 9 above into the array
+for readahead purposes. If we count by 2, we see that the last
+mapped index (9) is the first block of a 2-fsb directory block.
+
+At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill
+more readahead; the outer loop assumes one full dir block is
+processed each loop iteration, and an inner loop that ensures
+that this is so by advancing to the next extent until a full
+directory block is mapped.
+
+The problem is that this inner loop may step past the last
+extent in the mapping array as it tries to reach the end of
+the directory block. This will read garbage for the extent
+length, and as a result the loop control variable 'j' may
+become corrupted and never fail the loop conditional.
+
+The number of valid mappings we have in our array is stored
+in map->map_valid, so stop this inner loop based on that limit.
+
+There is an ASSERT at the top of the outer loop for this
+same condition, but we never made it out of the inner loop,
+so the ASSERT never fired.
+
+Huge appreciation for Carlos for debugging and isolating
+the problem.
+
+Debugged-and-analyzed-by: Carlos Maiolino <cmaiolino@redhat.com>
+Signed-off-by: Eric Sandeen <sandeen@redhat.com>
+Tested-by: Carlos Maiolino <cmaiolino@redhat.com>
+Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
+Reviewed-by: Bill O'Donnell <billodo@redhat.com>
+Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/xfs/xfs_dir2_readdir.c | 10 ++++++++--
+ 1 file changed, 8 insertions(+), 2 deletions(-)
+
+--- a/fs/xfs/xfs_dir2_readdir.c
++++ b/fs/xfs/xfs_dir2_readdir.c
+@@ -406,6 +406,7 @@ xfs_dir2_leaf_readbuf(
+
+ /*
+ * Do we need more readahead?
++ * Each loop tries to process 1 full dir blk; last may be partial.
+ */
+ blk_start_plug(&plug);
+ for (mip->ra_index = mip->ra_offset = i = 0;
+@@ -437,9 +438,14 @@ xfs_dir2_leaf_readbuf(
+ }
+
+ /*
+- * Advance offset through the mapping table.
++ * Advance offset through the mapping table, processing a full
++ * dir block even if it is fragmented into several extents.
++ * But stop if we have consumed all valid mappings, even if
++ * it's not yet a full directory block.
+ */
+- for (j = 0; j < geo->fsbcount; j += length ) {
++ for (j = 0;
++ j < geo->fsbcount && mip->ra_index < mip->map_valid;
++ j += length ) {
+ /*
+ * The rest of this extent but not more than a dir
+ * block.
--- /dev/null
+From cb52ee334a45ae6c78a3999e4b473c43ddc528f4 Mon Sep 17 00:00:00 2001
+From: Brian Foster <bfoster@redhat.com>
+Date: Thu, 20 Apr 2017 08:06:47 -0700
+Subject: xfs: prevent multi-fsb dir readahead from reading random blocks
+
+From: Brian Foster <bfoster@redhat.com>
+
+commit cb52ee334a45ae6c78a3999e4b473c43ddc528f4 upstream.
+
+Directory block readahead uses a complex iteration mechanism to map
+between high-level directory blocks and underlying physical extents.
+This mechanism attempts to traverse the higher-level dir blocks in a
+manner that handles multi-fsb directory blocks and simultaneously
+maintains a reference to the corresponding physical blocks.
+
+This logic doesn't handle certain (discontiguous) physical extent
+layouts correctly with multi-fsb directory blocks. For example,
+consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory
+block size and a directory with the following extent layout:
+
+ EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
+ 0: [0..7]: 88..95 0 (88..95) 8
+ 1: [8..15]: 80..87 0 (80..87) 8
+ 2: [16..39]: 168..191 0 (168..191) 24
+ 3: [40..63]: 5242952..5242975 1 (72..95) 24
+
+Directory block 0 spans physical extents 0 and 1, dirblk 1 lies
+entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because
+extent 2 is larger than the directory block size, the readahead code
+erroneously assumes the block is contiguous and issues a readahead
+based on the physical mapping of the first fsb of the dirblk. This
+results in read verifier failure and a spurious corruption or crc
+failure, depending on the filesystem format.
+
+Further, the subsequent readahead code responsible for walking
+through the physical table doesn't correctly advance the physical
+block reference for dirblk 2. Instead of advancing two physical
+filesystem blocks, the first iteration of the loop advances 1 block
+(correctly), but the subsequent iteration advances 2 more physical
+blocks because the next physical extent (extent 3, above) happens to
+cover more than dirblk 2. At this point, the higher-level directory
+block walking is completely off the rails of the actual physical
+layout of the directory for the respective mapping table.
+
+Update the contiguous dirblock logic to consider the current offset
+in the physical extent to avoid issuing directory readahead to
+unrelated blocks. Also, update the mapping table advancing code to
+consider the current offset within the current dirblock to avoid
+advancing the mapping reference too far beyond the dirblock.
+
+Signed-off-by: Brian Foster <bfoster@redhat.com>
+Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/xfs/xfs_dir2_readdir.c | 5 +++--
+ 1 file changed, 3 insertions(+), 2 deletions(-)
+
+--- a/fs/xfs/xfs_dir2_readdir.c
++++ b/fs/xfs/xfs_dir2_readdir.c
+@@ -417,7 +417,8 @@ xfs_dir2_leaf_readbuf(
+ * Read-ahead a contiguous directory block.
+ */
+ if (i > mip->ra_current &&
+- map[mip->ra_index].br_blockcount >= geo->fsbcount) {
++ (map[mip->ra_index].br_blockcount - mip->ra_offset) >=
++ geo->fsbcount) {
+ xfs_dir3_data_readahead(dp,
+ map[mip->ra_index].br_startoff + mip->ra_offset,
+ XFS_FSB_TO_DADDR(dp->i_mount,
+@@ -450,7 +451,7 @@ xfs_dir2_leaf_readbuf(
+ * The rest of this extent but not more than a dir
+ * block.
+ */
+- length = min_t(int, geo->fsbcount,
++ length = min_t(int, geo->fsbcount - j,
+ map[mip->ra_index].br_blockcount -
+ mip->ra_offset);
+ mip->ra_offset += length;
--- /dev/null
+From 756baca27fff3ecaeab9dbc7a5ee35a1d7bc0c7f Mon Sep 17 00:00:00 2001
+From: Brian Foster <bfoster@redhat.com>
+Date: Wed, 26 Apr 2017 08:30:39 -0700
+Subject: xfs: support ability to wait on new inodes
+
+From: Brian Foster <bfoster@redhat.com>
+
+commit 756baca27fff3ecaeab9dbc7a5ee35a1d7bc0c7f upstream.
+
+Inodes that are inserted into the perag tree but still under
+construction are flagged with the XFS_INEW bit. Most contexts either
+skip such inodes when they are encountered or have the ability to
+handle them.
+
+The runtime quotaoff sequence introduces a context that must wait
+for construction of such inodes to correctly ensure that all dquots
+in the fs are released. In anticipation of this, support the ability
+to wait on new inodes. Wake the appropriate bit when XFS_INEW is
+cleared.
+
+Signed-off-by: Brian Foster <bfoster@redhat.com>
+Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/xfs/xfs_icache.c | 5 ++++-
+ fs/xfs/xfs_inode.h | 4 +++-
+ 2 files changed, 7 insertions(+), 2 deletions(-)
+
+--- a/fs/xfs/xfs_icache.c
++++ b/fs/xfs/xfs_icache.c
+@@ -210,14 +210,17 @@ xfs_iget_cache_hit(
+
+ error = inode_init_always(mp->m_super, inode);
+ if (error) {
++ bool wake;
+ /*
+ * Re-initializing the inode failed, and we are in deep
+ * trouble. Try to re-add it to the reclaim list.
+ */
+ rcu_read_lock();
+ spin_lock(&ip->i_flags_lock);
+-
++ wake = !!__xfs_iflags_test(ip, XFS_INEW);
+ ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
++ if (wake)
++ wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
+ ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
+ trace_xfs_iget_reclaim_fail(ip);
+ goto out_error;
+--- a/fs/xfs/xfs_inode.h
++++ b/fs/xfs/xfs_inode.h
+@@ -208,7 +208,8 @@ xfs_get_initial_prid(struct xfs_inode *d
+ #define XFS_IRECLAIM (1 << 0) /* started reclaiming this inode */
+ #define XFS_ISTALE (1 << 1) /* inode has been staled */
+ #define XFS_IRECLAIMABLE (1 << 2) /* inode can be reclaimed */
+-#define XFS_INEW (1 << 3) /* inode has just been allocated */
++#define __XFS_INEW_BIT 3 /* inode has just been allocated */
++#define XFS_INEW (1 << __XFS_INEW_BIT)
+ #define XFS_ITRUNCATED (1 << 5) /* truncated down so flush-on-close */
+ #define XFS_IDIRTY_RELEASE (1 << 6) /* dirty release already seen */
+ #define __XFS_IFLOCK_BIT 7 /* inode is being flushed right now */
+@@ -453,6 +454,7 @@ static inline void xfs_finish_inode_setu
+ xfs_iflags_clear(ip, XFS_INEW);
+ barrier();
+ unlock_new_inode(VFS_I(ip));
++ wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
+ }
+
+ static inline void xfs_setup_existing_inode(struct xfs_inode *ip)
--- /dev/null
+From ae2c4ac2dd39b23a87ddb14ceddc3f2872c6aef5 Mon Sep 17 00:00:00 2001
+From: Brian Foster <bfoster@redhat.com>
+Date: Wed, 26 Apr 2017 08:30:39 -0700
+Subject: xfs: update ag iterator to support wait on new inodes
+
+From: Brian Foster <bfoster@redhat.com>
+
+commit ae2c4ac2dd39b23a87ddb14ceddc3f2872c6aef5 upstream.
+
+The AG inode iterator currently skips new inodes as such inodes are
+inserted into the inode radix tree before they are fully
+constructed. Certain contexts require the ability to wait on the
+construction of new inodes, however. The fs-wide dquot release from
+the quotaoff sequence is an example of this.
+
+Update the AG inode iterator to support the ability to wait on
+inodes flagged with XFS_INEW upon request. Create a new
+xfs_inode_ag_iterator_flags() interface and support a set of
+iteration flags to modify the iteration behavior. When the
+XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the
+radix tree inode lookup and wait on them before the callback is
+executed.
+
+Signed-off-by: Brian Foster <bfoster@redhat.com>
+Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/xfs/xfs_icache.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--------
+ fs/xfs/xfs_icache.h | 8 +++++++
+ 2 files changed, 53 insertions(+), 8 deletions(-)
+
+--- a/fs/xfs/xfs_icache.c
++++ b/fs/xfs/xfs_icache.c
+@@ -366,6 +366,22 @@ out_destroy:
+ return error;
+ }
+
++static void
++xfs_inew_wait(
++ struct xfs_inode *ip)
++{
++ wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_INEW_BIT);
++ DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_INEW_BIT);
++
++ do {
++ prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
++ if (!xfs_iflags_test(ip, XFS_INEW))
++ break;
++ schedule();
++ } while (true);
++ finish_wait(wq, &wait.wait);
++}
++
+ /*
+ * Look up an inode by number in the given file system.
+ * The inode is looked up in the cache held in each AG.
+@@ -470,9 +486,11 @@ out_error_or_again:
+
+ STATIC int
+ xfs_inode_ag_walk_grab(
+- struct xfs_inode *ip)
++ struct xfs_inode *ip,
++ int flags)
+ {
+ struct inode *inode = VFS_I(ip);
++ bool newinos = !!(flags & XFS_AGITER_INEW_WAIT);
+
+ ASSERT(rcu_read_lock_held());
+
+@@ -490,7 +508,8 @@ xfs_inode_ag_walk_grab(
+ goto out_unlock_noent;
+
+ /* avoid new or reclaimable inodes. Leave for reclaim code to flush */
+- if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM))
++ if ((!newinos && __xfs_iflags_test(ip, XFS_INEW)) ||
++ __xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM))
+ goto out_unlock_noent;
+ spin_unlock(&ip->i_flags_lock);
+
+@@ -518,7 +537,8 @@ xfs_inode_ag_walk(
+ void *args),
+ int flags,
+ void *args,
+- int tag)
++ int tag,
++ int iter_flags)
+ {
+ uint32_t first_index;
+ int last_error = 0;
+@@ -560,7 +580,7 @@ restart:
+ for (i = 0; i < nr_found; i++) {
+ struct xfs_inode *ip = batch[i];
+
+- if (done || xfs_inode_ag_walk_grab(ip))
++ if (done || xfs_inode_ag_walk_grab(ip, iter_flags))
+ batch[i] = NULL;
+
+ /*
+@@ -588,6 +608,9 @@ restart:
+ for (i = 0; i < nr_found; i++) {
+ if (!batch[i])
+ continue;
++ if ((iter_flags & XFS_AGITER_INEW_WAIT) &&
++ xfs_iflags_test(batch[i], XFS_INEW))
++ xfs_inew_wait(batch[i]);
+ error = execute(batch[i], flags, args);
+ IRELE(batch[i]);
+ if (error == -EAGAIN) {
+@@ -640,12 +663,13 @@ xfs_eofblocks_worker(
+ }
+
+ int
+-xfs_inode_ag_iterator(
++xfs_inode_ag_iterator_flags(
+ struct xfs_mount *mp,
+ int (*execute)(struct xfs_inode *ip, int flags,
+ void *args),
+ int flags,
+- void *args)
++ void *args,
++ int iter_flags)
+ {
+ struct xfs_perag *pag;
+ int error = 0;
+@@ -655,7 +679,8 @@ xfs_inode_ag_iterator(
+ ag = 0;
+ while ((pag = xfs_perag_get(mp, ag))) {
+ ag = pag->pag_agno + 1;
+- error = xfs_inode_ag_walk(mp, pag, execute, flags, args, -1);
++ error = xfs_inode_ag_walk(mp, pag, execute, flags, args, -1,
++ iter_flags);
+ xfs_perag_put(pag);
+ if (error) {
+ last_error = error;
+@@ -667,6 +692,17 @@ xfs_inode_ag_iterator(
+ }
+
+ int
++xfs_inode_ag_iterator(
++ struct xfs_mount *mp,
++ int (*execute)(struct xfs_inode *ip, int flags,
++ void *args),
++ int flags,
++ void *args)
++{
++ return xfs_inode_ag_iterator_flags(mp, execute, flags, args, 0);
++}
++
++int
+ xfs_inode_ag_iterator_tag(
+ struct xfs_mount *mp,
+ int (*execute)(struct xfs_inode *ip, int flags,
+@@ -683,7 +719,8 @@ xfs_inode_ag_iterator_tag(
+ ag = 0;
+ while ((pag = xfs_perag_get_tag(mp, ag, tag))) {
+ ag = pag->pag_agno + 1;
+- error = xfs_inode_ag_walk(mp, pag, execute, flags, args, tag);
++ error = xfs_inode_ag_walk(mp, pag, execute, flags, args, tag,
++ 0);
+ xfs_perag_put(pag);
+ if (error) {
+ last_error = error;
+--- a/fs/xfs/xfs_icache.h
++++ b/fs/xfs/xfs_icache.h
+@@ -48,6 +48,11 @@ struct xfs_eofblocks {
+ #define XFS_IGET_UNTRUSTED 0x2
+ #define XFS_IGET_DONTCACHE 0x4
+
++/*
++ * flags for AG inode iterator
++ */
++#define XFS_AGITER_INEW_WAIT 0x1 /* wait on new inodes */
++
+ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
+ uint flags, uint lock_flags, xfs_inode_t **ipp);
+
+@@ -72,6 +77,9 @@ void xfs_eofblocks_worker(struct work_st
+ int xfs_inode_ag_iterator(struct xfs_mount *mp,
+ int (*execute)(struct xfs_inode *ip, int flags, void *args),
+ int flags, void *args);
++int xfs_inode_ag_iterator_flags(struct xfs_mount *mp,
++ int (*execute)(struct xfs_inode *ip, int flags, void *args),
++ int flags, void *args, int iter_flags);
+ int xfs_inode_ag_iterator_tag(struct xfs_mount *mp,
+ int (*execute)(struct xfs_inode *ip, int flags, void *args),
+ int flags, void *args, int tag);
--- /dev/null
+From e20c8a517f259cb4d258e10b0cd5d4b30d4167a0 Mon Sep 17 00:00:00 2001
+From: Brian Foster <bfoster@redhat.com>
+Date: Wed, 26 Apr 2017 08:30:40 -0700
+Subject: xfs: wait on new inodes during quotaoff dquot release
+
+From: Brian Foster <bfoster@redhat.com>
+
+commit e20c8a517f259cb4d258e10b0cd5d4b30d4167a0 upstream.
+
+The quotaoff operation has a race with inode allocation that results
+in a livelock. An inode allocation that occurs before the quota
+status flags are updated acquires the appropriate dquots for the
+inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode
+into the perag radix tree, sometime later attaches the dquots to the
+inode and finally clears the XFS_INEW flag. Quotaoff expects to
+release the dquots from all inodes in the filesystem via
+xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator,
+which skips inodes in the XFS_INEW state because they are not fully
+constructed. If the scan occurs after dquots have been attached to
+an inode, but before XFS_INEW is cleared, the newly allocated inode
+will continue to hold a reference to the applicable dquots. When
+quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those
+dquot(s) remain elevated and the dqpurge scan spins indefinitely.
+
+To address this problem, update the xfs_qm_dqrele_all_inodes() scan
+to wait on inodes marked on the XFS_INEW state. We wait on the
+inodes explicitly rather than skip and retry to avoid continuous
+retry loops due to a parallel inode allocation workload. Since
+quotaoff updates the quota state flags and uses a synchronous
+transaction before the dqrele scan, and dquots are attached to
+inodes after radix tree insertion iff quota is enabled, one INEW
+waiting pass through the AG guarantees that the scan has processed
+all inodes that could possibly hold dquot references.
+
+Reported-by: Eryu Guan <eguan@redhat.com>
+Signed-off-by: Brian Foster <bfoster@redhat.com>
+Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/xfs/xfs_qm_syscalls.c | 3 ++-
+ 1 file changed, 2 insertions(+), 1 deletion(-)
+
+--- a/fs/xfs/xfs_qm_syscalls.c
++++ b/fs/xfs/xfs_qm_syscalls.c
+@@ -764,5 +764,6 @@ xfs_qm_dqrele_all_inodes(
+ uint flags)
+ {
+ ASSERT(mp->m_quotainfo);
+- xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags, NULL);
++ xfs_inode_ag_iterator_flags(mp, xfs_dqrele_inode, flags, NULL,
++ XFS_AGITER_INEW_WAIT);
+ }