From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Fri, 29 Apr 2022 09:17:53 +0000 (+0200)
Subject: 5.15-stable patches
X-Git-Tag: v4.19.241~11
X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=bd9ef141c0753a6aec0e58a272acab2bb78b30fa;p=thirdparty%2Fkernel%2Fstable-queue.git

5.15-stable patches

added patches:
	btrfs-fallback-to-blocking-mode-when-doing-async-dio-over-multiple-extents.patch
	btrfs-fix-deadlock-due-to-page-faults-during-direct-io-reads-and-writes.patch
	gfs2-add-wrapper-for-iomap_file_buffered_write.patch
	gfs2-clean-up-function-may_grant.patch
	gfs2-eliminate-ip-i_gh.patch
	gfs2-fix-mmap-page-fault-deadlocks-for-buffered-i-o.patch
	gfs2-fix-mmap-page-fault-deadlocks-for-direct-i-o.patch
	gfs2-introduce-flag-for-glock-holder-auto-demotion.patch
	gfs2-move-the-inode-glock-locking-to-gfs2_file_buffered_write.patch
	gup-introduce-foll_nofault-flag-to-disable-page-faults.patch
	gup-turn-fault_in_pages_-readable-writeable-into-fault_in_-readable-writeable.patch
	iomap-add-done_before-argument-to-iomap_dio_rw.patch
	iomap-fix-iomap_dio_rw-return-value-for-user-copies.patch
	iomap-support-partial-direct-i-o-on-user-copy-failures.patch
	iov_iter-introduce-fault_in_iov_iter_writeable.patch
	iov_iter-introduce-nofault-flag-to-disable-page-faults.patch
	iov_iter-turn-iov_iter_fault_in_readable-into-fault_in_iov_iter_readable.patch
	mm-gup-make-fault_in_safe_writeable-use-fixup_user_fault.patch
	mm-kfence-fix-objcgs-vector-allocation.patch
---

diff --git a/queue-5.15/btrfs-fallback-to-blocking-mode-when-doing-async-dio-over-multiple-extents.patch b/queue-5.15/btrfs-fallback-to-blocking-mode-when-doing-async-dio-over-multiple-extents.patch
new file mode 100644
index 00000000000..8d94795f8ad
--- /dev/null
+++ b/queue-5.15/btrfs-fallback-to-blocking-mode-when-doing-async-dio-over-multiple-extents.patch
@@ -0,0 +1,322 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:55 +0800
+Subject: btrfs: fallback to blocking mode when doing async dio over multiple extents
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Filipe Manana <fdmanana@suse.com>, Josef Bacik <josef@toxicpanda.com>, David Sterba <dsterba@suse.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <9127cbbcd2bf2f8efd46298d8799e36282e1a311.1649951733.git.anand.jain@oracle.com>
+
+From: Filipe Manana <fdmanana@suse.com>
+
+commit ca93e44bfb5fd7996b76f0f544999171f647f93b upstream
+
+Some users recently reported that MariaDB was getting a read corruption
+when using io_uring on top of btrfs. This started to happen in 5.16,
+after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
+during direct IO reads and writes"). That changed btrfs to use the new
+iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
+iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
+corresponds to a memory mapped file region. That type of scenario is
+exercised by test case generic/647 from fstests.
+
+For this MariaDB scenario, we attempt to read 16K from file offset X
+using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
+with a size of 4K, and what happens is the following:
+
+1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
+
+2) iomap creates a struct iomap_dio object, its reference count is
+   initialized to 1 and its ->size field is initialized to 0;
+
+3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
+   the first 4K extent, and setups an iomap for this extent consisting
+   of a single page;
+
+4) At iomap_dio_bio_iter(), we are able to access the first page of the
+   buffer (struct iov_iter) with bio_iov_iter_get_pages() without
+   triggering a page fault;
+
+5) iomap submits a bio for this 4K extent
+   (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
+   the refcount on the struct iomap_dio object to 2; The ->size field
+   of the struct iomap_dio object is incremented to 4K;
+
+6) iomap calls btrfs_iomap_begin() again, this time with a file
+   offset of X + 4K. There we setup an iomap for the next extent
+   that also has a size of 4K;
+
+7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
+   which tries to access the next page (2nd page) of the buffer.
+   This triggers a page fault and returns -EFAULT;
+
+8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
+   to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
+   the struct iomap_dio object has a ->size value of 4K (we submitted
+   a bio for an extent already). The 'wait_for_completion' variable
+   is not set to true, because our iocb has IOCB_NOWAIT set;
+
+9) At the bottom of __iomap_dio_rw(), we decrement the reference count
+   of the struct iomap_dio object from 2 to 1. Because we were not
+   the only ones holding a reference on it and 'wait_for_completion' is
+   set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
+   just returns it up the callchain, up to io_uring;
+
+10) The bio submitted for the first extent (step 5) completes and its
+    bio endio function, iomap_dio_bio_end_io(), decrements the last
+    reference on the struct iomap_dio object, resulting in calling
+    iomap_dio_complete_work() -> iomap_dio_complete().
+
+11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
+    and return 4K (the amount of io done) to iomap_dio_complete_work();
+
+12) iomap_dio_complete_work() calls the iocb completion callback,
+    iocb->ki_complete() with a second argument value of 4K (total io
+    done) and the iocb with the adjust ki_pos of X + 4K. This results
+    in completing the read request for io_uring, leaving it with a
+    result of 4K bytes read, and only the first page of the buffer
+    filled in, while the remaining 3 pages, corresponding to the other
+    3 extents, were not filled;
+
+13) For the application, the result is unexpected because if we ask
+    to read N bytes, it expects to get N bytes read as long as those
+    N bytes don't cross the EOF (i_size).
+
+MariaDB reports this as an error, as it's not expecting a short read,
+since it knows it's asking for read operations fully within the i_size
+boundary. This is typical in many applications, but it may also be
+questionable if they should react to such short reads by issuing more
+read calls to get the remaining data. Nevertheless, the short read
+happened due to a change in btrfs regarding how it deals with page
+faults while in the middle of a read operation, and there's no reason
+why btrfs can't have the previous behaviour of returning the whole data
+that was requested by the application.
+
+The problem can also be triggered with the following simple program:
+
+  /* Get O_DIRECT */
+  #ifndef _GNU_SOURCE
+  #define _GNU_SOURCE
+  #endif
+
+  #include <stdio.h>
+  #include <stdlib.h>
+  #include <unistd.h>
+  #include <fcntl.h>
+  #include <errno.h>
+  #include <string.h>
+  #include <liburing.h>
+
+  int main(int argc, char *argv[])
+  {
+      char *foo_path;
+      struct io_uring ring;
+      struct io_uring_sqe *sqe;
+      struct io_uring_cqe *cqe;
+      struct iovec iovec;
+      int fd;
+      long pagesize;
+      void *write_buf;
+      void *read_buf;
+      ssize_t ret;
+      int i;
+
+      if (argc != 2) {
+          fprintf(stderr, "Use: %s <directory>\n", argv[0]);
+          return 1;
+      }
+
+      foo_path = malloc(strlen(argv[1]) + 5);
+      if (!foo_path) {
+          fprintf(stderr, "Failed to allocate memory for file path\n");
+          return 1;
+      }
+      strcpy(foo_path, argv[1]);
+      strcat(foo_path, "/foo");
+
+      /*
+       * Create file foo with 2 extents, each with a size matching
+       * the page size. Then allocate a buffer to read both extents
+       * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
+       * the read with io_uring, access the first page of the buffer
+       * to fault it in, so that during the read we only trigger a
+       * page fault when accessing the second page of the buffer.
+       */
+       fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
+                O_DIRECT, 0666);
+       if (fd == -1) {
+           fprintf(stderr,
+                   "Failed to create file 'foo': %s (errno %d)",
+                   strerror(errno), errno);
+           return 1;
+       }
+
+       pagesize = sysconf(_SC_PAGE_SIZE);
+       ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
+       if (ret) {
+           fprintf(stderr, "Failed to allocate write buffer\n");
+           return 1;
+       }
+
+       memset(write_buf, 0xab, pagesize);
+       memset(write_buf + pagesize, 0xcd, pagesize);
+
+       /* Create 2 extents, each with a size matching page size. */
+       for (i = 0; i < 2; i++) {
+           ret = pwrite(fd, write_buf + i * pagesize, pagesize,
+                        i * pagesize);
+           if (ret != pagesize) {
+               fprintf(stderr,
+                     "Failed to write to file, ret = %ld errno %d (%s)\n",
+                      ret, errno, strerror(errno));
+               return 1;
+           }
+           ret = fsync(fd);
+           if (ret != 0) {
+               fprintf(stderr, "Failed to fsync file\n");
+               return 1;
+           }
+       }
+
+       close(fd);
+       fd = open(foo_path, O_RDONLY | O_DIRECT);
+       if (fd == -1) {
+           fprintf(stderr,
+                   "Failed to open file 'foo': %s (errno %d)",
+                   strerror(errno), errno);
+           return 1;
+       }
+
+       ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
+       if (ret) {
+           fprintf(stderr, "Failed to allocate read buffer\n");
+           return 1;
+       }
+
+       /*
+        * Fault in only the first page of the read buffer.
+        * We want to trigger a page fault for the 2nd page of the
+        * read buffer during the read operation with io_uring
+        * (O_DIRECT and IOCB_NOWAIT).
+        */
+       memset(read_buf, 0, 1);
+
+       ret = io_uring_queue_init(1, &ring, 0);
+       if (ret != 0) {
+           fprintf(stderr, "Failed to create io_uring queue\n");
+           return 1;
+       }
+
+       sqe = io_uring_get_sqe(&ring);
+       if (!sqe) {
+           fprintf(stderr, "Failed to get io_uring sqe\n");
+           return 1;
+       }
+
+       iovec.iov_base = read_buf;
+       iovec.iov_len = 2 * pagesize;
+       io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
+
+       ret = io_uring_submit_and_wait(&ring, 1);
+       if (ret != 1) {
+           fprintf(stderr,
+                   "Failed at io_uring_submit_and_wait()\n");
+           return 1;
+       }
+
+       ret = io_uring_wait_cqe(&ring, &cqe);
+       if (ret < 0) {
+           fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
+           return 1;
+       }
+
+       printf("io_uring read result for file foo:\n\n");
+       printf("  cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
+       printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
+              memcmp(read_buf, write_buf, 2 * pagesize));
+
+       io_uring_cqe_seen(&ring, cqe);
+       io_uring_queue_exit(&ring);
+
+       return 0;
+  }
+
+When running it on an unpatched kernel:
+
+  $ gcc io_uring_test.c -luring
+  $ mkfs.btrfs -f /dev/sda
+  $ mount /dev/sda /mnt/sda
+  $ ./a.out /mnt/sda
+  io_uring read result for file foo:
+
+    cqe->res == 4096 (expected 8192)
+    memcmp(read_buf, write_buf) == -205 (expected 0)
+
+After this patch, the read always returns 8192 bytes, with the buffer
+filled with the correct data. Although that reproducer always triggers
+the bug in my test vms, it's possible that it will not be so reliable
+on other environments, as that can happen if the bio for the first
+extent completes and decrements the reference on the struct iomap_dio
+object before we do the atomic_dec_and_test() on the reference at
+__iomap_dio_rw().
+
+Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
+whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
+set) over a range that spans multiple extents (or a mix of extents and
+holes). This avoids returning success to the caller when we only did
+partial IO, which is not optimal for writes and for reads it's actually
+incorrect, as the caller doesn't expect to get less bytes read than it has
+requested (unless EOF is crossed), as previously mentioned. This is also
+the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
+even though it doesn't use IOMAP_DIO_PARTIAL.
+
+A test case for fstests will follow soon.
+
+Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
+Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
+CC: stable@vger.kernel.org # 5.16+
+Reviewed-by: Josef Bacik <josef@toxicpanda.com>
+Signed-off-by: Filipe Manana <fdmanana@suse.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/btrfs/inode.c |   28 ++++++++++++++++++++++++++++
+ 1 file changed, 28 insertions(+)
+
+--- a/fs/btrfs/inode.c
++++ b/fs/btrfs/inode.c
+@@ -7961,6 +7961,34 @@ static int btrfs_dio_iomap_begin(struct
+ 	}
+ 
+ 	len = min(len, em->len - (start - em->start));
++
++	/*
++	 * If we have a NOWAIT request and the range contains multiple extents
++	 * (or a mix of extents and holes), then we return -EAGAIN to make the
++	 * caller fallback to a context where it can do a blocking (without
++	 * NOWAIT) request. This way we avoid doing partial IO and returning
++	 * success to the caller, which is not optimal for writes and for reads
++	 * it can result in unexpected behaviour for an application.
++	 *
++	 * When doing a read, because we use IOMAP_DIO_PARTIAL when calling
++	 * iomap_dio_rw(), we can end up returning less data then what the caller
++	 * asked for, resulting in an unexpected, and incorrect, short read.
++	 * That is, the caller asked to read N bytes and we return less than that,
++	 * which is wrong unless we are crossing EOF. This happens if we get a
++	 * page fault error when trying to fault in pages for the buffer that is
++	 * associated to the struct iov_iter passed to iomap_dio_rw(), and we
++	 * have previously submitted bios for other extents in the range, in
++	 * which case iomap_dio_rw() may return us EIOCBQUEUED if not all of
++	 * those bios have completed by the time we get the page fault error,
++	 * which we return back to our caller - we should only return EIOCBQUEUED
++	 * after we have submitted bios for all the extents in the range.
++	 */
++	if ((flags & IOMAP_NOWAIT) && len < length) {
++		free_extent_map(em);
++		ret = -EAGAIN;
++		goto unlock_err;
++	}
++
+ 	if (write) {
+ 		ret = btrfs_get_blocks_direct_write(&em, inode, dio_data,
+ 						    start, len);
diff --git a/queue-5.15/btrfs-fix-deadlock-due-to-page-faults-during-direct-io-reads-and-writes.patch b/queue-5.15/btrfs-fix-deadlock-due-to-page-faults-during-direct-io-reads-and-writes.patch
new file mode 100644
index 00000000000..399b6164629
--- /dev/null
+++ b/queue-5.15/btrfs-fix-deadlock-due-to-page-faults-during-direct-io-reads-and-writes.patch
@@ -0,0 +1,358 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:54 +0800
+Subject: btrfs: fix deadlock due to page faults during direct IO reads and writes
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Filipe Manana <fdmanana@suse.com>, Josef Bacik <josef@toxicpanda.com>, David Sterba <dsterba@suse.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <b3ed77a21e8c9b82b32a044aac971feaa0a893e0.1649951733.git.anand.jain@oracle.com>
+
+From: Filipe Manana <fdmanana@suse.com>
+
+commit 51bd9563b6783de8315f38f7baed949e77c42311 upstream
+
+If we do a direct IO read or write when the buffer given by the user is
+memory mapped to the file range we are going to do IO, we end up ending
+in a deadlock. This is triggered by the new test case generic/647 from
+fstests.
+
+For a direct IO read we get a trace like this:
+
+  [967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
+  [967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
+  [967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
+  [967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
+  [967.875992] Call Trace:
+  [967.875999]  __schedule+0x3ca/0xe10
+  [967.876015]  schedule+0x43/0xe0
+  [967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
+  [967.876109]  ? do_wait_intr_irq+0xb0/0xb0
+  [967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
+  [967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
+  [967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
+  [967.876214]  extent_readahead+0x32d/0x530 [btrfs]
+  [967.876253]  ? lru_cache_add+0x104/0x220
+  [967.876255]  ? kvm_sched_clock_read+0x14/0x40
+  [967.876258]  ? sched_clock_cpu+0xd/0x110
+  [967.876263]  ? lock_release+0x155/0x4a0
+  [967.876271]  read_pages+0x86/0x270
+  [967.876274]  ? lru_cache_add+0x125/0x220
+  [967.876281]  page_cache_ra_unbounded+0x1a3/0x220
+  [967.876291]  filemap_fault+0x626/0xa20
+  [967.876303]  __do_fault+0x36/0xf0
+  [967.876308]  __handle_mm_fault+0x83f/0x15f0
+  [967.876322]  handle_mm_fault+0x9e/0x260
+  [967.876327]  __get_user_pages+0x204/0x620
+  [967.876332]  ? get_user_pages_unlocked+0x69/0x340
+  [967.876340]  get_user_pages_unlocked+0xd3/0x340
+  [967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
+  [967.876366]  iov_iter_get_pages+0x8d/0x3a0
+  [967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
+  [967.876379]  ? lock_release+0x155/0x4a0
+  [967.876387]  iomap_dio_bio_actor+0x232/0x410
+  [967.876396]  iomap_apply+0x12a/0x4a0
+  [967.876398]  ? iomap_dio_rw+0x30/0x30
+  [967.876414]  __iomap_dio_rw+0x29f/0x5e0
+  [967.876415]  ? iomap_dio_rw+0x30/0x30
+  [967.876420]  ? lock_acquired+0xf3/0x420
+  [967.876429]  iomap_dio_rw+0xa/0x30
+  [967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
+  [967.876460]  new_sync_read+0x118/0x1a0
+  [967.876472]  vfs_read+0x128/0x1b0
+  [967.876477]  __x64_sys_pread64+0x90/0xc0
+  [967.876483]  do_syscall_64+0x3b/0xc0
+  [967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
+  [967.876490] RIP: 0033:0x7fb6f2c038d6
+  [967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
+  [967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
+  [967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
+  [967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
+  [967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
+  [967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
+
+This happens because at btrfs_dio_iomap_begin() we lock the extent range
+and return with it locked - we only unlock in the endio callback, at
+end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
+iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
+faults that resulting in reading the pages, through the readahead callback
+btrfs_readahead(), and through there we end to attempt to lock again the
+same extent range (or a subrange of what we locked before), resulting in
+the deadlock.
+
+For a direct IO write, the scenario is a bit different, and it results in
+trace like this:
+
+  [1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
+  [1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
+  [1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
+  [1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
+  [1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
+  [1330.351906] Call Trace:
+  [1330.351913]  __schedule+0x3ca/0xe10
+  [1330.351930]  schedule+0x43/0xe0
+  [1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
+  [1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
+  [1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
+  [1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
+  [1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
+  [1330.352133]  ? lru_cache_add+0x104/0x220
+  [1330.352135]  ? kvm_sched_clock_read+0x14/0x40
+  [1330.352138]  ? sched_clock_cpu+0xd/0x110
+  [1330.352143]  ? lock_release+0x155/0x4a0
+  [1330.352151]  read_pages+0x86/0x270
+  [1330.352155]  ? lru_cache_add+0x125/0x220
+  [1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
+  [1330.352172]  filemap_fault+0x626/0xa20
+  [1330.352176]  ? filemap_map_pages+0x18b/0x660
+  [1330.352184]  __do_fault+0x36/0xf0
+  [1330.352189]  __handle_mm_fault+0x1253/0x15f0
+  [1330.352203]  handle_mm_fault+0x9e/0x260
+  [1330.352208]  __get_user_pages+0x204/0x620
+  [1330.352212]  ? get_user_pages_unlocked+0x69/0x340
+  [1330.352220]  get_user_pages_unlocked+0xd3/0x340
+  [1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
+  [1330.352246]  iov_iter_get_pages+0x8d/0x3a0
+  [1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
+  [1330.352259]  ? lock_release+0x155/0x4a0
+  [1330.352266]  iomap_dio_bio_actor+0x232/0x410
+  [1330.352275]  iomap_apply+0x12a/0x4a0
+  [1330.352278]  ? iomap_dio_rw+0x30/0x30
+  [1330.352292]  __iomap_dio_rw+0x29f/0x5e0
+  [1330.352294]  ? iomap_dio_rw+0x30/0x30
+  [1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
+  [1330.352339]  new_sync_write+0x11f/0x1b0
+  [1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
+  [1330.352354]  vfs_write+0x292/0x3c0
+  [1330.352359]  __x64_sys_pwrite64+0x90/0xc0
+  [1330.352365]  do_syscall_64+0x3b/0xc0
+  [1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
+  [1330.352372] RIP: 0033:0x7f4b0a580986
+  [1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
+  [1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
+  [1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
+  [1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
+  [1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
+  [1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
+
+Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
+range unlocked, but later when the page faults are triggered and we try
+to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
+we find the ordered extent for our write, created by the iomap callback
+btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
+deadlock since we can't complete the ordered extent without reading the
+pages (the iomap code only submits the bio after the pages are faulted
+in).
+
+Fix this by setting the nofault attribute of the given iov_iter and retry
+the direct IO read/write if we get an -EFAULT error returned from iomap.
+For reads, also disable page faults completely, this is because when we
+read from a hole or a prealloc extent, we can still trigger page faults
+due to the call to iov_iter_zero() done by iomap - at the moment, it is
+oblivious to the value of the ->nofault attribute of an iov_iter.
+We also need to keep track of the number of bytes written or read, and
+pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
+
+This depends on the iov_iter and iomap changes introduced in commit
+c03098d4b9ad ("Merge tag 'gfs2-v5.15-rc5-mmap-fault' of
+git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2").
+
+Reviewed-by: Josef Bacik <josef@toxicpanda.com>
+Signed-off-by: Filipe Manana <fdmanana@suse.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/btrfs/file.c |  139 +++++++++++++++++++++++++++++++++++++++++++++++++-------
+ 1 file changed, 123 insertions(+), 16 deletions(-)
+
+--- a/fs/btrfs/file.c
++++ b/fs/btrfs/file.c
+@@ -1903,16 +1903,17 @@ static ssize_t check_direct_IO(struct bt
+ 
+ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
+ {
++	const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
+ 	struct file *file = iocb->ki_filp;
+ 	struct inode *inode = file_inode(file);
+ 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+ 	loff_t pos;
+ 	ssize_t written = 0;
+ 	ssize_t written_buffered;
++	size_t prev_left = 0;
+ 	loff_t endbyte;
+ 	ssize_t err;
+ 	unsigned int ilock_flags = 0;
+-	struct iomap_dio *dio = NULL;
+ 
+ 	if (iocb->ki_flags & IOCB_NOWAIT)
+ 		ilock_flags |= BTRFS_ILOCK_TRY;
+@@ -1955,23 +1956,80 @@ relock:
+ 		goto buffered;
+ 	}
+ 
+-	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
+-			     0, 0);
++	/*
++	 * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
++	 * calls generic_write_sync() (through iomap_dio_complete()), because
++	 * that results in calling fsync (btrfs_sync_file()) which will try to
++	 * lock the inode in exclusive/write mode.
++	 */
++	if (is_sync_write)
++		iocb->ki_flags &= ~IOCB_DSYNC;
++
++	/*
++	 * The iov_iter can be mapped to the same file range we are writing to.
++	 * If that's the case, then we will deadlock in the iomap code, because
++	 * it first calls our callback btrfs_dio_iomap_begin(), which will create
++	 * an ordered extent, and after that it will fault in the pages that the
++	 * iov_iter refers to. During the fault in we end up in the readahead
++	 * pages code (starting at btrfs_readahead()), which will lock the range,
++	 * find that ordered extent and then wait for it to complete (at
++	 * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
++	 * obviously the ordered extent can never complete as we didn't submit
++	 * yet the respective bio(s). This always happens when the buffer is
++	 * memory mapped to the same file range, since the iomap DIO code always
++	 * invalidates pages in the target file range (after starting and waiting
++	 * for any writeback).
++	 *
++	 * So here we disable page faults in the iov_iter and then retry if we
++	 * got -EFAULT, faulting in the pages before the retry.
++	 */
++again:
++	from->nofault = true;
++	err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
++			   IOMAP_DIO_PARTIAL, written);
++	from->nofault = false;
++
++	/* No increment (+=) because iomap returns a cumulative value. */
++	if (err > 0)
++		written = err;
++
++	if (iov_iter_count(from) > 0 && (err == -EFAULT || err > 0)) {
++		const size_t left = iov_iter_count(from);
++		/*
++		 * We have more data left to write. Try to fault in as many as
++		 * possible of the remainder pages and retry. We do this without
++		 * releasing and locking again the inode, to prevent races with
++		 * truncate.
++		 *
++		 * Also, in case the iov refers to pages in the file range of the
++		 * file we want to write to (due to a mmap), we could enter an
++		 * infinite loop if we retry after faulting the pages in, since
++		 * iomap will invalidate any pages in the range early on, before
++		 * it tries to fault in the pages of the iov. So we keep track of
++		 * how much was left of iov in the previous EFAULT and fallback
++		 * to buffered IO in case we haven't made any progress.
++		 */
++		if (left == prev_left) {
++			err = -ENOTBLK;
++		} else {
++			fault_in_iov_iter_readable(from, left);
++			prev_left = left;
++			goto again;
++		}
++	}
+ 
+ 	btrfs_inode_unlock(inode, ilock_flags);
+ 
+-	if (IS_ERR_OR_NULL(dio)) {
+-		err = PTR_ERR_OR_ZERO(dio);
+-		if (err < 0 && err != -ENOTBLK)
+-			goto out;
+-	} else {
+-		written = iomap_dio_complete(dio);
+-	}
++	/*
++	 * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do
++	 * the fsync (call generic_write_sync()).
++	 */
++	if (is_sync_write)
++		iocb->ki_flags |= IOCB_DSYNC;
+ 
+-	if (written < 0 || !iov_iter_count(from)) {
+-		err = written;
++	/* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
++	if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
+ 		goto out;
+-	}
+ 
+ buffered:
+ 	pos = iocb->ki_pos;
+@@ -1996,7 +2054,7 @@ buffered:
+ 	invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
+ 				 endbyte >> PAGE_SHIFT);
+ out:
+-	return written ? written : err;
++	return err < 0 ? err : written;
+ }
+ 
+ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
+@@ -3659,6 +3717,8 @@ static int check_direct_read(struct btrf
+ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
+ {
+ 	struct inode *inode = file_inode(iocb->ki_filp);
++	size_t prev_left = 0;
++	ssize_t read = 0;
+ 	ssize_t ret;
+ 
+ 	if (fsverity_active(inode))
+@@ -3668,10 +3728,57 @@ static ssize_t btrfs_direct_read(struct
+ 		return 0;
+ 
+ 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
++again:
++	/*
++	 * This is similar to what we do for direct IO writes, see the comment
++	 * at btrfs_direct_write(), but we also disable page faults in addition
++	 * to disabling them only at the iov_iter level. This is because when
++	 * reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
++	 * which can still trigger page fault ins despite having set ->nofault
++	 * to true of our 'to' iov_iter.
++	 *
++	 * The difference to direct IO writes is that we deadlock when trying
++	 * to lock the extent range in the inode's tree during he page reads
++	 * triggered by the fault in (while for writes it is due to waiting for
++	 * our own ordered extent). This is because for direct IO reads,
++	 * btrfs_dio_iomap_begin() returns with the extent range locked, which
++	 * is only unlocked in the endio callback (end_bio_extent_readpage()).
++	 */
++	pagefault_disable();
++	to->nofault = true;
+ 	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
+-			   0, 0);
++			   IOMAP_DIO_PARTIAL, read);
++	to->nofault = false;
++	pagefault_enable();
++
++	/* No increment (+=) because iomap returns a cumulative value. */
++	if (ret > 0)
++		read = ret;
++
++	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
++		const size_t left = iov_iter_count(to);
++
++		if (left == prev_left) {
++			/*
++			 * We didn't make any progress since the last attempt,
++			 * fallback to a buffered read for the remainder of the
++			 * range. This is just to avoid any possibility of looping
++			 * for too long.
++			 */
++			ret = read;
++		} else {
++			/*
++			 * We made some progress since the last retry or this is
++			 * the first time we are retrying. Fault in as many pages
++			 * as possible and retry.
++			 */
++			fault_in_iov_iter_writeable(to, left);
++			prev_left = left;
++			goto again;
++		}
++	}
+ 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
+-	return ret;
++	return ret < 0 ? ret : read;
+ }
+ 
+ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
diff --git a/queue-5.15/gfs2-add-wrapper-for-iomap_file_buffered_write.patch b/queue-5.15/gfs2-add-wrapper-for-iomap_file_buffered_write.patch
new file mode 100644
index 00000000000..847c13742e8
--- /dev/null
+++ b/queue-5.15/gfs2-add-wrapper-for-iomap_file_buffered_write.patch
@@ -0,0 +1,80 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:42 +0800
+Subject: gfs2: Add wrapper for iomap_file_buffered_write
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <c6935195c043510ac0c69085b8e22a906a8acc6d.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit 2eb7509a05443048fb4df60b782de3f03c6c298b upstream
+
+Add a wrapper around iomap_file_buffered_write.  We'll add code for when
+the operation needs to be retried here later.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/gfs2/file.c |   27 +++++++++++++++++----------
+ 1 file changed, 17 insertions(+), 10 deletions(-)
+
+--- a/fs/gfs2/file.c
++++ b/fs/gfs2/file.c
+@@ -877,6 +877,20 @@ out_uninit:
+ 	return written ? written : ret;
+ }
+ 
++static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
++{
++	struct file *file = iocb->ki_filp;
++	struct inode *inode = file_inode(file);
++	ssize_t ret;
++
++	current->backing_dev_info = inode_to_bdi(inode);
++	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
++	current->backing_dev_info = NULL;
++	if (ret > 0)
++		iocb->ki_pos += ret;
++	return ret;
++}
++
+ /**
+  * gfs2_file_write_iter - Perform a write to a file
+  * @iocb: The io context
+@@ -928,9 +942,7 @@ static ssize_t gfs2_file_write_iter(stru
+ 			goto out_unlock;
+ 
+ 		iocb->ki_flags |= IOCB_DSYNC;
+-		current->backing_dev_info = inode_to_bdi(inode);
+-		buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+-		current->backing_dev_info = NULL;
++		buffered = gfs2_file_buffered_write(iocb, from);
+ 		if (unlikely(buffered <= 0)) {
+ 			if (!ret)
+ 				ret = buffered;
+@@ -944,7 +956,6 @@ static ssize_t gfs2_file_write_iter(stru
+ 		 * the direct I/O range as we don't know if the buffered pages
+ 		 * made it to disk.
+ 		 */
+-		iocb->ki_pos += buffered;
+ 		ret2 = generic_write_sync(iocb, buffered);
+ 		invalidate_mapping_pages(mapping,
+ 				(iocb->ki_pos - buffered) >> PAGE_SHIFT,
+@@ -952,13 +963,9 @@ static ssize_t gfs2_file_write_iter(stru
+ 		if (!ret || ret2 > 0)
+ 			ret += ret2;
+ 	} else {
+-		current->backing_dev_info = inode_to_bdi(inode);
+-		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+-		current->backing_dev_info = NULL;
+-		if (likely(ret > 0)) {
+-			iocb->ki_pos += ret;
++		ret = gfs2_file_buffered_write(iocb, from);
++		if (likely(ret > 0))
+ 			ret = generic_write_sync(iocb, ret);
+-		}
+ 	}
+ 
+ out_unlock:
diff --git a/queue-5.15/gfs2-clean-up-function-may_grant.patch b/queue-5.15/gfs2-clean-up-function-may_grant.patch
new file mode 100644
index 00000000000..22bea88bd41
--- /dev/null
+++ b/queue-5.15/gfs2-clean-up-function-may_grant.patch
@@ -0,0 +1,201 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:43 +0800
+Subject: gfs2: Clean up function may_grant
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <16061e1d0b15ee024905913510b9569e0c5011b4.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit 6144464937fe1e6135b13a30502a339d549bf093 upstream
+
+Pass the first current glock holder into function may_grant and
+deobfuscate the logic there.
+
+While at it, switch from BUG_ON to GLOCK_BUG_ON in may_grant.  To make
+that build cleanly, de-constify the may_grant arguments.
+
+We're now using function find_first_holder in do_promote, so move the
+function's definition above do_promote.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/gfs2/glock.c |  119 ++++++++++++++++++++++++++++++++------------------------
+ 1 file changed, 69 insertions(+), 50 deletions(-)
+
+--- a/fs/gfs2/glock.c
++++ b/fs/gfs2/glock.c
+@@ -301,46 +301,59 @@ void gfs2_glock_put(struct gfs2_glock *g
+ }
+ 
+ /**
+- * may_grant - check if its ok to grant a new lock
++ * may_grant - check if it's ok to grant a new lock
+  * @gl: The glock
++ * @current_gh: One of the current holders of @gl
+  * @gh: The lock request which we wish to grant
+  *
+- * Returns: true if its ok to grant the lock
++ * With our current compatibility rules, if a glock has one or more active
++ * holders (HIF_HOLDER flag set), any of those holders can be passed in as
++ * @current_gh; they are all the same as far as compatibility with the new @gh
++ * goes.
++ *
++ * Returns true if it's ok to grant the lock.
+  */
+ 
+-static inline int may_grant(const struct gfs2_glock *gl, const struct gfs2_holder *gh)
+-{
+-	const struct gfs2_holder *gh_head = list_first_entry(&gl->gl_holders, const struct gfs2_holder, gh_list);
++static inline bool may_grant(struct gfs2_glock *gl,
++			     struct gfs2_holder *current_gh,
++			     struct gfs2_holder *gh)
++{
++	if (current_gh) {
++		GLOCK_BUG_ON(gl, !test_bit(HIF_HOLDER, &current_gh->gh_iflags));
++
++		switch(current_gh->gh_state) {
++		case LM_ST_EXCLUSIVE:
++			/*
++			 * Here we make a special exception to grant holders
++			 * who agree to share the EX lock with other holders
++			 * who also have the bit set. If the original holder
++			 * has the LM_FLAG_NODE_SCOPE bit set, we grant more
++			 * holders with the bit set.
++			 */
++			return gh->gh_state == LM_ST_EXCLUSIVE &&
++			       (current_gh->gh_flags & LM_FLAG_NODE_SCOPE) &&
++			       (gh->gh_flags & LM_FLAG_NODE_SCOPE);
+ 
+-	if (gh != gh_head) {
+-		/**
+-		 * Here we make a special exception to grant holders who agree
+-		 * to share the EX lock with other holders who also have the
+-		 * bit set. If the original holder has the LM_FLAG_NODE_SCOPE bit
+-		 * is set, we grant more holders with the bit set.
+-		 */
+-		if (gh_head->gh_state == LM_ST_EXCLUSIVE &&
+-		    (gh_head->gh_flags & LM_FLAG_NODE_SCOPE) &&
+-		    gh->gh_state == LM_ST_EXCLUSIVE &&
+-		    (gh->gh_flags & LM_FLAG_NODE_SCOPE))
+-			return 1;
+-		if ((gh->gh_state == LM_ST_EXCLUSIVE ||
+-		     gh_head->gh_state == LM_ST_EXCLUSIVE))
+-			return 0;
++		case LM_ST_SHARED:
++		case LM_ST_DEFERRED:
++			return gh->gh_state == current_gh->gh_state;
++
++		default:
++			return false;
++		}
+ 	}
++
+ 	if (gl->gl_state == gh->gh_state)
+-		return 1;
++		return true;
+ 	if (gh->gh_flags & GL_EXACT)
+-		return 0;
++		return false;
+ 	if (gl->gl_state == LM_ST_EXCLUSIVE) {
+-		if (gh->gh_state == LM_ST_SHARED && gh_head->gh_state == LM_ST_SHARED)
+-			return 1;
+-		if (gh->gh_state == LM_ST_DEFERRED && gh_head->gh_state == LM_ST_DEFERRED)
+-			return 1;
++		return gh->gh_state == LM_ST_SHARED ||
++		       gh->gh_state == LM_ST_DEFERRED;
+ 	}
+-	if (gl->gl_state != LM_ST_UNLOCKED && (gh->gh_flags & LM_FLAG_ANY))
+-		return 1;
+-	return 0;
++	if (gh->gh_flags & LM_FLAG_ANY)
++		return gl->gl_state != LM_ST_UNLOCKED;
++	return false;
+ }
+ 
+ static void gfs2_holder_wake(struct gfs2_holder *gh)
+@@ -381,6 +394,24 @@ static void do_error(struct gfs2_glock *
+ }
+ 
+ /**
++ * find_first_holder - find the first "holder" gh
++ * @gl: the glock
++ */
++
++static inline struct gfs2_holder *find_first_holder(const struct gfs2_glock *gl)
++{
++	struct gfs2_holder *gh;
++
++	if (!list_empty(&gl->gl_holders)) {
++		gh = list_first_entry(&gl->gl_holders, struct gfs2_holder,
++				      gh_list);
++		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
++			return gh;
++	}
++	return NULL;
++}
++
++/**
+  * do_promote - promote as many requests as possible on the current queue
+  * @gl: The glock
+  * 
+@@ -393,14 +424,15 @@ __releases(&gl->gl_lockref.lock)
+ __acquires(&gl->gl_lockref.lock)
+ {
+ 	const struct gfs2_glock_operations *glops = gl->gl_ops;
+-	struct gfs2_holder *gh, *tmp;
++	struct gfs2_holder *gh, *tmp, *first_gh;
+ 	int ret;
+ 
+ restart:
++	first_gh = find_first_holder(gl);
+ 	list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
+ 		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
+ 			continue;
+-		if (may_grant(gl, gh)) {
++		if (may_grant(gl, first_gh, gh)) {
+ 			if (gh->gh_list.prev == &gl->gl_holders &&
+ 			    glops->go_lock) {
+ 				spin_unlock(&gl->gl_lockref.lock);
+@@ -723,23 +755,6 @@ out:
+ }
+ 
+ /**
+- * find_first_holder - find the first "holder" gh
+- * @gl: the glock
+- */
+-
+-static inline struct gfs2_holder *find_first_holder(const struct gfs2_glock *gl)
+-{
+-	struct gfs2_holder *gh;
+-
+-	if (!list_empty(&gl->gl_holders)) {
+-		gh = list_first_entry(&gl->gl_holders, struct gfs2_holder, gh_list);
+-		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
+-			return gh;
+-	}
+-	return NULL;
+-}
+-
+-/**
+  * run_queue - do all outstanding tasks related to a glock
+  * @gl: The glock in question
+  * @nonblock: True if we must not block in run_queue
+@@ -1354,8 +1369,12 @@ __acquires(&gl->gl_lockref.lock)
+ 		GLOCK_BUG_ON(gl, true);
+ 
+ 	if (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)) {
+-		if (test_bit(GLF_LOCK, &gl->gl_flags))
+-			try_futile = !may_grant(gl, gh);
++		if (test_bit(GLF_LOCK, &gl->gl_flags)) {
++			struct gfs2_holder *first_gh;
++
++			first_gh = find_first_holder(gl);
++			try_futile = !may_grant(gl, first_gh, gh);
++		}
+ 		if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags))
+ 			goto fail;
+ 	}
diff --git a/queue-5.15/gfs2-eliminate-ip-i_gh.patch b/queue-5.15/gfs2-eliminate-ip-i_gh.patch
new file mode 100644
index 00000000000..3915a6d74f8
--- /dev/null
+++ b/queue-5.15/gfs2-eliminate-ip-i_gh.patch
@@ -0,0 +1,124 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:46 +0800
+Subject: gfs2: Eliminate ip->i_gh
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <844b20e15b0e730c43faa93347d7a65ac4e7b465.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit 1b223f7065bc7d89c4677c27381817cc95b117a8 upstream
+
+Now that gfs2_file_buffered_write is the only remaining user of
+ip->i_gh, we can move the glock holder to the stack (or rather, use the
+one we already have on the stack); there is no need for keeping the
+holder in the inode anymore.
+
+This is slightly complicated by the fact that we're using ip->i_gh for
+the statfs inode in gfs2_file_buffered_write as well.  Writing to the
+statfs inode isn't very common, so allocate the statfs holder
+dynamically when needed.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/gfs2/file.c   |   34 +++++++++++++++++++++-------------
+ fs/gfs2/incore.h |    3 +--
+ 2 files changed, 22 insertions(+), 15 deletions(-)
+
+--- a/fs/gfs2/file.c
++++ b/fs/gfs2/file.c
+@@ -877,16 +877,25 @@ out_uninit:
+ 	return written ? written : ret;
+ }
+ 
+-static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
++static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
++					struct iov_iter *from,
++					struct gfs2_holder *gh)
+ {
+ 	struct file *file = iocb->ki_filp;
+ 	struct inode *inode = file_inode(file);
+ 	struct gfs2_inode *ip = GFS2_I(inode);
+ 	struct gfs2_sbd *sdp = GFS2_SB(inode);
++	struct gfs2_holder *statfs_gh = NULL;
+ 	ssize_t ret;
+ 
+-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
+-	ret = gfs2_glock_nq(&ip->i_gh);
++	if (inode == sdp->sd_rindex) {
++		statfs_gh = kmalloc(sizeof(*statfs_gh), GFP_NOFS);
++		if (!statfs_gh)
++			return -ENOMEM;
++	}
++
++	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, gh);
++	ret = gfs2_glock_nq(gh);
+ 	if (ret)
+ 		goto out_uninit;
+ 
+@@ -894,7 +903,7 @@ static ssize_t gfs2_file_buffered_write(
+ 		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
+ 
+ 		ret = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
+-					 GL_NOCACHE, &m_ip->i_gh);
++					 GL_NOCACHE, statfs_gh);
+ 		if (ret)
+ 			goto out_unlock;
+ 	}
+@@ -905,16 +914,15 @@ static ssize_t gfs2_file_buffered_write(
+ 	if (ret > 0)
+ 		iocb->ki_pos += ret;
+ 
+-	if (inode == sdp->sd_rindex) {
+-		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
+-
+-		gfs2_glock_dq_uninit(&m_ip->i_gh);
+-	}
++	if (inode == sdp->sd_rindex)
++		gfs2_glock_dq_uninit(statfs_gh);
+ 
+ out_unlock:
+-	gfs2_glock_dq(&ip->i_gh);
++	gfs2_glock_dq(gh);
+ out_uninit:
+-	gfs2_holder_uninit(&ip->i_gh);
++	gfs2_holder_uninit(gh);
++	if (statfs_gh)
++		kfree(statfs_gh);
+ 	return ret;
+ }
+ 
+@@ -969,7 +977,7 @@ static ssize_t gfs2_file_write_iter(stru
+ 			goto out_unlock;
+ 
+ 		iocb->ki_flags |= IOCB_DSYNC;
+-		buffered = gfs2_file_buffered_write(iocb, from);
++		buffered = gfs2_file_buffered_write(iocb, from, &gh);
+ 		if (unlikely(buffered <= 0)) {
+ 			if (!ret)
+ 				ret = buffered;
+@@ -990,7 +998,7 @@ static ssize_t gfs2_file_write_iter(stru
+ 		if (!ret || ret2 > 0)
+ 			ret += ret2;
+ 	} else {
+-		ret = gfs2_file_buffered_write(iocb, from);
++		ret = gfs2_file_buffered_write(iocb, from, &gh);
+ 		if (likely(ret > 0))
+ 			ret = generic_write_sync(iocb, ret);
+ 	}
+--- a/fs/gfs2/incore.h
++++ b/fs/gfs2/incore.h
+@@ -387,9 +387,8 @@ struct gfs2_inode {
+ 	u64 i_generation;
+ 	u64 i_eattr;
+ 	unsigned long i_flags;		/* GIF_... */
+-	struct gfs2_glock *i_gl; /* Move into i_gh? */
++	struct gfs2_glock *i_gl;
+ 	struct gfs2_holder i_iopen_gh;
+-	struct gfs2_holder i_gh; /* for prepare/commit_write only */
+ 	struct gfs2_qadata *i_qadata; /* quota allocation data */
+ 	struct gfs2_holder i_rgd_gh;
+ 	struct gfs2_blkreserv i_res; /* rgrp multi-block reservation */
diff --git a/queue-5.15/gfs2-fix-mmap-page-fault-deadlocks-for-buffered-i-o.patch b/queue-5.15/gfs2-fix-mmap-page-fault-deadlocks-for-buffered-i-o.patch
new file mode 100644
index 00000000000..95274bce16b
--- /dev/null
+++ b/queue-5.15/gfs2-fix-mmap-page-fault-deadlocks-for-buffered-i-o.patch
@@ -0,0 +1,211 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:47 +0800
+Subject: gfs2: Fix mmap + page fault deadlocks for buffered I/O
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <087a752bc8848ad8814bee4648d8b9d855c8438c.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit 00bfe02f479688a67a29019d1228f1470e26f014 upstream
+
+In the .read_iter and .write_iter file operations, we're accessing
+user-space memory while holding the inode glock.  There is a possibility
+that the memory is mapped to the same file, in which case we'd recurse
+on the same glock.
+
+We could detect and work around this simple case of recursive locking,
+but more complex scenarios exist that involve multiple glocks,
+processes, and cluster nodes, and working around all of those cases
+isn't practical or even possible.
+
+Avoid these kinds of problems by disabling page faults while holding the
+inode glock.  If a page fault would occur, we either end up with a
+partial read or write or with -EFAULT if nothing could be read or
+written.  In either case, we know that we're not done with the
+operation, so we indicate that we're willing to give up the inode glock
+and then we fault in the missing pages.  If that made us lose the inode
+glock, we return a partial read or write.  Otherwise, we resume the
+operation.
+
+This locking problem was originally reported by Jan Kara.  Linus came up
+with the idea of disabling page faults.  Many thanks to Al Viro and
+Matthew Wilcox for their feedback.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/gfs2/file.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
+ 1 file changed, 94 insertions(+), 5 deletions(-)
+
+--- a/fs/gfs2/file.c
++++ b/fs/gfs2/file.c
+@@ -777,6 +777,36 @@ static int gfs2_fsync(struct file *file,
+ 	return ret ? ret : ret1;
+ }
+ 
++static inline bool should_fault_in_pages(ssize_t ret, struct iov_iter *i,
++					 size_t *prev_count,
++					 size_t *window_size)
++{
++	char __user *p = i->iov[0].iov_base + i->iov_offset;
++	size_t count = iov_iter_count(i);
++	int pages = 1;
++
++	if (likely(!count))
++		return false;
++	if (ret <= 0 && ret != -EFAULT)
++		return false;
++	if (!iter_is_iovec(i))
++		return false;
++
++	if (*prev_count != count || !*window_size) {
++		int pages, nr_dirtied;
++
++		pages = min_t(int, BIO_MAX_VECS,
++			      DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE));
++		nr_dirtied = max(current->nr_dirtied_pause -
++				 current->nr_dirtied, 1);
++		pages = min(pages, nr_dirtied);
++	}
++
++	*prev_count = count;
++	*window_size = (size_t)PAGE_SIZE * pages - offset_in_page(p);
++	return true;
++}
++
+ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
+ 				     struct gfs2_holder *gh)
+ {
+@@ -841,9 +871,17 @@ static ssize_t gfs2_file_read_iter(struc
+ {
+ 	struct gfs2_inode *ip;
+ 	struct gfs2_holder gh;
++	size_t prev_count = 0, window_size = 0;
+ 	size_t written = 0;
+ 	ssize_t ret;
+ 
++	/*
++	 * In this function, we disable page faults when we're holding the
++	 * inode glock while doing I/O.  If a page fault occurs, we indicate
++	 * that the inode glock may be dropped, fault in the pages manually,
++	 * and retry.
++	 */
++
+ 	if (iocb->ki_flags & IOCB_DIRECT) {
+ 		ret = gfs2_file_direct_read(iocb, to, &gh);
+ 		if (likely(ret != -ENOTBLK))
+@@ -865,13 +903,34 @@ static ssize_t gfs2_file_read_iter(struc
+ 	}
+ 	ip = GFS2_I(iocb->ki_filp->f_mapping->host);
+ 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
++retry:
+ 	ret = gfs2_glock_nq(&gh);
+ 	if (ret)
+ 		goto out_uninit;
++retry_under_glock:
++	pagefault_disable();
+ 	ret = generic_file_read_iter(iocb, to);
++	pagefault_enable();
+ 	if (ret > 0)
+ 		written += ret;
+-	gfs2_glock_dq(&gh);
++
++	if (should_fault_in_pages(ret, to, &prev_count, &window_size)) {
++		size_t leftover;
++
++		gfs2_holder_allow_demote(&gh);
++		leftover = fault_in_iov_iter_writeable(to, window_size);
++		gfs2_holder_disallow_demote(&gh);
++		if (leftover != window_size) {
++			if (!gfs2_holder_queued(&gh)) {
++				if (written)
++					goto out_uninit;
++				goto retry;
++			}
++			goto retry_under_glock;
++		}
++	}
++	if (gfs2_holder_queued(&gh))
++		gfs2_glock_dq(&gh);
+ out_uninit:
+ 	gfs2_holder_uninit(&gh);
+ 	return written ? written : ret;
+@@ -886,8 +945,17 @@ static ssize_t gfs2_file_buffered_write(
+ 	struct gfs2_inode *ip = GFS2_I(inode);
+ 	struct gfs2_sbd *sdp = GFS2_SB(inode);
+ 	struct gfs2_holder *statfs_gh = NULL;
++	size_t prev_count = 0, window_size = 0;
++	size_t read = 0;
+ 	ssize_t ret;
+ 
++	/*
++	 * In this function, we disable page faults when we're holding the
++	 * inode glock while doing I/O.  If a page fault occurs, we indicate
++	 * that the inode glock may be dropped, fault in the pages manually,
++	 * and retry.
++	 */
++
+ 	if (inode == sdp->sd_rindex) {
+ 		statfs_gh = kmalloc(sizeof(*statfs_gh), GFP_NOFS);
+ 		if (!statfs_gh)
+@@ -895,10 +963,11 @@ static ssize_t gfs2_file_buffered_write(
+ 	}
+ 
+ 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, gh);
++retry:
+ 	ret = gfs2_glock_nq(gh);
+ 	if (ret)
+ 		goto out_uninit;
+-
++retry_under_glock:
+ 	if (inode == sdp->sd_rindex) {
+ 		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
+ 
+@@ -909,21 +978,41 @@ static ssize_t gfs2_file_buffered_write(
+ 	}
+ 
+ 	current->backing_dev_info = inode_to_bdi(inode);
++	pagefault_disable();
+ 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
++	pagefault_enable();
+ 	current->backing_dev_info = NULL;
+-	if (ret > 0)
++	if (ret > 0) {
+ 		iocb->ki_pos += ret;
++		read += ret;
++	}
+ 
+ 	if (inode == sdp->sd_rindex)
+ 		gfs2_glock_dq_uninit(statfs_gh);
+ 
++	if (should_fault_in_pages(ret, from, &prev_count, &window_size)) {
++		size_t leftover;
++
++		gfs2_holder_allow_demote(gh);
++		leftover = fault_in_iov_iter_readable(from, window_size);
++		gfs2_holder_disallow_demote(gh);
++		if (leftover != window_size) {
++			if (!gfs2_holder_queued(gh)) {
++				if (read)
++					goto out_uninit;
++				goto retry;
++			}
++			goto retry_under_glock;
++		}
++	}
+ out_unlock:
+-	gfs2_glock_dq(gh);
++	if (gfs2_holder_queued(gh))
++		gfs2_glock_dq(gh);
+ out_uninit:
+ 	gfs2_holder_uninit(gh);
+ 	if (statfs_gh)
+ 		kfree(statfs_gh);
+-	return ret;
++	return read ? read : ret;
+ }
+ 
+ /**
diff --git a/queue-5.15/gfs2-fix-mmap-page-fault-deadlocks-for-direct-i-o.patch b/queue-5.15/gfs2-fix-mmap-page-fault-deadlocks-for-direct-i-o.patch
new file mode 100644
index 00000000000..cbdeb9d3abb
--- /dev/null
+++ b/queue-5.15/gfs2-fix-mmap-page-fault-deadlocks-for-direct-i-o.patch
@@ -0,0 +1,181 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:53 +0800
+Subject: gfs2: Fix mmap + page fault deadlocks for direct I/O
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <02aca00403b19d316add3a4c835d40436a615103.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit b01b2d72da25c000aeb124bc78daf3fb998be2b6 upstream
+
+Also disable page faults during direct I/O requests and implement a
+similar kind of retry logic as in the buffered I/O case.
+
+The retry logic in the direct I/O case differs from the buffered I/O
+case in the following way: direct I/O doesn't provide the kinds of
+consistency guarantees between concurrent reads and writes that buffered
+I/O provides, so once we lose the inode glock while faulting in user
+pages, we always resume the operation.  We never need to return a
+partial read or write.
+
+This locking problem was originally reported by Jan Kara.  Linus came up
+with the idea of disabling page faults.  Many thanks to Al Viro and
+Matthew Wilcox for their feedback.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/gfs2/file.c |  101 +++++++++++++++++++++++++++++++++++++++++++++++++--------
+ 1 file changed, 88 insertions(+), 13 deletions(-)
+
+--- a/fs/gfs2/file.c
++++ b/fs/gfs2/file.c
+@@ -812,22 +812,64 @@ static ssize_t gfs2_file_direct_read(str
+ {
+ 	struct file *file = iocb->ki_filp;
+ 	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
+-	size_t count = iov_iter_count(to);
++	size_t prev_count = 0, window_size = 0;
++	size_t written = 0;
+ 	ssize_t ret;
+ 
+-	if (!count)
++	/*
++	 * In this function, we disable page faults when we're holding the
++	 * inode glock while doing I/O.  If a page fault occurs, we indicate
++	 * that the inode glock may be dropped, fault in the pages manually,
++	 * and retry.
++	 *
++	 * Unlike generic_file_read_iter, for reads, iomap_dio_rw can trigger
++	 * physical as well as manual page faults, and we need to disable both
++	 * kinds.
++	 *
++	 * For direct I/O, gfs2 takes the inode glock in deferred mode.  This
++	 * locking mode is compatible with other deferred holders, so multiple
++	 * processes and nodes can do direct I/O to a file at the same time.
++	 * There's no guarantee that reads or writes will be atomic.  Any
++	 * coordination among readers and writers needs to happen externally.
++	 */
++
++	if (!iov_iter_count(to))
+ 		return 0; /* skip atime */
+ 
+ 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
++retry:
+ 	ret = gfs2_glock_nq(gh);
+ 	if (ret)
+ 		goto out_uninit;
+-
+-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
+-	gfs2_glock_dq(gh);
++retry_under_glock:
++	pagefault_disable();
++	to->nofault = true;
++	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
++			   IOMAP_DIO_PARTIAL, written);
++	to->nofault = false;
++	pagefault_enable();
++	if (ret > 0)
++		written = ret;
++
++	if (should_fault_in_pages(ret, to, &prev_count, &window_size)) {
++		size_t leftover;
++
++		gfs2_holder_allow_demote(gh);
++		leftover = fault_in_iov_iter_writeable(to, window_size);
++		gfs2_holder_disallow_demote(gh);
++		if (leftover != window_size) {
++			if (!gfs2_holder_queued(gh))
++				goto retry;
++			goto retry_under_glock;
++		}
++	}
++	if (gfs2_holder_queued(gh))
++		gfs2_glock_dq(gh);
+ out_uninit:
+ 	gfs2_holder_uninit(gh);
+-	return ret;
++	if (ret < 0)
++		return ret;
++	return written;
+ }
+ 
+ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
+@@ -836,11 +878,21 @@ static ssize_t gfs2_file_direct_write(st
+ 	struct file *file = iocb->ki_filp;
+ 	struct inode *inode = file->f_mapping->host;
+ 	struct gfs2_inode *ip = GFS2_I(inode);
+-	size_t len = iov_iter_count(from);
+-	loff_t offset = iocb->ki_pos;
++	size_t prev_count = 0, window_size = 0;
++	size_t read = 0;
+ 	ssize_t ret;
+ 
+ 	/*
++	 * In this function, we disable page faults when we're holding the
++	 * inode glock while doing I/O.  If a page fault occurs, we indicate
++	 * that the inode glock may be dropped, fault in the pages manually,
++	 * and retry.
++	 *
++	 * For writes, iomap_dio_rw only triggers manual page faults, so we
++	 * don't need to disable physical ones.
++	 */
++
++	/*
+ 	 * Deferred lock, even if its a write, since we do no allocation on
+ 	 * this path. All we need to change is the atime, and this lock mode
+ 	 * ensures that other nodes have flushed their buffered read caches
+@@ -849,22 +901,45 @@ static ssize_t gfs2_file_direct_write(st
+ 	 * VFS does.
+ 	 */
+ 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
++retry:
+ 	ret = gfs2_glock_nq(gh);
+ 	if (ret)
+ 		goto out_uninit;
+-
++retry_under_glock:
+ 	/* Silently fall back to buffered I/O when writing beyond EOF */
+-	if (offset + len > i_size_read(&ip->i_inode))
++	if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode))
+ 		goto out;
+ 
+-	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
++	from->nofault = true;
++	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
++			   IOMAP_DIO_PARTIAL, read);
++	from->nofault = false;
++
+ 	if (ret == -ENOTBLK)
+ 		ret = 0;
++	if (ret > 0)
++		read = ret;
++
++	if (should_fault_in_pages(ret, from, &prev_count, &window_size)) {
++		size_t leftover;
++
++		gfs2_holder_allow_demote(gh);
++		leftover = fault_in_iov_iter_readable(from, window_size);
++		gfs2_holder_disallow_demote(gh);
++		if (leftover != window_size) {
++			if (!gfs2_holder_queued(gh))
++				goto retry;
++			goto retry_under_glock;
++		}
++	}
+ out:
+-	gfs2_glock_dq(gh);
++	if (gfs2_holder_queued(gh))
++		gfs2_glock_dq(gh);
+ out_uninit:
+ 	gfs2_holder_uninit(gh);
+-	return ret;
++	if (ret < 0)
++		return ret;
++	return read;
+ }
+ 
+ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
diff --git a/queue-5.15/gfs2-introduce-flag-for-glock-holder-auto-demotion.patch b/queue-5.15/gfs2-introduce-flag-for-glock-holder-auto-demotion.patch
new file mode 100644
index 00000000000..f3da955cfb9
--- /dev/null
+++ b/queue-5.15/gfs2-introduce-flag-for-glock-holder-auto-demotion.patch
@@ -0,0 +1,425 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:44 +0800
+Subject: gfs2: Introduce flag for glock holder auto-demotion
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Bob Peterson <rpeterso@redhat.com>, Andreas Gruenbacher <agruenba@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <51a4309baa83be7f31064db7fad3b9d3649d239d.1649951733.git.anand.jain@oracle.com>
+
+From: Bob Peterson <rpeterso@redhat.com>
+
+commit dc732906c2450939c319fec6e258aa89ecb5a632 upstream
+
+This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure that
+will allow glocks to be demoted automatically on locking conflicts.
+When a locking request comes in that isn't compatible with the locking
+state of an active holder and that holder has the HIF_MAY_DEMOTE flag
+set, the holder will be demoted before the incoming locking request is
+granted.
+
+Note that this mechanism demotes active holders (with the HIF_HOLDER
+flag set), while before we were only demoting glocks without any active
+holders.  This allows processes to keep hold of locks that may form a
+cyclic locking dependency; the core glock logic will then break those
+dependencies in case a conflicting locking request occurs.  We'll use
+this to avoid giving up the inode glock proactively before faulting in
+pages.
+
+Processes that allow a glock holder to be taken away indicate this by
+calling gfs2_holder_allow_demote(), which sets the HIF_MAY_DEMOTE flag.
+Later, they call gfs2_holder_disallow_demote() to clear the flag again,
+and then they check if their holder is still queued: if it is, they are
+still holding the glock; if it isn't, they can re-acquire the glock (or
+abort).
+
+Signed-off-by: Bob Peterson <rpeterso@redhat.com>
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/gfs2/glock.c  |  215 +++++++++++++++++++++++++++++++++++++++++++++----------
+ fs/gfs2/glock.h  |   20 +++++
+ fs/gfs2/incore.h |    1 
+ 3 files changed, 200 insertions(+), 36 deletions(-)
+
+--- a/fs/gfs2/glock.c
++++ b/fs/gfs2/glock.c
+@@ -58,6 +58,7 @@ struct gfs2_glock_iter {
+ typedef void (*glock_examiner) (struct gfs2_glock * gl);
+ 
+ static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned int target);
++static void __gfs2_glock_dq(struct gfs2_holder *gh);
+ 
+ static struct dentry *gfs2_root;
+ static struct workqueue_struct *glock_workqueue;
+@@ -197,6 +198,12 @@ static int demote_ok(const struct gfs2_g
+ 
+ 	if (gl->gl_state == LM_ST_UNLOCKED)
+ 		return 0;
++	/*
++	 * Note that demote_ok is used for the lru process of disposing of
++	 * glocks. For this purpose, we don't care if the glock's holders
++	 * have the HIF_MAY_DEMOTE flag set or not. If someone is using
++	 * them, don't demote.
++	 */
+ 	if (!list_empty(&gl->gl_holders))
+ 		return 0;
+ 	if (glops->go_demote_ok)
+@@ -379,7 +386,7 @@ static void do_error(struct gfs2_glock *
+ 	struct gfs2_holder *gh, *tmp;
+ 
+ 	list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
+-		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
++		if (!test_bit(HIF_WAIT, &gh->gh_iflags))
+ 			continue;
+ 		if (ret & LM_OUT_ERROR)
+ 			gh->gh_error = -EIO;
+@@ -394,6 +401,40 @@ static void do_error(struct gfs2_glock *
+ }
+ 
+ /**
++ * demote_incompat_holders - demote incompatible demoteable holders
++ * @gl: the glock we want to promote
++ * @new_gh: the new holder to be promoted
++ */
++static void demote_incompat_holders(struct gfs2_glock *gl,
++				    struct gfs2_holder *new_gh)
++{
++	struct gfs2_holder *gh;
++
++	/*
++	 * Demote incompatible holders before we make ourselves eligible.
++	 * (This holder may or may not allow auto-demoting, but we don't want
++	 * to demote the new holder before it's even granted.)
++	 */
++	list_for_each_entry(gh, &gl->gl_holders, gh_list) {
++		/*
++		 * Since holders are at the front of the list, we stop when we
++		 * find the first non-holder.
++		 */
++		if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
++			return;
++		if (test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags) &&
++		    !may_grant(gl, new_gh, gh)) {
++			/*
++			 * We should not recurse into do_promote because
++			 * __gfs2_glock_dq only calls handle_callback,
++			 * gfs2_glock_add_to_lru and __gfs2_glock_queue_work.
++			 */
++			__gfs2_glock_dq(gh);
++		}
++	}
++}
++
++/**
+  * find_first_holder - find the first "holder" gh
+  * @gl: the glock
+  */
+@@ -412,6 +453,26 @@ static inline struct gfs2_holder *find_f
+ }
+ 
+ /**
++ * find_first_strong_holder - find the first non-demoteable holder
++ * @gl: the glock
++ *
++ * Find the first holder that doesn't have the HIF_MAY_DEMOTE flag set.
++ */
++static inline struct gfs2_holder *
++find_first_strong_holder(struct gfs2_glock *gl)
++{
++	struct gfs2_holder *gh;
++
++	list_for_each_entry(gh, &gl->gl_holders, gh_list) {
++		if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
++			return NULL;
++		if (!test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags))
++			return gh;
++	}
++	return NULL;
++}
++
++/**
+  * do_promote - promote as many requests as possible on the current queue
+  * @gl: The glock
+  * 
+@@ -425,14 +486,20 @@ __acquires(&gl->gl_lockref.lock)
+ {
+ 	const struct gfs2_glock_operations *glops = gl->gl_ops;
+ 	struct gfs2_holder *gh, *tmp, *first_gh;
++	bool incompat_holders_demoted = false;
+ 	int ret;
+ 
+ restart:
+-	first_gh = find_first_holder(gl);
++	first_gh = find_first_strong_holder(gl);
+ 	list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
+-		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
++		if (!test_bit(HIF_WAIT, &gh->gh_iflags))
+ 			continue;
+ 		if (may_grant(gl, first_gh, gh)) {
++			if (!incompat_holders_demoted) {
++				demote_incompat_holders(gl, first_gh);
++				incompat_holders_demoted = true;
++				first_gh = gh;
++			}
+ 			if (gh->gh_list.prev == &gl->gl_holders &&
+ 			    glops->go_lock) {
+ 				spin_unlock(&gl->gl_lockref.lock);
+@@ -458,6 +525,11 @@ restart:
+ 			gfs2_holder_wake(gh);
+ 			continue;
+ 		}
++		/*
++		 * If we get here, it means we may not grant this holder for
++		 * some reason. If this holder is the head of the list, it
++		 * means we have a blocked holder at the head, so return 1.
++		 */
+ 		if (gh->gh_list.prev == &gl->gl_holders)
+ 			return 1;
+ 		do_error(gl, 0);
+@@ -1372,7 +1444,7 @@ __acquires(&gl->gl_lockref.lock)
+ 		if (test_bit(GLF_LOCK, &gl->gl_flags)) {
+ 			struct gfs2_holder *first_gh;
+ 
+-			first_gh = find_first_holder(gl);
++			first_gh = find_first_strong_holder(gl);
+ 			try_futile = !may_grant(gl, first_gh, gh);
+ 		}
+ 		if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags))
+@@ -1381,7 +1453,8 @@ __acquires(&gl->gl_lockref.lock)
+ 
+ 	list_for_each_entry(gh2, &gl->gl_holders, gh_list) {
+ 		if (unlikely(gh2->gh_owner_pid == gh->gh_owner_pid &&
+-		    (gh->gh_gl->gl_ops->go_type != LM_TYPE_FLOCK)))
++		    (gh->gh_gl->gl_ops->go_type != LM_TYPE_FLOCK) &&
++		    !test_bit(HIF_MAY_DEMOTE, &gh2->gh_iflags)))
+ 			goto trap_recursive;
+ 		if (try_futile &&
+ 		    !(gh2->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB))) {
+@@ -1477,51 +1550,83 @@ int gfs2_glock_poll(struct gfs2_holder *
+ 	return test_bit(HIF_WAIT, &gh->gh_iflags) ? 0 : 1;
+ }
+ 
+-/**
+- * gfs2_glock_dq - dequeue a struct gfs2_holder from a glock (release a glock)
+- * @gh: the glock holder
+- *
+- */
++static inline bool needs_demote(struct gfs2_glock *gl)
++{
++	return (test_bit(GLF_DEMOTE, &gl->gl_flags) ||
++		test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags));
++}
+ 
+-void gfs2_glock_dq(struct gfs2_holder *gh)
++static void __gfs2_glock_dq(struct gfs2_holder *gh)
+ {
+ 	struct gfs2_glock *gl = gh->gh_gl;
+ 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
+ 	unsigned delay = 0;
+ 	int fast_path = 0;
+ 
+-	spin_lock(&gl->gl_lockref.lock);
+ 	/*
+-	 * If we're in the process of file system withdraw, we cannot just
+-	 * dequeue any glocks until our journal is recovered, lest we
+-	 * introduce file system corruption. We need two exceptions to this
+-	 * rule: We need to allow unlocking of nondisk glocks and the glock
+-	 * for our own journal that needs recovery.
++	 * This while loop is similar to function demote_incompat_holders:
++	 * If the glock is due to be demoted (which may be from another node
++	 * or even if this holder is GL_NOCACHE), the weak holders are
++	 * demoted as well, allowing the glock to be demoted.
+ 	 */
+-	if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
+-	    glock_blocked_by_withdraw(gl) &&
+-	    gh->gh_gl != sdp->sd_jinode_gl) {
+-		sdp->sd_glock_dqs_held++;
+-		spin_unlock(&gl->gl_lockref.lock);
+-		might_sleep();
+-		wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY,
+-			    TASK_UNINTERRUPTIBLE);
+-		spin_lock(&gl->gl_lockref.lock);
+-	}
+-	if (gh->gh_flags & GL_NOCACHE)
+-		handle_callback(gl, LM_ST_UNLOCKED, 0, false);
++	while (gh) {
++		/*
++		 * If we're in the process of file system withdraw, we cannot
++		 * just dequeue any glocks until our journal is recovered, lest
++		 * we introduce file system corruption. We need two exceptions
++		 * to this rule: We need to allow unlocking of nondisk glocks
++		 * and the glock for our own journal that needs recovery.
++		 */
++		if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
++		    glock_blocked_by_withdraw(gl) &&
++		    gh->gh_gl != sdp->sd_jinode_gl) {
++			sdp->sd_glock_dqs_held++;
++			spin_unlock(&gl->gl_lockref.lock);
++			might_sleep();
++			wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY,
++				    TASK_UNINTERRUPTIBLE);
++			spin_lock(&gl->gl_lockref.lock);
++		}
+ 
+-	list_del_init(&gh->gh_list);
+-	clear_bit(HIF_HOLDER, &gh->gh_iflags);
+-	if (list_empty(&gl->gl_holders) &&
+-	    !test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
+-	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
+-		fast_path = 1;
++		/*
++		 * This holder should not be cached, so mark it for demote.
++		 * Note: this should be done before the check for needs_demote
++		 * below.
++		 */
++		if (gh->gh_flags & GL_NOCACHE)
++			handle_callback(gl, LM_ST_UNLOCKED, 0, false);
++
++		list_del_init(&gh->gh_list);
++		clear_bit(HIF_HOLDER, &gh->gh_iflags);
++		trace_gfs2_glock_queue(gh, 0);
++
++		/*
++		 * If there hasn't been a demote request we are done.
++		 * (Let the remaining holders, if any, keep holding it.)
++		 */
++		if (!needs_demote(gl)) {
++			if (list_empty(&gl->gl_holders))
++				fast_path = 1;
++			break;
++		}
++		/*
++		 * If we have another strong holder (we cannot auto-demote)
++		 * we are done. It keeps holding it until it is done.
++		 */
++		if (find_first_strong_holder(gl))
++			break;
++
++		/*
++		 * If we have a weak holder at the head of the list, it
++		 * (and all others like it) must be auto-demoted. If there
++		 * are no more weak holders, we exit the while loop.
++		 */
++		gh = find_first_holder(gl);
++	}
+ 
+ 	if (!test_bit(GLF_LFLUSH, &gl->gl_flags) && demote_ok(gl))
+ 		gfs2_glock_add_to_lru(gl);
+ 
+-	trace_gfs2_glock_queue(gh, 0);
+ 	if (unlikely(!fast_path)) {
+ 		gl->gl_lockref.count++;
+ 		if (test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
+@@ -1530,6 +1635,19 @@ void gfs2_glock_dq(struct gfs2_holder *g
+ 			delay = gl->gl_hold_time;
+ 		__gfs2_glock_queue_work(gl, delay);
+ 	}
++}
++
++/**
++ * gfs2_glock_dq - dequeue a struct gfs2_holder from a glock (release a glock)
++ * @gh: the glock holder
++ *
++ */
++void gfs2_glock_dq(struct gfs2_holder *gh)
++{
++	struct gfs2_glock *gl = gh->gh_gl;
++
++	spin_lock(&gl->gl_lockref.lock);
++	__gfs2_glock_dq(gh);
+ 	spin_unlock(&gl->gl_lockref.lock);
+ }
+ 
+@@ -1692,6 +1810,7 @@ void gfs2_glock_dq_m(unsigned int num_gh
+ 
+ void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state)
+ {
++	struct gfs2_holder mock_gh = { .gh_gl = gl, .gh_state = state, };
+ 	unsigned long delay = 0;
+ 	unsigned long holdtime;
+ 	unsigned long now = jiffies;
+@@ -1706,6 +1825,28 @@ void gfs2_glock_cb(struct gfs2_glock *gl
+ 		if (test_bit(GLF_REPLY_PENDING, &gl->gl_flags))
+ 			delay = gl->gl_hold_time;
+ 	}
++	/*
++	 * Note 1: We cannot call demote_incompat_holders from handle_callback
++	 * or gfs2_set_demote due to recursion problems like: gfs2_glock_dq ->
++	 * handle_callback -> demote_incompat_holders -> gfs2_glock_dq
++	 * Plus, we only want to demote the holders if the request comes from
++	 * a remote cluster node because local holder conflicts are resolved
++	 * elsewhere.
++	 *
++	 * Note 2: if a remote node wants this glock in EX mode, lock_dlm will
++	 * request that we set our state to UNLOCKED. Here we mock up a holder
++	 * to make it look like someone wants the lock EX locally. Any SH
++	 * and DF requests should be able to share the lock without demoting.
++	 *
++	 * Note 3: We only want to demote the demoteable holders when there
++	 * are no more strong holders. The demoteable holders might as well
++	 * keep the glock until the last strong holder is done with it.
++	 */
++	if (!find_first_strong_holder(gl)) {
++		if (state == LM_ST_UNLOCKED)
++			mock_gh.gh_state = LM_ST_EXCLUSIVE;
++		demote_incompat_holders(gl, &mock_gh);
++	}
+ 	handle_callback(gl, state, delay, true);
+ 	__gfs2_glock_queue_work(gl, delay);
+ 	spin_unlock(&gl->gl_lockref.lock);
+@@ -2097,6 +2238,8 @@ static const char *hflags2str(char *buf,
+ 		*p++ = 'H';
+ 	if (test_bit(HIF_WAIT, &iflags))
+ 		*p++ = 'W';
++	if (test_bit(HIF_MAY_DEMOTE, &iflags))
++		*p++ = 'D';
+ 	*p = 0;
+ 	return buf;
+ }
+--- a/fs/gfs2/glock.h
++++ b/fs/gfs2/glock.h
+@@ -150,6 +150,8 @@ static inline struct gfs2_holder *gfs2_g
+ 	list_for_each_entry(gh, &gl->gl_holders, gh_list) {
+ 		if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
+ 			break;
++		if (test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags))
++			continue;
+ 		if (gh->gh_owner_pid == pid)
+ 			goto out;
+ 	}
+@@ -325,6 +327,24 @@ static inline void glock_clear_object(st
+ 	spin_unlock(&gl->gl_lockref.lock);
+ }
+ 
++static inline void gfs2_holder_allow_demote(struct gfs2_holder *gh)
++{
++	struct gfs2_glock *gl = gh->gh_gl;
++
++	spin_lock(&gl->gl_lockref.lock);
++	set_bit(HIF_MAY_DEMOTE, &gh->gh_iflags);
++	spin_unlock(&gl->gl_lockref.lock);
++}
++
++static inline void gfs2_holder_disallow_demote(struct gfs2_holder *gh)
++{
++	struct gfs2_glock *gl = gh->gh_gl;
++
++	spin_lock(&gl->gl_lockref.lock);
++	clear_bit(HIF_MAY_DEMOTE, &gh->gh_iflags);
++	spin_unlock(&gl->gl_lockref.lock);
++}
++
+ extern void gfs2_inode_remember_delete(struct gfs2_glock *gl, u64 generation);
+ extern bool gfs2_inode_already_deleted(struct gfs2_glock *gl, u64 generation);
+ 
+--- a/fs/gfs2/incore.h
++++ b/fs/gfs2/incore.h
+@@ -252,6 +252,7 @@ struct gfs2_lkstats {
+ 
+ enum {
+ 	/* States */
++	HIF_MAY_DEMOTE		= 1,
+ 	HIF_HOLDER		= 6,  /* Set for gh that "holds" the glock */
+ 	HIF_WAIT		= 10,
+ };
diff --git a/queue-5.15/gfs2-move-the-inode-glock-locking-to-gfs2_file_buffered_write.patch b/queue-5.15/gfs2-move-the-inode-glock-locking-to-gfs2_file_buffered_write.patch
new file mode 100644
index 00000000000..0b4d5f29210
--- /dev/null
+++ b/queue-5.15/gfs2-move-the-inode-glock-locking-to-gfs2_file_buffered_write.patch
@@ -0,0 +1,177 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:45 +0800
+Subject: gfs2: Move the inode glock locking to gfs2_file_buffered_write
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <cc3db66fcbea7329e3cc7246cd329b719f76f323.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit b924bdab7445946e2ed364a0e6e249d36f1f1158 upstream
+
+So far, for buffered writes, we were taking the inode glock in
+gfs2_iomap_begin and dropping it in gfs2_iomap_end with the intention of
+not holding the inode glock while iomap_write_actor faults in user
+pages.  It turns out that iomap_write_actor is called inside iomap_begin
+... iomap_end, so the user pages were still faulted in while holding the
+inode glock and the locking code in iomap_begin / iomap_end was
+completely pointless.
+
+Move the locking into gfs2_file_buffered_write instead.  We'll take care
+of the potential deadlocks due to faulting in user pages while holding a
+glock in a subsequent patch.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/gfs2/bmap.c |   60 ---------------------------------------------------------
+ fs/gfs2/file.c |   27 +++++++++++++++++++++++++
+ 2 files changed, 28 insertions(+), 59 deletions(-)
+
+--- a/fs/gfs2/bmap.c
++++ b/fs/gfs2/bmap.c
+@@ -961,46 +961,6 @@ hole_found:
+ 	goto out;
+ }
+ 
+-static int gfs2_write_lock(struct inode *inode)
+-{
+-	struct gfs2_inode *ip = GFS2_I(inode);
+-	struct gfs2_sbd *sdp = GFS2_SB(inode);
+-	int error;
+-
+-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
+-	error = gfs2_glock_nq(&ip->i_gh);
+-	if (error)
+-		goto out_uninit;
+-	if (&ip->i_inode == sdp->sd_rindex) {
+-		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
+-
+-		error = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
+-					   GL_NOCACHE, &m_ip->i_gh);
+-		if (error)
+-			goto out_unlock;
+-	}
+-	return 0;
+-
+-out_unlock:
+-	gfs2_glock_dq(&ip->i_gh);
+-out_uninit:
+-	gfs2_holder_uninit(&ip->i_gh);
+-	return error;
+-}
+-
+-static void gfs2_write_unlock(struct inode *inode)
+-{
+-	struct gfs2_inode *ip = GFS2_I(inode);
+-	struct gfs2_sbd *sdp = GFS2_SB(inode);
+-
+-	if (&ip->i_inode == sdp->sd_rindex) {
+-		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
+-
+-		gfs2_glock_dq_uninit(&m_ip->i_gh);
+-	}
+-	gfs2_glock_dq_uninit(&ip->i_gh);
+-}
+-
+ static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
+ 				   unsigned len)
+ {
+@@ -1118,11 +1078,6 @@ out_qunlock:
+ 	return ret;
+ }
+ 
+-static inline bool gfs2_iomap_need_write_lock(unsigned flags)
+-{
+-	return (flags & IOMAP_WRITE) && !(flags & IOMAP_DIRECT);
+-}
+-
+ static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
+ 			    unsigned flags, struct iomap *iomap,
+ 			    struct iomap *srcmap)
+@@ -1135,12 +1090,6 @@ static int gfs2_iomap_begin(struct inode
+ 		iomap->flags |= IOMAP_F_BUFFER_HEAD;
+ 
+ 	trace_gfs2_iomap_start(ip, pos, length, flags);
+-	if (gfs2_iomap_need_write_lock(flags)) {
+-		ret = gfs2_write_lock(inode);
+-		if (ret)
+-			goto out;
+-	}
+-
+ 	ret = __gfs2_iomap_get(inode, pos, length, flags, iomap, &mp);
+ 	if (ret)
+ 		goto out_unlock;
+@@ -1168,10 +1117,7 @@ static int gfs2_iomap_begin(struct inode
+ 	ret = gfs2_iomap_begin_write(inode, pos, length, flags, iomap, &mp);
+ 
+ out_unlock:
+-	if (ret && gfs2_iomap_need_write_lock(flags))
+-		gfs2_write_unlock(inode);
+ 	release_metapath(&mp);
+-out:
+ 	trace_gfs2_iomap_end(ip, iomap, ret);
+ 	return ret;
+ }
+@@ -1219,15 +1165,11 @@ static int gfs2_iomap_end(struct inode *
+ 	}
+ 
+ 	if (unlikely(!written))
+-		goto out_unlock;
++		return 0;
+ 
+ 	if (iomap->flags & IOMAP_F_SIZE_CHANGED)
+ 		mark_inode_dirty(inode);
+ 	set_bit(GLF_DIRTY, &ip->i_gl->gl_flags);
+-
+-out_unlock:
+-	if (gfs2_iomap_need_write_lock(flags))
+-		gfs2_write_unlock(inode);
+ 	return 0;
+ }
+ 
+--- a/fs/gfs2/file.c
++++ b/fs/gfs2/file.c
+@@ -881,13 +881,40 @@ static ssize_t gfs2_file_buffered_write(
+ {
+ 	struct file *file = iocb->ki_filp;
+ 	struct inode *inode = file_inode(file);
++	struct gfs2_inode *ip = GFS2_I(inode);
++	struct gfs2_sbd *sdp = GFS2_SB(inode);
+ 	ssize_t ret;
+ 
++	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
++	ret = gfs2_glock_nq(&ip->i_gh);
++	if (ret)
++		goto out_uninit;
++
++	if (inode == sdp->sd_rindex) {
++		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
++
++		ret = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
++					 GL_NOCACHE, &m_ip->i_gh);
++		if (ret)
++			goto out_unlock;
++	}
++
+ 	current->backing_dev_info = inode_to_bdi(inode);
+ 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+ 	current->backing_dev_info = NULL;
+ 	if (ret > 0)
+ 		iocb->ki_pos += ret;
++
++	if (inode == sdp->sd_rindex) {
++		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
++
++		gfs2_glock_dq_uninit(&m_ip->i_gh);
++	}
++
++out_unlock:
++	gfs2_glock_dq(&ip->i_gh);
++out_uninit:
++	gfs2_holder_uninit(&ip->i_gh);
+ 	return ret;
+ }
+ 
diff --git a/queue-5.15/gup-introduce-foll_nofault-flag-to-disable-page-faults.patch b/queue-5.15/gup-introduce-foll_nofault-flag-to-disable-page-faults.patch
new file mode 100644
index 00000000000..42ccc552b3d
--- /dev/null
+++ b/queue-5.15/gup-introduce-foll_nofault-flag-to-disable-page-faults.patch
@@ -0,0 +1,57 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:51 +0800
+Subject: gup: Introduce FOLL_NOFAULT flag to disable page faults
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <2ee1e383ae1cca975426b54ab251257f6d4e12c0.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit 55b8fe703bc51200d4698596c90813453b35ae63 upstream
+
+Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return
+-EFAULT when it would otherwise trigger a page fault.  This is roughly
+similar to FOLL_FAST_ONLY but available on all architectures, and less
+fragile.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ include/linux/mm.h |    3 ++-
+ mm/gup.c           |    4 +++-
+ 2 files changed, 5 insertions(+), 2 deletions(-)
+
+--- a/include/linux/mm.h
++++ b/include/linux/mm.h
+@@ -2858,7 +2858,8 @@ struct page *follow_page(struct vm_area_
+ #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+ #define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
+ 				 * and return without waiting upon it */
+-#define FOLL_POPULATE	0x40	/* fault in page */
++#define FOLL_POPULATE	0x40	/* fault in pages (with FOLL_MLOCK) */
++#define FOLL_NOFAULT	0x80	/* do not fault in pages */
+ #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
+ #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
+ #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
+--- a/mm/gup.c
++++ b/mm/gup.c
+@@ -943,6 +943,8 @@ static int faultin_page(struct vm_area_s
+ 	/* mlock all present pages, but do not fault in new pages */
+ 	if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
+ 		return -ENOENT;
++	if (*flags & FOLL_NOFAULT)
++		return -EFAULT;
+ 	if (*flags & FOLL_WRITE)
+ 		fault_flags |= FAULT_FLAG_WRITE;
+ 	if (*flags & FOLL_REMOTE)
+@@ -2868,7 +2870,7 @@ static int internal_get_user_pages_fast(
+ 
+ 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
+ 				       FOLL_FORCE | FOLL_PIN | FOLL_GET |
+-				       FOLL_FAST_ONLY)))
++				       FOLL_FAST_ONLY | FOLL_NOFAULT)))
+ 		return -EINVAL;
+ 
+ 	if (gup_flags & FOLL_PIN)
diff --git a/queue-5.15/gup-turn-fault_in_pages_-readable-writeable-into-fault_in_-readable-writeable.patch b/queue-5.15/gup-turn-fault_in_pages_-readable-writeable-into-fault_in_-readable-writeable.patch
new file mode 100644
index 00000000000..60a7d9c7d24
--- /dev/null
+++ b/queue-5.15/gup-turn-fault_in_pages_-readable-writeable-into-fault_in_-readable-writeable.patch
@@ -0,0 +1,340 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:39 +0800
+Subject: gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <92b6e65e73dd2764bef59e0e20b65143ab28914a.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit bb523b406c849eef8f265a07cd7f320f1f177743 upstream
+
+Turn fault_in_pages_{readable,writeable} into versions that return the
+number of bytes not faulted in, similar to copy_to_user, instead of
+returning a non-zero value when any of the requested pages couldn't be
+faulted in.  This supports the existing users that require all pages to
+be faulted in as well as new users that are happy if any pages can be
+faulted in.
+
+Rename the functions to fault_in_{readable,writeable} to make sure
+this change doesn't silently break things.
+
+Neither of these functions is entirely trivial and it doesn't seem
+useful to inline them, so move them to mm/gup.c.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ arch/powerpc/kernel/kvm.c           |    3 +
+ arch/powerpc/kernel/signal_32.c     |    4 +-
+ arch/powerpc/kernel/signal_64.c     |    2 -
+ arch/x86/kernel/fpu/signal.c        |    7 +--
+ drivers/gpu/drm/armada/armada_gem.c |    7 +--
+ fs/btrfs/ioctl.c                    |    5 +-
+ include/linux/pagemap.h             |   57 +---------------------------
+ lib/iov_iter.c                      |   10 ++---
+ mm/filemap.c                        |    2 -
+ mm/gup.c                            |   72 ++++++++++++++++++++++++++++++++++++
+ 10 files changed, 93 insertions(+), 76 deletions(-)
+
+--- a/arch/powerpc/kernel/kvm.c
++++ b/arch/powerpc/kernel/kvm.c
+@@ -669,7 +669,8 @@ static void __init kvm_use_magic_page(vo
+ 	on_each_cpu(kvm_map_magic_page, &features, 1);
+ 
+ 	/* Quick self-test to see if the mapping works */
+-	if (fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
++	if (fault_in_readable((const char __user *)KVM_MAGIC_PAGE,
++			      sizeof(u32))) {
+ 		kvm_patching_worked = false;
+ 		return;
+ 	}
+--- a/arch/powerpc/kernel/signal_32.c
++++ b/arch/powerpc/kernel/signal_32.c
+@@ -1048,7 +1048,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucon
+ 	if (new_ctx == NULL)
+ 		return 0;
+ 	if (!access_ok(new_ctx, ctx_size) ||
+-	    fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
++	    fault_in_readable((char __user *)new_ctx, ctx_size))
+ 		return -EFAULT;
+ 
+ 	/*
+@@ -1239,7 +1239,7 @@ SYSCALL_DEFINE3(debug_setcontext, struct
+ #endif
+ 
+ 	if (!access_ok(ctx, sizeof(*ctx)) ||
+-	    fault_in_pages_readable((u8 __user *)ctx, sizeof(*ctx)))
++	    fault_in_readable((char __user *)ctx, sizeof(*ctx)))
+ 		return -EFAULT;
+ 
+ 	/*
+--- a/arch/powerpc/kernel/signal_64.c
++++ b/arch/powerpc/kernel/signal_64.c
+@@ -688,7 +688,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucon
+ 	if (new_ctx == NULL)
+ 		return 0;
+ 	if (!access_ok(new_ctx, ctx_size) ||
+-	    fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
++	    fault_in_readable((char __user *)new_ctx, ctx_size))
+ 		return -EFAULT;
+ 
+ 	/*
+--- a/arch/x86/kernel/fpu/signal.c
++++ b/arch/x86/kernel/fpu/signal.c
+@@ -205,7 +205,7 @@ retry:
+ 	fpregs_unlock();
+ 
+ 	if (ret) {
+-		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
++		if (!fault_in_writeable(buf_fx, fpu_user_xstate_size))
+ 			goto retry;
+ 		return -EFAULT;
+ 	}
+@@ -278,10 +278,9 @@ retry:
+ 		if (ret != -EFAULT)
+ 			return -EINVAL;
+ 
+-		ret = fault_in_pages_readable(buf, size);
+-		if (!ret)
++		if (!fault_in_readable(buf, size))
+ 			goto retry;
+-		return ret;
++		return -EFAULT;
+ 	}
+ 
+ 	/*
+--- a/drivers/gpu/drm/armada/armada_gem.c
++++ b/drivers/gpu/drm/armada/armada_gem.c
+@@ -336,7 +336,7 @@ int armada_gem_pwrite_ioctl(struct drm_d
+ 	struct drm_armada_gem_pwrite *args = data;
+ 	struct armada_gem_object *dobj;
+ 	char __user *ptr;
+-	int ret;
++	int ret = 0;
+ 
+ 	DRM_DEBUG_DRIVER("handle %u off %u size %u ptr 0x%llx\n",
+ 		args->handle, args->offset, args->size, args->ptr);
+@@ -349,9 +349,8 @@ int armada_gem_pwrite_ioctl(struct drm_d
+ 	if (!access_ok(ptr, args->size))
+ 		return -EFAULT;
+ 
+-	ret = fault_in_pages_readable(ptr, args->size);
+-	if (ret)
+-		return ret;
++	if (fault_in_readable(ptr, args->size))
++		return -EFAULT;
+ 
+ 	dobj = armada_gem_object_lookup(file, args->handle);
+ 	if (dobj == NULL)
+--- a/fs/btrfs/ioctl.c
++++ b/fs/btrfs/ioctl.c
+@@ -2258,9 +2258,8 @@ static noinline int search_ioctl(struct
+ 	key.offset = sk->min_offset;
+ 
+ 	while (1) {
+-		ret = fault_in_pages_writeable(ubuf + sk_offset,
+-					       *buf_size - sk_offset);
+-		if (ret)
++		ret = -EFAULT;
++		if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
+ 			break;
+ 
+ 		ret = btrfs_search_forward(root, &key, path, sk->min_transid);
+--- a/include/linux/pagemap.h
++++ b/include/linux/pagemap.h
+@@ -733,61 +733,10 @@ int wait_on_page_private_2_killable(stru
+ extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
+ 
+ /*
+- * Fault everything in given userspace address range in.
++ * Fault in userspace address range.
+  */
+-static inline int fault_in_pages_writeable(char __user *uaddr, size_t size)
+-{
+-	char __user *end = uaddr + size - 1;
+-
+-	if (unlikely(size == 0))
+-		return 0;
+-
+-	if (unlikely(uaddr > end))
+-		return -EFAULT;
+-	/*
+-	 * Writing zeroes into userspace here is OK, because we know that if
+-	 * the zero gets there, we'll be overwriting it.
+-	 */
+-	do {
+-		if (unlikely(__put_user(0, uaddr) != 0))
+-			return -EFAULT;
+-		uaddr += PAGE_SIZE;
+-	} while (uaddr <= end);
+-
+-	/* Check whether the range spilled into the next page. */
+-	if (((unsigned long)uaddr & PAGE_MASK) ==
+-			((unsigned long)end & PAGE_MASK))
+-		return __put_user(0, end);
+-
+-	return 0;
+-}
+-
+-static inline int fault_in_pages_readable(const char __user *uaddr, size_t size)
+-{
+-	volatile char c;
+-	const char __user *end = uaddr + size - 1;
+-
+-	if (unlikely(size == 0))
+-		return 0;
+-
+-	if (unlikely(uaddr > end))
+-		return -EFAULT;
+-
+-	do {
+-		if (unlikely(__get_user(c, uaddr) != 0))
+-			return -EFAULT;
+-		uaddr += PAGE_SIZE;
+-	} while (uaddr <= end);
+-
+-	/* Check whether the range spilled into the next page. */
+-	if (((unsigned long)uaddr & PAGE_MASK) ==
+-			((unsigned long)end & PAGE_MASK)) {
+-		return __get_user(c, end);
+-	}
+-
+-	(void)c;
+-	return 0;
+-}
++size_t fault_in_writeable(char __user *uaddr, size_t size);
++size_t fault_in_readable(const char __user *uaddr, size_t size);
+ 
+ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
+ 				pgoff_t index, gfp_t gfp_mask);
+--- a/lib/iov_iter.c
++++ b/lib/iov_iter.c
+@@ -191,7 +191,7 @@ static size_t copy_page_to_iter_iovec(st
+ 	buf = iov->iov_base + skip;
+ 	copy = min(bytes, iov->iov_len - skip);
+ 
+-	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_writeable(buf, copy)) {
++	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_writeable(buf, copy)) {
+ 		kaddr = kmap_atomic(page);
+ 		from = kaddr + offset;
+ 
+@@ -275,7 +275,7 @@ static size_t copy_page_from_iter_iovec(
+ 	buf = iov->iov_base + skip;
+ 	copy = min(bytes, iov->iov_len - skip);
+ 
+-	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_readable(buf, copy)) {
++	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_readable(buf, copy)) {
+ 		kaddr = kmap_atomic(page);
+ 		to = kaddr + offset;
+ 
+@@ -447,13 +447,11 @@ int iov_iter_fault_in_readable(const str
+ 			bytes = i->count;
+ 		for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
+ 			size_t len = min(bytes, p->iov_len - skip);
+-			int err;
+ 
+ 			if (unlikely(!len))
+ 				continue;
+-			err = fault_in_pages_readable(p->iov_base + skip, len);
+-			if (unlikely(err))
+-				return err;
++			if (fault_in_readable(p->iov_base + skip, len))
++				return -EFAULT;
+ 			bytes -= len;
+ 		}
+ 	}
+--- a/mm/filemap.c
++++ b/mm/filemap.c
+@@ -90,7 +90,7 @@
+  *      ->lock_page		(filemap_fault, access_process_vm)
+  *
+  *  ->i_rwsem			(generic_perform_write)
+- *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
++ *    ->mmap_lock		(fault_in_readable->do_page_fault)
+  *
+  *  bdi->wb.list_lock
+  *    sb_lock			(fs/fs-writeback.c)
+--- a/mm/gup.c
++++ b/mm/gup.c
+@@ -1682,6 +1682,78 @@ finish_or_fault:
+ #endif /* !CONFIG_MMU */
+ 
+ /**
++ * fault_in_writeable - fault in userspace address range for writing
++ * @uaddr: start of address range
++ * @size: size of address range
++ *
++ * Returns the number of bytes not faulted in (like copy_to_user() and
++ * copy_from_user()).
++ */
++size_t fault_in_writeable(char __user *uaddr, size_t size)
++{
++	char __user *start = uaddr, *end;
++
++	if (unlikely(size == 0))
++		return 0;
++	if (!PAGE_ALIGNED(uaddr)) {
++		if (unlikely(__put_user(0, uaddr) != 0))
++			return size;
++		uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
++	}
++	end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
++	if (unlikely(end < start))
++		end = NULL;
++	while (uaddr != end) {
++		if (unlikely(__put_user(0, uaddr) != 0))
++			goto out;
++		uaddr += PAGE_SIZE;
++	}
++
++out:
++	if (size > uaddr - start)
++		return size - (uaddr - start);
++	return 0;
++}
++EXPORT_SYMBOL(fault_in_writeable);
++
++/**
++ * fault_in_readable - fault in userspace address range for reading
++ * @uaddr: start of user address range
++ * @size: size of user address range
++ *
++ * Returns the number of bytes not faulted in (like copy_to_user() and
++ * copy_from_user()).
++ */
++size_t fault_in_readable(const char __user *uaddr, size_t size)
++{
++	const char __user *start = uaddr, *end;
++	volatile char c;
++
++	if (unlikely(size == 0))
++		return 0;
++	if (!PAGE_ALIGNED(uaddr)) {
++		if (unlikely(__get_user(c, uaddr) != 0))
++			return size;
++		uaddr = (const char __user *)PAGE_ALIGN((unsigned long)uaddr);
++	}
++	end = (const char __user *)PAGE_ALIGN((unsigned long)start + size);
++	if (unlikely(end < start))
++		end = NULL;
++	while (uaddr != end) {
++		if (unlikely(__get_user(c, uaddr) != 0))
++			goto out;
++		uaddr += PAGE_SIZE;
++	}
++
++out:
++	(void)c;
++	if (size > uaddr - start)
++		return size - (uaddr - start);
++	return 0;
++}
++EXPORT_SYMBOL(fault_in_readable);
++
++/**
+  * get_dump_page() - pin user page in memory while writing it to core dump
+  * @addr: user address
+  *
diff --git a/queue-5.15/iomap-add-done_before-argument-to-iomap_dio_rw.patch b/queue-5.15/iomap-add-done_before-argument-to-iomap_dio_rw.patch
new file mode 100644
index 00000000000..cf02fcb15ba
--- /dev/null
+++ b/queue-5.15/iomap-add-done_before-argument-to-iomap_dio_rw.patch
@@ -0,0 +1,242 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:50 +0800
+Subject: iomap: Add done_before argument to iomap_dio_rw
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, "Darrick J . Wong" <djwong@kernel.org>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <db3340e7b4b9e65960ecdd2c4e1b08f3fe5a09ec.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit 4fdccaa0d184c202f98d73b24e3ec8eeee88ab8d upstream
+
+Add a done_before argument to iomap_dio_rw that indicates how much of
+the request has already been transferred.  When the request succeeds, we
+report that done_before additional bytes were tranferred.  This is
+useful for finishing a request asynchronously when part of the request
+has already been completed synchronously.
+
+We'll use that to allow iomap_dio_rw to be used with page faults
+disabled: when a page fault occurs while submitting a request, we
+synchronously complete the part of the request that has already been
+submitted.  The caller can then take care of the page fault and call
+iomap_dio_rw again for the rest of the request, passing in the number of
+bytes already tranferred.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Reviewed-by: Darrick J. Wong <djwong@kernel.org>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/btrfs/file.c       |    5 +++--
+ fs/erofs/data.c       |    2 +-
+ fs/ext4/file.c        |    5 +++--
+ fs/gfs2/file.c        |    4 ++--
+ fs/iomap/direct-io.c  |   19 ++++++++++++++++---
+ fs/xfs/xfs_file.c     |    6 +++---
+ fs/zonefs/super.c     |    4 ++--
+ include/linux/iomap.h |    4 ++--
+ 8 files changed, 32 insertions(+), 17 deletions(-)
+
+--- a/fs/btrfs/file.c
++++ b/fs/btrfs/file.c
+@@ -1956,7 +1956,7 @@ relock:
+ 	}
+ 
+ 	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
+-			     0);
++			     0, 0);
+ 
+ 	btrfs_inode_unlock(inode, ilock_flags);
+ 
+@@ -3668,7 +3668,8 @@ static ssize_t btrfs_direct_read(struct
+ 		return 0;
+ 
+ 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
+-	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 0);
++	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
++			   0, 0);
+ 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
+ 	return ret;
+ }
+--- a/fs/erofs/data.c
++++ b/fs/erofs/data.c
+@@ -287,7 +287,7 @@ static ssize_t erofs_file_read_iter(stru
+ 
+ 		if (!err)
+ 			return iomap_dio_rw(iocb, to, &erofs_iomap_ops,
+-					    NULL, 0);
++					    NULL, 0, 0);
+ 		if (err < 0)
+ 			return err;
+ 	}
+--- a/fs/ext4/file.c
++++ b/fs/ext4/file.c
+@@ -74,7 +74,7 @@ static ssize_t ext4_dio_read_iter(struct
+ 		return generic_file_read_iter(iocb, to);
+ 	}
+ 
+-	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
++	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, 0);
+ 	inode_unlock_shared(inode);
+ 
+ 	file_accessed(iocb->ki_filp);
+@@ -566,7 +566,8 @@ static ssize_t ext4_dio_write_iter(struc
+ 	if (ilock_shared)
+ 		iomap_ops = &ext4_iomap_overwrite_ops;
+ 	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
+-			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
++			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0,
++			   0);
+ 	if (ret == -ENOTBLK)
+ 		ret = 0;
+ 
+--- a/fs/gfs2/file.c
++++ b/fs/gfs2/file.c
+@@ -823,7 +823,7 @@ static ssize_t gfs2_file_direct_read(str
+ 	if (ret)
+ 		goto out_uninit;
+ 
+-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
++	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
+ 	gfs2_glock_dq(gh);
+ out_uninit:
+ 	gfs2_holder_uninit(gh);
+@@ -857,7 +857,7 @@ static ssize_t gfs2_file_direct_write(st
+ 	if (offset + len > i_size_read(&ip->i_inode))
+ 		goto out;
+ 
+-	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
++	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
+ 	if (ret == -ENOTBLK)
+ 		ret = 0;
+ out:
+--- a/fs/iomap/direct-io.c
++++ b/fs/iomap/direct-io.c
+@@ -31,6 +31,7 @@ struct iomap_dio {
+ 	atomic_t		ref;
+ 	unsigned		flags;
+ 	int			error;
++	size_t			done_before;
+ 	bool			wait_for_completion;
+ 
+ 	union {
+@@ -124,6 +125,9 @@ ssize_t iomap_dio_complete(struct iomap_
+ 	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
+ 		ret = generic_write_sync(iocb, ret);
+ 
++	if (ret > 0)
++		ret += dio->done_before;
++
+ 	kfree(dio);
+ 
+ 	return ret;
+@@ -450,13 +454,21 @@ static loff_t iomap_dio_iter(const struc
+  * may be pure data writes. In that case, we still need to do a full data sync
+  * completion.
+  *
++ * When page faults are disabled and @dio_flags includes IOMAP_DIO_PARTIAL,
++ * __iomap_dio_rw can return a partial result if it encounters a non-resident
++ * page in @iter after preparing a transfer.  In that case, the non-resident
++ * pages can be faulted in and the request resumed with @done_before set to the
++ * number of bytes previously transferred.  The request will then complete with
++ * the correct total number of bytes transferred; this is essential for
++ * completing partial requests asynchronously.
++ *
+  * Returns -ENOTBLK In case of a page invalidation invalidation failure for
+  * writes.  The callers needs to fall back to buffered I/O in this case.
+  */
+ struct iomap_dio *
+ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+ 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
+-		unsigned int dio_flags)
++		unsigned int dio_flags, size_t done_before)
+ {
+ 	struct address_space *mapping = iocb->ki_filp->f_mapping;
+ 	struct inode *inode = file_inode(iocb->ki_filp);
+@@ -486,6 +498,7 @@ __iomap_dio_rw(struct kiocb *iocb, struc
+ 	dio->dops = dops;
+ 	dio->error = 0;
+ 	dio->flags = 0;
++	dio->done_before = done_before;
+ 
+ 	dio->submit.iter = iter;
+ 	dio->submit.waiter = current;
+@@ -652,11 +665,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
+ ssize_t
+ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+ 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
+-		unsigned int dio_flags)
++		unsigned int dio_flags, size_t done_before)
+ {
+ 	struct iomap_dio *dio;
+ 
+-	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
++	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
+ 	if (IS_ERR_OR_NULL(dio))
+ 		return PTR_ERR_OR_ZERO(dio);
+ 	return iomap_dio_complete(dio);
+--- a/fs/xfs/xfs_file.c
++++ b/fs/xfs/xfs_file.c
+@@ -259,7 +259,7 @@ xfs_file_dio_read(
+ 	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
+ 	if (ret)
+ 		return ret;
+-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
++	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, 0);
+ 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+ 
+ 	return ret;
+@@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
+ 	}
+ 	trace_xfs_file_direct_write(iocb, from);
+ 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
+-			   &xfs_dio_write_ops, 0);
++			   &xfs_dio_write_ops, 0, 0);
+ out_unlock:
+ 	if (iolock)
+ 		xfs_iunlock(ip, iolock);
+@@ -647,7 +647,7 @@ retry_exclusive:
+ 
+ 	trace_xfs_file_direct_write(iocb, from);
+ 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
+-			   &xfs_dio_write_ops, flags);
++			   &xfs_dio_write_ops, flags, 0);
+ 
+ 	/*
+ 	 * Retry unaligned I/O with exclusive blocking semantics if the DIO
+--- a/fs/zonefs/super.c
++++ b/fs/zonefs/super.c
+@@ -852,7 +852,7 @@ static ssize_t zonefs_file_dio_write(str
+ 		ret = zonefs_file_dio_append(iocb, from);
+ 	else
+ 		ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
+-				   &zonefs_write_dio_ops, 0);
++				   &zonefs_write_dio_ops, 0, 0);
+ 	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
+ 	    (ret > 0 || ret == -EIOCBQUEUED)) {
+ 		if (ret > 0)
+@@ -987,7 +987,7 @@ static ssize_t zonefs_file_read_iter(str
+ 		}
+ 		file_accessed(iocb->ki_filp);
+ 		ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
+-				   &zonefs_read_dio_ops, 0);
++				   &zonefs_read_dio_ops, 0, 0);
+ 	} else {
+ 		ret = generic_file_read_iter(iocb, to);
+ 		if (ret == -EIO)
+--- a/include/linux/iomap.h
++++ b/include/linux/iomap.h
+@@ -339,10 +339,10 @@ struct iomap_dio_ops {
+ 
+ ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+ 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
+-		unsigned int dio_flags);
++		unsigned int dio_flags, size_t done_before);
+ struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+ 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
+-		unsigned int dio_flags);
++		unsigned int dio_flags, size_t done_before);
+ ssize_t iomap_dio_complete(struct iomap_dio *dio);
+ int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
+ 
diff --git a/queue-5.15/iomap-fix-iomap_dio_rw-return-value-for-user-copies.patch b/queue-5.15/iomap-fix-iomap_dio_rw-return-value-for-user-copies.patch
new file mode 100644
index 00000000000..988f688f762
--- /dev/null
+++ b/queue-5.15/iomap-fix-iomap_dio_rw-return-value-for-user-copies.patch
@@ -0,0 +1,47 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:48 +0800
+Subject: iomap: Fix iomap_dio_rw return value for user copies
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, "Darrick J . Wong" <djwong@kernel.org>, Christoph Hellwig <hch@lst.de>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <63440885619fdfa1a520a9528e38207311f44f2a.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit 42c498c18a94eed79896c50871889af52fa0822e upstream
+
+When a user copy fails in one of the helpers of iomap_dio_rw, fail with
+-EFAULT instead of returning 0.  This matches what iomap_dio_bio_actor
+returns when it gets an -EFAULT from bio_iov_iter_get_pages.  With these
+changes, iomap_dio_actor now consistently fails with -EFAULT when a user
+page cannot be faulted in.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Reviewed-by: Darrick J. Wong <djwong@kernel.org>
+Reviewed-by: Christoph Hellwig <hch@lst.de>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/iomap/direct-io.c |    4 ++++
+ 1 file changed, 4 insertions(+)
+
+--- a/fs/iomap/direct-io.c
++++ b/fs/iomap/direct-io.c
+@@ -371,6 +371,8 @@ static loff_t iomap_dio_hole_iter(const
+ 	loff_t length = iov_iter_zero(iomap_length(iter), dio->submit.iter);
+ 
+ 	dio->size += length;
++	if (!length)
++		return -EFAULT;
+ 	return length;
+ }
+ 
+@@ -402,6 +404,8 @@ static loff_t iomap_dio_inline_iter(cons
+ 		copied = copy_to_iter(inline_data, length, iter);
+ 	}
+ 	dio->size += copied;
++	if (!copied)
++		return -EFAULT;
+ 	return copied;
+ }
+ 
diff --git a/queue-5.15/iomap-support-partial-direct-i-o-on-user-copy-failures.patch b/queue-5.15/iomap-support-partial-direct-i-o-on-user-copy-failures.patch
new file mode 100644
index 00000000000..3f86cac5e4d
--- /dev/null
+++ b/queue-5.15/iomap-support-partial-direct-i-o-on-user-copy-failures.patch
@@ -0,0 +1,57 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:49 +0800
+Subject: iomap: Support partial direct I/O on user copy failures
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, "Darrick J . Wong" <djwong@kernel.org>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <a85564f9b06b5bae198a27c7f60cd02b39c2ce79.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit 97308f8b0d867e9ef59528cd97f0db55ffdf5651 upstream
+
+In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
+IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
+return a partial result.  This allows the caller to deal with the page
+fault and retry the remainder of the request.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Reviewed-by: Darrick J. Wong <djwong@kernel.org>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/iomap/direct-io.c  |    6 ++++++
+ include/linux/iomap.h |    7 +++++++
+ 2 files changed, 13 insertions(+)
+
+--- a/fs/iomap/direct-io.c
++++ b/fs/iomap/direct-io.c
+@@ -581,6 +581,12 @@ __iomap_dio_rw(struct kiocb *iocb, struc
+ 	if (iov_iter_rw(iter) == READ && iomi.pos >= dio->i_size)
+ 		iov_iter_revert(iter, iomi.pos - dio->i_size);
+ 
++	if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
++		if (!(iocb->ki_flags & IOCB_NOWAIT))
++			wait_for_completion = true;
++		ret = 0;
++	}
++
+ 	/* magic error code to fall back to buffered I/O */
+ 	if (ret == -ENOTBLK) {
+ 		wait_for_completion = true;
+--- a/include/linux/iomap.h
++++ b/include/linux/iomap.h
+@@ -330,6 +330,13 @@ struct iomap_dio_ops {
+   */
+ #define IOMAP_DIO_OVERWRITE_ONLY	(1 << 1)
+ 
++/*
++ * When a page fault occurs, return a partial synchronous result and allow
++ * the caller to retry the rest of the operation after dealing with the page
++ * fault.
++ */
++#define IOMAP_DIO_PARTIAL		(1 << 2)
++
+ ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+ 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
+ 		unsigned int dio_flags);
diff --git a/queue-5.15/iov_iter-introduce-fault_in_iov_iter_writeable.patch b/queue-5.15/iov_iter-introduce-fault_in_iov_iter_writeable.patch
new file mode 100644
index 00000000000..25020def7a4
--- /dev/null
+++ b/queue-5.15/iov_iter-introduce-fault_in_iov_iter_writeable.patch
@@ -0,0 +1,169 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:41 +0800
+Subject: iov_iter: Introduce fault_in_iov_iter_writeable
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <8181618a0badc14fd9bbe13e26164bc601c59df9.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit cdd591fc86e38ad3899196066219fbbd845f3162 upstream
+
+Introduce a new fault_in_iov_iter_writeable helper for safely faulting
+in an iterator for writing.  Uses get_user_pages() to fault in the pages
+without actually writing to them, which would be destructive.
+
+We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that
+the iterator passed to .read_iter isn't in memory.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ include/linux/pagemap.h |    1 
+ include/linux/uio.h     |    1 
+ lib/iov_iter.c          |   39 +++++++++++++++++++++++++++++
+ mm/gup.c                |   63 ++++++++++++++++++++++++++++++++++++++++++++++++
+ 4 files changed, 104 insertions(+)
+
+--- a/include/linux/pagemap.h
++++ b/include/linux/pagemap.h
+@@ -736,6 +736,7 @@ extern void add_page_wait_queue(struct p
+  * Fault in userspace address range.
+  */
+ size_t fault_in_writeable(char __user *uaddr, size_t size);
++size_t fault_in_safe_writeable(const char __user *uaddr, size_t size);
+ size_t fault_in_readable(const char __user *uaddr, size_t size);
+ 
+ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
+--- a/include/linux/uio.h
++++ b/include/linux/uio.h
+@@ -134,6 +134,7 @@ size_t copy_page_from_iter_atomic(struct
+ void iov_iter_advance(struct iov_iter *i, size_t bytes);
+ void iov_iter_revert(struct iov_iter *i, size_t bytes);
+ size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
++size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t bytes);
+ size_t iov_iter_single_seg_count(const struct iov_iter *i);
+ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
+ 			 struct iov_iter *i);
+--- a/lib/iov_iter.c
++++ b/lib/iov_iter.c
+@@ -468,6 +468,45 @@ size_t fault_in_iov_iter_readable(const
+ }
+ EXPORT_SYMBOL(fault_in_iov_iter_readable);
+ 
++/*
++ * fault_in_iov_iter_writeable - fault in iov iterator for writing
++ * @i: iterator
++ * @size: maximum length
++ *
++ * Faults in the iterator using get_user_pages(), i.e., without triggering
++ * hardware page faults.  This is primarily useful when we already know that
++ * some or all of the pages in @i aren't in memory.
++ *
++ * Returns the number of bytes not faulted in, like copy_to_user() and
++ * copy_from_user().
++ *
++ * Always returns 0 for non-user-space iterators.
++ */
++size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
++{
++	if (iter_is_iovec(i)) {
++		size_t count = min(size, iov_iter_count(i));
++		const struct iovec *p;
++		size_t skip;
++
++		size -= count;
++		for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
++			size_t len = min(count, p->iov_len - skip);
++			size_t ret;
++
++			if (unlikely(!len))
++				continue;
++			ret = fault_in_safe_writeable(p->iov_base + skip, len);
++			count -= len - ret;
++			if (ret)
++				break;
++		}
++		return count + size;
++	}
++	return 0;
++}
++EXPORT_SYMBOL(fault_in_iov_iter_writeable);
++
+ void iov_iter_init(struct iov_iter *i, unsigned int direction,
+ 			const struct iovec *iov, unsigned long nr_segs,
+ 			size_t count)
+--- a/mm/gup.c
++++ b/mm/gup.c
+@@ -1716,6 +1716,69 @@ out:
+ }
+ EXPORT_SYMBOL(fault_in_writeable);
+ 
++/*
++ * fault_in_safe_writeable - fault in an address range for writing
++ * @uaddr: start of address range
++ * @size: length of address range
++ *
++ * Faults in an address range using get_user_pages, i.e., without triggering
++ * hardware page faults.  This is primarily useful when we already know that
++ * some or all of the pages in the address range aren't in memory.
++ *
++ * Other than fault_in_writeable(), this function is non-destructive.
++ *
++ * Note that we don't pin or otherwise hold the pages referenced that we fault
++ * in.  There's no guarantee that they'll stay in memory for any duration of
++ * time.
++ *
++ * Returns the number of bytes not faulted in, like copy_to_user() and
++ * copy_from_user().
++ */
++size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
++{
++	unsigned long start = (unsigned long)untagged_addr(uaddr);
++	unsigned long end, nstart, nend;
++	struct mm_struct *mm = current->mm;
++	struct vm_area_struct *vma = NULL;
++	int locked = 0;
++
++	nstart = start & PAGE_MASK;
++	end = PAGE_ALIGN(start + size);
++	if (end < nstart)
++		end = 0;
++	for (; nstart != end; nstart = nend) {
++		unsigned long nr_pages;
++		long ret;
++
++		if (!locked) {
++			locked = 1;
++			mmap_read_lock(mm);
++			vma = find_vma(mm, nstart);
++		} else if (nstart >= vma->vm_end)
++			vma = vma->vm_next;
++		if (!vma || vma->vm_start >= end)
++			break;
++		nend = end ? min(end, vma->vm_end) : vma->vm_end;
++		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
++			continue;
++		if (nstart < vma->vm_start)
++			nstart = vma->vm_start;
++		nr_pages = (nend - nstart) / PAGE_SIZE;
++		ret = __get_user_pages_locked(mm, nstart, nr_pages,
++					      NULL, NULL, &locked,
++					      FOLL_TOUCH | FOLL_WRITE);
++		if (ret <= 0)
++			break;
++		nend = nstart + ret * PAGE_SIZE;
++	}
++	if (locked)
++		mmap_read_unlock(mm);
++	if (nstart == end)
++		return 0;
++	return size - min_t(size_t, nstart - start, size);
++}
++EXPORT_SYMBOL(fault_in_safe_writeable);
++
+ /**
+  * fault_in_readable - fault in userspace address range for reading
+  * @uaddr: start of user address range
diff --git a/queue-5.15/iov_iter-introduce-nofault-flag-to-disable-page-faults.patch b/queue-5.15/iov_iter-introduce-nofault-flag-to-disable-page-faults.patch
new file mode 100644
index 00000000000..94faa821a15
--- /dev/null
+++ b/queue-5.15/iov_iter-introduce-nofault-flag-to-disable-page-faults.patch
@@ -0,0 +1,92 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:52 +0800
+Subject: iov_iter: Introduce nofault flag to disable page faults
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <56bf354a8e9c5f2d3d9482c90510d4ff0890d996.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit 3337ab08d08b1a375f88471d9c8b1cac968cb054 upstream
+
+Introduce a new nofault flag to indicate to iov_iter_get_pages not to
+fault in user pages.
+
+This is implemented by passing the FOLL_NOFAULT flag to get_user_pages,
+which causes get_user_pages to fail when it would otherwise fault in a
+page. We'll use the ->nofault flag to prevent iomap_dio_rw from faulting
+in pages when page faults are not allowed.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ include/linux/uio.h |    1 +
+ lib/iov_iter.c      |   20 +++++++++++++++-----
+ 2 files changed, 16 insertions(+), 5 deletions(-)
+
+--- a/include/linux/uio.h
++++ b/include/linux/uio.h
+@@ -35,6 +35,7 @@ struct iov_iter_state {
+ 
+ struct iov_iter {
+ 	u8 iter_type;
++	bool nofault;
+ 	bool data_source;
+ 	size_t iov_offset;
+ 	size_t count;
+--- a/lib/iov_iter.c
++++ b/lib/iov_iter.c
+@@ -514,6 +514,7 @@ void iov_iter_init(struct iov_iter *i, u
+ 	WARN_ON(direction & ~(READ | WRITE));
+ 	*i = (struct iov_iter) {
+ 		.iter_type = ITER_IOVEC,
++		.nofault = false,
+ 		.data_source = direction,
+ 		.iov = iov,
+ 		.nr_segs = nr_segs,
+@@ -1529,13 +1530,17 @@ ssize_t iov_iter_get_pages(struct iov_it
+ 		return 0;
+ 
+ 	if (likely(iter_is_iovec(i))) {
++		unsigned int gup_flags = 0;
+ 		unsigned long addr;
+ 
++		if (iov_iter_rw(i) != WRITE)
++			gup_flags |= FOLL_WRITE;
++		if (i->nofault)
++			gup_flags |= FOLL_NOFAULT;
++
+ 		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
+ 		n = DIV_ROUND_UP(len, PAGE_SIZE);
+-		res = get_user_pages_fast(addr, n,
+-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
+-				pages);
++		res = get_user_pages_fast(addr, n, gup_flags, pages);
+ 		if (unlikely(res <= 0))
+ 			return res;
+ 		return (res == n ? len : res * PAGE_SIZE) - *start;
+@@ -1651,15 +1656,20 @@ ssize_t iov_iter_get_pages_alloc(struct
+ 		return 0;
+ 
+ 	if (likely(iter_is_iovec(i))) {
++		unsigned int gup_flags = 0;
+ 		unsigned long addr;
+ 
++		if (iov_iter_rw(i) != WRITE)
++			gup_flags |= FOLL_WRITE;
++		if (i->nofault)
++			gup_flags |= FOLL_NOFAULT;
++
+ 		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
+ 		n = DIV_ROUND_UP(len, PAGE_SIZE);
+ 		p = get_pages_array(n);
+ 		if (!p)
+ 			return -ENOMEM;
+-		res = get_user_pages_fast(addr, n,
+-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
++		res = get_user_pages_fast(addr, n, gup_flags, p);
+ 		if (unlikely(res <= 0)) {
+ 			kvfree(p);
+ 			*pages = NULL;
diff --git a/queue-5.15/iov_iter-turn-iov_iter_fault_in_readable-into-fault_in_iov_iter_readable.patch b/queue-5.15/iov_iter-turn-iov_iter_fault_in_readable-into-fault_in_iov_iter_readable.patch
new file mode 100644
index 00000000000..945276d2128
--- /dev/null
+++ b/queue-5.15/iov_iter-turn-iov_iter_fault_in_readable-into-fault_in_iov_iter_readable.patch
@@ -0,0 +1,181 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:40 +0800
+Subject: iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Andreas Gruenbacher <agruenba@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <2f18cef5634943c5bcd007b3753c3839feee9bd9.1649951733.git.anand.jain@oracle.com>
+
+From: Andreas Gruenbacher <agruenba@redhat.com>
+
+commit a6294593e8a1290091d0b078d5d33da5e0cd3dfe upstream
+
+Turn iov_iter_fault_in_readable into a function that returns the number
+of bytes not faulted in, similar to copy_to_user, instead of returning a
+non-zero value when any of the requested pages couldn't be faulted in.
+This supports the existing users that require all pages to be faulted in
+as well as new users that are happy if any pages can be faulted in.
+
+Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
+sure this change doesn't silently break things.
+
+Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/btrfs/file.c        |    2 +-
+ fs/f2fs/file.c         |    2 +-
+ fs/fuse/file.c         |    2 +-
+ fs/iomap/buffered-io.c |    2 +-
+ fs/ntfs/file.c         |    2 +-
+ fs/ntfs3/file.c        |    2 +-
+ include/linux/uio.h    |    2 +-
+ lib/iov_iter.c         |   33 +++++++++++++++++++++------------
+ mm/filemap.c           |    2 +-
+ 9 files changed, 29 insertions(+), 20 deletions(-)
+
+--- a/fs/btrfs/file.c
++++ b/fs/btrfs/file.c
+@@ -1709,7 +1709,7 @@ static noinline ssize_t btrfs_buffered_w
+ 		 * Fault pages before locking them in prepare_pages
+ 		 * to avoid recursive lock
+ 		 */
+-		if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
++		if (unlikely(fault_in_iov_iter_readable(i, write_bytes))) {
+ 			ret = -EFAULT;
+ 			break;
+ 		}
+--- a/fs/f2fs/file.c
++++ b/fs/f2fs/file.c
+@@ -4279,7 +4279,7 @@ static ssize_t f2fs_file_write_iter(stru
+ 		size_t target_size = 0;
+ 		int err;
+ 
+-		if (iov_iter_fault_in_readable(from, iov_iter_count(from)))
++		if (fault_in_iov_iter_readable(from, iov_iter_count(from)))
+ 			set_inode_flag(inode, FI_NO_PREALLOC);
+ 
+ 		if ((iocb->ki_flags & IOCB_NOWAIT)) {
+--- a/fs/fuse/file.c
++++ b/fs/fuse/file.c
+@@ -1164,7 +1164,7 @@ static ssize_t fuse_fill_write_pages(str
+ 
+  again:
+ 		err = -EFAULT;
+-		if (iov_iter_fault_in_readable(ii, bytes))
++		if (fault_in_iov_iter_readable(ii, bytes))
+ 			break;
+ 
+ 		err = -ENOMEM;
+--- a/fs/iomap/buffered-io.c
++++ b/fs/iomap/buffered-io.c
+@@ -757,7 +757,7 @@ again:
+ 		 * same page as we're writing to, without it being marked
+ 		 * up-to-date.
+ 		 */
+-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
++		if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
+ 			status = -EFAULT;
+ 			break;
+ 		}
+--- a/fs/ntfs/file.c
++++ b/fs/ntfs/file.c
+@@ -1829,7 +1829,7 @@ again:
+ 		 * pages being swapped out between us bringing them into memory
+ 		 * and doing the actual copying.
+ 		 */
+-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
++		if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
+ 			status = -EFAULT;
+ 			break;
+ 		}
+--- a/fs/ntfs3/file.c
++++ b/fs/ntfs3/file.c
+@@ -989,7 +989,7 @@ static ssize_t ntfs_compress_write(struc
+ 		frame_vbo = pos & ~(frame_size - 1);
+ 		index = frame_vbo >> PAGE_SHIFT;
+ 
+-		if (unlikely(iov_iter_fault_in_readable(from, bytes))) {
++		if (unlikely(fault_in_iov_iter_readable(from, bytes))) {
+ 			err = -EFAULT;
+ 			goto out;
+ 		}
+--- a/include/linux/uio.h
++++ b/include/linux/uio.h
+@@ -133,7 +133,7 @@ size_t copy_page_from_iter_atomic(struct
+ 				  size_t bytes, struct iov_iter *i);
+ void iov_iter_advance(struct iov_iter *i, size_t bytes);
+ void iov_iter_revert(struct iov_iter *i, size_t bytes);
+-int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes);
++size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
+ size_t iov_iter_single_seg_count(const struct iov_iter *i);
+ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
+ 			 struct iov_iter *i);
+--- a/lib/iov_iter.c
++++ b/lib/iov_iter.c
+@@ -431,33 +431,42 @@ out:
+ }
+ 
+ /*
++ * fault_in_iov_iter_readable - fault in iov iterator for reading
++ * @i: iterator
++ * @size: maximum length
++ *
+  * Fault in one or more iovecs of the given iov_iter, to a maximum length of
+- * bytes.  For each iovec, fault in each page that constitutes the iovec.
++ * @size.  For each iovec, fault in each page that constitutes the iovec.
++ *
++ * Returns the number of bytes not faulted in (like copy_to_user() and
++ * copy_from_user()).
+  *
+- * Return 0 on success, or non-zero if the memory could not be accessed (i.e.
+- * because it is an invalid address).
++ * Always returns 0 for non-userspace iterators.
+  */
+-int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
++size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
+ {
+ 	if (iter_is_iovec(i)) {
++		size_t count = min(size, iov_iter_count(i));
+ 		const struct iovec *p;
+ 		size_t skip;
+ 
+-		if (bytes > i->count)
+-			bytes = i->count;
+-		for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
+-			size_t len = min(bytes, p->iov_len - skip);
++		size -= count;
++		for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
++			size_t len = min(count, p->iov_len - skip);
++			size_t ret;
+ 
+ 			if (unlikely(!len))
+ 				continue;
+-			if (fault_in_readable(p->iov_base + skip, len))
+-				return -EFAULT;
+-			bytes -= len;
++			ret = fault_in_readable(p->iov_base + skip, len);
++			count -= len - ret;
++			if (ret)
++				break;
+ 		}
++		return count + size;
+ 	}
+ 	return 0;
+ }
+-EXPORT_SYMBOL(iov_iter_fault_in_readable);
++EXPORT_SYMBOL(fault_in_iov_iter_readable);
+ 
+ void iov_iter_init(struct iov_iter *i, unsigned int direction,
+ 			const struct iovec *iov, unsigned long nr_segs,
+--- a/mm/filemap.c
++++ b/mm/filemap.c
+@@ -3760,7 +3760,7 @@ again:
+ 		 * same page as we're writing to, without it being marked
+ 		 * up-to-date.
+ 		 */
+-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
++		if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
+ 			status = -EFAULT;
+ 			break;
+ 		}
diff --git a/queue-5.15/mm-gup-make-fault_in_safe_writeable-use-fixup_user_fault.patch b/queue-5.15/mm-gup-make-fault_in_safe_writeable-use-fixup_user_fault.patch
new file mode 100644
index 00000000000..4b104ff1a9e
--- /dev/null
+++ b/queue-5.15/mm-gup-make-fault_in_safe_writeable-use-fixup_user_fault.patch
@@ -0,0 +1,117 @@
+From foo@baz Fri Apr 29 11:07:48 AM CEST 2022
+From: Anand Jain <anand.jain@oracle.com>
+Date: Fri, 15 Apr 2022 06:28:56 +0800
+Subject: mm: gup: make fault_in_safe_writeable() use fixup_user_fault()
+To: stable@vger.kernel.org
+Cc: linux-btrfs@vger.kernel.org, Linus Torvalds <torvalds@linux-foundation.org>, Andreas Gruenbacher <agruenba@redhat.com>, David Hildenbrand <david@redhat.com>, Anand Jain <anand.jain@oracle.com>
+Message-ID: <f0f656eec295ba30808cdaaaf7e8187b7fae162e.1649951733.git.anand.jain@oracle.com>
+
+From: Linus Torvalds <torvalds@linux-foundation.org>
+
+commit fe673d3f5bf1fc50cdc4b754831db91a2ec10126 upstream
+
+Instead of using GUP, make fault_in_safe_writeable() actually force a
+'handle_mm_fault()' using the same fixup_user_fault() machinery that
+futexes already use.
+
+Using the GUP machinery meant that fault_in_safe_writeable() did not do
+everything that a real fault would do, ranging from not auto-expanding
+the stack segment, to not updating accessed or dirty flags in the page
+tables (GUP sets those flags on the pages themselves).
+
+The latter causes problems on architectures (like s390) that do accessed
+bit handling in software, which meant that fault_in_safe_writeable()
+didn't actually do all the fault handling it needed to, and trying to
+access the user address afterwards would still cause faults.
+
+Reported-and-tested-by: Andreas Gruenbacher <agruenba@redhat.com>
+Fixes: cdd591fc86e3 ("iov_iter: Introduce fault_in_iov_iter_writeable")
+Link: https://lore.kernel.org/all/CAHc6FU5nP+nziNGG0JAF1FUx-GV7kKFvM7aZuU_XD2_1v4vnvg@mail.gmail.com/
+Acked-by: David Hildenbrand <david@redhat.com>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Anand Jain <anand.jain@oracle.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ mm/gup.c |   57 +++++++++++++++++++--------------------------------------
+ 1 file changed, 19 insertions(+), 38 deletions(-)
+
+--- a/mm/gup.c
++++ b/mm/gup.c
+@@ -1723,11 +1723,11 @@ EXPORT_SYMBOL(fault_in_writeable);
+  * @uaddr: start of address range
+  * @size: length of address range
+  *
+- * Faults in an address range using get_user_pages, i.e., without triggering
+- * hardware page faults.  This is primarily useful when we already know that
+- * some or all of the pages in the address range aren't in memory.
++ * Faults in an address range for writing.  This is primarily useful when we
++ * already know that some or all of the pages in the address range aren't in
++ * memory.
+  *
+- * Other than fault_in_writeable(), this function is non-destructive.
++ * Unlike fault_in_writeable(), this function is non-destructive.
+  *
+  * Note that we don't pin or otherwise hold the pages referenced that we fault
+  * in.  There's no guarantee that they'll stay in memory for any duration of
+@@ -1738,46 +1738,27 @@ EXPORT_SYMBOL(fault_in_writeable);
+  */
+ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
+ {
+-	unsigned long start = (unsigned long)untagged_addr(uaddr);
+-	unsigned long end, nstart, nend;
++	unsigned long start = (unsigned long)uaddr, end;
+ 	struct mm_struct *mm = current->mm;
+-	struct vm_area_struct *vma = NULL;
+-	int locked = 0;
++	bool unlocked = false;
+ 
+-	nstart = start & PAGE_MASK;
++	if (unlikely(size == 0))
++		return 0;
+ 	end = PAGE_ALIGN(start + size);
+-	if (end < nstart)
++	if (end < start)
+ 		end = 0;
+-	for (; nstart != end; nstart = nend) {
+-		unsigned long nr_pages;
+-		long ret;
+ 
+-		if (!locked) {
+-			locked = 1;
+-			mmap_read_lock(mm);
+-			vma = find_vma(mm, nstart);
+-		} else if (nstart >= vma->vm_end)
+-			vma = vma->vm_next;
+-		if (!vma || vma->vm_start >= end)
+-			break;
+-		nend = end ? min(end, vma->vm_end) : vma->vm_end;
+-		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+-			continue;
+-		if (nstart < vma->vm_start)
+-			nstart = vma->vm_start;
+-		nr_pages = (nend - nstart) / PAGE_SIZE;
+-		ret = __get_user_pages_locked(mm, nstart, nr_pages,
+-					      NULL, NULL, &locked,
+-					      FOLL_TOUCH | FOLL_WRITE);
+-		if (ret <= 0)
++	mmap_read_lock(mm);
++	do {
++		if (fixup_user_fault(mm, start, FAULT_FLAG_WRITE, &unlocked))
+ 			break;
+-		nend = nstart + ret * PAGE_SIZE;
+-	}
+-	if (locked)
+-		mmap_read_unlock(mm);
+-	if (nstart == end)
+-		return 0;
+-	return size - min_t(size_t, nstart - start, size);
++		start = (start + PAGE_SIZE) & PAGE_MASK;
++	} while (start != end);
++	mmap_read_unlock(mm);
++
++	if (size > (unsigned long)uaddr - start)
++		return size - ((unsigned long)uaddr - start);
++	return 0;
+ }
+ EXPORT_SYMBOL(fault_in_safe_writeable);
+ 
diff --git a/queue-5.15/mm-kfence-fix-objcgs-vector-allocation.patch b/queue-5.15/mm-kfence-fix-objcgs-vector-allocation.patch
new file mode 100644
index 00000000000..9aeffb9a21d
--- /dev/null
+++ b/queue-5.15/mm-kfence-fix-objcgs-vector-allocation.patch
@@ -0,0 +1,80 @@
+From 8f0b36497303487d5a32c75789c77859cc2ee895 Mon Sep 17 00:00:00 2001
+From: Muchun Song <songmuchun@bytedance.com>
+Date: Fri, 1 Apr 2022 11:28:36 -0700
+Subject: mm: kfence: fix objcgs vector allocation
+
+From: Muchun Song <songmuchun@bytedance.com>
+
+commit 8f0b36497303487d5a32c75789c77859cc2ee895 upstream.
+
+If the kfence object is allocated to be used for objects vector, then
+this slot of the pool eventually being occupied permanently since the
+vector is never freed.  The solutions could be (1) freeing vector when
+the kfence object is freed or (2) allocating all vectors statically.
+
+Since the memory consumption of object vectors is low, it is better to
+chose (2) to fix the issue and it is also can reduce overhead of vectors
+allocating in the future.
+
+Link: https://lkml.kernel.org/r/20220328132843.16624-1-songmuchun@bytedance.com
+Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB")
+Signed-off-by: Muchun Song <songmuchun@bytedance.com>
+Reviewed-by: Marco Elver <elver@google.com>
+Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
+Cc: Alexander Potapenko <glider@google.com>
+Cc: Dmitry Vyukov <dvyukov@google.com>
+Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ mm/kfence/core.c   |   11 ++++++++++-
+ mm/kfence/kfence.h |    3 +++
+ 2 files changed, 13 insertions(+), 1 deletion(-)
+
+--- a/mm/kfence/core.c
++++ b/mm/kfence/core.c
+@@ -528,6 +528,8 @@ static bool __init kfence_init_pool(void
+ 	 * enters __slab_free() slow-path.
+ 	 */
+ 	for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
++		struct page *page = &pages[i];
++
+ 		if (!i || (i % 2))
+ 			continue;
+ 
+@@ -535,7 +537,11 @@ static bool __init kfence_init_pool(void
+ 		if (WARN_ON(compound_head(&pages[i]) != &pages[i]))
+ 			goto err;
+ 
+-		__SetPageSlab(&pages[i]);
++		__SetPageSlab(page);
++#ifdef CONFIG_MEMCG
++		page->memcg_data = (unsigned long)&kfence_metadata[i / 2 - 1].objcg |
++				   MEMCG_DATA_OBJCGS;
++#endif
+ 	}
+ 
+ 	/*
+@@ -911,6 +917,9 @@ void __kfence_free(void *addr)
+ {
+ 	struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
+ 
++#ifdef CONFIG_MEMCG
++	KFENCE_WARN_ON(meta->objcg);
++#endif
+ 	/*
+ 	 * If the objects of the cache are SLAB_TYPESAFE_BY_RCU, defer freeing
+ 	 * the object, as the object page may be recycled for other-typed
+--- a/mm/kfence/kfence.h
++++ b/mm/kfence/kfence.h
+@@ -89,6 +89,9 @@ struct kfence_metadata {
+ 	struct kfence_track free_track;
+ 	/* For updating alloc_covered on frees. */
+ 	u32 alloc_stack_hash;
++#ifdef CONFIG_MEMCG
++	struct obj_cgroup *objcg;
++#endif
+ };
+ 
+ extern struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS];
diff --git a/queue-5.15/series b/queue-5.15/series
index a5f95ec48a2..79a632b3ff5 100644
--- a/queue-5.15/series
+++ b/queue-5.15/series
@@ -11,3 +11,22 @@ bpf-selftests-test-ptr_to_rdonly_mem.patch
 bpf-fix-crash-due-to-out-of-bounds-access-into-reg2btf_ids.patch
 spi-cadence-quadspi-fix-write-completion-support.patch
 arm-dts-socfpga-change-qspi-to-intel-socfpga-qspi.patch
+mm-kfence-fix-objcgs-vector-allocation.patch
+gup-turn-fault_in_pages_-readable-writeable-into-fault_in_-readable-writeable.patch
+iov_iter-turn-iov_iter_fault_in_readable-into-fault_in_iov_iter_readable.patch
+iov_iter-introduce-fault_in_iov_iter_writeable.patch
+gfs2-add-wrapper-for-iomap_file_buffered_write.patch
+gfs2-clean-up-function-may_grant.patch
+gfs2-introduce-flag-for-glock-holder-auto-demotion.patch
+gfs2-move-the-inode-glock-locking-to-gfs2_file_buffered_write.patch
+gfs2-eliminate-ip-i_gh.patch
+gfs2-fix-mmap-page-fault-deadlocks-for-buffered-i-o.patch
+iomap-fix-iomap_dio_rw-return-value-for-user-copies.patch
+iomap-support-partial-direct-i-o-on-user-copy-failures.patch
+iomap-add-done_before-argument-to-iomap_dio_rw.patch
+gup-introduce-foll_nofault-flag-to-disable-page-faults.patch
+iov_iter-introduce-nofault-flag-to-disable-page-faults.patch
+gfs2-fix-mmap-page-fault-deadlocks-for-direct-i-o.patch
+btrfs-fix-deadlock-due-to-page-faults-during-direct-io-reads-and-writes.patch
+btrfs-fallback-to-blocking-mode-when-doing-async-dio-over-multiple-extents.patch
+mm-gup-make-fault_in_safe_writeable-use-fixup_user_fault.patch