From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Mon, 17 Feb 2020 19:19:38 +0000 (+0100)
Subject: 4.4-stable patches
X-Git-Tag: v4.19.105~26
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=e9c75ea1ac2374e053718da6eb055f368af36f22;p=thirdparty%2Fkernel%2Fstable-queue.git

4.4-stable patches

added patches:
	btrfs-fix-race-between-using-extent-maps-and-merging-them.patch
	btrfs-log-message-when-rw-remount-is-attempted-with-unclean-tree-log.patch
	ext4-fix-checksum-errors-with-indexed-dirs.patch
---

diff --git a/queue-4.4/btrfs-fix-race-between-using-extent-maps-and-merging-them.patch b/queue-4.4/btrfs-fix-race-between-using-extent-maps-and-merging-them.patch
new file mode 100644
index 00000000000..804677192c3
--- /dev/null
+++ b/queue-4.4/btrfs-fix-race-between-using-extent-maps-and-merging-them.patch
@@ -0,0 +1,128 @@
+From ac05ca913e9f3871126d61da275bfe8516ff01ca Mon Sep 17 00:00:00 2001
+From: Filipe Manana <fdmanana@suse.com>
+Date: Fri, 31 Jan 2020 14:06:07 +0000
+Subject: Btrfs: fix race between using extent maps and merging them
+
+From: Filipe Manana <fdmanana@suse.com>
+
+commit ac05ca913e9f3871126d61da275bfe8516ff01ca upstream.
+
+We have a few cases where we allow an extent map that is in an extent map
+tree to be merged with other extents in the tree. Such cases include the
+unpinning of an extent after the respective ordered extent completed or
+after logging an extent during a fast fsync. This can lead to subtle and
+dangerous problems because when doing the merge some other task might be
+using the same extent map and as consequence see an inconsistent state of
+the extent map - for example sees the new length but has seen the old start
+offset.
+
+With luck this triggers a BUG_ON(), and not some silent bug, such as the
+following one in __do_readpage():
+
+  $ cat -n fs/btrfs/extent_io.c
+  3061  static int __do_readpage(struct extent_io_tree *tree,
+  3062                           struct page *page,
+  (...)
+  3127                  em = __get_extent_map(inode, page, pg_offset, cur,
+  3128                                        end - cur + 1, get_extent, em_cached);
+  3129                  if (IS_ERR_OR_NULL(em)) {
+  3130                          SetPageError(page);
+  3131                          unlock_extent(tree, cur, end);
+  3132                          break;
+  3133                  }
+  3134                  extent_offset = cur - em->start;
+  3135                  BUG_ON(extent_map_end(em) <= cur);
+  (...)
+
+Consider the following example scenario, where we end up hitting the
+BUG_ON() in __do_readpage().
+
+We have an inode with a size of 8KiB and 2 extent maps:
+
+  extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
+            a previous transaction
+
+  extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
+            persisted but writeback started for it already. The extent map
+	    is pinned since there's writeback and an ordered extent in
+	    progress, so it can not be merged with extent map A yet
+
+The following sequence of steps leads to the BUG_ON():
+
+1) The ordered extent for extent B completes, the respective page gets its
+   writeback bit cleared and the extent map is unpinned, at that point it
+   is not yet merged with extent map A because it's in the list of modified
+   extents;
+
+2) Due to memory pressure, or some other reason, the MM subsystem releases
+   the page corresponding to extent B - btrfs_releasepage() is called and
+   returns 1, meaning the page can be released as it's not dirty, not under
+   writeback anymore and the extent range is not locked in the inode's
+   iotree. However the extent map is not released, either because we are
+   not in a context that allows memory allocations to block or because the
+   inode's size is smaller than 16MiB - in this case our inode has a size
+   of 8KiB;
+
+3) Task B needs to read extent B and ends up __do_readpage() through the
+   btrfs_readpage() callback. At __do_readpage() it gets a reference to
+   extent map B;
+
+4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
+   while holding the write lock on the inode's extent map tree - this
+   results in try_merge_map() being called and since it's possible to merge
+   extent map B with extent map A now (the extent map B was removed from
+   the list of modified extents), the merging begins - it sets extent map
+   B's start offset to 0 (was 4KiB), but before it increments the map's
+   length to 8KiB (4kb + 4KiB), task A is at:
+
+   BUG_ON(extent_map_end(em) <= cur);
+
+   The call to extent_map_end() sees the extent map has a start of 0
+   and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
+   the BUG_ON() is triggered.
+
+So it's dangerous to modify an extent map that is in the tree, because some
+other task might have got a reference to it before and still using it, and
+needs to see a consistent map while using it. Generally this is very rare
+since most paths that lookup and use extent maps also have the file range
+locked in the inode's iotree. The fsync path is pretty much the only
+exception where we don't do it to avoid serialization with concurrent
+reads.
+
+Fix this by not allowing an extent map do be merged if if it's being used
+by tasks other then the one attempting to merge the extent map (when the
+reference count of the extent map is greater than 2).
+
+Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
+Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp>
+Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
+CC: stable@vger.kernel.org # 4.4+
+Reviewed-by: Josef Bacik <josef@toxicpanda.com>
+Signed-off-by: Filipe Manana <fdmanana@suse.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/btrfs/extent_map.c |   11 +++++++++++
+ 1 file changed, 11 insertions(+)
+
+--- a/fs/btrfs/extent_map.c
++++ b/fs/btrfs/extent_map.c
+@@ -227,6 +227,17 @@ static void try_merge_map(struct extent_
+ 	struct extent_map *merge = NULL;
+ 	struct rb_node *rb;
+ 
++	/*
++	 * We can't modify an extent map that is in the tree and that is being
++	 * used by another task, as it can cause that other task to see it in
++	 * inconsistent state during the merging. We always have 1 reference for
++	 * the tree and 1 for this task (which is unpinning the extent map or
++	 * clearing the logging flag), so anything > 2 means it's being used by
++	 * other tasks too.
++	 */
++	if (refcount_read(&em->refs) > 2)
++		return;
++
+ 	if (em->start != 0) {
+ 		rb = rb_prev(&em->rb_node);
+ 		if (rb)
diff --git a/queue-4.4/btrfs-log-message-when-rw-remount-is-attempted-with-unclean-tree-log.patch b/queue-4.4/btrfs-log-message-when-rw-remount-is-attempted-with-unclean-tree-log.patch
new file mode 100644
index 00000000000..4a633e49a06
--- /dev/null
+++ b/queue-4.4/btrfs-log-message-when-rw-remount-is-attempted-with-unclean-tree-log.patch
@@ -0,0 +1,37 @@
+From 10a3a3edc5b89a8cd095bc63495fb1e0f42047d9 Mon Sep 17 00:00:00 2001
+From: David Sterba <dsterba@suse.com>
+Date: Wed, 5 Feb 2020 17:12:28 +0100
+Subject: btrfs: log message when rw remount is attempted with unclean tree-log
+
+From: David Sterba <dsterba@suse.com>
+
+commit 10a3a3edc5b89a8cd095bc63495fb1e0f42047d9 upstream.
+
+A remount to a read-write filesystem is not safe when there's tree-log
+to be replayed. Files that could be opened until now might be affected
+by the changes in the tree-log.
+
+A regular mount is needed to replay the log so the filesystem presents
+the consistent view with the pending changes included.
+
+CC: stable@vger.kernel.org # 4.4+
+Reviewed-by: Anand Jain <anand.jain@oracle.com>
+Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/btrfs/super.c |    2 ++
+ 1 file changed, 2 insertions(+)
+
+--- a/fs/btrfs/super.c
++++ b/fs/btrfs/super.c
+@@ -1702,6 +1702,8 @@ static int btrfs_remount(struct super_bl
+ 		}
+ 
+ 		if (btrfs_super_log_root(fs_info->super_copy) != 0) {
++			btrfs_warn(fs_info,
++		"mount required to replay tree-log, cannot remount read-write");
+ 			ret = -EINVAL;
+ 			goto restore;
+ 		}
diff --git a/queue-4.4/ext4-fix-checksum-errors-with-indexed-dirs.patch b/queue-4.4/ext4-fix-checksum-errors-with-indexed-dirs.patch
new file mode 100644
index 00000000000..c84fa818f01
--- /dev/null
+++ b/queue-4.4/ext4-fix-checksum-errors-with-indexed-dirs.patch
@@ -0,0 +1,125 @@
+From 48a34311953d921235f4d7bbd2111690d2e469cf Mon Sep 17 00:00:00 2001
+From: Jan Kara <jack@suse.cz>
+Date: Mon, 10 Feb 2020 15:43:16 +0100
+Subject: ext4: fix checksum errors with indexed dirs
+
+From: Jan Kara <jack@suse.cz>
+
+commit 48a34311953d921235f4d7bbd2111690d2e469cf upstream.
+
+DIR_INDEX has been introduced as a compat ext4 feature. That means that
+even kernels / tools that don't understand the feature may modify the
+filesystem. This works because for kernels not understanding indexed dir
+format, internal htree nodes appear just as empty directory entries.
+Index dir aware kernels then check the htree structure is still
+consistent before using the data. This all worked reasonably well until
+metadata checksums were introduced. The problem is that these
+effectively made DIR_INDEX only ro-compatible because internal htree
+nodes store checksums in a different place than normal directory blocks.
+Thus any modification ignorant to DIR_INDEX (or just clearing
+EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch
+and trigger kernel errors. So we have to be more careful when dealing
+with indexed directories on filesystems with checksumming enabled.
+
+1) We just disallow loading any directory inodes with EXT4_INDEX_FL when
+DIR_INDEX is not enabled. This is harsh but it should be very rare (it
+means someone disabled DIR_INDEX on existing filesystem and didn't run
+e2fsck), e2fsck can fix the problem, and we don't want to answer the
+difficult question: "Should we rather corrupt the directory more or
+should we ignore that DIR_INDEX feature is not set?"
+
+2) When we find out htree structure is corrupted (but the filesystem and
+the directory should in support htrees), we continue just ignoring htree
+information for reading but we refuse to add new entries to the
+directory to avoid corrupting it more.
+
+Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz
+Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes")
+Reviewed-by: Andreas Dilger <adilger@dilger.ca>
+Signed-off-by: Jan Kara <jack@suse.cz>
+Signed-off-by: Theodore Ts'o <tytso@mit.edu>
+Cc: stable@kernel.org
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/ext4/dir.c   |   14 ++++++++------
+ fs/ext4/ext4.h  |    5 ++++-
+ fs/ext4/inode.c |   12 ++++++++++++
+ fs/ext4/namei.c |    7 +++++++
+ 4 files changed, 31 insertions(+), 7 deletions(-)
+
+--- a/fs/ext4/dir.c
++++ b/fs/ext4/dir.c
+@@ -125,12 +125,14 @@ static int ext4_readdir(struct file *fil
+ 		if (err != ERR_BAD_DX_DIR) {
+ 			return err;
+ 		}
+-		/*
+-		 * We don't set the inode dirty flag since it's not
+-		 * critical that it get flushed back to the disk.
+-		 */
+-		ext4_clear_inode_flag(file_inode(file),
+-				      EXT4_INODE_INDEX);
++		/* Can we just clear INDEX flag to ignore htree information? */
++		if (!ext4_has_metadata_csum(sb)) {
++			/*
++			 * We don't set the inode dirty flag since it's not
++			 * critical that it gets flushed back to the disk.
++			 */
++			ext4_clear_inode_flag(inode, EXT4_INODE_INDEX);
++		}
+ 	}
+ 
+ 	if (ext4_has_inline_data(inode)) {
+--- a/fs/ext4/ext4.h
++++ b/fs/ext4/ext4.h
+@@ -2381,8 +2381,11 @@ int ext4_insert_dentry(struct inode *dir
+ 		       struct ext4_filename *fname);
+ static inline void ext4_update_dx_flag(struct inode *inode)
+ {
+-	if (!ext4_has_feature_dir_index(inode->i_sb))
++	if (!ext4_has_feature_dir_index(inode->i_sb)) {
++		/* ext4_iget() should have caught this... */
++		WARN_ON_ONCE(ext4_has_feature_metadata_csum(inode->i_sb));
+ 		ext4_clear_inode_flag(inode, EXT4_INODE_INDEX);
++	}
+ }
+ static unsigned char ext4_filetype_table[] = {
+ 	DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
+--- a/fs/ext4/inode.c
++++ b/fs/ext4/inode.c
+@@ -4325,6 +4325,18 @@ struct inode *ext4_iget(struct super_blo
+ 		ret = -EFSCORRUPTED;
+ 		goto bad_inode;
+ 	}
++	/*
++	 * If dir_index is not enabled but there's dir with INDEX flag set,
++	 * we'd normally treat htree data as empty space. But with metadata
++	 * checksumming that corrupts checksums so forbid that.
++	 */
++	if (!ext4_has_feature_dir_index(sb) && ext4_has_metadata_csum(sb) &&
++	    ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) {
++		ext4_error_inode(inode, function, line, 0,
++			 "iget: Dir with htree data on filesystem without dir_index feature.");
++		ret = -EFSCORRUPTED;
++		goto bad_inode;
++	}
+ 	ei->i_disksize = inode->i_size;
+ #ifdef CONFIG_QUOTA
+ 	ei->i_reserved_quota = 0;
+--- a/fs/ext4/namei.c
++++ b/fs/ext4/namei.c
+@@ -2121,6 +2121,13 @@ static int ext4_add_entry(handle_t *hand
+ 		retval = ext4_dx_add_entry(handle, &fname, dentry, inode);
+ 		if (!retval || (retval != ERR_BAD_DX_DIR))
+ 			goto out;
++		/* Can we just ignore htree data? */
++		if (ext4_has_metadata_csum(sb)) {
++			EXT4_ERROR_INODE(dir,
++				"Directory has corrupted htree index.");
++			retval = -EFSCORRUPTED;
++			goto out;
++		}
+ 		ext4_clear_inode_flag(dir, EXT4_INODE_INDEX);
+ 		dx_fallback++;
+ 		ext4_mark_inode_dirty(handle, dir);
diff --git a/queue-4.4/series b/queue-4.4/series
index 05a0b1aef7c..0c8da8f7eaa 100644
--- a/queue-4.4/series
+++ b/queue-4.4/series
@@ -1,2 +1,5 @@
 alsa-usb-audio-apply-sample-rate-quirk-for-audioengine-d1.patch
 ubifs-fix-deadlock-in-concurrent-bulk-read-and-writepage.patch
+ext4-fix-checksum-errors-with-indexed-dirs.patch
+btrfs-fix-race-between-using-extent-maps-and-merging-them.patch
+btrfs-log-message-when-rw-remount-is-attempted-with-unclean-tree-log.patch