From: Greg Kroah-Hartman Date: Mon, 17 Feb 2020 19:19:38 +0000 (+0100) Subject: 4.4-stable patches X-Git-Tag: v4.19.105~26 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=e9c75ea1ac2374e053718da6eb055f368af36f22;p=thirdparty%2Fkernel%2Fstable-queue.git 4.4-stable patches added patches: btrfs-fix-race-between-using-extent-maps-and-merging-them.patch btrfs-log-message-when-rw-remount-is-attempted-with-unclean-tree-log.patch ext4-fix-checksum-errors-with-indexed-dirs.patch --- diff --git a/queue-4.4/btrfs-fix-race-between-using-extent-maps-and-merging-them.patch b/queue-4.4/btrfs-fix-race-between-using-extent-maps-and-merging-them.patch new file mode 100644 index 00000000000..804677192c3 --- /dev/null +++ b/queue-4.4/btrfs-fix-race-between-using-extent-maps-and-merging-them.patch @@ -0,0 +1,128 @@ +From ac05ca913e9f3871126d61da275bfe8516ff01ca Mon Sep 17 00:00:00 2001 +From: Filipe Manana +Date: Fri, 31 Jan 2020 14:06:07 +0000 +Subject: Btrfs: fix race between using extent maps and merging them + +From: Filipe Manana + +commit ac05ca913e9f3871126d61da275bfe8516ff01ca upstream. + +We have a few cases where we allow an extent map that is in an extent map +tree to be merged with other extents in the tree. Such cases include the +unpinning of an extent after the respective ordered extent completed or +after logging an extent during a fast fsync. This can lead to subtle and +dangerous problems because when doing the merge some other task might be +using the same extent map and as consequence see an inconsistent state of +the extent map - for example sees the new length but has seen the old start +offset. + +With luck this triggers a BUG_ON(), and not some silent bug, such as the +following one in __do_readpage(): + + $ cat -n fs/btrfs/extent_io.c + 3061 static int __do_readpage(struct extent_io_tree *tree, + 3062 struct page *page, + (...) + 3127 em = __get_extent_map(inode, page, pg_offset, cur, + 3128 end - cur + 1, get_extent, em_cached); + 3129 if (IS_ERR_OR_NULL(em)) { + 3130 SetPageError(page); + 3131 unlock_extent(tree, cur, end); + 3132 break; + 3133 } + 3134 extent_offset = cur - em->start; + 3135 BUG_ON(extent_map_end(em) <= cur); + (...) + +Consider the following example scenario, where we end up hitting the +BUG_ON() in __do_readpage(). + +We have an inode with a size of 8KiB and 2 extent maps: + + extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by + a previous transaction + + extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet + persisted but writeback started for it already. The extent map + is pinned since there's writeback and an ordered extent in + progress, so it can not be merged with extent map A yet + +The following sequence of steps leads to the BUG_ON(): + +1) The ordered extent for extent B completes, the respective page gets its + writeback bit cleared and the extent map is unpinned, at that point it + is not yet merged with extent map A because it's in the list of modified + extents; + +2) Due to memory pressure, or some other reason, the MM subsystem releases + the page corresponding to extent B - btrfs_releasepage() is called and + returns 1, meaning the page can be released as it's not dirty, not under + writeback anymore and the extent range is not locked in the inode's + iotree. However the extent map is not released, either because we are + not in a context that allows memory allocations to block or because the + inode's size is smaller than 16MiB - in this case our inode has a size + of 8KiB; + +3) Task B needs to read extent B and ends up __do_readpage() through the + btrfs_readpage() callback. At __do_readpage() it gets a reference to + extent map B; + +4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B + while holding the write lock on the inode's extent map tree - this + results in try_merge_map() being called and since it's possible to merge + extent map B with extent map A now (the extent map B was removed from + the list of modified extents), the merging begins - it sets extent map + B's start offset to 0 (was 4KiB), but before it increments the map's + length to 8KiB (4kb + 4KiB), task A is at: + + BUG_ON(extent_map_end(em) <= cur); + + The call to extent_map_end() sees the extent map has a start of 0 + and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so + the BUG_ON() is triggered. + +So it's dangerous to modify an extent map that is in the tree, because some +other task might have got a reference to it before and still using it, and +needs to see a consistent map while using it. Generally this is very rare +since most paths that lookup and use extent maps also have the file range +locked in the inode's iotree. The fsync path is pretty much the only +exception where we don't do it to avoid serialization with concurrent +reads. + +Fix this by not allowing an extent map do be merged if if it's being used +by tasks other then the one attempting to merge the extent map (when the +reference count of the extent map is greater than 2). + +Reported-by: ryusuke1925 +Reported-by: Koki Mitani +Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211 +CC: stable@vger.kernel.org # 4.4+ +Reviewed-by: Josef Bacik +Signed-off-by: Filipe Manana +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/extent_map.c | 11 +++++++++++ + 1 file changed, 11 insertions(+) + +--- a/fs/btrfs/extent_map.c ++++ b/fs/btrfs/extent_map.c +@@ -227,6 +227,17 @@ static void try_merge_map(struct extent_ + struct extent_map *merge = NULL; + struct rb_node *rb; + ++ /* ++ * We can't modify an extent map that is in the tree and that is being ++ * used by another task, as it can cause that other task to see it in ++ * inconsistent state during the merging. We always have 1 reference for ++ * the tree and 1 for this task (which is unpinning the extent map or ++ * clearing the logging flag), so anything > 2 means it's being used by ++ * other tasks too. ++ */ ++ if (refcount_read(&em->refs) > 2) ++ return; ++ + if (em->start != 0) { + rb = rb_prev(&em->rb_node); + if (rb) diff --git a/queue-4.4/btrfs-log-message-when-rw-remount-is-attempted-with-unclean-tree-log.patch b/queue-4.4/btrfs-log-message-when-rw-remount-is-attempted-with-unclean-tree-log.patch new file mode 100644 index 00000000000..4a633e49a06 --- /dev/null +++ b/queue-4.4/btrfs-log-message-when-rw-remount-is-attempted-with-unclean-tree-log.patch @@ -0,0 +1,37 @@ +From 10a3a3edc5b89a8cd095bc63495fb1e0f42047d9 Mon Sep 17 00:00:00 2001 +From: David Sterba +Date: Wed, 5 Feb 2020 17:12:28 +0100 +Subject: btrfs: log message when rw remount is attempted with unclean tree-log + +From: David Sterba + +commit 10a3a3edc5b89a8cd095bc63495fb1e0f42047d9 upstream. + +A remount to a read-write filesystem is not safe when there's tree-log +to be replayed. Files that could be opened until now might be affected +by the changes in the tree-log. + +A regular mount is needed to replay the log so the filesystem presents +the consistent view with the pending changes included. + +CC: stable@vger.kernel.org # 4.4+ +Reviewed-by: Anand Jain +Reviewed-by: Johannes Thumshirn +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/super.c | 2 ++ + 1 file changed, 2 insertions(+) + +--- a/fs/btrfs/super.c ++++ b/fs/btrfs/super.c +@@ -1702,6 +1702,8 @@ static int btrfs_remount(struct super_bl + } + + if (btrfs_super_log_root(fs_info->super_copy) != 0) { ++ btrfs_warn(fs_info, ++ "mount required to replay tree-log, cannot remount read-write"); + ret = -EINVAL; + goto restore; + } diff --git a/queue-4.4/ext4-fix-checksum-errors-with-indexed-dirs.patch b/queue-4.4/ext4-fix-checksum-errors-with-indexed-dirs.patch new file mode 100644 index 00000000000..c84fa818f01 --- /dev/null +++ b/queue-4.4/ext4-fix-checksum-errors-with-indexed-dirs.patch @@ -0,0 +1,125 @@ +From 48a34311953d921235f4d7bbd2111690d2e469cf Mon Sep 17 00:00:00 2001 +From: Jan Kara +Date: Mon, 10 Feb 2020 15:43:16 +0100 +Subject: ext4: fix checksum errors with indexed dirs + +From: Jan Kara + +commit 48a34311953d921235f4d7bbd2111690d2e469cf upstream. + +DIR_INDEX has been introduced as a compat ext4 feature. That means that +even kernels / tools that don't understand the feature may modify the +filesystem. This works because for kernels not understanding indexed dir +format, internal htree nodes appear just as empty directory entries. +Index dir aware kernels then check the htree structure is still +consistent before using the data. This all worked reasonably well until +metadata checksums were introduced. The problem is that these +effectively made DIR_INDEX only ro-compatible because internal htree +nodes store checksums in a different place than normal directory blocks. +Thus any modification ignorant to DIR_INDEX (or just clearing +EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch +and trigger kernel errors. So we have to be more careful when dealing +with indexed directories on filesystems with checksumming enabled. + +1) We just disallow loading any directory inodes with EXT4_INDEX_FL when +DIR_INDEX is not enabled. This is harsh but it should be very rare (it +means someone disabled DIR_INDEX on existing filesystem and didn't run +e2fsck), e2fsck can fix the problem, and we don't want to answer the +difficult question: "Should we rather corrupt the directory more or +should we ignore that DIR_INDEX feature is not set?" + +2) When we find out htree structure is corrupted (but the filesystem and +the directory should in support htrees), we continue just ignoring htree +information for reading but we refuse to add new entries to the +directory to avoid corrupting it more. + +Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz +Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes") +Reviewed-by: Andreas Dilger +Signed-off-by: Jan Kara +Signed-off-by: Theodore Ts'o +Cc: stable@kernel.org +Signed-off-by: Greg Kroah-Hartman + +--- + fs/ext4/dir.c | 14 ++++++++------ + fs/ext4/ext4.h | 5 ++++- + fs/ext4/inode.c | 12 ++++++++++++ + fs/ext4/namei.c | 7 +++++++ + 4 files changed, 31 insertions(+), 7 deletions(-) + +--- a/fs/ext4/dir.c ++++ b/fs/ext4/dir.c +@@ -125,12 +125,14 @@ static int ext4_readdir(struct file *fil + if (err != ERR_BAD_DX_DIR) { + return err; + } +- /* +- * We don't set the inode dirty flag since it's not +- * critical that it get flushed back to the disk. +- */ +- ext4_clear_inode_flag(file_inode(file), +- EXT4_INODE_INDEX); ++ /* Can we just clear INDEX flag to ignore htree information? */ ++ if (!ext4_has_metadata_csum(sb)) { ++ /* ++ * We don't set the inode dirty flag since it's not ++ * critical that it gets flushed back to the disk. ++ */ ++ ext4_clear_inode_flag(inode, EXT4_INODE_INDEX); ++ } + } + + if (ext4_has_inline_data(inode)) { +--- a/fs/ext4/ext4.h ++++ b/fs/ext4/ext4.h +@@ -2381,8 +2381,11 @@ int ext4_insert_dentry(struct inode *dir + struct ext4_filename *fname); + static inline void ext4_update_dx_flag(struct inode *inode) + { +- if (!ext4_has_feature_dir_index(inode->i_sb)) ++ if (!ext4_has_feature_dir_index(inode->i_sb)) { ++ /* ext4_iget() should have caught this... */ ++ WARN_ON_ONCE(ext4_has_feature_metadata_csum(inode->i_sb)); + ext4_clear_inode_flag(inode, EXT4_INODE_INDEX); ++ } + } + static unsigned char ext4_filetype_table[] = { + DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK +--- a/fs/ext4/inode.c ++++ b/fs/ext4/inode.c +@@ -4325,6 +4325,18 @@ struct inode *ext4_iget(struct super_blo + ret = -EFSCORRUPTED; + goto bad_inode; + } ++ /* ++ * If dir_index is not enabled but there's dir with INDEX flag set, ++ * we'd normally treat htree data as empty space. But with metadata ++ * checksumming that corrupts checksums so forbid that. ++ */ ++ if (!ext4_has_feature_dir_index(sb) && ext4_has_metadata_csum(sb) && ++ ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) { ++ ext4_error_inode(inode, function, line, 0, ++ "iget: Dir with htree data on filesystem without dir_index feature."); ++ ret = -EFSCORRUPTED; ++ goto bad_inode; ++ } + ei->i_disksize = inode->i_size; + #ifdef CONFIG_QUOTA + ei->i_reserved_quota = 0; +--- a/fs/ext4/namei.c ++++ b/fs/ext4/namei.c +@@ -2121,6 +2121,13 @@ static int ext4_add_entry(handle_t *hand + retval = ext4_dx_add_entry(handle, &fname, dentry, inode); + if (!retval || (retval != ERR_BAD_DX_DIR)) + goto out; ++ /* Can we just ignore htree data? */ ++ if (ext4_has_metadata_csum(sb)) { ++ EXT4_ERROR_INODE(dir, ++ "Directory has corrupted htree index."); ++ retval = -EFSCORRUPTED; ++ goto out; ++ } + ext4_clear_inode_flag(dir, EXT4_INODE_INDEX); + dx_fallback++; + ext4_mark_inode_dirty(handle, dir); diff --git a/queue-4.4/series b/queue-4.4/series index 05a0b1aef7c..0c8da8f7eaa 100644 --- a/queue-4.4/series +++ b/queue-4.4/series @@ -1,2 +1,5 @@ alsa-usb-audio-apply-sample-rate-quirk-for-audioengine-d1.patch ubifs-fix-deadlock-in-concurrent-bulk-read-and-writepage.patch +ext4-fix-checksum-errors-with-indexed-dirs.patch +btrfs-fix-race-between-using-extent-maps-and-merging-them.patch +btrfs-log-message-when-rw-remount-is-attempted-with-unclean-tree-log.patch