git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blob

   1 From stable+bounces-179491-greg=kroah.com@vger.kernel.org Sat Sep 13 16:13:05 2025
   2 From: Sasha Levin <sashal@kernel.org>
   3 Date: Sat, 13 Sep 2025 10:12:33 -0400
   4 Subject: btrfs: fix corruption reading compressed range when block size is smaller than page size
   5 To: stable@vger.kernel.org
   6 Cc: Qu Wenruo <wqu@suse.com>, Filipe Manana <fdmanana@suse.com>, David Sterba <dsterba@suse.com>, Sasha Levin <sashal@kernel.org>
   7 Message-ID: <20250913141233.1363589-2-sashal@kernel.org>
   8
   9 From: Qu Wenruo <wqu@suse.com>
  10
  11 [ Upstream commit 9786531399a679fc2f4630d2c0a186205282ab2f ]
  12
  13 [BUG]
  14 With 64K page size (aarch64 with 64K page size config) and 4K btrfs
  15 block size, the following workload can easily lead to a corrupted read:
  16
  17         mkfs.btrfs -f -s 4k $dev > /dev/null
  18         mount -o compress $dev $mnt
  19         xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null
  20         echo "correct result:"
  21         od -Ad -t x1 $mnt/base
  22         xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
  23                   -c "reflink $mnt/base 0 32k 32k" \
  24                   -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
  25         echo "incorrect result:"
  26         od -Ad -t x1 $mnt/new
  27         umount $mnt
  28
  29 This shows the following result:
  30
  31 correct result:
  32 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  33 *
  34 0065536
  35 incorrect result:
  36 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  37 *
  38 0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  39 *
  40 0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  41 *
  42 0065536
  43
  44 Notice the zero in the range [32K, 60K), which is incorrect.
  45
  46 [CAUSE]
  47 With extra trace printk, it shows the following events during od:
  48 (some unrelated info removed like CPU and context)
  49
  50  od-3457   btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
  51
  52 The "r/i" is indicating the root and inode number. In our case the file
  53 "new" is using ino 258 from fs tree (root 5).
  54
  55 Here notice the @prev_em_start pointer is NULL. This means the
  56 btrfs_do_readpage() is called from btrfs_read_folio(), not from
  57 btrfs_readahead().
  58
  59  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
  60  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
  61  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
  62  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
  63  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
  64  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
  65  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
  66  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
  67
  68 These above 32K blocks will be read from the first half of the
  69 compressed data extent.
  70
  71  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
  72
  73 Note here there is no btrfs_submit_compressed_read() call. Which is
  74 incorrect now.
  75 Although both extent maps at 0 and 32K are pointing to the same compressed
  76 data, their offsets are different thus can not be merged into the same
  77 read.
  78
  79 So this means the compressed data read merge check is doing something
  80 wrong.
  81
  82  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
  83  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
  84  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
  85  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
  86  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
  87  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
  88  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
  89  od-3457   btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
  90
  91 The function btrfs_submit_compressed_read() is only called at the end of
  92 folio read. The compressed bio will only have an extent map of range [0,
  93 32K), but the original bio passed in is for the whole 64K folio.
  94
  95 This will cause the decompression part to only fill the first 32K,
  96 leaving the rest untouched (aka, filled with zero).
  97
  98 This incorrect compressed read merge leads to the above data corruption.
  99
 100 There were similar problems that happened in the past, commit 808f80b46790
 101 ("Btrfs: update fix for read corruption of compressed and shared
 102 extents") is doing pretty much the same fix for readahead.
 103
 104 But that's back to 2015, where btrfs still only supports bs (block size)
 105 == ps (page size) cases.
 106 This means btrfs_do_readpage() only needs to handle a folio which
 107 contains exactly one block.
 108
 109 Only btrfs_readahead() can lead to a read covering multiple blocks.
 110 Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
 111
 112 With v5.15 kernel btrfs introduced bs < ps support. This breaks the above
 113 assumption that a folio can only contain one block.
 114
 115 Now btrfs_read_folio() can also read multiple blocks in one go.
 116 But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
 117 existing bio force submission check will never be triggered.
 118
 119 In theory, this can also happen for btrfs with large folios, but since
 120 large folio is still experimental, we don't need to bother it, thus only
 121 bs < ps support is affected for now.
 122
 123 [FIX]
 124 Instead of passing @prev_em_start to do the proper compressed extent
 125 check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
 126 the existing bio force submission logic will always be triggered.
 127
 128 CC: stable@vger.kernel.org # 5.15+
 129 Reviewed-by: Filipe Manana <fdmanana@suse.com>
 130 Signed-off-by: Qu Wenruo <wqu@suse.com>
 131 Signed-off-by: David Sterba <dsterba@suse.com>
 132 Signed-off-by: Sasha Levin <sashal@kernel.org>
 133 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 134 ---
 135  fs/btrfs/extent_io.c |   40 ++++++++++++++++++++++++++++++----------
 136  1 file changed, 30 insertions(+), 10 deletions(-)
 137
 138 --- a/fs/btrfs/extent_io.c
 139 +++ b/fs/btrfs/extent_io.c
 140 @@ -111,6 +111,24 @@ struct btrfs_bio_ctrl {
 141          */
 142         unsigned long submit_bitmap;
 143         struct readahead_control *ractl;
 144 +
 145 +       /*
 146 +        * The start offset of the last used extent map by a read operation.
 147 +        *
 148 +        * This is for proper compressed read merge.
 149 +        * U64_MAX means we are starting the read and have made no progress yet.
 150 +        *
 151 +        * The current btrfs_bio_is_contig() only uses disk_bytenr as
 152 +        * the condition to check if the read can be merged with previous
 153 +        * bio, which is not correct. E.g. two file extents pointing to the
 154 +        * same extent but with different offset.
 155 +        *
 156 +        * So here we need to do extra checks to only merge reads that are
 157 +        * covered by the same extent map.
 158 +        * Just extent_map::start will be enough, as they are unique
 159 +        * inside the same inode.
 160 +        */
 161 +       u64 last_em_start;
 162  };
 163
 164  static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
 165 @@ -910,7 +928,7 @@ static void btrfs_readahead_expand(struc
 166   * return 0 on success, otherwise return error
 167   */
 168  static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 169 -                     struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
 170 +                            struct btrfs_bio_ctrl *bio_ctrl)
 171  {
 172         struct inode *inode = folio->mapping->host;
 173         struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
 174 @@ -1020,12 +1038,11 @@ static int btrfs_do_readpage(struct foli
 175                  * non-optimal behavior (submitting 2 bios for the same extent).
 176                  */
 177                 if (compress_type != BTRFS_COMPRESS_NONE &&
 178 -                   prev_em_start && *prev_em_start != (u64)-1 &&
 179 -                   *prev_em_start != em->start)
 180 +                   bio_ctrl->last_em_start != U64_MAX &&
 181 +                   bio_ctrl->last_em_start != em->start)
 182                         force_bio_submit = true;
 183
 184 -               if (prev_em_start)
 185 -                       *prev_em_start = em->start;
 186 +               bio_ctrl->last_em_start = em->start;
 187
 188                 btrfs_free_extent_map(em);
 189                 em = NULL;
 190 @@ -1239,12 +1256,15 @@ int btrfs_read_folio(struct file *file,
 191         const u64 start = folio_pos(folio);
 192         const u64 end = start + folio_size(folio) - 1;
 193         struct extent_state *cached_state = NULL;
 194 -       struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
 195 +       struct btrfs_bio_ctrl bio_ctrl = {
 196 +               .opf = REQ_OP_READ,
 197 +               .last_em_start = U64_MAX,
 198 +       };
 199         struct extent_map *em_cached = NULL;
 200         int ret;
 201
 202         lock_extents_for_read(inode, start, end, &cached_state);
 203 -       ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
 204 +       ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
 205         btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
 206
 207         btrfs_free_extent_map(em_cached);
 208 @@ -2582,7 +2602,8 @@ void btrfs_readahead(struct readahead_co
 209  {
 210         struct btrfs_bio_ctrl bio_ctrl = {
 211                 .opf = REQ_OP_READ | REQ_RAHEAD,
 212 -               .ractl = rac
 213 +               .ractl = rac,
 214 +               .last_em_start = U64_MAX,
 215         };
 216         struct folio *folio;
 217         struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
 218 @@ -2590,12 +2611,11 @@ void btrfs_readahead(struct readahead_co
 219         const u64 end = start + readahead_length(rac) - 1;
 220         struct extent_state *cached_state = NULL;
 221         struct extent_map *em_cached = NULL;
 222 -       u64 prev_em_start = (u64)-1;
 223
 224         lock_extents_for_read(inode, start, end, &cached_state);
 225
 226         while ((folio = readahead_folio(rac)) != NULL)
 227 -               btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
 228 +               btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
 229
 230         btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
 231