git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blob

   1 From stable+bounces-179510-greg=kroah.com@vger.kernel.org Sat Sep 13 18:14:04 2025
   2 From: Sasha Levin <sashal@kernel.org>
   3 Date: Sat, 13 Sep 2025 12:13:53 -0400
   4 Subject: btrfs: fix corruption reading compressed range when block size is smaller than page size
   5 To: stable@vger.kernel.org
   6 Cc: Qu Wenruo <wqu@suse.com>, Filipe Manana <fdmanana@suse.com>, David Sterba <dsterba@suse.com>, Sasha Levin <sashal@kernel.org>
   7 Message-ID: <20250913161353.1443138-2-sashal@kernel.org>
   8
   9 From: Qu Wenruo <wqu@suse.com>
  10
  11 [ Upstream commit 9786531399a679fc2f4630d2c0a186205282ab2f ]
  12
  13 [BUG]
  14 With 64K page size (aarch64 with 64K page size config) and 4K btrfs
  15 block size, the following workload can easily lead to a corrupted read:
  16
  17         mkfs.btrfs -f -s 4k $dev > /dev/null
  18         mount -o compress $dev $mnt
  19         xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null
  20         echo "correct result:"
  21         od -Ad -t x1 $mnt/base
  22         xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
  23                   -c "reflink $mnt/base 0 32k 32k" \
  24                   -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
  25         echo "incorrect result:"
  26         od -Ad -t x1 $mnt/new
  27         umount $mnt
  28
  29 This shows the following result:
  30
  31 correct result:
  32 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  33 *
  34 0065536
  35 incorrect result:
  36 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  37 *
  38 0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  39 *
  40 0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  41 *
  42 0065536
  43
  44 Notice the zero in the range [32K, 60K), which is incorrect.
  45
  46 [CAUSE]
  47 With extra trace printk, it shows the following events during od:
  48 (some unrelated info removed like CPU and context)
  49
  50  od-3457   btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
  51
  52 The "r/i" is indicating the root and inode number. In our case the file
  53 "new" is using ino 258 from fs tree (root 5).
  54
  55 Here notice the @prev_em_start pointer is NULL. This means the
  56 btrfs_do_readpage() is called from btrfs_read_folio(), not from
  57 btrfs_readahead().
  58
  59  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
  60  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
  61  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
  62  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
  63  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
  64  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
  65  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
  66  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
  67
  68 These above 32K blocks will be read from the first half of the
  69 compressed data extent.
  70
  71  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
  72
  73 Note here there is no btrfs_submit_compressed_read() call. Which is
  74 incorrect now.
  75 Although both extent maps at 0 and 32K are pointing to the same compressed
  76 data, their offsets are different thus can not be merged into the same
  77 read.
  78
  79 So this means the compressed data read merge check is doing something
  80 wrong.
  81
  82  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
  83  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
  84  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
  85  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
  86  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
  87  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
  88  od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
  89  od-3457   btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
  90
  91 The function btrfs_submit_compressed_read() is only called at the end of
  92 folio read. The compressed bio will only have an extent map of range [0,
  93 32K), but the original bio passed in is for the whole 64K folio.
  94
  95 This will cause the decompression part to only fill the first 32K,
  96 leaving the rest untouched (aka, filled with zero).
  97
  98 This incorrect compressed read merge leads to the above data corruption.
  99
 100 There were similar problems that happened in the past, commit 808f80b46790
 101 ("Btrfs: update fix for read corruption of compressed and shared
 102 extents") is doing pretty much the same fix for readahead.
 103
 104 But that's back to 2015, where btrfs still only supports bs (block size)
 105 == ps (page size) cases.
 106 This means btrfs_do_readpage() only needs to handle a folio which
 107 contains exactly one block.
 108
 109 Only btrfs_readahead() can lead to a read covering multiple blocks.
 110 Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
 111
 112 With v5.15 kernel btrfs introduced bs < ps support. This breaks the above
 113 assumption that a folio can only contain one block.
 114
 115 Now btrfs_read_folio() can also read multiple blocks in one go.
 116 But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
 117 existing bio force submission check will never be triggered.
 118
 119 In theory, this can also happen for btrfs with large folios, but since
 120 large folio is still experimental, we don't need to bother it, thus only
 121 bs < ps support is affected for now.
 122
 123 [FIX]
 124 Instead of passing @prev_em_start to do the proper compressed extent
 125 check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
 126 the existing bio force submission logic will always be triggered.
 127
 128 CC: stable@vger.kernel.org # 5.15+
 129 Reviewed-by: Filipe Manana <fdmanana@suse.com>
 130 Signed-off-by: Qu Wenruo <wqu@suse.com>
 131 Signed-off-by: David Sterba <dsterba@suse.com>
 132 [ Adjust context ]
 133 Signed-off-by: Sasha Levin <sashal@kernel.org>
 134 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 135 ---
 136  fs/btrfs/extent_io.c |   40 ++++++++++++++++++++++++++++++----------
 137  1 file changed, 30 insertions(+), 10 deletions(-)
 138
 139 --- a/fs/btrfs/extent_io.c
 140 +++ b/fs/btrfs/extent_io.c
 141 @@ -109,6 +109,24 @@ struct btrfs_bio_ctrl {
 142          */
 143         unsigned long submit_bitmap;
 144         struct readahead_control *ractl;
 145 +
 146 +       /*
 147 +        * The start offset of the last used extent map by a read operation.
 148 +        *
 149 +        * This is for proper compressed read merge.
 150 +        * U64_MAX means we are starting the read and have made no progress yet.
 151 +        *
 152 +        * The current btrfs_bio_is_contig() only uses disk_bytenr as
 153 +        * the condition to check if the read can be merged with previous
 154 +        * bio, which is not correct. E.g. two file extents pointing to the
 155 +        * same extent but with different offset.
 156 +        *
 157 +        * So here we need to do extra checks to only merge reads that are
 158 +        * covered by the same extent map.
 159 +        * Just extent_map::start will be enough, as they are unique
 160 +        * inside the same inode.
 161 +        */
 162 +       u64 last_em_start;
 163  };
 164
 165  static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
 166 @@ -955,7 +973,7 @@ static void btrfs_readahead_expand(struc
 167   * return 0 on success, otherwise return error
 168   */
 169  static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 170 -                     struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
 171 +                            struct btrfs_bio_ctrl *bio_ctrl)
 172  {
 173         struct inode *inode = folio->mapping->host;
 174         struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
 175 @@ -1066,12 +1084,11 @@ static int btrfs_do_readpage(struct foli
 176                  * non-optimal behavior (submitting 2 bios for the same extent).
 177                  */
 178                 if (compress_type != BTRFS_COMPRESS_NONE &&
 179 -                   prev_em_start && *prev_em_start != (u64)-1 &&
 180 -                   *prev_em_start != em->start)
 181 +                   bio_ctrl->last_em_start != U64_MAX &&
 182 +                   bio_ctrl->last_em_start != em->start)
 183                         force_bio_submit = true;
 184
 185 -               if (prev_em_start)
 186 -                       *prev_em_start = em->start;
 187 +               bio_ctrl->last_em_start = em->start;
 188
 189                 free_extent_map(em);
 190                 em = NULL;
 191 @@ -1115,12 +1132,15 @@ int btrfs_read_folio(struct file *file,
 192         const u64 start = folio_pos(folio);
 193         const u64 end = start + folio_size(folio) - 1;
 194         struct extent_state *cached_state = NULL;
 195 -       struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
 196 +       struct btrfs_bio_ctrl bio_ctrl = {
 197 +               .opf = REQ_OP_READ,
 198 +               .last_em_start = U64_MAX,
 199 +       };
 200         struct extent_map *em_cached = NULL;
 201         int ret;
 202
 203         btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
 204 -       ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
 205 +       ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
 206         unlock_extent(&inode->io_tree, start, end, &cached_state);
 207
 208         free_extent_map(em_cached);
 209 @@ -2391,7 +2411,8 @@ void btrfs_readahead(struct readahead_co
 210  {
 211         struct btrfs_bio_ctrl bio_ctrl = {
 212                 .opf = REQ_OP_READ | REQ_RAHEAD,
 213 -               .ractl = rac
 214 +               .ractl = rac,
 215 +               .last_em_start = U64_MAX,
 216         };
 217         struct folio *folio;
 218         struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
 219 @@ -2399,12 +2420,11 @@ void btrfs_readahead(struct readahead_co
 220         const u64 end = start + readahead_length(rac) - 1;
 221         struct extent_state *cached_state = NULL;
 222         struct extent_map *em_cached = NULL;
 223 -       u64 prev_em_start = (u64)-1;
 224
 225         btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
 226
 227         while ((folio = readahead_folio(rac)) != NULL)
 228 -               btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
 229 +               btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
 230
 231         unlock_extent(&inode->io_tree, start, end, &cached_state);
 232