git.ipfire.org Git - thirdparty/linux.git/commit

btrfs: fix incorrect buffered IO fallback for append direct writes

[BUG]
With the previous bug of short direct writes fixed, test case
generic/362 (*) still fails with the following error with nodatasum
mount option:

generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
- output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:13:09.072485767 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +Wrong file size after first write, got 8192 expected 4096
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/

[CAUSE]
Inside btrfs_dio_iomap_begin() for a direct write, we increase the isize
if it's beyond the current isize.

But if the direct io finished short, we do not revert the isize to the
previous value nor to the short write end.

Then if we need to fall back to buffered writes, and the write has
IOCB_APPEND flag, then the buffered write will be positioned at the
incorrect isize.

The call chain looks like this:

btrfs_direct_write(pos=0, length=4K)
|- __iomap_dio_rw()
|  |- iomap_iter()
|  |  |- btrfs_dio_iomap_begin()
|  |     |- btrfs_get_blocks_direct_write()
|  |        |- i_size_write()
|  |           Which updates the isize to the write end (4K).
|  |
|  |- iomap_dio_iter()
|  |  Failed with -EFAULT on the first page.
|  |
|  |- iomap_iter()
|  |  |- btrfs_dio_iomap_end()
|  |     Detects a short write, return -ENOTBLK
|  |- if (ret == -ENOTBLK) { ret = 0;}
|     Which resets the return value.
|
|- ret = iomap_dio_complet()
|  Which returns 0.
|
|- btrfs_buffered_write(iocb, from);
    |- generic_write_checks()
       |- iocb->ki_pos = i_size_read()
          Which is still the new size (4K), other than the original
  isize 0.

[FIX]
Introduce the following btrfs_dio_data members:

- old_isize

- updated_isize
  If the direct write has enlarged the isize.

Then if we got a short write, and btrfs_dio_data::updated_isize is set,
revert to the correct isize based on old_isize and current file
position.

And here we call i_size_write() without holding an extent lock, which is
a very special case that we're safe to do:

- Only a single writer can be enlarging isize
   Enlarging isize will take the exclusive inode lock.

- Buffered readers need to wait for the OE we're holding
   Buffered readers will lock extent and wait for OE of the folio range.
   Sometimes we can skip the OE wait, but since all page cache is
   invalidated, the OE wait can not be skipped.

But I do not think this is the most elegant solution, nor covers all
cases. E.g. if the bio is submitted but IO failed, we are unable to do
the revert.

I believe the more elegant one would be extend the EXTENT_DIO_LOCKED
lifespan for direct writes, so that we can update the isize when a
write beyond EOF finished successfully.

However that change is too huge for a small bug fix.
So only implement the minimal partial fix for now.

[REASON FOR NO FIXES TAG]
The bug is again very old, before commit f85781fb505e ("btrfs: switch to
iomap for direct IO") we are already increasing isize without a
proper rollback for short writes.

Thus only a CC to stable.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

author	Qu Wenruo <wqu@suse.com>
	Thu, 4 Jun 2026 00:29:47 +0000 (09:59 +0930)
committer	Johannes Thumshirn <johannes.thumshirn@wdc.com>
	Tue, 9 Jun 2026 16:22:46 +0000 (18:22 +0200)
commit	ff66fe6662330226b3f486014c375538d91c44aa
tree	23ab758eda3e1dac4b9427a0c6dd5bd0629f0200	tree \| snapshot
parent	66ff4d366e7eb4d31813d2acabf3af512ce03aa5	commit \| diff