btrfs: fix incorrect buffered IO fallback for append direct writes
[BUG]
With the previous bug of short direct writes fixed, test case
generic/362 (*) still fails with the following error with nodatasum
mount option:
generic/362 0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
- output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
--- tests/generic/362.out 2024-08-24 15:31:37.
200000000 +0930
+++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:13:09.
072485767 +0930
@@ -1,2 +1,3 @@
QA output created by 362
+Wrong file size after first write, got 8192 expected 4096
Silence is golden
...
*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/
20260528111659.87113-1-wqu@suse.com/
[CAUSE]
Inside btrfs_dio_iomap_begin() for a direct write, we increase the isize
if it's beyond the current isize.
But if the direct io finished short, we do not revert the isize to the
previous value nor to the short write end.
Then if we need to fall back to buffered writes, and the write has
IOCB_APPEND flag, then the buffered write will be positioned at the
incorrect isize.
The call chain looks like this:
btrfs_direct_write(pos=0, length=4K)
|- __iomap_dio_rw()
| |- iomap_iter()
| | |- btrfs_dio_iomap_begin()
| | |- btrfs_get_blocks_direct_write()
| | |- i_size_write()
| | Which updates the isize to the write end (4K).
| |
| |- iomap_dio_iter()
| | Failed with -EFAULT on the first page.
| |
| |- iomap_iter()
| | |- btrfs_dio_iomap_end()
| | Detects a short write, return -ENOTBLK
| |- if (ret == -ENOTBLK) { ret = 0;}
| Which resets the return value.
|
|- ret = iomap_dio_complet()
| Which returns 0.
|
|- btrfs_buffered_write(iocb, from);
|- generic_write_checks()
|- iocb->ki_pos = i_size_read()
Which is still the new size (4K), other than the original
isize 0.
[FIX]
Introduce the following btrfs_dio_data members:
- old_isize
- updated_isize
If the direct write has enlarged the isize.
Then if we got a short write, and btrfs_dio_data::updated_isize is set,
revert to the correct isize based on old_isize and current file
position.
And here we call i_size_write() without holding an extent lock, which is
a very special case that we're safe to do:
- Only a single writer can be enlarging isize
Enlarging isize will take the exclusive inode lock.
- Buffered readers need to wait for the OE we're holding
Buffered readers will lock extent and wait for OE of the folio range.
Sometimes we can skip the OE wait, but since all page cache is
invalidated, the OE wait can not be skipped.
But I do not think this is the most elegant solution, nor covers all
cases. E.g. if the bio is submitted but IO failed, we are unable to do
the revert.
I believe the more elegant one would be extend the EXTENT_DIO_LOCKED
lifespan for direct writes, so that we can update the isize when a
write beyond EOF finished successfully.
However that change is too huge for a small bug fix.
So only implement the minimal partial fix for now.
[REASON FOR NO FIXES TAG]
The bug is again very old, before commit
f85781fb505e ("btrfs: switch to
iomap for direct IO") we are already increasing isize without a
proper rollback for short writes.
Thus only a CC to stable.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>