[WARNING]
When running test cases with injected errors or shutdown, e.g.
generic/388 or generic/475, there is a chance that the following kernel
warning is triggered:
BTRFS info (device dm-2): first mount of filesystem
d8a19a28-3232-4809-b0df-
38df83e71bff
BTRFS info (device dm-2): using crc32c checksum algorithm
BTRFS info (device dm-2): checking UUID tree
BTRFS info (device dm-2): turning on async discard
BTRFS info (device dm-2): enabling free space tree
BTRFS critical (device dm-2 state E): emergency shutdown
------------[ cut here ]------------
WARNING: extent_io.c:1742 at extent_writepage_io+0x437/0x520 [btrfs], CPU#2: kworker/u43:2/651591
CPU: 2 UID: 0 PID: 651591 Comm: kworker/u43:2 Tainted: G W OE 7.0.0-rc6-custom+ #365 PREEMPT(full)
5804053f02137e627472d94b5128cc9fcb110e88
RIP: 0010:extent_writepage_io+0x437/0x520 [btrfs]
Call Trace:
<TASK>
extent_write_cache_pages+0x2a5/0x820 [btrfs
70299925d0856939e93b17d480651713b3cbba58]
btrfs_writepages+0x74/0x130 [btrfs
70299925d0856939e93b17d480651713b3cbba58]
do_writepages+0xd0/0x160
__writeback_single_inode+0x42/0x340
writeback_sb_inodes+0x22d/0x580
wb_writeback+0xc6/0x360
wb_workfn+0xbd/0x470
process_one_work+0x198/0x3b0
worker_thread+0x1c8/0x330
kthread+0xee/0x120
ret_from_fork+0x2a6/0x330
ret_from_fork_asm+0x11/0x20
</TASK>
---[ end trace
0000000000000000 ]---
BTRFS error (device dm-2 state E): root 5 ino 259 folio
1323008 is marked dirty without notifying the fs
BTRFS error (device dm-2 state E): failed to submit blocks, root=5 inode=259 folio=
1323008 submit_bitmap=0: -117
BTRFS info (device dm-2 state E): last unmount of filesystem
d8a19a28-3232-4809-b0df-
38df83e71bff
[CAUSE]
Inside btrfs we have the following pattern in several locations, for
example inside btrfs_dirty_folio():
btrfs_clear_extent_bit(&inode->io_tree, start_pos, end_of_last_block,
EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
cached);
ret = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
extra_bits, cached);
if (ret)
return ret;
However btrfs_set_extent_delalloc() can return IO errors other than -ENOMEM
through the following callchain:
btrfs_set_extent_delalloc()
\- btrfs_find_new_delalloc_bytes()
\- btrfs_get_extent()
\- btrfs_lookup_file_extent()
\- btrfs_search_slot()
When such IO error happened, the previous btrfs_clear_extent_bit() has
cleared the EXTENT_DELALLOC for the range, and we're expecting
btrfs_set_extent_delalloc() to re-set EXTENT_DELALLOC.
But since btrfs_set_extent_delalloc() failed before
btrfs_set_extent_bit(), EXTENT_DELALLOC flag is no longer present.
And if the folio range is dirty before entering
btrfs_set_extent_delalloc(), we got a dirty folio but no EXTENT_DELALLOC
flag now.
Then we hit the folio writeback:
extent_writepage()
|- writepage_delalloc()
| No ordered extent is created, as there is no EXTENT_DELALLOC set
| for the folio range.
| This also means the folio has no ordered flag set.
|
|- extent_writepage_io()
\- if (unlikely(!folio_test_ordered(folio))
Now we hit the warning.
[FIX]
Introduce a new helper, btrfs_reset_extent_delalloc() to replace the
currently open-coded btrfs_clear_extent_bit() +
btrfs_set_extent_delalloc() combination.
Instead of calling btrfs_clear_extent_bit() first, update
EXTENT_DELALLOC_NEW first, as that part can fail due to metadata IO,
meanwhile btrfs_clear_extent_bit() and btrfs_set_extent_bit() won't
return any error but retry memory allocation until succeeded.
This allows us to fail early without clearing EXTENT_DELALLOC bit, so
even if that new btrfs_reset_extent_delalloc() failed before touching
EXTENT_DELALLOC, the existing dirty range will still have their old
EXTENT_DELALLOC flag present, thus avoid the warning.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
int btrfs_set_extent_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
unsigned int extra_bits,
struct extent_state **cached_state);
+int btrfs_reset_extent_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
+ unsigned int extra_bits, struct extent_state **cached_state);
struct btrfs_new_inode_args {
/* Input */
end_of_last_block = start_pos + num_bytes - 1;
- /*
- * The pages may have already been dirty, clear out old accounting so
- * we can set things up properly
- */
- btrfs_clear_extent_bit(&inode->io_tree, start_pos, end_of_last_block,
- EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
- cached);
-
- ret = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
- extra_bits, cached);
+ ret = btrfs_reset_extent_delalloc(inode, start_pos, end_of_last_block,
+ extra_bits, cached);
if (ret)
return ret;
}
}
- /*
- * page_mkwrite gets called when the page is firstly dirtied after it's
- * faulted in, but write(2) could also dirty a page and set delalloc
- * bits, thus in this case for space account reason, we still need to
- * clear any delalloc bits within this page range since we have to
- * reserve data&meta space before lock_page() (see above comments).
- */
- btrfs_clear_extent_bit(io_tree, page_start, end,
- EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
- EXTENT_DEFRAG, &cached_state);
-
- ret = btrfs_set_extent_delalloc(inode, page_start, end, 0, &cached_state);
+ ret = btrfs_reset_extent_delalloc(inode, page_start, end, 0, &cached_state);
if (ret < 0) {
btrfs_unlock_extent(io_tree, page_start, page_end, &cached_state);
goto out_unlock;
unsigned int extra_bits,
struct extent_state **cached_state)
{
- WARN_ON(PAGE_ALIGNED(end));
+ const u32 blocksize = inode->root->fs_info->sectorsize;
+
+ /* Basic alignment check. */
+ ASSERT(IS_ALIGNED(start, blocksize), "start=%llu blocksize=%u",
+ start, blocksize);
+ ASSERT(IS_ALIGNED(end + 1, blocksize), "inclusive end=%llu blocksize=%u",
+ end, blocksize);
if (start >= i_size_read(&inode->vfs_inode) &&
!(inode->flags & BTRFS_INODE_PREALLOC)) {
EXTENT_DELALLOC | extra_bits, cached_state);
}
+/*
+ * Clear the old accounting flags and set EXTENT_DELALLOC for the range.
+ *
+ * Return <0 for error, in that case no range has EXTENT_DELALLOC bit cleared or set.
+ */
+int btrfs_reset_extent_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
+ unsigned int extra_bits, struct extent_state **cached_state)
+{
+ const u32 blocksize = inode->root->fs_info->sectorsize;
+
+ /* The @extra_bits can only be EXTENT_NORESERVE for now. */
+ ASSERT(!(extra_bits & ~EXTENT_NORESERVE), "extra_bits=0x%x", extra_bits);
+
+ /* Basic alignment check. */
+ ASSERT(IS_ALIGNED(start, blocksize), "start=%llu blocksize=%u",
+ start, blocksize);
+ ASSERT(IS_ALIGNED(end + 1, blocksize), "inclusive end=%llu blocksize=%u",
+ end, blocksize);
+
+ /*
+ * Check and set DELALLOC_NEW flag, this needs to search tree thus can
+ * fail early. Thus we want to do this before clearing EXTENT_DELALLOC.
+ */
+ if (start >= i_size_read(&inode->vfs_inode) &&
+ !(inode->flags & BTRFS_INODE_PREALLOC)) {
+ /*
+ * There can't be any extents following EOF in this case so just
+ * set the delalloc new bit for the range directly.
+ */
+ extra_bits |= EXTENT_DELALLOC_NEW;
+ } else {
+ int ret;
+
+ ret = btrfs_find_new_delalloc_bytes(inode, start, end + 1 - start,
+ NULL);
+ if (unlikely(ret))
+ return ret;
+ }
+ /* Clear the old accounting as the range may already be dirty. */
+ btrfs_clear_extent_bit(&inode->io_tree, start, end,
+ EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
+ EXTENT_DEFRAG, cached_state);
+ return btrfs_set_extent_bit(&inode->io_tree, start, end,
+ EXTENT_DELALLOC | extra_bits, cached_state);
+}
+
static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode, u64 file_pos,
struct btrfs_file_extent_item *stack_fi,
goto again;
}
- btrfs_clear_extent_bit(&inode->io_tree, block_start, block_end,
- EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
- &cached_state);
-
- ret = btrfs_set_extent_delalloc(inode, block_start, block_end, 0,
- &cached_state);
+ ret = btrfs_reset_extent_delalloc(inode, block_start, block_end, 0, &cached_state);
if (ret) {
btrfs_unlock_extent(io_tree, block_start, block_end, &cached_state);
goto out_unlock;
if (ret < 0)
goto out_unlock;
- btrfs_clear_extent_bit(&inode->io_tree, file_offset, range_end,
- EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, NULL);
- ret = btrfs_set_extent_delalloc(inode, file_offset, range_end, 0, NULL);
+ ret = btrfs_reset_extent_delalloc(inode, file_offset, range_end, 0, NULL);
if (ret)
goto out_unlock;