From: Filipe Manana Date: Tue, 26 May 2026 13:44:30 +0000 (+0100) Subject: btrfs: fix deadlock cloning inline extent when using flushoncommit X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=532085d00eb54c074bdeae648b194765239f4d11;p=thirdparty%2Flinux.git btrfs: fix deadlock cloning inline extent when using flushoncommit In commit b48c980b6a7e ("btrfs: fix deadlock between reflink and transaction commit when using flushoncommit") a deadlock was fixed between reflinks and transaction commits when the fs is mounted with the flushoncommit option. This happened when we had to copy an inline extent's data to the destination file. However the issue was fixed only for the case where the destination offset is 0, it missed the case when the offset is greater than zero. Fix this by ensuring we get i_size update whenever we copied an inline extent's data into the destination file. Syzbot reported this with the following trace: INFO: task kworker/u8:3:57 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u8:3 state:D stack:21600 pid:57 tgid:57 ppid:2 task_flags:0x4208160 flags:0x00080000 Workqueue: writeback wb_workfn (flush-btrfs-129) Call Trace: context_switch kernel/sched/core.c:5402 [inline] __schedule+0x16f9/0x5500 kernel/sched/core.c:7204 __schedule_loop kernel/sched/core.c:7283 [inline] schedule+0x164/0x360 kernel/sched/core.c:7298 wait_extent_bit fs/btrfs/extent-io-tree.c:905 [inline] btrfs_lock_extent_bits+0x59c/0x700 fs/btrfs/extent-io-tree.c:2008 btrfs_lock_extent fs/btrfs/extent-io-tree.h:152 [inline] btrfs_invalidate_folio+0x440/0xc00 fs/btrfs/inode.c:7718 extent_writepage fs/btrfs/extent_io.c:1848 [inline] extent_write_cache_pages fs/btrfs/extent_io.c:2552 [inline] btrfs_writepages+0x12f3/0x2410 fs/btrfs/extent_io.c:2684 do_writepages+0x32e/0x550 mm/page-writeback.c:2571 __writeback_single_inode+0x133/0x10e0 fs/fs-writeback.c:1764 writeback_sb_inodes+0x97f/0x1980 fs/fs-writeback.c:2056 wb_writeback+0x445/0xb00 fs/fs-writeback.c:2241 wb_do_writeback fs/fs-writeback.c:2388 [inline] wb_workfn+0x3fd/0xf20 fs/fs-writeback.c:2428 process_one_work+0x98b/0x1630 kernel/workqueue.c:3318 process_scheduled_works kernel/workqueue.c:3401 [inline] worker_thread+0xb49/0x1140 kernel/workqueue.c:3482 kthread+0x388/0x470 kernel/kthread.c:436 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 INFO: task syz.0.145:8523 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz.0.145 state:D stack:22752 pid:8523 tgid:8522 ppid:5850 task_flags:0x400140 flags:0x00080002 Call Trace: context_switch kernel/sched/core.c:5402 [inline] __schedule+0x16f9/0x5500 kernel/sched/core.c:7204 __schedule_loop kernel/sched/core.c:7283 [inline] schedule+0x164/0x360 kernel/sched/core.c:7298 wb_wait_for_completion+0x3e8/0x790 fs/fs-writeback.c:227 __writeback_inodes_sb_nr+0x24c/0x2d0 fs/fs-writeback.c:2847 try_to_writeback_inodes_sb+0x9a/0xc0 fs/fs-writeback.c:2895 btrfs_start_delalloc_flush fs/btrfs/transaction.c:2182 [inline] btrfs_commit_transaction+0x813/0x2fc0 fs/btrfs/transaction.c:2371 btrfs_sync_file+0xdf4/0x1230 fs/btrfs/file.c:1822 generic_write_sync include/linux/fs.h:2663 [inline] btrfs_do_write_iter+0x6a9/0x840 fs/btrfs/file.c:1473 new_sync_write fs/read_write.c:595 [inline] vfs_write+0x629/0xba0 fs/read_write.c:688 ksys_write+0x156/0x270 fs/read_write.c:740 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5a0bdece59 RSP: 002b:00007f5a0b446028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007f5a0c065fa0 RCX: 00007f5a0bdece59 RDX: 000000000000029f RSI: 0000200000000200 RDI: 0000000000000004 RBP: 00007f5a0be82d6f R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f5a0c066038 R14: 00007f5a0c065fa0 R15: 00007ffe149206b8 INFO: task syz.0.145:8539 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz.0.145 state:D stack:23704 pid:8539 tgid:8522 ppid:5850 task_flags:0x400140 flags:0x00080002 Call Trace: context_switch kernel/sched/core.c:5402 [inline] __schedule+0x16f9/0x5500 kernel/sched/core.c:7204 __schedule_loop kernel/sched/core.c:7283 [inline] schedule+0x164/0x360 kernel/sched/core.c:7298 wait_current_trans+0x39f/0x590 fs/btrfs/transaction.c:536 start_transaction+0xbd8/0x1820 fs/btrfs/transaction.c:716 clone_copy_inline_extent fs/btrfs/reflink.c:299 [inline] btrfs_clone+0x1316/0x2540 fs/btrfs/reflink.c:574 btrfs_clone_files+0x271/0x3f0 fs/btrfs/reflink.c:795 btrfs_remap_file_range+0x76b/0x1320 fs/btrfs/reflink.c:948 vfs_clone_file_range+0x435/0x7b0 fs/remap_range.c:403 ioctl_file_clone fs/ioctl.c:239 [inline] ioctl_file_clone_range fs/ioctl.c:257 [inline] do_vfs_ioctl+0xe15/0x1540 fs/ioctl.c:544 __do_sys_ioctl fs/ioctl.c:595 [inline] __se_sys_ioctl+0x82/0x170 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5a0bdece59 RSP: 002b:00007f5a0b425028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f5a0c066090 RCX: 00007f5a0bdece59 RDX: 00002000000000c0 RSI: 000000004020940d RDI: 0000000000000004 RBP: 00007f5a0be82d6f R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f5a0c066128 R14: 00007f5a0c066090 R15: 00007ffe149206b8 Reported-by: syzbot+c7443384724bb0f9e913@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6a150a09.820a0220.e7972.0006.GAE@google.com/ Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents") Reviewed-by: Boris Burkov Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index 0a4628b3007df..9a49d2ecb9494 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -179,10 +179,12 @@ static int clone_copy_inline_extent(struct btrfs_inode *inode, struct btrfs_drop_extents_args drop_args = { 0 }; int ret; struct btrfs_key key; + bool copied_inline_to_page = false; if (new_key->offset > 0) { ret = copy_inline_to_page(inode, new_key->offset, inline_data, size, datal, comp_type); + copied_inline_to_page = (ret == 0); goto out; } @@ -288,6 +290,60 @@ copy_inline_extent: btrfs_abort_transaction(trans, ret); out: if (!ret && !trans) { + if (copied_inline_to_page && + new_key->offset + datal > i_size_read(&inode->vfs_inode)) { + /* + * If we copied the inline extent data to a page/folio + * beyond the i_size of the destination inode, then we + * need to increase the i_size before we start a + * transaction to update the inode item. This is to + * prevent a deadlock when the flushoncommit mount + * option is used, which happens like this: + * + * 1) Task A clones an inline extent from inode X to an + * offset of inode Y that is beyond Y's current + * i_size. This means we copied the inline extent's + * data to a folio of inode Y that is beyond its EOF, + * using the call above to copy_inline_to_page(); + * + * 2) Task B starts a transaction commit and calls + * btrfs_start_delalloc_flush() to flush delalloc; + * + * 3) The delalloc flushing sees the new dirty folio of + * inode Y and when it attempts to flush it, it ends + * up at extent_writepage() and sees that the offset + * of the folio is beyond the i_size of inode Y, so + * it attempts to invalidate the folio by calling + * folio_invalidate(), which ends up at btrfs' folio + * invalidate callback - btrfs_invalidate_folio(). + * There it tries to lock the folio's range in inode + * Y's extent io tree, but it blocks since it's + * currently locked by task A - during reflink we + * lock the inodes and the source and destination + * ranges after flushing all delalloc and waiting for + * ordered extent completion - after that we don't + * expect to have dirty folios in the ranges, the + * exception is if we have to copy an inline extent's + * data (because the destination offset is not zero); + * + * 4) Task A then does the 'goto out' below and attempts + * to start a transaction to update the inode item, + * and then it's blocked since the current + * transaction is in the TRANS_STATE_COMMIT_START + * state. Therefore task A has to wait for the + * current transaction to become unblocked (its + * state >= TRANS_STATE_UNBLOCKED). + * + * This leads to a deadlock - the task committing the + * transaction waiting for the delalloc flushing which + * is blocked during folio invalidation on the inode's + * extent lock and the reflink task waiting for the + * current transaction to be unblocked so that it can + * start a new one to update the inode item (while + * holding the extent lock). + */ + i_size_write(&inode->vfs_inode, new_key->offset + datal); + } /* * No transaction here means we copied the inline extent into a * page of the destination inode. @@ -320,50 +376,7 @@ copy_to_page: ret = copy_inline_to_page(inode, new_key->offset, inline_data, size, datal, comp_type); - - /* - * If we copied the inline extent data to a page/folio beyond the i_size - * of the destination inode, then we need to increase the i_size before - * we start a transaction to update the inode item. This is to prevent a - * deadlock when the flushoncommit mount option is used, which happens - * like this: - * - * 1) Task A clones an inline extent from inode X to an offset of inode - * Y that is beyond Y's current i_size. This means we copied the - * inline extent's data to a folio of inode Y that is beyond its EOF, - * using the call above to copy_inline_to_page(); - * - * 2) Task B starts a transaction commit and calls - * btrfs_start_delalloc_flush() to flush delalloc; - * - * 3) The delalloc flushing sees the new dirty folio of inode Y and when - * it attempts to flush it, it ends up at extent_writepage() and sees - * that the offset of the folio is beyond the i_size of inode Y, so - * it attempts to invalidate the folio by calling folio_invalidate(), - * which ends up at btrfs' folio invalidate callback - - * btrfs_invalidate_folio(). There it tries to lock the folio's range - * in inode Y's extent io tree, but it blocks since it's currently - * locked by task A - during reflink we lock the inodes and the - * source and destination ranges after flushing all delalloc and - * waiting for ordered extent completion - after that we don't expect - * to have dirty folios in the ranges, the exception is if we have to - * copy an inline extent's data (because the destination offset is - * not zero); - * - * 4) Task A then does the 'goto out' below and attempts to start a - * transaction to update the inode item, and then it's blocked since - * the current transaction is in the TRANS_STATE_COMMIT_START state. - * Therefore task A has to wait for the current transaction to become - * unblocked (its state >= TRANS_STATE_UNBLOCKED). - * - * This leads to a deadlock - the task committing the transaction - * waiting for the delalloc flushing which is blocked during folio - * invalidation on the inode's extent lock and the reflink task waiting - * for the current transaction to be unblocked so that it can start a - * a new one to update the inode item (while holding the extent lock). - */ - if (ret == 0 && new_key->offset + datal > i_size_read(&inode->vfs_inode)) - i_size_write(&inode->vfs_inode, new_key->offset + datal); + copied_inline_to_page = (ret == 0); goto out; }