]> git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blame - queue-4.4/btrfs-fix-race-between-ranged-fsync-and-writeback-of-adjacent-ranges.patch
4.4-stable patches
[thirdparty/kernel/stable-queue.git] / queue-4.4 / btrfs-fix-race-between-ranged-fsync-and-writeback-of-adjacent-ranges.patch
CommitLineData
4c58ed05
GKH
1From 0c713cbab6200b0ab6473b50435e450a6e1de85d Mon Sep 17 00:00:00 2001
2From: Filipe Manana <fdmanana@suse.com>
3Date: Mon, 6 May 2019 16:44:02 +0100
4Subject: Btrfs: fix race between ranged fsync and writeback of adjacent ranges
5
6From: Filipe Manana <fdmanana@suse.com>
7
8commit 0c713cbab6200b0ab6473b50435e450a6e1de85d upstream.
9
10When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
11inode) that happens to be ranged, which happens during a msync() or writes
12for files opened with O_SYNC for example, we can end up with a corrupt log,
13due to different file extent items representing ranges that overlap with
14each other, or hit some assertion failures.
15
16When doing a ranged fsync we only flush delalloc and wait for ordered
17exents within that range. If while we are logging items from our inode
18ordered extents for adjacent ranges complete, we end up in a race that can
19make us insert the file extent items that overlap with others we logged
20previously and the assertion failures.
21
22For example, if tree-log.c:copy_items() receives a leaf that has the
23following file extents items, all with a length of 4K and therefore there
24is an implicit hole in the range 68K to 72K - 1:
25
26 (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
27
28It copies them to the log tree. However due to the need to detect implicit
29holes, it may release the path, in order to look at the previous leaf to
30detect an implicit hole, and then later it will search again in the tree
31for the first file extent item key, with the goal of locking again the
32leaf (which might have changed due to concurrent changes to other inodes).
33
34However when it locks again the leaf containing the first key, the key
35corresponding to the extent at offset 72K may not be there anymore since
36there is an ordered extent for that range that is finishing (that is,
37somewhere in the middle of btrfs_finish_ordered_io()), and it just
38removed the file extent item but has not yet replaced it with a new file
39extent item, so the part of copy_items() that does hole detection will
40decide that there is a hole in the range starting from 68K to 76K - 1,
41and therefore insert a file extent item to represent that hole, having
42a key offset of 68K. After that we now have a log tree with 2 different
43extent items that have overlapping ranges:
44
45 1) The file extent item copied before copy_items() released the path,
46 which has a key offset of 72K and a length of 4K, representing the
47 file range 72K to 76K - 1.
48
49 2) And a file extent item representing a hole that has a key offset of
50 68K and a length of 8K, representing the range 68K to 76K - 1. This
51 item was inserted after releasing the path, and overlaps with the
52 extent item inserted before.
53
54The overlapping extent items can cause all sorts of unpredictable and
55incorrect behaviour, either when replayed or if a fast (non full) fsync
56happens later, which can trigger a BUG_ON() when calling
57btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
58trace like the following:
59
60 [61666.783269] ------------[ cut here ]------------
61 [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
62 [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
63 (...)
64 [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
65 [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
66 [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
67 [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
68 [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
69 [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
70 [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
71 [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
72 [61666.786253] FS: 00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
73 [61666.786253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
74 [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
75 [61666.786253] Call Trace:
76 [61666.786253] __btrfs_drop_extents+0x5e3/0xaad [btrfs]
77 [61666.786253] ? time_hardirqs_on+0x9/0x14
78 [61666.786253] btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
79 [61666.786253] ? release_extent_buffer+0x38/0xb4 [btrfs]
80 [61666.786253] btrfs_log_inode+0xb6e/0xcdc [btrfs]
81 [61666.786253] ? lock_acquire+0x131/0x1c5
82 [61666.786253] ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
83 [61666.786253] ? arch_local_irq_save+0x9/0xc
84 [61666.786253] ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
85 [61666.786253] btrfs_log_inode_parent+0x223/0x659 [btrfs]
86 [61666.786253] ? arch_local_irq_save+0x9/0xc
87 [61666.786253] ? lockref_get_not_zero+0x2c/0x34
88 [61666.786253] ? rcu_read_unlock+0x3e/0x5d
89 [61666.786253] btrfs_log_dentry_safe+0x60/0x7b [btrfs]
90 [61666.786253] btrfs_sync_file+0x317/0x42c [btrfs]
91 [61666.786253] vfs_fsync_range+0x8c/0x9e
92 [61666.786253] SyS_msync+0x13c/0x1c9
93 [61666.786253] entry_SYSCALL_64_fastpath+0x18/0xad
94
95A sample of a corrupt log tree leaf with overlapping extents I got from
96running btrfs/072:
97
98 item 14 key (295 108 200704) itemoff 2599 itemsize 53
99 extent data disk bytenr 0 nr 0
100 extent data offset 0 nr 458752 ram 458752
101 item 15 key (295 108 659456) itemoff 2546 itemsize 53
102 extent data disk bytenr 4343541760 nr 770048
103 extent data offset 606208 nr 163840 ram 770048
104 item 16 key (295 108 663552) itemoff 2493 itemsize 53
105 extent data disk bytenr 4343541760 nr 770048
106 extent data offset 610304 nr 155648 ram 770048
107 item 17 key (295 108 819200) itemoff 2440 itemsize 53
108 extent data disk bytenr 4334788608 nr 4096
109 extent data offset 0 nr 4096 ram 4096
110
111The file extent item at offset 659456 (item 15) ends at offset 823296
112(659456 + 163840) while the next file extent item (item 16) starts at
113offset 663552.
114
115Another different problem that the race can trigger is a failure in the
116assertions at tree-log.c:copy_items(), which expect that the first file
117extent item key we found before releasing the path exists after we have
118released path and that the last key we found before releasing the path
119also exists after releasing the path:
120
121 $ cat -n fs/btrfs/tree-log.c
122 4080 if (need_find_last_extent) {
123 4081 /* btrfs_prev_leaf could return 1 without releasing the path */
124 4082 btrfs_release_path(src_path);
125 4083 ret = btrfs_search_slot(NULL, inode->root, &first_key,
126 4084 src_path, 0, 0);
127 4085 if (ret < 0)
128 4086 return ret;
129 4087 ASSERT(ret == 0);
130 (...)
131 4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) {
132 4104 ret = btrfs_next_leaf(inode->root, src_path);
133 4105 if (ret < 0)
134 4106 return ret;
135 4107 ASSERT(ret == 0);
136 4108 src = src_path->nodes[0];
137 4109 i = 0;
138 4110 need_find_last_extent = true;
139 4111 }
140 (...)
141
142The second assertion implicitly expects that the last key before the path
143release still exists, because the surrounding while loop only stops after
144we have found that key. When this assertion fails it produces a stack like
145this:
146
147 [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
148 [139590.037406] ------------[ cut here ]------------
149 [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
150 [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
151 [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G W 5.0.0-btrfs-next-46 #1
152 (...)
153 [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
154 (...)
155 [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
156 [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
157 [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
158 [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
159 [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
160 [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
161 [139590.042501] FS: 00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
162 [139590.042847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
163 [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
164 [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
165 [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
166 [139590.044250] Call Trace:
167 [139590.044631] copy_items+0xa3f/0x1000 [btrfs]
168 [139590.045009] ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
169 [139590.045396] btrfs_log_inode+0x7b3/0xd70 [btrfs]
170 [139590.045773] btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
171 [139590.046143] ? do_raw_spin_unlock+0x49/0xc0
172 [139590.046510] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
173 [139590.046872] btrfs_sync_file+0x3b6/0x440 [btrfs]
174 [139590.047243] btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
175 [139590.047592] __vfs_write+0x129/0x1c0
176 [139590.047932] vfs_write+0xc2/0x1b0
177 [139590.048270] ksys_write+0x55/0xc0
178 [139590.048608] do_syscall_64+0x60/0x1b0
179 [139590.048946] entry_SYSCALL_64_after_hwframe+0x49/0xbe
180 [139590.049287] RIP: 0033:0x7f2efc4be190
181 (...)
182 [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
183 [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
184 [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
185 [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
186 [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
187 [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
188 (...)
189 [139590.055128] ---[ end trace 193f35d0215cdeeb ]---
190
191So fix this race between a full ranged fsync and writeback of adjacent
192ranges by flushing all delalloc and waiting for all ordered extents to
193complete before logging the inode. This is the simplest way to solve the
194problem because currently the full fsync path does not deal with ranges
195at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
196to look at adjacent ranges for hole detection. For use cases of ranged
197fsyncs this can make a few fsyncs slower but on the other hand it can
198make some following fsyncs to other ranges do less work or no need to do
199anything at all. A full fsync is rare anyway and happens only once after
200loading/creating an inode and once after less common operations such as a
201shrinking truncate.
202
203This is an issue that exists for a long time, and was often triggered by
204generic/127, because it does mmap'ed writes and msync (which triggers a
205ranged fsync). Adding support for the tree checker to detect overlapping
206extents (next patch in the series) and trigger a WARN() when such cases
207are found, and then calling btrfs_check_leaf_full() at the end of
208btrfs_insert_file_extent() made the issue much easier to detect. Running
209btrfs/072 with that change to the tree checker and making fsstress open
210files always with O_SYNC made it much easier to trigger the issue (as
211triggering it with generic/127 is very rare).
212
213CC: stable@vger.kernel.org # 3.16+
214Reviewed-by: Josef Bacik <josef@toxicpanda.com>
215Signed-off-by: Filipe Manana <fdmanana@suse.com>
216Signed-off-by: David Sterba <dsterba@suse.com>
217Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
218
219---
220 fs/btrfs/file.c | 12 ++++++++++++
221 1 file changed, 12 insertions(+)
222
223--- a/fs/btrfs/file.c
224+++ b/fs/btrfs/file.c
225@@ -1901,6 +1901,18 @@ int btrfs_sync_file(struct file *file, l
226 u64 len;
227
228 /*
229+ * If the inode needs a full sync, make sure we use a full range to
230+ * avoid log tree corruption, due to hole detection racing with ordered
231+ * extent completion for adjacent ranges, and assertion failures during
232+ * hole detection.
233+ */
234+ if (test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
235+ &BTRFS_I(inode)->runtime_flags)) {
236+ start = 0;
237+ end = LLONG_MAX;
238+ }
239+
240+ /*
241 * The range length can be represented by u64, we have to do the typecasts
242 * to avoid signed overflow if it's [0, LLONG_MAX] eg. from fsync()
243 */