]>
Commit | Line | Data |
---|---|---|
4c58ed05 GKH |
1 | From 0c713cbab6200b0ab6473b50435e450a6e1de85d Mon Sep 17 00:00:00 2001 |
2 | From: Filipe Manana <fdmanana@suse.com> | |
3 | Date: Mon, 6 May 2019 16:44:02 +0100 | |
4 | Subject: Btrfs: fix race between ranged fsync and writeback of adjacent ranges | |
5 | ||
6 | From: Filipe Manana <fdmanana@suse.com> | |
7 | ||
8 | commit 0c713cbab6200b0ab6473b50435e450a6e1de85d upstream. | |
9 | ||
10 | When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the | |
11 | inode) that happens to be ranged, which happens during a msync() or writes | |
12 | for files opened with O_SYNC for example, we can end up with a corrupt log, | |
13 | due to different file extent items representing ranges that overlap with | |
14 | each other, or hit some assertion failures. | |
15 | ||
16 | When doing a ranged fsync we only flush delalloc and wait for ordered | |
17 | exents within that range. If while we are logging items from our inode | |
18 | ordered extents for adjacent ranges complete, we end up in a race that can | |
19 | make us insert the file extent items that overlap with others we logged | |
20 | previously and the assertion failures. | |
21 | ||
22 | For example, if tree-log.c:copy_items() receives a leaf that has the | |
23 | following file extents items, all with a length of 4K and therefore there | |
24 | is an implicit hole in the range 68K to 72K - 1: | |
25 | ||
26 | (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ... | |
27 | ||
28 | It copies them to the log tree. However due to the need to detect implicit | |
29 | holes, it may release the path, in order to look at the previous leaf to | |
30 | detect an implicit hole, and then later it will search again in the tree | |
31 | for the first file extent item key, with the goal of locking again the | |
32 | leaf (which might have changed due to concurrent changes to other inodes). | |
33 | ||
34 | However when it locks again the leaf containing the first key, the key | |
35 | corresponding to the extent at offset 72K may not be there anymore since | |
36 | there is an ordered extent for that range that is finishing (that is, | |
37 | somewhere in the middle of btrfs_finish_ordered_io()), and it just | |
38 | removed the file extent item but has not yet replaced it with a new file | |
39 | extent item, so the part of copy_items() that does hole detection will | |
40 | decide that there is a hole in the range starting from 68K to 76K - 1, | |
41 | and therefore insert a file extent item to represent that hole, having | |
42 | a key offset of 68K. After that we now have a log tree with 2 different | |
43 | extent items that have overlapping ranges: | |
44 | ||
45 | 1) The file extent item copied before copy_items() released the path, | |
46 | which has a key offset of 72K and a length of 4K, representing the | |
47 | file range 72K to 76K - 1. | |
48 | ||
49 | 2) And a file extent item representing a hole that has a key offset of | |
50 | 68K and a length of 8K, representing the range 68K to 76K - 1. This | |
51 | item was inserted after releasing the path, and overlaps with the | |
52 | extent item inserted before. | |
53 | ||
54 | The overlapping extent items can cause all sorts of unpredictable and | |
55 | incorrect behaviour, either when replayed or if a fast (non full) fsync | |
56 | happens later, which can trigger a BUG_ON() when calling | |
57 | btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a | |
58 | trace like the following: | |
59 | ||
60 | [61666.783269] ------------[ cut here ]------------ | |
61 | [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182! | |
62 | [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP | |
63 | (...) | |
64 | [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000 | |
65 | [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs] | |
66 | [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246 | |
67 | [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000 | |
68 | [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937 | |
69 | [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000 | |
70 | [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418 | |
71 | [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000 | |
72 | [61666.786253] FS: 00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000 | |
73 | [61666.786253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 | |
74 | [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0 | |
75 | [61666.786253] Call Trace: | |
76 | [61666.786253] __btrfs_drop_extents+0x5e3/0xaad [btrfs] | |
77 | [61666.786253] ? time_hardirqs_on+0x9/0x14 | |
78 | [61666.786253] btrfs_log_changed_extents+0x294/0x4e0 [btrfs] | |
79 | [61666.786253] ? release_extent_buffer+0x38/0xb4 [btrfs] | |
80 | [61666.786253] btrfs_log_inode+0xb6e/0xcdc [btrfs] | |
81 | [61666.786253] ? lock_acquire+0x131/0x1c5 | |
82 | [61666.786253] ? btrfs_log_inode_parent+0xee/0x659 [btrfs] | |
83 | [61666.786253] ? arch_local_irq_save+0x9/0xc | |
84 | [61666.786253] ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs] | |
85 | [61666.786253] btrfs_log_inode_parent+0x223/0x659 [btrfs] | |
86 | [61666.786253] ? arch_local_irq_save+0x9/0xc | |
87 | [61666.786253] ? lockref_get_not_zero+0x2c/0x34 | |
88 | [61666.786253] ? rcu_read_unlock+0x3e/0x5d | |
89 | [61666.786253] btrfs_log_dentry_safe+0x60/0x7b [btrfs] | |
90 | [61666.786253] btrfs_sync_file+0x317/0x42c [btrfs] | |
91 | [61666.786253] vfs_fsync_range+0x8c/0x9e | |
92 | [61666.786253] SyS_msync+0x13c/0x1c9 | |
93 | [61666.786253] entry_SYSCALL_64_fastpath+0x18/0xad | |
94 | ||
95 | A sample of a corrupt log tree leaf with overlapping extents I got from | |
96 | running btrfs/072: | |
97 | ||
98 | item 14 key (295 108 200704) itemoff 2599 itemsize 53 | |
99 | extent data disk bytenr 0 nr 0 | |
100 | extent data offset 0 nr 458752 ram 458752 | |
101 | item 15 key (295 108 659456) itemoff 2546 itemsize 53 | |
102 | extent data disk bytenr 4343541760 nr 770048 | |
103 | extent data offset 606208 nr 163840 ram 770048 | |
104 | item 16 key (295 108 663552) itemoff 2493 itemsize 53 | |
105 | extent data disk bytenr 4343541760 nr 770048 | |
106 | extent data offset 610304 nr 155648 ram 770048 | |
107 | item 17 key (295 108 819200) itemoff 2440 itemsize 53 | |
108 | extent data disk bytenr 4334788608 nr 4096 | |
109 | extent data offset 0 nr 4096 ram 4096 | |
110 | ||
111 | The file extent item at offset 659456 (item 15) ends at offset 823296 | |
112 | (659456 + 163840) while the next file extent item (item 16) starts at | |
113 | offset 663552. | |
114 | ||
115 | Another different problem that the race can trigger is a failure in the | |
116 | assertions at tree-log.c:copy_items(), which expect that the first file | |
117 | extent item key we found before releasing the path exists after we have | |
118 | released path and that the last key we found before releasing the path | |
119 | also exists after releasing the path: | |
120 | ||
121 | $ cat -n fs/btrfs/tree-log.c | |
122 | 4080 if (need_find_last_extent) { | |
123 | 4081 /* btrfs_prev_leaf could return 1 without releasing the path */ | |
124 | 4082 btrfs_release_path(src_path); | |
125 | 4083 ret = btrfs_search_slot(NULL, inode->root, &first_key, | |
126 | 4084 src_path, 0, 0); | |
127 | 4085 if (ret < 0) | |
128 | 4086 return ret; | |
129 | 4087 ASSERT(ret == 0); | |
130 | (...) | |
131 | 4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) { | |
132 | 4104 ret = btrfs_next_leaf(inode->root, src_path); | |
133 | 4105 if (ret < 0) | |
134 | 4106 return ret; | |
135 | 4107 ASSERT(ret == 0); | |
136 | 4108 src = src_path->nodes[0]; | |
137 | 4109 i = 0; | |
138 | 4110 need_find_last_extent = true; | |
139 | 4111 } | |
140 | (...) | |
141 | ||
142 | The second assertion implicitly expects that the last key before the path | |
143 | release still exists, because the surrounding while loop only stops after | |
144 | we have found that key. When this assertion fails it produces a stack like | |
145 | this: | |
146 | ||
147 | [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107 | |
148 | [139590.037406] ------------[ cut here ]------------ | |
149 | [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546! | |
150 | [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI | |
151 | [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G W 5.0.0-btrfs-next-46 #1 | |
152 | (...) | |
153 | [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs] | |
154 | (...) | |
155 | [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282 | |
156 | [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000 | |
157 | [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868 | |
158 | [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000 | |
159 | [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013 | |
160 | [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001 | |
161 | [139590.042501] FS: 00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000 | |
162 | [139590.042847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 | |
163 | [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0 | |
164 | [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 | |
165 | [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 | |
166 | [139590.044250] Call Trace: | |
167 | [139590.044631] copy_items+0xa3f/0x1000 [btrfs] | |
168 | [139590.045009] ? generic_bin_search.constprop.32+0x61/0x200 [btrfs] | |
169 | [139590.045396] btrfs_log_inode+0x7b3/0xd70 [btrfs] | |
170 | [139590.045773] btrfs_log_inode_parent+0x2b3/0xce0 [btrfs] | |
171 | [139590.046143] ? do_raw_spin_unlock+0x49/0xc0 | |
172 | [139590.046510] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] | |
173 | [139590.046872] btrfs_sync_file+0x3b6/0x440 [btrfs] | |
174 | [139590.047243] btrfs_file_write_iter+0x45b/0x5c0 [btrfs] | |
175 | [139590.047592] __vfs_write+0x129/0x1c0 | |
176 | [139590.047932] vfs_write+0xc2/0x1b0 | |
177 | [139590.048270] ksys_write+0x55/0xc0 | |
178 | [139590.048608] do_syscall_64+0x60/0x1b0 | |
179 | [139590.048946] entry_SYSCALL_64_after_hwframe+0x49/0xbe | |
180 | [139590.049287] RIP: 0033:0x7f2efc4be190 | |
181 | (...) | |
182 | [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 | |
183 | [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190 | |
184 | [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003 | |
185 | [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60 | |
186 | [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003 | |
187 | [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000 | |
188 | (...) | |
189 | [139590.055128] ---[ end trace 193f35d0215cdeeb ]--- | |
190 | ||
191 | So fix this race between a full ranged fsync and writeback of adjacent | |
192 | ranges by flushing all delalloc and waiting for all ordered extents to | |
193 | complete before logging the inode. This is the simplest way to solve the | |
194 | problem because currently the full fsync path does not deal with ranges | |
195 | at all (it assumes a full range from 0 to LLONG_MAX) and it always needs | |
196 | to look at adjacent ranges for hole detection. For use cases of ranged | |
197 | fsyncs this can make a few fsyncs slower but on the other hand it can | |
198 | make some following fsyncs to other ranges do less work or no need to do | |
199 | anything at all. A full fsync is rare anyway and happens only once after | |
200 | loading/creating an inode and once after less common operations such as a | |
201 | shrinking truncate. | |
202 | ||
203 | This is an issue that exists for a long time, and was often triggered by | |
204 | generic/127, because it does mmap'ed writes and msync (which triggers a | |
205 | ranged fsync). Adding support for the tree checker to detect overlapping | |
206 | extents (next patch in the series) and trigger a WARN() when such cases | |
207 | are found, and then calling btrfs_check_leaf_full() at the end of | |
208 | btrfs_insert_file_extent() made the issue much easier to detect. Running | |
209 | btrfs/072 with that change to the tree checker and making fsstress open | |
210 | files always with O_SYNC made it much easier to trigger the issue (as | |
211 | triggering it with generic/127 is very rare). | |
212 | ||
213 | CC: stable@vger.kernel.org # 3.16+ | |
214 | Reviewed-by: Josef Bacik <josef@toxicpanda.com> | |
215 | Signed-off-by: Filipe Manana <fdmanana@suse.com> | |
216 | Signed-off-by: David Sterba <dsterba@suse.com> | |
217 | Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> | |
218 | ||
219 | --- | |
220 | fs/btrfs/file.c | 12 ++++++++++++ | |
221 | 1 file changed, 12 insertions(+) | |
222 | ||
223 | --- a/fs/btrfs/file.c | |
224 | +++ b/fs/btrfs/file.c | |
225 | @@ -1901,6 +1901,18 @@ int btrfs_sync_file(struct file *file, l | |
226 | u64 len; | |
227 | ||
228 | /* | |
229 | + * If the inode needs a full sync, make sure we use a full range to | |
230 | + * avoid log tree corruption, due to hole detection racing with ordered | |
231 | + * extent completion for adjacent ranges, and assertion failures during | |
232 | + * hole detection. | |
233 | + */ | |
234 | + if (test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, | |
235 | + &BTRFS_I(inode)->runtime_flags)) { | |
236 | + start = 0; | |
237 | + end = LLONG_MAX; | |
238 | + } | |
239 | + | |
240 | + /* | |
241 | * The range length can be represented by u64, we have to do the typecasts | |
242 | * to avoid signed overflow if it's [0, LLONG_MAX] eg. from fsync() | |
243 | */ |