]>
Commit | Line | Data |
---|---|---|
959ef981 DC |
1 | # SPDX-License-Identifier: GPL-2.0 |
2 | ||
2bd0ea18 NS |
3 | A living document. The basic algorithm. |
4 | ||
5 | TODO: (D == DONE) | |
6 | ||
7 | 0) Need to bring some sanity into the case of flags that can | |
8 | be set in the secondaries at mkfs time but reset or cleared | |
9 | in the primary later in the filesystem's life. | |
10 | ||
11 | 0) Clear the persistent read-only bit if set. Clear the | |
12 | shared bit if set and the version number is zero. This | |
13 | brings the filesystem back to a known state. | |
14 | ||
15 | 0) make sure that superblock geometry code checks the logstart | |
16 | value against whether or not we have an internal log. | |
17 | If we have an internal log and a logdev, that's ok. | |
18 | (Maybe we just aren't using it). If we have an external | |
19 | log (logstart == 0) but no logdev, that's right out. | |
20 | ||
21 | 0) write secondary superblock search code. Rewrite initial | |
22 | superblock parsing code to be less complicated. Just | |
23 | use variables to indicate primary, secondary, etc., | |
24 | and use a function to get the SB given a specific location | |
25 | or something. | |
26 | ||
27 | 2) For inode alignment, if the SB bit is set and the | |
28 | inode alignment size field in the SB is set, then | |
29 | believe that the fs inodes MUST be aligned and | |
30 | disallow any non-aligned inodes. Likewise, if | |
31 | the SB bit isn't set (or earlier version) and | |
32 | the inode alignment size field is zero, then | |
33 | never set the bit even if the inodes are aligned. | |
34 | Note that the bits and alignment values are | |
35 | replicated in the secondary superblocks. | |
36 | ||
37 | 0) add feature specification options to parse_arguments | |
38 | ||
39 | 0) add logic to add_inode_ref(), add_inode_reached() | |
40 | to detect nlink overflows in cases where the fs | |
41 | (or user had indicated fs) doesn't support new nlinks. | |
42 | ||
43 | 6) check to make sure that the inodes containing btree blocks | |
44 | with # recs < minrecs aren't legit -- e.g. the only | |
45 | descendant of a root block. | |
46 | ||
47 | 7) inode di_size value sanity checking -- should always be less than | |
48 | the biggest filebno offset mentioned in the bmaps. Doesn't | |
49 | have to be equal though since we're allowed to overallocate | |
50 | (it just wastes a little space). This is for both regular | |
51 | files and directories (have to modify the existing directory | |
52 | check). | |
53 | ||
54 | Add tracking of largest offset in bmap scanning code. Compare | |
55 | value against di_size. Should be >= di_size. | |
56 | ||
57 | Alternatively, you could pass the inode into down through | |
58 | the extent record processing layer and make the checks | |
59 | there. | |
60 | ||
61 | Add knowledge of quota inodes. size of quota inode is | |
62 | always zero. We should maintain that. | |
63 | ||
64 | 8) Basic quota stuff. | |
65 | ||
66 | Invariants | |
67 | if quota feature bit is set, the quota inodes | |
68 | if set, should point to disconnected, 0 len inodes. | |
69 | ||
70 | D - if quota inodes exist, the quota bits must be | |
71 | turned on. It's ok for the quota flags to be | |
72 | zeroed but they should be in a legal state | |
73 | (see xfs_quota.h). | |
74 | ||
dfc130f3 | 75 | D - if the quota flags are non-zero, the corresponding |
2bd0ea18 NS |
76 | quota inodes must exist. |
77 | ||
78 | quota inodes are never deleted, only their space | |
79 | is freed. | |
80 | ||
81 | if quotas are being downgraded, then check quota inodes | |
82 | at the end of phase 3. If they haven't been cleared yet, | |
83 | clear them. Regardless, then clear sb flags (quota inode | |
84 | fields, quota flags, and quota bit). | |
85 | ||
86 | ||
87 | 5) look at verify_inode_chunk(). it's probably really broken. | |
88 | ||
89 | ||
90 | 9) Complicated quota stuff. Add code to bmap scan code to | |
91 | track used blocks. Add another pair of AVL trees | |
92 | to track user and project quota limits. Set AVL | |
93 | trees up at the beginning of phase 3. Quota inodes | |
94 | can be rebuilt or corrected later if damaged. | |
95 | ||
96 | ||
97 | D - 0) fix directory processing. phase 3, if an entry references | |
98 | a free inode, *don't* mark it used. wait for the rest of | |
99 | phase 3 processing to hit that inode. If it looks like it's | |
100 | in use, we'll mark in use then. If not, we'll clear it and | |
101 | mark the inode map. then in phase 4, you can depend on the | |
102 | inode map. should probably set the parent info in phase 4. | |
103 | So we have a check_dups flag. Maybe we should change the | |
104 | name of check_dir to discover_inodes. During phase 3 | |
105 | (discover_inodes == 1), uncertain inodes are added to list. | |
106 | During phase 4 (discover_inodes == 0), they aren't. And | |
107 | we never mark inodes in use from the directory code. | |
108 | During phase 4, we shouldn't complain about names with | |
109 | a leading '/' since we made those names in phase 3. | |
110 | ||
111 | Have to change dino_chunks.c (parent setting), dinode.c | |
112 | and dir.c. | |
113 | ||
114 | D - 0) make sure we don't screw up filesystems with real-time inodes. | |
115 | remember to initialize real-time map with all blocks XR_E_FREE. | |
116 | ||
117 | D - 4) check contents of symlinks as well as lengths in process_symlinks() | |
118 | in dinode.c. Right now, we only check lengths. | |
119 | ||
120 | ||
121 | D - 1) Feature mismatches -- for quotas and attributes, | |
122 | if the stuff exists in the filesystem, set the | |
123 | superblock version bits. | |
124 | ||
125 | D - 0) rewrite directory leaf block holemap comparison code. | |
126 | probably should just check the leaf block hole info | |
127 | against our incore bitmap. If the hole flag is not | |
128 | set, then we know that there can only be one hole and | |
129 | it has to be between the entry table and the top of heap. | |
130 | If the hole flag is set, then it's ok if the on-disk | |
131 | holemap doesn't describe everything as long as what | |
132 | it does describe doesn't conflict with reality. | |
133 | ||
134 | D - 0) rewrite setting nlinks handling -- for version 1 | |
22bc10ed | 135 | inodes, set both nlinks and onlinks (zero projid_lo/hi |
2bd0ea18 NS |
136 | and pad) if we have to change anything. For |
137 | version 2, I think we're ok. | |
138 | ||
139 | D - 0) Put awareness of quota inode into mark_standalone_inodes. | |
140 | ||
141 | ||
142 | D - 8) redo handling of superblocks with bad version numbers. need | |
143 | to bail out (without harming) fs's that have sbs that | |
144 | are newer than we are. | |
145 | ||
146 | D - 0) How do we handle feature mismatches between fs and | |
147 | superblock? For nlink, check each inode after you | |
148 | know it's good. If onlinks is 0 and nlinks is > 0 | |
149 | and it's a version 2 inode, then it really is a version | |
150 | 2 inode and the nlinks flag in the SB needs to be set. | |
151 | If it's a version 2 inode and the SB agrees but onlink | |
152 | is non-zero, then clear onlink. | |
153 | ||
154 | D - 3) keep cumulative counts of freeblocks, inodes, etc. to set in | |
155 | the superblock at the end of phase 5. Remember that | |
156 | agf freeblock counters don't include blocks used by | |
157 | the non-root levels of the freespace trees but that | |
158 | the sb free block counters include those. | |
159 | ||
160 | D - 0) Do parent setting in directory code (called by phase 3). | |
161 | actually, I put it in process_inode_set and propagated | |
162 | the parent up to it from the process_dinode/process_dir | |
163 | routines. seemed cleaner than pushing the irec down | |
164 | and letting them bang on it. | |
165 | ||
166 | D - 0) If we clear a file in phase 4, make sure that if it's | |
167 | a directory that the parent info is cleared also. | |
168 | ||
169 | D - 0) put inode tree flashover (call to add_ino_backptrs) into phase 5. | |
170 | ||
171 | D - 0) do set/get_inode_parent functions in incore_ino.c. | |
172 | also do is/set/ inode_processed. | |
dfc130f3 | 173 | |
2bd0ea18 NS |
174 | D - 0) do a versions.c to extract feature info and set global vars |
175 | from the superblock version number and possibly feature bits | |
176 | ||
177 | D - 0) change longform_dir_entry_check + shortform_dir_entry_check | |
178 | to return a count of how many illegal '/' entries exist. | |
179 | if > 0, then process_dirstack needs to call prune_dir_entry | |
180 | with a hash value of 0 to delete the entries. | |
181 | ||
182 | D - 0) add the "processed" bitfield | |
183 | to the backptrs_t struct that gets attached after | |
184 | phase 4. | |
185 | ||
186 | D- ) Phase 6 !!! | |
187 | ||
188 | D - 0) look at usage of XFS_MAKE_IPTR(). It does the right | |
189 | arithmetic assuming you count your offsets from the | |
190 | beginning of the buffer. | |
191 | ||
192 | ||
193 | D - 0) look at references to XFS_INODES_PER_CHUNK. change the | |
14f8b681 | 194 | ones that really mean sizeof(uint64_t)*NBBY to |
2bd0ea18 NS |
195 | something else (like that only defined as a constant |
196 | INOS_PER_IREC. this isn't as important since | |
197 | XFS_INODES_PER_CHUNK will never chang | |
198 | ||
199 | ||
200 | D - 0) look at junk_zerolen_dir_leaf_entries() to make sure it isn't hosing | |
201 | the freemap since it assumed that bytes between the | |
202 | end of the table and firstused didn't show up in the | |
203 | freemap when they actually do. | |
204 | ||
205 | D - 0) track down XFS_INO_TO_OFFSET() usage. I don't think I'm | |
206 | using it right. (e.g. I think | |
207 | it gives you the offset of an inode into a block but | |
208 | on small block filesystems, I may be reading in inodes | |
209 | in multiblock buffers and working from the start of | |
210 | the buffer plus I'm using it to get offsets into | |
211 | my ino_rec's which may not be a good idea since I | |
212 | use 64-inode ino_rec's whereas the offset macro | |
213 | works off blocksize). | |
214 | ||
215 | D - 0.0) put buffer -> dirblock conversion macros into xfs kernel code | |
216 | ||
217 | D - 0.2) put in sibling pointer checking and path fixup into | |
218 | bmap (long form) scan routines in scan.c | |
219 | D - 0.3) find out if bmap btrees with only root blocks are legal. I'm | |
220 | betting that they're not because they'd be extent inodes | |
221 | instead. If that's the case, rip some code out of | |
222 | process_btinode() | |
223 | ||
224 | ||
225 | Algorithm (XXX means not done yet): | |
226 | ||
227 | Phase 1 -- get a superblock and zero log | |
228 | ||
229 | get a superblock -- either read in primary or | |
230 | find a secondary (ag header), check ag headers | |
231 | ||
232 | To find secondary: | |
233 | ||
234 | Go for brute force and read in the filesystem N meg | |
235 | at a time looking for a superblock. as a | |
236 | slight optimization, we could maybe skip | |
237 | ahead some number of blocks to try and get | |
238 | towards the end of the first ag. | |
239 | ||
240 | After you find a secondary, try and find at least | |
241 | other ags as a verification that the | |
242 | secondary is a good superblock. | |
243 | ||
244 | XXX - Ugh. Have to take growfs'ed filesystems into account. | |
245 | The root superblock geometry info may not be right if | |
246 | recovery hasn't run or it's been trashed. The old ag's | |
247 | may or may not be right since the system could have crashed | |
248 | during growfs or the bwrite() to the superblocks could have | |
249 | failed and the buffer been reused. So we need to check | |
250 | to see if another ag exists beyond the "last" ag | |
251 | to see if a growfs happened. If not, then we know that | |
252 | the geometry info is good and treat the fs as a non-growfs'ed | |
253 | fs. If we do have inconsistencies, then the smaller geometry | |
254 | is the old fs and the larger the new. We can check the | |
255 | new superblocks to see if they're good. If not, then we | |
256 | know the system crashed at or soon after the growfs and | |
257 | we can choose to either accept the new geometry info or | |
258 | trash it and truncate the fs back to the old geometry | |
259 | parameters. | |
260 | ||
261 | Cross-check geometry information in secondary sb's with | |
262 | primary to ensure that it's correct. | |
263 | ||
264 | Use sim code to allow mount filesystems *without* reading | |
265 | in root inode. This sets up the xfs_mount_t structure | |
266 | and allows us to use XFS_* macros that we wouldn't | |
267 | otherwise be able to use. | |
268 | ||
269 | Note, I split phase 1 and 2 into separate pieces because I want | |
270 | to initialize the xfs_repair incore data structures after phase 1. | |
271 | ||
272 | parse superblock version and feature flags and set appropriate | |
273 | global vars to reflect the flags (attributes, quotas, etc.) | |
274 | ||
275 | Workaround for the mkfs "not zeroing the superblock buffer" bug. | |
276 | Determine what field is the last valid non-zero field in | |
277 | the superblock. The trick here is to be able to differentiate | |
278 | the last valid non-zero field in the primary superblock and | |
279 | secondaries because they may not be the same. Fields in | |
280 | the primary can be set as the filesystem gets upgraded but | |
281 | the upgrades won't touch the secondaries. This means that | |
282 | we need to find some number of secondaries and check them. | |
283 | So we do the checking here and the setting in phase2. | |
284 | ||
285 | Phase 2 -- check integrity of allocation group allocation structures | |
286 | ||
287 | zero the log if in no modify mode | |
288 | ||
289 | sanity check ag headers -- superblocks match, agi isn't | |
290 | trashed -- the agf and agfl | |
291 | don't really matter because we can | |
292 | just recreate them later. | |
293 | ||
294 | Zero part of the superblock buffer if necessary | |
295 | ||
296 | Walk the freeblock trees to get an | |
297 | initial idea of what the fs thinks is free. | |
298 | Files that disagree (claim free'd blocks) | |
299 | can be salvaged or deleted. If the btree is | |
300 | internally inconsistent, when in doubt, mark | |
301 | blocks free. If they're used, they'll be stolen | |
302 | back later. don't have to check sibling pointers | |
303 | for each level since we're going to regenerate | |
304 | all the trees anyway. | |
305 | Walk the inode allocation trees and | |
306 | make sure they're ok, otherwise the sim | |
307 | inode routines will probably just barf. | |
308 | mark inode allocation tree blocks and ag header | |
309 | blocks as used blocks. If the trees are | |
310 | corrupted, this phase will generate "uncertain" | |
311 | inode chunks. Those chunks go on a list and | |
312 | will have to verified later. Record the blocks | |
313 | that are used to detect corruption and multiply | |
314 | claimed blocks. These trees will be regenerated | |
315 | later. Mark the blocks containing inodes referenced | |
316 | by uncorrupted inode trees as being used by inodes. | |
317 | The other blocks will get marked when/if the inodes | |
318 | are verified. | |
319 | ||
320 | calculate root and realtime inode numbers from the | |
321 | filesystem geometry, fix up mount structure's | |
322 | incore superblock if they're wrong. | |
323 | ||
324 | ASSUMPTION: at end of phase 2, we've got superblocks and ag headers | |
325 | that are not garbage (some data in them like counters and the | |
326 | freeblock and inode trees may be inconsistent but the header | |
327 | is readable and otherwise makes sense). | |
328 | ||
329 | XXX if in no_modify mode, check for blocks claimed by one freespace | |
330 | btree and not the other | |
dfc130f3 | 331 | |
2bd0ea18 NS |
332 | Phase 3 -- traverse inodes to make the inodes, bmaps and freespace maps |
333 | consistent. For each ag, use either the incore inode map or | |
334 | scan the ag for inodes. | |
335 | Let's use the incore inode map, now that we've made one | |
336 | up in phase2. If we lose the maps, we'll locate inodes | |
337 | when we traverse the directory heirarchy. If we lose both, | |
338 | we could scan the disk. Ugh. Maybe make that a command-line | |
339 | option that we support later. | |
dfc130f3 | 340 | |
2bd0ea18 NS |
341 | ASSUMPTION: we know if the ag allocation btrees are intact (phase 2) |
342 | ||
343 | First - Walk and clear the ag unlinked lists. We'll process | |
344 | the inodes later. Check and make sure that the unlinked | |
345 | lists reference known inodes. If not, add to the list | |
346 | of uncertain inodes. | |
347 | ||
348 | Second, check the uncertain inode list generated in phase2 and | |
349 | above and get them into the inode tree if they're good. | |
350 | The incore inode cluster tree *always* has good | |
351 | clusters (alignment, etc.) in it. | |
dfc130f3 | 352 | |
2bd0ea18 NS |
353 | Third, make sure that the root inode is known. If not, |
354 | and we know the inode number from the superblock, | |
ff1f79a7 | 355 | discover that inode and its chunk. |
2bd0ea18 NS |
356 | |
357 | Then, walk the incore inode-cluster tree. | |
358 | ||
359 | Maintain an in-core bitmap over the entire fs for block allocation. | |
360 | ||
361 | traverse each inode, make sure inode mode field matches free/allocated | |
362 | bit in the incore inode allocation tree. If there's a mismatch, | |
363 | assume that the inode is in use. | |
364 | ||
365 | - for each in-use inode, traverse each bmap/dir/attribute | |
366 | map or tree. Maintain a map (extent list?) for the | |
367 | current inode. | |
368 | ||
369 | - For each block marked as used, check to see if already known | |
370 | (referenced by another file or directory) and sanity | |
371 | check the contents of the block as well if possible | |
372 | (in the case of meta-blocks). | |
373 | ||
374 | - if the inode claims already used blocks, mark the blocks | |
375 | as multiply claimed (duplicate) and go on. the inode | |
376 | will be cleared in phase 4. | |
377 | ||
378 | - if metablocks are garbaged, clear the inode after | |
379 | traversing what you can of the bmap and | |
380 | proceed to next inode. We don't have to worry | |
381 | about trashing the maps or trees in cleared inodes | |
382 | because the blocks will show up as free in the | |
383 | ag freespace trees that we set up in phase 5. | |
384 | ||
385 | - clear the di_next_unlinked pointer -- all unlinked | |
386 | but active files go bye-bye. | |
387 | ||
388 | - All blocks start out unknown. We need the last state | |
389 | in case we run into a case where we need to step | |
390 | on a block to store filesystem meta-data and it | |
391 | turns out later that it's referenced by some inode's | |
392 | bmap. In that case, the inode loses because we've | |
393 | already trashed the block. This shouldn't happen | |
394 | in the first version unless some inode has a bogus | |
395 | bmap referencing blocks in the ag header but the | |
396 | 4th state will keep us from inadvertently doing | |
397 | something stupid in that case. | |
398 | ||
399 | - If inode is allocated, mark all blocks allocated to the | |
400 | current inode as allocated in the incore freespace | |
401 | bitmap. | |
402 | ||
dfc130f3 | 403 | - If inode is good and a directory, scan through it to |
2bd0ea18 | 404 | find leaf entries and discover any unknown inodes. |
dfc130f3 | 405 | |
2bd0ea18 NS |
406 | For shortform, we correct what we can. |
407 | ||
408 | If the directory is corrupt, we try and fix it in | |
409 | place. If it has zero good entries, then we blast it. | |
410 | ||
411 | All unknown inodes get put onto the uncertain inode | |
412 | list. This is safe because we only put inodes onto | |
413 | the list when we're processing known inodes so the | |
414 | uncertain inode list isn't in use. | |
415 | ||
416 | We fix only one problem -- an entry that has | |
417 | a mathematically invalid inode numbers in them. | |
418 | If that's the case, we replace the inode number | |
419 | with NULLFSINO and we'll fix up the entry in | |
420 | phase 6. | |
421 | ||
422 | That info may conflict with the inode information, | |
423 | but we'll straighten out any inconsistencies there | |
424 | in phase4 when we process the inodes again. | |
425 | ||
426 | Errors involving bogus forward/back links, | |
427 | zero-length entries make the directory get | |
428 | trashed. | |
429 | ||
430 | if an entry references a free inode, ignore that | |
431 | fact for now. wait for the rest of phase 3 | |
432 | processing to hit that inode. If it looks like it's | |
433 | in use, we'll mark in use then. If not, we'll | |
434 | clear it and mark the inode map. then in phase | |
435 | 4, you can depend on the inode map. | |
dfc130f3 | 436 | |
2bd0ea18 NS |
437 | Entries that point to non-existent or free |
438 | inodes, and extra blocks in the directory | |
439 | will get fixed in place in a later pass. | |
440 | ||
441 | Entries that point to a quota inode are | |
442 | marked TBD. | |
443 | ||
444 | If the directory internally points to the same | |
445 | block twice, the directory gets blown away. | |
446 | ||
447 | Note that processing uncertain inodes can add more inodes | |
448 | to the uncertain list if they're directories. So we loop | |
449 | until the uncertain list is empty. | |
450 | ||
451 | During inode verification, if the inode blocks are unknown, | |
452 | mark then as in-use by inodes. | |
453 | ||
454 | XXX HEURISTIC -- if we blow an inode away that has space, | |
455 | assume that the freespace btree is now out of wack. | |
456 | If it was ok earlier, it's certain to be wrong now. | |
457 | And the odds of this space free cancelling out the | |
458 | existing error is so small I'm willing to ignore it. | |
459 | Should probably do this via a global var and complain | |
460 | about this later. | |
461 | ||
462 | Assumption: All known inodes are now marked as in-use or free. Any | |
463 | inodes that we haven't found by now are hosed (lost) since | |
464 | we can't reach them via either the inode btrees or via directory | |
465 | entries. | |
466 | ||
467 | Directories are semi-clean. All '.' entries are good. | |
468 | Root '..' entry is good if root inode exists. All entries | |
dfc130f3 | 469 | referencing non-existent inodes, free inodes, etc. |
2bd0ea18 NS |
470 | |
471 | XXX verify that either quota inode is 0 or NULLFSINO or | |
472 | if sb quota flag is non zero, verify that quota inode | |
473 | is NULLFSINO or is referencing a used, but disconnected | |
474 | inode. | |
475 | ||
476 | XXX if in no_modify mode, check for unclaimed blocks | |
477 | ||
478 | - Phase 4 - Check for inodes referencing duplicate blocks | |
479 | ||
480 | At this point, all known duplicate blocks are marked in | |
481 | the block map. However, some of the claimed blocks in | |
482 | the bmap may in fact be free because they belong to inodes | |
483 | that have to be cleared either due to being a trashed | |
484 | directory or because it's the first inode to claim a | |
485 | block that was then claimed later. There's a similar | |
486 | problem with meta-data blocks that are referenced by | |
487 | inode bmaps that are going to be freed once the inode | |
488 | (or directory) gets cleared. | |
489 | ||
490 | So at this point, we collect the duplicate blocks into | |
491 | extents and put them into the duplicate extent list. | |
492 | ||
493 | Mark the ag header blocks as in use. | |
494 | ||
495 | We then process each inode twice -- the first time | |
496 | we check to see if the inode claims a duplicate extent | |
497 | and we do NOT set the block bitmap. If the inode claims | |
498 | a duplicate extent, we clear the inode. Since the bitmap | |
499 | hasn't been set, that automatically frees all blocks associated | |
500 | with the cleared inode. If the inode is ok, process it a second | |
501 | time and set the bitmap since we know that this inode will live. | |
502 | ||
503 | The unlinked list gets cleared in every inode at this point as | |
504 | well. We no longer need to preserve it since we've discovered | |
505 | every inode we're going to find from it. | |
506 | ||
507 | verify existence of root inode. if it exists, check for | |
508 | existence of "lost+found". If it exists, mark the entry | |
509 | to be deleted, and clear the inode. All the inodes that | |
510 | were connected to the lost+found will be reconnected later. | |
511 | ||
512 | XXX HEURISTIC -- if we blow an inode away that has space, | |
513 | assume that the freespace btree is now out of wack. | |
514 | If it was ok earlier, it's certain to be wrong now. | |
515 | And the odds of this space free cancelling out the | |
516 | existing error is so small I'm willing to ignore it. | |
517 | Should probably do this via a global var and complain | |
518 | about this later. | |
519 | ||
520 | Clear the quota inodes if the inode btree says that | |
521 | they're not in use. The space freed will get picked | |
522 | up by phase 5. | |
dfc130f3 | 523 | |
2bd0ea18 NS |
524 | XXX Clear the quota inodes if the filesystem is being downgraded. |
525 | ||
526 | - Phase 5 - Build inode allocation trees, freespace trees and | |
527 | agfl's for each ag. After this, we should be able to | |
528 | unmount the filesystem and remount it for real. | |
529 | ||
530 | For each ag: (if no in no_modify mode) | |
531 | ||
532 | scan bitmap first to figure out number of extents. | |
dfc130f3 | 533 | |
2bd0ea18 NS |
534 | calculate space required for all trees. Start with inode trees. |
535 | Setup the btree cursor which includes the list of preallocated | |
536 | blocks. As a by-product, this will delete the extents required | |
537 | for the inode tree from the incore extent tree. | |
dfc130f3 | 538 | |
2bd0ea18 NS |
539 | Calculate how many extents will be required to represent the |
540 | remaining free extent tree on disk (twice, one for bybno and | |
541 | one for bycnt). You have to iterate on this because consuming | |
542 | extents can alter the number of blocks required to represent | |
543 | the remaining extents. If there's slop left over, you can | |
544 | put it in the agfl though. | |
545 | ||
546 | Then, manually build the trees, agi, agfs, and agfls. | |
547 | ||
548 | XXX if in no_modify mode, scan the on-disk inode allocation | |
549 | trees and compare against the incore versions. Don't have | |
550 | to scan the freespace trees because we caught the problems | |
551 | there in phase2 and phase3. But if we cleared any inodes | |
552 | with space during phases 3 or 4, now is the time to complain. | |
553 | ||
dfc130f3 | 554 | XXX - Free duplicate extent lists. ??? |
2bd0ea18 NS |
555 | |
556 | Assumptions: at this point, sim code having to do with inode | |
557 | creation/modification/deletion and space allocation | |
558 | work because the inode maps, space maps, and bmaps | |
559 | for all files in the filesystem are good. The only | |
560 | structures that are screwed up are the directory contents, | |
561 | which means that lookup may not work for beans, the | |
562 | root inode which exists but may be completely bogus and | |
563 | the link counts on all inodes which may also be bogus. | |
564 | ||
565 | Free the bitmap, the freespace tree. | |
566 | ||
dfc130f3 | 567 | Flash the incore inode tree over from parent list to having |
2bd0ea18 NS |
568 | full backpointers. |
569 | ||
570 | realtime processing, if any -- | |
571 | ||
572 | (Skip to below if running in no_modify mode). | |
573 | ||
574 | Generate the realtime bitmap from the incore realtime | |
575 | extent map and slam the info into the realtime bitmap | |
576 | inode. Generate summary info from the realtime extent map. | |
dfc130f3 | 577 | |
2bd0ea18 NS |
578 | XXX if in no_modify mode, compare contents of realtime bitmap |
579 | inode to the incore realtime extent map. generate the | |
580 | summary info from the incore realtime extent map. | |
581 | compare against the contents of the realtime summary inode. | |
582 | complain if bad. | |
583 | ||
584 | reset superblock counters, sync version numbers | |
585 | ||
586 | - Phase 6 - directory traversal -- check reference counts, | |
587 | attach disconnected inodes, fix up bogus directories | |
588 | ||
589 | Assumptions: all on-disk space and inode trees are structurally | |
590 | sound. Incore and on-disk inode trees agree on whether | |
591 | an inode is in use. | |
592 | ||
593 | Directories are structurally sound. All hashvalues | |
594 | are monotonically increasing and interior nodes are | |
595 | correct so lookups work. All legal directory entries | |
596 | point to inodes that are in use and exist. Shortform | |
597 | directories are fine except that the links haven't been | |
598 | checked for conflicts (cycles, ".." being correct, etc.). | |
599 | Longform directories haven't been checked for those problems | |
600 | either PLUS longform directories may still contain | |
601 | entries beginning with '/'. No zero-length entries | |
602 | exist (they've been deleted or converted to '/'). | |
603 | ||
604 | Root directory may or may not exist. orphange may | |
605 | or may not exist. Contents of either may be completely | |
606 | bogus. | |
607 | ||
608 | Entries may point to free or non-existent inodes. | |
609 | ||
610 | At this we point, we may need new incore structures and | |
611 | may be able to trash an old one (like the filesystem | |
612 | block map) | |
613 | ||
614 | If '/' is trashed, then reinitialize it. | |
615 | ||
616 | If no realtime inodes, make them and if necessary, slam the | |
617 | summary info into the realtime summary | |
618 | inode. Ditto with the realtime bitmap inode. | |
dfc130f3 | 619 | |
2bd0ea18 NS |
620 | Make orphanage (lost+found ???). |
621 | ||
622 | Traverse each directory from '/' (unless it was created). | |
623 | Check directory structure and each directory entry. | |
624 | If the entry is bogus (points to a non-existent or | |
625 | free inode, for example), mark that entry TBD. Maintain | |
626 | link counts on all inodes. Currently, traversal is | |
627 | depth-first. | |
628 | ||
629 | Mark every inode reached as "reached" (includes | |
630 | bumping up link counts). | |
631 | ||
632 | If a entry points to a directory but the parent (..) | |
633 | disagrees, then blow away the entry. if the directory | |
634 | being pointed to winds up disconnected, it'll be moved | |
635 | to the orphanage (and the link count incremented to | |
636 | account for the link and the reached bit set then). | |
637 | ||
638 | If an entry points to a directory that we've already | |
639 | reached, then some entry is bad and should be blown | |
640 | away. It's easiest to blow away the current entry | |
641 | plus since presumably the parent entry in the | |
642 | reached directory points to another directory, | |
643 | then it's far more likely that the current | |
644 | entry is bogus (otherwise the parent should point | |
645 | at it). | |
646 | ||
647 | If an entry points to a non-existent of free inode, | |
648 | blow the entry away. | |
649 | ||
650 | Every time a good entry is encountered update the | |
651 | link count for the inode that the entry points to. | |
652 | ||
653 | After traversal, scan incore inode map for directories not | |
ff1f79a7 | 654 | reached. Go to first one and try and find its root |
2bd0ea18 NS |
655 | by following .. entries. Once at root, run traversal |
656 | algorithm. When algorithm terminates, move subtree | |
657 | root inode to the orphanage. Repeat as necessary | |
658 | until all disconnected directories are attached. | |
659 | ||
660 | Move all disconnected inodes to orphanage. | |
661 | ||
662 | - Phase 7: reset reference counts if required. | |
663 | ||
664 | Now traverse the on-disk inodes again, and make sure on-disk | |
665 | reference counts are correct. Reset if necessary. | |
666 | ||
667 | SKIP all unused inodes -- that also makes us | |
668 | skip the orphanage inode which we think is | |
669 | unused but is really used. However, the ref counts | |
670 | on that should be right so that's ok. | |
671 | ||
672 | --- | |
673 | ||
674 | multiple TB xfs_repair | |
675 | ||
676 | modify above to work in a couple of AGs at a time. The bitmaps | |
677 | should span only the current set of AGs. | |
678 | ||
679 | The key it scan the inode bmaps and keep a list of inodes | |
680 | that span multiple AG sets and keep the list in a data structure | |
681 | that's keyed off AG set # as well as inode # and also has a bit | |
682 | to indicate whether or not the inode will be cleared. | |
683 | ||
684 | Then in each AG set, when doing duplicate extent processing, | |
685 | you have to process all multi-AG-set inodes that claim blocks in | |
686 | the current AG set. If there's a conflict, you mark clear the | |
687 | inode in the current AG and you mark the multi-AG inode as | |
688 | "to be cleared". | |
689 | ||
690 | After going through all AGs, you can clear the to-be-cleared | |
691 | multi-AG-set inodes and pull them off the list. | |
692 | ||
693 | When building up the AG freespace trees, you walk the bmaps | |
694 | of all multi-AG-set inodes that are in the AG-set and include | |
695 | blocks claimed in the AG by the inode as used. | |
696 | ||
697 | This probably involves adding a phase 3-0 which would have to | |
698 | check all the inodes to see which ones are multi-AG-set inodes | |
699 | and set up the multi-AG-set inode data structure. Plus the | |
700 | process_dinode routines may have to be altered just a bit | |
701 | to do the right thing if running in tera-byte mode (call | |
702 | out to routines that check the multi-AG-set inodes when | |
703 | appropriate). | |
704 | ||
705 | To make things go faster, phase 3-0 could probably run | |
706 | in parallel. It should be possible to run phases 2-5 | |
707 | in parallel as well once the appropriate synchronization | |
708 | is added to the incore routines and the static directory | |
709 | leaf block bitmap is changed to be on the stack. | |
710 | ||
711 | Phase 7 probably can be in parallel as well. | |
712 | ||
713 | By in parallel, I mean that assuming that an AG-set | |
714 | contains 4 AGs, you could run 4 threads, 1 per AG | |
715 | in parallel to process the AG set. | |
716 | ||
717 | I don't see how phase 6 can be run in parallel though. | |
718 | ||
719 | And running Phase 8 in parallel is just silly. |