]> git.ipfire.org Git - thirdparty/xfsprogs-dev.git/blame - repair/README
xfs_repair: validate some of the log space information
[thirdparty/xfsprogs-dev.git] / repair / README
CommitLineData
2bd0ea18
NS
1A living document. The basic algorithm.
2
3TODO: (D == DONE)
4
50) Need to bring some sanity into the case of flags that can
6 be set in the secondaries at mkfs time but reset or cleared
7 in the primary later in the filesystem's life.
8
90) Clear the persistent read-only bit if set. Clear the
10 shared bit if set and the version number is zero. This
11 brings the filesystem back to a known state.
12
130) make sure that superblock geometry code checks the logstart
14 value against whether or not we have an internal log.
15 If we have an internal log and a logdev, that's ok.
16 (Maybe we just aren't using it). If we have an external
17 log (logstart == 0) but no logdev, that's right out.
18
190) write secondary superblock search code. Rewrite initial
20 superblock parsing code to be less complicated. Just
21 use variables to indicate primary, secondary, etc.,
22 and use a function to get the SB given a specific location
23 or something.
24
252) For inode alignment, if the SB bit is set and the
26 inode alignment size field in the SB is set, then
27 believe that the fs inodes MUST be aligned and
28 disallow any non-aligned inodes. Likewise, if
29 the SB bit isn't set (or earlier version) and
30 the inode alignment size field is zero, then
31 never set the bit even if the inodes are aligned.
32 Note that the bits and alignment values are
33 replicated in the secondary superblocks.
34
350) add feature specification options to parse_arguments
36
370) add logic to add_inode_ref(), add_inode_reached()
38 to detect nlink overflows in cases where the fs
39 (or user had indicated fs) doesn't support new nlinks.
40
416) check to make sure that the inodes containing btree blocks
42 with # recs < minrecs aren't legit -- e.g. the only
43 descendant of a root block.
44
457) inode di_size value sanity checking -- should always be less than
46 the biggest filebno offset mentioned in the bmaps. Doesn't
47 have to be equal though since we're allowed to overallocate
48 (it just wastes a little space). This is for both regular
49 files and directories (have to modify the existing directory
50 check).
51
52 Add tracking of largest offset in bmap scanning code. Compare
53 value against di_size. Should be >= di_size.
54
55 Alternatively, you could pass the inode into down through
56 the extent record processing layer and make the checks
57 there.
58
59 Add knowledge of quota inodes. size of quota inode is
60 always zero. We should maintain that.
61
628) Basic quota stuff.
63
64 Invariants
65 if quota feature bit is set, the quota inodes
66 if set, should point to disconnected, 0 len inodes.
67
68D - if quota inodes exist, the quota bits must be
69 turned on. It's ok for the quota flags to be
70 zeroed but they should be in a legal state
71 (see xfs_quota.h).
72
dfc130f3 73D - if the quota flags are non-zero, the corresponding
2bd0ea18
NS
74 quota inodes must exist.
75
76 quota inodes are never deleted, only their space
77 is freed.
78
79 if quotas are being downgraded, then check quota inodes
80 at the end of phase 3. If they haven't been cleared yet,
81 clear them. Regardless, then clear sb flags (quota inode
82 fields, quota flags, and quota bit).
83
84
855) look at verify_inode_chunk(). it's probably really broken.
86
87
889) Complicated quota stuff. Add code to bmap scan code to
89 track used blocks. Add another pair of AVL trees
90 to track user and project quota limits. Set AVL
91 trees up at the beginning of phase 3. Quota inodes
92 can be rebuilt or corrected later if damaged.
93
94
95D - 0) fix directory processing. phase 3, if an entry references
96 a free inode, *don't* mark it used. wait for the rest of
97 phase 3 processing to hit that inode. If it looks like it's
98 in use, we'll mark in use then. If not, we'll clear it and
99 mark the inode map. then in phase 4, you can depend on the
100 inode map. should probably set the parent info in phase 4.
101 So we have a check_dups flag. Maybe we should change the
102 name of check_dir to discover_inodes. During phase 3
103 (discover_inodes == 1), uncertain inodes are added to list.
104 During phase 4 (discover_inodes == 0), they aren't. And
105 we never mark inodes in use from the directory code.
106 During phase 4, we shouldn't complain about names with
107 a leading '/' since we made those names in phase 3.
108
109 Have to change dino_chunks.c (parent setting), dinode.c
110 and dir.c.
111
112D - 0) make sure we don't screw up filesystems with real-time inodes.
113 remember to initialize real-time map with all blocks XR_E_FREE.
114
115D - 4) check contents of symlinks as well as lengths in process_symlinks()
116 in dinode.c. Right now, we only check lengths.
117
118
119D - 1) Feature mismatches -- for quotas and attributes,
120 if the stuff exists in the filesystem, set the
121 superblock version bits.
122
123D - 0) rewrite directory leaf block holemap comparison code.
124 probably should just check the leaf block hole info
125 against our incore bitmap. If the hole flag is not
126 set, then we know that there can only be one hole and
127 it has to be between the entry table and the top of heap.
128 If the hole flag is set, then it's ok if the on-disk
129 holemap doesn't describe everything as long as what
130 it does describe doesn't conflict with reality.
131
132D - 0) rewrite setting nlinks handling -- for version 1
22bc10ed 133 inodes, set both nlinks and onlinks (zero projid_lo/hi
2bd0ea18
NS
134 and pad) if we have to change anything. For
135 version 2, I think we're ok.
136
137D - 0) Put awareness of quota inode into mark_standalone_inodes.
138
139
140D - 8) redo handling of superblocks with bad version numbers. need
141 to bail out (without harming) fs's that have sbs that
142 are newer than we are.
143
144D - 0) How do we handle feature mismatches between fs and
145 superblock? For nlink, check each inode after you
146 know it's good. If onlinks is 0 and nlinks is > 0
147 and it's a version 2 inode, then it really is a version
148 2 inode and the nlinks flag in the SB needs to be set.
149 If it's a version 2 inode and the SB agrees but onlink
150 is non-zero, then clear onlink.
151
152D - 3) keep cumulative counts of freeblocks, inodes, etc. to set in
153 the superblock at the end of phase 5. Remember that
154 agf freeblock counters don't include blocks used by
155 the non-root levels of the freespace trees but that
156 the sb free block counters include those.
157
158D - 0) Do parent setting in directory code (called by phase 3).
159 actually, I put it in process_inode_set and propagated
160 the parent up to it from the process_dinode/process_dir
161 routines. seemed cleaner than pushing the irec down
162 and letting them bang on it.
163
164D - 0) If we clear a file in phase 4, make sure that if it's
165 a directory that the parent info is cleared also.
166
167D - 0) put inode tree flashover (call to add_ino_backptrs) into phase 5.
168
169D - 0) do set/get_inode_parent functions in incore_ino.c.
170 also do is/set/ inode_processed.
dfc130f3 171
2bd0ea18
NS
172D - 0) do a versions.c to extract feature info and set global vars
173 from the superblock version number and possibly feature bits
174
175D - 0) change longform_dir_entry_check + shortform_dir_entry_check
176 to return a count of how many illegal '/' entries exist.
177 if > 0, then process_dirstack needs to call prune_dir_entry
178 with a hash value of 0 to delete the entries.
179
180D - 0) add the "processed" bitfield
181 to the backptrs_t struct that gets attached after
182 phase 4.
183
184D- ) Phase 6 !!!
185
186D - 0) look at usage of XFS_MAKE_IPTR(). It does the right
187 arithmetic assuming you count your offsets from the
188 beginning of the buffer.
189
190
191D - 0) look at references to XFS_INODES_PER_CHUNK. change the
14f8b681 192 ones that really mean sizeof(uint64_t)*NBBY to
2bd0ea18
NS
193 something else (like that only defined as a constant
194 INOS_PER_IREC. this isn't as important since
195 XFS_INODES_PER_CHUNK will never chang
196
197
198D - 0) look at junk_zerolen_dir_leaf_entries() to make sure it isn't hosing
199 the freemap since it assumed that bytes between the
200 end of the table and firstused didn't show up in the
201 freemap when they actually do.
202
203D - 0) track down XFS_INO_TO_OFFSET() usage. I don't think I'm
204 using it right. (e.g. I think
205 it gives you the offset of an inode into a block but
206 on small block filesystems, I may be reading in inodes
207 in multiblock buffers and working from the start of
208 the buffer plus I'm using it to get offsets into
209 my ino_rec's which may not be a good idea since I
210 use 64-inode ino_rec's whereas the offset macro
211 works off blocksize).
212
213D - 0.0) put buffer -> dirblock conversion macros into xfs kernel code
214
215D - 0.2) put in sibling pointer checking and path fixup into
216 bmap (long form) scan routines in scan.c
217D - 0.3) find out if bmap btrees with only root blocks are legal. I'm
218 betting that they're not because they'd be extent inodes
219 instead. If that's the case, rip some code out of
220 process_btinode()
221
222
223Algorithm (XXX means not done yet):
224
225Phase 1 -- get a superblock and zero log
226
227 get a superblock -- either read in primary or
228 find a secondary (ag header), check ag headers
229
230 To find secondary:
231
232 Go for brute force and read in the filesystem N meg
233 at a time looking for a superblock. as a
234 slight optimization, we could maybe skip
235 ahead some number of blocks to try and get
236 towards the end of the first ag.
237
238 After you find a secondary, try and find at least
239 other ags as a verification that the
240 secondary is a good superblock.
241
242XXX - Ugh. Have to take growfs'ed filesystems into account.
243 The root superblock geometry info may not be right if
244 recovery hasn't run or it's been trashed. The old ag's
245 may or may not be right since the system could have crashed
246 during growfs or the bwrite() to the superblocks could have
247 failed and the buffer been reused. So we need to check
248 to see if another ag exists beyond the "last" ag
249 to see if a growfs happened. If not, then we know that
250 the geometry info is good and treat the fs as a non-growfs'ed
251 fs. If we do have inconsistencies, then the smaller geometry
252 is the old fs and the larger the new. We can check the
253 new superblocks to see if they're good. If not, then we
254 know the system crashed at or soon after the growfs and
255 we can choose to either accept the new geometry info or
256 trash it and truncate the fs back to the old geometry
257 parameters.
258
259 Cross-check geometry information in secondary sb's with
260 primary to ensure that it's correct.
261
262 Use sim code to allow mount filesystems *without* reading
263 in root inode. This sets up the xfs_mount_t structure
264 and allows us to use XFS_* macros that we wouldn't
265 otherwise be able to use.
266
267 Note, I split phase 1 and 2 into separate pieces because I want
268 to initialize the xfs_repair incore data structures after phase 1.
269
270 parse superblock version and feature flags and set appropriate
271 global vars to reflect the flags (attributes, quotas, etc.)
272
273 Workaround for the mkfs "not zeroing the superblock buffer" bug.
274 Determine what field is the last valid non-zero field in
275 the superblock. The trick here is to be able to differentiate
276 the last valid non-zero field in the primary superblock and
277 secondaries because they may not be the same. Fields in
278 the primary can be set as the filesystem gets upgraded but
279 the upgrades won't touch the secondaries. This means that
280 we need to find some number of secondaries and check them.
281 So we do the checking here and the setting in phase2.
282
283Phase 2 -- check integrity of allocation group allocation structures
284
285 zero the log if in no modify mode
286
287 sanity check ag headers -- superblocks match, agi isn't
288 trashed -- the agf and agfl
289 don't really matter because we can
290 just recreate them later.
291
292 Zero part of the superblock buffer if necessary
293
294 Walk the freeblock trees to get an
295 initial idea of what the fs thinks is free.
296 Files that disagree (claim free'd blocks)
297 can be salvaged or deleted. If the btree is
298 internally inconsistent, when in doubt, mark
299 blocks free. If they're used, they'll be stolen
300 back later. don't have to check sibling pointers
301 for each level since we're going to regenerate
302 all the trees anyway.
303 Walk the inode allocation trees and
304 make sure they're ok, otherwise the sim
305 inode routines will probably just barf.
306 mark inode allocation tree blocks and ag header
307 blocks as used blocks. If the trees are
308 corrupted, this phase will generate "uncertain"
309 inode chunks. Those chunks go on a list and
310 will have to verified later. Record the blocks
311 that are used to detect corruption and multiply
312 claimed blocks. These trees will be regenerated
313 later. Mark the blocks containing inodes referenced
314 by uncorrupted inode trees as being used by inodes.
315 The other blocks will get marked when/if the inodes
316 are verified.
317
318 calculate root and realtime inode numbers from the
319 filesystem geometry, fix up mount structure's
320 incore superblock if they're wrong.
321
322ASSUMPTION: at end of phase 2, we've got superblocks and ag headers
323 that are not garbage (some data in them like counters and the
324 freeblock and inode trees may be inconsistent but the header
325 is readable and otherwise makes sense).
326
327XXX if in no_modify mode, check for blocks claimed by one freespace
328 btree and not the other
dfc130f3 329
2bd0ea18
NS
330Phase 3 -- traverse inodes to make the inodes, bmaps and freespace maps
331 consistent. For each ag, use either the incore inode map or
332 scan the ag for inodes.
333 Let's use the incore inode map, now that we've made one
334 up in phase2. If we lose the maps, we'll locate inodes
335 when we traverse the directory heirarchy. If we lose both,
336 we could scan the disk. Ugh. Maybe make that a command-line
337 option that we support later.
dfc130f3 338
2bd0ea18
NS
339 ASSUMPTION: we know if the ag allocation btrees are intact (phase 2)
340
341 First - Walk and clear the ag unlinked lists. We'll process
342 the inodes later. Check and make sure that the unlinked
343 lists reference known inodes. If not, add to the list
344 of uncertain inodes.
345
346 Second, check the uncertain inode list generated in phase2 and
347 above and get them into the inode tree if they're good.
348 The incore inode cluster tree *always* has good
349 clusters (alignment, etc.) in it.
dfc130f3 350
2bd0ea18
NS
351 Third, make sure that the root inode is known. If not,
352 and we know the inode number from the superblock,
ff1f79a7 353 discover that inode and its chunk.
2bd0ea18
NS
354
355 Then, walk the incore inode-cluster tree.
356
357 Maintain an in-core bitmap over the entire fs for block allocation.
358
359 traverse each inode, make sure inode mode field matches free/allocated
360 bit in the incore inode allocation tree. If there's a mismatch,
361 assume that the inode is in use.
362
363 - for each in-use inode, traverse each bmap/dir/attribute
364 map or tree. Maintain a map (extent list?) for the
365 current inode.
366
367 - For each block marked as used, check to see if already known
368 (referenced by another file or directory) and sanity
369 check the contents of the block as well if possible
370 (in the case of meta-blocks).
371
372 - if the inode claims already used blocks, mark the blocks
373 as multiply claimed (duplicate) and go on. the inode
374 will be cleared in phase 4.
375
376 - if metablocks are garbaged, clear the inode after
377 traversing what you can of the bmap and
378 proceed to next inode. We don't have to worry
379 about trashing the maps or trees in cleared inodes
380 because the blocks will show up as free in the
381 ag freespace trees that we set up in phase 5.
382
383 - clear the di_next_unlinked pointer -- all unlinked
384 but active files go bye-bye.
385
386 - All blocks start out unknown. We need the last state
387 in case we run into a case where we need to step
388 on a block to store filesystem meta-data and it
389 turns out later that it's referenced by some inode's
390 bmap. In that case, the inode loses because we've
391 already trashed the block. This shouldn't happen
392 in the first version unless some inode has a bogus
393 bmap referencing blocks in the ag header but the
394 4th state will keep us from inadvertently doing
395 something stupid in that case.
396
397 - If inode is allocated, mark all blocks allocated to the
398 current inode as allocated in the incore freespace
399 bitmap.
400
dfc130f3 401 - If inode is good and a directory, scan through it to
2bd0ea18 402 find leaf entries and discover any unknown inodes.
dfc130f3 403
2bd0ea18
NS
404 For shortform, we correct what we can.
405
406 If the directory is corrupt, we try and fix it in
407 place. If it has zero good entries, then we blast it.
408
409 All unknown inodes get put onto the uncertain inode
410 list. This is safe because we only put inodes onto
411 the list when we're processing known inodes so the
412 uncertain inode list isn't in use.
413
414 We fix only one problem -- an entry that has
415 a mathematically invalid inode numbers in them.
416 If that's the case, we replace the inode number
417 with NULLFSINO and we'll fix up the entry in
418 phase 6.
419
420 That info may conflict with the inode information,
421 but we'll straighten out any inconsistencies there
422 in phase4 when we process the inodes again.
423
424 Errors involving bogus forward/back links,
425 zero-length entries make the directory get
426 trashed.
427
428 if an entry references a free inode, ignore that
429 fact for now. wait for the rest of phase 3
430 processing to hit that inode. If it looks like it's
431 in use, we'll mark in use then. If not, we'll
432 clear it and mark the inode map. then in phase
433 4, you can depend on the inode map.
dfc130f3 434
2bd0ea18
NS
435 Entries that point to non-existent or free
436 inodes, and extra blocks in the directory
437 will get fixed in place in a later pass.
438
439 Entries that point to a quota inode are
440 marked TBD.
441
442 If the directory internally points to the same
443 block twice, the directory gets blown away.
444
445 Note that processing uncertain inodes can add more inodes
446 to the uncertain list if they're directories. So we loop
447 until the uncertain list is empty.
448
449 During inode verification, if the inode blocks are unknown,
450 mark then as in-use by inodes.
451
452XXX HEURISTIC -- if we blow an inode away that has space,
453 assume that the freespace btree is now out of wack.
454 If it was ok earlier, it's certain to be wrong now.
455 And the odds of this space free cancelling out the
456 existing error is so small I'm willing to ignore it.
457 Should probably do this via a global var and complain
458 about this later.
459
460Assumption: All known inodes are now marked as in-use or free. Any
461 inodes that we haven't found by now are hosed (lost) since
462 we can't reach them via either the inode btrees or via directory
463 entries.
464
465 Directories are semi-clean. All '.' entries are good.
466 Root '..' entry is good if root inode exists. All entries
dfc130f3 467 referencing non-existent inodes, free inodes, etc.
2bd0ea18
NS
468
469XXX verify that either quota inode is 0 or NULLFSINO or
470 if sb quota flag is non zero, verify that quota inode
471 is NULLFSINO or is referencing a used, but disconnected
472 inode.
473
474XXX if in no_modify mode, check for unclaimed blocks
475
476- Phase 4 - Check for inodes referencing duplicate blocks
477
478 At this point, all known duplicate blocks are marked in
479 the block map. However, some of the claimed blocks in
480 the bmap may in fact be free because they belong to inodes
481 that have to be cleared either due to being a trashed
482 directory or because it's the first inode to claim a
483 block that was then claimed later. There's a similar
484 problem with meta-data blocks that are referenced by
485 inode bmaps that are going to be freed once the inode
486 (or directory) gets cleared.
487
488 So at this point, we collect the duplicate blocks into
489 extents and put them into the duplicate extent list.
490
491 Mark the ag header blocks as in use.
492
493 We then process each inode twice -- the first time
494 we check to see if the inode claims a duplicate extent
495 and we do NOT set the block bitmap. If the inode claims
496 a duplicate extent, we clear the inode. Since the bitmap
497 hasn't been set, that automatically frees all blocks associated
498 with the cleared inode. If the inode is ok, process it a second
499 time and set the bitmap since we know that this inode will live.
500
501 The unlinked list gets cleared in every inode at this point as
502 well. We no longer need to preserve it since we've discovered
503 every inode we're going to find from it.
504
505 verify existence of root inode. if it exists, check for
506 existence of "lost+found". If it exists, mark the entry
507 to be deleted, and clear the inode. All the inodes that
508 were connected to the lost+found will be reconnected later.
509
510XXX HEURISTIC -- if we blow an inode away that has space,
511 assume that the freespace btree is now out of wack.
512 If it was ok earlier, it's certain to be wrong now.
513 And the odds of this space free cancelling out the
514 existing error is so small I'm willing to ignore it.
515 Should probably do this via a global var and complain
516 about this later.
517
518 Clear the quota inodes if the inode btree says that
519 they're not in use. The space freed will get picked
520 up by phase 5.
dfc130f3 521
2bd0ea18
NS
522XXX Clear the quota inodes if the filesystem is being downgraded.
523
524- Phase 5 - Build inode allocation trees, freespace trees and
525 agfl's for each ag. After this, we should be able to
526 unmount the filesystem and remount it for real.
527
528 For each ag: (if no in no_modify mode)
529
530 scan bitmap first to figure out number of extents.
dfc130f3 531
2bd0ea18
NS
532 calculate space required for all trees. Start with inode trees.
533 Setup the btree cursor which includes the list of preallocated
534 blocks. As a by-product, this will delete the extents required
535 for the inode tree from the incore extent tree.
dfc130f3 536
2bd0ea18
NS
537 Calculate how many extents will be required to represent the
538 remaining free extent tree on disk (twice, one for bybno and
539 one for bycnt). You have to iterate on this because consuming
540 extents can alter the number of blocks required to represent
541 the remaining extents. If there's slop left over, you can
542 put it in the agfl though.
543
544 Then, manually build the trees, agi, agfs, and agfls.
545
546XXX if in no_modify mode, scan the on-disk inode allocation
547 trees and compare against the incore versions. Don't have
548 to scan the freespace trees because we caught the problems
549 there in phase2 and phase3. But if we cleared any inodes
550 with space during phases 3 or 4, now is the time to complain.
551
dfc130f3 552XXX - Free duplicate extent lists. ???
2bd0ea18
NS
553
554Assumptions: at this point, sim code having to do with inode
555 creation/modification/deletion and space allocation
556 work because the inode maps, space maps, and bmaps
557 for all files in the filesystem are good. The only
558 structures that are screwed up are the directory contents,
559 which means that lookup may not work for beans, the
560 root inode which exists but may be completely bogus and
561 the link counts on all inodes which may also be bogus.
562
563 Free the bitmap, the freespace tree.
564
dfc130f3 565 Flash the incore inode tree over from parent list to having
2bd0ea18
NS
566 full backpointers.
567
568 realtime processing, if any --
569
570 (Skip to below if running in no_modify mode).
571
572 Generate the realtime bitmap from the incore realtime
573 extent map and slam the info into the realtime bitmap
574 inode. Generate summary info from the realtime extent map.
dfc130f3 575
2bd0ea18
NS
576XXX if in no_modify mode, compare contents of realtime bitmap
577 inode to the incore realtime extent map. generate the
578 summary info from the incore realtime extent map.
579 compare against the contents of the realtime summary inode.
580 complain if bad.
581
582 reset superblock counters, sync version numbers
583
584- Phase 6 - directory traversal -- check reference counts,
585 attach disconnected inodes, fix up bogus directories
586
587 Assumptions: all on-disk space and inode trees are structurally
588 sound. Incore and on-disk inode trees agree on whether
589 an inode is in use.
590
591 Directories are structurally sound. All hashvalues
592 are monotonically increasing and interior nodes are
593 correct so lookups work. All legal directory entries
594 point to inodes that are in use and exist. Shortform
595 directories are fine except that the links haven't been
596 checked for conflicts (cycles, ".." being correct, etc.).
597 Longform directories haven't been checked for those problems
598 either PLUS longform directories may still contain
599 entries beginning with '/'. No zero-length entries
600 exist (they've been deleted or converted to '/').
601
602 Root directory may or may not exist. orphange may
603 or may not exist. Contents of either may be completely
604 bogus.
605
606 Entries may point to free or non-existent inodes.
607
608 At this we point, we may need new incore structures and
609 may be able to trash an old one (like the filesystem
610 block map)
611
612 If '/' is trashed, then reinitialize it.
613
614 If no realtime inodes, make them and if necessary, slam the
615 summary info into the realtime summary
616 inode. Ditto with the realtime bitmap inode.
dfc130f3 617
2bd0ea18
NS
618 Make orphanage (lost+found ???).
619
620 Traverse each directory from '/' (unless it was created).
621 Check directory structure and each directory entry.
622 If the entry is bogus (points to a non-existent or
623 free inode, for example), mark that entry TBD. Maintain
624 link counts on all inodes. Currently, traversal is
625 depth-first.
626
627 Mark every inode reached as "reached" (includes
628 bumping up link counts).
629
630 If a entry points to a directory but the parent (..)
631 disagrees, then blow away the entry. if the directory
632 being pointed to winds up disconnected, it'll be moved
633 to the orphanage (and the link count incremented to
634 account for the link and the reached bit set then).
635
636 If an entry points to a directory that we've already
637 reached, then some entry is bad and should be blown
638 away. It's easiest to blow away the current entry
639 plus since presumably the parent entry in the
640 reached directory points to another directory,
641 then it's far more likely that the current
642 entry is bogus (otherwise the parent should point
643 at it).
644
645 If an entry points to a non-existent of free inode,
646 blow the entry away.
647
648 Every time a good entry is encountered update the
649 link count for the inode that the entry points to.
650
651 After traversal, scan incore inode map for directories not
ff1f79a7 652 reached. Go to first one and try and find its root
2bd0ea18
NS
653 by following .. entries. Once at root, run traversal
654 algorithm. When algorithm terminates, move subtree
655 root inode to the orphanage. Repeat as necessary
656 until all disconnected directories are attached.
657
658 Move all disconnected inodes to orphanage.
659
660- Phase 7: reset reference counts if required.
661
662 Now traverse the on-disk inodes again, and make sure on-disk
663 reference counts are correct. Reset if necessary.
664
665 SKIP all unused inodes -- that also makes us
666 skip the orphanage inode which we think is
667 unused but is really used. However, the ref counts
668 on that should be right so that's ok.
669
670---
671
672multiple TB xfs_repair
673
674modify above to work in a couple of AGs at a time. The bitmaps
675should span only the current set of AGs.
676
677The key it scan the inode bmaps and keep a list of inodes
678that span multiple AG sets and keep the list in a data structure
679that's keyed off AG set # as well as inode # and also has a bit
680to indicate whether or not the inode will be cleared.
681
682Then in each AG set, when doing duplicate extent processing,
683you have to process all multi-AG-set inodes that claim blocks in
684the current AG set. If there's a conflict, you mark clear the
685inode in the current AG and you mark the multi-AG inode as
686"to be cleared".
687
688After going through all AGs, you can clear the to-be-cleared
689multi-AG-set inodes and pull them off the list.
690
691When building up the AG freespace trees, you walk the bmaps
692of all multi-AG-set inodes that are in the AG-set and include
693blocks claimed in the AG by the inode as used.
694
695This probably involves adding a phase 3-0 which would have to
696check all the inodes to see which ones are multi-AG-set inodes
697and set up the multi-AG-set inode data structure. Plus the
698process_dinode routines may have to be altered just a bit
699to do the right thing if running in tera-byte mode (call
700out to routines that check the multi-AG-set inodes when
701appropriate).
702
703To make things go faster, phase 3-0 could probably run
704in parallel. It should be possible to run phases 2-5
705in parallel as well once the appropriate synchronization
706is added to the incore routines and the static directory
707leaf block bitmap is changed to be on the stack.
708
709Phase 7 probably can be in parallel as well.
710
711By in parallel, I mean that assuming that an AG-set
712contains 4 AGs, you could run 4 threads, 1 per AG
713in parallel to process the AG set.
714
715I don't see how phase 6 can be run in parallel though.
716
717And running Phase 8 in parallel is just silly.