]>
Commit | Line | Data |
---|---|---|
2bd0ea18 NS |
1 | A living document. The basic algorithm. |
2 | ||
3 | TODO: (D == DONE) | |
4 | ||
5 | 0) Need to bring some sanity into the case of flags that can | |
6 | be set in the secondaries at mkfs time but reset or cleared | |
7 | in the primary later in the filesystem's life. | |
8 | ||
9 | 0) Clear the persistent read-only bit if set. Clear the | |
10 | shared bit if set and the version number is zero. This | |
11 | brings the filesystem back to a known state. | |
12 | ||
13 | 0) make sure that superblock geometry code checks the logstart | |
14 | value against whether or not we have an internal log. | |
15 | If we have an internal log and a logdev, that's ok. | |
16 | (Maybe we just aren't using it). If we have an external | |
17 | log (logstart == 0) but no logdev, that's right out. | |
18 | ||
19 | 0) write secondary superblock search code. Rewrite initial | |
20 | superblock parsing code to be less complicated. Just | |
21 | use variables to indicate primary, secondary, etc., | |
22 | and use a function to get the SB given a specific location | |
23 | or something. | |
24 | ||
25 | 2) For inode alignment, if the SB bit is set and the | |
26 | inode alignment size field in the SB is set, then | |
27 | believe that the fs inodes MUST be aligned and | |
28 | disallow any non-aligned inodes. Likewise, if | |
29 | the SB bit isn't set (or earlier version) and | |
30 | the inode alignment size field is zero, then | |
31 | never set the bit even if the inodes are aligned. | |
32 | Note that the bits and alignment values are | |
33 | replicated in the secondary superblocks. | |
34 | ||
35 | 0) add feature specification options to parse_arguments | |
36 | ||
37 | 0) add logic to add_inode_ref(), add_inode_reached() | |
38 | to detect nlink overflows in cases where the fs | |
39 | (or user had indicated fs) doesn't support new nlinks. | |
40 | ||
41 | 6) check to make sure that the inodes containing btree blocks | |
42 | with # recs < minrecs aren't legit -- e.g. the only | |
43 | descendant of a root block. | |
44 | ||
45 | 7) inode di_size value sanity checking -- should always be less than | |
46 | the biggest filebno offset mentioned in the bmaps. Doesn't | |
47 | have to be equal though since we're allowed to overallocate | |
48 | (it just wastes a little space). This is for both regular | |
49 | files and directories (have to modify the existing directory | |
50 | check). | |
51 | ||
52 | Add tracking of largest offset in bmap scanning code. Compare | |
53 | value against di_size. Should be >= di_size. | |
54 | ||
55 | Alternatively, you could pass the inode into down through | |
56 | the extent record processing layer and make the checks | |
57 | there. | |
58 | ||
59 | Add knowledge of quota inodes. size of quota inode is | |
60 | always zero. We should maintain that. | |
61 | ||
62 | 8) Basic quota stuff. | |
63 | ||
64 | Invariants | |
65 | if quota feature bit is set, the quota inodes | |
66 | if set, should point to disconnected, 0 len inodes. | |
67 | ||
68 | D - if quota inodes exist, the quota bits must be | |
69 | turned on. It's ok for the quota flags to be | |
70 | zeroed but they should be in a legal state | |
71 | (see xfs_quota.h). | |
72 | ||
dfc130f3 | 73 | D - if the quota flags are non-zero, the corresponding |
2bd0ea18 NS |
74 | quota inodes must exist. |
75 | ||
76 | quota inodes are never deleted, only their space | |
77 | is freed. | |
78 | ||
79 | if quotas are being downgraded, then check quota inodes | |
80 | at the end of phase 3. If they haven't been cleared yet, | |
81 | clear them. Regardless, then clear sb flags (quota inode | |
82 | fields, quota flags, and quota bit). | |
83 | ||
84 | ||
85 | 5) look at verify_inode_chunk(). it's probably really broken. | |
86 | ||
87 | ||
88 | 9) Complicated quota stuff. Add code to bmap scan code to | |
89 | track used blocks. Add another pair of AVL trees | |
90 | to track user and project quota limits. Set AVL | |
91 | trees up at the beginning of phase 3. Quota inodes | |
92 | can be rebuilt or corrected later if damaged. | |
93 | ||
94 | ||
95 | D - 0) fix directory processing. phase 3, if an entry references | |
96 | a free inode, *don't* mark it used. wait for the rest of | |
97 | phase 3 processing to hit that inode. If it looks like it's | |
98 | in use, we'll mark in use then. If not, we'll clear it and | |
99 | mark the inode map. then in phase 4, you can depend on the | |
100 | inode map. should probably set the parent info in phase 4. | |
101 | So we have a check_dups flag. Maybe we should change the | |
102 | name of check_dir to discover_inodes. During phase 3 | |
103 | (discover_inodes == 1), uncertain inodes are added to list. | |
104 | During phase 4 (discover_inodes == 0), they aren't. And | |
105 | we never mark inodes in use from the directory code. | |
106 | During phase 4, we shouldn't complain about names with | |
107 | a leading '/' since we made those names in phase 3. | |
108 | ||
109 | Have to change dino_chunks.c (parent setting), dinode.c | |
110 | and dir.c. | |
111 | ||
112 | D - 0) make sure we don't screw up filesystems with real-time inodes. | |
113 | remember to initialize real-time map with all blocks XR_E_FREE. | |
114 | ||
115 | D - 4) check contents of symlinks as well as lengths in process_symlinks() | |
116 | in dinode.c. Right now, we only check lengths. | |
117 | ||
118 | ||
119 | D - 1) Feature mismatches -- for quotas and attributes, | |
120 | if the stuff exists in the filesystem, set the | |
121 | superblock version bits. | |
122 | ||
123 | D - 0) rewrite directory leaf block holemap comparison code. | |
124 | probably should just check the leaf block hole info | |
125 | against our incore bitmap. If the hole flag is not | |
126 | set, then we know that there can only be one hole and | |
127 | it has to be between the entry table and the top of heap. | |
128 | If the hole flag is set, then it's ok if the on-disk | |
129 | holemap doesn't describe everything as long as what | |
130 | it does describe doesn't conflict with reality. | |
131 | ||
132 | D - 0) rewrite setting nlinks handling -- for version 1 | |
22bc10ed | 133 | inodes, set both nlinks and onlinks (zero projid_lo/hi |
2bd0ea18 NS |
134 | and pad) if we have to change anything. For |
135 | version 2, I think we're ok. | |
136 | ||
137 | D - 0) Put awareness of quota inode into mark_standalone_inodes. | |
138 | ||
139 | ||
140 | D - 8) redo handling of superblocks with bad version numbers. need | |
141 | to bail out (without harming) fs's that have sbs that | |
142 | are newer than we are. | |
143 | ||
144 | D - 0) How do we handle feature mismatches between fs and | |
145 | superblock? For nlink, check each inode after you | |
146 | know it's good. If onlinks is 0 and nlinks is > 0 | |
147 | and it's a version 2 inode, then it really is a version | |
148 | 2 inode and the nlinks flag in the SB needs to be set. | |
149 | If it's a version 2 inode and the SB agrees but onlink | |
150 | is non-zero, then clear onlink. | |
151 | ||
152 | D - 3) keep cumulative counts of freeblocks, inodes, etc. to set in | |
153 | the superblock at the end of phase 5. Remember that | |
154 | agf freeblock counters don't include blocks used by | |
155 | the non-root levels of the freespace trees but that | |
156 | the sb free block counters include those. | |
157 | ||
158 | D - 0) Do parent setting in directory code (called by phase 3). | |
159 | actually, I put it in process_inode_set and propagated | |
160 | the parent up to it from the process_dinode/process_dir | |
161 | routines. seemed cleaner than pushing the irec down | |
162 | and letting them bang on it. | |
163 | ||
164 | D - 0) If we clear a file in phase 4, make sure that if it's | |
165 | a directory that the parent info is cleared also. | |
166 | ||
167 | D - 0) put inode tree flashover (call to add_ino_backptrs) into phase 5. | |
168 | ||
169 | D - 0) do set/get_inode_parent functions in incore_ino.c. | |
170 | also do is/set/ inode_processed. | |
dfc130f3 | 171 | |
2bd0ea18 NS |
172 | D - 0) do a versions.c to extract feature info and set global vars |
173 | from the superblock version number and possibly feature bits | |
174 | ||
175 | D - 0) change longform_dir_entry_check + shortform_dir_entry_check | |
176 | to return a count of how many illegal '/' entries exist. | |
177 | if > 0, then process_dirstack needs to call prune_dir_entry | |
178 | with a hash value of 0 to delete the entries. | |
179 | ||
180 | D - 0) add the "processed" bitfield | |
181 | to the backptrs_t struct that gets attached after | |
182 | phase 4. | |
183 | ||
184 | D- ) Phase 6 !!! | |
185 | ||
186 | D - 0) look at usage of XFS_MAKE_IPTR(). It does the right | |
187 | arithmetic assuming you count your offsets from the | |
188 | beginning of the buffer. | |
189 | ||
190 | ||
191 | D - 0) look at references to XFS_INODES_PER_CHUNK. change the | |
14f8b681 | 192 | ones that really mean sizeof(uint64_t)*NBBY to |
2bd0ea18 NS |
193 | something else (like that only defined as a constant |
194 | INOS_PER_IREC. this isn't as important since | |
195 | XFS_INODES_PER_CHUNK will never chang | |
196 | ||
197 | ||
198 | D - 0) look at junk_zerolen_dir_leaf_entries() to make sure it isn't hosing | |
199 | the freemap since it assumed that bytes between the | |
200 | end of the table and firstused didn't show up in the | |
201 | freemap when they actually do. | |
202 | ||
203 | D - 0) track down XFS_INO_TO_OFFSET() usage. I don't think I'm | |
204 | using it right. (e.g. I think | |
205 | it gives you the offset of an inode into a block but | |
206 | on small block filesystems, I may be reading in inodes | |
207 | in multiblock buffers and working from the start of | |
208 | the buffer plus I'm using it to get offsets into | |
209 | my ino_rec's which may not be a good idea since I | |
210 | use 64-inode ino_rec's whereas the offset macro | |
211 | works off blocksize). | |
212 | ||
213 | D - 0.0) put buffer -> dirblock conversion macros into xfs kernel code | |
214 | ||
215 | D - 0.2) put in sibling pointer checking and path fixup into | |
216 | bmap (long form) scan routines in scan.c | |
217 | D - 0.3) find out if bmap btrees with only root blocks are legal. I'm | |
218 | betting that they're not because they'd be extent inodes | |
219 | instead. If that's the case, rip some code out of | |
220 | process_btinode() | |
221 | ||
222 | ||
223 | Algorithm (XXX means not done yet): | |
224 | ||
225 | Phase 1 -- get a superblock and zero log | |
226 | ||
227 | get a superblock -- either read in primary or | |
228 | find a secondary (ag header), check ag headers | |
229 | ||
230 | To find secondary: | |
231 | ||
232 | Go for brute force and read in the filesystem N meg | |
233 | at a time looking for a superblock. as a | |
234 | slight optimization, we could maybe skip | |
235 | ahead some number of blocks to try and get | |
236 | towards the end of the first ag. | |
237 | ||
238 | After you find a secondary, try and find at least | |
239 | other ags as a verification that the | |
240 | secondary is a good superblock. | |
241 | ||
242 | XXX - Ugh. Have to take growfs'ed filesystems into account. | |
243 | The root superblock geometry info may not be right if | |
244 | recovery hasn't run or it's been trashed. The old ag's | |
245 | may or may not be right since the system could have crashed | |
246 | during growfs or the bwrite() to the superblocks could have | |
247 | failed and the buffer been reused. So we need to check | |
248 | to see if another ag exists beyond the "last" ag | |
249 | to see if a growfs happened. If not, then we know that | |
250 | the geometry info is good and treat the fs as a non-growfs'ed | |
251 | fs. If we do have inconsistencies, then the smaller geometry | |
252 | is the old fs and the larger the new. We can check the | |
253 | new superblocks to see if they're good. If not, then we | |
254 | know the system crashed at or soon after the growfs and | |
255 | we can choose to either accept the new geometry info or | |
256 | trash it and truncate the fs back to the old geometry | |
257 | parameters. | |
258 | ||
259 | Cross-check geometry information in secondary sb's with | |
260 | primary to ensure that it's correct. | |
261 | ||
262 | Use sim code to allow mount filesystems *without* reading | |
263 | in root inode. This sets up the xfs_mount_t structure | |
264 | and allows us to use XFS_* macros that we wouldn't | |
265 | otherwise be able to use. | |
266 | ||
267 | Note, I split phase 1 and 2 into separate pieces because I want | |
268 | to initialize the xfs_repair incore data structures after phase 1. | |
269 | ||
270 | parse superblock version and feature flags and set appropriate | |
271 | global vars to reflect the flags (attributes, quotas, etc.) | |
272 | ||
273 | Workaround for the mkfs "not zeroing the superblock buffer" bug. | |
274 | Determine what field is the last valid non-zero field in | |
275 | the superblock. The trick here is to be able to differentiate | |
276 | the last valid non-zero field in the primary superblock and | |
277 | secondaries because they may not be the same. Fields in | |
278 | the primary can be set as the filesystem gets upgraded but | |
279 | the upgrades won't touch the secondaries. This means that | |
280 | we need to find some number of secondaries and check them. | |
281 | So we do the checking here and the setting in phase2. | |
282 | ||
283 | Phase 2 -- check integrity of allocation group allocation structures | |
284 | ||
285 | zero the log if in no modify mode | |
286 | ||
287 | sanity check ag headers -- superblocks match, agi isn't | |
288 | trashed -- the agf and agfl | |
289 | don't really matter because we can | |
290 | just recreate them later. | |
291 | ||
292 | Zero part of the superblock buffer if necessary | |
293 | ||
294 | Walk the freeblock trees to get an | |
295 | initial idea of what the fs thinks is free. | |
296 | Files that disagree (claim free'd blocks) | |
297 | can be salvaged or deleted. If the btree is | |
298 | internally inconsistent, when in doubt, mark | |
299 | blocks free. If they're used, they'll be stolen | |
300 | back later. don't have to check sibling pointers | |
301 | for each level since we're going to regenerate | |
302 | all the trees anyway. | |
303 | Walk the inode allocation trees and | |
304 | make sure they're ok, otherwise the sim | |
305 | inode routines will probably just barf. | |
306 | mark inode allocation tree blocks and ag header | |
307 | blocks as used blocks. If the trees are | |
308 | corrupted, this phase will generate "uncertain" | |
309 | inode chunks. Those chunks go on a list and | |
310 | will have to verified later. Record the blocks | |
311 | that are used to detect corruption and multiply | |
312 | claimed blocks. These trees will be regenerated | |
313 | later. Mark the blocks containing inodes referenced | |
314 | by uncorrupted inode trees as being used by inodes. | |
315 | The other blocks will get marked when/if the inodes | |
316 | are verified. | |
317 | ||
318 | calculate root and realtime inode numbers from the | |
319 | filesystem geometry, fix up mount structure's | |
320 | incore superblock if they're wrong. | |
321 | ||
322 | ASSUMPTION: at end of phase 2, we've got superblocks and ag headers | |
323 | that are not garbage (some data in them like counters and the | |
324 | freeblock and inode trees may be inconsistent but the header | |
325 | is readable and otherwise makes sense). | |
326 | ||
327 | XXX if in no_modify mode, check for blocks claimed by one freespace | |
328 | btree and not the other | |
dfc130f3 | 329 | |
2bd0ea18 NS |
330 | Phase 3 -- traverse inodes to make the inodes, bmaps and freespace maps |
331 | consistent. For each ag, use either the incore inode map or | |
332 | scan the ag for inodes. | |
333 | Let's use the incore inode map, now that we've made one | |
334 | up in phase2. If we lose the maps, we'll locate inodes | |
335 | when we traverse the directory heirarchy. If we lose both, | |
336 | we could scan the disk. Ugh. Maybe make that a command-line | |
337 | option that we support later. | |
dfc130f3 | 338 | |
2bd0ea18 NS |
339 | ASSUMPTION: we know if the ag allocation btrees are intact (phase 2) |
340 | ||
341 | First - Walk and clear the ag unlinked lists. We'll process | |
342 | the inodes later. Check and make sure that the unlinked | |
343 | lists reference known inodes. If not, add to the list | |
344 | of uncertain inodes. | |
345 | ||
346 | Second, check the uncertain inode list generated in phase2 and | |
347 | above and get them into the inode tree if they're good. | |
348 | The incore inode cluster tree *always* has good | |
349 | clusters (alignment, etc.) in it. | |
dfc130f3 | 350 | |
2bd0ea18 NS |
351 | Third, make sure that the root inode is known. If not, |
352 | and we know the inode number from the superblock, | |
ff1f79a7 | 353 | discover that inode and its chunk. |
2bd0ea18 NS |
354 | |
355 | Then, walk the incore inode-cluster tree. | |
356 | ||
357 | Maintain an in-core bitmap over the entire fs for block allocation. | |
358 | ||
359 | traverse each inode, make sure inode mode field matches free/allocated | |
360 | bit in the incore inode allocation tree. If there's a mismatch, | |
361 | assume that the inode is in use. | |
362 | ||
363 | - for each in-use inode, traverse each bmap/dir/attribute | |
364 | map or tree. Maintain a map (extent list?) for the | |
365 | current inode. | |
366 | ||
367 | - For each block marked as used, check to see if already known | |
368 | (referenced by another file or directory) and sanity | |
369 | check the contents of the block as well if possible | |
370 | (in the case of meta-blocks). | |
371 | ||
372 | - if the inode claims already used blocks, mark the blocks | |
373 | as multiply claimed (duplicate) and go on. the inode | |
374 | will be cleared in phase 4. | |
375 | ||
376 | - if metablocks are garbaged, clear the inode after | |
377 | traversing what you can of the bmap and | |
378 | proceed to next inode. We don't have to worry | |
379 | about trashing the maps or trees in cleared inodes | |
380 | because the blocks will show up as free in the | |
381 | ag freespace trees that we set up in phase 5. | |
382 | ||
383 | - clear the di_next_unlinked pointer -- all unlinked | |
384 | but active files go bye-bye. | |
385 | ||
386 | - All blocks start out unknown. We need the last state | |
387 | in case we run into a case where we need to step | |
388 | on a block to store filesystem meta-data and it | |
389 | turns out later that it's referenced by some inode's | |
390 | bmap. In that case, the inode loses because we've | |
391 | already trashed the block. This shouldn't happen | |
392 | in the first version unless some inode has a bogus | |
393 | bmap referencing blocks in the ag header but the | |
394 | 4th state will keep us from inadvertently doing | |
395 | something stupid in that case. | |
396 | ||
397 | - If inode is allocated, mark all blocks allocated to the | |
398 | current inode as allocated in the incore freespace | |
399 | bitmap. | |
400 | ||
dfc130f3 | 401 | - If inode is good and a directory, scan through it to |
2bd0ea18 | 402 | find leaf entries and discover any unknown inodes. |
dfc130f3 | 403 | |
2bd0ea18 NS |
404 | For shortform, we correct what we can. |
405 | ||
406 | If the directory is corrupt, we try and fix it in | |
407 | place. If it has zero good entries, then we blast it. | |
408 | ||
409 | All unknown inodes get put onto the uncertain inode | |
410 | list. This is safe because we only put inodes onto | |
411 | the list when we're processing known inodes so the | |
412 | uncertain inode list isn't in use. | |
413 | ||
414 | We fix only one problem -- an entry that has | |
415 | a mathematically invalid inode numbers in them. | |
416 | If that's the case, we replace the inode number | |
417 | with NULLFSINO and we'll fix up the entry in | |
418 | phase 6. | |
419 | ||
420 | That info may conflict with the inode information, | |
421 | but we'll straighten out any inconsistencies there | |
422 | in phase4 when we process the inodes again. | |
423 | ||
424 | Errors involving bogus forward/back links, | |
425 | zero-length entries make the directory get | |
426 | trashed. | |
427 | ||
428 | if an entry references a free inode, ignore that | |
429 | fact for now. wait for the rest of phase 3 | |
430 | processing to hit that inode. If it looks like it's | |
431 | in use, we'll mark in use then. If not, we'll | |
432 | clear it and mark the inode map. then in phase | |
433 | 4, you can depend on the inode map. | |
dfc130f3 | 434 | |
2bd0ea18 NS |
435 | Entries that point to non-existent or free |
436 | inodes, and extra blocks in the directory | |
437 | will get fixed in place in a later pass. | |
438 | ||
439 | Entries that point to a quota inode are | |
440 | marked TBD. | |
441 | ||
442 | If the directory internally points to the same | |
443 | block twice, the directory gets blown away. | |
444 | ||
445 | Note that processing uncertain inodes can add more inodes | |
446 | to the uncertain list if they're directories. So we loop | |
447 | until the uncertain list is empty. | |
448 | ||
449 | During inode verification, if the inode blocks are unknown, | |
450 | mark then as in-use by inodes. | |
451 | ||
452 | XXX HEURISTIC -- if we blow an inode away that has space, | |
453 | assume that the freespace btree is now out of wack. | |
454 | If it was ok earlier, it's certain to be wrong now. | |
455 | And the odds of this space free cancelling out the | |
456 | existing error is so small I'm willing to ignore it. | |
457 | Should probably do this via a global var and complain | |
458 | about this later. | |
459 | ||
460 | Assumption: All known inodes are now marked as in-use or free. Any | |
461 | inodes that we haven't found by now are hosed (lost) since | |
462 | we can't reach them via either the inode btrees or via directory | |
463 | entries. | |
464 | ||
465 | Directories are semi-clean. All '.' entries are good. | |
466 | Root '..' entry is good if root inode exists. All entries | |
dfc130f3 | 467 | referencing non-existent inodes, free inodes, etc. |
2bd0ea18 NS |
468 | |
469 | XXX verify that either quota inode is 0 or NULLFSINO or | |
470 | if sb quota flag is non zero, verify that quota inode | |
471 | is NULLFSINO or is referencing a used, but disconnected | |
472 | inode. | |
473 | ||
474 | XXX if in no_modify mode, check for unclaimed blocks | |
475 | ||
476 | - Phase 4 - Check for inodes referencing duplicate blocks | |
477 | ||
478 | At this point, all known duplicate blocks are marked in | |
479 | the block map. However, some of the claimed blocks in | |
480 | the bmap may in fact be free because they belong to inodes | |
481 | that have to be cleared either due to being a trashed | |
482 | directory or because it's the first inode to claim a | |
483 | block that was then claimed later. There's a similar | |
484 | problem with meta-data blocks that are referenced by | |
485 | inode bmaps that are going to be freed once the inode | |
486 | (or directory) gets cleared. | |
487 | ||
488 | So at this point, we collect the duplicate blocks into | |
489 | extents and put them into the duplicate extent list. | |
490 | ||
491 | Mark the ag header blocks as in use. | |
492 | ||
493 | We then process each inode twice -- the first time | |
494 | we check to see if the inode claims a duplicate extent | |
495 | and we do NOT set the block bitmap. If the inode claims | |
496 | a duplicate extent, we clear the inode. Since the bitmap | |
497 | hasn't been set, that automatically frees all blocks associated | |
498 | with the cleared inode. If the inode is ok, process it a second | |
499 | time and set the bitmap since we know that this inode will live. | |
500 | ||
501 | The unlinked list gets cleared in every inode at this point as | |
502 | well. We no longer need to preserve it since we've discovered | |
503 | every inode we're going to find from it. | |
504 | ||
505 | verify existence of root inode. if it exists, check for | |
506 | existence of "lost+found". If it exists, mark the entry | |
507 | to be deleted, and clear the inode. All the inodes that | |
508 | were connected to the lost+found will be reconnected later. | |
509 | ||
510 | XXX HEURISTIC -- if we blow an inode away that has space, | |
511 | assume that the freespace btree is now out of wack. | |
512 | If it was ok earlier, it's certain to be wrong now. | |
513 | And the odds of this space free cancelling out the | |
514 | existing error is so small I'm willing to ignore it. | |
515 | Should probably do this via a global var and complain | |
516 | about this later. | |
517 | ||
518 | Clear the quota inodes if the inode btree says that | |
519 | they're not in use. The space freed will get picked | |
520 | up by phase 5. | |
dfc130f3 | 521 | |
2bd0ea18 NS |
522 | XXX Clear the quota inodes if the filesystem is being downgraded. |
523 | ||
524 | - Phase 5 - Build inode allocation trees, freespace trees and | |
525 | agfl's for each ag. After this, we should be able to | |
526 | unmount the filesystem and remount it for real. | |
527 | ||
528 | For each ag: (if no in no_modify mode) | |
529 | ||
530 | scan bitmap first to figure out number of extents. | |
dfc130f3 | 531 | |
2bd0ea18 NS |
532 | calculate space required for all trees. Start with inode trees. |
533 | Setup the btree cursor which includes the list of preallocated | |
534 | blocks. As a by-product, this will delete the extents required | |
535 | for the inode tree from the incore extent tree. | |
dfc130f3 | 536 | |
2bd0ea18 NS |
537 | Calculate how many extents will be required to represent the |
538 | remaining free extent tree on disk (twice, one for bybno and | |
539 | one for bycnt). You have to iterate on this because consuming | |
540 | extents can alter the number of blocks required to represent | |
541 | the remaining extents. If there's slop left over, you can | |
542 | put it in the agfl though. | |
543 | ||
544 | Then, manually build the trees, agi, agfs, and agfls. | |
545 | ||
546 | XXX if in no_modify mode, scan the on-disk inode allocation | |
547 | trees and compare against the incore versions. Don't have | |
548 | to scan the freespace trees because we caught the problems | |
549 | there in phase2 and phase3. But if we cleared any inodes | |
550 | with space during phases 3 or 4, now is the time to complain. | |
551 | ||
dfc130f3 | 552 | XXX - Free duplicate extent lists. ??? |
2bd0ea18 NS |
553 | |
554 | Assumptions: at this point, sim code having to do with inode | |
555 | creation/modification/deletion and space allocation | |
556 | work because the inode maps, space maps, and bmaps | |
557 | for all files in the filesystem are good. The only | |
558 | structures that are screwed up are the directory contents, | |
559 | which means that lookup may not work for beans, the | |
560 | root inode which exists but may be completely bogus and | |
561 | the link counts on all inodes which may also be bogus. | |
562 | ||
563 | Free the bitmap, the freespace tree. | |
564 | ||
dfc130f3 | 565 | Flash the incore inode tree over from parent list to having |
2bd0ea18 NS |
566 | full backpointers. |
567 | ||
568 | realtime processing, if any -- | |
569 | ||
570 | (Skip to below if running in no_modify mode). | |
571 | ||
572 | Generate the realtime bitmap from the incore realtime | |
573 | extent map and slam the info into the realtime bitmap | |
574 | inode. Generate summary info from the realtime extent map. | |
dfc130f3 | 575 | |
2bd0ea18 NS |
576 | XXX if in no_modify mode, compare contents of realtime bitmap |
577 | inode to the incore realtime extent map. generate the | |
578 | summary info from the incore realtime extent map. | |
579 | compare against the contents of the realtime summary inode. | |
580 | complain if bad. | |
581 | ||
582 | reset superblock counters, sync version numbers | |
583 | ||
584 | - Phase 6 - directory traversal -- check reference counts, | |
585 | attach disconnected inodes, fix up bogus directories | |
586 | ||
587 | Assumptions: all on-disk space and inode trees are structurally | |
588 | sound. Incore and on-disk inode trees agree on whether | |
589 | an inode is in use. | |
590 | ||
591 | Directories are structurally sound. All hashvalues | |
592 | are monotonically increasing and interior nodes are | |
593 | correct so lookups work. All legal directory entries | |
594 | point to inodes that are in use and exist. Shortform | |
595 | directories are fine except that the links haven't been | |
596 | checked for conflicts (cycles, ".." being correct, etc.). | |
597 | Longform directories haven't been checked for those problems | |
598 | either PLUS longform directories may still contain | |
599 | entries beginning with '/'. No zero-length entries | |
600 | exist (they've been deleted or converted to '/'). | |
601 | ||
602 | Root directory may or may not exist. orphange may | |
603 | or may not exist. Contents of either may be completely | |
604 | bogus. | |
605 | ||
606 | Entries may point to free or non-existent inodes. | |
607 | ||
608 | At this we point, we may need new incore structures and | |
609 | may be able to trash an old one (like the filesystem | |
610 | block map) | |
611 | ||
612 | If '/' is trashed, then reinitialize it. | |
613 | ||
614 | If no realtime inodes, make them and if necessary, slam the | |
615 | summary info into the realtime summary | |
616 | inode. Ditto with the realtime bitmap inode. | |
dfc130f3 | 617 | |
2bd0ea18 NS |
618 | Make orphanage (lost+found ???). |
619 | ||
620 | Traverse each directory from '/' (unless it was created). | |
621 | Check directory structure and each directory entry. | |
622 | If the entry is bogus (points to a non-existent or | |
623 | free inode, for example), mark that entry TBD. Maintain | |
624 | link counts on all inodes. Currently, traversal is | |
625 | depth-first. | |
626 | ||
627 | Mark every inode reached as "reached" (includes | |
628 | bumping up link counts). | |
629 | ||
630 | If a entry points to a directory but the parent (..) | |
631 | disagrees, then blow away the entry. if the directory | |
632 | being pointed to winds up disconnected, it'll be moved | |
633 | to the orphanage (and the link count incremented to | |
634 | account for the link and the reached bit set then). | |
635 | ||
636 | If an entry points to a directory that we've already | |
637 | reached, then some entry is bad and should be blown | |
638 | away. It's easiest to blow away the current entry | |
639 | plus since presumably the parent entry in the | |
640 | reached directory points to another directory, | |
641 | then it's far more likely that the current | |
642 | entry is bogus (otherwise the parent should point | |
643 | at it). | |
644 | ||
645 | If an entry points to a non-existent of free inode, | |
646 | blow the entry away. | |
647 | ||
648 | Every time a good entry is encountered update the | |
649 | link count for the inode that the entry points to. | |
650 | ||
651 | After traversal, scan incore inode map for directories not | |
ff1f79a7 | 652 | reached. Go to first one and try and find its root |
2bd0ea18 NS |
653 | by following .. entries. Once at root, run traversal |
654 | algorithm. When algorithm terminates, move subtree | |
655 | root inode to the orphanage. Repeat as necessary | |
656 | until all disconnected directories are attached. | |
657 | ||
658 | Move all disconnected inodes to orphanage. | |
659 | ||
660 | - Phase 7: reset reference counts if required. | |
661 | ||
662 | Now traverse the on-disk inodes again, and make sure on-disk | |
663 | reference counts are correct. Reset if necessary. | |
664 | ||
665 | SKIP all unused inodes -- that also makes us | |
666 | skip the orphanage inode which we think is | |
667 | unused but is really used. However, the ref counts | |
668 | on that should be right so that's ok. | |
669 | ||
670 | --- | |
671 | ||
672 | multiple TB xfs_repair | |
673 | ||
674 | modify above to work in a couple of AGs at a time. The bitmaps | |
675 | should span only the current set of AGs. | |
676 | ||
677 | The key it scan the inode bmaps and keep a list of inodes | |
678 | that span multiple AG sets and keep the list in a data structure | |
679 | that's keyed off AG set # as well as inode # and also has a bit | |
680 | to indicate whether or not the inode will be cleared. | |
681 | ||
682 | Then in each AG set, when doing duplicate extent processing, | |
683 | you have to process all multi-AG-set inodes that claim blocks in | |
684 | the current AG set. If there's a conflict, you mark clear the | |
685 | inode in the current AG and you mark the multi-AG inode as | |
686 | "to be cleared". | |
687 | ||
688 | After going through all AGs, you can clear the to-be-cleared | |
689 | multi-AG-set inodes and pull them off the list. | |
690 | ||
691 | When building up the AG freespace trees, you walk the bmaps | |
692 | of all multi-AG-set inodes that are in the AG-set and include | |
693 | blocks claimed in the AG by the inode as used. | |
694 | ||
695 | This probably involves adding a phase 3-0 which would have to | |
696 | check all the inodes to see which ones are multi-AG-set inodes | |
697 | and set up the multi-AG-set inode data structure. Plus the | |
698 | process_dinode routines may have to be altered just a bit | |
699 | to do the right thing if running in tera-byte mode (call | |
700 | out to routines that check the multi-AG-set inodes when | |
701 | appropriate). | |
702 | ||
703 | To make things go faster, phase 3-0 could probably run | |
704 | in parallel. It should be possible to run phases 2-5 | |
705 | in parallel as well once the appropriate synchronization | |
706 | is added to the incore routines and the static directory | |
707 | leaf block bitmap is changed to be on the stack. | |
708 | ||
709 | Phase 7 probably can be in parallel as well. | |
710 | ||
711 | By in parallel, I mean that assuming that an AG-set | |
712 | contains 4 AGs, you could run 4 threads, 1 per AG | |
713 | in parallel to process the AG set. | |
714 | ||
715 | I don't see how phase 6 can be run in parallel though. | |
716 | ||
717 | And running Phase 8 in parallel is just silly. |