]>
Commit | Line | Data |
---|---|---|
41dfbb2d JK |
1 | Date: Wed, 16 Oct 2013 04:34:01 -0400 |
2 | From: Jeff King <peff@peff.net> | |
3 | Subject: pack corruption post-mortem | |
4 | Abstract: Recovering a corrupted object when no good copy is available. | |
5 | Content-type: text/asciidoc | |
6 | ||
7 | How to recover an object from scratch | |
8 | ===================================== | |
9 | ||
10 | I was recently presented with a repository with a corrupted packfile, | |
11 | and was asked if the data was recoverable. This post-mortem describes | |
12 | the steps I took to investigate and fix the problem. I thought others | |
13 | might find the process interesting, and it might help somebody in the | |
14 | same situation. | |
15 | ||
16 | ******************************** | |
17 | Note: In this case, no good copy of the repository was available. For | |
18 | the much easier case where you can get the corrupted object from | |
19 | elsewhere, see link:recover-corrupted-blob-object.html[this howto]. | |
20 | ******************************** | |
21 | ||
22 | I started with an fsck, which found a problem with exactly one object | |
23 | (I've used $pack and $obj below to keep the output readable, and also | |
24 | because I'll refer to them later): | |
25 | ||
26 | ----------- | |
27 | $ git fsck | |
28 | error: $pack SHA1 checksum mismatch | |
29 | error: index CRC mismatch for object $obj from $pack at offset 51653873 | |
30 | error: inflate: data stream error (incorrect data check) | |
31 | error: cannot unpack $obj from $pack at offset 51653873 | |
32 | ----------- | |
33 | ||
34 | The pack checksum failing means a byte is munged somewhere, and it is | |
35 | presumably in the object mentioned (since both the index checksum and | |
36 | zlib were failing). | |
37 | ||
38 | Reading the zlib source code, I found that "incorrect data check" means | |
39 | that the adler-32 checksum at the end of the zlib data did not match the | |
40 | inflated data. So stepping the data through zlib would not help, as it | |
f745acb0 | 41 | did not fail until the very end, when we realize the CRC does not match. |
41dfbb2d JK |
42 | The problematic bytes could be anywhere in the object data. |
43 | ||
44 | The first thing I did was pull the broken data out of the packfile. I | |
45 | needed to know how big the object was, which I found out with: | |
46 | ||
47 | ------------ | |
48 | $ git show-index <$idx | cut -d' ' -f1 | sort -n | grep -A1 51653873 | |
49 | 51653873 | |
50 | 51664736 | |
51 | ------------ | |
52 | ||
53 | Show-index gives us the list of objects and their offsets. We throw away | |
54 | everything but the offsets, and then sort them so that our interesting | |
55 | offset (which we got from the fsck output above) is followed immediately | |
56 | by the offset of the next object. Now we know that the object data is | |
57 | 10863 bytes long, and we can grab it with: | |
58 | ||
59 | ------------ | |
60 | dd if=$pack of=object bs=1 skip=51653873 count=10863 | |
61 | ------------ | |
62 | ||
63 | I inspected a hexdump of the data, looking for any obvious bogosity | |
64 | (e.g., a 4K run of zeroes would be a good sign of filesystem | |
65 | corruption). But everything looked pretty reasonable. | |
66 | ||
67 | Note that the "object" file isn't fit for feeding straight to zlib; it | |
68 | has the git packed object header, which is variable-length. We want to | |
69 | strip that off so we can start playing with the zlib data directly. You | |
70 | can either work your way through it manually (the format is described in | |
977c47b4 | 71 | linkgit:gitformat-pack[5]), |
41dfbb2d JK |
72 | or you can walk through it in a debugger. I did the latter, creating a |
73 | valid pack like: | |
74 | ||
75 | ------------ | |
76 | # pack magic and version | |
77 | printf 'PACK\0\0\0\2' >tmp.pack | |
78 | # pack has one object | |
79 | printf '\0\0\0\1' >>tmp.pack | |
80 | # now add our object data | |
81 | cat object >>tmp.pack | |
82 | # and then append the pack trailer | |
dae2ff9b | 83 | /path/to/git.git/t/helper/test-tool sha1 -b <tmp.pack >trailer |
41dfbb2d JK |
84 | cat trailer >>tmp.pack |
85 | ------------ | |
86 | ||
87 | and then running "git index-pack tmp.pack" in the debugger (stop at | |
88 | unpack_raw_entry). Doing this, I found that there were 3 bytes of header | |
89 | (and the header itself had a sane type and size). So I stripped those | |
90 | off with: | |
91 | ||
92 | ------------ | |
93 | dd if=object of=zlib bs=1 skip=3 | |
94 | ------------ | |
95 | ||
96 | I ran the result through zlib's inflate using a custom C program. And | |
97 | while it did report the error, I did get the right number of output | |
98 | bytes (i.e., it matched git's size header that we decoded above). But | |
99 | feeding the result back to "git hash-object" didn't produce the same | |
100 | sha1. So there were some wrong bytes, but I didn't know which. The file | |
101 | happened to be C source code, so I hoped I could notice something | |
102 | obviously wrong with it, but I didn't. I even got it to compile! | |
103 | ||
104 | I also tried comparing it to other versions of the same path in the | |
105 | repository, hoping that there would be some part of the diff that didn't | |
106 | make sense. Unfortunately, this happened to be the only revision of this | |
107 | particular file in the repository, so I had nothing to compare against. | |
108 | ||
109 | So I took a different approach. Working under the guess that the | |
110 | corruption was limited to a single byte, I wrote a program to munge each | |
111 | byte individually, and try inflating the result. Since the object was | |
112 | only 10K compressed, that worked out to about 2.5M attempts, which took | |
113 | a few minutes. | |
114 | ||
115 | The program I used is here: | |
116 | ||
117 | ---------------------------------------------- | |
118 | #include <stdio.h> | |
119 | #include <unistd.h> | |
120 | #include <string.h> | |
121 | #include <signal.h> | |
122 | #include <zlib.h> | |
123 | ||
124 | static int try_zlib(unsigned char *buf, int len) | |
125 | { | |
126 | /* make this absurdly large so we don't have to loop */ | |
127 | static unsigned char out[1024*1024]; | |
128 | z_stream z; | |
129 | int ret; | |
130 | ||
131 | memset(&z, 0, sizeof(z)); | |
132 | inflateInit(&z); | |
133 | ||
134 | z.next_in = buf; | |
135 | z.avail_in = len; | |
136 | z.next_out = out; | |
137 | z.avail_out = sizeof(out); | |
138 | ||
139 | ret = inflate(&z, 0); | |
140 | inflateEnd(&z); | |
141 | return ret >= 0; | |
142 | } | |
143 | ||
144 | /* eye candy */ | |
145 | static int counter = 0; | |
146 | static void progress(int sig) | |
147 | { | |
148 | fprintf(stderr, "\r%d", counter); | |
149 | alarm(1); | |
150 | } | |
151 | ||
152 | int main(void) | |
153 | { | |
154 | /* oversized so we can read the whole buffer in */ | |
155 | unsigned char buf[1024*1024]; | |
156 | int len; | |
157 | unsigned i, j; | |
158 | ||
159 | signal(SIGALRM, progress); | |
160 | alarm(1); | |
161 | ||
162 | len = read(0, buf, sizeof(buf)); | |
163 | for (i = 0; i < len; i++) { | |
164 | unsigned char c = buf[i]; | |
165 | for (j = 0; j <= 0xff; j++) { | |
166 | buf[i] = j; | |
167 | ||
168 | counter++; | |
169 | if (try_zlib(buf, len)) | |
170 | printf("i=%d, j=%x\n", i, j); | |
171 | } | |
172 | buf[i] = c; | |
173 | } | |
174 | ||
175 | alarm(0); | |
176 | fprintf(stderr, "\n"); | |
177 | return 0; | |
178 | } | |
179 | ---------------------------------------------- | |
180 | ||
181 | I compiled and ran with: | |
182 | ||
183 | ------- | |
184 | gcc -Wall -Werror -O3 munge.c -o munge -lz | |
185 | ./munge <zlib | |
186 | ------- | |
187 | ||
188 | ||
189 | There were a few false positives early on (if you write "no data" in the | |
190 | zlib header, zlib thinks it's just fine :) ). But I got a hit about | |
191 | halfway through: | |
192 | ||
193 | ------- | |
194 | i=5642, j=c7 | |
195 | ------- | |
196 | ||
197 | I let it run to completion, and got a few more hits at the end (where it | |
f745acb0 | 198 | was munging the CRC to match our broken data). So there was a good |
41dfbb2d JK |
199 | chance this middle hit was the source of the problem. |
200 | ||
201 | I confirmed by tweaking the byte in a hex editor, zlib inflating the | |
202 | result (no errors!), and then piping the output into "git hash-object", | |
203 | which reported the sha1 of the broken object. Success! | |
204 | ||
205 | I fixed the packfile itself with: | |
206 | ||
207 | ------- | |
208 | chmod +w $pack | |
209 | printf '\xc7' | dd of=$pack bs=1 seek=51659518 conv=notrunc | |
210 | chmod -w $pack | |
211 | ------- | |
212 | ||
213 | The `\xc7` comes from the replacement byte our "munge" program found. | |
214 | The offset 51659518 is derived by taking the original object offset | |
215 | (51653873), adding the replacement offset found by "munge" (5642), and | |
216 | then adding back in the 3 bytes of git header we stripped. | |
217 | ||
218 | After that, "git fsck" ran clean. | |
219 | ||
220 | As for the corruption itself, I was lucky that it was indeed a single | |
221 | byte. In fact, it turned out to be a single bit. The byte 0xc7 was | |
222 | corrupted to 0xc5. So presumably it was caused by faulty hardware, or a | |
223 | cosmic ray. | |
224 | ||
225 | And the aborted attempt to look at the inflated output to see what was | |
226 | wrong? I could have looked forever and never found it. Here's the diff | |
227 | between what the corrupted data inflates to, versus the real data: | |
228 | ||
229 | -------------- | |
230 | - cp = strtok (arg, "+"); | |
231 | + cp = strtok (arg, "."); | |
232 | -------------- | |
233 | ||
234 | It tweaked one byte and still ended up as valid, readable C that just | |
235 | happened to do something totally different! One takeaway is that on a | |
236 | less unlucky day, looking at the zlib output might have actually been | |
237 | helpful, as most random changes would actually break the C code. | |
238 | ||
239 | But more importantly, git's hashing and checksumming noticed a problem | |
240 | that easily could have gone undetected in another system. The result | |
241 | still compiled, but would have caused an interesting bug (that would | |
242 | have been blamed on some random commit). | |
2b8bd44a JK |
243 | |
244 | ||
245 | The adventure continues... | |
246 | -------------------------- | |
247 | ||
248 | I ended up doing this again! Same entity, new hardware. The assumption | |
249 | at this point is that the old disk corrupted the packfile, and then the | |
250 | corruption was migrated to the new hardware (because it was done by | |
251 | rsync or similar, and no fsck was done at the time of migration). | |
252 | ||
253 | This time, the affected blob was over 20 megabytes, which was far too | |
254 | large to do a brute-force on. I followed the instructions above to | |
255 | create the `zlib` file. I then used the `inflate` program below to pull | |
256 | the corrupted data from that. Examining that output gave me a hint about | |
257 | where in the file the corruption was. But now I was working with the | |
258 | file itself, not the zlib contents. So knowing the sha1 of the object | |
259 | and the approximate area of the corruption, I used the `sha1-munge` | |
260 | program below to brute-force the correct byte. | |
261 | ||
262 | Here's the inflate program (it's essentially `gunzip` but without the | |
263 | `.gz` header processing): | |
264 | ||
265 | -------------------------- | |
266 | #include <stdio.h> | |
267 | #include <string.h> | |
268 | #include <zlib.h> | |
269 | #include <stdlib.h> | |
270 | ||
271 | int main(int argc, char **argv) | |
272 | { | |
273 | /* | |
274 | * oversized so we can read the whole buffer in; | |
275 | * this could actually be switched to streaming | |
276 | * to avoid any memory limitations | |
277 | */ | |
278 | static unsigned char buf[25 * 1024 * 1024]; | |
279 | static unsigned char out[25 * 1024 * 1024]; | |
280 | int len; | |
281 | z_stream z; | |
282 | int ret; | |
283 | ||
284 | len = read(0, buf, sizeof(buf)); | |
285 | memset(&z, 0, sizeof(z)); | |
286 | inflateInit(&z); | |
287 | ||
288 | z.next_in = buf; | |
289 | z.avail_in = len; | |
290 | z.next_out = out; | |
291 | z.avail_out = sizeof(out); | |
292 | ||
293 | ret = inflate(&z, 0); | |
294 | if (ret != Z_OK && ret != Z_STREAM_END) | |
295 | fprintf(stderr, "initial inflate failed (%d)\n", ret); | |
296 | ||
297 | fprintf(stderr, "outputting %lu bytes", z.total_out); | |
298 | fwrite(out, 1, z.total_out, stdout); | |
299 | return 0; | |
300 | } | |
301 | -------------------------- | |
302 | ||
303 | And here is the `sha1-munge` program: | |
304 | ||
305 | -------------------------- | |
306 | #include <stdio.h> | |
307 | #include <unistd.h> | |
308 | #include <string.h> | |
309 | #include <signal.h> | |
310 | #include <openssl/sha.h> | |
311 | #include <stdlib.h> | |
312 | ||
313 | /* eye candy */ | |
314 | static int counter = 0; | |
315 | static void progress(int sig) | |
316 | { | |
317 | fprintf(stderr, "\r%d", counter); | |
318 | alarm(1); | |
319 | } | |
320 | ||
321 | static const signed char hexval_table[256] = { | |
322 | -1, -1, -1, -1, -1, -1, -1, -1, /* 00-07 */ | |
323 | -1, -1, -1, -1, -1, -1, -1, -1, /* 08-0f */ | |
324 | -1, -1, -1, -1, -1, -1, -1, -1, /* 10-17 */ | |
325 | -1, -1, -1, -1, -1, -1, -1, -1, /* 18-1f */ | |
326 | -1, -1, -1, -1, -1, -1, -1, -1, /* 20-27 */ | |
327 | -1, -1, -1, -1, -1, -1, -1, -1, /* 28-2f */ | |
328 | 0, 1, 2, 3, 4, 5, 6, 7, /* 30-37 */ | |
329 | 8, 9, -1, -1, -1, -1, -1, -1, /* 38-3f */ | |
330 | -1, 10, 11, 12, 13, 14, 15, -1, /* 40-47 */ | |
331 | -1, -1, -1, -1, -1, -1, -1, -1, /* 48-4f */ | |
332 | -1, -1, -1, -1, -1, -1, -1, -1, /* 50-57 */ | |
333 | -1, -1, -1, -1, -1, -1, -1, -1, /* 58-5f */ | |
334 | -1, 10, 11, 12, 13, 14, 15, -1, /* 60-67 */ | |
335 | -1, -1, -1, -1, -1, -1, -1, -1, /* 68-67 */ | |
336 | -1, -1, -1, -1, -1, -1, -1, -1, /* 70-77 */ | |
337 | -1, -1, -1, -1, -1, -1, -1, -1, /* 78-7f */ | |
338 | -1, -1, -1, -1, -1, -1, -1, -1, /* 80-87 */ | |
339 | -1, -1, -1, -1, -1, -1, -1, -1, /* 88-8f */ | |
340 | -1, -1, -1, -1, -1, -1, -1, -1, /* 90-97 */ | |
341 | -1, -1, -1, -1, -1, -1, -1, -1, /* 98-9f */ | |
342 | -1, -1, -1, -1, -1, -1, -1, -1, /* a0-a7 */ | |
343 | -1, -1, -1, -1, -1, -1, -1, -1, /* a8-af */ | |
344 | -1, -1, -1, -1, -1, -1, -1, -1, /* b0-b7 */ | |
345 | -1, -1, -1, -1, -1, -1, -1, -1, /* b8-bf */ | |
346 | -1, -1, -1, -1, -1, -1, -1, -1, /* c0-c7 */ | |
347 | -1, -1, -1, -1, -1, -1, -1, -1, /* c8-cf */ | |
348 | -1, -1, -1, -1, -1, -1, -1, -1, /* d0-d7 */ | |
349 | -1, -1, -1, -1, -1, -1, -1, -1, /* d8-df */ | |
350 | -1, -1, -1, -1, -1, -1, -1, -1, /* e0-e7 */ | |
351 | -1, -1, -1, -1, -1, -1, -1, -1, /* e8-ef */ | |
352 | -1, -1, -1, -1, -1, -1, -1, -1, /* f0-f7 */ | |
353 | -1, -1, -1, -1, -1, -1, -1, -1, /* f8-ff */ | |
354 | }; | |
355 | ||
356 | static inline unsigned int hexval(unsigned char c) | |
357 | { | |
358 | return hexval_table[c]; | |
359 | } | |
360 | ||
361 | static int get_sha1_hex(const char *hex, unsigned char *sha1) | |
362 | { | |
363 | int i; | |
364 | for (i = 0; i < 20; i++) { | |
365 | unsigned int val; | |
366 | /* | |
367 | * hex[1]=='\0' is caught when val is checked below, | |
368 | * but if hex[0] is NUL we have to avoid reading | |
369 | * past the end of the string: | |
370 | */ | |
371 | if (!hex[0]) | |
372 | return -1; | |
373 | val = (hexval(hex[0]) << 4) | hexval(hex[1]); | |
374 | if (val & ~0xff) | |
375 | return -1; | |
376 | *sha1++ = val; | |
377 | hex += 2; | |
378 | } | |
379 | return 0; | |
380 | } | |
381 | ||
382 | int main(int argc, char **argv) | |
383 | { | |
384 | /* oversized so we can read the whole buffer in */ | |
385 | static unsigned char buf[25 * 1024 * 1024]; | |
386 | char header[32]; | |
387 | int header_len; | |
388 | unsigned char have[20], want[20]; | |
389 | int start, len; | |
390 | SHA_CTX orig; | |
391 | unsigned i, j; | |
392 | ||
393 | if (!argv[1] || get_sha1_hex(argv[1], want)) { | |
394 | fprintf(stderr, "usage: sha1-munge <sha1> [start] <file.in\n"); | |
395 | return 1; | |
396 | } | |
397 | ||
398 | if (argv[2]) | |
399 | start = atoi(argv[2]); | |
400 | else | |
401 | start = 0; | |
402 | ||
403 | len = read(0, buf, sizeof(buf)); | |
404 | header_len = sprintf(header, "blob %d", len) + 1; | |
405 | fprintf(stderr, "using header: %s\n", header); | |
406 | ||
407 | /* | |
408 | * We keep a running sha1 so that if you are munging | |
409 | * near the end of the file, we do not have to re-sha1 | |
410 | * the unchanged earlier bytes | |
411 | */ | |
412 | SHA1_Init(&orig); | |
413 | SHA1_Update(&orig, header, header_len); | |
414 | if (start) | |
415 | SHA1_Update(&orig, buf, start); | |
416 | ||
417 | signal(SIGALRM, progress); | |
418 | alarm(1); | |
419 | ||
420 | for (i = start; i < len; i++) { | |
421 | unsigned char c; | |
422 | SHA_CTX x; | |
423 | ||
424 | #if 0 | |
425 | /* | |
426 | * deletion -- this would not actually work in practice, | |
427 | * I think, because we've already committed to a | |
428 | * particular size in the header. Ditto for addition | |
429 | * below. In those cases, you'd have to do the whole | |
430 | * sha1 from scratch, or possibly keep three running | |
431 | * "orig" sha1 computations going. | |
432 | */ | |
433 | memcpy(&x, &orig, sizeof(x)); | |
434 | SHA1_Update(&x, buf + i + 1, len - i - 1); | |
435 | SHA1_Final(have, &x); | |
436 | if (!memcmp(have, want, 20)) | |
437 | printf("i=%d, deletion\n", i); | |
438 | #endif | |
439 | ||
440 | /* | |
441 | * replacement -- note that this tries each of the 256 | |
442 | * possible bytes. If you suspect a single-bit flip, | |
443 | * it would be much shorter to just try the 8 | |
444 | * bit-flipped variants. | |
445 | */ | |
446 | c = buf[i]; | |
447 | for (j = 0; j <= 0xff; j++) { | |
448 | buf[i] = j; | |
449 | ||
450 | memcpy(&x, &orig, sizeof(x)); | |
451 | SHA1_Update(&x, buf + i, len - i); | |
452 | SHA1_Final(have, &x); | |
453 | if (!memcmp(have, want, 20)) | |
454 | printf("i=%d, j=%02x\n", i, j); | |
455 | } | |
456 | buf[i] = c; | |
457 | ||
458 | #if 0 | |
459 | /* addition */ | |
460 | for (j = 0; j <= 0xff; j++) { | |
461 | unsigned char extra = j; | |
462 | memcpy(&x, &orig, sizeof(x)); | |
463 | SHA1_Update(&x, &extra, 1); | |
464 | SHA1_Update(&x, buf + i, len - i); | |
465 | SHA1_Final(have, &x); | |
466 | if (!memcmp(have, want, 20)) | |
467 | printf("i=%d, addition=%02x", i, j); | |
468 | } | |
469 | #endif | |
470 | ||
471 | SHA1_Update(&orig, buf + i, 1); | |
472 | counter++; | |
473 | } | |
474 | ||
475 | alarm(0); | |
476 | fprintf(stderr, "\r%d\n", counter); | |
477 | return 0; | |
478 | } | |
479 | -------------------------- |