]>
Commit | Line | Data |
---|---|---|
48a8c26c | 1 | Git pack format |
9760662f JH |
2 | =============== |
3 | ||
17420eaf | 4 | == Checksums and object IDs |
5 | ||
6 | In a repository using the traditional SHA-1, pack checksums, index checksums, | |
7 | and object IDs (object names) mentioned below are all computed using SHA-1. | |
8 | Similarly, in SHA-256 repositories, these values are computed using SHA-256. | |
9 | ||
5316c8e9 | 10 | == pack-*.pack files have the following format: |
9760662f | 11 | |
71362bd5 | 12 | - A header appears at the beginning and consists of the following: |
9760662f | 13 | |
1361fa3e SP |
14 | 4-byte signature: |
15 | The signature is: {'P', 'A', 'C', 'K'} | |
16 | ||
17 | 4-byte version number (network byte order): | |
48a8c26c | 18 | Git currently accepts version number 2 or 3 but |
1361fa3e SP |
19 | generates version 2 only. |
20 | ||
9760662f JH |
21 | 4-byte number of objects contained in the pack (network byte order) |
22 | ||
23 | Observation: we cannot have more than 4G versions ;-) and | |
24 | more than 4G objects in a pack. | |
25 | ||
26 | - The header is followed by number of object entries, each of | |
27 | which looks like this: | |
28 | ||
29 | (undeltified representation) | |
979ea585 | 30 | n-byte type and length (3-bit type, (n-1)*7+4-bit length) |
9760662f JH |
31 | compressed data |
32 | ||
33 | (deltified representation) | |
979ea585 | 34 | n-byte type and length (3-bit type, (n-1)*7+4-bit length) |
17420eaf | 35 | base object name if OBJ_REF_DELTA or a negative relative |
06cb843f SS |
36 | offset from the delta object's position in the pack if this |
37 | is an OBJ_OFS_DELTA object | |
9760662f JH |
38 | compressed delta data |
39 | ||
40 | Observation: length of each object is encoded in a variable | |
41 | length format and is not constrained to 32-bit or anything. | |
42 | ||
17420eaf | 43 | - The trailer records a pack checksum of all of the above. |
9760662f | 44 | |
011b6486 NTND |
45 | === Object types |
46 | ||
47 | Valid object types are: | |
48 | ||
49 | - OBJ_COMMIT (1) | |
50 | - OBJ_TREE (2) | |
51 | - OBJ_BLOB (3) | |
52 | - OBJ_TAG (4) | |
53 | - OBJ_OFS_DELTA (6) | |
54 | - OBJ_REF_DELTA (7) | |
55 | ||
56 | Type 5 is reserved for future expansion. Type 0 is invalid. | |
57 | ||
7b77f5a1 MÅ |
58 | === Size encoding |
59 | ||
60 | This document uses the following "size encoding" of non-negative | |
61 | integers: From each byte, the seven least significant bits are | |
62 | used to form the resulting integer. As long as the most significant | |
63 | bit is 1, this process continues; the byte with MSB 0 provides the | |
64 | last seven bits. The seven-bit chunks are concatenated. Later | |
65 | values are more significant. | |
66 | ||
67 | This size encoding should not be confused with the "offset encoding", | |
68 | which is also used in this document. | |
69 | ||
011b6486 NTND |
70 | === Deltified representation |
71 | ||
72 | Conceptually there are only four object types: commit, tree, tag and | |
73 | blob. However to save space, an object could be stored as a "delta" of | |
74 | another "base" object. These representations are assigned new types | |
75 | ofs-delta and ref-delta, which is only valid in a pack file. | |
76 | ||
77 | Both ofs-delta and ref-delta store the "delta" to be applied to | |
78 | another object (called 'base object') to reconstruct the object. The | |
17420eaf | 79 | difference between them is, ref-delta directly encodes base object |
80 | name. If the base object is in the same pack, ofs-delta encodes | |
011b6486 NTND |
81 | the offset of the base object in the pack instead. |
82 | ||
83 | The base object could also be deltified if it's in the same pack. | |
84 | Ref-delta can also refer to an object outside the pack (i.e. the | |
85 | so-called "thin pack"). When stored on disk however, the pack should | |
86 | be self contained to avoid cyclic dependency. | |
87 | ||
7b77f5a1 MÅ |
88 | The delta data starts with the size of the base object and the |
89 | size of the object to be reconstructed. These sizes are | |
90 | encoded using the size encoding from above. The remainder of | |
91 | the delta data is a sequence of instructions to reconstruct the object | |
011b6486 NTND |
92 | from the base object. If the base object is deltified, it must be |
93 | converted to canonical form first. Each instruction appends more and | |
94 | more data to the target object until it's complete. There are two | |
95 | supported instructions so far: one for copy a byte range from the | |
96 | source object and one for inserting new data embedded in the | |
97 | instruction itself. | |
98 | ||
99 | Each instruction has variable length. Instruction type is determined | |
100 | by the seventh bit of the first octet. The following diagrams follow | |
101 | the convention in RFC 1951 (Deflate compressed data format). | |
102 | ||
103 | ==== Instruction to copy from base object | |
104 | ||
105 | +----------+---------+---------+---------+---------+-------+-------+-------+ | |
106 | | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 | | |
107 | +----------+---------+---------+---------+---------+-------+-------+-------+ | |
108 | ||
109 | This is the instruction format to copy a byte range from the source | |
110 | object. It encodes the offset to copy from and the number of bytes to | |
111 | copy. Offset and size are in little-endian order. | |
112 | ||
113 | All offset and size bytes are optional. This is to reduce the | |
114 | instruction size when encoding small offsets or sizes. The first seven | |
115 | bits in the first octet determines which of the next seven octets is | |
116 | present. If bit zero is set, offset1 is present. If bit one is set | |
117 | offset2 is present and so on. | |
118 | ||
119 | Note that a more compact instruction does not change offset and size | |
120 | encoding. For example, if only offset2 is omitted like below, offset3 | |
121 | still contains bits 16-23. It does not become offset2 and contains | |
122 | bits 8-15 even if it's right next to offset1. | |
123 | ||
124 | +----------+---------+---------+ | |
125 | | 10000101 | offset1 | offset3 | | |
126 | +----------+---------+---------+ | |
127 | ||
128 | In its most compact form, this instruction only takes up one byte | |
129 | (0x80) with both offset and size omitted, which will have default | |
130 | values zero. There is another exception: size zero is automatically | |
131 | converted to 0x10000. | |
132 | ||
133 | ==== Instruction to add new data | |
134 | ||
135 | +----------+============+ | |
136 | | 0xxxxxxx | data | | |
137 | +----------+============+ | |
138 | ||
139 | This is the instruction to construct target object without the base | |
140 | object. The following data is appended to the target object. The first | |
141 | seven bits of the first octet determines the size of data in | |
142 | bytes. The size must be non-zero. | |
143 | ||
144 | ==== Reserved instruction | |
145 | ||
146 | +----------+============ | |
147 | | 00000000 | | |
148 | +----------+============ | |
149 | ||
150 | This is the instruction reserved for future expansion. | |
151 | ||
5316c8e9 | 152 | == Original (version 1) pack-*.idx files have the following format: |
9760662f JH |
153 | |
154 | - The header consists of 256 4-byte network byte order | |
155 | integers. N-th entry of this table records the number of | |
156 | objects in the corresponding pack, the first byte of whose | |
71362bd5 | 157 | object name is less than or equal to N. This is called the |
9760662f JH |
158 | 'first-level fan-out' table. |
159 | ||
1361fa3e | 160 | - The header is followed by sorted 24-byte entries, one entry |
9760662f JH |
161 | per object in the pack. Each entry is: |
162 | ||
163 | 4-byte network byte order integer, recording where the | |
164 | object is stored in the packfile as the offset from the | |
165 | beginning. | |
166 | ||
17420eaf | 167 | one object name of the appropriate size. |
9760662f | 168 | |
9760662f JH |
169 | - The file is concluded with a trailer: |
170 | ||
17420eaf | 171 | A copy of the pack checksum at the end of the corresponding |
172 | packfile. | |
9760662f | 173 | |
17420eaf | 174 | Index checksum of all of the above. |
9760662f JH |
175 | |
176 | Pack Idx file: | |
177 | ||
71362bd5 | 178 | -- +--------------------------------+ |
179 | fanout | fanout[0] = 2 (for example) |-. | |
180 | table +--------------------------------+ | | |
9760662f JH |
181 | | fanout[1] | | |
182 | +--------------------------------+ | | |
183 | | fanout[2] | | | |
184 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | | |
71362bd5 | 185 | | fanout[255] = total objects |---. |
186 | -- +--------------------------------+ | | | |
187 | main | offset | | | | |
188 | index | object name 00XXXXXXXXXXXXXXXX | | | | |
189 | table +--------------------------------+ | | | |
190 | | offset | | | | |
191 | | object name 00XXXXXXXXXXXXXXXX | | | | |
192 | +--------------------------------+<+ | | |
193 | .-| offset | | | |
194 | | | object name 01XXXXXXXXXXXXXXXX | | | |
195 | | +--------------------------------+ | | |
196 | | | offset | | | |
197 | | | object name 01XXXXXXXXXXXXXXXX | | | |
198 | | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | | |
199 | | | offset | | | |
200 | | | object name FFXXXXXXXXXXXXXXXX | | | |
201 | --| +--------------------------------+<--+ | |
9760662f JH |
202 | trailer | | packfile checksum | |
203 | | +--------------------------------+ | |
204 | | | idxfile checksum | | |
205 | | +--------------------------------+ | |
a6080a0a | 206 | .-------. |
9760662f JH |
207 | | |
208 | Pack file entry: <+ | |
209 | ||
210 | packed object header: | |
979ea585 PE |
211 | 1-byte size extension bit (MSB) |
212 | type (next 3 bit) | |
a6080a0a | 213 | size0 (lower 4-bit) |
9760662f JH |
214 | n-byte sizeN (as long as MSB is set, each 7-bit) |
215 | size0..sizeN form 4+7+7+..+7 bit integer, size0 | |
979ea585 PE |
216 | is the least significant part, and sizeN is the |
217 | most significant part. | |
9760662f JH |
218 | packed object data: |
219 | If it is not DELTA, then deflated bytes (the size above | |
220 | is the size before compression). | |
9de328fe | 221 | If it is REF_DELTA, then |
17420eaf | 222 | base object name (the size above is the |
a6080a0a | 223 | size of the delta data that follows). |
9760662f | 224 | delta data, deflated. |
9de328fe PE |
225 | If it is OFS_DELTA, then |
226 | n-byte offset (see below) interpreted as a negative | |
227 | offset from the type-byte of the header of the | |
228 | ofs-delta entry (the size above is the size of | |
229 | the delta data that follows). | |
230 | delta data, deflated. | |
231 | ||
232 | offset encoding: | |
233 | n bytes with MSB set in all but the last one. | |
234 | The offset is then the number constructed by | |
235 | concatenating the lower 7 bit of each byte, and | |
236 | for n >= 2 adding 2^7 + 2^14 + ... + 2^(7*(n-1)) | |
237 | to the result. | |
238 | ||
71362bd5 | 239 | |
240 | ||
5316c8e9 TA |
241 | == Version 2 pack-*.idx files support packs larger than 4 GiB, and |
242 | have some other reorganizations. They have the format: | |
71362bd5 | 243 | |
244 | - A 4-byte magic number '\377tOc' which is an unreasonable | |
245 | fanout[0] value. | |
246 | ||
247 | - A 4-byte version number (= 2) | |
248 | ||
249 | - A 256-entry fan-out table just like v1. | |
250 | ||
17420eaf | 251 | - A table of sorted object names. These are packed together |
252 | without offset values to reduce the cache footprint of the | |
253 | binary search for a specific object name. | |
71362bd5 | 254 | |
255 | - A table of 4-byte CRC32 values of the packed object data. | |
256 | This is new in v2 so compressed data can be copied directly | |
f1cdcc70 | 257 | from pack to pack during repacking without undetected |
71362bd5 | 258 | data corruption. |
259 | ||
260 | - A table of 4-byte offset values (in network byte order). | |
261 | These are usually 31-bit pack file offsets, but large | |
262 | offsets are encoded as an index into the next table with | |
263 | the msbit set. | |
264 | ||
265 | - A table of 8-byte offset entries (empty for pack files less | |
266 | than 2 GiB). Pack files are organized with heavily used | |
267 | objects toward the front, so most object references should | |
268 | not need to refer to this table. | |
269 | ||
270 | - The same trailer as a v1 pack file: | |
271 | ||
17420eaf | 272 | A copy of the pack checksum at the end of |
71362bd5 | 273 | corresponding packfile. |
274 | ||
17420eaf | 275 | Index checksum of all of the above. |
e0d1bcf8 | 276 | |
2f4ba2a8 TB |
277 | == pack-*.rev files have the format: |
278 | ||
279 | - A 4-byte magic number '0x52494458' ('RIDX'). | |
280 | ||
281 | - A 4-byte version identifier (= 1). | |
282 | ||
283 | - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256). | |
284 | ||
285 | - A table of index positions (one per packed object, num_objects in | |
286 | total, each a 4-byte unsigned integer in network order), sorted by | |
287 | their corresponding offsets in the packfile. | |
288 | ||
289 | - A trailer, containing a: | |
290 | ||
291 | checksum of the corresponding packfile, and | |
292 | ||
293 | a checksum of all of the above. | |
294 | ||
295 | All 4-byte numbers are in network order. | |
296 | ||
e0d1bcf8 DS |
297 | == multi-pack-index (MIDX) files have the following format: |
298 | ||
299 | The multi-pack-index files refer to multiple pack-files and loose objects. | |
300 | ||
301 | In order to allow extensions that add extra data to the MIDX, we organize | |
302 | the body into "chunks" and provide a lookup table at the beginning of the | |
303 | body. The header includes certain length values, such as the number of packs, | |
304 | the number of base MIDX files, hash lengths and types. | |
305 | ||
306 | All 4-byte numbers are in network order. | |
307 | ||
308 | HEADER: | |
309 | ||
310 | 4-byte signature: | |
311 | The signature is: {'M', 'I', 'D', 'X'} | |
312 | ||
313 | 1-byte version number: | |
314 | Git only writes or recognizes version 1. | |
315 | ||
316 | 1-byte Object Id Version | |
d9607542 DS |
317 | We infer the length of object IDs (OIDs) from this value: |
318 | 1 => SHA-1 | |
319 | 2 => SHA-256 | |
320 | If the hash type does not match the repository's hash algorithm, | |
321 | the multi-pack-index file should be ignored with a warning | |
322 | presented to the user. | |
e0d1bcf8 DS |
323 | |
324 | 1-byte number of "chunks" | |
325 | ||
326 | 1-byte number of base multi-pack-index files: | |
327 | This value is currently always zero. | |
328 | ||
329 | 4-byte number of pack files | |
330 | ||
331 | CHUNK LOOKUP: | |
332 | ||
333 | (C + 1) * 12 bytes providing the chunk offsets: | |
334 | First 4 bytes describe chunk id. Value 0 is a terminating label. | |
335 | Other 8 bytes provide offset in current file for chunk to start. | |
336 | (Chunks are provided in file-order, so you can infer the length | |
337 | using the next chunk position if necessary.) | |
338 | ||
a43a2e6c DS |
339 | The CHUNK LOOKUP matches the table of contents from |
340 | link:technical/chunk-format.html[the chunk-based file format]. | |
341 | ||
e0d1bcf8 DS |
342 | The remaining data in the body is described one chunk at a time, and |
343 | these chunks may be given in any order. Chunks are required unless | |
344 | otherwise specified. | |
345 | ||
346 | CHUNK DATA: | |
347 | ||
32f3c541 DS |
348 | Packfile Names (ID: {'P', 'N', 'A', 'M'}) |
349 | Stores the packfile names as concatenated, null-terminated strings. | |
350 | Packfiles must be listed in lexicographic order for fast lookups by | |
351 | name. This is the only chunk not guaranteed to be a multiple of four | |
352 | bytes in length, so should be the last chunk for alignment reasons. | |
353 | ||
d7cacf29 DS |
354 | OID Fanout (ID: {'O', 'I', 'D', 'F'}) |
355 | The ith entry, F[i], stores the number of OIDs with first | |
356 | byte at most i. Thus F[255] stores the total | |
357 | number of objects. | |
358 | ||
0d5b3a5e DS |
359 | OID Lookup (ID: {'O', 'I', 'D', 'L'}) |
360 | The OIDs for all objects in the MIDX are stored in lexicographic | |
361 | order in this chunk. | |
362 | ||
662148c4 DS |
363 | Object Offsets (ID: {'O', 'O', 'F', 'F'}) |
364 | Stores two 4-byte values for every object. | |
365 | 1: The pack-int-id for the pack storing this object. | |
366 | 2: The offset within the pack. | |
eb31044f | 367 | If all offsets are less than 2^32, then the large offset chunk |
662148c4 DS |
368 | will not exist and offsets are stored as in IDX v1. |
369 | If there is at least one offset value larger than 2^32-1, then | |
eb31044f JB |
370 | the large offset chunk must exist, and offsets larger than |
371 | 2^31-1 must be stored in it instead. If the large offset chunk | |
662148c4 DS |
372 | exists and the 31st bit is on, then removing that bit reveals |
373 | the row in the large offsets containing the 8-byte offset of | |
374 | this object. | |
375 | ||
376 | [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'}) | |
377 | 8-byte offsets into large packfiles. | |
e0d1bcf8 | 378 | |
95e8383b TB |
379 | [Optional] Bitmap pack order (ID: {'R', 'I', 'D', 'X'}) |
380 | A list of MIDX positions (one per object in the MIDX, num_objects in | |
381 | total, each a 4-byte unsigned integer in network byte order), sorted | |
382 | according to their relative bitmap/pseudo-pack positions. | |
383 | ||
e0d1bcf8 DS |
384 | TRAILER: |
385 | ||
17420eaf | 386 | Index checksum of the above contents. |
b25fd24c TB |
387 | |
388 | == multi-pack-index reverse indexes | |
389 | ||
390 | Similar to the pack-based reverse index, the multi-pack index can also | |
391 | be used to generate a reverse index. | |
392 | ||
393 | Instead of mapping between offset, pack-, and index position, this | |
394 | reverse index maps between an object's position within the MIDX, and | |
395 | that object's position within a pseudo-pack that the MIDX describes | |
396 | (i.e., the ith entry of the multi-pack reverse index holds the MIDX | |
397 | position of ith object in pseudo-pack order). | |
398 | ||
399 | To clarify the difference between these orderings, consider a multi-pack | |
400 | reachability bitmap (which does not yet exist, but is what we are | |
401 | building towards here). Each bit needs to correspond to an object in the | |
402 | MIDX, and so we need an efficient mapping from bit position to MIDX | |
403 | position. | |
404 | ||
405 | One solution is to let bits occupy the same position in the oid-sorted | |
406 | index stored by the MIDX. But because oids are effectively random, their | |
407 | resulting reachability bitmaps would have no locality, and thus compress | |
408 | poorly. (This is the reason that single-pack bitmaps use the pack | |
409 | ordering, and not the .idx ordering, for the same purpose.) | |
410 | ||
411 | So we'd like to define an ordering for the whole MIDX based around | |
412 | pack ordering, which has far better locality (and thus compresses more | |
413 | efficiently). We can think of a pseudo-pack created by the concatenation | |
414 | of all of the packs in the MIDX. E.g., if we had a MIDX with three packs | |
415 | (a, b, c), with 10, 15, and 20 objects respectively, we can imagine an | |
416 | ordering of the objects like: | |
417 | ||
418 | |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19| | |
419 | ||
420 | where the ordering of the packs is defined by the MIDX's pack list, | |
421 | and then the ordering of objects within each pack is the same as the | |
422 | order in the actual packfile. | |
423 | ||
424 | Given the list of packs and their counts of objects, you can | |
425 | naïvely reconstruct that pseudo-pack ordering (e.g., the object at | |
426 | position 27 must be (c,1) because packs "a" and "b" consumed 25 of the | |
427 | slots). But there's a catch. Objects may be duplicated between packs, in | |
428 | which case the MIDX only stores one pointer to the object (and thus we'd | |
429 | want only one slot in the bitmap). | |
430 | ||
431 | Callers could handle duplicates themselves by reading objects in order | |
432 | of their bit-position, but that's linear in the number of objects, and | |
433 | much too expensive for ordinary bitmap lookups. Building a reverse index | |
434 | solves this, since it is the logical inverse of the index, and that | |
435 | index has already removed duplicates. But, building a reverse index on | |
436 | the fly can be expensive. Since we already have an on-disk format for | |
437 | pack-based reverse indexes, let's reuse it for the MIDX's pseudo-pack, | |
438 | too. | |
439 | ||
440 | Objects from the MIDX are ordered as follows to string together the | |
441 | pseudo-pack. Let `pack(o)` return the pack from which `o` was selected | |
442 | by the MIDX, and define an ordering of packs based on their numeric ID | |
443 | (as stored by the MIDX). Let `offset(o)` return the object offset of `o` | |
444 | within `pack(o)`. Then, compare `o1` and `o2` as follows: | |
445 | ||
446 | - If one of `pack(o1)` and `pack(o2)` is preferred and the other | |
447 | is not, then the preferred one sorts first. | |
448 | + | |
449 | (This is a detail that allows the MIDX bitmap to determine which | |
450 | pack should be used by the pack-reuse mechanism, since it can ask | |
451 | the MIDX for the pack containing the object at bit position 0). | |
452 | ||
453 | - If `pack(o1) ≠ pack(o2)`, then sort the two objects in descending | |
454 | order based on the pack ID. | |
455 | ||
456 | - Otherwise, `pack(o1) = pack(o2)`, and the objects are sorted in | |
457 | pack-order (i.e., `o1` sorts ahead of `o2` exactly when `offset(o1) | |
458 | < offset(o2)`). | |
459 | ||
460 | In short, a MIDX's pseudo-pack is the de-duplicated concatenation of | |
461 | objects in packs stored by the MIDX, laid out in pack order, and the | |
462 | packs arranged in MIDX order (with the preferred pack coming first). | |
463 | ||
95e8383b TB |
464 | The MIDX's reverse index is stored in the optional 'RIDX' chunk within |
465 | the MIDX itself. |