]>
Commit | Line | Data |
---|---|---|
1 | reftable | |
2 | -------- | |
3 | ||
4 | Overview | |
5 | ~~~~~~~~ | |
6 | ||
7 | Problem statement | |
8 | ^^^^^^^^^^^^^^^^^ | |
9 | ||
10 | Some repositories contain a lot of references (e.g. android at 866k, | |
11 | rails at 31k). The existing packed-refs format takes up a lot of space | |
12 | (e.g. 62M), and does not scale with additional references. Lookup of a | |
13 | single reference requires linearly scanning the file. | |
14 | ||
15 | Atomic pushes modifying multiple references require copying the entire | |
16 | packed-refs file, which can be a considerable amount of data moved | |
17 | (e.g. 62M in, 62M out) for even small transactions (2 refs modified). | |
18 | ||
19 | Repositories with many loose references occupy a large number of disk | |
20 | blocks from the local file system, as each reference is its own file | |
21 | storing 41 bytes (and another file for the corresponding reflog). This | |
22 | negatively affects the number of inodes available when a large number of | |
23 | repositories are stored on the same filesystem. Readers can be penalized | |
24 | due to the larger number of syscalls required to traverse and read the | |
25 | `$GIT_DIR/refs` directory. | |
26 | ||
27 | ||
28 | Objectives | |
29 | ^^^^^^^^^^ | |
30 | ||
31 | * Near constant time lookup for any single reference, even when the | |
32 | repository is cold and not in process or kernel cache. | |
33 | * Near constant time verification if an object name is referred to by at least | |
34 | one reference (for allow-tip-sha1-in-want). | |
35 | * Efficient enumeration of an entire namespace, such as `refs/tags/`. | |
36 | * Support atomic push with `O(size_of_update)` operations. | |
37 | * Combine reflog storage with ref storage for small transactions. | |
38 | * Separate reflog storage for base refs and historical logs. | |
39 | ||
40 | Description | |
41 | ^^^^^^^^^^^ | |
42 | ||
43 | A reftable file is a portable binary file format customized for | |
44 | reference storage. References are sorted, enabling linear scans, binary | |
45 | search lookup, and range scans. | |
46 | ||
47 | Storage in the file is organized into variable sized blocks. Prefix | |
48 | compression is used within a single block to reduce disk space. Block | |
49 | size and alignment is tunable by the writer. | |
50 | ||
51 | Performance | |
52 | ^^^^^^^^^^^ | |
53 | ||
54 | Space used, packed-refs vs. reftable: | |
55 | ||
56 | [cols=",>,>,>,>,>",options="header",] | |
57 | |=============================================================== | |
58 | |repository |packed-refs |reftable |% original |avg ref |avg obj | |
59 | |android |62.2 M |36.1 M |58.0% |33 bytes |5 bytes | |
60 | |rails |1.8 M |1.1 M |57.7% |29 bytes |4 bytes | |
61 | |git |78.7 K |48.1 K |61.0% |50 bytes |4 bytes | |
62 | |git (heads) |332 b |269 b |81.0% |33 bytes |0 bytes | |
63 | |=============================================================== | |
64 | ||
65 | Scan (read 866k refs), by reference name lookup (single ref from 866k | |
66 | refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs): | |
67 | ||
68 | [cols=",>,>,>,>",options="header",] | |
69 | |========================================================= | |
70 | |format |cache |scan |by name |by SHA-1 | |
71 | |packed-refs |cold |402 ms |409,660.1 usec |412,535.8 usec | |
72 | |packed-refs |hot | |6,844.6 usec |20,110.1 usec | |
73 | |reftable |cold |112 ms |33.9 usec |323.2 usec | |
74 | |reftable |hot | |20.2 usec |320.8 usec | |
75 | |========================================================= | |
76 | ||
77 | Space used for 149,932 log entries for 43,061 refs, reflog vs. reftable: | |
78 | ||
79 | [cols=",>,>",options="header",] | |
80 | |================================ | |
81 | |format |size |avg entry | |
82 | |$GIT_DIR/logs |173 M |1209 bytes | |
83 | |reftable |5 M |37 bytes | |
84 | |================================ | |
85 | ||
86 | Details | |
87 | ~~~~~~~ | |
88 | ||
89 | Peeling | |
90 | ^^^^^^^ | |
91 | ||
92 | References stored in a reftable are peeled, a record for an annotated | |
93 | (or signed) tag records both the tag object, and the object it refers | |
94 | to. This is analogous to storage in the packed-refs format. | |
95 | ||
96 | Reference name encoding | |
97 | ^^^^^^^^^^^^^^^^^^^^^^^ | |
98 | ||
99 | Reference names are an uninterpreted sequence of bytes that must pass | |
100 | linkgit:git-check-ref-format[1] as a valid reference name. | |
101 | ||
102 | Key unicity | |
103 | ^^^^^^^^^^^ | |
104 | ||
105 | Each entry must have a unique key; repeated keys are disallowed. | |
106 | ||
107 | Network byte order | |
108 | ^^^^^^^^^^^^^^^^^^ | |
109 | ||
110 | All multi-byte, fixed width fields are in network byte order. | |
111 | ||
112 | Varint encoding | |
113 | ^^^^^^^^^^^^^^^ | |
114 | ||
115 | Varint encoding is identical to the ofs-delta encoding method used | |
116 | within pack files. | |
117 | ||
118 | Decoder works such as: | |
119 | ||
120 | .... | |
121 | val = buf[ptr] & 0x7f | |
122 | while (buf[ptr] & 0x80) { | |
123 | ptr++ | |
124 | val = ((val + 1) << 7) | (buf[ptr] & 0x7f) | |
125 | } | |
126 | .... | |
127 | ||
128 | Ordering | |
129 | ^^^^^^^^ | |
130 | ||
131 | Blocks are lexicographically ordered by their first reference. | |
132 | ||
133 | Directory/file conflicts | |
134 | ^^^^^^^^^^^^^^^^^^^^^^^^ | |
135 | ||
136 | The reftable format accepts both `refs/heads/foo` and | |
137 | `refs/heads/foo/bar` as distinct references. | |
138 | ||
139 | This property is useful for retaining log records in reftable, but may | |
140 | confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain | |
141 | references. Users of reftable may choose to continue to reject `foo` and | |
142 | `foo/bar` type conflicts to prevent problems for peers. | |
143 | ||
144 | File format | |
145 | ~~~~~~~~~~~ | |
146 | ||
147 | Structure | |
148 | ^^^^^^^^^ | |
149 | ||
150 | A reftable file has the following high-level structure: | |
151 | ||
152 | .... | |
153 | first_block { | |
154 | header | |
155 | first_ref_block | |
156 | } | |
157 | ref_block* | |
158 | ref_index* | |
159 | obj_block* | |
160 | obj_index* | |
161 | log_block* | |
162 | log_index* | |
163 | footer | |
164 | .... | |
165 | ||
166 | A log-only file omits the `ref_block`, `ref_index`, `obj_block` and | |
167 | `obj_index` sections, containing only the file header and log block: | |
168 | ||
169 | .... | |
170 | first_block { | |
171 | header | |
172 | } | |
173 | log_block* | |
174 | log_index* | |
175 | footer | |
176 | .... | |
177 | ||
178 | in a log-only file the first log block immediately follows the file | |
179 | header, without padding to block alignment. | |
180 | ||
181 | Block size | |
182 | ^^^^^^^^^^ | |
183 | ||
184 | The file's block size is arbitrarily determined by the writer, and does | |
185 | not have to be a power of 2. The block size must be larger than the | |
186 | longest reference name or log entry used in the repository, as | |
187 | references cannot span blocks. | |
188 | ||
189 | Powers of two that are friendly to the virtual memory system or | |
190 | filesystem (such as 4k or 8k) are recommended. Larger sizes (64k) can | |
191 | yield better compression, with a possible increased cost incurred by | |
192 | readers during access. | |
193 | ||
194 | The largest block size is `16777215` bytes (15.99 MiB). | |
195 | ||
196 | Block alignment | |
197 | ^^^^^^^^^^^^^^^ | |
198 | ||
199 | Writers may choose to align blocks at multiples of the block size by | |
200 | including `padding` filled with NUL bytes at the end of a block to round | |
201 | out to the chosen alignment. When alignment is used, writers must | |
202 | specify the alignment with the file header's `block_size` field. | |
203 | ||
204 | Block alignment is not required by the file format. Unaligned files must | |
205 | set `block_size = 0` in the file header, and omit `padding`. Unaligned | |
206 | files with more than one ref block must include the link:#Ref-index[ref | |
207 | index] to support fast lookup. Readers must be able to read both aligned | |
208 | and non-aligned files. | |
209 | ||
210 | Very small files (e.g. a single ref block) may omit `padding` and the ref | |
211 | index to reduce total file size. | |
212 | ||
213 | Header (version 1) | |
214 | ^^^^^^^^^^^^^^^^^^ | |
215 | ||
216 | A 24-byte header appears at the beginning of the file: | |
217 | ||
218 | .... | |
219 | 'REFT' | |
220 | uint8( version_number = 1 ) | |
221 | uint24( block_size ) | |
222 | uint64( min_update_index ) | |
223 | uint64( max_update_index ) | |
224 | .... | |
225 | ||
226 | Aligned files must specify `block_size` to configure readers with the | |
227 | expected block alignment. Unaligned files must set `block_size = 0`. | |
228 | ||
229 | The `min_update_index` and `max_update_index` describe bounds for the | |
230 | `update_index` field of all log records in this file. When reftables are | |
231 | used in a stack for link:#Update-transactions[transactions], these | |
232 | fields can order the files such that the prior file's | |
233 | `max_update_index + 1` is the next file's `min_update_index`. | |
234 | ||
235 | Header (version 2) | |
236 | ^^^^^^^^^^^^^^^^^^ | |
237 | ||
238 | A 28-byte header appears at the beginning of the file: | |
239 | ||
240 | .... | |
241 | 'REFT' | |
242 | uint8( version_number = 2 ) | |
243 | uint24( block_size ) | |
244 | uint64( min_update_index ) | |
245 | uint64( max_update_index ) | |
246 | uint32( hash_id ) | |
247 | .... | |
248 | ||
249 | The header is identical to `version_number=1`, with the 4-byte hash ID | |
250 | ("sha1" for SHA1 and "s256" for SHA-256) append to the header. | |
251 | ||
252 | For maximum backward compatibility, it is recommended to use version 1 when | |
253 | writing SHA1 reftables. | |
254 | ||
255 | First ref block | |
256 | ^^^^^^^^^^^^^^^ | |
257 | ||
258 | The first ref block shares the same block as the file header, and is 24 | |
259 | bytes smaller than all other blocks in the file. The first block | |
260 | immediately begins after the file header, at position 24. | |
261 | ||
262 | If the first block is a log block (a log-only file), its block header | |
263 | begins immediately at position 24. | |
264 | ||
265 | Ref block format | |
266 | ^^^^^^^^^^^^^^^^ | |
267 | ||
268 | A ref block is written as: | |
269 | ||
270 | .... | |
271 | 'r' | |
272 | uint24( block_len ) | |
273 | ref_record+ | |
274 | uint24( restart_offset )+ | |
275 | uint16( restart_count ) | |
276 | ||
277 | padding? | |
278 | .... | |
279 | ||
280 | Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which | |
281 | encodes the number of bytes in the block up to, but not including the | |
282 | optional `padding`. This is always less than or equal to the file's | |
283 | block size. In the first ref block, `block_len` includes 24 bytes for | |
284 | the file header. | |
285 | ||
286 | The 2-byte `restart_count` stores the number of entries in the | |
287 | `restart_offset` list, which must not be empty. Readers can use | |
288 | `restart_count` to binary search between restarts before starting a | |
289 | linear scan. | |
290 | ||
291 | Exactly `restart_count` 3-byte `restart_offset` values precedes the | |
292 | `restart_count`. Offsets are relative to the start of the block and | |
293 | refer to the first byte of any `ref_record` whose name has not been | |
294 | prefix compressed. Entries in the `restart_offset` list must be sorted, | |
295 | ascending. Readers can start linear scans from any of these records. | |
296 | ||
297 | A variable number of `ref_record` fill the middle of the block, | |
298 | describing reference names and values. The format is described below. | |
299 | ||
300 | As the first ref block shares the first file block with the file header, | |
301 | all `restart_offset` in the first block are relative to the start of the | |
302 | file (position 0), and include the file header. This forces the first | |
303 | `restart_offset` to be `28`. | |
304 | ||
305 | ref record | |
306 | ++++++++++ | |
307 | ||
308 | A `ref_record` describes a single reference, storing both the name and | |
309 | its value(s). Records are formatted as: | |
310 | ||
311 | .... | |
312 | varint( prefix_length ) | |
313 | varint( (suffix_length << 3) | value_type ) | |
314 | suffix | |
315 | varint( update_index_delta ) | |
316 | value? | |
317 | .... | |
318 | ||
319 | The `prefix_length` field specifies how many leading bytes of the prior | |
320 | reference record's name should be copied to obtain this reference's | |
321 | name. This must be 0 for the first reference in any block, and also must | |
322 | be 0 for any `ref_record` whose offset is listed in the `restart_offset` | |
323 | table at the end of the block. | |
324 | ||
325 | Recovering a reference name from any `ref_record` is a simple concat: | |
326 | ||
327 | .... | |
328 | this_name = prior_name[0..prefix_length] + suffix | |
329 | .... | |
330 | ||
331 | The `suffix_length` value provides the number of bytes available in | |
332 | `suffix` to copy from `suffix` to complete the reference name. | |
333 | ||
334 | The `update_index` that last modified the reference can be obtained by | |
335 | adding `update_index_delta` to the `min_update_index` from the file | |
336 | header: `min_update_index + update_index_delta`. | |
337 | ||
338 | The `value` follows. Its format is determined by `value_type`, one of | |
339 | the following: | |
340 | ||
341 | * `0x0`: deletion; no value data (see transactions, below) | |
342 | * `0x1`: one object name; value of the ref | |
343 | * `0x2`: two object names; value of the ref, peeled target | |
344 | * `0x3`: symbolic reference: `varint( target_len ) target` | |
345 | ||
346 | Symbolic references use `0x3`, followed by the complete name of the | |
347 | reference target. No compression is applied to the target name. | |
348 | ||
349 | Types `0x4..0x7` are reserved for future use. | |
350 | ||
351 | Ref index | |
352 | ^^^^^^^^^ | |
353 | ||
354 | The ref index stores the name of the last reference from every ref block | |
355 | in the file, enabling reduced disk seeks for lookups. Any reference can | |
356 | be found by searching the index, identifying the containing block, and | |
357 | searching within that block. | |
358 | ||
359 | The index may be organized into a multi-level index, where the 1st level | |
360 | index block points to additional ref index blocks (2nd level), which may | |
361 | in turn point to either additional index blocks (e.g. 3rd level) or ref | |
362 | blocks (leaf level). Disk reads required to access a ref go up with | |
363 | higher index levels. Multi-level indexes may be required to ensure no | |
364 | single index block exceeds the file format's max block size of | |
365 | `16777215` bytes (15.99 MiB). To achieve constant O(1) disk seeks for | |
366 | lookups the index must be a single level, which is permitted to exceed | |
367 | the file's configured block size, but not the format's max block size of | |
368 | 15.99 MiB. | |
369 | ||
370 | If present, the ref index block(s) appears after the last ref block. | |
371 | ||
372 | If there are at least 4 ref blocks, a ref index block should be written | |
373 | to improve lookup times. Cold reads using the index require 2 disk reads | |
374 | (read index, read block), and binary searching < 4 blocks also requires | |
375 | <= 2 reads. Omitting the index block from smaller files saves space. | |
376 | ||
377 | If the file is unaligned and contains more than one ref block, the ref | |
378 | index must be written. | |
379 | ||
380 | Index block format: | |
381 | ||
382 | .... | |
383 | 'i' | |
384 | uint24( block_len ) | |
385 | index_record+ | |
386 | uint24( restart_offset )+ | |
387 | uint16( restart_count ) | |
388 | ||
389 | padding? | |
390 | .... | |
391 | ||
392 | The index blocks begin with `block_type = 'i'` and a 3-byte `block_len` | |
393 | which encodes the number of bytes in the block, up to but not including | |
394 | the optional `padding`. | |
395 | ||
396 | The `restart_offset` and `restart_count` fields are identical in format, | |
397 | meaning and usage as in ref blocks. | |
398 | ||
399 | To reduce the number of reads required for random access in very large | |
400 | files the index block may be larger than other blocks. However, readers | |
401 | must hold the entire index in memory to benefit from this, so it's a | |
402 | time-space tradeoff in both file size and reader memory. | |
403 | ||
404 | Increasing the file's block size decreases the index size. Alternatively | |
405 | a multi-level index may be used, keeping index blocks within the file's | |
406 | block size, but increasing the number of blocks that need to be | |
407 | accessed. | |
408 | ||
409 | index record | |
410 | ++++++++++++ | |
411 | ||
412 | An index record describes the last entry in another block. Index records | |
413 | are written as: | |
414 | ||
415 | .... | |
416 | varint( prefix_length ) | |
417 | varint( (suffix_length << 3) | 0 ) | |
418 | suffix | |
419 | varint( block_position ) | |
420 | .... | |
421 | ||
422 | Index records use prefix compression exactly like `ref_record`. | |
423 | ||
424 | Index records store `block_position` after the suffix, specifying the | |
425 | absolute position in bytes (from the start of the file) of the block | |
426 | that ends with this reference. Readers can seek to `block_position` to | |
427 | begin reading the block header. | |
428 | ||
429 | Readers must examine the block header at `block_position` to determine | |
430 | if the next block is another level index block, or the leaf-level ref | |
431 | block. | |
432 | ||
433 | Reading the index | |
434 | +++++++++++++++++ | |
435 | ||
436 | Readers loading the ref index must first read the footer (below) to | |
437 | obtain `ref_index_position`. If not present, the position will be 0. The | |
438 | `ref_index_position` is for the 1st level root of the ref index. | |
439 | ||
440 | Obj block format | |
441 | ^^^^^^^^^^^^^^^^ | |
442 | ||
443 | Object blocks are optional. Writers may choose to omit object blocks, | |
444 | especially if readers will not use the object name to ref mapping. | |
445 | ||
446 | Object blocks use unique, abbreviated 2-32 object name keys, mapping to | |
447 | ref blocks containing references pointing to that object directly, or as | |
448 | the peeled value of an annotated tag. Like ref blocks, object blocks use | |
449 | the file's standard block size. The abbrevation length is available in | |
450 | the footer as `obj_id_len`. | |
451 | ||
452 | To save space in small files, object blocks may be omitted if the ref | |
453 | index is not present, as brute force search will only need to read a few | |
454 | ref blocks. When missing, readers should brute force a linear search of | |
455 | all references to lookup by object name. | |
456 | ||
457 | An object block is written as: | |
458 | ||
459 | .... | |
460 | 'o' | |
461 | uint24( block_len ) | |
462 | obj_record+ | |
463 | uint24( restart_offset )+ | |
464 | uint16( restart_count ) | |
465 | ||
466 | padding? | |
467 | .... | |
468 | ||
469 | Fields are identical to ref block. Binary search using the restart table | |
470 | works the same as in reference blocks. | |
471 | ||
472 | Because object names are abbreviated by writers to the shortest unique | |
473 | abbreviation within the reftable, obj key lengths have a variable length. Their | |
474 | length must be at least 2 bytes. Readers must compare only for common prefix | |
475 | match within an obj block or obj index. | |
476 | ||
477 | obj record | |
478 | ++++++++++ | |
479 | ||
480 | An `obj_record` describes a single object abbreviation, and the blocks | |
481 | containing references using that unique abbreviation: | |
482 | ||
483 | .... | |
484 | varint( prefix_length ) | |
485 | varint( (suffix_length << 3) | cnt_3 ) | |
486 | suffix | |
487 | varint( cnt_large )? | |
488 | varint( position_delta )* | |
489 | .... | |
490 | ||
491 | Like in reference blocks, abbreviations are prefix compressed within an | |
492 | obj block. On large reftables with many unique objects, higher block | |
493 | sizes (64k), and higher restart interval (128), a `prefix_length` of 2 | |
494 | or 3 and `suffix_length` of 3 may be common in obj records (unique | |
495 | abbreviation of 5-6 raw bytes, 10-12 hex digits). | |
496 | ||
497 | Each record contains `position_count` number of positions for matching | |
498 | ref blocks. For 1-7 positions the count is stored in `cnt_3`. When | |
499 | `cnt_3 = 0` the actual count follows in a varint, `cnt_large`. | |
500 | ||
501 | The use of `cnt_3` bets most objects are pointed to by only a single | |
502 | reference, some may be pointed to by a couple of references, and very | |
503 | few (if any) are pointed to by more than 7 references. | |
504 | ||
505 | A special case exists when `cnt_3 = 0` and `cnt_large = 0`: there are no | |
506 | `position_delta`, but at least one reference starts with this | |
507 | abbreviation. A reader that needs exact reference names must scan all | |
508 | references to find which specific references have the desired object. | |
509 | Writers should use this format when the `position_delta` list would have | |
510 | overflowed the file's block size due to a high number of references | |
511 | pointing to the same object. | |
512 | ||
513 | The first `position_delta` is the position from the start of the file. | |
514 | Additional `position_delta` entries are sorted ascending and relative to | |
515 | the prior entry, e.g. a reader would perform: | |
516 | ||
517 | .... | |
518 | pos = position_delta[0] | |
519 | prior = pos | |
520 | for (j = 1; j < position_count; j++) { | |
521 | pos = prior + position_delta[j] | |
522 | prior = pos | |
523 | } | |
524 | .... | |
525 | ||
526 | With a position in hand, a reader must linearly scan the ref block, | |
527 | starting from the first `ref_record`, testing each reference's object names | |
528 | (for `value_type = 0x1` or `0x2`) for full equality. Faster searching by | |
529 | object name within a single ref block is not supported by the reftable format. | |
530 | Smaller block sizes reduce the number of candidates this step must | |
531 | consider. | |
532 | ||
533 | Obj index | |
534 | ^^^^^^^^^ | |
535 | ||
536 | The obj index stores the abbreviation from the last entry for every obj | |
537 | block in the file, enabling reduced disk seeks for all lookups. It is | |
538 | formatted exactly the same as the ref index, but refers to obj blocks. | |
539 | ||
540 | The obj index should be present if obj blocks are present, as obj blocks | |
541 | should only be written in larger files. | |
542 | ||
543 | Readers loading the obj index must first read the footer (below) to | |
544 | obtain `obj_index_position`. If not present, the position will be 0. | |
545 | ||
546 | Log block format | |
547 | ^^^^^^^^^^^^^^^^ | |
548 | ||
549 | Unlike ref and obj blocks, log blocks are always unaligned. | |
550 | ||
551 | Log blocks are variable in size, and do not match the `block_size` | |
552 | specified in the file header or footer. Writers should choose an | |
553 | appropriate buffer size to prepare a log block for deflation, such as | |
554 | `2 * block_size`. | |
555 | ||
556 | A log block is written as: | |
557 | ||
558 | .... | |
559 | 'g' | |
560 | uint24( block_len ) | |
561 | zlib_deflate { | |
562 | log_record+ | |
563 | uint24( restart_offset )+ | |
564 | uint16( restart_count ) | |
565 | } | |
566 | .... | |
567 | ||
568 | Log blocks look similar to ref blocks, except `block_type = 'g'`. | |
569 | ||
570 | The 4-byte block header is followed by the deflated block contents using | |
571 | zlib deflate. The `block_len` in the header is the inflated size | |
572 | (including 4-byte block header), and should be used by readers to | |
573 | preallocate the inflation output buffer. A log block's `block_len` may | |
574 | exceed the file's block size. | |
575 | ||
576 | Offsets within the log block (e.g. `restart_offset`) still include the | |
577 | 4-byte header. Readers may prefer prefixing the inflation output buffer | |
578 | with the 4-byte header. | |
579 | ||
580 | Within the deflate container, a variable number of `log_record` describe | |
581 | reference changes. The log record format is described below. See ref | |
582 | block format (above) for a description of `restart_offset` and | |
583 | `restart_count`. | |
584 | ||
585 | Because log blocks have no alignment or padding between blocks, readers | |
586 | must keep track of the bytes consumed by the inflater to know where the | |
587 | next log block begins. | |
588 | ||
589 | log record | |
590 | ++++++++++ | |
591 | ||
592 | Log record keys are structured as: | |
593 | ||
594 | .... | |
595 | ref_name '\0' reverse_int64( update_index ) | |
596 | .... | |
597 | ||
598 | where `update_index` is the unique transaction identifier. The | |
599 | `update_index` field must be unique within the scope of a `ref_name`. | |
600 | See the update transactions section below for further details. | |
601 | ||
602 | The `reverse_int64` function inverses the value so lexicographical | |
603 | ordering the network byte order encoding sorts the more recent records | |
604 | with higher `update_index` values first: | |
605 | ||
606 | .... | |
607 | reverse_int64(int64 t) { | |
608 | return 0xffffffffffffffff - t; | |
609 | } | |
610 | .... | |
611 | ||
612 | Log records have a similar starting structure to ref and index records, | |
613 | utilizing the same prefix compression scheme applied to the log record | |
614 | key described above. | |
615 | ||
616 | .... | |
617 | varint( prefix_length ) | |
618 | varint( (suffix_length << 3) | log_type ) | |
619 | suffix | |
620 | log_data { | |
621 | old_id | |
622 | new_id | |
623 | varint( name_length ) name | |
624 | varint( email_length ) email | |
625 | varint( time_seconds ) | |
626 | sint16( tz_offset ) | |
627 | varint( message_length ) message | |
628 | }? | |
629 | .... | |
630 | ||
631 | Log record entries use `log_type` to indicate what follows: | |
632 | ||
633 | * `0x0`: deletion; no log data. | |
634 | * `0x1`: standard git reflog data using `log_data` above. | |
635 | ||
636 | The `log_type = 0x0` is mostly useful for `git stash drop`, removing an | |
637 | entry from the reflog of `refs/stash` in a transaction file (below), | |
638 | without needing to rewrite larger files. Readers reading a stack of | |
639 | reflogs must treat this as a deletion. | |
640 | ||
641 | For `log_type = 0x1`, the `log_data` section follows | |
642 | linkgit:git-update-ref[1] logging and includes: | |
643 | ||
644 | * two object names (old id, new id) | |
645 | * varint string of committer's name | |
646 | * varint string of committer's email | |
647 | * varint time in seconds since epoch (Jan 1, 1970) | |
648 | * 2-byte timezone offset in minutes (signed) | |
649 | * varint string of message | |
650 | ||
651 | `tz_offset` is the absolute number of minutes from GMT the committer was | |
652 | at the time of the update. For example `GMT-0800` is encoded in reftable | |
653 | as `sint16(-480)` and `GMT+0230` is `sint16(150)`. | |
654 | ||
655 | The committer email does not contain `<` or `>`, it's the value normally | |
656 | found between the `<>` in a git commit object header. | |
657 | ||
658 | The `message_length` may be 0, in which case there was no message | |
659 | supplied for the update. | |
660 | ||
661 | Contrary to traditional reflog (which is a file), renames are encoded as | |
662 | a combination of ref deletion and ref creation. A deletion is a log | |
663 | record with a zero new_id, and a creation is a log record with a zero old_id. | |
664 | ||
665 | Reading the log | |
666 | +++++++++++++++ | |
667 | ||
668 | Readers accessing the log must first read the footer (below) to | |
669 | determine the `log_position`. The first block of the log begins at | |
670 | `log_position` bytes since the start of the file. The `log_position` is | |
671 | not block aligned. | |
672 | ||
673 | Importing logs | |
674 | ++++++++++++++ | |
675 | ||
676 | When importing from `$GIT_DIR/logs` writers should globally order all | |
677 | log records roughly by timestamp while preserving file order, and assign | |
678 | unique, increasing `update_index` values for each log line. Newer log | |
679 | records get higher `update_index` values. | |
680 | ||
681 | Although an import may write only a single reftable file, the reftable | |
682 | file must span many unique `update_index`, as each log line requires its | |
683 | own `update_index` to preserve semantics. | |
684 | ||
685 | Log index | |
686 | ^^^^^^^^^ | |
687 | ||
688 | The log index stores the log key | |
689 | (`refname \0 reverse_int64(update_index)`) for the last log record of | |
690 | every log block in the file, supporting bounded-time lookup. | |
691 | ||
692 | A log index block must be written if 2 or more log blocks are written to | |
693 | the file. If present, the log index appears after the last log block. | |
694 | There is no padding used to align the log index to block alignment. | |
695 | ||
696 | Log index format is identical to ref index, except the keys are 9 bytes | |
697 | longer to include `'\0'` and the 8-byte `reverse_int64(update_index)`. | |
698 | Records use `block_position` to refer to the start of a log block. | |
699 | ||
700 | Reading the index | |
701 | +++++++++++++++++ | |
702 | ||
703 | Readers loading the log index must first read the footer (below) to | |
704 | obtain `log_index_position`. If not present, the position will be 0. | |
705 | ||
706 | Footer | |
707 | ^^^^^^ | |
708 | ||
709 | After the last block of the file, a file footer is written. It begins | |
710 | like the file header, but is extended with additional data. | |
711 | ||
712 | .... | |
713 | HEADER | |
714 | ||
715 | uint64( ref_index_position ) | |
716 | uint64( (obj_position << 5) | obj_id_len ) | |
717 | uint64( obj_index_position ) | |
718 | ||
719 | uint64( log_position ) | |
720 | uint64( log_index_position ) | |
721 | ||
722 | uint32( CRC-32 of above ) | |
723 | .... | |
724 | ||
725 | If a section is missing (e.g. ref index) the corresponding position | |
726 | field (e.g. `ref_index_position`) will be 0. | |
727 | ||
728 | * `obj_position`: byte position for the first obj block. | |
729 | * `obj_id_len`: number of bytes used to abbreviate object names in | |
730 | obj blocks. | |
731 | * `log_position`: byte position for the first log block. | |
732 | * `ref_index_position`: byte position for the start of the ref index. | |
733 | * `obj_index_position`: byte position for the start of the obj index. | |
734 | * `log_index_position`: byte position for the start of the log index. | |
735 | ||
736 | The size of the footer is 68 bytes for version 1, and 72 bytes for | |
737 | version 2. | |
738 | ||
739 | Reading the footer | |
740 | ++++++++++++++++++ | |
741 | ||
742 | Readers must first read the file start to determine the version | |
743 | number. Then they seek to `file_length - FOOTER_LENGTH` to access the | |
744 | footer. A trusted external source (such as `stat(2)`) is necessary to | |
745 | obtain `file_length`. When reading the footer, readers must verify: | |
746 | ||
747 | * 4-byte magic is correct | |
748 | * 1-byte version number is recognized | |
749 | * 4-byte CRC-32 matches the other 64 bytes (including magic, and | |
750 | version) | |
751 | ||
752 | Once verified, the other fields of the footer can be accessed. | |
753 | ||
754 | Empty tables | |
755 | ++++++++++++ | |
756 | ||
757 | A reftable may be empty. In this case, the file starts with a header | |
758 | and is immediately followed by a footer. | |
759 | ||
760 | Binary search | |
761 | ^^^^^^^^^^^^^ | |
762 | ||
763 | Binary search within a block is supported by the `restart_offset` fields | |
764 | at the end of the block. Readers can binary search through the restart | |
765 | table to locate between which two restart points the sought reference or | |
766 | key should appear. | |
767 | ||
768 | Each record identified by a `restart_offset` stores the complete key in | |
769 | the `suffix` field of the record, making the compare operation during | |
770 | binary search straightforward. | |
771 | ||
772 | Once a restart point lexicographically before the sought reference has | |
773 | been identified, readers can linearly scan through the following record | |
774 | entries to locate the sought record, terminating if the current record | |
775 | sorts after (and therefore the sought key is not present). | |
776 | ||
777 | Restart point selection | |
778 | +++++++++++++++++++++++ | |
779 | ||
780 | Writers determine the restart points at file creation. The process is | |
781 | arbitrary, but every 16 or 64 records is recommended. Every 16 may be | |
782 | more suitable for smaller block sizes (4k or 8k), every 64 for larger | |
783 | block sizes (64k). | |
784 | ||
785 | More frequent restart points reduces prefix compression and increases | |
786 | space consumed by the restart table, both of which increase file size. | |
787 | ||
788 | Less frequent restart points makes prefix compression more effective, | |
789 | decreasing overall file size, with increased penalties for readers | |
790 | walking through more records after the binary search step. | |
791 | ||
792 | A maximum of `65535` restart points per block is supported. | |
793 | ||
794 | Considerations | |
795 | ~~~~~~~~~~~~~~ | |
796 | ||
797 | Lightweight refs dominate | |
798 | ^^^^^^^^^^^^^^^^^^^^^^^^^ | |
799 | ||
800 | The reftable format assumes the vast majority of references are single | |
801 | object names valued with common prefixes, such as Gerrit Code Review's | |
802 | `refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many | |
803 | lightweight tags in the `refs/tags/` namespace. | |
804 | ||
805 | Annotated tags storing the peeled object cost an additional object name per | |
806 | reference. | |
807 | ||
808 | Low overhead | |
809 | ^^^^^^^^^^^^ | |
810 | ||
811 | A reftable with very few references (e.g. git.git with 5 heads) is 269 | |
812 | bytes for reftable, vs. 332 bytes for packed-refs. This supports | |
813 | reftable scaling down for transaction logs (below). | |
814 | ||
815 | Block size | |
816 | ^^^^^^^^^^ | |
817 | ||
818 | For a Gerrit Code Review type repository with many change refs, larger | |
819 | block sizes (64 KiB) and less frequent restart points (every 64) yield | |
820 | better compression due to more references within the block compressing | |
821 | against the prior reference. | |
822 | ||
823 | Larger block sizes reduce the index size, as the reftable will require | |
824 | fewer blocks to store the same number of references. | |
825 | ||
826 | Minimal disk seeks | |
827 | ^^^^^^^^^^^^^^^^^^ | |
828 | ||
829 | Assuming the index block has been loaded into memory, binary searching | |
830 | for any single reference requires exactly 1 disk seek to load the | |
831 | containing block. | |
832 | ||
833 | Scans and lookups dominate | |
834 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
835 | ||
836 | Scanning all references and lookup by name (or namespace such as | |
837 | `refs/heads/`) are the most common activities performed on repositories. | |
838 | Object names are stored directly with references to optimize this use case. | |
839 | ||
840 | Logs are infrequently read | |
841 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
842 | ||
843 | Logs are infrequently accessed, but can be large. Deflating log blocks | |
844 | saves disk space, with some increased penalty at read time. | |
845 | ||
846 | Logs are stored in an isolated section from refs, reducing the burden on | |
847 | reference readers that want to ignore logs. Further, historical logs can | |
848 | be isolated into log-only files. | |
849 | ||
850 | Logs are read backwards | |
851 | ^^^^^^^^^^^^^^^^^^^^^^^ | |
852 | ||
853 | Logs are frequently accessed backwards (most recent N records for master | |
854 | to answer `master@{4}`), so log records are grouped by reference, and | |
855 | sorted descending by update index. | |
856 | ||
857 | Repository format | |
858 | ~~~~~~~~~~~~~~~~~ | |
859 | ||
860 | Version 1 | |
861 | ^^^^^^^^^ | |
862 | ||
863 | A repository must set its `$GIT_DIR/config` to configure reftable: | |
864 | ||
865 | .... | |
866 | [core] | |
867 | repositoryformatversion = 1 | |
868 | [extensions] | |
869 | refStorage = reftable | |
870 | .... | |
871 | ||
872 | Layout | |
873 | ^^^^^^ | |
874 | ||
875 | A collection of reftable files are stored in the `$GIT_DIR/reftable/` | |
876 | directory: | |
877 | ||
878 | .... | |
879 | 00000001-00000001.log | |
880 | 00000002-00000002.ref | |
881 | 00000003-00000003.ref | |
882 | .... | |
883 | ||
884 | where reftable files are named by a unique name such as produced by the | |
885 | function `${min_update_index}-${max_update_index}.ref`. | |
886 | ||
887 | Log-only files use the `.log` extension, while ref-only and mixed ref | |
888 | and log files use `.ref`. extension. | |
889 | ||
890 | The stack ordering file is `$GIT_DIR/reftable/tables.list` and lists the | |
891 | current files, one per line, in order, from oldest (base) to newest | |
892 | (most recent): | |
893 | ||
894 | .... | |
895 | $ cat .git/reftable/tables.list | |
896 | 00000001-00000001.log | |
897 | 00000002-00000002.ref | |
898 | 00000003-00000003.ref | |
899 | .... | |
900 | ||
901 | Readers must read `$GIT_DIR/reftable/tables.list` to determine which | |
902 | files are relevant right now, and search through the stack in reverse | |
903 | order (last reftable is examined first). | |
904 | ||
905 | Reftable files not listed in `tables.list` may be new (and about to be | |
906 | added to the stack by the active writer), or ancient and ready to be | |
907 | pruned. | |
908 | ||
909 | Backward compatibility | |
910 | ^^^^^^^^^^^^^^^^^^^^^^ | |
911 | ||
912 | Older clients should continue to recognize the directory as a git | |
913 | repository so they don't look for an enclosing repository in parent | |
914 | directories. To this end, a reftable-enabled repository must contain the | |
915 | following dummy files | |
916 | ||
917 | * `.git/HEAD`, a regular file containing `ref: refs/heads/.invalid`. | |
918 | * `.git/refs/`, a directory | |
919 | * `.git/refs/heads`, a regular file | |
920 | ||
921 | Readers | |
922 | ^^^^^^^ | |
923 | ||
924 | Readers can obtain a consistent snapshot of the reference space by | |
925 | following: | |
926 | ||
927 | 1. Open and read the `tables.list` file. | |
928 | 2. Open each of the reftable files that it mentions. | |
929 | 3. If any of the files is missing, goto 1. | |
930 | 4. Read from the now-open files as long as necessary. | |
931 | ||
932 | Update transactions | |
933 | ^^^^^^^^^^^^^^^^^^^ | |
934 | ||
935 | Although reftables are immutable, mutations are supported by writing a | |
936 | new reftable and atomically appending it to the stack: | |
937 | ||
938 | 1. Acquire `tables.list.lock`. | |
939 | 2. Read `tables.list` to determine current reftables. | |
940 | 3. Select `update_index` to be most recent file's | |
941 | `max_update_index + 1`. | |
942 | 4. Prepare temp reftable `tmp_XXXXXX`, including log entries. | |
943 | 5. Rename `tmp_XXXXXX` to `${update_index}-${update_index}.ref`. | |
944 | 6. Copy `tables.list` to `tables.list.lock`, appending file from (5). | |
945 | 7. Rename `tables.list.lock` to `tables.list`. | |
946 | ||
947 | During step 4 the new file's `min_update_index` and `max_update_index` | |
948 | are both set to the `update_index` selected by step 3. All log records | |
949 | for the transaction use the same `update_index` in their keys. This | |
950 | enables later correlation of which references were updated by the same | |
951 | transaction. | |
952 | ||
953 | Because a single `tables.list.lock` file is used to manage locking, the | |
954 | repository is single-threaded for writers. Writers may have to busy-spin | |
955 | (with backoff) around creating `tables.list.lock`, for up to an | |
956 | acceptable wait period, aborting if the repository is too busy to | |
957 | mutate. Application servers wrapped around repositories (e.g. Gerrit | |
958 | Code Review) can layer their own lock/wait queue to improve fairness to | |
959 | writers. | |
960 | ||
961 | Reference deletions | |
962 | ^^^^^^^^^^^^^^^^^^^ | |
963 | ||
964 | Deletion of any reference can be explicitly stored by setting the `type` | |
965 | to `0x0` and omitting the `value` field of the `ref_record`. This serves | |
966 | as a tombstone, overriding any assertions about the existence of the | |
967 | reference from earlier files in the stack. | |
968 | ||
969 | Compaction | |
970 | ^^^^^^^^^^ | |
971 | ||
972 | A partial stack of reftables can be compacted by merging references | |
973 | using a straightforward merge join across reftables, selecting the most | |
974 | recent value for output, and omitting deleted references that do not | |
975 | appear in remaining, lower reftables. | |
976 | ||
977 | A compacted reftable should set its `min_update_index` to the smallest | |
978 | of the input files' `min_update_index`, and its `max_update_index` | |
979 | likewise to the largest input `max_update_index`. | |
980 | ||
981 | For sake of illustration, assume the stack currently consists of | |
982 | reftable files (from oldest to newest): A, B, C, and D. The compactor is | |
983 | going to compact B and C, leaving A and D alone. | |
984 | ||
985 | 1. Obtain lock `tables.list.lock` and read the `tables.list` file. | |
986 | 2. Obtain locks `B.lock` and `C.lock`. Ownership of these locks | |
987 | prevents other processes from trying to compact these files. | |
988 | 3. Release `tables.list.lock`. | |
989 | 4. Compact `B` and `C` into a temp file | |
990 | `${min_update_index}-${max_update_index}_XXXXXX`. | |
991 | 5. Reacquire lock `tables.list.lock`. | |
992 | 6. Verify that `B` and `C` are still in the stack, in that order. This | |
993 | should always be the case, assuming that other processes are adhering to | |
994 | the locking protocol. | |
995 | 7. Rename `${min_update_index}-${max_update_index}_XXXXXX` to | |
996 | `${min_update_index}-${max_update_index}.ref`. | |
997 | 8. Write the new stack to `tables.list.lock`, replacing `B` and `C` | |
998 | with the file from (4). | |
999 | 9. Rename `tables.list.lock` to `tables.list`. | |
1000 | 10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing | |
1001 | readers to backtrack. | |
1002 | ||
1003 | This strategy permits compactions to proceed independently of updates. | |
1004 | ||
1005 | Each reftable (compacted or not) is uniquely identified by its name, so | |
1006 | open reftables can be cached by their name. | |
1007 | ||
1008 | Alternatives considered | |
1009 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
1010 | ||
1011 | bzip packed-refs | |
1012 | ^^^^^^^^^^^^^^^^ | |
1013 | ||
1014 | `bzip2` can significantly shrink a large packed-refs file (e.g. 62 MiB | |
1015 | compresses to 23 MiB, 37%). However the bzip format does not support | |
1016 | random access to a single reference. Readers must inflate and discard | |
1017 | while performing a linear scan. | |
1018 | ||
1019 | Breaking packed-refs into chunks (individually compressing each chunk) | |
1020 | would reduce the amount of data a reader must inflate, but still leaves | |
1021 | the problem of indexing chunks to support readers efficiently locating | |
1022 | the correct chunk. | |
1023 | ||
1024 | Given the compression achieved by reftable's encoding, it does not seem | |
1025 | necessary to add the complexity of bzip/gzip/zlib. | |
1026 | ||
1027 | Michael Haggerty's alternate format | |
1028 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
1029 | ||
1030 | Michael Haggerty proposed | |
1031 | link:https://lore.kernel.org/git/CAMy9T_HCnyc1g8XWOOWhe7nN0aEFyyBskV2aOMb_fe%2BwGvEJ7A%40mail.gmail.com/[an | |
1032 | alternate] format to reftable on the Git mailing list. This format uses | |
1033 | smaller chunks, without the restart table, and avoids block alignment | |
1034 | with padding. Reflog entries immediately follow each ref, and are thus | |
1035 | interleaved between refs. | |
1036 | ||
1037 | Performance testing indicates reftable is faster for lookups (51% | |
1038 | faster, 11.2 usec vs. 5.4 usec), although reftable produces a slightly | |
1039 | larger file (+ ~3.2%, 28.3M vs 29.2M): | |
1040 | ||
1041 | [cols=">,>,>,>",options="header",] | |
1042 | |===================================== | |
1043 | |format |size |seek cold |seek hot | |
1044 | |mh-alt |28.3 M |23.4 usec |11.2 usec | |
1045 | |reftable |29.2 M |19.9 usec |5.4 usec | |
1046 | |===================================== | |
1047 | ||
1048 | JGit Ketch RefTree | |
1049 | ^^^^^^^^^^^^^^^^^^ | |
1050 | ||
1051 | https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html[JGit Ketch] | |
1052 | proposed | |
1053 | link:https://lore.kernel.org/git/CAJo%3DhJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg%40mail.gmail.com/[RefTree], | |
1054 | an encoding of references inside Git tree objects stored as part of the | |
1055 | repository's object database. | |
1056 | ||
1057 | The RefTree format adds additional load on the object database storage | |
1058 | layer (more loose objects, more objects in packs), and relies heavily on | |
1059 | the packer's delta compression to save space. Namespaces which are flat | |
1060 | (e.g. thousands of tags in refs/tags) initially create very large loose | |
1061 | objects, and so RefTree does not address the problem of copying many | |
1062 | references to modify a handful. | |
1063 | ||
1064 | Flat namespaces are not efficiently searchable in RefTree, as tree | |
1065 | objects in canonical formatting cannot be binary searched. This fails | |
1066 | the need to handle a large number of references in a single namespace, | |
1067 | such as GitHub's `refs/pulls`, or a project with many tags. | |
1068 | ||
1069 | LMDB | |
1070 | ^^^^ | |
1071 | ||
1072 | David Turner proposed | |
1073 | https://lore.kernel.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/[using | |
1074 | LMDB], as LMDB is lightweight (64k of runtime code) and GPL-compatible | |
1075 | license. | |
1076 | ||
1077 | A downside of LMDB is its reliance on a single C implementation. This | |
1078 | makes embedding inside JGit (a popular reimplementation of Git) | |
1079 | difficult, and hoisting onto virtual storage (for JGit DFS) virtually | |
1080 | impossible. | |
1081 | ||
1082 | A common format that can be supported by all major Git implementations | |
1083 | (git-core, JGit, libgit2) is strongly preferred. |