]>
Commit | Line | Data |
---|---|---|
ceab693d DS |
1 | Multi-Pack-Index (MIDX) Design Notes |
2 | ==================================== | |
3 | ||
4 | The Git object directory contains a 'pack' directory containing | |
5 | packfiles (with suffix ".pack") and pack-indexes (with suffix | |
6 | ".idx"). The pack-indexes provide a way to lookup objects and | |
7 | navigate to their offset within the pack, but these must come | |
8 | in pairs with the packfiles. This pairing depends on the file | |
9 | names, as the pack-index differs only in suffix with its pack- | |
10 | file. While the pack-indexes provide fast lookup per packfile, | |
11 | this performance degrades as the number of packfiles increases, | |
12 | because abbreviations need to inspect every packfile and we are | |
13 | more likely to have a miss on our most-recently-used packfile. | |
14 | For some large repositories, repacking into a single packfile | |
15 | is not feasible due to storage space or excessive repack times. | |
16 | ||
17 | The multi-pack-index (MIDX for short) stores a list of objects | |
18 | and their offsets into multiple packfiles. It contains: | |
19 | ||
20 | - A list of packfile names. | |
21 | - A sorted list of object IDs. | |
22 | - A list of metadata for the ith object ID including: | |
23 | - A value j referring to the jth packfile. | |
24 | - An offset within the jth packfile for the object. | |
25 | - If large offsets are required, we use another list of large | |
26 | offsets similar to version 2 pack-indexes. | |
27 | ||
28 | Thus, we can provide O(log N) lookup time for any number | |
29 | of packfiles. | |
30 | ||
31 | Design Details | |
32 | -------------- | |
33 | ||
34 | - The MIDX is stored in a file named 'multi-pack-index' in the | |
35 | .git/objects/pack directory. This could be stored in the pack | |
36 | directory of an alternate. It refers only to packfiles in that | |
37 | same directory. | |
38 | ||
421c0ffb | 39 | - The core.multiPackIndex config setting must be on to consume MIDX files. |
ceab693d DS |
40 | |
41 | - The file format includes parameters for the object ID hash | |
42 | function, so a future change of hash algorithm does not require | |
43 | a change in format. | |
44 | ||
45 | - The MIDX keeps only one record per object ID. If an object appears | |
9218c6a4 TB |
46 | in multiple packfiles, then the MIDX selects the copy in the |
47 | preferred packfile, otherwise selecting from the most-recently | |
48 | modified packfile. | |
ceab693d DS |
49 | |
50 | - If there exist packfiles in the pack directory not registered in | |
51 | the MIDX, then those packfiles are loaded into the `packed_git` | |
52 | list and `packed_git_mru` cache. | |
53 | ||
54 | - The pack-indexes (.idx files) remain in the pack directory so we | |
55 | can delete the MIDX file, set core.midx to false, or downgrade | |
56 | without any loss of information. | |
57 | ||
58 | - The MIDX file format uses a chunk-based approach (similar to the | |
59 | commit-graph file) that allows optional data to be added. | |
60 | ||
61 | Future Work | |
62 | ----------- | |
63 | ||
ceab693d DS |
64 | - The multi-pack-index allows many packfiles, especially in a context |
65 | where repacking is expensive (such as a very large repo), or | |
66 | unexpected maintenance time is unacceptable (such as a high-demand | |
67 | build machine). However, the multi-pack-index needs to be rewritten | |
68 | in full every time. We can extend the format to be incremental, so | |
69 | writes are fast. By storing a small "tip" multi-pack-index that | |
70 | points to large "base" MIDX files, we can keep writes fast while | |
71 | still reducing the number of binary searches required for object | |
72 | lookups. | |
73 | ||
74 | - The reachability bitmap is currently paired directly with a single | |
75 | packfile, using the pack-order as the object order to hopefully | |
76 | compress the bitmaps well using run-length encoding. This could be | |
77 | extended to pair a reachability bitmap with a multi-pack-index. If | |
78 | the multi-pack-index is extended to store a "stable object order" | |
79 | (a function Order(hash) = integer that is constant for a given hash, | |
80 | even as the multi-pack-index is updated) then a reachability bitmap | |
81 | could point to a multi-pack-index and be updated independently. | |
82 | ||
83 | - Packfiles can be marked as "special" using empty files that share | |
84 | the initial name but replace ".pack" with ".keep" or ".promisor". | |
85 | We can add an optional chunk of data to the multi-pack-index that | |
86 | records flags of information about the packfiles. This allows new | |
87 | states, such as 'repacked' or 'redeltified', that can help with | |
88 | pack maintenance in a multi-pack environment. It may also be | |
89 | helpful to organize packfiles by object type (commit, tree, blob, | |
90 | etc.) and use this metadata to help that maintenance. | |
91 | ||
92 | - The partial clone feature records special "promisor" packs that | |
93 | may point to objects that are not stored locally, but available | |
94 | on request to a server. The multi-pack-index does not currently | |
95 | track these promisor packs. | |
96 | ||
97 | Related Links | |
98 | ------------- | |
99 | [0] https://bugs.chromium.org/p/git/issues/detail?id=6 | |
100 | Chromium work item for: Multi-Pack Index (MIDX) | |
101 | ||
3eae30e4 | 102 | [1] https://lore.kernel.org/git/20180107181459.222909-1-dstolee@microsoft.com/ |
ceab693d DS |
103 | An earlier RFC for the multi-pack-index feature |
104 | ||
3eae30e4 | 105 | [2] https://lore.kernel.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ |
ceab693d | 106 | Git Merge 2018 Contributor's summit notes (includes discussion of MIDX) |