]>
Commit | Line | Data |
---|---|---|
0d4455a3 VM |
1 | GIT bitmap v1 format |
2 | ==================== | |
3 | ||
917a54c0 TB |
4 | == Pack and multi-pack bitmaps |
5 | ||
6 | Bitmaps store reachability information about the set of objects in a packfile, | |
7 | or a multi-pack index (MIDX). The former is defined obviously, and the latter is | |
8 | defined as the union of objects in packs contained in the MIDX. | |
9 | ||
10 | A bitmap may belong to either one pack, or the repository's multi-pack index (if | |
11 | it exists). A repository may have at most one bitmap. | |
12 | ||
13 | An object is uniquely described by its bit position within a bitmap: | |
14 | ||
15 | - If the bitmap belongs to a packfile, the __n__th bit corresponds to | |
16 | the __n__th object in pack order. For a function `offset` which maps | |
17 | objects to their byte offset within a pack, pack order is defined as | |
18 | follows: | |
19 | ||
20 | o1 <= o2 <==> offset(o1) <= offset(o2) | |
21 | ||
22 | - If the bitmap belongs to a MIDX, the __n__th bit corresponds to the | |
23 | __n__th object in MIDX order. With an additional function `pack` which | |
24 | maps objects to the pack they were selected from by the MIDX, MIDX order | |
25 | is defined as follows: | |
26 | ||
27 | o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2) | |
caea9002 AC |
28 | + |
29 | The ordering between packs is done according to the MIDX's .rev file. | |
30 | Notably, the preferred pack sorts ahead of all other packs. | |
917a54c0 TB |
31 | |
32 | The on-disk representation (described below) of a bitmap is the same regardless | |
33 | of whether or not that bitmap belongs to a packfile or a MIDX. The only | |
34 | difference is the interpretation of the bits, which is described above. | |
35 | ||
36 | Certain bitmap extensions are supported (see: Appendix B). No extensions are | |
37 | required for bitmaps corresponding to packfiles. For bitmaps that correspond to | |
38 | MIDXs, both the bit-cache and rev-cache extensions are required. | |
39 | ||
40 | == On-disk format | |
41 | ||
caea9002 AC |
42 | * A header appears at the beginning: |
43 | ||
44 | 4-byte signature: :: {'B', 'I', 'T', 'M'} | |
45 | ||
46 | 2-byte version number (network byte order): :: | |
47 | ||
48 | The current implementation only supports version 1 | |
49 | of the bitmap index (the same one as JGit). | |
50 | ||
51 | 2-byte flags (network byte order): :: | |
52 | ||
53 | The following flags are supported: | |
54 | ||
55 | ** {empty} | |
56 | BITMAP_OPT_FULL_DAG (0x1) REQUIRED: ::: | |
57 | ||
58 | This flag must always be present. It implies that the | |
59 | bitmap index has been generated for a packfile or | |
60 | multi-pack index (MIDX) with full closure (i.e. where | |
61 | every single object in the packfile/MIDX can find its | |
62 | parent links inside the same packfile/MIDX). This is a | |
63 | requirement for the bitmap index format, also present in | |
64 | JGit, that greatly reduces the complexity of the | |
65 | implementation. | |
66 | ||
67 | ** {empty} | |
68 | BITMAP_OPT_HASH_CACHE (0x4): ::: | |
69 | ||
70 | If present, the end of the bitmap file contains | |
71 | `N` 32-bit name-hash values, one per object in the | |
72 | pack/MIDX. The format and meaning of the name-hash is | |
73 | described below. | |
74 | ||
e9977b12 AC |
75 | ** {empty} |
76 | BITMAP_OPT_LOOKUP_TABLE (0x10): ::: | |
77 | If present, the end of the bitmap file contains a table | |
78 | containing a list of `N` <commit_pos, offset, xor_row> | |
79 | triplets. The format and meaning of the table is described | |
80 | below. | |
81 | + | |
82 | NOTE: Unlike the xor_offset used to compress an individual bitmap, | |
83 | `xor_row` stores an *absolute* index into the lookup table, not a location | |
84 | relative to the current entry. | |
85 | ||
caea9002 AC |
86 | 4-byte entry count (network byte order): :: |
87 | The total count of entries (bitmapped commits) in this bitmap index. | |
88 | ||
89 | 20-byte checksum: :: | |
90 | The SHA1 checksum of the pack/MIDX this bitmap index | |
91 | belongs to. | |
92 | ||
93 | * 4 EWAH bitmaps that act as type indexes | |
94 | + | |
95 | Type indexes are serialized after the hash cache in the shape | |
96 | of four EWAH bitmaps stored consecutively (see Appendix A for | |
97 | the serialization format of an EWAH bitmap). | |
98 | + | |
99 | There is a bitmap for each Git object type, stored in the following | |
100 | order: | |
101 | + | |
102 | - Commits | |
103 | - Trees | |
104 | - Blobs | |
105 | - Tags | |
106 | ||
107 | + | |
108 | In each bitmap, the `n`th bit is set to true if the `n`th object | |
109 | in the packfile or multi-pack index is of that type. | |
110 | + | |
111 | The obvious consequence is that the OR of all 4 bitmaps will result | |
112 | in a full set (all bits set), and the AND of all 4 bitmaps will | |
113 | result in an empty bitmap (no bits set). | |
114 | ||
115 | * N entries with compressed bitmaps, one for each indexed commit | |
116 | + | |
117 | Where `N` is the total amount of entries in this bitmap index. | |
118 | Each entry contains the following: | |
119 | ||
120 | ** {empty} | |
121 | 4-byte object position (network byte order): :: | |
122 | The position **in the index for the packfile or | |
123 | multi-pack index** where the bitmap for this commit is | |
124 | found. | |
125 | ||
126 | ** {empty} | |
127 | 1-byte XOR-offset: :: | |
128 | The xor offset used to compress this bitmap. For an entry | |
129 | in position `x`, a XOR offset of `y` means that the actual | |
130 | bitmap representing this commit is composed by XORing the | |
131 | bitmap for this entry with the bitmap in entry `x-y` (i.e. | |
132 | the bitmap `y` entries before this one). | |
133 | + | |
134 | NOTE: This compression can be recursive. In order to | |
135 | XOR this entry with a previous one, the previous entry needs | |
136 | to be decompressed first, and so on. | |
137 | + | |
138 | The hard-limit for this offset is 160 (an entry can only be | |
139 | xor'ed against one of the 160 entries preceding it). This | |
140 | number is always positive, and hence entries are always xor'ed | |
141 | with **previous** bitmaps, not bitmaps that will come afterwards | |
142 | in the index. | |
143 | ||
144 | ** {empty} | |
145 | 1-byte flags for this bitmap: :: | |
146 | At the moment the only available flag is `0x1`, which hints | |
147 | that this bitmap can be re-used when rebuilding bitmap indexes | |
148 | for the repository. | |
149 | ||
150 | ** The compressed bitmap itself, see Appendix A. | |
0d4455a3 | 151 | |
ac7667bd AC |
152 | * {empty} |
153 | TRAILER: :: | |
154 | Trailing checksum of the preceding contents. | |
155 | ||
0d4455a3 VM |
156 | == Appendix A: Serialization format for an EWAH bitmap |
157 | ||
158 | Ewah bitmaps are serialized in the same protocol as the JAVAEWAH | |
159 | library, making them backwards compatible with the JGit | |
160 | implementation: | |
161 | ||
162 | - 4-byte number of bits of the resulting UNCOMPRESSED bitmap | |
163 | ||
164 | - 4-byte number of words of the COMPRESSED bitmap, when stored | |
165 | ||
166 | - N x 8-byte words, as specified by the previous field | |
caea9002 AC |
167 | + |
168 | This is the actual content of the compressed bitmap. | |
0d4455a3 VM |
169 | |
170 | - 4-byte position of the current RLW for the compressed | |
171 | bitmap | |
172 | ||
173 | All words are stored in network byte order for their corresponding | |
174 | sizes. | |
175 | ||
176 | The compressed bitmap is stored in a form of run-length encoding, as | |
177 | follows. It consists of a concatenation of an arbitrary number of | |
178 | chunks. Each chunk consists of one or more 64-bit words | |
179 | ||
180 | H L_1 L_2 L_3 .... L_M | |
181 | ||
182 | H is called RLW (run length word). It consists of (from lower to higher | |
183 | order bits): | |
184 | ||
185 | - 1 bit: the repeated bit B | |
186 | ||
187 | - 32 bits: repetition count K (unsigned) | |
188 | ||
189 | - 31 bits: literal word count M (unsigned) | |
190 | ||
191 | The bitstream represented by the above chunk is then: | |
192 | ||
193 | - K repetitions of B | |
194 | ||
195 | - The bits stored in `L_1` through `L_M`. Within a word, bits at | |
196 | lower order come earlier in the stream than those at higher | |
197 | order. | |
198 | ||
199 | The next word after `L_M` (if any) must again be a RLW, for the next | |
200 | chunk. For efficient appending to the bitstream, the EWAH stores a | |
201 | pointer to the last RLW in the stream. | |
ae4f07fb VM |
202 | |
203 | ||
204 | == Appendix B: Optional Bitmap Sections | |
205 | ||
206 | These sections may or may not be present in the `.bitmap` file; their | |
207 | presence is indicated by the header flags section described above. | |
208 | ||
209 | Name-hash cache | |
210 | --------------- | |
211 | ||
212 | If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains | |
917a54c0 | 213 | a cache of 32-bit values, one per object in the pack/MIDX. The value at |
ae4f07fb | 214 | position `i` is the hash of the pathname at which the `i`th object |
917a54c0 TB |
215 | (counting in index or multi-pack index order) in the pack/MIDX can be found. |
216 | This can be fed into the delta heuristics to compare objects with similar | |
217 | pathnames. | |
ae4f07fb VM |
218 | |
219 | The hash algorithm used is: | |
220 | ||
221 | hash = 0; | |
222 | while ((c = *name++)) | |
223 | if (!isspace(c)) | |
224 | hash = (hash >> 2) + (c << 24); | |
225 | ||
226 | Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. | |
227 | If implementations want to choose a different hashing scheme, they are | |
228 | free to do so, but MUST allocate a new header flag (because comparing | |
229 | hashes made under two different schemes would be pointless). | |
e9977b12 AC |
230 | |
231 | Commit lookup table | |
232 | ------------------- | |
233 | ||
234 | If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)` | |
235 | bytes (preceding the name-hash cache and trailing hash) of the `.bitmap` | |
236 | file contains a lookup table specifying the information needed to get | |
237 | the desired bitmap from the entries without parsing previous unnecessary | |
238 | bitmaps. | |
239 | ||
240 | For a `.bitmap` containing `nr_entries` reachability bitmaps, the table | |
241 | contains a list of `nr_entries` <commit_pos, offset, xor_row> triplets | |
242 | (sorted in the ascending order of `commit_pos`). The content of i'th | |
243 | triplet is - | |
244 | ||
245 | * {empty} | |
246 | commit_pos (4 byte integer, network byte order): :: | |
247 | It stores the object position of a commit (in the midx or pack | |
248 | index). | |
249 | ||
250 | * {empty} | |
251 | offset (8 byte integer, network byte order): :: | |
252 | The offset from which that commit's bitmap can be read. | |
253 | ||
254 | * {empty} | |
255 | xor_row (4 byte integer, network byte order): :: | |
256 | The position of the triplet whose bitmap is used to compress | |
257 | this one, or `0xffffffff` if no such bitmap exists. |