]>
Commit | Line | Data |
---|---|---|
0d4455a3 VM |
1 | GIT bitmap v1 format |
2 | ==================== | |
3 | ||
4 | - A header appears at the beginning: | |
5 | ||
6 | 4-byte signature: {'B', 'I', 'T', 'M'} | |
7 | ||
8 | 2-byte version number (network byte order) | |
9 | The current implementation only supports version 1 | |
10 | of the bitmap index (the same one as JGit). | |
11 | ||
12 | 2-byte flags (network byte order) | |
13 | ||
14 | The following flags are supported: | |
15 | ||
16 | - BITMAP_OPT_FULL_DAG (0x1) REQUIRED | |
17 | This flag must always be present. It implies that the bitmap | |
18 | index has been generated for a packfile with full closure | |
19 | (i.e. where every single object in the packfile can find | |
20 | its parent links inside the same packfile). This is a | |
21 | requirement for the bitmap index format, also present in JGit, | |
22 | that greatly reduces the complexity of the implementation. | |
23 | ||
ae4f07fb VM |
24 | - BITMAP_OPT_HASH_CACHE (0x4) |
25 | If present, the end of the bitmap file contains | |
26 | `N` 32-bit name-hash values, one per object in the | |
27 | pack. The format and meaning of the name-hash is | |
28 | described below. | |
29 | ||
0d4455a3 VM |
30 | 4-byte entry count (network byte order) |
31 | ||
32 | The total count of entries (bitmapped commits) in this bitmap index. | |
33 | ||
34 | 20-byte checksum | |
35 | ||
36 | The SHA1 checksum of the pack this bitmap index belongs to. | |
37 | ||
38 | - 4 EWAH bitmaps that act as type indexes | |
39 | ||
40 | Type indexes are serialized after the hash cache in the shape | |
41 | of four EWAH bitmaps stored consecutively (see Appendix A for | |
42 | the serialization format of an EWAH bitmap). | |
43 | ||
44 | There is a bitmap for each Git object type, stored in the following | |
45 | order: | |
46 | ||
47 | - Commits | |
48 | - Trees | |
49 | - Blobs | |
50 | - Tags | |
51 | ||
52 | In each bitmap, the `n`th bit is set to true if the `n`th object | |
53 | in the packfile is of that type. | |
54 | ||
55 | The obvious consequence is that the OR of all 4 bitmaps will result | |
56 | in a full set (all bits set), and the AND of all 4 bitmaps will | |
57 | result in an empty bitmap (no bits set). | |
58 | ||
59 | - N entries with compressed bitmaps, one for each indexed commit | |
60 | ||
61 | Where `N` is the total amount of entries in this bitmap index. | |
62 | Each entry contains the following: | |
63 | ||
64 | - 4-byte object position (network byte order) | |
65 | The position **in the index for the packfile** where the | |
66 | bitmap for this commit is found. | |
67 | ||
68 | - 1-byte XOR-offset | |
69 | The xor offset used to compress this bitmap. For an entry | |
70 | in position `x`, a XOR offset of `y` means that the actual | |
71 | bitmap representing this commit is composed by XORing the | |
72 | bitmap for this entry with the bitmap in entry `x-y` (i.e. | |
73 | the bitmap `y` entries before this one). | |
74 | ||
75 | Note that this compression can be recursive. In order to | |
76 | XOR this entry with a previous one, the previous entry needs | |
77 | to be decompressed first, and so on. | |
78 | ||
79 | The hard-limit for this offset is 160 (an entry can only be | |
80 | xor'ed against one of the 160 entries preceding it). This | |
81 | number is always positive, and hence entries are always xor'ed | |
82 | with **previous** bitmaps, not bitmaps that will come afterwards | |
83 | in the index. | |
84 | ||
85 | - 1-byte flags for this bitmap | |
86 | At the moment the only available flag is `0x1`, which hints | |
87 | that this bitmap can be re-used when rebuilding bitmap indexes | |
88 | for the repository. | |
89 | ||
90 | - The compressed bitmap itself, see Appendix A. | |
91 | ||
92 | == Appendix A: Serialization format for an EWAH bitmap | |
93 | ||
94 | Ewah bitmaps are serialized in the same protocol as the JAVAEWAH | |
95 | library, making them backwards compatible with the JGit | |
96 | implementation: | |
97 | ||
98 | - 4-byte number of bits of the resulting UNCOMPRESSED bitmap | |
99 | ||
100 | - 4-byte number of words of the COMPRESSED bitmap, when stored | |
101 | ||
102 | - N x 8-byte words, as specified by the previous field | |
103 | ||
104 | This is the actual content of the compressed bitmap. | |
105 | ||
106 | - 4-byte position of the current RLW for the compressed | |
107 | bitmap | |
108 | ||
109 | All words are stored in network byte order for their corresponding | |
110 | sizes. | |
111 | ||
112 | The compressed bitmap is stored in a form of run-length encoding, as | |
113 | follows. It consists of a concatenation of an arbitrary number of | |
114 | chunks. Each chunk consists of one or more 64-bit words | |
115 | ||
116 | H L_1 L_2 L_3 .... L_M | |
117 | ||
118 | H is called RLW (run length word). It consists of (from lower to higher | |
119 | order bits): | |
120 | ||
121 | - 1 bit: the repeated bit B | |
122 | ||
123 | - 32 bits: repetition count K (unsigned) | |
124 | ||
125 | - 31 bits: literal word count M (unsigned) | |
126 | ||
127 | The bitstream represented by the above chunk is then: | |
128 | ||
129 | - K repetitions of B | |
130 | ||
131 | - The bits stored in `L_1` through `L_M`. Within a word, bits at | |
132 | lower order come earlier in the stream than those at higher | |
133 | order. | |
134 | ||
135 | The next word after `L_M` (if any) must again be a RLW, for the next | |
136 | chunk. For efficient appending to the bitstream, the EWAH stores a | |
137 | pointer to the last RLW in the stream. | |
ae4f07fb VM |
138 | |
139 | ||
140 | == Appendix B: Optional Bitmap Sections | |
141 | ||
142 | These sections may or may not be present in the `.bitmap` file; their | |
143 | presence is indicated by the header flags section described above. | |
144 | ||
145 | Name-hash cache | |
146 | --------------- | |
147 | ||
148 | If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains | |
149 | a cache of 32-bit values, one per object in the pack. The value at | |
150 | position `i` is the hash of the pathname at which the `i`th object | |
151 | (counting in index order) in the pack can be found. This can be fed | |
152 | into the delta heuristics to compare objects with similar pathnames. | |
153 | ||
154 | The hash algorithm used is: | |
155 | ||
156 | hash = 0; | |
157 | while ((c = *name++)) | |
158 | if (!isspace(c)) | |
159 | hash = (hash >> 2) + (c << 24); | |
160 | ||
161 | Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. | |
162 | If implementations want to choose a different hashing scheme, they are | |
163 | free to do so, but MUST allocate a new header flag (because comparing | |
164 | hashes made under two different schemes would be pointless). |