]>
Commit | Line | Data |
---|---|---|
ae30d7b1 DS |
1 | Git Commit Graph Design Notes |
2 | ============================= | |
3 | ||
4 | Git walks the commit graph for many reasons, including: | |
5 | ||
6 | 1. Listing and filtering commit history. | |
7 | 2. Computing merge bases. | |
8 | ||
9 | These operations can become slow as the commit count grows. The merge | |
10 | base calculation shows up in many user-facing commands, such as 'merge-base' | |
11 | or 'status' and can take minutes to compute depending on history shape. | |
12 | ||
13 | There are two main costs here: | |
14 | ||
15 | 1. Decompressing and parsing commits. | |
16 | 2. Walking the entire graph to satisfy topological order constraints. | |
17 | ||
4c399442 | 18 | The commit-graph file is a supplemental data structure that accelerates |
ae30d7b1 DS |
19 | commit graph walks. If a user downgrades or disables the 'core.commitGraph' |
20 | config setting, then the existing ODB is sufficient. The file is stored | |
21 | as "commit-graph" either in the .git/objects/info directory or in the info | |
22 | directory of an alternate. | |
23 | ||
4c399442 | 24 | The commit-graph file stores the commit graph structure along with some |
031fd4b9 EN |
25 | extra metadata to speed up graph walks. By listing commit OIDs in |
26 | lexicographic order, we can identify an integer position for each commit | |
27 | and refer to the parents of a commit using those integer positions. We | |
28 | use binary search to find initial commits and then use the integer | |
29 | positions for fast lookups during the walk. | |
ae30d7b1 DS |
30 | |
31 | A consumer may load the following info for a commit from the graph: | |
32 | ||
33 | 1. The commit OID. | |
34 | 2. The list of parents, along with their integer position. | |
35 | 3. The commit date. | |
36 | 4. The root tree OID. | |
37 | 5. The generation number (see definition below). | |
38 | ||
39 | Values 1-4 satisfy the requirements of parse_commit_gently(). | |
40 | ||
5a3b130c AK |
41 | There are two definitions of generation number: |
42 | 1. Corrected committer dates (generation number v2) | |
bbb0c357 | 43 | 2. Topological levels (generation number v1) |
ae30d7b1 | 44 | |
5a3b130c | 45 | Define "corrected committer date" of a commit recursively as follows: |
ae30d7b1 | 46 | |
5a3b130c AK |
47 | * A commit with no parents (a root commit) has corrected committer date |
48 | equal to its committer date. | |
ae30d7b1 | 49 | |
5a3b130c | 50 | * A commit with at least one parent has corrected committer date equal to |
bbb0c357 | 51 | the maximum of its committer date and one more than the largest corrected |
5a3b130c AK |
52 | committer date among its parents. |
53 | ||
54 | * As a special case, a root commit with timestamp zero has corrected commit | |
55 | date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO | |
56 | (that is, an uncomputed corrected commit date). | |
57 | ||
58 | Define the "topological level" of a commit recursively as follows: | |
59 | ||
60 | * A commit with no parents (a root commit) has topological level of one. | |
61 | ||
62 | * A commit with at least one parent has topological level one more than | |
63 | the largest topological level among its parents. | |
64 | ||
65 | Equivalently, the topological level of a commit A is one more than the | |
ae30d7b1 DS |
66 | length of a longest path from A to a root commit. The recursive definition |
67 | is easier to use for computation and observing the following property: | |
68 | ||
69 | If A and B are commits with generation numbers N and M, respectively, | |
70 | and N <= M, then A cannot reach B. That is, we know without searching | |
71 | that B is not an ancestor of A because it is further from a root commit | |
72 | than A. | |
73 | ||
74 | Conversely, when checking if A is an ancestor of B, then we only need | |
75 | to walk commits until all commits on the walk boundary have generation | |
76 | number at most N. If we walk commits using a priority queue seeded by | |
77 | generation numbers, then we always expand the boundary commit with highest | |
78 | generation number and can easily detect the stopping condition. | |
79 | ||
5a3b130c AK |
80 | The property applies to both versions of generation number, that is both |
81 | corrected committer dates and topological levels. | |
82 | ||
ae30d7b1 DS |
83 | This property can be used to significantly reduce the time it takes to |
84 | walk commits and determine topological relationships. Without generation | |
85 | numbers, the general heuristic is the following: | |
86 | ||
87 | If A and B are commits with commit time X and Y, respectively, and | |
88 | X < Y, then A _probably_ cannot reach B. | |
89 | ||
5a3b130c AK |
90 | In absence of corrected commit dates (for example, old versions of Git or |
91 | mixed generation graph chains), | |
92 | this heuristic is currently used whenever the computation is allowed to | |
ae30d7b1 DS |
93 | violate topological relationships due to clock skew (such as "git log" |
94 | with default order), but is not used when the topological order is | |
95 | required (such as merge base calculations, "git log --graph"). | |
96 | ||
97 | In practice, we expect some commits to be created recently and not stored | |
98 | in the commit graph. We can treat these commits as having "infinite" | |
99 | generation number and walk until reaching commits with known generation | |
100 | number. | |
101 | ||
5a3b130c | 102 | We use the macro GENERATION_NUMBER_INFINITY to mark commits not |
1472978e DS |
103 | in the commit-graph file. If a commit-graph file was written by a version |
104 | of Git that did not compute generation numbers, then those commits will | |
105 | have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. | |
106 | ||
107 | Since the commit-graph file is closed under reachability, we can guarantee | |
108 | the following weaker condition on all commits: | |
109 | ||
031fd4b9 | 110 | If A and B are commits with generation numbers N and M, respectively, |
1472978e DS |
111 | and N < M, then A cannot reach B. |
112 | ||
113 | Note how the strict inequality differs from the inequality when we have | |
114 | fully-computed generation numbers. Using strict inequality may result in | |
115 | walking a few extra commits, but the simplicity in dealing with commits | |
116 | with generation number *_INFINITY or *_ZERO is valuable. | |
117 | ||
5a3b130c AK |
118 | We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose |
119 | topological levels (generation number v1) are computed to be at least | |
120 | this value. We limit at this value since it is the largest value that | |
121 | can be stored in the commit-graph file using the 30 bits available | |
122 | to topological levels. This presents another case where a commit can | |
123 | have generation number equal to that of a parent. | |
1472978e | 124 | |
ae30d7b1 DS |
125 | Design Details |
126 | -------------- | |
127 | ||
4c399442 | 128 | - The commit-graph file is stored in a file named 'commit-graph' in the |
ae30d7b1 DS |
129 | .git/objects/info directory. This could be stored in the info directory |
130 | of an alternate. | |
131 | ||
132 | - The core.commitGraph config setting must be on to consume graph files. | |
133 | ||
134 | - The file format includes parameters for the object ID hash function, | |
135 | so a future change of hash algorithm does not require a change in format. | |
136 | ||
950c62bd DS |
137 | - Commit grafts and replace objects can change the shape of the commit |
138 | history. The latter can also be enabled/disabled on the fly using | |
139 | `--no-replace-objects`. This leads to difficultly storing both possible | |
140 | interpretations of a commit id, especially when computing generation | |
141 | numbers. The commit-graph will not be read or written when | |
142 | replace-objects or grafts are present. | |
143 | ||
144 | - Shallow clones create grafts of commits by dropping their parents. This | |
145 | leads the commit-graph to think those commits have generation number 1. | |
146 | If and when those commits are made unshallow, those generation numbers | |
147 | become invalid. Since shallow clones are intended to restrict the commit | |
148 | history to a very small set of commits, the commit-graph feature is less | |
149 | helpful for these clones, anyway. The commit-graph will not be read or | |
150 | written when shallow commits are present. | |
151 | ||
890345ac DS |
152 | Commit Graphs Chains |
153 | -------------------- | |
154 | ||
155 | Typically, repos grow with near-constant velocity (commits per day). Over time, | |
156 | the number of commits added by a fetch operation is much smaller than the | |
157 | number of commits in the full history. By creating a "chain" of commit-graphs, | |
158 | we enable fast writes of new commit data without rewriting the entire commit | |
159 | history -- at least, most of the time. | |
160 | ||
161 | ## File Layout | |
162 | ||
163 | A commit-graph chain uses multiple files, and we use a fixed naming convention | |
164 | to organize these files. Each commit-graph file has a name | |
165 | `$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex- | |
166 | valued hash stored in the footer of that file (which is a hash of the file's | |
167 | contents before that hash). For a chain of commit-graph files, a plain-text | |
168 | file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the | |
169 | hashes for the files in order from "lowest" to "highest". | |
170 | ||
171 | For example, if the `commit-graph-chain` file contains the lines | |
172 | ||
173 | ``` | |
174 | {hash0} | |
175 | {hash1} | |
176 | {hash2} | |
177 | ``` | |
178 | ||
179 | then the commit-graph chain looks like the following diagram: | |
180 | ||
181 | +-----------------------+ | |
182 | | graph-{hash2}.graph | | |
183 | +-----------------------+ | |
184 | | | |
185 | +-----------------------+ | |
186 | | | | |
187 | | graph-{hash1}.graph | | |
188 | | | | |
189 | +-----------------------+ | |
190 | | | |
191 | +-----------------------+ | |
192 | | | | |
193 | | | | |
194 | | | | |
195 | | graph-{hash0}.graph | | |
196 | | | | |
197 | | | | |
198 | | | | |
199 | +-----------------------+ | |
200 | ||
201 | Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of | |
202 | commits in `graph-{hash1}.graph`, and X2 be the number of commits in | |
203 | `graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`, | |
204 | then we interpret this as being the commit in position (X0 + X1 + i), and that | |
205 | will be used as its "graph position". The commits in `graph-{hash2}.graph` use these | |
206 | positions to refer to their parents, which may be in `graph-{hash1}.graph` or | |
207 | `graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking | |
208 | its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 + | |
209 | X2). | |
210 | ||
1771be90 DS |
211 | Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data |
212 | specifying the hashes of all files in the lower layers. In the above example, | |
213 | `graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains | |
214 | `{hash0}` and `{hash1}`. | |
215 | ||
216 | ## Merging commit-graph files | |
217 | ||
218 | If we only added a new commit-graph file on every write, we would run into a | |
219 | linear search problem through many commit-graph files. Instead, we use a merge | |
220 | strategy to decide when the stack should collapse some number of levels. | |
221 | ||
222 | The diagram below shows such a collapse. As a set of new commits are added, it | |
223 | is determined by the merge strategy that the files should collapse to | |
224 | `graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and | |
225 | the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}` | |
226 | file. | |
227 | ||
228 | +---------------------+ | |
229 | | | | |
230 | | (new commits) | | |
231 | | | | |
232 | +---------------------+ | |
233 | | | | |
234 | +-----------------------+ +---------------------+ | |
6dfefe70 | 235 | | graph-{hash2} |->| | |
1771be90 DS |
236 | +-----------------------+ +---------------------+ |
237 | | | | | |
238 | +-----------------------+ +---------------------+ | |
239 | | | | | | |
6dfefe70 | 240 | | graph-{hash1} |->| | |
1771be90 DS |
241 | | | | | |
242 | +-----------------------+ +---------------------+ | |
243 | | tmp_graphXXX | |
244 | +-----------------------+ | |
245 | | | | |
246 | | | | |
247 | | | | |
6dfefe70 | 248 | | graph-{hash0} | |
1771be90 DS |
249 | | | |
250 | | | | |
251 | | | | |
252 | +-----------------------+ | |
253 | ||
254 | During this process, the commits to write are combined, sorted and we write the | |
255 | contents to a temporary file, all while holding a `commit-graph-chain.lock` | |
256 | lock-file. When the file is flushed, we rename it to `graph-{hash3}` | |
257 | according to the computed `{hash3}`. Finally, we write the new chain data to | |
258 | `commit-graph-chain.lock`: | |
259 | ||
260 | ``` | |
261 | {hash3} | |
262 | {hash0} | |
263 | ``` | |
264 | ||
265 | We then close the lock-file. | |
266 | ||
267 | ## Merge Strategy | |
268 | ||
269 | When writing a set of commits that do not exist in the commit-graph stack of | |
270 | height N, we default to creating a new file at level N + 1. We then decide to | |
271 | merge with the Nth level if one of two conditions hold: | |
272 | ||
c2bc6e6a DS |
273 | 1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in |
274 | level N is less than X times the number of commits in level N + 1. | |
1771be90 | 275 | |
c2bc6e6a DS |
276 | 2. `--max-commits=<C>` is specified with non-zero C and the number of commits |
277 | in level N + 1 is more than C commits. | |
1771be90 DS |
278 | |
279 | This decision cascades down the levels: when we merge a level we create a new | |
280 | set of commits that then compares to the next level. | |
281 | ||
282 | The first condition bounds the number of levels to be logarithmic in the total | |
283 | number of commits. The second condition bounds the total number of commits in | |
284 | a `graph-{hashN}` file and not in the `commit-graph` file, preventing | |
285 | significant performance issues when the stack merges and another process only | |
286 | partially reads the previous stack. | |
287 | ||
288 | The merge strategy values (2 for the size multiple, 64,000 for the maximum | |
289 | number of commits) could be extracted into config settings for full | |
290 | flexibility. | |
291 | ||
5a3b130c AK |
292 | ## Handling Mixed Generation Number Chains |
293 | ||
294 | With the introduction of generation number v2 and generation data chunk, the | |
295 | following scenario is possible: | |
296 | ||
297 | 1. "New" Git writes a commit-graph with the corrected commit dates. | |
298 | 2. "Old" Git writes a split commit-graph on top without corrected commit dates. | |
299 | ||
300 | A naive approach of using the newest available generation number from | |
301 | each layer would lead to violated expectations: the lower layer would | |
302 | use corrected commit dates which are much larger than the topological | |
303 | levels of the higher layer. For this reason, Git inspects the topmost | |
304 | layer to see if the layer is missing corrected commit dates. In such a case | |
305 | Git only uses topological level for generation numbers. | |
306 | ||
307 | When writing a new layer in split commit-graph, we write corrected commit | |
308 | dates if the topmost layer has corrected commit dates written. This | |
309 | guarantees that if a layer has corrected commit dates, all lower layers | |
310 | must have corrected commit dates as well. | |
311 | ||
312 | When merging layers, we do not consider whether the merged layers had corrected | |
313 | commit dates. Instead, the new layer will have corrected commit dates if the | |
314 | layer below the new layer has corrected commit dates. | |
315 | ||
316 | While writing or merging layers, if the new layer is the only layer, it will | |
317 | have corrected commit dates when written by compatible versions of Git. Thus, | |
318 | rewriting split commit-graph as a single file (`--split=replace`) creates a | |
319 | single layer with corrected commit dates. | |
320 | ||
8d84097f DS |
321 | ## Deleting graph-{hash} files |
322 | ||
323 | After a new tip file is written, some `graph-{hash}` files may no longer | |
324 | be part of a chain. It is important to remove these files from disk, eventually. | |
325 | The main reason to delay removal is that another process could read the | |
326 | `commit-graph-chain` file before it is rewritten, but then look for the | |
327 | `graph-{hash}` files after they are deleted. | |
328 | ||
329 | To allow holding old split commit-graphs for a while after they are unreferenced, | |
330 | we update the modified times of the files when they become unreferenced. Then, | |
331 | we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}` | |
332 | files whose modified times are older than a given expiry window. This window | |
333 | defaults to zero, but can be changed using command-line arguments or a config | |
334 | setting. | |
335 | ||
c523035c DS |
336 | ## Chains across multiple object directories |
337 | ||
338 | In a repo with alternates, we look for the `commit-graph-chain` file starting | |
339 | in the local object directory and then in each alternate. The first file that | |
340 | exists defines our chain. As we look for the `graph-{hash}` files for | |
341 | each `{hash}` in the chain file, we follow the same pattern for the host | |
342 | directories. | |
343 | ||
344 | This allows commit-graphs to be split across multiple forks in a fork network. | |
345 | The typical case is a large "base" repo with many smaller forks. | |
346 | ||
347 | As the base repo advances, it will likely update and merge its commit-graph | |
348 | chain more frequently than the forks. If a fork updates their commit-graph after | |
349 | the base repo, then it should "reparent" the commit-graph chain onto the new | |
350 | chain in the base repo. When reading each `graph-{hash}` file, we track | |
351 | the object directory containing it. During a write of a new commit-graph file, | |
352 | we check for any changes in the source object directory and read the | |
353 | `commit-graph-chain` file for that source and create a new file based on those | |
354 | files. During this "reparent" operation, we necessarily need to collapse all | |
355 | levels in the fork, as all of the files are invalid against the new base file. | |
356 | ||
357 | It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph` | |
358 | files in this scenario. It falls to the user to define the proper settings for | |
359 | their custom environment: | |
360 | ||
361 | 1. When merging levels in the base repo, the unreferenced files may still be | |
362 | referenced by chains from fork repos. | |
363 | ||
364 | 2. The expiry time should be set to a length of time such that every fork has | |
365 | time to recompute their commit-graph chain to "reparent" onto the new base | |
366 | file(s). | |
367 | ||
368 | 3. If the commit-graph chain is updated in the base, the fork will not have | |
369 | access to the new chain until its chain is updated to reference those files. | |
370 | (This may change in the future [5].) | |
371 | ||
ae30d7b1 DS |
372 | Related Links |
373 | ------------- | |
374 | [0] https://bugs.chromium.org/p/git/issues/detail?id=8 | |
375 | Chromium work item for: Serialized Commit Graph | |
376 | ||
3eae30e4 | 377 | [1] https://lore.kernel.org/git/20110713070517.GC18566@sigill.intra.peff.net/ |
ae30d7b1 DS |
378 | An abandoned patch that introduced generation numbers. |
379 | ||
3eae30e4 | 380 | [2] https://lore.kernel.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/ |
ae30d7b1 DS |
381 | Discussion about generation numbers on commits and how they interact |
382 | with fsck. | |
383 | ||
3eae30e4 | 384 | [3] https://lore.kernel.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/ |
ae30d7b1 DS |
385 | More discussion about generation numbers and not storing them inside |
386 | commit objects. A valuable quote: | |
387 | ||
388 | "I think we should be moving more in the direction of keeping | |
389 | repo-local caches for optimizations. Reachability bitmaps have been | |
390 | a big performance win. I think we should be doing the same with our | |
391 | properties of commits. Not just generation numbers, but making it | |
392 | cheap to access the graph structure without zlib-inflating whole | |
393 | commit objects (i.e., packv4 or something like the "metapacks" I | |
394 | proposed a few years ago)." | |
395 | ||
3eae30e4 | 396 | [4] https://lore.kernel.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u |
ae30d7b1 | 397 | A patch to remove the ahead-behind calculation from 'status'. |
c523035c | 398 | |
3eae30e4 | 399 | [5] https://lore.kernel.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/ |
c523035c DS |
400 | A discussion of a "two-dimensional graph position" that can allow reading |
401 | multiple commit-graph chains at the same time. |