]>
Commit | Line | Data |
---|---|---|
ae30d7b1 DS |
1 | Git Commit Graph Design Notes |
2 | ============================= | |
3 | ||
4 | Git walks the commit graph for many reasons, including: | |
5 | ||
6 | 1. Listing and filtering commit history. | |
7 | 2. Computing merge bases. | |
8 | ||
9 | These operations can become slow as the commit count grows. The merge | |
10 | base calculation shows up in many user-facing commands, such as 'merge-base' | |
11 | or 'status' and can take minutes to compute depending on history shape. | |
12 | ||
13 | There are two main costs here: | |
14 | ||
15 | 1. Decompressing and parsing commits. | |
16 | 2. Walking the entire graph to satisfy topological order constraints. | |
17 | ||
4c399442 | 18 | The commit-graph file is a supplemental data structure that accelerates |
ae30d7b1 DS |
19 | commit graph walks. If a user downgrades or disables the 'core.commitGraph' |
20 | config setting, then the existing ODB is sufficient. The file is stored | |
21 | as "commit-graph" either in the .git/objects/info directory or in the info | |
22 | directory of an alternate. | |
23 | ||
4c399442 | 24 | The commit-graph file stores the commit graph structure along with some |
031fd4b9 EN |
25 | extra metadata to speed up graph walks. By listing commit OIDs in |
26 | lexicographic order, we can identify an integer position for each commit | |
27 | and refer to the parents of a commit using those integer positions. We | |
28 | use binary search to find initial commits and then use the integer | |
29 | positions for fast lookups during the walk. | |
ae30d7b1 DS |
30 | |
31 | A consumer may load the following info for a commit from the graph: | |
32 | ||
33 | 1. The commit OID. | |
34 | 2. The list of parents, along with their integer position. | |
35 | 3. The commit date. | |
36 | 4. The root tree OID. | |
37 | 5. The generation number (see definition below). | |
38 | ||
39 | Values 1-4 satisfy the requirements of parse_commit_gently(). | |
40 | ||
41 | Define the "generation number" of a commit recursively as follows: | |
42 | ||
43 | * A commit with no parents (a root commit) has generation number one. | |
44 | ||
45 | * A commit with at least one parent has generation number one more than | |
46 | the largest generation number among its parents. | |
47 | ||
48 | Equivalently, the generation number of a commit A is one more than the | |
49 | length of a longest path from A to a root commit. The recursive definition | |
50 | is easier to use for computation and observing the following property: | |
51 | ||
52 | If A and B are commits with generation numbers N and M, respectively, | |
53 | and N <= M, then A cannot reach B. That is, we know without searching | |
54 | that B is not an ancestor of A because it is further from a root commit | |
55 | than A. | |
56 | ||
57 | Conversely, when checking if A is an ancestor of B, then we only need | |
58 | to walk commits until all commits on the walk boundary have generation | |
59 | number at most N. If we walk commits using a priority queue seeded by | |
60 | generation numbers, then we always expand the boundary commit with highest | |
61 | generation number and can easily detect the stopping condition. | |
62 | ||
63 | This property can be used to significantly reduce the time it takes to | |
64 | walk commits and determine topological relationships. Without generation | |
65 | numbers, the general heuristic is the following: | |
66 | ||
67 | If A and B are commits with commit time X and Y, respectively, and | |
68 | X < Y, then A _probably_ cannot reach B. | |
69 | ||
70 | This heuristic is currently used whenever the computation is allowed to | |
71 | violate topological relationships due to clock skew (such as "git log" | |
72 | with default order), but is not used when the topological order is | |
73 | required (such as merge base calculations, "git log --graph"). | |
74 | ||
75 | In practice, we expect some commits to be created recently and not stored | |
76 | in the commit graph. We can treat these commits as having "infinite" | |
77 | generation number and walk until reaching commits with known generation | |
78 | number. | |
79 | ||
1472978e DS |
80 | We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not |
81 | in the commit-graph file. If a commit-graph file was written by a version | |
82 | of Git that did not compute generation numbers, then those commits will | |
83 | have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. | |
84 | ||
85 | Since the commit-graph file is closed under reachability, we can guarantee | |
86 | the following weaker condition on all commits: | |
87 | ||
031fd4b9 | 88 | If A and B are commits with generation numbers N and M, respectively, |
1472978e DS |
89 | and N < M, then A cannot reach B. |
90 | ||
91 | Note how the strict inequality differs from the inequality when we have | |
92 | fully-computed generation numbers. Using strict inequality may result in | |
93 | walking a few extra commits, but the simplicity in dealing with commits | |
94 | with generation number *_INFINITY or *_ZERO is valuable. | |
95 | ||
96 | We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose | |
97 | generation numbers are computed to be at least this value. We limit at | |
98 | this value since it is the largest value that can be stored in the | |
99 | commit-graph file using the 30 bits available to generation numbers. This | |
100 | presents another case where a commit can have generation number equal to | |
101 | that of a parent. | |
102 | ||
ae30d7b1 DS |
103 | Design Details |
104 | -------------- | |
105 | ||
4c399442 | 106 | - The commit-graph file is stored in a file named 'commit-graph' in the |
ae30d7b1 DS |
107 | .git/objects/info directory. This could be stored in the info directory |
108 | of an alternate. | |
109 | ||
110 | - The core.commitGraph config setting must be on to consume graph files. | |
111 | ||
112 | - The file format includes parameters for the object ID hash function, | |
113 | so a future change of hash algorithm does not require a change in format. | |
114 | ||
950c62bd DS |
115 | - Commit grafts and replace objects can change the shape of the commit |
116 | history. The latter can also be enabled/disabled on the fly using | |
117 | `--no-replace-objects`. This leads to difficultly storing both possible | |
118 | interpretations of a commit id, especially when computing generation | |
119 | numbers. The commit-graph will not be read or written when | |
120 | replace-objects or grafts are present. | |
121 | ||
122 | - Shallow clones create grafts of commits by dropping their parents. This | |
123 | leads the commit-graph to think those commits have generation number 1. | |
124 | If and when those commits are made unshallow, those generation numbers | |
125 | become invalid. Since shallow clones are intended to restrict the commit | |
126 | history to a very small set of commits, the commit-graph feature is less | |
127 | helpful for these clones, anyway. The commit-graph will not be read or | |
128 | written when shallow commits are present. | |
129 | ||
890345ac DS |
130 | Commit Graphs Chains |
131 | -------------------- | |
132 | ||
133 | Typically, repos grow with near-constant velocity (commits per day). Over time, | |
134 | the number of commits added by a fetch operation is much smaller than the | |
135 | number of commits in the full history. By creating a "chain" of commit-graphs, | |
136 | we enable fast writes of new commit data without rewriting the entire commit | |
137 | history -- at least, most of the time. | |
138 | ||
139 | ## File Layout | |
140 | ||
141 | A commit-graph chain uses multiple files, and we use a fixed naming convention | |
142 | to organize these files. Each commit-graph file has a name | |
143 | `$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex- | |
144 | valued hash stored in the footer of that file (which is a hash of the file's | |
145 | contents before that hash). For a chain of commit-graph files, a plain-text | |
146 | file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the | |
147 | hashes for the files in order from "lowest" to "highest". | |
148 | ||
149 | For example, if the `commit-graph-chain` file contains the lines | |
150 | ||
151 | ``` | |
152 | {hash0} | |
153 | {hash1} | |
154 | {hash2} | |
155 | ``` | |
156 | ||
157 | then the commit-graph chain looks like the following diagram: | |
158 | ||
159 | +-----------------------+ | |
160 | | graph-{hash2}.graph | | |
161 | +-----------------------+ | |
162 | | | |
163 | +-----------------------+ | |
164 | | | | |
165 | | graph-{hash1}.graph | | |
166 | | | | |
167 | +-----------------------+ | |
168 | | | |
169 | +-----------------------+ | |
170 | | | | |
171 | | | | |
172 | | | | |
173 | | graph-{hash0}.graph | | |
174 | | | | |
175 | | | | |
176 | | | | |
177 | +-----------------------+ | |
178 | ||
179 | Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of | |
180 | commits in `graph-{hash1}.graph`, and X2 be the number of commits in | |
181 | `graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`, | |
182 | then we interpret this as being the commit in position (X0 + X1 + i), and that | |
183 | will be used as its "graph position". The commits in `graph-{hash2}.graph` use these | |
184 | positions to refer to their parents, which may be in `graph-{hash1}.graph` or | |
185 | `graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking | |
186 | its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 + | |
187 | X2). | |
188 | ||
1771be90 DS |
189 | Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data |
190 | specifying the hashes of all files in the lower layers. In the above example, | |
191 | `graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains | |
192 | `{hash0}` and `{hash1}`. | |
193 | ||
194 | ## Merging commit-graph files | |
195 | ||
196 | If we only added a new commit-graph file on every write, we would run into a | |
197 | linear search problem through many commit-graph files. Instead, we use a merge | |
198 | strategy to decide when the stack should collapse some number of levels. | |
199 | ||
200 | The diagram below shows such a collapse. As a set of new commits are added, it | |
201 | is determined by the merge strategy that the files should collapse to | |
202 | `graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and | |
203 | the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}` | |
204 | file. | |
205 | ||
206 | +---------------------+ | |
207 | | | | |
208 | | (new commits) | | |
209 | | | | |
210 | +---------------------+ | |
211 | | | | |
212 | +-----------------------+ +---------------------+ | |
213 | | graph-{hash2} |->| | | |
214 | +-----------------------+ +---------------------+ | |
215 | | | | | |
216 | +-----------------------+ +---------------------+ | |
217 | | | | | | |
218 | | graph-{hash1} |->| | | |
219 | | | | | | |
220 | +-----------------------+ +---------------------+ | |
221 | | tmp_graphXXX | |
222 | +-----------------------+ | |
223 | | | | |
224 | | | | |
225 | | | | |
226 | | graph-{hash0} | | |
227 | | | | |
228 | | | | |
229 | | | | |
230 | +-----------------------+ | |
231 | ||
232 | During this process, the commits to write are combined, sorted and we write the | |
233 | contents to a temporary file, all while holding a `commit-graph-chain.lock` | |
234 | lock-file. When the file is flushed, we rename it to `graph-{hash3}` | |
235 | according to the computed `{hash3}`. Finally, we write the new chain data to | |
236 | `commit-graph-chain.lock`: | |
237 | ||
238 | ``` | |
239 | {hash3} | |
240 | {hash0} | |
241 | ``` | |
242 | ||
243 | We then close the lock-file. | |
244 | ||
245 | ## Merge Strategy | |
246 | ||
247 | When writing a set of commits that do not exist in the commit-graph stack of | |
248 | height N, we default to creating a new file at level N + 1. We then decide to | |
249 | merge with the Nth level if one of two conditions hold: | |
250 | ||
c2bc6e6a DS |
251 | 1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in |
252 | level N is less than X times the number of commits in level N + 1. | |
1771be90 | 253 | |
c2bc6e6a DS |
254 | 2. `--max-commits=<C>` is specified with non-zero C and the number of commits |
255 | in level N + 1 is more than C commits. | |
1771be90 DS |
256 | |
257 | This decision cascades down the levels: when we merge a level we create a new | |
258 | set of commits that then compares to the next level. | |
259 | ||
260 | The first condition bounds the number of levels to be logarithmic in the total | |
261 | number of commits. The second condition bounds the total number of commits in | |
262 | a `graph-{hashN}` file and not in the `commit-graph` file, preventing | |
263 | significant performance issues when the stack merges and another process only | |
264 | partially reads the previous stack. | |
265 | ||
266 | The merge strategy values (2 for the size multiple, 64,000 for the maximum | |
267 | number of commits) could be extracted into config settings for full | |
268 | flexibility. | |
269 | ||
8d84097f DS |
270 | ## Deleting graph-{hash} files |
271 | ||
272 | After a new tip file is written, some `graph-{hash}` files may no longer | |
273 | be part of a chain. It is important to remove these files from disk, eventually. | |
274 | The main reason to delay removal is that another process could read the | |
275 | `commit-graph-chain` file before it is rewritten, but then look for the | |
276 | `graph-{hash}` files after they are deleted. | |
277 | ||
278 | To allow holding old split commit-graphs for a while after they are unreferenced, | |
279 | we update the modified times of the files when they become unreferenced. Then, | |
280 | we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}` | |
281 | files whose modified times are older than a given expiry window. This window | |
282 | defaults to zero, but can be changed using command-line arguments or a config | |
283 | setting. | |
284 | ||
c523035c DS |
285 | ## Chains across multiple object directories |
286 | ||
287 | In a repo with alternates, we look for the `commit-graph-chain` file starting | |
288 | in the local object directory and then in each alternate. The first file that | |
289 | exists defines our chain. As we look for the `graph-{hash}` files for | |
290 | each `{hash}` in the chain file, we follow the same pattern for the host | |
291 | directories. | |
292 | ||
293 | This allows commit-graphs to be split across multiple forks in a fork network. | |
294 | The typical case is a large "base" repo with many smaller forks. | |
295 | ||
296 | As the base repo advances, it will likely update and merge its commit-graph | |
297 | chain more frequently than the forks. If a fork updates their commit-graph after | |
298 | the base repo, then it should "reparent" the commit-graph chain onto the new | |
299 | chain in the base repo. When reading each `graph-{hash}` file, we track | |
300 | the object directory containing it. During a write of a new commit-graph file, | |
301 | we check for any changes in the source object directory and read the | |
302 | `commit-graph-chain` file for that source and create a new file based on those | |
303 | files. During this "reparent" operation, we necessarily need to collapse all | |
304 | levels in the fork, as all of the files are invalid against the new base file. | |
305 | ||
306 | It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph` | |
307 | files in this scenario. It falls to the user to define the proper settings for | |
308 | their custom environment: | |
309 | ||
310 | 1. When merging levels in the base repo, the unreferenced files may still be | |
311 | referenced by chains from fork repos. | |
312 | ||
313 | 2. The expiry time should be set to a length of time such that every fork has | |
314 | time to recompute their commit-graph chain to "reparent" onto the new base | |
315 | file(s). | |
316 | ||
317 | 3. If the commit-graph chain is updated in the base, the fork will not have | |
318 | access to the new chain until its chain is updated to reference those files. | |
319 | (This may change in the future [5].) | |
320 | ||
ae30d7b1 DS |
321 | Related Links |
322 | ------------- | |
323 | [0] https://bugs.chromium.org/p/git/issues/detail?id=8 | |
324 | Chromium work item for: Serialized Commit Graph | |
325 | ||
3eae30e4 | 326 | [1] https://lore.kernel.org/git/20110713070517.GC18566@sigill.intra.peff.net/ |
ae30d7b1 DS |
327 | An abandoned patch that introduced generation numbers. |
328 | ||
3eae30e4 | 329 | [2] https://lore.kernel.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/ |
ae30d7b1 DS |
330 | Discussion about generation numbers on commits and how they interact |
331 | with fsck. | |
332 | ||
3eae30e4 | 333 | [3] https://lore.kernel.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/ |
ae30d7b1 DS |
334 | More discussion about generation numbers and not storing them inside |
335 | commit objects. A valuable quote: | |
336 | ||
337 | "I think we should be moving more in the direction of keeping | |
338 | repo-local caches for optimizations. Reachability bitmaps have been | |
339 | a big performance win. I think we should be doing the same with our | |
340 | properties of commits. Not just generation numbers, but making it | |
341 | cheap to access the graph structure without zlib-inflating whole | |
342 | commit objects (i.e., packv4 or something like the "metapacks" I | |
343 | proposed a few years ago)." | |
344 | ||
3eae30e4 | 345 | [4] https://lore.kernel.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u |
ae30d7b1 | 346 | A patch to remove the ahead-behind calculation from 'status'. |
c523035c | 347 | |
3eae30e4 | 348 | [5] https://lore.kernel.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/ |
c523035c DS |
349 | A discussion of a "two-dimensional graph position" that can allow reading |
350 | multiple commit-graph chains at the same time. |