]>
Commit | Line | Data |
---|---|---|
ae30d7b1 DS |
1 | Git Commit Graph Design Notes |
2 | ============================= | |
3 | ||
4 | Git walks the commit graph for many reasons, including: | |
5 | ||
6 | 1. Listing and filtering commit history. | |
7 | 2. Computing merge bases. | |
8 | ||
9 | These operations can become slow as the commit count grows. The merge | |
10 | base calculation shows up in many user-facing commands, such as 'merge-base' | |
11 | or 'status' and can take minutes to compute depending on history shape. | |
12 | ||
13 | There are two main costs here: | |
14 | ||
15 | 1. Decompressing and parsing commits. | |
16 | 2. Walking the entire graph to satisfy topological order constraints. | |
17 | ||
4c399442 | 18 | The commit-graph file is a supplemental data structure that accelerates |
ae30d7b1 DS |
19 | commit graph walks. If a user downgrades or disables the 'core.commitGraph' |
20 | config setting, then the existing ODB is sufficient. The file is stored | |
21 | as "commit-graph" either in the .git/objects/info directory or in the info | |
22 | directory of an alternate. | |
23 | ||
4c399442 | 24 | The commit-graph file stores the commit graph structure along with some |
ae30d7b1 DS |
25 | extra metadata to speed up graph walks. By listing commit OIDs in lexi- |
26 | cographic order, we can identify an integer position for each commit and | |
27 | refer to the parents of a commit using those integer positions. We use | |
28 | binary search to find initial commits and then use the integer positions | |
29 | for fast lookups during the walk. | |
30 | ||
31 | A consumer may load the following info for a commit from the graph: | |
32 | ||
33 | 1. The commit OID. | |
34 | 2. The list of parents, along with their integer position. | |
35 | 3. The commit date. | |
36 | 4. The root tree OID. | |
37 | 5. The generation number (see definition below). | |
38 | ||
39 | Values 1-4 satisfy the requirements of parse_commit_gently(). | |
40 | ||
41 | Define the "generation number" of a commit recursively as follows: | |
42 | ||
43 | * A commit with no parents (a root commit) has generation number one. | |
44 | ||
45 | * A commit with at least one parent has generation number one more than | |
46 | the largest generation number among its parents. | |
47 | ||
48 | Equivalently, the generation number of a commit A is one more than the | |
49 | length of a longest path from A to a root commit. The recursive definition | |
50 | is easier to use for computation and observing the following property: | |
51 | ||
52 | If A and B are commits with generation numbers N and M, respectively, | |
53 | and N <= M, then A cannot reach B. That is, we know without searching | |
54 | that B is not an ancestor of A because it is further from a root commit | |
55 | than A. | |
56 | ||
57 | Conversely, when checking if A is an ancestor of B, then we only need | |
58 | to walk commits until all commits on the walk boundary have generation | |
59 | number at most N. If we walk commits using a priority queue seeded by | |
60 | generation numbers, then we always expand the boundary commit with highest | |
61 | generation number and can easily detect the stopping condition. | |
62 | ||
63 | This property can be used to significantly reduce the time it takes to | |
64 | walk commits and determine topological relationships. Without generation | |
65 | numbers, the general heuristic is the following: | |
66 | ||
67 | If A and B are commits with commit time X and Y, respectively, and | |
68 | X < Y, then A _probably_ cannot reach B. | |
69 | ||
70 | This heuristic is currently used whenever the computation is allowed to | |
71 | violate topological relationships due to clock skew (such as "git log" | |
72 | with default order), but is not used when the topological order is | |
73 | required (such as merge base calculations, "git log --graph"). | |
74 | ||
75 | In practice, we expect some commits to be created recently and not stored | |
76 | in the commit graph. We can treat these commits as having "infinite" | |
77 | generation number and walk until reaching commits with known generation | |
78 | number. | |
79 | ||
1472978e DS |
80 | We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not |
81 | in the commit-graph file. If a commit-graph file was written by a version | |
82 | of Git that did not compute generation numbers, then those commits will | |
83 | have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. | |
84 | ||
85 | Since the commit-graph file is closed under reachability, we can guarantee | |
86 | the following weaker condition on all commits: | |
87 | ||
88 | If A and B are commits with generation numbers N amd M, respectively, | |
89 | and N < M, then A cannot reach B. | |
90 | ||
91 | Note how the strict inequality differs from the inequality when we have | |
92 | fully-computed generation numbers. Using strict inequality may result in | |
93 | walking a few extra commits, but the simplicity in dealing with commits | |
94 | with generation number *_INFINITY or *_ZERO is valuable. | |
95 | ||
96 | We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose | |
97 | generation numbers are computed to be at least this value. We limit at | |
98 | this value since it is the largest value that can be stored in the | |
99 | commit-graph file using the 30 bits available to generation numbers. This | |
100 | presents another case where a commit can have generation number equal to | |
101 | that of a parent. | |
102 | ||
ae30d7b1 DS |
103 | Design Details |
104 | -------------- | |
105 | ||
4c399442 | 106 | - The commit-graph file is stored in a file named 'commit-graph' in the |
ae30d7b1 DS |
107 | .git/objects/info directory. This could be stored in the info directory |
108 | of an alternate. | |
109 | ||
110 | - The core.commitGraph config setting must be on to consume graph files. | |
111 | ||
112 | - The file format includes parameters for the object ID hash function, | |
113 | so a future change of hash algorithm does not require a change in format. | |
114 | ||
950c62bd DS |
115 | - Commit grafts and replace objects can change the shape of the commit |
116 | history. The latter can also be enabled/disabled on the fly using | |
117 | `--no-replace-objects`. This leads to difficultly storing both possible | |
118 | interpretations of a commit id, especially when computing generation | |
119 | numbers. The commit-graph will not be read or written when | |
120 | replace-objects or grafts are present. | |
121 | ||
122 | - Shallow clones create grafts of commits by dropping their parents. This | |
123 | leads the commit-graph to think those commits have generation number 1. | |
124 | If and when those commits are made unshallow, those generation numbers | |
125 | become invalid. Since shallow clones are intended to restrict the commit | |
126 | history to a very small set of commits, the commit-graph feature is less | |
127 | helpful for these clones, anyway. The commit-graph will not be read or | |
128 | written when shallow commits are present. | |
129 | ||
ae30d7b1 DS |
130 | Related Links |
131 | ------------- | |
132 | [0] https://bugs.chromium.org/p/git/issues/detail?id=8 | |
133 | Chromium work item for: Serialized Commit Graph | |
134 | ||
135 | [1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/ | |
136 | An abandoned patch that introduced generation numbers. | |
137 | ||
138 | [2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/ | |
139 | Discussion about generation numbers on commits and how they interact | |
140 | with fsck. | |
141 | ||
142 | [3] https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/ | |
143 | More discussion about generation numbers and not storing them inside | |
144 | commit objects. A valuable quote: | |
145 | ||
146 | "I think we should be moving more in the direction of keeping | |
147 | repo-local caches for optimizations. Reachability bitmaps have been | |
148 | a big performance win. I think we should be doing the same with our | |
149 | properties of commits. Not just generation numbers, but making it | |
150 | cheap to access the graph structure without zlib-inflating whole | |
151 | commit objects (i.e., packv4 or something like the "metapacks" I | |
152 | proposed a few years ago)." | |
153 | ||
154 | [4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u | |
155 | A patch to remove the ahead-behind calculation from 'status'. |