]>
Commit | Line | Data |
---|---|---|
637fc446 JH |
1 | Partial Clone Design Notes |
2 | ========================== | |
3 | ||
4 | The "Partial Clone" feature is a performance optimization for Git that | |
5 | allows Git to function without having a complete copy of the repository. | |
6 | The goal of this work is to allow Git better handle extremely large | |
7 | repositories. | |
8 | ||
9 | During clone and fetch operations, Git downloads the complete contents | |
10 | and history of the repository. This includes all commits, trees, and | |
11 | blobs for the complete life of the repository. For extremely large | |
12 | repositories, clones can take hours (or days) and consume 100+GiB of disk | |
13 | space. | |
14 | ||
15 | Often in these repositories there are many blobs and trees that the user | |
16 | does not need such as: | |
17 | ||
18 | 1. files outside of the user's work area in the tree. For example, in | |
19 | a repository with 500K directories and 3.5M files in every commit, | |
20 | we can avoid downloading many objects if the user only needs a | |
21 | narrow "cone" of the source tree. | |
22 | ||
23 | 2. large binary assets. For example, in a repository where large build | |
24 | artifacts are checked into the tree, we can avoid downloading all | |
25 | previous versions of these non-mergeable binary assets and only | |
26 | download versions that are actually referenced. | |
27 | ||
28 | Partial clone allows us to avoid downloading such unneeded objects *in | |
29 | advance* during clone and fetch operations and thereby reduce download | |
30 | times and disk usage. Missing objects can later be "demand fetched" | |
31 | if/when needed. | |
32 | ||
33 | Use of partial clone requires that the user be online and the origin | |
34 | remote be available for on-demand fetching of missing objects. This may | |
35 | or may not be problematic for the user. For example, if the user can | |
36 | stay within the pre-selected subset of the source tree, they may not | |
37 | encounter any missing objects. Alternatively, the user could try to | |
38 | pre-fetch various objects if they know that they are going offline. | |
39 | ||
40 | ||
41 | Non-Goals | |
42 | --------- | |
43 | ||
44 | Partial clone is a mechanism to limit the number of blobs and trees downloaded | |
45 | *within* a given range of commits -- and is therefore independent of and not | |
46 | intended to conflict with existing DAG-level mechanisms to limit the set of | |
47 | requested commits (i.e. shallow clone, single branch, or fetch '<refspec>'). | |
48 | ||
49 | ||
50 | Design Overview | |
51 | --------------- | |
52 | ||
53 | Partial clone logically consists of the following parts: | |
54 | ||
55 | - A mechanism for the client to describe unneeded or unwanted objects to | |
56 | the server. | |
57 | ||
58 | - A mechanism for the server to omit such unwanted objects from packfiles | |
59 | sent to the client. | |
60 | ||
61 | - A mechanism for the client to gracefully handle missing objects (that | |
62 | were previously omitted by the server). | |
63 | ||
64 | - A mechanism for the client to backfill missing objects as needed. | |
65 | ||
66 | ||
67 | Design Details | |
68 | -------------- | |
69 | ||
70 | - A new pack-protocol capability "filter" is added to the fetch-pack and | |
71 | upload-pack negotiation. | |
5641eb94 JN |
72 | + |
73 | This uses the existing capability discovery mechanism. | |
74 | See "filter" in Documentation/technical/pack-protocol.txt. | |
637fc446 JH |
75 | |
76 | - Clients pass a "filter-spec" to clone and fetch which is passed to the | |
77 | server to request filtering during packfile construction. | |
5641eb94 JN |
78 | + |
79 | There are various filters available to accommodate different situations. | |
80 | See "--filter=<filter-spec>" in Documentation/rev-list-options.txt. | |
637fc446 JH |
81 | |
82 | - On the server pack-objects applies the requested filter-spec as it | |
83 | creates "filtered" packfiles for the client. | |
5641eb94 JN |
84 | + |
85 | These filtered packfiles are *incomplete* in the traditional sense because | |
86 | they may contain objects that reference objects not contained in the | |
87 | packfile and that the client doesn't already have. For example, the | |
88 | filtered packfile may contain trees or tags that reference missing blobs | |
89 | or commits that reference missing trees. | |
637fc446 JH |
90 | |
91 | - On the client these incomplete packfiles are marked as "promisor packfiles" | |
92 | and treated differently by various commands. | |
93 | ||
94 | - On the client a repository extension is added to the local config to | |
95 | prevent older versions of git from failing mid-operation because of | |
96 | missing objects that they cannot handle. | |
97 | See "extensions.partialClone" in Documentation/technical/repository-version.txt" | |
98 | ||
99 | ||
100 | Handling Missing Objects | |
101 | ------------------------ | |
102 | ||
103 | - An object may be missing due to a partial clone or fetch, or missing due | |
104 | to repository corruption. To differentiate these cases, the local | |
105 | repository specially indicates such filtered packfiles obtained from the | |
106 | promisor remote as "promisor packfiles". | |
5641eb94 JN |
107 | + |
108 | These promisor packfiles consist of a "<name>.promisor" file with | |
109 | arbitrary contents (like the "<name>.keep" files), in addition to | |
110 | their "<name>.pack" and "<name>.idx" files. | |
637fc446 JH |
111 | |
112 | - The local repository considers a "promisor object" to be an object that | |
113 | it knows (to the best of its ability) that the promisor remote has promised | |
114 | that it has, either because the local repository has that object in one of | |
115 | its promisor packfiles, or because another promisor object refers to it. | |
5641eb94 | 116 | + |
1747125e | 117 | When Git encounters a missing object, Git can see if it is a promisor object |
5641eb94 JN |
118 | and handle it appropriately. If not, Git can report a corruption. |
119 | + | |
120 | This means that there is no need for the client to explicitly maintain an | |
121 | expensive-to-modify list of missing objects.[a] | |
637fc446 JH |
122 | |
123 | - Since almost all Git code currently expects any referenced object to be | |
124 | present locally and because we do not want to force every command to do | |
125 | a dry-run first, a fallback mechanism is added to allow Git to attempt | |
126 | to dynamically fetch missing objects from the promisor remote. | |
5641eb94 JN |
127 | + |
128 | When the normal object lookup fails to find an object, Git invokes | |
129 | fetch-object to try to get the object from the server and then retry | |
130 | the object lookup. This allows objects to be "faulted in" without | |
131 | complicated prediction algorithms. | |
132 | + | |
133 | For efficiency reasons, no check as to whether the missing object is | |
134 | actually a promisor object is performed. | |
135 | + | |
136 | Dynamic object fetching tends to be slow as objects are fetched one at | |
137 | a time. | |
637fc446 JH |
138 | |
139 | - `checkout` (and any other command using `unpack-trees`) has been taught | |
140 | to bulk pre-fetch all required missing blobs in a single batch. | |
141 | ||
142 | - `rev-list` has been taught to print missing objects. | |
5641eb94 JN |
143 | + |
144 | This can be used by other commands to bulk prefetch objects. | |
145 | For example, a "git log -p A..B" may internally want to first do | |
146 | something like "git rev-list --objects --quiet --missing=print A..B" | |
147 | and prefetch those objects in bulk. | |
637fc446 JH |
148 | |
149 | - `fsck` has been updated to be fully aware of promisor objects. | |
150 | ||
151 | - `repack` in GC has been updated to not touch promisor packfiles at all, | |
152 | and to only repack other objects. | |
153 | ||
154 | - The global variable "fetch_if_missing" is used to control whether an | |
155 | object lookup will attempt to dynamically fetch a missing object or | |
156 | report an error. | |
5641eb94 JN |
157 | + |
158 | We are not happy with this global variable and would like to remove it, | |
159 | but that requires significant refactoring of the object code to pass an | |
160 | additional flag. We hope that concurrent efforts to add an ODB API can | |
161 | encompass this. | |
637fc446 JH |
162 | |
163 | ||
164 | Fetching Missing Objects | |
165 | ------------------------ | |
166 | ||
167 | - Fetching of objects is done using the existing transport mechanism using | |
168 | transport_fetch_refs(), setting a new transport option | |
169 | TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are | |
170 | desired, not any object that they refer to. | |
5641eb94 JN |
171 | + |
172 | Because some transports invoke fetch_pack() in the same process, fetch_pack() | |
173 | has been updated to not use any object flags when the corresponding argument | |
174 | (no_dependents) is set. | |
637fc446 JH |
175 | |
176 | - The local repository sends a request with the hashes of all requested | |
177 | objects as "want" lines, and does not perform any packfile negotiation. | |
178 | It then receives a packfile. | |
179 | ||
180 | - Because we are reusing the existing fetch-pack mechanism, fetching | |
181 | currently fetches all objects referred to by the requested objects, even | |
182 | though they are not necessary. | |
183 | ||
184 | ||
185 | Current Limitations | |
186 | ------------------- | |
187 | ||
188 | - The remote used for a partial clone (or the first partial fetch | |
189 | following a regular clone) is marked as the "promisor remote". | |
5641eb94 JN |
190 | + |
191 | We are currently limited to a single promisor remote and only that | |
192 | remote may be used for subsequent partial fetches. | |
193 | + | |
194 | We accept this limitation because we believe initial users of this | |
195 | feature will be using it on repositories with a strong single central | |
196 | server. | |
637fc446 JH |
197 | |
198 | - Dynamic object fetching will only ask the promisor remote for missing | |
199 | objects. We assume that the promisor remote has a complete view of the | |
200 | repository and can satisfy all such requests. | |
201 | ||
202 | - Repack essentially treats promisor and non-promisor packfiles as 2 | |
203 | distinct partitions and does not mix them. Repack currently only works | |
204 | on non-promisor packfiles and loose objects. | |
205 | ||
206 | - Dynamic object fetching invokes fetch-pack once *for each item* | |
207 | because most algorithms stumble upon a missing object and need to have | |
208 | it resolved before continuing their work. This may incur significant | |
209 | overhead -- and multiple authentication requests -- if many objects are | |
210 | needed. | |
211 | ||
212 | - Dynamic object fetching currently uses the existing pack protocol V0 | |
213 | which means that each object is requested via fetch-pack. The server | |
214 | will send a full set of info/refs when the connection is established. | |
215 | If there are large number of refs, this may incur significant overhead. | |
216 | ||
217 | ||
218 | Future Work | |
219 | ----------- | |
220 | ||
221 | - Allow more than one promisor remote and define a strategy for fetching | |
222 | missing objects from specific promisor remotes or of iterating over the | |
223 | set of promisor remotes until a missing object is found. | |
5641eb94 JN |
224 | + |
225 | A user might want to have multiple geographically-close cache servers | |
226 | for fetching missing blobs while continuing to do filtered `git-fetch` | |
227 | commands from the central server, for example. | |
228 | + | |
229 | Or the user might want to work in a triangular work flow with multiple | |
230 | promisor remotes that each have an incomplete view of the repository. | |
637fc446 JH |
231 | |
232 | - Allow repack to work on promisor packfiles (while keeping them distinct | |
233 | from non-promisor packfiles). | |
234 | ||
235 | - Allow non-pathname-based filters to make use of packfile bitmaps (when | |
236 | present). This was just an omission during the initial implementation. | |
237 | ||
238 | - Investigate use of a long-running process to dynamically fetch a series | |
239 | of objects, such as proposed in [5,6] to reduce process startup and | |
240 | overhead costs. | |
5641eb94 JN |
241 | + |
242 | It would be nice if pack protocol V2 could allow that long-running | |
243 | process to make a series of requests over a single long-running | |
244 | connection. | |
637fc446 JH |
245 | |
246 | - Investigate pack protocol V2 to avoid the info/refs broadcast on | |
247 | each connection with the server to dynamically fetch missing objects. | |
248 | ||
249 | - Investigate the need to handle loose promisor objects. | |
5641eb94 JN |
250 | + |
251 | Objects in promisor packfiles are allowed to reference missing objects | |
252 | that can be dynamically fetched from the server. An assumption was | |
253 | made that loose objects are only created locally and therefore should | |
254 | not reference a missing object. We may need to revisit that assumption | |
255 | if, for example, we dynamically fetch a missing tree and store it as a | |
256 | loose object rather than a single object packfile. | |
257 | + | |
258 | This does not necessarily mean we need to mark loose objects as promisor; | |
259 | it may be sufficient to relax the object lookup or is-promisor functions. | |
637fc446 JH |
260 | |
261 | ||
262 | Non-Tasks | |
263 | --------- | |
264 | ||
265 | - Every time the subject of "demand loading blobs" comes up it seems | |
266 | that someone suggests that the server be allowed to "guess" and send | |
267 | additional objects that may be related to the requested objects. | |
5641eb94 JN |
268 | + |
269 | No work has gone into actually doing that; we're just documenting that | |
270 | it is a common suggestion. We're not sure how it would work and have | |
271 | no plans to work on it. | |
272 | + | |
273 | It is valid for the server to send more objects than requested (even | |
274 | for a dynamic object fetch), but we are not building on that. | |
637fc446 JH |
275 | |
276 | ||
277 | Footnotes | |
278 | --------- | |
279 | ||
280 | [a] expensive-to-modify list of missing objects: Earlier in the design of | |
281 | partial clone we discussed the need for a single list of missing objects. | |
282 | This would essentially be a sorted linear list of OIDs that the were | |
283 | omitted by the server during a clone or subsequent fetches. | |
284 | ||
5641eb94 JN |
285 | This file would need to be loaded into memory on every object lookup. |
286 | It would need to be read, updated, and re-written (like the .git/index) | |
287 | on every explicit "git fetch" command *and* on any dynamic object fetch. | |
637fc446 | 288 | |
5641eb94 JN |
289 | The cost to read, update, and write this file could add significant |
290 | overhead to every command if there are many missing objects. For example, | |
291 | if there are 100M missing blobs, this file would be at least 2GiB on disk. | |
637fc446 | 292 | |
5641eb94 JN |
293 | With the "promisor" concept, we *infer* a missing object based upon the |
294 | type of packfile that references it. | |
637fc446 JH |
295 | |
296 | ||
297 | Related Links | |
298 | ------------- | |
5641eb94 JN |
299 | [0] https://crbug.com/git/2 |
300 | Bug#2: Partial Clone | |
637fc446 | 301 | |
5641eb94 JN |
302 | [1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/ + |
303 | Subject: [RFC] Add support for downloading blobs on demand + | |
637fc446 JH |
304 | Date: Fri, 13 Jan 2017 10:52:53 -0500 |
305 | ||
5641eb94 JN |
306 | [2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/ + |
307 | Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) + | |
637fc446 JH |
308 | Date: Fri, 29 Sep 2017 13:11:36 -0700 |
309 | ||
5641eb94 JN |
310 | [3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/ + |
311 | Subject: Proposal for missing blob support in Git repos + | |
637fc446 JH |
312 | Date: Wed, 26 Apr 2017 15:13:46 -0700 |
313 | ||
5641eb94 JN |
314 | [4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ + |
315 | Subject: [PATCH 00/10] RFC Partial Clone and Fetch + | |
637fc446 JH |
316 | Date: Wed, 8 Mar 2017 18:50:29 +0000 |
317 | ||
5641eb94 JN |
318 | [5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/ + |
319 | Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module + | |
637fc446 JH |
320 | Date: Fri, 5 May 2017 11:27:52 -0400 |
321 | ||
5641eb94 JN |
322 | [6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/ + |
323 | Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand + | |
637fc446 | 324 | Date: Fri, 14 Jul 2017 09:26:50 -0400 |