]>
Commit | Line | Data |
---|---|---|
637fc446 JH |
1 | Partial Clone Design Notes |
2 | ========================== | |
3 | ||
4 | The "Partial Clone" feature is a performance optimization for Git that | |
5 | allows Git to function without having a complete copy of the repository. | |
6 | The goal of this work is to allow Git better handle extremely large | |
7 | repositories. | |
8 | ||
9 | During clone and fetch operations, Git downloads the complete contents | |
10 | and history of the repository. This includes all commits, trees, and | |
11 | blobs for the complete life of the repository. For extremely large | |
12 | repositories, clones can take hours (or days) and consume 100+GiB of disk | |
13 | space. | |
14 | ||
15 | Often in these repositories there are many blobs and trees that the user | |
16 | does not need such as: | |
17 | ||
18 | 1. files outside of the user's work area in the tree. For example, in | |
19 | a repository with 500K directories and 3.5M files in every commit, | |
20 | we can avoid downloading many objects if the user only needs a | |
21 | narrow "cone" of the source tree. | |
22 | ||
23 | 2. large binary assets. For example, in a repository where large build | |
24 | artifacts are checked into the tree, we can avoid downloading all | |
25 | previous versions of these non-mergeable binary assets and only | |
26 | download versions that are actually referenced. | |
27 | ||
28 | Partial clone allows us to avoid downloading such unneeded objects *in | |
29 | advance* during clone and fetch operations and thereby reduce download | |
30 | times and disk usage. Missing objects can later be "demand fetched" | |
31 | if/when needed. | |
32 | ||
7e154bad CC |
33 | A remote that can later provide the missing objects is called a |
34 | promisor remote, as it promises to send the objects when | |
031fd4b9 | 35 | requested. Initially Git supported only one promisor remote, the origin |
7e154bad CC |
36 | remote from which the user cloned and that was configured in the |
37 | "extensions.partialClone" config option. Later support for more than | |
38 | one promisor remote has been implemented. | |
39 | ||
637fc446 | 40 | Use of partial clone requires that the user be online and the origin |
7e154bad CC |
41 | remote or other promisor remotes be available for on-demand fetching |
42 | of missing objects. This may or may not be problematic for the user. | |
43 | For example, if the user can stay within the pre-selected subset of | |
44 | the source tree, they may not encounter any missing objects. | |
45 | Alternatively, the user could try to pre-fetch various objects if they | |
46 | know that they are going offline. | |
637fc446 JH |
47 | |
48 | ||
49 | Non-Goals | |
50 | --------- | |
51 | ||
52 | Partial clone is a mechanism to limit the number of blobs and trees downloaded | |
53 | *within* a given range of commits -- and is therefore independent of and not | |
54 | intended to conflict with existing DAG-level mechanisms to limit the set of | |
55 | requested commits (i.e. shallow clone, single branch, or fetch '<refspec>'). | |
56 | ||
57 | ||
58 | Design Overview | |
59 | --------------- | |
60 | ||
61 | Partial clone logically consists of the following parts: | |
62 | ||
63 | - A mechanism for the client to describe unneeded or unwanted objects to | |
64 | the server. | |
65 | ||
66 | - A mechanism for the server to omit such unwanted objects from packfiles | |
67 | sent to the client. | |
68 | ||
69 | - A mechanism for the client to gracefully handle missing objects (that | |
70 | were previously omitted by the server). | |
71 | ||
72 | - A mechanism for the client to backfill missing objects as needed. | |
73 | ||
74 | ||
75 | Design Details | |
76 | -------------- | |
77 | ||
78 | - A new pack-protocol capability "filter" is added to the fetch-pack and | |
79 | upload-pack negotiation. | |
5641eb94 JN |
80 | + |
81 | This uses the existing capability discovery mechanism. | |
82 | See "filter" in Documentation/technical/pack-protocol.txt. | |
637fc446 JH |
83 | |
84 | - Clients pass a "filter-spec" to clone and fetch which is passed to the | |
85 | server to request filtering during packfile construction. | |
5641eb94 JN |
86 | + |
87 | There are various filters available to accommodate different situations. | |
88 | See "--filter=<filter-spec>" in Documentation/rev-list-options.txt. | |
637fc446 JH |
89 | |
90 | - On the server pack-objects applies the requested filter-spec as it | |
91 | creates "filtered" packfiles for the client. | |
5641eb94 JN |
92 | + |
93 | These filtered packfiles are *incomplete* in the traditional sense because | |
94 | they may contain objects that reference objects not contained in the | |
95 | packfile and that the client doesn't already have. For example, the | |
96 | filtered packfile may contain trees or tags that reference missing blobs | |
97 | or commits that reference missing trees. | |
637fc446 JH |
98 | |
99 | - On the client these incomplete packfiles are marked as "promisor packfiles" | |
100 | and treated differently by various commands. | |
101 | ||
102 | - On the client a repository extension is added to the local config to | |
103 | prevent older versions of git from failing mid-operation because of | |
104 | missing objects that they cannot handle. | |
105 | See "extensions.partialClone" in Documentation/technical/repository-version.txt" | |
106 | ||
107 | ||
108 | Handling Missing Objects | |
109 | ------------------------ | |
110 | ||
7e154bad CC |
111 | - An object may be missing due to a partial clone or fetch, or missing |
112 | due to repository corruption. To differentiate these cases, the | |
113 | local repository specially indicates such filtered packfiles | |
114 | obtained from promisor remotes as "promisor packfiles". | |
5641eb94 JN |
115 | + |
116 | These promisor packfiles consist of a "<name>.promisor" file with | |
117 | arbitrary contents (like the "<name>.keep" files), in addition to | |
118 | their "<name>.pack" and "<name>.idx" files. | |
637fc446 JH |
119 | |
120 | - The local repository considers a "promisor object" to be an object that | |
7e154bad CC |
121 | it knows (to the best of its ability) that promisor remotes have promised |
122 | that they have, either because the local repository has that object in one of | |
637fc446 | 123 | its promisor packfiles, or because another promisor object refers to it. |
5641eb94 | 124 | + |
1747125e | 125 | When Git encounters a missing object, Git can see if it is a promisor object |
5641eb94 JN |
126 | and handle it appropriately. If not, Git can report a corruption. |
127 | + | |
128 | This means that there is no need for the client to explicitly maintain an | |
129 | expensive-to-modify list of missing objects.[a] | |
637fc446 JH |
130 | |
131 | - Since almost all Git code currently expects any referenced object to be | |
132 | present locally and because we do not want to force every command to do | |
133 | a dry-run first, a fallback mechanism is added to allow Git to attempt | |
7e154bad | 134 | to dynamically fetch missing objects from promisor remotes. |
5641eb94 JN |
135 | + |
136 | When the normal object lookup fails to find an object, Git invokes | |
7e154bad CC |
137 | promisor_remote_get_direct() to try to get the object from a promisor |
138 | remote and then retry the object lookup. This allows objects to be | |
139 | "faulted in" without complicated prediction algorithms. | |
5641eb94 JN |
140 | + |
141 | For efficiency reasons, no check as to whether the missing object is | |
142 | actually a promisor object is performed. | |
143 | + | |
144 | Dynamic object fetching tends to be slow as objects are fetched one at | |
145 | a time. | |
637fc446 JH |
146 | |
147 | - `checkout` (and any other command using `unpack-trees`) has been taught | |
148 | to bulk pre-fetch all required missing blobs in a single batch. | |
149 | ||
150 | - `rev-list` has been taught to print missing objects. | |
5641eb94 JN |
151 | + |
152 | This can be used by other commands to bulk prefetch objects. | |
153 | For example, a "git log -p A..B" may internally want to first do | |
154 | something like "git rev-list --objects --quiet --missing=print A..B" | |
155 | and prefetch those objects in bulk. | |
637fc446 JH |
156 | |
157 | - `fsck` has been updated to be fully aware of promisor objects. | |
158 | ||
159 | - `repack` in GC has been updated to not touch promisor packfiles at all, | |
160 | and to only repack other objects. | |
161 | ||
162 | - The global variable "fetch_if_missing" is used to control whether an | |
163 | object lookup will attempt to dynamically fetch a missing object or | |
164 | report an error. | |
5641eb94 JN |
165 | + |
166 | We are not happy with this global variable and would like to remove it, | |
167 | but that requires significant refactoring of the object code to pass an | |
7e154bad | 168 | additional flag. |
637fc446 JH |
169 | |
170 | ||
171 | Fetching Missing Objects | |
172 | ------------------------ | |
173 | ||
174 | - Fetching of objects is done using the existing transport mechanism using | |
175 | transport_fetch_refs(), setting a new transport option | |
176 | TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are | |
177 | desired, not any object that they refer to. | |
5641eb94 JN |
178 | + |
179 | Because some transports invoke fetch_pack() in the same process, fetch_pack() | |
180 | has been updated to not use any object flags when the corresponding argument | |
181 | (no_dependents) is set. | |
637fc446 JH |
182 | |
183 | - The local repository sends a request with the hashes of all requested | |
184 | objects as "want" lines, and does not perform any packfile negotiation. | |
185 | It then receives a packfile. | |
186 | ||
187 | - Because we are reusing the existing fetch-pack mechanism, fetching | |
188 | currently fetches all objects referred to by the requested objects, even | |
189 | though they are not necessary. | |
190 | ||
191 | ||
7e154bad CC |
192 | Using many promisor remotes |
193 | --------------------------- | |
194 | ||
195 | Many promisor remotes can be configured and used. | |
196 | ||
197 | This allows for example a user to have multiple geographically-close | |
198 | cache servers for fetching missing blobs while continuing to do | |
199 | filtered `git-fetch` commands from the central server. | |
200 | ||
201 | When fetching objects, promisor remotes are tried one after the other | |
202 | until all the objects have been fetched. | |
203 | ||
204 | Remotes that are considered "promisor" remotes are those specified by | |
205 | the following configuration variables: | |
206 | ||
207 | - `extensions.partialClone = <name>` | |
208 | ||
209 | - `remote.<name>.promisor = true` | |
210 | ||
211 | - `remote.<name>.partialCloneFilter = ...` | |
212 | ||
213 | Only one promisor remote can be configured using the | |
214 | `extensions.partialClone` config variable. This promisor remote will | |
215 | be the last one tried when fetching objects. | |
216 | ||
217 | We decided to make it the last one we try, because it is likely that | |
218 | someone using many promisor remotes is doing so because the other | |
219 | promisor remotes are better for some reason (maybe they are closer or | |
220 | faster for some kind of objects) than the origin, and the origin is | |
221 | likely to be the remote specified by extensions.partialClone. | |
222 | ||
223 | This justification is not very strong, but one choice had to be made, | |
224 | and anyway the long term plan should be to make the order somehow | |
225 | fully configurable. | |
226 | ||
227 | For now though the other promisor remotes will be tried in the order | |
228 | they appear in the config file. | |
229 | ||
637fc446 JH |
230 | Current Limitations |
231 | ------------------- | |
232 | ||
7e154bad CC |
233 | - It is not possible to specify the order in which the promisor |
234 | remotes are tried in other ways than the order in which they appear | |
235 | in the config file. | |
5641eb94 | 236 | + |
7e154bad CC |
237 | It is also not possible to specify an order to be used when fetching |
238 | from one remote and a different order when fetching from another | |
239 | remote. | |
240 | ||
241 | - It is not possible to push only specific objects to a promisor | |
242 | remote. | |
5641eb94 | 243 | + |
7e154bad CC |
244 | It is not possible to push at the same time to multiple promisor |
245 | remote in a specific order. | |
637fc446 | 246 | |
7e154bad CC |
247 | - Dynamic object fetching will only ask promisor remotes for missing |
248 | objects. We assume that promisor remotes have a complete view of the | |
637fc446 JH |
249 | repository and can satisfy all such requests. |
250 | ||
251 | - Repack essentially treats promisor and non-promisor packfiles as 2 | |
252 | distinct partitions and does not mix them. Repack currently only works | |
253 | on non-promisor packfiles and loose objects. | |
254 | ||
255 | - Dynamic object fetching invokes fetch-pack once *for each item* | |
256 | because most algorithms stumble upon a missing object and need to have | |
257 | it resolved before continuing their work. This may incur significant | |
258 | overhead -- and multiple authentication requests -- if many objects are | |
259 | needed. | |
260 | ||
261 | - Dynamic object fetching currently uses the existing pack protocol V0 | |
262 | which means that each object is requested via fetch-pack. The server | |
263 | will send a full set of info/refs when the connection is established. | |
264 | If there are large number of refs, this may incur significant overhead. | |
265 | ||
266 | ||
267 | Future Work | |
268 | ----------- | |
269 | ||
7e154bad CC |
270 | - Improve the way to specify the order in which promisor remotes are |
271 | tried. | |
5641eb94 | 272 | + |
7e154bad CC |
273 | For example this could allow to specify explicitly something like: |
274 | "When fetching from this remote, I want to use these promisor remotes | |
275 | in this order, though, when pushing or fetching to that remote, I want | |
276 | to use those promisor remotes in that order." | |
277 | ||
278 | - Allow pushing to promisor remotes. | |
5641eb94 | 279 | + |
7e154bad | 280 | The user might want to work in a triangular work flow with multiple |
5641eb94 | 281 | promisor remotes that each have an incomplete view of the repository. |
637fc446 JH |
282 | |
283 | - Allow repack to work on promisor packfiles (while keeping them distinct | |
284 | from non-promisor packfiles). | |
285 | ||
286 | - Allow non-pathname-based filters to make use of packfile bitmaps (when | |
287 | present). This was just an omission during the initial implementation. | |
288 | ||
289 | - Investigate use of a long-running process to dynamically fetch a series | |
290 | of objects, such as proposed in [5,6] to reduce process startup and | |
291 | overhead costs. | |
5641eb94 JN |
292 | + |
293 | It would be nice if pack protocol V2 could allow that long-running | |
294 | process to make a series of requests over a single long-running | |
295 | connection. | |
637fc446 JH |
296 | |
297 | - Investigate pack protocol V2 to avoid the info/refs broadcast on | |
298 | each connection with the server to dynamically fetch missing objects. | |
299 | ||
300 | - Investigate the need to handle loose promisor objects. | |
5641eb94 JN |
301 | + |
302 | Objects in promisor packfiles are allowed to reference missing objects | |
303 | that can be dynamically fetched from the server. An assumption was | |
304 | made that loose objects are only created locally and therefore should | |
305 | not reference a missing object. We may need to revisit that assumption | |
306 | if, for example, we dynamically fetch a missing tree and store it as a | |
307 | loose object rather than a single object packfile. | |
308 | + | |
309 | This does not necessarily mean we need to mark loose objects as promisor; | |
310 | it may be sufficient to relax the object lookup or is-promisor functions. | |
637fc446 JH |
311 | |
312 | ||
313 | Non-Tasks | |
314 | --------- | |
315 | ||
316 | - Every time the subject of "demand loading blobs" comes up it seems | |
317 | that someone suggests that the server be allowed to "guess" and send | |
318 | additional objects that may be related to the requested objects. | |
5641eb94 JN |
319 | + |
320 | No work has gone into actually doing that; we're just documenting that | |
321 | it is a common suggestion. We're not sure how it would work and have | |
322 | no plans to work on it. | |
323 | + | |
324 | It is valid for the server to send more objects than requested (even | |
325 | for a dynamic object fetch), but we are not building on that. | |
637fc446 JH |
326 | |
327 | ||
328 | Footnotes | |
329 | --------- | |
330 | ||
331 | [a] expensive-to-modify list of missing objects: Earlier in the design of | |
332 | partial clone we discussed the need for a single list of missing objects. | |
333 | This would essentially be a sorted linear list of OIDs that the were | |
334 | omitted by the server during a clone or subsequent fetches. | |
335 | ||
5641eb94 JN |
336 | This file would need to be loaded into memory on every object lookup. |
337 | It would need to be read, updated, and re-written (like the .git/index) | |
338 | on every explicit "git fetch" command *and* on any dynamic object fetch. | |
637fc446 | 339 | |
5641eb94 JN |
340 | The cost to read, update, and write this file could add significant |
341 | overhead to every command if there are many missing objects. For example, | |
342 | if there are 100M missing blobs, this file would be at least 2GiB on disk. | |
637fc446 | 343 | |
5641eb94 JN |
344 | With the "promisor" concept, we *infer* a missing object based upon the |
345 | type of packfile that references it. | |
637fc446 JH |
346 | |
347 | ||
348 | Related Links | |
349 | ------------- | |
5641eb94 JN |
350 | [0] https://crbug.com/git/2 |
351 | Bug#2: Partial Clone | |
637fc446 | 352 | |
3eae30e4 | 353 | [1] https://lore.kernel.org/git/20170113155253.1644-1-benpeart@microsoft.com/ + |
5641eb94 | 354 | Subject: [RFC] Add support for downloading blobs on demand + |
637fc446 JH |
355 | Date: Fri, 13 Jan 2017 10:52:53 -0500 |
356 | ||
3eae30e4 | 357 | [2] https://lore.kernel.org/git/cover.1506714999.git.jonathantanmy@google.com/ + |
5641eb94 | 358 | Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) + |
637fc446 JH |
359 | Date: Fri, 29 Sep 2017 13:11:36 -0700 |
360 | ||
3eae30e4 | 361 | [3] https://lore.kernel.org/git/20170426221346.25337-1-jonathantanmy@google.com/ + |
5641eb94 | 362 | Subject: Proposal for missing blob support in Git repos + |
637fc446 JH |
363 | Date: Wed, 26 Apr 2017 15:13:46 -0700 |
364 | ||
3eae30e4 | 365 | [4] https://lore.kernel.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ + |
5641eb94 | 366 | Subject: [PATCH 00/10] RFC Partial Clone and Fetch + |
637fc446 JH |
367 | Date: Wed, 8 Mar 2017 18:50:29 +0000 |
368 | ||
3eae30e4 | 369 | [5] https://lore.kernel.org/git/20170505152802.6724-1-benpeart@microsoft.com/ + |
5641eb94 | 370 | Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module + |
637fc446 JH |
371 | Date: Fri, 5 May 2017 11:27:52 -0400 |
372 | ||
3eae30e4 | 373 | [6] https://lore.kernel.org/git/20170714132651.170708-1-benpeart@microsoft.com/ + |
5641eb94 | 374 | Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand + |
637fc446 | 375 | Date: Fri, 14 Jul 2017 09:26:50 -0400 |