]> git.ipfire.org Git - thirdparty/git.git/blame - Documentation/technical/partial-clone.txt
Merge branch 'sb/submodule-recursive-fetch-gets-the-tip'
[thirdparty/git.git] / Documentation / technical / partial-clone.txt
CommitLineData
637fc446
JH
1Partial Clone Design Notes
2==========================
3
4The "Partial Clone" feature is a performance optimization for Git that
5allows Git to function without having a complete copy of the repository.
6The goal of this work is to allow Git better handle extremely large
7repositories.
8
9During clone and fetch operations, Git downloads the complete contents
10and history of the repository. This includes all commits, trees, and
11blobs for the complete life of the repository. For extremely large
12repositories, clones can take hours (or days) and consume 100+GiB of disk
13space.
14
15Often in these repositories there are many blobs and trees that the user
16does not need such as:
17
18 1. files outside of the user's work area in the tree. For example, in
19 a repository with 500K directories and 3.5M files in every commit,
20 we can avoid downloading many objects if the user only needs a
21 narrow "cone" of the source tree.
22
23 2. large binary assets. For example, in a repository where large build
24 artifacts are checked into the tree, we can avoid downloading all
25 previous versions of these non-mergeable binary assets and only
26 download versions that are actually referenced.
27
28Partial clone allows us to avoid downloading such unneeded objects *in
29advance* during clone and fetch operations and thereby reduce download
30times and disk usage. Missing objects can later be "demand fetched"
31if/when needed.
32
33Use of partial clone requires that the user be online and the origin
34remote be available for on-demand fetching of missing objects. This may
35or may not be problematic for the user. For example, if the user can
36stay within the pre-selected subset of the source tree, they may not
37encounter any missing objects. Alternatively, the user could try to
38pre-fetch various objects if they know that they are going offline.
39
40
41Non-Goals
42---------
43
44Partial clone is a mechanism to limit the number of blobs and trees downloaded
45*within* a given range of commits -- and is therefore independent of and not
46intended to conflict with existing DAG-level mechanisms to limit the set of
47requested commits (i.e. shallow clone, single branch, or fetch '<refspec>').
48
49
50Design Overview
51---------------
52
53Partial clone logically consists of the following parts:
54
55- A mechanism for the client to describe unneeded or unwanted objects to
56 the server.
57
58- A mechanism for the server to omit such unwanted objects from packfiles
59 sent to the client.
60
61- A mechanism for the client to gracefully handle missing objects (that
62 were previously omitted by the server).
63
64- A mechanism for the client to backfill missing objects as needed.
65
66
67Design Details
68--------------
69
70- A new pack-protocol capability "filter" is added to the fetch-pack and
71 upload-pack negotiation.
5641eb94
JN
72+
73This uses the existing capability discovery mechanism.
74See "filter" in Documentation/technical/pack-protocol.txt.
637fc446
JH
75
76- Clients pass a "filter-spec" to clone and fetch which is passed to the
77 server to request filtering during packfile construction.
5641eb94
JN
78+
79There are various filters available to accommodate different situations.
80See "--filter=<filter-spec>" in Documentation/rev-list-options.txt.
637fc446
JH
81
82- On the server pack-objects applies the requested filter-spec as it
83 creates "filtered" packfiles for the client.
5641eb94
JN
84+
85These filtered packfiles are *incomplete* in the traditional sense because
86they may contain objects that reference objects not contained in the
87packfile and that the client doesn't already have. For example, the
88filtered packfile may contain trees or tags that reference missing blobs
89or commits that reference missing trees.
637fc446
JH
90
91- On the client these incomplete packfiles are marked as "promisor packfiles"
92 and treated differently by various commands.
93
94- On the client a repository extension is added to the local config to
95 prevent older versions of git from failing mid-operation because of
96 missing objects that they cannot handle.
97 See "extensions.partialClone" in Documentation/technical/repository-version.txt"
98
99
100Handling Missing Objects
101------------------------
102
103- An object may be missing due to a partial clone or fetch, or missing due
104 to repository corruption. To differentiate these cases, the local
105 repository specially indicates such filtered packfiles obtained from the
106 promisor remote as "promisor packfiles".
5641eb94
JN
107+
108These promisor packfiles consist of a "<name>.promisor" file with
109arbitrary contents (like the "<name>.keep" files), in addition to
110their "<name>.pack" and "<name>.idx" files.
637fc446
JH
111
112- The local repository considers a "promisor object" to be an object that
113 it knows (to the best of its ability) that the promisor remote has promised
114 that it has, either because the local repository has that object in one of
115 its promisor packfiles, or because another promisor object refers to it.
5641eb94 116+
1747125e 117When Git encounters a missing object, Git can see if it is a promisor object
5641eb94
JN
118and handle it appropriately. If not, Git can report a corruption.
119+
120This means that there is no need for the client to explicitly maintain an
121expensive-to-modify list of missing objects.[a]
637fc446
JH
122
123- Since almost all Git code currently expects any referenced object to be
124 present locally and because we do not want to force every command to do
125 a dry-run first, a fallback mechanism is added to allow Git to attempt
126 to dynamically fetch missing objects from the promisor remote.
5641eb94
JN
127+
128When the normal object lookup fails to find an object, Git invokes
129fetch-object to try to get the object from the server and then retry
130the object lookup. This allows objects to be "faulted in" without
131complicated prediction algorithms.
132+
133For efficiency reasons, no check as to whether the missing object is
134actually a promisor object is performed.
135+
136Dynamic object fetching tends to be slow as objects are fetched one at
137a time.
637fc446
JH
138
139- `checkout` (and any other command using `unpack-trees`) has been taught
140 to bulk pre-fetch all required missing blobs in a single batch.
141
142- `rev-list` has been taught to print missing objects.
5641eb94
JN
143+
144This can be used by other commands to bulk prefetch objects.
145For example, a "git log -p A..B" may internally want to first do
146something like "git rev-list --objects --quiet --missing=print A..B"
147and prefetch those objects in bulk.
637fc446
JH
148
149- `fsck` has been updated to be fully aware of promisor objects.
150
151- `repack` in GC has been updated to not touch promisor packfiles at all,
152 and to only repack other objects.
153
154- The global variable "fetch_if_missing" is used to control whether an
155 object lookup will attempt to dynamically fetch a missing object or
156 report an error.
5641eb94
JN
157+
158We are not happy with this global variable and would like to remove it,
159but that requires significant refactoring of the object code to pass an
160additional flag. We hope that concurrent efforts to add an ODB API can
161encompass this.
637fc446
JH
162
163
164Fetching Missing Objects
165------------------------
166
167- Fetching of objects is done using the existing transport mechanism using
168 transport_fetch_refs(), setting a new transport option
169 TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are
170 desired, not any object that they refer to.
5641eb94
JN
171+
172Because some transports invoke fetch_pack() in the same process, fetch_pack()
173has been updated to not use any object flags when the corresponding argument
174(no_dependents) is set.
637fc446
JH
175
176- The local repository sends a request with the hashes of all requested
177 objects as "want" lines, and does not perform any packfile negotiation.
178 It then receives a packfile.
179
180- Because we are reusing the existing fetch-pack mechanism, fetching
181 currently fetches all objects referred to by the requested objects, even
182 though they are not necessary.
183
184
185Current Limitations
186-------------------
187
188- The remote used for a partial clone (or the first partial fetch
189 following a regular clone) is marked as the "promisor remote".
5641eb94
JN
190+
191We are currently limited to a single promisor remote and only that
192remote may be used for subsequent partial fetches.
193+
194We accept this limitation because we believe initial users of this
195feature will be using it on repositories with a strong single central
196server.
637fc446
JH
197
198- Dynamic object fetching will only ask the promisor remote for missing
199 objects. We assume that the promisor remote has a complete view of the
200 repository and can satisfy all such requests.
201
202- Repack essentially treats promisor and non-promisor packfiles as 2
203 distinct partitions and does not mix them. Repack currently only works
204 on non-promisor packfiles and loose objects.
205
206- Dynamic object fetching invokes fetch-pack once *for each item*
207 because most algorithms stumble upon a missing object and need to have
208 it resolved before continuing their work. This may incur significant
209 overhead -- and multiple authentication requests -- if many objects are
210 needed.
211
212- Dynamic object fetching currently uses the existing pack protocol V0
213 which means that each object is requested via fetch-pack. The server
214 will send a full set of info/refs when the connection is established.
215 If there are large number of refs, this may incur significant overhead.
216
217
218Future Work
219-----------
220
221- Allow more than one promisor remote and define a strategy for fetching
222 missing objects from specific promisor remotes or of iterating over the
223 set of promisor remotes until a missing object is found.
5641eb94
JN
224+
225A user might want to have multiple geographically-close cache servers
226for fetching missing blobs while continuing to do filtered `git-fetch`
227commands from the central server, for example.
228+
229Or the user might want to work in a triangular work flow with multiple
230promisor remotes that each have an incomplete view of the repository.
637fc446
JH
231
232- Allow repack to work on promisor packfiles (while keeping them distinct
233 from non-promisor packfiles).
234
235- Allow non-pathname-based filters to make use of packfile bitmaps (when
236 present). This was just an omission during the initial implementation.
237
238- Investigate use of a long-running process to dynamically fetch a series
239 of objects, such as proposed in [5,6] to reduce process startup and
240 overhead costs.
5641eb94
JN
241+
242It would be nice if pack protocol V2 could allow that long-running
243process to make a series of requests over a single long-running
244connection.
637fc446
JH
245
246- Investigate pack protocol V2 to avoid the info/refs broadcast on
247 each connection with the server to dynamically fetch missing objects.
248
249- Investigate the need to handle loose promisor objects.
5641eb94
JN
250+
251Objects in promisor packfiles are allowed to reference missing objects
252that can be dynamically fetched from the server. An assumption was
253made that loose objects are only created locally and therefore should
254not reference a missing object. We may need to revisit that assumption
255if, for example, we dynamically fetch a missing tree and store it as a
256loose object rather than a single object packfile.
257+
258This does not necessarily mean we need to mark loose objects as promisor;
259it may be sufficient to relax the object lookup or is-promisor functions.
637fc446
JH
260
261
262Non-Tasks
263---------
264
265- Every time the subject of "demand loading blobs" comes up it seems
266 that someone suggests that the server be allowed to "guess" and send
267 additional objects that may be related to the requested objects.
5641eb94
JN
268+
269No work has gone into actually doing that; we're just documenting that
270it is a common suggestion. We're not sure how it would work and have
271no plans to work on it.
272+
273It is valid for the server to send more objects than requested (even
274for a dynamic object fetch), but we are not building on that.
637fc446
JH
275
276
277Footnotes
278---------
279
280[a] expensive-to-modify list of missing objects: Earlier in the design of
281 partial clone we discussed the need for a single list of missing objects.
282 This would essentially be a sorted linear list of OIDs that the were
283 omitted by the server during a clone or subsequent fetches.
284
5641eb94
JN
285This file would need to be loaded into memory on every object lookup.
286It would need to be read, updated, and re-written (like the .git/index)
287on every explicit "git fetch" command *and* on any dynamic object fetch.
637fc446 288
5641eb94
JN
289The cost to read, update, and write this file could add significant
290overhead to every command if there are many missing objects. For example,
291if there are 100M missing blobs, this file would be at least 2GiB on disk.
637fc446 292
5641eb94
JN
293With the "promisor" concept, we *infer* a missing object based upon the
294type of packfile that references it.
637fc446
JH
295
296
297Related Links
298-------------
5641eb94
JN
299[0] https://crbug.com/git/2
300 Bug#2: Partial Clone
637fc446 301
5641eb94
JN
302[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/ +
303 Subject: [RFC] Add support for downloading blobs on demand +
637fc446
JH
304 Date: Fri, 13 Jan 2017 10:52:53 -0500
305
5641eb94
JN
306[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/ +
307 Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) +
637fc446
JH
308 Date: Fri, 29 Sep 2017 13:11:36 -0700
309
5641eb94
JN
310[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/ +
311 Subject: Proposal for missing blob support in Git repos +
637fc446
JH
312 Date: Wed, 26 Apr 2017 15:13:46 -0700
313
5641eb94
JN
314[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ +
315 Subject: [PATCH 00/10] RFC Partial Clone and Fetch +
637fc446
JH
316 Date: Wed, 8 Mar 2017 18:50:29 +0000
317
5641eb94
JN
318[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/ +
319 Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module +
637fc446
JH
320 Date: Fri, 5 May 2017 11:27:52 -0400
321
5641eb94
JN
322[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/ +
323 Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand +
637fc446 324 Date: Fri, 14 Jul 2017 09:26:50 -0400