]> git.ipfire.org Git - thirdparty/git.git/blob - Documentation/technical/bundle-uri.txt
Merge branch 'jk/clone-allow-bare-and-o-together'
[thirdparty/git.git] / Documentation / technical / bundle-uri.txt
1 Bundle URIs
2 ===========
3
4 Git bundles are files that store a pack-file along with some extra metadata,
5 including a set of refs and a (possibly empty) set of necessary commits. See
6 linkgit:git-bundle[1] and linkgit:gitformat-bundle[5] for more information.
7
8 Bundle URIs are locations where Git can download one or more bundles in
9 order to bootstrap the object database in advance of fetching the remaining
10 objects from a remote.
11
12 One goal is to speed up clones and fetches for users with poor network
13 connectivity to the origin server. Another benefit is to allow heavy users,
14 such as CI build farms, to use local resources for the majority of Git data
15 and thereby reducing the load on the origin server.
16
17 To enable the bundle URI feature, users can specify a bundle URI using
18 command-line options or the origin server can advertise one or more URIs
19 via a protocol v2 capability.
20
21 Design Goals
22 ------------
23
24 The bundle URI standard aims to be flexible enough to satisfy multiple
25 workloads. The bundle provider and the Git client have several choices in
26 how they create and consume bundle URIs.
27
28 * Bundles can have whatever name the server desires. This name could refer
29 to immutable data by using a hash of the bundle contents. However, this
30 means that a new URI will be needed after every update of the content.
31 This might be acceptable if the server is advertising the URI (and the
32 server is aware of new bundles being generated) but would not be
33 ergonomic for users using the command line option.
34
35 * The bundles could be organized specifically for bootstrapping full
36 clones, but could also be organized with the intention of bootstrapping
37 incremental fetches. The bundle provider must decide on one of several
38 organization schemes to minimize client downloads during incremental
39 fetches, but the Git client can also choose whether to use bundles for
40 either of these operations.
41
42 * The bundle provider can choose to support full clones, partial clones,
43 or both. The client can detect which bundles are appropriate for the
44 repository's partial clone filter, if any.
45
46 * The bundle provider can use a single bundle (for clones only), or a
47 list of bundles. When using a list of bundles, the provider can specify
48 whether or not the client needs _all_ of the bundle URIs for a full
49 clone, or if _any_ one of the bundle URIs is sufficient. This allows the
50 bundle provider to use different URIs for different geographies.
51
52 * The bundle provider can organize the bundles using heuristics, such as
53 creation tokens, to help the client prevent downloading bundles it does
54 not need. When the bundle provider does not provide these heuristics,
55 the client can use optimizations to minimize how much of the data is
56 downloaded.
57
58 * The bundle provider does not need to be associated with the Git server.
59 The client can choose to use the bundle provider without it being
60 advertised by the Git server.
61
62 * The client can choose to discover bundle providers that are advertised
63 by the Git server. This could happen during `git clone`, during
64 `git fetch`, both, or neither. The user can choose which combination
65 works best for them.
66
67 * The client can choose to configure a bundle provider manually at any
68 time. The client can also choose to specify a bundle provider manually
69 as a command-line option to `git clone`.
70
71 Each repository is different and every Git server has different needs.
72 Hopefully the bundle URI feature is flexible enough to satisfy all needs.
73 If not, then the feature can be extended through its versioning mechanism.
74
75 Server requirements
76 -------------------
77
78 To provide a server-side implementation of bundle servers, no other parts
79 of the Git protocol are required. This allows server maintainers to use
80 static content solutions such as CDNs in order to serve the bundle files.
81
82 At the current scope of the bundle URI feature, all URIs are expected to
83 be HTTP(S) URLs where content is downloaded to a local file using a `GET`
84 request to that URL. The server could include authentication requirements
85 to those requests with the aim of triggering the configured credential
86 helper for secure access. (Future extensions could use "file://" URIs or
87 SSH URIs.)
88
89 Assuming a `200 OK` response from the server, the content at the URL is
90 inspected. First, Git attempts to parse the file as a bundle file of
91 version 2 or higher. If the file is not a bundle, then the file is parsed
92 as a plain-text file using Git's config parser. The key-value pairs in
93 that config file are expected to describe a list of bundle URIs. If
94 neither of these parse attempts succeed, then Git will report an error to
95 the user that the bundle URI provided erroneous data.
96
97 Any other data provided by the server is considered erroneous.
98
99 Bundle Lists
100 ------------
101
102 The Git server can advertise bundle URIs using a set of `key=value` pairs.
103 A bundle URI can also serve a plain-text file in the Git config format
104 containing these same `key=value` pairs. In both cases, we consider this
105 to be a _bundle list_. The pairs specify information about the bundles
106 that the client can use to make decisions for which bundles to download
107 and which to ignore.
108
109 A few keys focus on properties of the list itself.
110
111 bundle.version::
112 (Required) This value provides a version number for the bundle
113 list. If a future Git change enables a feature that needs the Git
114 client to react to a new key in the bundle list file, then this version
115 will increment. The only current version number is 1, and if any other
116 value is specified then Git will fail to use this file.
117
118 bundle.mode::
119 (Required) This value has one of two values: `all` and `any`. When `all`
120 is specified, then the client should expect to need all of the listed
121 bundle URIs that match their repository's requirements. When `any` is
122 specified, then the client should expect that any one of the bundle URIs
123 that match their repository's requirements will suffice. Typically, the
124 `any` option is used to list a number of different bundle servers
125 located in different geographies.
126
127 bundle.heuristic::
128 If this string-valued key exists, then the bundle list is designed to
129 work well with incremental `git fetch` commands. The heuristic signals
130 that there are additional keys available for each bundle that help
131 determine which subset of bundles the client should download. The only
132 heuristic currently planned is `creationToken`.
133
134 The remaining keys include an `<id>` segment which is a server-designated
135 name for each available bundle. The `<id>` must contain only alphanumeric
136 and `-` characters.
137
138 bundle.<id>.uri::
139 (Required) This string value is the URI for downloading bundle `<id>`.
140 If the URI begins with a protocol (`http://` or `https://`) then the URI
141 is absolute. Otherwise, the URI is interpreted as relative to the URI
142 used for the bundle list. If the URI begins with `/`, then that relative
143 path is relative to the domain name used for the bundle list. (This use
144 of relative paths is intended to make it easier to distribute a set of
145 bundles across a large number of servers or CDNs with different domain
146 names.)
147
148 bundle.<id>.filter::
149 This string value represents an object filter that should also appear in
150 the header of this bundle. The server uses this value to differentiate
151 different kinds of bundles from which the client can choose those that
152 match their object filters.
153
154 bundle.<id>.creationToken::
155 This value is a nonnegative 64-bit integer used for sorting the bundles
156 the list. This is used to download a subset of bundles during a fetch
157 when `bundle.heuristic=creationToken`.
158
159 bundle.<id>.location::
160 This string value advertises a real-world location from where the bundle
161 URI is served. This can be used to present the user with an option for
162 which bundle URI to use or simply as an informative indicator of which
163 bundle URI was selected by Git. This is only valuable when
164 `bundle.mode` is `any`.
165
166 Here is an example bundle list using the Git config format:
167
168 [bundle]
169 version = 1
170 mode = all
171 heuristic = creationToken
172
173 [bundle "2022-02-09-1644442601-daily"]
174 uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
175 creationToken = 1644442601
176
177 [bundle "2022-02-02-1643842562"]
178 uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
179 creationToken = 1643842562
180
181 [bundle "2022-02-09-1644442631-daily-blobless"]
182 uri = 2022-02-09-1644442631-daily-blobless.bundle
183 creationToken = 1644442631
184 filter = blob:none
185
186 [bundle "2022-02-02-1643842568-blobless"]
187 uri = /git/git/2022-02-02-1643842568-blobless.bundle
188 creationToken = 1643842568
189 filter = blob:none
190
191 This example uses `bundle.mode=all` as well as the
192 `bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter`
193 options to present two parallel sets of bundles: one for full clones and
194 another for blobless partial clones.
195
196 Suppose that this bundle list was found at the URI
197 `https://bundles.example.com/git/git/` and so the two blobless bundles have
198 the following fully-expanded URIs:
199
200 * `https://bundles.example.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
201 * `https://bundles.example.com/git/git/2022-02-02-1643842568-blobless.bundle`
202
203 Advertising Bundle URIs
204 -----------------------
205
206 If a user knows a bundle URI for the repository they are cloning, then
207 they can specify that URI manually through a command-line option. However,
208 a Git host may want to advertise bundle URIs during the clone operation,
209 helping users unaware of the feature.
210
211 The only thing required for this feature is that the server can advertise
212 one or more bundle URIs. This advertisement takes the form of a new
213 protocol v2 capability specifically for discovering bundle URIs.
214
215 The client could choose an arbitrary bundle URI as an option _or_ select
216 the URI with best performance by some exploratory checks. It is up to the
217 bundle provider to decide if having multiple URIs is preferable to a
218 single URI that is geodistributed through server-side infrastructure.
219
220 Cloning with Bundle URIs
221 ------------------------
222
223 The primary need for bundle URIs is to speed up clones. The Git client
224 will interact with bundle URIs according to the following flow:
225
226 1. The user specifies a bundle URI with the `--bundle-uri` command-line
227 option _or_ the client discovers a bundle list advertised by the
228 Git server.
229
230 2. If the downloaded data from a bundle URI is a bundle, then the client
231 inspects the bundle headers to check that the prerequisite commit OIDs
232 are present in the client repository. If some are missing, then the
233 client delays unbundling until other bundles have been unbundled,
234 making those OIDs present. When all required OIDs are present, the
235 client unbundles that data using a refspec. The default refspec is
236 `+refs/heads/*:refs/bundles/*`, but this can be configured. These refs
237 are stored so that later `git fetch` negotiations can communicate the
238 bundled refs as `have`s, reducing the size of the fetch over the Git
239 protocol. To allow pruning refs from this ref namespace, Git may
240 introduce a numbered namespace (such as `refs/bundles/<i>/*`) such that
241 stale bundle refs can be deleted.
242
243 3. If the file is instead a bundle list, then the client inspects the
244 `bundle.mode` to see if the list is of the `all` or `any` form.
245
246 a. If `bundle.mode=all`, then the client considers all bundle
247 URIs. The list is reduced based on the `bundle.<id>.filter` options
248 matching the client repository's partial clone filter. Then, all
249 bundle URIs are requested. If the `bundle.<id>.creationToken`
250 heuristic is provided, then the bundles are downloaded in decreasing
251 order by the creation token, stopping when a bundle has all required
252 OIDs. The bundles can then be unbundled in increasing creation token
253 order. The client stores the latest creation token as a heuristic
254 for avoiding future downloads if the bundle list does not advertise
255 bundles with larger creation tokens.
256
257 b. If `bundle.mode=any`, then the client can choose any one of the
258 bundle URIs to inspect. The client can use a variety of ways to
259 choose among these URIs. The client can also fallback to another URI
260 if the initial choice fails to return a result.
261
262 Note that during a clone we expect that all bundles will be required, and
263 heuristics such as `bundle.<uri>.creationToken` can be used to download
264 bundles in chronological order or in parallel.
265
266 If a given bundle URI is a bundle list with a `bundle.heuristic`
267 value, then the client can choose to store that URI as its chosen bundle
268 URI. The client can then navigate directly to that URI during later `git
269 fetch` calls.
270
271 When downloading bundle URIs, the client can choose to inspect the initial
272 content before committing to downloading the entire content. This may
273 provide enough information to determine if the URI is a bundle list or
274 a bundle. In the case of a bundle, the client may inspect the bundle
275 header to determine that all advertised tips are already in the client
276 repository and cancel the remaining download.
277
278 Fetching with Bundle URIs
279 -------------------------
280
281 When the client fetches new data, it can decide to fetch from bundle
282 servers before fetching from the origin remote. This could be done via a
283 command-line option, but it is more likely useful to use a config value
284 such as the one specified during the clone.
285
286 The fetch operation follows the same procedure to download bundles from a
287 bundle list (although we do _not_ want to use parallel downloads here). We
288 expect that the process will end when all prerequisite commit OIDs in a
289 thin bundle are already in the object database.
290
291 When using the `creationToken` heuristic, the client can avoid downloading
292 any bundles if their creation tokens are not larger than the stored
293 creation token. After fetching new bundles, Git updates this local
294 creation token.
295
296 If the bundle provider does not provide a heuristic, then the client
297 should attempt to inspect the bundle headers before downloading the full
298 bundle data in case the bundle tips already exist in the client
299 repository.
300
301 Error Conditions
302 ----------------
303
304 If the Git client discovers something unexpected while downloading
305 information according to a bundle URI or the bundle list found at that
306 location, then Git can ignore that data and continue as if it was not
307 given a bundle URI. The remote Git server is the ultimate source of truth,
308 not the bundle URI.
309
310 Here are a few example error conditions:
311
312 * The client fails to connect with a server at the given URI or a connection
313 is lost without any chance to recover.
314
315 * The client receives a 400-level response (such as `404 Not Found` or
316 `401 Not Authorized`). The client should use the credential helper to
317 find and provide a credential for the URI, but match the semantics of
318 Git's other HTTP protocols in terms of handling specific 400-level
319 errors.
320
321 * The server reports any other failure response.
322
323 * The client receives data that is not parsable as a bundle or bundle list.
324
325 * A bundle includes a filter that does not match expectations.
326
327 * The client cannot unbundle the bundles because the prerequisite commit OIDs
328 are not in the object database and there are no more bundles to download.
329
330 There are also situations that could be seen as wasteful, but are not
331 error conditions:
332
333 * The downloaded bundles contain more information than is requested by
334 the clone or fetch request. A primary example is if the user requests
335 a clone with `--single-branch` but downloads bundles that store every
336 reachable commit from all `refs/heads/*` references. This might be
337 initially wasteful, but perhaps these objects will become reachable by
338 a later ref update that the client cares about.
339
340 * A bundle download during a `git fetch` contains objects already in the
341 object database. This is probably unavoidable if we are using bundles
342 for fetches, since the client will almost always be slightly ahead of
343 the bundle servers after performing its "catch-up" fetch to the remote
344 server. This extra work is most wasteful when the client is fetching
345 much more frequently than the server is computing bundles, such as if
346 the client is using hourly prefetches with background maintenance, but
347 the server is computing bundles weekly. For this reason, the client
348 should not use bundle URIs for fetch unless the server has explicitly
349 recommended it through a `bundle.heuristic` value.
350
351 Example Bundle Provider organization
352 ------------------------------------
353
354 The bundle URI feature is intentionally designed to be flexible to
355 different ways a bundle provider wants to organize the object data.
356 However, it can be helpful to have a complete organization model described
357 here so providers can start from that base.
358
359 This example organization is a simplified model of what is used by the
360 GVFS Cache Servers (see section near the end of this document) which have
361 been beneficial in speeding up clones and fetches for very large
362 repositories, although using extra software outside of Git.
363
364 The bundle provider deploys servers across multiple geographies. Each
365 server manages its own bundle set. The server can track a number of Git
366 repositories, but provides a bundle list for each based on a pattern. For
367 example, when mirroring a repository at `https://<domain>/<org>/<repo>`
368 the bundle server could have its bundle list available at
369 `https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can
370 list all of these servers under the "any" mode:
371
372 [bundle]
373 version = 1
374 mode = any
375
376 [bundle "eastus"]
377 uri = https://eastus.example.com/<domain>/<org>/<repo>
378
379 [bundle "europe"]
380 uri = https://europe.example.com/<domain>/<org>/<repo>
381
382 [bundle "apac"]
383 uri = https://apac.example.com/<domain>/<org>/<repo>
384
385 This "list of lists" is static and only changes if a bundle server is
386 added or removed.
387
388 Each bundle server manages its own set of bundles. The initial bundle list
389 contains only a single bundle, containing all of the objects received from
390 cloning the repository from the origin server. The list uses the
391 `creationToken` heuristic and a `creationToken` is made for the bundle
392 based on the server's timestamp.
393
394 The bundle server runs regularly-scheduled updates for the bundle list,
395 such as once a day. During this task, the server fetches the latest
396 contents from the origin server and generates a bundle containing the
397 objects reachable from the latest origin refs, but not contained in a
398 previously-computed bundle. This bundle is added to the list, with care
399 that the `creationToken` is strictly greater than the previous maximum
400 `creationToken`.
401
402 When the bundle list grows too large, say more than 30 bundles, then the
403 oldest "_N_ minus 30" bundles are combined into a single bundle. This
404 bundle's `creationToken` is equal to the maximum `creationToken` among the
405 merged bundles.
406
407 An example bundle list is provided here, although it only has two daily
408 bundles and not a full list of 30:
409
410 [bundle]
411 version = 1
412 mode = all
413 heuristic = creationToken
414
415 [bundle "2022-02-13-1644770820-daily"]
416 uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644770820-daily.bundle
417 creationToken = 1644770820
418
419 [bundle "2022-02-09-1644442601-daily"]
420 uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644442601-daily.bundle
421 creationToken = 1644442601
422
423 [bundle "2022-02-02-1643842562"]
424 uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-02-1643842562.bundle
425 creationToken = 1643842562
426
427 To avoid storing and serving object data in perpetuity despite becoming
428 unreachable in the origin server, this bundle merge can be more careful.
429 Instead of taking an absolute union of the old bundles, instead the bundle
430 can be created by looking at the newer bundles and ensuring that their
431 necessary commits are all available in this merged bundle (or in another
432 one of the newer bundles). This allows "expiring" object data that is not
433 being used by new commits in this window of time. That data could be
434 reintroduced by a later push.
435
436 The intention of this data organization has two main goals. First, initial
437 clones of the repository become faster by downloading precomputed object
438 data from a closer source. Second, `git fetch` commands can be faster,
439 especially if the client has not fetched for a few days. However, if a
440 client does not fetch for 30 days, then the bundle list organization would
441 cause redownloading a large amount of object data.
442
443 One way to make this organization more useful to users who fetch frequently
444 is to have more frequent bundle creation. For example, bundles could be
445 created every hour, and then once a day those "hourly" bundles could be
446 merged into a "daily" bundle. The daily bundles are merged into the
447 oldest bundle after 30 days.
448
449 It is recommended that this bundle strategy is repeated with the `blob:none`
450 filter if clients of this repository are expecting to use blobless partial
451 clones. This list of blobless bundles stays in the same list as the full
452 bundles, but uses the `bundle.<id>.filter` key to separate the two groups.
453 For very large repositories, the bundle provider may want to _only_ provide
454 blobless bundles.
455
456 Implementation Plan
457 -------------------
458
459 This design document is being submitted on its own as an aspirational
460 document, with the goal of implementing all of the mentioned client
461 features over the course of several patch series. Here is a potential
462 outline for submitting these features:
463
464 1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
465 This will include a new `git fetch --bundle-uri` mode for use as the
466 implementation underneath `git clone`. The initial version here will
467 expect a single bundle at the given URI.
468
469 2. Implement the ability to parse a bundle list from a bundle URI and
470 update the `git fetch --bundle-uri` logic to properly distinguish
471 between `bundle.mode` options. Specifically design the feature so
472 that the config format parsing feeds a list of key-value pairs into the
473 bundle list logic.
474
475 3. Create the `bundle-uri` protocol v2 command so Git servers can advertise
476 bundle URIs using the key-value pairs. Plug into the existing key-value
477 input to the bundle list logic. Allow `git clone` to discover these
478 bundle URIs and bootstrap the client repository from the bundle data.
479 (This choice is an opt-in via a config option and a command-line
480 option.)
481
482 4. Allow the client to understand the `bundle.flag=forFetch` configuration
483 and the `bundle.<id>.creationToken` heuristic. When `git clone`
484 discovers a bundle URI with `bundle.flag=forFetch`, it configures the
485 client repository to check that bundle URI during later `git fetch <remote>`
486 commands.
487
488 5. Allow clients to discover bundle URIs during `git fetch` and configure
489 a bundle URI for later fetches if `bundle.flag=forFetch`.
490
491 6. Implement the "inspect headers" heuristic to reduce data downloads when
492 the `bundle.<id>.creationToken` heuristic is not available.
493
494 As these features are reviewed, this plan might be updated. We also expect
495 that new designs will be discovered and implemented as this feature
496 matures and becomes used in real-world scenarios.
497
498 Related Work: Packfile URIs
499 ---------------------------
500
501 The Git protocol already has a capability where the Git server can list
502 a set of URLs along with the packfile response when serving a client
503 request. The client is then expected to download the packfiles at those
504 locations in order to have a complete understanding of the response.
505
506 This mechanism is used by the Gerrit server (implemented with JGit) and
507 has been effective at reducing CPU load and improving user performance for
508 clones.
509
510 A major downside to this mechanism is that the origin server needs to know
511 _exactly_ what is in those packfiles, and the packfiles need to be available
512 to the user for some time after the server has responded. This coupling
513 between the origin and the packfile data is difficult to manage.
514
515 Further, this implementation is extremely hard to make work with fetches.
516
517 Related Work: GVFS Cache Servers
518 --------------------------------
519
520 The GVFS Protocol [2] is a set of HTTP endpoints designed independently of
521 the Git project before Git's partial clone was created. One feature of this
522 protocol is the idea of a "cache server" which can be colocated with build
523 machines or developer offices to transfer Git data without overloading the
524 central server.
525
526 The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}`
527 endpoint, which allows downloading an object on-demand. This is a critical
528 piece of the filesystem virtualization of that product.
529
530 However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>`
531 endpoint. Given an optional timestamp, the cache server responds with a list
532 of precomputed packfiles containing the commits and trees that were introduced
533 in those time intervals.
534
535 The cache server computes these "prefetch" packfiles using the following
536 strategy:
537
538 1. Every hour, an "hourly" pack is generated with a given timestamp.
539 2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack.
540 3. Nightly, all prefetch packs more than 30 days old are rolled up into
541 one pack.
542
543 When a user runs `gvfs clone` or `scalar clone` against a repo with cache
544 servers, the client requests all prefetch packfiles, which is at most
545 `24 + 30 + 1` packfiles downloading only commits and trees. The client
546 then follows with a request to the origin server for the references, and
547 attempts to checkout that tip reference. (There is an extra endpoint that
548 helps get all reachable trees from a given commit, in case that commit
549 was not already in a prefetch packfile.)
550
551 During a `git fetch`, a hook requests the prefetch endpoint using the
552 most-recent timestamp from a previously-downloaded prefetch packfile.
553 Only the list of packfiles with later timestamps are downloaded. Most
554 users fetch hourly, so they get at most one hourly prefetch pack. Users
555 whose machines have been off or otherwise have not fetched in over 30 days
556 might redownload all prefetch packfiles. This is rare.
557
558 It is important to note that the clients always contact the origin server
559 for the refs advertisement, so the refs are frequently "ahead" of the
560 prefetched pack data. The missing objects are downloaded on-demand using
561 the `GET gvfs/objects/{oid}` requests, when needed by a command such as
562 `git checkout` or `git log`. Some Git optimizations disable checks that
563 would cause these on-demand downloads to be too aggressive.
564
565 See Also
566 --------
567
568 [1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
569 An earlier RFC for a bundle URI feature.
570
571 [2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
572 The GVFS Protocol