]>
Commit | Line | Data |
---|---|---|
2da14fad DS |
1 | Bundle URIs |
2 | =========== | |
3 | ||
4 | Git bundles are files that store a pack-file along with some extra metadata, | |
5 | including a set of refs and a (possibly empty) set of necessary commits. See | |
086eaab8 | 6 | linkgit:git-bundle[1] and linkgit:gitformat-bundle[5] for more information. |
2da14fad DS |
7 | |
8 | Bundle URIs are locations where Git can download one or more bundles in | |
9 | order to bootstrap the object database in advance of fetching the remaining | |
10 | objects from a remote. | |
11 | ||
12 | One goal is to speed up clones and fetches for users with poor network | |
13 | connectivity to the origin server. Another benefit is to allow heavy users, | |
14 | such as CI build farms, to use local resources for the majority of Git data | |
15 | and thereby reducing the load on the origin server. | |
16 | ||
17 | To enable the bundle URI feature, users can specify a bundle URI using | |
18 | command-line options or the origin server can advertise one or more URIs | |
19 | via a protocol v2 capability. | |
20 | ||
21 | Design Goals | |
22 | ------------ | |
23 | ||
24 | The bundle URI standard aims to be flexible enough to satisfy multiple | |
25 | workloads. The bundle provider and the Git client have several choices in | |
26 | how they create and consume bundle URIs. | |
27 | ||
28 | * Bundles can have whatever name the server desires. This name could refer | |
29 | to immutable data by using a hash of the bundle contents. However, this | |
30 | means that a new URI will be needed after every update of the content. | |
31 | This might be acceptable if the server is advertising the URI (and the | |
32 | server is aware of new bundles being generated) but would not be | |
33 | ergonomic for users using the command line option. | |
34 | ||
35 | * The bundles could be organized specifically for bootstrapping full | |
36 | clones, but could also be organized with the intention of bootstrapping | |
37 | incremental fetches. The bundle provider must decide on one of several | |
38 | organization schemes to minimize client downloads during incremental | |
39 | fetches, but the Git client can also choose whether to use bundles for | |
40 | either of these operations. | |
41 | ||
42 | * The bundle provider can choose to support full clones, partial clones, | |
43 | or both. The client can detect which bundles are appropriate for the | |
44 | repository's partial clone filter, if any. | |
45 | ||
46 | * The bundle provider can use a single bundle (for clones only), or a | |
47 | list of bundles. When using a list of bundles, the provider can specify | |
48 | whether or not the client needs _all_ of the bundle URIs for a full | |
49 | clone, or if _any_ one of the bundle URIs is sufficient. This allows the | |
50 | bundle provider to use different URIs for different geographies. | |
51 | ||
52 | * The bundle provider can organize the bundles using heuristics, such as | |
53 | creation tokens, to help the client prevent downloading bundles it does | |
54 | not need. When the bundle provider does not provide these heuristics, | |
55 | the client can use optimizations to minimize how much of the data is | |
56 | downloaded. | |
57 | ||
58 | * The bundle provider does not need to be associated with the Git server. | |
59 | The client can choose to use the bundle provider without it being | |
60 | advertised by the Git server. | |
61 | ||
62 | * The client can choose to discover bundle providers that are advertised | |
63 | by the Git server. This could happen during `git clone`, during | |
64 | `git fetch`, both, or neither. The user can choose which combination | |
65 | works best for them. | |
66 | ||
67 | * The client can choose to configure a bundle provider manually at any | |
68 | time. The client can also choose to specify a bundle provider manually | |
69 | as a command-line option to `git clone`. | |
70 | ||
71 | Each repository is different and every Git server has different needs. | |
72 | Hopefully the bundle URI feature is flexible enough to satisfy all needs. | |
73 | If not, then the feature can be extended through its versioning mechanism. | |
74 | ||
75 | Server requirements | |
76 | ------------------- | |
77 | ||
78 | To provide a server-side implementation of bundle servers, no other parts | |
79 | of the Git protocol are required. This allows server maintainers to use | |
80 | static content solutions such as CDNs in order to serve the bundle files. | |
81 | ||
82 | At the current scope of the bundle URI feature, all URIs are expected to | |
83 | be HTTP(S) URLs where content is downloaded to a local file using a `GET` | |
84 | request to that URL. The server could include authentication requirements | |
85 | to those requests with the aim of triggering the configured credential | |
86 | helper for secure access. (Future extensions could use "file://" URIs or | |
87 | SSH URIs.) | |
88 | ||
89 | Assuming a `200 OK` response from the server, the content at the URL is | |
90 | inspected. First, Git attempts to parse the file as a bundle file of | |
91 | version 2 or higher. If the file is not a bundle, then the file is parsed | |
92 | as a plain-text file using Git's config parser. The key-value pairs in | |
93 | that config file are expected to describe a list of bundle URIs. If | |
94 | neither of these parse attempts succeed, then Git will report an error to | |
95 | the user that the bundle URI provided erroneous data. | |
96 | ||
97 | Any other data provided by the server is considered erroneous. | |
98 | ||
99 | Bundle Lists | |
100 | ------------ | |
101 | ||
102 | The Git server can advertise bundle URIs using a set of `key=value` pairs. | |
103 | A bundle URI can also serve a plain-text file in the Git config format | |
104 | containing these same `key=value` pairs. In both cases, we consider this | |
105 | to be a _bundle list_. The pairs specify information about the bundles | |
106 | that the client can use to make decisions for which bundles to download | |
107 | and which to ignore. | |
108 | ||
109 | A few keys focus on properties of the list itself. | |
110 | ||
111 | bundle.version:: | |
112 | (Required) This value provides a version number for the bundle | |
113 | list. If a future Git change enables a feature that needs the Git | |
114 | client to react to a new key in the bundle list file, then this version | |
115 | will increment. The only current version number is 1, and if any other | |
116 | value is specified then Git will fail to use this file. | |
117 | ||
118 | bundle.mode:: | |
119 | (Required) This value has one of two values: `all` and `any`. When `all` | |
120 | is specified, then the client should expect to need all of the listed | |
121 | bundle URIs that match their repository's requirements. When `any` is | |
122 | specified, then the client should expect that any one of the bundle URIs | |
123 | that match their repository's requirements will suffice. Typically, the | |
124 | `any` option is used to list a number of different bundle servers | |
125 | located in different geographies. | |
126 | ||
127 | bundle.heuristic:: | |
128 | If this string-valued key exists, then the bundle list is designed to | |
129 | work well with incremental `git fetch` commands. The heuristic signals | |
130 | that there are additional keys available for each bundle that help | |
131 | determine which subset of bundles the client should download. The only | |
132 | heuristic currently planned is `creationToken`. | |
133 | ||
134 | The remaining keys include an `<id>` segment which is a server-designated | |
135 | name for each available bundle. The `<id>` must contain only alphanumeric | |
136 | and `-` characters. | |
137 | ||
138 | bundle.<id>.uri:: | |
139 | (Required) This string value is the URI for downloading bundle `<id>`. | |
140 | If the URI begins with a protocol (`http://` or `https://`) then the URI | |
141 | is absolute. Otherwise, the URI is interpreted as relative to the URI | |
142 | used for the bundle list. If the URI begins with `/`, then that relative | |
143 | path is relative to the domain name used for the bundle list. (This use | |
144 | of relative paths is intended to make it easier to distribute a set of | |
145 | bundles across a large number of servers or CDNs with different domain | |
146 | names.) | |
147 | ||
148 | bundle.<id>.filter:: | |
149 | This string value represents an object filter that should also appear in | |
150 | the header of this bundle. The server uses this value to differentiate | |
151 | different kinds of bundles from which the client can choose those that | |
152 | match their object filters. | |
153 | ||
154 | bundle.<id>.creationToken:: | |
155 | This value is a nonnegative 64-bit integer used for sorting the bundles | |
7190b7eb DS |
156 | list. This is used to download a subset of bundles during a fetch when |
157 | `bundle.heuristic=creationToken`. | |
2da14fad DS |
158 | |
159 | bundle.<id>.location:: | |
160 | This string value advertises a real-world location from where the bundle | |
161 | URI is served. This can be used to present the user with an option for | |
162 | which bundle URI to use or simply as an informative indicator of which | |
163 | bundle URI was selected by Git. This is only valuable when | |
164 | `bundle.mode` is `any`. | |
165 | ||
166 | Here is an example bundle list using the Git config format: | |
167 | ||
168 | [bundle] | |
169 | version = 1 | |
170 | mode = all | |
171 | heuristic = creationToken | |
172 | ||
173 | [bundle "2022-02-09-1644442601-daily"] | |
174 | uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle | |
175 | creationToken = 1644442601 | |
176 | ||
177 | [bundle "2022-02-02-1643842562"] | |
178 | uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle | |
179 | creationToken = 1643842562 | |
180 | ||
181 | [bundle "2022-02-09-1644442631-daily-blobless"] | |
182 | uri = 2022-02-09-1644442631-daily-blobless.bundle | |
183 | creationToken = 1644442631 | |
184 | filter = blob:none | |
185 | ||
186 | [bundle "2022-02-02-1643842568-blobless"] | |
187 | uri = /git/git/2022-02-02-1643842568-blobless.bundle | |
188 | creationToken = 1643842568 | |
189 | filter = blob:none | |
190 | ||
191 | This example uses `bundle.mode=all` as well as the | |
192 | `bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter` | |
193 | options to present two parallel sets of bundles: one for full clones and | |
194 | another for blobless partial clones. | |
195 | ||
196 | Suppose that this bundle list was found at the URI | |
197 | `https://bundles.example.com/git/git/` and so the two blobless bundles have | |
198 | the following fully-expanded URIs: | |
199 | ||
200 | * `https://bundles.example.com/git/git/2022-02-09-1644442631-daily-blobless.bundle` | |
201 | * `https://bundles.example.com/git/git/2022-02-02-1643842568-blobless.bundle` | |
202 | ||
203 | Advertising Bundle URIs | |
204 | ----------------------- | |
205 | ||
206 | If a user knows a bundle URI for the repository they are cloning, then | |
207 | they can specify that URI manually through a command-line option. However, | |
208 | a Git host may want to advertise bundle URIs during the clone operation, | |
209 | helping users unaware of the feature. | |
210 | ||
211 | The only thing required for this feature is that the server can advertise | |
212 | one or more bundle URIs. This advertisement takes the form of a new | |
213 | protocol v2 capability specifically for discovering bundle URIs. | |
214 | ||
215 | The client could choose an arbitrary bundle URI as an option _or_ select | |
216 | the URI with best performance by some exploratory checks. It is up to the | |
217 | bundle provider to decide if having multiple URIs is preferable to a | |
218 | single URI that is geodistributed through server-side infrastructure. | |
219 | ||
220 | Cloning with Bundle URIs | |
221 | ------------------------ | |
222 | ||
223 | The primary need for bundle URIs is to speed up clones. The Git client | |
224 | will interact with bundle URIs according to the following flow: | |
225 | ||
226 | 1. The user specifies a bundle URI with the `--bundle-uri` command-line | |
227 | option _or_ the client discovers a bundle list advertised by the | |
228 | Git server. | |
229 | ||
230 | 2. If the downloaded data from a bundle URI is a bundle, then the client | |
231 | inspects the bundle headers to check that the prerequisite commit OIDs | |
232 | are present in the client repository. If some are missing, then the | |
233 | client delays unbundling until other bundles have been unbundled, | |
234 | making those OIDs present. When all required OIDs are present, the | |
235 | client unbundles that data using a refspec. The default refspec is | |
236 | `+refs/heads/*:refs/bundles/*`, but this can be configured. These refs | |
7190b7eb DS |
237 | are stored so that later `git fetch` negotiations can communicate each |
238 | bundled ref as a `have`, reducing the size of the fetch over the Git | |
2da14fad DS |
239 | protocol. To allow pruning refs from this ref namespace, Git may |
240 | introduce a numbered namespace (such as `refs/bundles/<i>/*`) such that | |
241 | stale bundle refs can be deleted. | |
242 | ||
243 | 3. If the file is instead a bundle list, then the client inspects the | |
244 | `bundle.mode` to see if the list is of the `all` or `any` form. | |
245 | ||
246 | a. If `bundle.mode=all`, then the client considers all bundle | |
247 | URIs. The list is reduced based on the `bundle.<id>.filter` options | |
248 | matching the client repository's partial clone filter. Then, all | |
249 | bundle URIs are requested. If the `bundle.<id>.creationToken` | |
250 | heuristic is provided, then the bundles are downloaded in decreasing | |
251 | order by the creation token, stopping when a bundle has all required | |
252 | OIDs. The bundles can then be unbundled in increasing creation token | |
253 | order. The client stores the latest creation token as a heuristic | |
254 | for avoiding future downloads if the bundle list does not advertise | |
255 | bundles with larger creation tokens. | |
256 | ||
257 | b. If `bundle.mode=any`, then the client can choose any one of the | |
258 | bundle URIs to inspect. The client can use a variety of ways to | |
259 | choose among these URIs. The client can also fallback to another URI | |
260 | if the initial choice fails to return a result. | |
261 | ||
262 | Note that during a clone we expect that all bundles will be required, and | |
263 | heuristics such as `bundle.<uri>.creationToken` can be used to download | |
264 | bundles in chronological order or in parallel. | |
265 | ||
266 | If a given bundle URI is a bundle list with a `bundle.heuristic` | |
267 | value, then the client can choose to store that URI as its chosen bundle | |
268 | URI. The client can then navigate directly to that URI during later `git | |
269 | fetch` calls. | |
270 | ||
271 | When downloading bundle URIs, the client can choose to inspect the initial | |
272 | content before committing to downloading the entire content. This may | |
273 | provide enough information to determine if the URI is a bundle list or | |
274 | a bundle. In the case of a bundle, the client may inspect the bundle | |
275 | header to determine that all advertised tips are already in the client | |
276 | repository and cancel the remaining download. | |
277 | ||
278 | Fetching with Bundle URIs | |
279 | ------------------------- | |
280 | ||
281 | When the client fetches new data, it can decide to fetch from bundle | |
282 | servers before fetching from the origin remote. This could be done via a | |
283 | command-line option, but it is more likely useful to use a config value | |
284 | such as the one specified during the clone. | |
285 | ||
286 | The fetch operation follows the same procedure to download bundles from a | |
287 | bundle list (although we do _not_ want to use parallel downloads here). We | |
288 | expect that the process will end when all prerequisite commit OIDs in a | |
289 | thin bundle are already in the object database. | |
290 | ||
291 | When using the `creationToken` heuristic, the client can avoid downloading | |
bbb0c357 | 292 | any bundles if their creation tokens are not larger than the stored |
2da14fad DS |
293 | creation token. After fetching new bundles, Git updates this local |
294 | creation token. | |
295 | ||
296 | If the bundle provider does not provide a heuristic, then the client | |
297 | should attempt to inspect the bundle headers before downloading the full | |
298 | bundle data in case the bundle tips already exist in the client | |
299 | repository. | |
300 | ||
301 | Error Conditions | |
302 | ---------------- | |
303 | ||
304 | If the Git client discovers something unexpected while downloading | |
305 | information according to a bundle URI or the bundle list found at that | |
306 | location, then Git can ignore that data and continue as if it was not | |
307 | given a bundle URI. The remote Git server is the ultimate source of truth, | |
308 | not the bundle URI. | |
309 | ||
310 | Here are a few example error conditions: | |
311 | ||
312 | * The client fails to connect with a server at the given URI or a connection | |
313 | is lost without any chance to recover. | |
314 | ||
315 | * The client receives a 400-level response (such as `404 Not Found` or | |
316 | `401 Not Authorized`). The client should use the credential helper to | |
317 | find and provide a credential for the URI, but match the semantics of | |
318 | Git's other HTTP protocols in terms of handling specific 400-level | |
319 | errors. | |
320 | ||
bbb0c357 | 321 | * The server reports any other failure response. |
2da14fad DS |
322 | |
323 | * The client receives data that is not parsable as a bundle or bundle list. | |
324 | ||
325 | * A bundle includes a filter that does not match expectations. | |
326 | ||
327 | * The client cannot unbundle the bundles because the prerequisite commit OIDs | |
328 | are not in the object database and there are no more bundles to download. | |
329 | ||
330 | There are also situations that could be seen as wasteful, but are not | |
331 | error conditions: | |
332 | ||
333 | * The downloaded bundles contain more information than is requested by | |
334 | the clone or fetch request. A primary example is if the user requests | |
335 | a clone with `--single-branch` but downloads bundles that store every | |
336 | reachable commit from all `refs/heads/*` references. This might be | |
337 | initially wasteful, but perhaps these objects will become reachable by | |
338 | a later ref update that the client cares about. | |
339 | ||
340 | * A bundle download during a `git fetch` contains objects already in the | |
341 | object database. This is probably unavoidable if we are using bundles | |
342 | for fetches, since the client will almost always be slightly ahead of | |
343 | the bundle servers after performing its "catch-up" fetch to the remote | |
344 | server. This extra work is most wasteful when the client is fetching | |
345 | much more frequently than the server is computing bundles, such as if | |
346 | the client is using hourly prefetches with background maintenance, but | |
347 | the server is computing bundles weekly. For this reason, the client | |
348 | should not use bundle URIs for fetch unless the server has explicitly | |
349 | recommended it through a `bundle.heuristic` value. | |
350 | ||
d06ed85d DS |
351 | Example Bundle Provider organization |
352 | ------------------------------------ | |
353 | ||
354 | The bundle URI feature is intentionally designed to be flexible to | |
355 | different ways a bundle provider wants to organize the object data. | |
356 | However, it can be helpful to have a complete organization model described | |
357 | here so providers can start from that base. | |
358 | ||
359 | This example organization is a simplified model of what is used by the | |
360 | GVFS Cache Servers (see section near the end of this document) which have | |
361 | been beneficial in speeding up clones and fetches for very large | |
362 | repositories, although using extra software outside of Git. | |
363 | ||
364 | The bundle provider deploys servers across multiple geographies. Each | |
365 | server manages its own bundle set. The server can track a number of Git | |
366 | repositories, but provides a bundle list for each based on a pattern. For | |
367 | example, when mirroring a repository at `https://<domain>/<org>/<repo>` | |
368 | the bundle server could have its bundle list available at | |
369 | `https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can | |
370 | list all of these servers under the "any" mode: | |
371 | ||
372 | [bundle] | |
373 | version = 1 | |
374 | mode = any | |
375 | ||
376 | [bundle "eastus"] | |
377 | uri = https://eastus.example.com/<domain>/<org>/<repo> | |
378 | ||
379 | [bundle "europe"] | |
380 | uri = https://europe.example.com/<domain>/<org>/<repo> | |
381 | ||
382 | [bundle "apac"] | |
383 | uri = https://apac.example.com/<domain>/<org>/<repo> | |
384 | ||
385 | This "list of lists" is static and only changes if a bundle server is | |
386 | added or removed. | |
387 | ||
388 | Each bundle server manages its own set of bundles. The initial bundle list | |
389 | contains only a single bundle, containing all of the objects received from | |
390 | cloning the repository from the origin server. The list uses the | |
391 | `creationToken` heuristic and a `creationToken` is made for the bundle | |
392 | based on the server's timestamp. | |
393 | ||
394 | The bundle server runs regularly-scheduled updates for the bundle list, | |
395 | such as once a day. During this task, the server fetches the latest | |
396 | contents from the origin server and generates a bundle containing the | |
397 | objects reachable from the latest origin refs, but not contained in a | |
398 | previously-computed bundle. This bundle is added to the list, with care | |
399 | that the `creationToken` is strictly greater than the previous maximum | |
400 | `creationToken`. | |
401 | ||
402 | When the bundle list grows too large, say more than 30 bundles, then the | |
403 | oldest "_N_ minus 30" bundles are combined into a single bundle. This | |
404 | bundle's `creationToken` is equal to the maximum `creationToken` among the | |
405 | merged bundles. | |
406 | ||
407 | An example bundle list is provided here, although it only has two daily | |
408 | bundles and not a full list of 30: | |
409 | ||
410 | [bundle] | |
411 | version = 1 | |
412 | mode = all | |
413 | heuristic = creationToken | |
414 | ||
415 | [bundle "2022-02-13-1644770820-daily"] | |
416 | uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644770820-daily.bundle | |
417 | creationToken = 1644770820 | |
418 | ||
419 | [bundle "2022-02-09-1644442601-daily"] | |
420 | uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644442601-daily.bundle | |
421 | creationToken = 1644442601 | |
422 | ||
423 | [bundle "2022-02-02-1643842562"] | |
424 | uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-02-1643842562.bundle | |
425 | creationToken = 1643842562 | |
426 | ||
427 | To avoid storing and serving object data in perpetuity despite becoming | |
428 | unreachable in the origin server, this bundle merge can be more careful. | |
429 | Instead of taking an absolute union of the old bundles, instead the bundle | |
430 | can be created by looking at the newer bundles and ensuring that their | |
431 | necessary commits are all available in this merged bundle (or in another | |
432 | one of the newer bundles). This allows "expiring" object data that is not | |
433 | being used by new commits in this window of time. That data could be | |
434 | reintroduced by a later push. | |
435 | ||
436 | The intention of this data organization has two main goals. First, initial | |
437 | clones of the repository become faster by downloading precomputed object | |
438 | data from a closer source. Second, `git fetch` commands can be faster, | |
439 | especially if the client has not fetched for a few days. However, if a | |
440 | client does not fetch for 30 days, then the bundle list organization would | |
441 | cause redownloading a large amount of object data. | |
442 | ||
443 | One way to make this organization more useful to users who fetch frequently | |
444 | is to have more frequent bundle creation. For example, bundles could be | |
445 | created every hour, and then once a day those "hourly" bundles could be | |
446 | merged into a "daily" bundle. The daily bundles are merged into the | |
447 | oldest bundle after 30 days. | |
448 | ||
bbb0c357 | 449 | It is recommended that this bundle strategy is repeated with the `blob:none` |
d06ed85d DS |
450 | filter if clients of this repository are expecting to use blobless partial |
451 | clones. This list of blobless bundles stays in the same list as the full | |
452 | bundles, but uses the `bundle.<id>.filter` key to separate the two groups. | |
453 | For very large repositories, the bundle provider may want to _only_ provide | |
454 | blobless bundles. | |
455 | ||
2da14fad DS |
456 | Implementation Plan |
457 | ------------------- | |
458 | ||
459 | This design document is being submitted on its own as an aspirational | |
460 | document, with the goal of implementing all of the mentioned client | |
461 | features over the course of several patch series. Here is a potential | |
462 | outline for submitting these features: | |
463 | ||
464 | 1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option. | |
465 | This will include a new `git fetch --bundle-uri` mode for use as the | |
466 | implementation underneath `git clone`. The initial version here will | |
467 | expect a single bundle at the given URI. | |
468 | ||
469 | 2. Implement the ability to parse a bundle list from a bundle URI and | |
470 | update the `git fetch --bundle-uri` logic to properly distinguish | |
471 | between `bundle.mode` options. Specifically design the feature so | |
472 | that the config format parsing feeds a list of key-value pairs into the | |
473 | bundle list logic. | |
474 | ||
475 | 3. Create the `bundle-uri` protocol v2 command so Git servers can advertise | |
476 | bundle URIs using the key-value pairs. Plug into the existing key-value | |
477 | input to the bundle list logic. Allow `git clone` to discover these | |
478 | bundle URIs and bootstrap the client repository from the bundle data. | |
479 | (This choice is an opt-in via a config option and a command-line | |
480 | option.) | |
481 | ||
0524ad35 | 482 | 4. Allow the client to understand the `bundle.heuristic` configuration key |
2da14fad | 483 | and the `bundle.<id>.creationToken` heuristic. When `git clone` |
0524ad35 DS |
484 | discovers a bundle URI with `bundle.heuristic`, it configures the client |
485 | repository to check that bundle URI during later `git fetch <remote>` | |
2da14fad DS |
486 | commands. |
487 | ||
488 | 5. Allow clients to discover bundle URIs during `git fetch` and configure | |
0524ad35 | 489 | a bundle URI for later fetches if `bundle.heuristic` is set. |
2da14fad DS |
490 | |
491 | 6. Implement the "inspect headers" heuristic to reduce data downloads when | |
492 | the `bundle.<id>.creationToken` heuristic is not available. | |
493 | ||
494 | As these features are reviewed, this plan might be updated. We also expect | |
495 | that new designs will be discovered and implemented as this feature | |
496 | matures and becomes used in real-world scenarios. | |
497 | ||
498 | Related Work: Packfile URIs | |
499 | --------------------------- | |
500 | ||
501 | The Git protocol already has a capability where the Git server can list | |
502 | a set of URLs along with the packfile response when serving a client | |
503 | request. The client is then expected to download the packfiles at those | |
504 | locations in order to have a complete understanding of the response. | |
505 | ||
506 | This mechanism is used by the Gerrit server (implemented with JGit) and | |
507 | has been effective at reducing CPU load and improving user performance for | |
508 | clones. | |
509 | ||
510 | A major downside to this mechanism is that the origin server needs to know | |
511 | _exactly_ what is in those packfiles, and the packfiles need to be available | |
512 | to the user for some time after the server has responded. This coupling | |
513 | between the origin and the packfile data is difficult to manage. | |
514 | ||
515 | Further, this implementation is extremely hard to make work with fetches. | |
516 | ||
517 | Related Work: GVFS Cache Servers | |
518 | -------------------------------- | |
519 | ||
520 | The GVFS Protocol [2] is a set of HTTP endpoints designed independently of | |
521 | the Git project before Git's partial clone was created. One feature of this | |
522 | protocol is the idea of a "cache server" which can be colocated with build | |
523 | machines or developer offices to transfer Git data without overloading the | |
524 | central server. | |
525 | ||
526 | The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}` | |
527 | endpoint, which allows downloading an object on-demand. This is a critical | |
528 | piece of the filesystem virtualization of that product. | |
529 | ||
530 | However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>` | |
531 | endpoint. Given an optional timestamp, the cache server responds with a list | |
532 | of precomputed packfiles containing the commits and trees that were introduced | |
533 | in those time intervals. | |
534 | ||
535 | The cache server computes these "prefetch" packfiles using the following | |
536 | strategy: | |
537 | ||
538 | 1. Every hour, an "hourly" pack is generated with a given timestamp. | |
539 | 2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack. | |
540 | 3. Nightly, all prefetch packs more than 30 days old are rolled up into | |
541 | one pack. | |
542 | ||
543 | When a user runs `gvfs clone` or `scalar clone` against a repo with cache | |
544 | servers, the client requests all prefetch packfiles, which is at most | |
545 | `24 + 30 + 1` packfiles downloading only commits and trees. The client | |
546 | then follows with a request to the origin server for the references, and | |
547 | attempts to checkout that tip reference. (There is an extra endpoint that | |
548 | helps get all reachable trees from a given commit, in case that commit | |
549 | was not already in a prefetch packfile.) | |
550 | ||
551 | During a `git fetch`, a hook requests the prefetch endpoint using the | |
552 | most-recent timestamp from a previously-downloaded prefetch packfile. | |
553 | Only the list of packfiles with later timestamps are downloaded. Most | |
554 | users fetch hourly, so they get at most one hourly prefetch pack. Users | |
555 | whose machines have been off or otherwise have not fetched in over 30 days | |
556 | might redownload all prefetch packfiles. This is rare. | |
557 | ||
558 | It is important to note that the clients always contact the origin server | |
559 | for the refs advertisement, so the refs are frequently "ahead" of the | |
560 | prefetched pack data. The missing objects are downloaded on-demand using | |
561 | the `GET gvfs/objects/{oid}` requests, when needed by a command such as | |
562 | `git checkout` or `git log`. Some Git optimizations disable checks that | |
563 | would cause these on-demand downloads to be too aggressive. | |
564 | ||
565 | See Also | |
566 | -------- | |
567 | ||
568 | [1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/ | |
569 | An earlier RFC for a bundle URI feature. | |
570 | ||
571 | [2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md | |
572 | The GVFS Protocol |