]>
Commit | Line | Data |
---|---|---|
752414ae JN |
1 | Git hash function transition |
2 | ============================ | |
3 | ||
4 | Objective | |
5 | --------- | |
6 | Migrate Git from SHA-1 to a stronger hash function. | |
7 | ||
8 | Background | |
9 | ---------- | |
10 | At its core, the Git version control system is a content addressable | |
11 | filesystem. It uses the SHA-1 hash function to name content. For | |
12 | example, files, directories, and revisions are referred to by hash | |
13 | values unlike in other traditional version control systems where files | |
14 | or versions are referred to via sequential numbers. The use of a hash | |
15 | function to address its content delivers a few advantages: | |
16 | ||
17 | * Integrity checking is easy. Bit flips, for example, are easily | |
18 | detected, as the hash of corrupted content does not match its name. | |
19 | * Lookup of objects is fast. | |
20 | ||
21 | Using a cryptographically secure hash function brings additional | |
22 | advantages: | |
23 | ||
24 | * Object names can be signed and third parties can trust the hash to | |
25 | address the signed object and all objects it references. | |
26 | * Communication using Git protocol and out of band communication | |
27 | methods have a short reliable string that can be used to reliably | |
28 | address stored content. | |
29 | ||
30 | Over time some flaws in SHA-1 have been discovered by security | |
5988eb63 ÆAB |
31 | researchers. On 23 February 2017 the SHAttered attack |
32 | (https://shattered.io) demonstrated a practical SHA-1 hash collision. | |
33 | ||
34 | Git v2.13.0 and later subsequently moved to a hardened SHA-1 | |
35 | implementation by default, which isn't vulnerable to the SHAttered | |
36 | attack. | |
37 | ||
38 | Thus Git has in effect already migrated to a new hash that isn't SHA-1 | |
39 | and doesn't share its vulnerabilities, its new hash function just | |
40 | happens to produce exactly the same output for all known inputs, | |
41 | except two PDFs published by the SHAttered researchers, and the new | |
42 | implementation (written by those researchers) claims to detect future | |
43 | cryptanalytic collision attacks. | |
44 | ||
45 | Regardless, it's considered prudent to move past any variant of SHA-1 | |
46 | to a new hash. There's no guarantee that future attacks on SHA-1 won't | |
47 | be published in the future, and those attacks may not have viable | |
48 | mitigations. | |
49 | ||
50 | If SHA-1 and its variants were to be truly broken, Git's hash function | |
51 | could not be considered cryptographically secure any more. This would | |
52 | impact the communication of hash values because we could not trust | |
53 | that a given hash value represented the known good version of content | |
54 | that the speaker intended. | |
752414ae JN |
55 | |
56 | SHA-1 still possesses the other properties such as fast object lookup | |
57 | and safe error checking, but other hash functions are equally suitable | |
58 | that are believed to be cryptographically secure. | |
59 | ||
60 | Goals | |
61 | ----- | |
0ed8d8da | 62 | 1. The transition to SHA-256 can be done one local repository at a time. |
752414ae | 63 | a. Requiring no action by any other party. |
0ed8d8da | 64 | b. A SHA-256 repository can communicate with SHA-1 Git servers |
752414ae | 65 | (push/fetch). |
0ed8d8da | 66 | c. Users can use SHA-1 and SHA-256 identifiers for objects |
752414ae JN |
67 | interchangeably (see "Object names on the command line", below). |
68 | d. New signed objects make use of a stronger hash function than | |
69 | SHA-1 for their security guarantees. | |
70 | 2. Allow a complete transition away from SHA-1. | |
71 | a. Local metadata for SHA-1 compatibility can be removed from a | |
72 | repository if compatibility with SHA-1 is no longer needed. | |
73 | 3. Maintainability throughout the process. | |
74 | a. The object format is kept simple and consistent. | |
75 | b. Creation of a generalized repository conversion tool. | |
76 | ||
77 | Non-Goals | |
78 | --------- | |
0ed8d8da | 79 | 1. Add SHA-256 support to Git protocol. This is valuable and the |
752414ae JN |
80 | logical next step but it is out of scope for this initial design. |
81 | 2. Transparently improving the security of existing SHA-1 signed | |
82 | objects. | |
83 | 3. Intermixing objects using multiple hash functions in a single | |
84 | repository. | |
85 | 4. Taking the opportunity to fix other bugs in Git's formats and | |
86 | protocols. | |
0ed8d8da JN |
87 | 5. Shallow clones and fetches into a SHA-256 repository. (This will |
88 | change when we add SHA-256 support to Git protocol.) | |
89 | 6. Skip fetching some submodules of a project into a SHA-256 | |
90 | repository. (This also depends on SHA-256 support in Git | |
752414ae JN |
91 | protocol.) |
92 | ||
93 | Overview | |
94 | -------- | |
95 | We introduce a new repository format extension. Repositories with this | |
0ed8d8da | 96 | extension enabled use SHA-256 instead of SHA-1 to name their objects. |
de82095a | 97 | This affects both object names and object content -- both the names |
752414ae JN |
98 | of objects and all references to other objects within an object are |
99 | switched to the new hash function. | |
100 | ||
0ed8d8da | 101 | SHA-256 repositories cannot be read by older versions of Git. |
752414ae | 102 | |
0ed8d8da JN |
103 | Alongside the packfile, a SHA-256 repository stores a bidirectional |
104 | mapping between SHA-256 and SHA-1 object names. The mapping is generated | |
752414ae | 105 | locally and can be verified using "git fsck". Object lookups use this |
0ed8d8da | 106 | mapping to allow naming objects using either their SHA-1 and SHA-256 names |
752414ae JN |
107 | interchangeably. |
108 | ||
109 | "git cat-file" and "git hash-object" gain options to display an object | |
af9b1e9a | 110 | in its SHA-1 form and write an object given its SHA-1 form. This |
752414ae JN |
111 | requires all objects referenced by that object to be present in the |
112 | object database so that they can be named using the appropriate name | |
113 | (using the bidirectional hash mapping). | |
114 | ||
115 | Fetches from a SHA-1 based server convert the fetched objects into | |
0ed8d8da | 116 | SHA-256 form and record the mapping in the bidirectional mapping table |
752414ae | 117 | (see below for details). Pushes to a SHA-1 based server convert the |
af9b1e9a | 118 | objects being pushed into SHA-1 form so the server does not have to be |
752414ae JN |
119 | aware of the hash function the client is using. |
120 | ||
121 | Detailed Design | |
122 | --------------- | |
123 | Repository format extension | |
124 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
0ed8d8da | 125 | A SHA-256 repository uses repository format version `1` (see |
752414ae JN |
126 | Documentation/technical/repository-version.txt) with extensions |
127 | `objectFormat` and `compatObjectFormat`: | |
128 | ||
129 | [core] | |
130 | repositoryFormatVersion = 1 | |
131 | [extensions] | |
0ed8d8da | 132 | objectFormat = sha256 |
752414ae JN |
133 | compatObjectFormat = sha1 |
134 | ||
45fa195f ÆAB |
135 | The combination of setting `core.repositoryFormatVersion=1` and |
136 | populating `extensions.*` ensures that all versions of Git later than | |
0ed8d8da | 137 | `v0.99.9l` will die instead of trying to operate on the SHA-256 |
45fa195f | 138 | repository, instead producing an error message. |
752414ae | 139 | |
45fa195f ÆAB |
140 | # Between v0.99.9l and v2.7.0 |
141 | $ git status | |
142 | fatal: Expected git repo version <= 0, found 1 | |
143 | # After v2.7.0 | |
752414ae JN |
144 | $ git status |
145 | fatal: unknown repository extensions found: | |
146 | objectformat | |
147 | compatobjectformat | |
148 | ||
149 | See the "Transition plan" section below for more details on these | |
150 | repository extensions. | |
151 | ||
152 | Object names | |
153 | ~~~~~~~~~~~~ | |
af9b1e9a TA |
154 | Objects can be named by their 40 hexadecimal digit SHA-1 name or 64 |
155 | hexadecimal digit SHA-256 name, plus names derived from those (see | |
752414ae JN |
156 | gitrevisions(7)). |
157 | ||
af9b1e9a TA |
158 | The SHA-1 name of an object is the SHA-1 of the concatenation of its |
159 | type, length, a nul byte, and the object's SHA-1 content. This is the | |
752414ae JN |
160 | traditional <sha1> used in Git to name objects. |
161 | ||
af9b1e9a TA |
162 | The SHA-256 name of an object is the SHA-256 of the concatenation of its |
163 | type, length, a nul byte, and the object's SHA-256 content. | |
752414ae JN |
164 | |
165 | Object format | |
166 | ~~~~~~~~~~~~~ | |
167 | The content as a byte sequence of a tag, commit, or tree object named | |
af9b1e9a TA |
168 | by SHA-1 and SHA-256 differ because an object named by SHA-256 name refers to |
169 | other objects by their SHA-256 names and an object named by SHA-1 name | |
170 | refers to other objects by their SHA-1 names. | |
752414ae | 171 | |
af9b1e9a TA |
172 | The SHA-256 content of an object is the same as its SHA-1 content, except |
173 | that objects referenced by the object are named using their SHA-256 names | |
174 | instead of SHA-1 names. Because a blob object does not refer to any | |
175 | other object, its SHA-1 content and SHA-256 content are the same. | |
752414ae | 176 | |
af9b1e9a TA |
177 | The format allows round-trip conversion between SHA-256 content and |
178 | SHA-1 content. | |
752414ae JN |
179 | |
180 | Object storage | |
181 | ~~~~~~~~~~~~~~ | |
182 | Loose objects use zlib compression and packed objects use the packed | |
183 | format described in Documentation/technical/pack-format.txt, just like | |
af9b1e9a TA |
184 | today. The content that is compressed and stored uses SHA-256 content |
185 | instead of SHA-1 content. | |
752414ae JN |
186 | |
187 | Pack index | |
188 | ~~~~~~~~~~ | |
189 | Pack index (.idx) files use a new v3 format that supports multiple | |
190 | hash functions. They have the following format (all integers are in | |
191 | network byte order): | |
192 | ||
193 | - A header appears at the beginning and consists of the following: | |
de82095a TA |
194 | * The 4-byte pack index signature: '\377t0c' |
195 | * 4-byte version number: 3 | |
196 | * 4-byte length of the header section, including the signature and | |
752414ae | 197 | version number |
de82095a TA |
198 | * 4-byte number of objects contained in the pack |
199 | * 4-byte number of object formats in this pack index: 2 | |
200 | * For each object format: | |
201 | ** 4-byte format identifier (e.g., 'sha1' for SHA-1) | |
202 | ** 4-byte length in bytes of shortened object names. This is the | |
752414ae JN |
203 | shortest possible length needed to make names in the shortened |
204 | object name table unambiguous. | |
de82095a | 205 | ** 4-byte integer, recording where tables relating to this format |
752414ae | 206 | are stored in this index file, as an offset from the beginning. |
de82095a TA |
207 | * 4-byte offset to the trailer from the beginning of this file. |
208 | * Zero or more additional key/value pairs (4-byte key, 4-byte | |
752414ae JN |
209 | value). Only one key is supported: 'PSRC'. See the "Loose objects |
210 | and unreachable objects" section for supported values and how this | |
211 | is used. All other keys are reserved. Readers must ignore | |
212 | unrecognized keys. | |
213 | - Zero or more NUL bytes. This can optionally be used to improve the | |
214 | alignment of the full object name table below. | |
215 | - Tables for the first object format: | |
de82095a | 216 | * A sorted table of shortened object names. These are prefixes of |
752414ae JN |
217 | the names of all objects in this pack file, packed together |
218 | without offset values to reduce the cache footprint of the binary | |
219 | search for a specific object name. | |
220 | ||
de82095a | 221 | * A table of full object names in pack order. This allows resolving |
752414ae JN |
222 | a reference to "the nth object in the pack file" (from a |
223 | reachability bitmap or from the next table of another object | |
224 | format) to its object name. | |
225 | ||
de82095a | 226 | * A table of 4-byte values mapping object name order to pack order. |
752414ae JN |
227 | For an object in the table of sorted shortened object names, the |
228 | value at the corresponding index in this table is the index in the | |
229 | previous table for that same object. | |
752414ae JN |
230 | This can be used to look up the object in reachability bitmaps or |
231 | to look up its name in another object format. | |
232 | ||
de82095a | 233 | * A table of 4-byte CRC32 values of the packed object data, in the |
752414ae JN |
234 | order that the objects appear in the pack file. This is to allow |
235 | compressed data to be copied directly from pack to pack during | |
236 | repacking without undetected data corruption. | |
237 | ||
de82095a | 238 | * A table of 4-byte offset values. For an object in the table of |
752414ae JN |
239 | sorted shortened object names, the value at the corresponding |
240 | index in this table indicates where that object can be found in | |
241 | the pack file. These are usually 31-bit pack file offsets, but | |
242 | large offsets are encoded as an index into the next table with the | |
243 | most significant bit set. | |
244 | ||
de82095a | 245 | * A table of 8-byte offset entries (empty for pack files less than |
752414ae JN |
246 | 2 GiB). Pack files are organized with heavily used objects toward |
247 | the front, so most object references should not need to refer to | |
248 | this table. | |
249 | - Zero or more NUL bytes. | |
250 | - Tables for the second object format, with the same layout as above, | |
251 | up to and not including the table of CRC32 values. | |
252 | - Zero or more NUL bytes. | |
253 | - The trailer consists of the following: | |
de82095a | 254 | * A copy of the 20-byte SHA-256 checksum at the end of the |
752414ae JN |
255 | corresponding packfile. |
256 | ||
de82095a | 257 | * 20-byte SHA-256 checksum of all of the above. |
752414ae JN |
258 | |
259 | Loose object index | |
260 | ~~~~~~~~~~~~~~~~~~ | |
261 | A new file $GIT_OBJECT_DIR/loose-object-idx contains information about | |
262 | all loose objects. Its format is | |
263 | ||
264 | # loose-object-idx | |
0ed8d8da | 265 | (sha256-name SP sha1-name LF)* |
752414ae JN |
266 | |
267 | where the object names are in hexadecimal format. The file is not | |
268 | sorted. | |
269 | ||
270 | The loose object index is protected against concurrent writes by a | |
271 | lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose | |
272 | object: | |
273 | ||
274 | 1. Write the loose object to a temporary file, like today. | |
275 | 2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock. | |
276 | 3. Rename the loose object into place. | |
277 | 4. Open loose-object-idx with O_APPEND and write the new object | |
278 | 5. Unlink loose-object-idx.lock to release the lock. | |
279 | ||
280 | To remove entries (e.g. in "git pack-refs" or "git-prune"): | |
281 | ||
282 | 1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the | |
283 | lock. | |
284 | 2. Write the new content to loose-object-idx.lock. | |
285 | 3. Unlink any loose objects being removed. | |
286 | 4. Rename to replace loose-object-idx, releasing the lock. | |
287 | ||
288 | Translation table | |
289 | ~~~~~~~~~~~~~~~~~ | |
af9b1e9a TA |
290 | The index files support a bidirectional mapping between SHA-1 names |
291 | and SHA-256 names. The lookup proceeds similarly to ordinary object | |
292 | lookups. For example, to convert a SHA-1 name to a SHA-256 name: | |
752414ae JN |
293 | |
294 | 1. Look for the object in idx files. If a match is present in the | |
af9b1e9a TA |
295 | idx's sorted list of truncated SHA-1 names, then: |
296 | a. Read the corresponding entry in the SHA-1 name order to pack | |
752414ae | 297 | name order mapping. |
af9b1e9a | 298 | b. Read the corresponding entry in the full SHA-1 name table to |
752414ae | 299 | verify we found the right object. If it is, then |
af9b1e9a TA |
300 | c. Read the corresponding entry in the full SHA-256 name table. |
301 | That is the object's SHA-256 name. | |
752414ae JN |
302 | 2. Check for a loose object. Read lines from loose-object-idx until |
303 | we find a match. | |
304 | ||
305 | Step (1) takes the same amount of time as an ordinary object lookup: | |
306 | O(number of packs * log(objects per pack)). Step (2) takes O(number of | |
307 | loose objects) time. To maintain good performance it will be necessary | |
308 | to keep the number of loose objects low. See the "Loose objects and | |
309 | unreachable objects" section below for more details. | |
310 | ||
311 | Since all operations that make new objects (e.g., "git commit") add | |
312 | the new objects to the corresponding index, this mapping is possible | |
313 | for all objects in the object store. | |
314 | ||
af9b1e9a TA |
315 | Reading an object's SHA-1 content |
316 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
317 | The SHA-1 content of an object can be read by converting all SHA-256 names | |
318 | its SHA-256 content references to SHA-1 names using the translation table. | |
752414ae JN |
319 | |
320 | Fetch | |
321 | ~~~~~ | |
322 | Fetching from a SHA-1 based server requires translating between SHA-1 | |
0ed8d8da | 323 | and SHA-256 based representations on the fly. |
752414ae JN |
324 | |
325 | SHA-1s named in the ref advertisement that are present on the client | |
0ed8d8da | 326 | can be translated to SHA-256 and looked up as local objects using the |
752414ae JN |
327 | translation table. |
328 | ||
329 | Negotiation proceeds as today. Any "have"s generated locally are | |
330 | converted to SHA-1 before being sent to the server, and SHA-1s | |
0ed8d8da | 331 | mentioned by the server are converted to SHA-256 when looking them up |
752414ae JN |
332 | locally. |
333 | ||
334 | After negotiation, the server sends a packfile containing the | |
0ed8d8da | 335 | requested objects. We convert the packfile to SHA-256 format using |
752414ae JN |
336 | the following steps: |
337 | ||
338 | 1. index-pack: inflate each object in the packfile and compute its | |
339 | SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against | |
340 | objects the client has locally. These objects can be looked up | |
af9b1e9a | 341 | using the translation table and their SHA-1 content read as |
752414ae JN |
342 | described above to resolve the deltas. |
343 | 2. topological sort: starting at the "want"s from the negotiation | |
344 | phase, walk through objects in the pack and emit a list of them, | |
345 | excluding blobs, in reverse topologically sorted order, with each | |
346 | object coming later in the list than all objects it references. | |
347 | (This list only contains objects reachable from the "wants". If the | |
348 | pack from the server contained additional extraneous objects, then | |
349 | they will be discarded.) | |
af9b1e9a | 350 | 3. convert to SHA-256: open a new SHA-256 packfile. Read the topologically |
752414ae | 351 | sorted list just generated. For each object, inflate its |
af9b1e9a TA |
352 | SHA-1 content, convert to SHA-256 content, and write it to the SHA-256 |
353 | pack. Record the new SHA-1<-->SHA-256 mapping entry for use in the idx. | |
752414ae | 354 | 4. sort: reorder entries in the new pack to match the order of objects |
af9b1e9a | 355 | in the pack the server generated and include blobs. Write a SHA-256 idx |
752414ae JN |
356 | file |
357 | 5. clean up: remove the SHA-1 based pack file, index, and | |
358 | topologically sorted list obtained from the server in steps 1 | |
359 | and 2. | |
360 | ||
361 | Step 3 requires every object referenced by the new object to be in the | |
362 | translation table. This is why the topological sort step is necessary. | |
363 | ||
364 | As an optimization, step 1 could write a file describing what non-blob | |
365 | objects each object it has inflated from the packfile references. This | |
366 | makes the topological sort in step 2 possible without inflating the | |
367 | objects in the packfile for a second time. The objects need to be | |
368 | inflated again in step 3, for a total of two inflations. | |
369 | ||
370 | Step 4 is probably necessary for good read-time performance. "git | |
371 | pack-objects" on the server optimizes the pack file for good data | |
372 | locality (see Documentation/technical/pack-heuristics.txt). | |
373 | ||
374 | Details of this process are likely to change. It will take some | |
375 | experimenting to get this to perform well. | |
376 | ||
377 | Push | |
378 | ~~~~ | |
379 | Push is simpler than fetch because the objects referenced by the | |
af9b1e9a | 380 | pushed objects are already in the translation table. The SHA-1 content |
752414ae | 381 | of each object being pushed can be read as described in the "Reading |
af9b1e9a | 382 | an object's SHA-1 content" section to generate the pack written by git |
752414ae JN |
383 | send-pack. |
384 | ||
385 | Signed Commits | |
386 | ~~~~~~~~~~~~~~ | |
0ed8d8da | 387 | We add a new field "gpgsig-sha256" to the commit object format to allow |
752414ae | 388 | signing commits without relying on SHA-1. It is similar to the |
af9b1e9a | 389 | existing "gpgsig" field. Its signed payload is the SHA-256 content of the |
0ed8d8da | 390 | commit object with any "gpgsig" and "gpgsig-sha256" fields removed. |
752414ae JN |
391 | |
392 | This means commits can be signed | |
de82095a | 393 | |
752414ae | 394 | 1. using SHA-1 only, as in existing signed commit objects |
0ed8d8da | 395 | 2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig |
752414ae | 396 | fields. |
0ed8d8da | 397 | 3. using only SHA-256, by only using the gpgsig-sha256 field. |
752414ae JN |
398 | |
399 | Old versions of "git verify-commit" can verify the gpgsig signature in | |
400 | cases (1) and (2) without modifications and view case (3) as an | |
401 | ordinary unsigned commit. | |
402 | ||
403 | Signed Tags | |
404 | ~~~~~~~~~~~ | |
0ed8d8da | 405 | We add a new field "gpgsig-sha256" to the tag object format to allow |
752414ae | 406 | signing tags without relying on SHA-1. Its signed payload is the |
af9b1e9a | 407 | SHA-256 content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP |
752414ae JN |
408 | SIGNATURE-----" delimited in-body signature removed. |
409 | ||
410 | This means tags can be signed | |
de82095a | 411 | |
752414ae | 412 | 1. using SHA-1 only, as in existing signed tag objects |
0ed8d8da | 413 | 2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body |
752414ae | 414 | signature. |
0ed8d8da | 415 | 3. using only SHA-256, by only using the gpgsig-sha256 field. |
752414ae JN |
416 | |
417 | Mergetag embedding | |
418 | ~~~~~~~~~~~~~~~~~~ | |
af9b1e9a TA |
419 | The mergetag field in the SHA-1 content of a commit contains the |
420 | SHA-1 content of a tag that was merged by that commit. | |
752414ae | 421 | |
af9b1e9a TA |
422 | The mergetag field in the SHA-256 content of the same commit contains the |
423 | SHA-256 content of the same tag. | |
752414ae JN |
424 | |
425 | Submodules | |
426 | ~~~~~~~~~~ | |
427 | To convert recorded submodule pointers, you need to have the converted | |
428 | submodule repository in place. The translation table of the submodule | |
429 | can be used to look up the new hash. | |
430 | ||
431 | Loose objects and unreachable objects | |
432 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
433 | Fast lookups in the loose-object-idx require that the number of loose | |
434 | objects not grow too high. | |
435 | ||
436 | "git gc --auto" currently waits for there to be 6700 loose objects | |
437 | present before consolidating them into a packfile. We will need to | |
438 | measure to find a more appropriate threshold for it to use. | |
439 | ||
440 | "git gc --auto" currently waits for there to be 50 packs present | |
441 | before combining packfiles. Packing loose objects more aggressively | |
442 | may cause the number of pack files to grow too quickly. This can be | |
443 | mitigated by using a strategy similar to Martin Fick's exponential | |
444 | rolling garbage collection script: | |
445 | https://gerrit-review.googlesource.com/c/gerrit/+/35215 | |
446 | ||
447 | "git gc" currently expels any unreachable objects it encounters in | |
448 | pack files to loose objects in an attempt to prevent a race when | |
449 | pruning them (in case another process is simultaneously writing a new | |
450 | object that refers to the about-to-be-deleted object). This leads to | |
451 | an explosion in the number of loose objects present and disk space | |
452 | usage due to the objects in delta form being replaced with independent | |
453 | loose objects. Worse, the race is still present for loose objects. | |
454 | ||
455 | Instead, "git gc" will need to move unreachable objects to a new | |
456 | packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see | |
457 | below). To avoid the race when writing new objects referring to an | |
458 | about-to-be-deleted object, code paths that write new objects will | |
459 | need to copy any objects from UNREACHABLE_GARBAGE packs that they | |
24966cd9 | 460 | refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects). |
752414ae JN |
461 | UNREACHABLE_GARBAGE are then safe to delete if their creation time (as |
462 | indicated by the file's mtime) is long enough ago. | |
463 | ||
464 | To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be | |
465 | combined under certain circumstances. If "gc.garbageTtl" is set to | |
466 | greater than one day, then packs created within a single calendar day, | |
467 | UTC, can be coalesced together. The resulting packfile would have an | |
468 | mtime before midnight on that day, so this makes the effective maximum | |
469 | ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day, | |
470 | then we divide the calendar day into intervals one-third of that ttl | |
471 | in duration. Packs created within the same interval can be coalesced | |
472 | together. The resulting packfile would have an mtime before the end of | |
473 | the interval, so this makes the effective maximum ttl equal to the | |
474 | garbageTtl * 4/3. | |
475 | ||
476 | This rule comes from Thirumala Reddy Mutchukota's JGit change | |
477 | https://git.eclipse.org/r/90465. | |
478 | ||
479 | The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack | |
480 | index. More generally, that field indicates where a pack came from: | |
481 | ||
482 | - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network | |
483 | - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight | |
484 | "gc --auto" operation | |
485 | - 3 (PACK_SOURCE_GC) for a pack created by a full gc | |
486 | - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage | |
487 | discovered by gc | |
488 | - 5 (PACK_SOURCE_INSERT) for locally created objects that were | |
489 | written directly to a pack file, e.g. from "git add ." | |
490 | ||
491 | This information can be useful for debugging and for "gc --auto" to | |
492 | make appropriate choices about which packs to coalesce. | |
493 | ||
494 | Caveats | |
495 | ------- | |
496 | Invalid objects | |
497 | ~~~~~~~~~~~~~~~ | |
af9b1e9a | 498 | The conversion from SHA-1 content to SHA-256 content retains any |
752414ae JN |
499 | brokenness in the original object (e.g., tree entry modes encoded with |
500 | leading 0, tree objects whose paths are not sorted correctly, and | |
501 | commit objects without an author or committer). This is a deliberate | |
502 | feature of the design to allow the conversion to round-trip. | |
503 | ||
504 | More profoundly broken objects (e.g., a commit with a truncated "tree" | |
505 | header line) cannot be converted but were not usable by current Git | |
506 | anyway. | |
507 | ||
508 | Shallow clone and submodules | |
509 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
510 | Because it requires all referenced objects to be available in the | |
511 | locally generated translation table, this design does not support | |
512 | shallow clone or unfetched submodules. Protocol improvements might | |
513 | allow lifting this restriction. | |
514 | ||
515 | Alternates | |
516 | ~~~~~~~~~~ | |
af9b1e9a TA |
517 | For the same reason, a SHA-256 repository cannot borrow objects from a |
518 | SHA-1 repository using objects/info/alternates or | |
752414ae JN |
519 | $GIT_ALTERNATE_OBJECT_REPOSITORIES. |
520 | ||
521 | git notes | |
522 | ~~~~~~~~~ | |
af9b1e9a | 523 | The "git notes" tool annotates objects using their SHA-1 name as key. |
752414ae | 524 | This design does not describe a way to migrate notes trees to use |
af9b1e9a | 525 | SHA-256 names. That migration is expected to happen separately (for |
752414ae JN |
526 | example using a file at the root of the notes tree to describe which |
527 | hash it uses). | |
528 | ||
529 | Server-side cost | |
530 | ~~~~~~~~~~~~~~~~ | |
0ed8d8da | 531 | Until Git protocol gains SHA-256 support, using SHA-256 based storage |
752414ae | 532 | on public-facing Git servers is strongly discouraged. Once Git |
0ed8d8da | 533 | protocol gains SHA-256 support, SHA-256 based servers are likely not |
752414ae | 534 | to support SHA-1 compatibility, to avoid what may be a very expensive |
031fd4b9 | 535 | hash re-encode during clone and to encourage peers to modernize. |
752414ae JN |
536 | |
537 | The design described here allows fetches by SHA-1 clients of a | |
0ed8d8da | 538 | personal SHA-256 repository because it's not much more difficult than |
752414ae JN |
539 | allowing pushes from that repository. This support needs to be guarded |
540 | by a configuration option --- servers like git.kernel.org that serve a | |
541 | large number of clients would not be expected to bear that cost. | |
542 | ||
543 | Meaning of signatures | |
544 | ~~~~~~~~~~~~~~~~~~~~~ | |
545 | The signed payload for signed commits and tags does not explicitly | |
546 | name the hash used to identify objects. If some day Git adopts a new | |
547 | hash function with the same length as the current SHA-1 (40 | |
0ed8d8da | 548 | hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the |
752414ae JN |
549 | intent behind the PGP signed payload in an object signature is |
550 | unclear: | |
551 | ||
552 | object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 | |
553 | type commit | |
554 | tag v2.12.0 | |
555 | tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800 | |
556 | ||
557 | Git 2.12 | |
558 | ||
af9b1e9a | 559 | Does this mean Git v2.12.0 is the commit with SHA-1 name |
752414ae JN |
560 | e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with |
561 | new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? | |
562 | ||
0ed8d8da | 563 | Fortunately SHA-256 and SHA-1 have different lengths. If Git starts |
752414ae JN |
564 | using another hash with the same length to name objects, then it will |
565 | need to change the format of signed payloads using that hash to | |
566 | address this issue. | |
567 | ||
568 | Object names on the command line | |
569 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
570 | To support the transition (see Transition plan below), this design | |
571 | supports four different modes of operation: | |
572 | ||
573 | 1. ("dark launch") Treat object names input by the user as SHA-1 and | |
574 | convert any object names written to output to SHA-1, but store | |
0ed8d8da | 575 | objects using SHA-256. This allows users to test the code with no |
752414ae JN |
576 | visible behavior change except for performance. This allows |
577 | allows running even tests that assume the SHA-1 hash function, to | |
578 | sanity-check the behavior of the new mode. | |
579 | ||
0ed8d8da | 580 | 2. ("early transition") Allow both SHA-1 and SHA-256 object names in |
752414ae JN |
581 | input. Any object names written to output use SHA-1. This allows |
582 | users to continue to make use of SHA-1 to communicate with peers | |
583 | (e.g. by email) that have not migrated yet and prepares for mode 3. | |
584 | ||
0ed8d8da JN |
585 | 3. ("late transition") Allow both SHA-1 and SHA-256 object names in |
586 | input. Any object names written to output use SHA-256. In this | |
752414ae JN |
587 | mode, users are using a more secure object naming method by |
588 | default. The disruption is minimal as long as most of their peers | |
589 | are in mode 2 or mode 3. | |
590 | ||
591 | 4. ("post-transition") Treat object names input by the user as | |
0ed8d8da | 592 | SHA-256 and write output using SHA-256. This is safer than mode 3 |
752414ae JN |
593 | because there is less risk that input is incorrectly interpreted |
594 | using the wrong hash function. | |
595 | ||
596 | The mode is specified in configuration. | |
597 | ||
598 | The user can also explicitly specify which format to use for a | |
599 | particular revision specifier and for output, overriding the mode. For | |
600 | example: | |
601 | ||
de82095a | 602 | git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} |
752414ae | 603 | |
0ed8d8da JN |
604 | Choice of Hash |
605 | -------------- | |
031fd4b9 | 606 | In early 2005, around the time that Git was written, Xiaoyun Wang, |
752414ae JN |
607 | Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 |
608 | collisions in 2^69 operations. In August they published details. | |
609 | Luckily, no practical demonstrations of a collision in full SHA-1 were | |
610 | published until 10 years later, in 2017. | |
611 | ||
0ed8d8da JN |
612 | Git v2.13.0 and later subsequently moved to a hardened SHA-1 |
613 | implementation by default that mitigates the SHAttered attack, but | |
614 | SHA-1 is still believed to be weak. | |
615 | ||
616 | The hash to replace this hardened SHA-1 should be stronger than SHA-1 | |
617 | was: we would like it to be trustworthy and useful in practice for at | |
618 | least 10 years. | |
752414ae JN |
619 | |
620 | Some other relevant properties: | |
621 | ||
622 | 1. A 256-bit hash (long enough to match common security practice; not | |
623 | excessively long to hurt performance and disk usage). | |
624 | ||
0ed8d8da JN |
625 | 2. High quality implementations should be widely available (e.g., in |
626 | OpenSSL and Apple CommonCrypto). | |
752414ae JN |
627 | |
628 | 3. The hash function's properties should match Git's needs (e.g. Git | |
629 | requires collision and 2nd preimage resistance and does not require | |
630 | length extension resistance). | |
631 | ||
632 | 4. As a tiebreaker, the hash should be fast to compute (fortunately | |
633 | many contenders are faster than SHA-1). | |
634 | ||
0ed8d8da | 635 | We choose SHA-256. |
752414ae JN |
636 | |
637 | Transition plan | |
638 | --------------- | |
639 | Some initial steps can be implemented independently of one another: | |
de82095a | 640 | |
752414ae | 641 | - adding a hash function API (vtable) |
0ed8d8da | 642 | - teaching fsck to tolerate the gpgsig-sha256 field |
752414ae JN |
643 | - excluding gpgsig-* from the fields copied by "git commit --amend" |
644 | - annotating tests that depend on SHA-1 values with a SHA1 test | |
645 | prerequisite | |
646 | - using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ | |
647 | consistently instead of "unsigned char *" and the hardcoded | |
648 | constants 20 and 40. | |
649 | - introducing index v3 | |
650 | - adding support for the PSRC field and safer object pruning | |
651 | ||
752414ae JN |
652 | The first user-visible change is the introduction of the objectFormat |
653 | extension (without compatObjectFormat). This requires: | |
de82095a | 654 | |
752414ae JN |
655 | - teaching fsck about this mode of operation |
656 | - using the hash function API (vtable) when computing object names | |
657 | - signing objects and verifying signatures | |
658 | - rejecting attempts to fetch from or push to an incompatible | |
659 | repository | |
660 | ||
661 | Next comes introduction of compatObjectFormat: | |
de82095a | 662 | |
2ae12e56 | 663 | - implementing the loose-object-idx |
752414ae JN |
664 | - translating object names between object formats |
665 | - translating object content between object formats | |
666 | - generating and verifying signatures in the compat format | |
667 | - adding appropriate index entries when adding a new object to the | |
668 | object store | |
669 | - --output-format option | |
0ed8d8da | 670 | - ^{sha1} and ^{sha256} revision notation |
752414ae JN |
671 | - configuration to specify default input and output format (see |
672 | "Object names on the command line" above) | |
673 | ||
674 | The next step is supporting fetches and pushes to SHA-1 repositories: | |
de82095a | 675 | |
752414ae JN |
676 | - allow pushes to a repository using the compat format |
677 | - generate a topologically sorted list of the SHA-1 names of fetched | |
678 | objects | |
af9b1e9a | 679 | - convert the fetched packfile to SHA-256 format and generate an idx |
752414ae JN |
680 | file |
681 | - re-sort to match the order of objects in the fetched packfile | |
682 | ||
683 | The infrastructure supporting fetch also allows converting an existing | |
684 | repository. In converted repositories and new clones, end users can | |
685 | gain support for the new hash function without any visible change in | |
686 | behavior (see "dark launch" in the "Object names on the command line" | |
0ed8d8da | 687 | section). In particular this allows users to verify SHA-256 signatures |
752414ae JN |
688 | on objects in the repository, and it should ensure the transition code |
689 | is stable in production in preparation for using it more widely. | |
690 | ||
691 | Over time projects would encourage their users to adopt the "early | |
692 | transition" and then "late transition" modes to take advantage of the | |
0ed8d8da | 693 | new, more futureproof SHA-256 object names. |
752414ae JN |
694 | |
695 | When objectFormat and compatObjectFormat are both set, commands | |
0ed8d8da | 696 | generating signatures would generate both SHA-1 and SHA-256 signatures |
752414ae JN |
697 | by default to support both new and old users. |
698 | ||
0ed8d8da | 699 | In projects using SHA-256 heavily, users could be encouraged to adopt |
752414ae JN |
700 | the "post-transition" mode to avoid accidentally making implicit use |
701 | of SHA-1 object names. | |
702 | ||
703 | Once a critical mass of users have upgraded to a version of Git that | |
0ed8d8da | 704 | can verify SHA-256 signatures and have converted their existing |
752414ae | 705 | repositories to support verifying them, we can add support for a |
0ed8d8da | 706 | setting to generate only SHA-256 signatures. This is expected to be at |
752414ae JN |
707 | least a year later. |
708 | ||
709 | That is also a good moment to advertise the ability to convert | |
0ed8d8da | 710 | repositories to use SHA-256 only, stripping out all SHA-1 related |
752414ae JN |
711 | metadata. This improves performance by eliminating translation |
712 | overhead and security by avoiding the possibility of accidentally | |
713 | relying on the safety of SHA-1. | |
714 | ||
715 | Updating Git's protocols to allow a server to specify which hash | |
716 | functions it supports is also an important part of this transition. It | |
717 | is not discussed in detail in this document but this transition plan | |
718 | assumes it happens. :) | |
719 | ||
720 | Alternatives considered | |
721 | ----------------------- | |
722 | Upgrading everyone working on a particular project on a flag day | |
723 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
724 | Projects like the Linux kernel are large and complex enough that | |
725 | flipping the switch for all projects based on the repository at once | |
726 | is infeasible. | |
727 | ||
728 | Not only would all developers and server operators supporting | |
729 | developers have to switch on the same flag day, but supporting tooling | |
730 | (continuous integration, code review, bug trackers, etc) would have to | |
731 | be adapted as well. This also makes it difficult to get early feedback | |
732 | from some project participants testing before it is time for mass | |
733 | adoption. | |
734 | ||
735 | Using hash functions in parallel | |
736 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
3eae30e4 | 737 | (e.g. https://lore.kernel.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ ) |
752414ae JN |
738 | Objects newly created would be addressed by the new hash, but inside |
739 | such an object (e.g. commit) it is still possible to address objects | |
740 | using the old hash function. | |
de82095a | 741 | |
752414ae JN |
742 | * You cannot trust its history (needed for bisectability) in the |
743 | future without further work | |
744 | * Maintenance burden as the number of supported hash functions grows | |
745 | (they will never go away, so they accumulate). In this proposal, by | |
746 | comparison, converted objects lose all references to SHA-1. | |
747 | ||
748 | Signed objects with multiple hashes | |
749 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
0ed8d8da | 750 | Instead of introducing the gpgsig-sha256 field in commit and tag objects |
af9b1e9a TA |
751 | for SHA-256 content based signatures, an earlier version of this design |
752 | added "hash sha256 <SHA-256 name>" fields to strengthen the existing | |
753 | SHA-1 content based signatures. | |
752414ae JN |
754 | |
755 | In other words, a single signature was used to attest to the object | |
756 | content using both hash functions. This had some advantages: | |
de82095a | 757 | |
752414ae JN |
758 | * Using one signature instead of two speeds up the signing process. |
759 | * Having one signed payload with both hashes allows the signer to | |
af9b1e9a | 760 | attest to the SHA-1 name and SHA-256 name referring to the same object. |
752414ae JN |
761 | * All users consume the same signature. Broken signatures are likely |
762 | to be detected quickly using current versions of git. | |
763 | ||
764 | However, it also came with disadvantages: | |
de82095a | 765 | |
af9b1e9a | 766 | * Verifying a signed object requires access to the SHA-1 names of all |
752414ae JN |
767 | objects it references, even after the transition is complete and |
768 | translation table is no longer needed for anything else. To support | |
af9b1e9a TA |
769 | this, the design added fields such as "hash sha1 tree <SHA-1 name>" |
770 | and "hash sha1 parent <SHA-1 name>" to the SHA-256 content of a signed | |
752414ae | 771 | commit, complicating the conversion process. |
af9b1e9a | 772 | * Allowing signed objects without a SHA-1 (for after the transition is |
752414ae | 773 | complete) complicated the design further, requiring a "nohash sha1" |
af9b1e9a | 774 | field to suppress including "hash sha1" fields in the SHA-256 content |
752414ae JN |
775 | and signed payload. |
776 | ||
777 | Lazily populated translation table | |
778 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
779 | Some of the work of building the translation table could be deferred to | |
780 | push time, but that would significantly complicate and slow down pushes. | |
af9b1e9a TA |
781 | Calculating the SHA-1 name at object creation time at the same time it is |
782 | being streamed to disk and having its SHA-256 name calculated should be | |
752414ae JN |
783 | an acceptable cost. |
784 | ||
785 | Document History | |
786 | ---------------- | |
787 | ||
788 | 2017-03-03 | |
789 | bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com, | |
790 | sbeller@google.com | |
791 | ||
de82095a | 792 | * Initial version sent to http://lore.kernel.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com |
752414ae JN |
793 | |
794 | 2017-03-03 jrnieder@gmail.com | |
795 | Incorporated suggestions from jonathantanmy and sbeller: | |
de82095a | 796 | |
810372f8 TA |
797 | * Describe purpose of signed objects with each hash type |
798 | * Redefine signed object verification using object content under the | |
752414ae JN |
799 | first hash function |
800 | ||
801 | 2017-03-06 jrnieder@gmail.com | |
de82095a | 802 | |
752414ae | 803 | * Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2] |
af9b1e9a | 804 | * Make SHA3-based signatures a separate field, avoiding the need for |
752414ae JN |
805 | "hash" and "nohash" fields (thanks to peff[3]). |
806 | * Add a sorting phase to fetch (thanks to Junio for noticing the need | |
807 | for this). | |
808 | * Omit blobs from the topological sort during fetch (thanks to peff). | |
809 | * Discuss alternates, git notes, and git servers in the caveats | |
810 | section (thanks to Junio Hamano, brian m. carlson[4], and Shawn | |
811 | Pearce). | |
812 | * Clarify language throughout (thanks to various commenters, | |
813 | especially Junio). | |
814 | ||
815 | 2017-09-27 jrnieder@gmail.com, sbeller@google.com | |
de82095a | 816 | |
810372f8 TA |
817 | * Use placeholder NewHash instead of SHA3-256 |
818 | * Describe criteria for picking a hash function. | |
819 | * Include a transition plan (thanks especially to Brandon Williams | |
752414ae | 820 | for fleshing these ideas out) |
810372f8 | 821 | * Define the translation table (thanks, Shawn Pearce[5], Jonathan |
752414ae | 822 | Tan, and Masaya Suzuki) |
810372f8 | 823 | * Avoid loose object overhead by packing more aggressively in |
752414ae JN |
824 | "git gc --auto" |
825 | ||
13f5e098 ÆAB |
826 | Later history: |
827 | ||
de82095a TA |
828 | * See the history of this file in git.git for the history of subsequent |
829 | edits. This document history is no longer being maintained as it | |
830 | would now be superfluous to the commit log | |
831 | ||
832 | References: | |
13f5e098 | 833 | |
de82095a TA |
834 | [1] http://lore.kernel.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/ |
835 | [2] http://lore.kernel.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/ | |
836 | [3] http://lore.kernel.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/ | |
837 | [4] http://lore.kernel.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net | |
838 | [5] https://lore.kernel.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/ |