]> git.ipfire.org Git - thirdparty/git.git/blame - Documentation/technical/hash-function-transition.txt
doc hash-function-transition: use upper case consistently
[thirdparty/git.git] / Documentation / technical / hash-function-transition.txt
CommitLineData
752414ae
JN
1Git hash function transition
2============================
3
4Objective
5---------
6Migrate Git from SHA-1 to a stronger hash function.
7
8Background
9----------
10At its core, the Git version control system is a content addressable
11filesystem. It uses the SHA-1 hash function to name content. For
12example, files, directories, and revisions are referred to by hash
13values unlike in other traditional version control systems where files
14or versions are referred to via sequential numbers. The use of a hash
15function to address its content delivers a few advantages:
16
17* Integrity checking is easy. Bit flips, for example, are easily
18 detected, as the hash of corrupted content does not match its name.
19* Lookup of objects is fast.
20
21Using a cryptographically secure hash function brings additional
22advantages:
23
24* Object names can be signed and third parties can trust the hash to
25 address the signed object and all objects it references.
26* Communication using Git protocol and out of band communication
27 methods have a short reliable string that can be used to reliably
28 address stored content.
29
30Over time some flaws in SHA-1 have been discovered by security
5988eb63
ÆAB
31researchers. On 23 February 2017 the SHAttered attack
32(https://shattered.io) demonstrated a practical SHA-1 hash collision.
33
34Git v2.13.0 and later subsequently moved to a hardened SHA-1
35implementation by default, which isn't vulnerable to the SHAttered
36attack.
37
38Thus Git has in effect already migrated to a new hash that isn't SHA-1
39and doesn't share its vulnerabilities, its new hash function just
40happens to produce exactly the same output for all known inputs,
41except two PDFs published by the SHAttered researchers, and the new
42implementation (written by those researchers) claims to detect future
43cryptanalytic collision attacks.
44
45Regardless, it's considered prudent to move past any variant of SHA-1
46to a new hash. There's no guarantee that future attacks on SHA-1 won't
47be published in the future, and those attacks may not have viable
48mitigations.
49
50If SHA-1 and its variants were to be truly broken, Git's hash function
51could not be considered cryptographically secure any more. This would
52impact the communication of hash values because we could not trust
53that a given hash value represented the known good version of content
54that the speaker intended.
752414ae
JN
55
56SHA-1 still possesses the other properties such as fast object lookup
57and safe error checking, but other hash functions are equally suitable
58that are believed to be cryptographically secure.
59
60Goals
61-----
0ed8d8da 621. The transition to SHA-256 can be done one local repository at a time.
752414ae 63 a. Requiring no action by any other party.
0ed8d8da 64 b. A SHA-256 repository can communicate with SHA-1 Git servers
752414ae 65 (push/fetch).
0ed8d8da 66 c. Users can use SHA-1 and SHA-256 identifiers for objects
752414ae
JN
67 interchangeably (see "Object names on the command line", below).
68 d. New signed objects make use of a stronger hash function than
69 SHA-1 for their security guarantees.
702. Allow a complete transition away from SHA-1.
71 a. Local metadata for SHA-1 compatibility can be removed from a
72 repository if compatibility with SHA-1 is no longer needed.
733. Maintainability throughout the process.
74 a. The object format is kept simple and consistent.
75 b. Creation of a generalized repository conversion tool.
76
77Non-Goals
78---------
0ed8d8da 791. Add SHA-256 support to Git protocol. This is valuable and the
752414ae
JN
80 logical next step but it is out of scope for this initial design.
812. Transparently improving the security of existing SHA-1 signed
82 objects.
833. Intermixing objects using multiple hash functions in a single
84 repository.
854. Taking the opportunity to fix other bugs in Git's formats and
86 protocols.
0ed8d8da
JN
875. Shallow clones and fetches into a SHA-256 repository. (This will
88 change when we add SHA-256 support to Git protocol.)
896. Skip fetching some submodules of a project into a SHA-256
90 repository. (This also depends on SHA-256 support in Git
752414ae
JN
91 protocol.)
92
93Overview
94--------
95We introduce a new repository format extension. Repositories with this
0ed8d8da 96extension enabled use SHA-256 instead of SHA-1 to name their objects.
de82095a 97This affects both object names and object content -- both the names
752414ae
JN
98of objects and all references to other objects within an object are
99switched to the new hash function.
100
0ed8d8da 101SHA-256 repositories cannot be read by older versions of Git.
752414ae 102
0ed8d8da
JN
103Alongside the packfile, a SHA-256 repository stores a bidirectional
104mapping between SHA-256 and SHA-1 object names. The mapping is generated
752414ae 105locally and can be verified using "git fsck". Object lookups use this
0ed8d8da 106mapping to allow naming objects using either their SHA-1 and SHA-256 names
752414ae
JN
107interchangeably.
108
109"git cat-file" and "git hash-object" gain options to display an object
af9b1e9a 110in its SHA-1 form and write an object given its SHA-1 form. This
752414ae
JN
111requires all objects referenced by that object to be present in the
112object database so that they can be named using the appropriate name
113(using the bidirectional hash mapping).
114
115Fetches from a SHA-1 based server convert the fetched objects into
0ed8d8da 116SHA-256 form and record the mapping in the bidirectional mapping table
752414ae 117(see below for details). Pushes to a SHA-1 based server convert the
af9b1e9a 118objects being pushed into SHA-1 form so the server does not have to be
752414ae
JN
119aware of the hash function the client is using.
120
121Detailed Design
122---------------
123Repository format extension
124~~~~~~~~~~~~~~~~~~~~~~~~~~~
0ed8d8da 125A SHA-256 repository uses repository format version `1` (see
752414ae
JN
126Documentation/technical/repository-version.txt) with extensions
127`objectFormat` and `compatObjectFormat`:
128
129 [core]
130 repositoryFormatVersion = 1
131 [extensions]
0ed8d8da 132 objectFormat = sha256
752414ae
JN
133 compatObjectFormat = sha1
134
45fa195f
ÆAB
135The combination of setting `core.repositoryFormatVersion=1` and
136populating `extensions.*` ensures that all versions of Git later than
0ed8d8da 137`v0.99.9l` will die instead of trying to operate on the SHA-256
45fa195f 138repository, instead producing an error message.
752414ae 139
45fa195f
ÆAB
140 # Between v0.99.9l and v2.7.0
141 $ git status
142 fatal: Expected git repo version <= 0, found 1
143 # After v2.7.0
752414ae
JN
144 $ git status
145 fatal: unknown repository extensions found:
146 objectformat
147 compatobjectformat
148
149See the "Transition plan" section below for more details on these
150repository extensions.
151
152Object names
153~~~~~~~~~~~~
af9b1e9a
TA
154Objects can be named by their 40 hexadecimal digit SHA-1 name or 64
155hexadecimal digit SHA-256 name, plus names derived from those (see
752414ae
JN
156gitrevisions(7)).
157
af9b1e9a
TA
158The SHA-1 name of an object is the SHA-1 of the concatenation of its
159type, length, a nul byte, and the object's SHA-1 content. This is the
752414ae
JN
160traditional <sha1> used in Git to name objects.
161
af9b1e9a
TA
162The SHA-256 name of an object is the SHA-256 of the concatenation of its
163type, length, a nul byte, and the object's SHA-256 content.
752414ae
JN
164
165Object format
166~~~~~~~~~~~~~
167The content as a byte sequence of a tag, commit, or tree object named
af9b1e9a
TA
168by SHA-1 and SHA-256 differ because an object named by SHA-256 name refers to
169other objects by their SHA-256 names and an object named by SHA-1 name
170refers to other objects by their SHA-1 names.
752414ae 171
af9b1e9a
TA
172The SHA-256 content of an object is the same as its SHA-1 content, except
173that objects referenced by the object are named using their SHA-256 names
174instead of SHA-1 names. Because a blob object does not refer to any
175other object, its SHA-1 content and SHA-256 content are the same.
752414ae 176
af9b1e9a
TA
177The format allows round-trip conversion between SHA-256 content and
178SHA-1 content.
752414ae
JN
179
180Object storage
181~~~~~~~~~~~~~~
182Loose objects use zlib compression and packed objects use the packed
183format described in Documentation/technical/pack-format.txt, just like
af9b1e9a
TA
184today. The content that is compressed and stored uses SHA-256 content
185instead of SHA-1 content.
752414ae
JN
186
187Pack index
188~~~~~~~~~~
189Pack index (.idx) files use a new v3 format that supports multiple
190hash functions. They have the following format (all integers are in
191network byte order):
192
193- A header appears at the beginning and consists of the following:
de82095a
TA
194 * The 4-byte pack index signature: '\377t0c'
195 * 4-byte version number: 3
196 * 4-byte length of the header section, including the signature and
752414ae 197 version number
de82095a
TA
198 * 4-byte number of objects contained in the pack
199 * 4-byte number of object formats in this pack index: 2
200 * For each object format:
201 ** 4-byte format identifier (e.g., 'sha1' for SHA-1)
202 ** 4-byte length in bytes of shortened object names. This is the
752414ae
JN
203 shortest possible length needed to make names in the shortened
204 object name table unambiguous.
de82095a 205 ** 4-byte integer, recording where tables relating to this format
752414ae 206 are stored in this index file, as an offset from the beginning.
de82095a
TA
207 * 4-byte offset to the trailer from the beginning of this file.
208 * Zero or more additional key/value pairs (4-byte key, 4-byte
752414ae
JN
209 value). Only one key is supported: 'PSRC'. See the "Loose objects
210 and unreachable objects" section for supported values and how this
211 is used. All other keys are reserved. Readers must ignore
212 unrecognized keys.
213- Zero or more NUL bytes. This can optionally be used to improve the
214 alignment of the full object name table below.
215- Tables for the first object format:
de82095a 216 * A sorted table of shortened object names. These are prefixes of
752414ae
JN
217 the names of all objects in this pack file, packed together
218 without offset values to reduce the cache footprint of the binary
219 search for a specific object name.
220
de82095a 221 * A table of full object names in pack order. This allows resolving
752414ae
JN
222 a reference to "the nth object in the pack file" (from a
223 reachability bitmap or from the next table of another object
224 format) to its object name.
225
de82095a 226 * A table of 4-byte values mapping object name order to pack order.
752414ae
JN
227 For an object in the table of sorted shortened object names, the
228 value at the corresponding index in this table is the index in the
229 previous table for that same object.
752414ae
JN
230 This can be used to look up the object in reachability bitmaps or
231 to look up its name in another object format.
232
de82095a 233 * A table of 4-byte CRC32 values of the packed object data, in the
752414ae
JN
234 order that the objects appear in the pack file. This is to allow
235 compressed data to be copied directly from pack to pack during
236 repacking without undetected data corruption.
237
de82095a 238 * A table of 4-byte offset values. For an object in the table of
752414ae
JN
239 sorted shortened object names, the value at the corresponding
240 index in this table indicates where that object can be found in
241 the pack file. These are usually 31-bit pack file offsets, but
242 large offsets are encoded as an index into the next table with the
243 most significant bit set.
244
de82095a 245 * A table of 8-byte offset entries (empty for pack files less than
752414ae
JN
246 2 GiB). Pack files are organized with heavily used objects toward
247 the front, so most object references should not need to refer to
248 this table.
249- Zero or more NUL bytes.
250- Tables for the second object format, with the same layout as above,
251 up to and not including the table of CRC32 values.
252- Zero or more NUL bytes.
253- The trailer consists of the following:
de82095a 254 * A copy of the 20-byte SHA-256 checksum at the end of the
752414ae
JN
255 corresponding packfile.
256
de82095a 257 * 20-byte SHA-256 checksum of all of the above.
752414ae
JN
258
259Loose object index
260~~~~~~~~~~~~~~~~~~
261A new file $GIT_OBJECT_DIR/loose-object-idx contains information about
262all loose objects. Its format is
263
264 # loose-object-idx
0ed8d8da 265 (sha256-name SP sha1-name LF)*
752414ae
JN
266
267where the object names are in hexadecimal format. The file is not
268sorted.
269
270The loose object index is protected against concurrent writes by a
271lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose
272object:
273
2741. Write the loose object to a temporary file, like today.
2752. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock.
2763. Rename the loose object into place.
2774. Open loose-object-idx with O_APPEND and write the new object
2785. Unlink loose-object-idx.lock to release the lock.
279
280To remove entries (e.g. in "git pack-refs" or "git-prune"):
281
2821. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the
283 lock.
2842. Write the new content to loose-object-idx.lock.
2853. Unlink any loose objects being removed.
2864. Rename to replace loose-object-idx, releasing the lock.
287
288Translation table
289~~~~~~~~~~~~~~~~~
af9b1e9a
TA
290The index files support a bidirectional mapping between SHA-1 names
291and SHA-256 names. The lookup proceeds similarly to ordinary object
292lookups. For example, to convert a SHA-1 name to a SHA-256 name:
752414ae
JN
293
294 1. Look for the object in idx files. If a match is present in the
af9b1e9a
TA
295 idx's sorted list of truncated SHA-1 names, then:
296 a. Read the corresponding entry in the SHA-1 name order to pack
752414ae 297 name order mapping.
af9b1e9a 298 b. Read the corresponding entry in the full SHA-1 name table to
752414ae 299 verify we found the right object. If it is, then
af9b1e9a
TA
300 c. Read the corresponding entry in the full SHA-256 name table.
301 That is the object's SHA-256 name.
752414ae
JN
302 2. Check for a loose object. Read lines from loose-object-idx until
303 we find a match.
304
305Step (1) takes the same amount of time as an ordinary object lookup:
306O(number of packs * log(objects per pack)). Step (2) takes O(number of
307loose objects) time. To maintain good performance it will be necessary
308to keep the number of loose objects low. See the "Loose objects and
309unreachable objects" section below for more details.
310
311Since all operations that make new objects (e.g., "git commit") add
312the new objects to the corresponding index, this mapping is possible
313for all objects in the object store.
314
af9b1e9a
TA
315Reading an object's SHA-1 content
316~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
317The SHA-1 content of an object can be read by converting all SHA-256 names
318its SHA-256 content references to SHA-1 names using the translation table.
752414ae
JN
319
320Fetch
321~~~~~
322Fetching from a SHA-1 based server requires translating between SHA-1
0ed8d8da 323and SHA-256 based representations on the fly.
752414ae
JN
324
325SHA-1s named in the ref advertisement that are present on the client
0ed8d8da 326can be translated to SHA-256 and looked up as local objects using the
752414ae
JN
327translation table.
328
329Negotiation proceeds as today. Any "have"s generated locally are
330converted to SHA-1 before being sent to the server, and SHA-1s
0ed8d8da 331mentioned by the server are converted to SHA-256 when looking them up
752414ae
JN
332locally.
333
334After negotiation, the server sends a packfile containing the
0ed8d8da 335requested objects. We convert the packfile to SHA-256 format using
752414ae
JN
336the following steps:
337
3381. index-pack: inflate each object in the packfile and compute its
339 SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
340 objects the client has locally. These objects can be looked up
af9b1e9a 341 using the translation table and their SHA-1 content read as
752414ae
JN
342 described above to resolve the deltas.
3432. topological sort: starting at the "want"s from the negotiation
344 phase, walk through objects in the pack and emit a list of them,
345 excluding blobs, in reverse topologically sorted order, with each
346 object coming later in the list than all objects it references.
347 (This list only contains objects reachable from the "wants". If the
348 pack from the server contained additional extraneous objects, then
349 they will be discarded.)
af9b1e9a 3503. convert to SHA-256: open a new SHA-256 packfile. Read the topologically
752414ae 351 sorted list just generated. For each object, inflate its
af9b1e9a
TA
352 SHA-1 content, convert to SHA-256 content, and write it to the SHA-256
353 pack. Record the new SHA-1<-->SHA-256 mapping entry for use in the idx.
752414ae 3544. sort: reorder entries in the new pack to match the order of objects
af9b1e9a 355 in the pack the server generated and include blobs. Write a SHA-256 idx
752414ae
JN
356 file
3575. clean up: remove the SHA-1 based pack file, index, and
358 topologically sorted list obtained from the server in steps 1
359 and 2.
360
361Step 3 requires every object referenced by the new object to be in the
362translation table. This is why the topological sort step is necessary.
363
364As an optimization, step 1 could write a file describing what non-blob
365objects each object it has inflated from the packfile references. This
366makes the topological sort in step 2 possible without inflating the
367objects in the packfile for a second time. The objects need to be
368inflated again in step 3, for a total of two inflations.
369
370Step 4 is probably necessary for good read-time performance. "git
371pack-objects" on the server optimizes the pack file for good data
372locality (see Documentation/technical/pack-heuristics.txt).
373
374Details of this process are likely to change. It will take some
375experimenting to get this to perform well.
376
377Push
378~~~~
379Push is simpler than fetch because the objects referenced by the
af9b1e9a 380pushed objects are already in the translation table. The SHA-1 content
752414ae 381of each object being pushed can be read as described in the "Reading
af9b1e9a 382an object's SHA-1 content" section to generate the pack written by git
752414ae
JN
383send-pack.
384
385Signed Commits
386~~~~~~~~~~~~~~
0ed8d8da 387We add a new field "gpgsig-sha256" to the commit object format to allow
752414ae 388signing commits without relying on SHA-1. It is similar to the
af9b1e9a 389existing "gpgsig" field. Its signed payload is the SHA-256 content of the
0ed8d8da 390commit object with any "gpgsig" and "gpgsig-sha256" fields removed.
752414ae
JN
391
392This means commits can be signed
de82095a 393
752414ae 3941. using SHA-1 only, as in existing signed commit objects
0ed8d8da 3952. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig
752414ae 396 fields.
0ed8d8da 3973. using only SHA-256, by only using the gpgsig-sha256 field.
752414ae
JN
398
399Old versions of "git verify-commit" can verify the gpgsig signature in
400cases (1) and (2) without modifications and view case (3) as an
401ordinary unsigned commit.
402
403Signed Tags
404~~~~~~~~~~~
0ed8d8da 405We add a new field "gpgsig-sha256" to the tag object format to allow
752414ae 406signing tags without relying on SHA-1. Its signed payload is the
af9b1e9a 407SHA-256 content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP
752414ae
JN
408SIGNATURE-----" delimited in-body signature removed.
409
410This means tags can be signed
de82095a 411
752414ae 4121. using SHA-1 only, as in existing signed tag objects
0ed8d8da 4132. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body
752414ae 414 signature.
0ed8d8da 4153. using only SHA-256, by only using the gpgsig-sha256 field.
752414ae
JN
416
417Mergetag embedding
418~~~~~~~~~~~~~~~~~~
af9b1e9a
TA
419The mergetag field in the SHA-1 content of a commit contains the
420SHA-1 content of a tag that was merged by that commit.
752414ae 421
af9b1e9a
TA
422The mergetag field in the SHA-256 content of the same commit contains the
423SHA-256 content of the same tag.
752414ae
JN
424
425Submodules
426~~~~~~~~~~
427To convert recorded submodule pointers, you need to have the converted
428submodule repository in place. The translation table of the submodule
429can be used to look up the new hash.
430
431Loose objects and unreachable objects
432~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
433Fast lookups in the loose-object-idx require that the number of loose
434objects not grow too high.
435
436"git gc --auto" currently waits for there to be 6700 loose objects
437present before consolidating them into a packfile. We will need to
438measure to find a more appropriate threshold for it to use.
439
440"git gc --auto" currently waits for there to be 50 packs present
441before combining packfiles. Packing loose objects more aggressively
442may cause the number of pack files to grow too quickly. This can be
443mitigated by using a strategy similar to Martin Fick's exponential
444rolling garbage collection script:
445https://gerrit-review.googlesource.com/c/gerrit/+/35215
446
447"git gc" currently expels any unreachable objects it encounters in
448pack files to loose objects in an attempt to prevent a race when
449pruning them (in case another process is simultaneously writing a new
450object that refers to the about-to-be-deleted object). This leads to
451an explosion in the number of loose objects present and disk space
452usage due to the objects in delta form being replaced with independent
453loose objects. Worse, the race is still present for loose objects.
454
455Instead, "git gc" will need to move unreachable objects to a new
456packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see
457below). To avoid the race when writing new objects referring to an
458about-to-be-deleted object, code paths that write new objects will
459need to copy any objects from UNREACHABLE_GARBAGE packs that they
24966cd9 460refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects).
752414ae
JN
461UNREACHABLE_GARBAGE are then safe to delete if their creation time (as
462indicated by the file's mtime) is long enough ago.
463
464To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
465combined under certain circumstances. If "gc.garbageTtl" is set to
466greater than one day, then packs created within a single calendar day,
467UTC, can be coalesced together. The resulting packfile would have an
468mtime before midnight on that day, so this makes the effective maximum
469ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day,
470then we divide the calendar day into intervals one-third of that ttl
471in duration. Packs created within the same interval can be coalesced
472together. The resulting packfile would have an mtime before the end of
473the interval, so this makes the effective maximum ttl equal to the
474garbageTtl * 4/3.
475
476This rule comes from Thirumala Reddy Mutchukota's JGit change
477https://git.eclipse.org/r/90465.
478
479The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack
480index. More generally, that field indicates where a pack came from:
481
482 - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network
483 - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight
484 "gc --auto" operation
485 - 3 (PACK_SOURCE_GC) for a pack created by a full gc
486 - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage
487 discovered by gc
488 - 5 (PACK_SOURCE_INSERT) for locally created objects that were
489 written directly to a pack file, e.g. from "git add ."
490
491This information can be useful for debugging and for "gc --auto" to
492make appropriate choices about which packs to coalesce.
493
494Caveats
495-------
496Invalid objects
497~~~~~~~~~~~~~~~
af9b1e9a 498The conversion from SHA-1 content to SHA-256 content retains any
752414ae
JN
499brokenness in the original object (e.g., tree entry modes encoded with
500leading 0, tree objects whose paths are not sorted correctly, and
501commit objects without an author or committer). This is a deliberate
502feature of the design to allow the conversion to round-trip.
503
504More profoundly broken objects (e.g., a commit with a truncated "tree"
505header line) cannot be converted but were not usable by current Git
506anyway.
507
508Shallow clone and submodules
509~~~~~~~~~~~~~~~~~~~~~~~~~~~~
510Because it requires all referenced objects to be available in the
511locally generated translation table, this design does not support
512shallow clone or unfetched submodules. Protocol improvements might
513allow lifting this restriction.
514
515Alternates
516~~~~~~~~~~
af9b1e9a
TA
517For the same reason, a SHA-256 repository cannot borrow objects from a
518SHA-1 repository using objects/info/alternates or
752414ae
JN
519$GIT_ALTERNATE_OBJECT_REPOSITORIES.
520
521git notes
522~~~~~~~~~
af9b1e9a 523The "git notes" tool annotates objects using their SHA-1 name as key.
752414ae 524This design does not describe a way to migrate notes trees to use
af9b1e9a 525SHA-256 names. That migration is expected to happen separately (for
752414ae
JN
526example using a file at the root of the notes tree to describe which
527hash it uses).
528
529Server-side cost
530~~~~~~~~~~~~~~~~
0ed8d8da 531Until Git protocol gains SHA-256 support, using SHA-256 based storage
752414ae 532on public-facing Git servers is strongly discouraged. Once Git
0ed8d8da 533protocol gains SHA-256 support, SHA-256 based servers are likely not
752414ae 534to support SHA-1 compatibility, to avoid what may be a very expensive
031fd4b9 535hash re-encode during clone and to encourage peers to modernize.
752414ae
JN
536
537The design described here allows fetches by SHA-1 clients of a
0ed8d8da 538personal SHA-256 repository because it's not much more difficult than
752414ae
JN
539allowing pushes from that repository. This support needs to be guarded
540by a configuration option --- servers like git.kernel.org that serve a
541large number of clients would not be expected to bear that cost.
542
543Meaning of signatures
544~~~~~~~~~~~~~~~~~~~~~
545The signed payload for signed commits and tags does not explicitly
546name the hash used to identify objects. If some day Git adopts a new
547hash function with the same length as the current SHA-1 (40
0ed8d8da 548hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the
752414ae
JN
549intent behind the PGP signed payload in an object signature is
550unclear:
551
552 object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7
553 type commit
554 tag v2.12.0
555 tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800
556
557 Git 2.12
558
af9b1e9a 559Does this mean Git v2.12.0 is the commit with SHA-1 name
752414ae
JN
560e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
561new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
562
0ed8d8da 563Fortunately SHA-256 and SHA-1 have different lengths. If Git starts
752414ae
JN
564using another hash with the same length to name objects, then it will
565need to change the format of signed payloads using that hash to
566address this issue.
567
568Object names on the command line
569~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
570To support the transition (see Transition plan below), this design
571supports four different modes of operation:
572
573 1. ("dark launch") Treat object names input by the user as SHA-1 and
574 convert any object names written to output to SHA-1, but store
0ed8d8da 575 objects using SHA-256. This allows users to test the code with no
752414ae
JN
576 visible behavior change except for performance. This allows
577 allows running even tests that assume the SHA-1 hash function, to
578 sanity-check the behavior of the new mode.
579
0ed8d8da 580 2. ("early transition") Allow both SHA-1 and SHA-256 object names in
752414ae
JN
581 input. Any object names written to output use SHA-1. This allows
582 users to continue to make use of SHA-1 to communicate with peers
583 (e.g. by email) that have not migrated yet and prepares for mode 3.
584
0ed8d8da
JN
585 3. ("late transition") Allow both SHA-1 and SHA-256 object names in
586 input. Any object names written to output use SHA-256. In this
752414ae
JN
587 mode, users are using a more secure object naming method by
588 default. The disruption is minimal as long as most of their peers
589 are in mode 2 or mode 3.
590
591 4. ("post-transition") Treat object names input by the user as
0ed8d8da 592 SHA-256 and write output using SHA-256. This is safer than mode 3
752414ae
JN
593 because there is less risk that input is incorrectly interpreted
594 using the wrong hash function.
595
596The mode is specified in configuration.
597
598The user can also explicitly specify which format to use for a
599particular revision specifier and for output, overriding the mode. For
600example:
601
de82095a 602 git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}
752414ae 603
0ed8d8da
JN
604Choice of Hash
605--------------
031fd4b9 606In early 2005, around the time that Git was written, Xiaoyun Wang,
752414ae
JN
607Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1
608collisions in 2^69 operations. In August they published details.
609Luckily, no practical demonstrations of a collision in full SHA-1 were
610published until 10 years later, in 2017.
611
0ed8d8da
JN
612Git v2.13.0 and later subsequently moved to a hardened SHA-1
613implementation by default that mitigates the SHAttered attack, but
614SHA-1 is still believed to be weak.
615
616The hash to replace this hardened SHA-1 should be stronger than SHA-1
617was: we would like it to be trustworthy and useful in practice for at
618least 10 years.
752414ae
JN
619
620Some other relevant properties:
621
6221. A 256-bit hash (long enough to match common security practice; not
623 excessively long to hurt performance and disk usage).
624
0ed8d8da
JN
6252. High quality implementations should be widely available (e.g., in
626 OpenSSL and Apple CommonCrypto).
752414ae
JN
627
6283. The hash function's properties should match Git's needs (e.g. Git
629 requires collision and 2nd preimage resistance and does not require
630 length extension resistance).
631
6324. As a tiebreaker, the hash should be fast to compute (fortunately
633 many contenders are faster than SHA-1).
634
0ed8d8da 635We choose SHA-256.
752414ae
JN
636
637Transition plan
638---------------
639Some initial steps can be implemented independently of one another:
de82095a 640
752414ae 641- adding a hash function API (vtable)
0ed8d8da 642- teaching fsck to tolerate the gpgsig-sha256 field
752414ae
JN
643- excluding gpgsig-* from the fields copied by "git commit --amend"
644- annotating tests that depend on SHA-1 values with a SHA1 test
645 prerequisite
646- using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ
647 consistently instead of "unsigned char *" and the hardcoded
648 constants 20 and 40.
649- introducing index v3
650- adding support for the PSRC field and safer object pruning
651
752414ae
JN
652The first user-visible change is the introduction of the objectFormat
653extension (without compatObjectFormat). This requires:
de82095a 654
752414ae
JN
655- teaching fsck about this mode of operation
656- using the hash function API (vtable) when computing object names
657- signing objects and verifying signatures
658- rejecting attempts to fetch from or push to an incompatible
659 repository
660
661Next comes introduction of compatObjectFormat:
de82095a 662
2ae12e56 663- implementing the loose-object-idx
752414ae
JN
664- translating object names between object formats
665- translating object content between object formats
666- generating and verifying signatures in the compat format
667- adding appropriate index entries when adding a new object to the
668 object store
669- --output-format option
0ed8d8da 670- ^{sha1} and ^{sha256} revision notation
752414ae
JN
671- configuration to specify default input and output format (see
672 "Object names on the command line" above)
673
674The next step is supporting fetches and pushes to SHA-1 repositories:
de82095a 675
752414ae
JN
676- allow pushes to a repository using the compat format
677- generate a topologically sorted list of the SHA-1 names of fetched
678 objects
af9b1e9a 679- convert the fetched packfile to SHA-256 format and generate an idx
752414ae
JN
680 file
681- re-sort to match the order of objects in the fetched packfile
682
683The infrastructure supporting fetch also allows converting an existing
684repository. In converted repositories and new clones, end users can
685gain support for the new hash function without any visible change in
686behavior (see "dark launch" in the "Object names on the command line"
0ed8d8da 687section). In particular this allows users to verify SHA-256 signatures
752414ae
JN
688on objects in the repository, and it should ensure the transition code
689is stable in production in preparation for using it more widely.
690
691Over time projects would encourage their users to adopt the "early
692transition" and then "late transition" modes to take advantage of the
0ed8d8da 693new, more futureproof SHA-256 object names.
752414ae
JN
694
695When objectFormat and compatObjectFormat are both set, commands
0ed8d8da 696generating signatures would generate both SHA-1 and SHA-256 signatures
752414ae
JN
697by default to support both new and old users.
698
0ed8d8da 699In projects using SHA-256 heavily, users could be encouraged to adopt
752414ae
JN
700the "post-transition" mode to avoid accidentally making implicit use
701of SHA-1 object names.
702
703Once a critical mass of users have upgraded to a version of Git that
0ed8d8da 704can verify SHA-256 signatures and have converted their existing
752414ae 705repositories to support verifying them, we can add support for a
0ed8d8da 706setting to generate only SHA-256 signatures. This is expected to be at
752414ae
JN
707least a year later.
708
709That is also a good moment to advertise the ability to convert
0ed8d8da 710repositories to use SHA-256 only, stripping out all SHA-1 related
752414ae
JN
711metadata. This improves performance by eliminating translation
712overhead and security by avoiding the possibility of accidentally
713relying on the safety of SHA-1.
714
715Updating Git's protocols to allow a server to specify which hash
716functions it supports is also an important part of this transition. It
717is not discussed in detail in this document but this transition plan
718assumes it happens. :)
719
720Alternatives considered
721-----------------------
722Upgrading everyone working on a particular project on a flag day
723~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
724Projects like the Linux kernel are large and complex enough that
725flipping the switch for all projects based on the repository at once
726is infeasible.
727
728Not only would all developers and server operators supporting
729developers have to switch on the same flag day, but supporting tooling
730(continuous integration, code review, bug trackers, etc) would have to
731be adapted as well. This also makes it difficult to get early feedback
732from some project participants testing before it is time for mass
733adoption.
734
735Using hash functions in parallel
736~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3eae30e4 737(e.g. https://lore.kernel.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ )
752414ae
JN
738Objects newly created would be addressed by the new hash, but inside
739such an object (e.g. commit) it is still possible to address objects
740using the old hash function.
de82095a 741
752414ae
JN
742* You cannot trust its history (needed for bisectability) in the
743 future without further work
744* Maintenance burden as the number of supported hash functions grows
745 (they will never go away, so they accumulate). In this proposal, by
746 comparison, converted objects lose all references to SHA-1.
747
748Signed objects with multiple hashes
749~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0ed8d8da 750Instead of introducing the gpgsig-sha256 field in commit and tag objects
af9b1e9a
TA
751for SHA-256 content based signatures, an earlier version of this design
752added "hash sha256 <SHA-256 name>" fields to strengthen the existing
753SHA-1 content based signatures.
752414ae
JN
754
755In other words, a single signature was used to attest to the object
756content using both hash functions. This had some advantages:
de82095a 757
752414ae
JN
758* Using one signature instead of two speeds up the signing process.
759* Having one signed payload with both hashes allows the signer to
af9b1e9a 760 attest to the SHA-1 name and SHA-256 name referring to the same object.
752414ae
JN
761* All users consume the same signature. Broken signatures are likely
762 to be detected quickly using current versions of git.
763
764However, it also came with disadvantages:
de82095a 765
af9b1e9a 766* Verifying a signed object requires access to the SHA-1 names of all
752414ae
JN
767 objects it references, even after the transition is complete and
768 translation table is no longer needed for anything else. To support
af9b1e9a
TA
769 this, the design added fields such as "hash sha1 tree <SHA-1 name>"
770 and "hash sha1 parent <SHA-1 name>" to the SHA-256 content of a signed
752414ae 771 commit, complicating the conversion process.
af9b1e9a 772* Allowing signed objects without a SHA-1 (for after the transition is
752414ae 773 complete) complicated the design further, requiring a "nohash sha1"
af9b1e9a 774 field to suppress including "hash sha1" fields in the SHA-256 content
752414ae
JN
775 and signed payload.
776
777Lazily populated translation table
778~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
779Some of the work of building the translation table could be deferred to
780push time, but that would significantly complicate and slow down pushes.
af9b1e9a
TA
781Calculating the SHA-1 name at object creation time at the same time it is
782being streamed to disk and having its SHA-256 name calculated should be
752414ae
JN
783an acceptable cost.
784
785Document History
786----------------
787
7882017-03-03
789bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com,
790sbeller@google.com
791
de82095a 792* Initial version sent to http://lore.kernel.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com
752414ae
JN
793
7942017-03-03 jrnieder@gmail.com
795Incorporated suggestions from jonathantanmy and sbeller:
de82095a 796
810372f8
TA
797* Describe purpose of signed objects with each hash type
798* Redefine signed object verification using object content under the
752414ae
JN
799 first hash function
800
8012017-03-06 jrnieder@gmail.com
de82095a 802
752414ae 803* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2]
af9b1e9a 804* Make SHA3-based signatures a separate field, avoiding the need for
752414ae
JN
805 "hash" and "nohash" fields (thanks to peff[3]).
806* Add a sorting phase to fetch (thanks to Junio for noticing the need
807 for this).
808* Omit blobs from the topological sort during fetch (thanks to peff).
809* Discuss alternates, git notes, and git servers in the caveats
810 section (thanks to Junio Hamano, brian m. carlson[4], and Shawn
811 Pearce).
812* Clarify language throughout (thanks to various commenters,
813 especially Junio).
814
8152017-09-27 jrnieder@gmail.com, sbeller@google.com
de82095a 816
810372f8
TA
817* Use placeholder NewHash instead of SHA3-256
818* Describe criteria for picking a hash function.
819* Include a transition plan (thanks especially to Brandon Williams
752414ae 820 for fleshing these ideas out)
810372f8 821* Define the translation table (thanks, Shawn Pearce[5], Jonathan
752414ae 822 Tan, and Masaya Suzuki)
810372f8 823* Avoid loose object overhead by packing more aggressively in
752414ae
JN
824 "git gc --auto"
825
13f5e098
ÆAB
826Later history:
827
de82095a
TA
828* See the history of this file in git.git for the history of subsequent
829 edits. This document history is no longer being maintained as it
830 would now be superfluous to the commit log
831
832References:
13f5e098 833
de82095a
TA
834 [1] http://lore.kernel.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/
835 [2] http://lore.kernel.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/
836 [3] http://lore.kernel.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/
837 [4] http://lore.kernel.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net
838 [5] https://lore.kernel.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/