]> git.ipfire.org Git - thirdparty/git.git/blame - Documentation/technical/hash-function-transition.txt
Merge branch 'en/ort-perf-batch-9'
[thirdparty/git.git] / Documentation / technical / hash-function-transition.txt
CommitLineData
752414ae
JN
1Git hash function transition
2============================
3
4Objective
5---------
6Migrate Git from SHA-1 to a stronger hash function.
7
8Background
9----------
10At its core, the Git version control system is a content addressable
11filesystem. It uses the SHA-1 hash function to name content. For
12example, files, directories, and revisions are referred to by hash
13values unlike in other traditional version control systems where files
14or versions are referred to via sequential numbers. The use of a hash
15function to address its content delivers a few advantages:
16
17* Integrity checking is easy. Bit flips, for example, are easily
18 detected, as the hash of corrupted content does not match its name.
19* Lookup of objects is fast.
20
21Using a cryptographically secure hash function brings additional
22advantages:
23
24* Object names can be signed and third parties can trust the hash to
25 address the signed object and all objects it references.
26* Communication using Git protocol and out of band communication
27 methods have a short reliable string that can be used to reliably
28 address stored content.
29
30Over time some flaws in SHA-1 have been discovered by security
5988eb63
ÆAB
31researchers. On 23 February 2017 the SHAttered attack
32(https://shattered.io) demonstrated a practical SHA-1 hash collision.
33
34Git v2.13.0 and later subsequently moved to a hardened SHA-1
35implementation by default, which isn't vulnerable to the SHAttered
1d189970 36attack, but SHA-1 is still weak.
5988eb63 37
1d189970 38Thus it's considered prudent to move past any variant of SHA-1
5988eb63
ÆAB
39to a new hash. There's no guarantee that future attacks on SHA-1 won't
40be published in the future, and those attacks may not have viable
41mitigations.
42
43If SHA-1 and its variants were to be truly broken, Git's hash function
44could not be considered cryptographically secure any more. This would
45impact the communication of hash values because we could not trust
46that a given hash value represented the known good version of content
47that the speaker intended.
752414ae
JN
48
49SHA-1 still possesses the other properties such as fast object lookup
50and safe error checking, but other hash functions are equally suitable
51that are believed to be cryptographically secure.
52
1d189970
TA
53Choice of Hash
54--------------
55The hash to replace the hardened SHA-1 should be stronger than SHA-1
56was: we would like it to be trustworthy and useful in practice for at
57least 10 years.
58
59Some other relevant properties:
60
611. A 256-bit hash (long enough to match common security practice; not
62 excessively long to hurt performance and disk usage).
63
642. High quality implementations should be widely available (e.g., in
65 OpenSSL and Apple CommonCrypto).
66
673. The hash function's properties should match Git's needs (e.g. Git
68 requires collision and 2nd preimage resistance and does not require
69 length extension resistance).
70
714. As a tiebreaker, the hash should be fast to compute (fortunately
72 many contenders are faster than SHA-1).
73
74There were several contenders for a successor hash to SHA-1, including
75SHA-256, SHA-512/256, SHA-256x16, K12, and BLAKE2bp-256.
76
77In late 2018 the project picked SHA-256 as its successor hash.
78
79See 0ed8d8da374 (doc hash-function-transition: pick SHA-256 as
80NewHash, 2018-08-04) and numerous mailing list threads at the time,
81particularly the one starting at
82https://lore.kernel.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/
83for more information.
84
752414ae
JN
85Goals
86-----
0ed8d8da 871. The transition to SHA-256 can be done one local repository at a time.
752414ae 88 a. Requiring no action by any other party.
0ed8d8da 89 b. A SHA-256 repository can communicate with SHA-1 Git servers
752414ae 90 (push/fetch).
0ed8d8da 91 c. Users can use SHA-1 and SHA-256 identifiers for objects
752414ae
JN
92 interchangeably (see "Object names on the command line", below).
93 d. New signed objects make use of a stronger hash function than
94 SHA-1 for their security guarantees.
952. Allow a complete transition away from SHA-1.
96 a. Local metadata for SHA-1 compatibility can be removed from a
97 repository if compatibility with SHA-1 is no longer needed.
983. Maintainability throughout the process.
99 a. The object format is kept simple and consistent.
100 b. Creation of a generalized repository conversion tool.
101
102Non-Goals
103---------
0ed8d8da 1041. Add SHA-256 support to Git protocol. This is valuable and the
752414ae
JN
105 logical next step but it is out of scope for this initial design.
1062. Transparently improving the security of existing SHA-1 signed
107 objects.
1083. Intermixing objects using multiple hash functions in a single
109 repository.
1104. Taking the opportunity to fix other bugs in Git's formats and
111 protocols.
0ed8d8da
JN
1125. Shallow clones and fetches into a SHA-256 repository. (This will
113 change when we add SHA-256 support to Git protocol.)
1146. Skip fetching some submodules of a project into a SHA-256
115 repository. (This also depends on SHA-256 support in Git
752414ae
JN
116 protocol.)
117
118Overview
119--------
120We introduce a new repository format extension. Repositories with this
0ed8d8da 121extension enabled use SHA-256 instead of SHA-1 to name their objects.
de82095a 122This affects both object names and object content -- both the names
752414ae
JN
123of objects and all references to other objects within an object are
124switched to the new hash function.
125
0ed8d8da 126SHA-256 repositories cannot be read by older versions of Git.
752414ae 127
0ed8d8da
JN
128Alongside the packfile, a SHA-256 repository stores a bidirectional
129mapping between SHA-256 and SHA-1 object names. The mapping is generated
752414ae 130locally and can be verified using "git fsck". Object lookups use this
0ed8d8da 131mapping to allow naming objects using either their SHA-1 and SHA-256 names
752414ae
JN
132interchangeably.
133
134"git cat-file" and "git hash-object" gain options to display an object
af9b1e9a 135in its SHA-1 form and write an object given its SHA-1 form. This
752414ae
JN
136requires all objects referenced by that object to be present in the
137object database so that they can be named using the appropriate name
138(using the bidirectional hash mapping).
139
140Fetches from a SHA-1 based server convert the fetched objects into
0ed8d8da 141SHA-256 form and record the mapping in the bidirectional mapping table
752414ae 142(see below for details). Pushes to a SHA-1 based server convert the
af9b1e9a 143objects being pushed into SHA-1 form so the server does not have to be
752414ae
JN
144aware of the hash function the client is using.
145
146Detailed Design
147---------------
148Repository format extension
149~~~~~~~~~~~~~~~~~~~~~~~~~~~
0ed8d8da 150A SHA-256 repository uses repository format version `1` (see
752414ae
JN
151Documentation/technical/repository-version.txt) with extensions
152`objectFormat` and `compatObjectFormat`:
153
154 [core]
155 repositoryFormatVersion = 1
156 [extensions]
0ed8d8da 157 objectFormat = sha256
752414ae
JN
158 compatObjectFormat = sha1
159
45fa195f
ÆAB
160The combination of setting `core.repositoryFormatVersion=1` and
161populating `extensions.*` ensures that all versions of Git later than
0ed8d8da 162`v0.99.9l` will die instead of trying to operate on the SHA-256
45fa195f 163repository, instead producing an error message.
752414ae 164
45fa195f
ÆAB
165 # Between v0.99.9l and v2.7.0
166 $ git status
167 fatal: Expected git repo version <= 0, found 1
168 # After v2.7.0
752414ae
JN
169 $ git status
170 fatal: unknown repository extensions found:
171 objectformat
172 compatobjectformat
173
174See the "Transition plan" section below for more details on these
175repository extensions.
176
177Object names
178~~~~~~~~~~~~
af9b1e9a
TA
179Objects can be named by their 40 hexadecimal digit SHA-1 name or 64
180hexadecimal digit SHA-256 name, plus names derived from those (see
752414ae
JN
181gitrevisions(7)).
182
af9b1e9a
TA
183The SHA-1 name of an object is the SHA-1 of the concatenation of its
184type, length, a nul byte, and the object's SHA-1 content. This is the
752414ae
JN
185traditional <sha1> used in Git to name objects.
186
af9b1e9a
TA
187The SHA-256 name of an object is the SHA-256 of the concatenation of its
188type, length, a nul byte, and the object's SHA-256 content.
752414ae
JN
189
190Object format
191~~~~~~~~~~~~~
192The content as a byte sequence of a tag, commit, or tree object named
af9b1e9a
TA
193by SHA-1 and SHA-256 differ because an object named by SHA-256 name refers to
194other objects by their SHA-256 names and an object named by SHA-1 name
195refers to other objects by their SHA-1 names.
752414ae 196
af9b1e9a
TA
197The SHA-256 content of an object is the same as its SHA-1 content, except
198that objects referenced by the object are named using their SHA-256 names
199instead of SHA-1 names. Because a blob object does not refer to any
200other object, its SHA-1 content and SHA-256 content are the same.
752414ae 201
af9b1e9a
TA
202The format allows round-trip conversion between SHA-256 content and
203SHA-1 content.
752414ae
JN
204
205Object storage
206~~~~~~~~~~~~~~
207Loose objects use zlib compression and packed objects use the packed
208format described in Documentation/technical/pack-format.txt, just like
af9b1e9a
TA
209today. The content that is compressed and stored uses SHA-256 content
210instead of SHA-1 content.
752414ae
JN
211
212Pack index
213~~~~~~~~~~
214Pack index (.idx) files use a new v3 format that supports multiple
215hash functions. They have the following format (all integers are in
216network byte order):
217
218- A header appears at the beginning and consists of the following:
de82095a
TA
219 * The 4-byte pack index signature: '\377t0c'
220 * 4-byte version number: 3
221 * 4-byte length of the header section, including the signature and
752414ae 222 version number
de82095a
TA
223 * 4-byte number of objects contained in the pack
224 * 4-byte number of object formats in this pack index: 2
225 * For each object format:
226 ** 4-byte format identifier (e.g., 'sha1' for SHA-1)
227 ** 4-byte length in bytes of shortened object names. This is the
752414ae
JN
228 shortest possible length needed to make names in the shortened
229 object name table unambiguous.
de82095a 230 ** 4-byte integer, recording where tables relating to this format
752414ae 231 are stored in this index file, as an offset from the beginning.
de82095a
TA
232 * 4-byte offset to the trailer from the beginning of this file.
233 * Zero or more additional key/value pairs (4-byte key, 4-byte
752414ae
JN
234 value). Only one key is supported: 'PSRC'. See the "Loose objects
235 and unreachable objects" section for supported values and how this
236 is used. All other keys are reserved. Readers must ignore
237 unrecognized keys.
238- Zero or more NUL bytes. This can optionally be used to improve the
239 alignment of the full object name table below.
240- Tables for the first object format:
de82095a 241 * A sorted table of shortened object names. These are prefixes of
752414ae
JN
242 the names of all objects in this pack file, packed together
243 without offset values to reduce the cache footprint of the binary
244 search for a specific object name.
245
de82095a 246 * A table of full object names in pack order. This allows resolving
752414ae
JN
247 a reference to "the nth object in the pack file" (from a
248 reachability bitmap or from the next table of another object
249 format) to its object name.
250
de82095a 251 * A table of 4-byte values mapping object name order to pack order.
752414ae
JN
252 For an object in the table of sorted shortened object names, the
253 value at the corresponding index in this table is the index in the
254 previous table for that same object.
752414ae
JN
255 This can be used to look up the object in reachability bitmaps or
256 to look up its name in another object format.
257
de82095a 258 * A table of 4-byte CRC32 values of the packed object data, in the
752414ae
JN
259 order that the objects appear in the pack file. This is to allow
260 compressed data to be copied directly from pack to pack during
261 repacking without undetected data corruption.
262
de82095a 263 * A table of 4-byte offset values. For an object in the table of
752414ae
JN
264 sorted shortened object names, the value at the corresponding
265 index in this table indicates where that object can be found in
266 the pack file. These are usually 31-bit pack file offsets, but
267 large offsets are encoded as an index into the next table with the
268 most significant bit set.
269
de82095a 270 * A table of 8-byte offset entries (empty for pack files less than
752414ae
JN
271 2 GiB). Pack files are organized with heavily used objects toward
272 the front, so most object references should not need to refer to
273 this table.
274- Zero or more NUL bytes.
275- Tables for the second object format, with the same layout as above,
276 up to and not including the table of CRC32 values.
277- Zero or more NUL bytes.
278- The trailer consists of the following:
de82095a 279 * A copy of the 20-byte SHA-256 checksum at the end of the
752414ae
JN
280 corresponding packfile.
281
de82095a 282 * 20-byte SHA-256 checksum of all of the above.
752414ae
JN
283
284Loose object index
285~~~~~~~~~~~~~~~~~~
286A new file $GIT_OBJECT_DIR/loose-object-idx contains information about
287all loose objects. Its format is
288
289 # loose-object-idx
0ed8d8da 290 (sha256-name SP sha1-name LF)*
752414ae
JN
291
292where the object names are in hexadecimal format. The file is not
293sorted.
294
295The loose object index is protected against concurrent writes by a
296lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose
297object:
298
2991. Write the loose object to a temporary file, like today.
3002. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock.
3013. Rename the loose object into place.
3024. Open loose-object-idx with O_APPEND and write the new object
3035. Unlink loose-object-idx.lock to release the lock.
304
305To remove entries (e.g. in "git pack-refs" or "git-prune"):
306
3071. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the
308 lock.
3092. Write the new content to loose-object-idx.lock.
3103. Unlink any loose objects being removed.
3114. Rename to replace loose-object-idx, releasing the lock.
312
313Translation table
314~~~~~~~~~~~~~~~~~
af9b1e9a
TA
315The index files support a bidirectional mapping between SHA-1 names
316and SHA-256 names. The lookup proceeds similarly to ordinary object
317lookups. For example, to convert a SHA-1 name to a SHA-256 name:
752414ae
JN
318
319 1. Look for the object in idx files. If a match is present in the
af9b1e9a
TA
320 idx's sorted list of truncated SHA-1 names, then:
321 a. Read the corresponding entry in the SHA-1 name order to pack
752414ae 322 name order mapping.
af9b1e9a 323 b. Read the corresponding entry in the full SHA-1 name table to
752414ae 324 verify we found the right object. If it is, then
af9b1e9a
TA
325 c. Read the corresponding entry in the full SHA-256 name table.
326 That is the object's SHA-256 name.
752414ae
JN
327 2. Check for a loose object. Read lines from loose-object-idx until
328 we find a match.
329
330Step (1) takes the same amount of time as an ordinary object lookup:
331O(number of packs * log(objects per pack)). Step (2) takes O(number of
332loose objects) time. To maintain good performance it will be necessary
333to keep the number of loose objects low. See the "Loose objects and
334unreachable objects" section below for more details.
335
336Since all operations that make new objects (e.g., "git commit") add
337the new objects to the corresponding index, this mapping is possible
338for all objects in the object store.
339
af9b1e9a
TA
340Reading an object's SHA-1 content
341~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
342The SHA-1 content of an object can be read by converting all SHA-256 names
cc9f0916 343of its SHA-256 content references to SHA-1 names using the translation table.
752414ae
JN
344
345Fetch
346~~~~~
347Fetching from a SHA-1 based server requires translating between SHA-1
0ed8d8da 348and SHA-256 based representations on the fly.
752414ae
JN
349
350SHA-1s named in the ref advertisement that are present on the client
0ed8d8da 351can be translated to SHA-256 and looked up as local objects using the
752414ae
JN
352translation table.
353
354Negotiation proceeds as today. Any "have"s generated locally are
355converted to SHA-1 before being sent to the server, and SHA-1s
0ed8d8da 356mentioned by the server are converted to SHA-256 when looking them up
752414ae
JN
357locally.
358
359After negotiation, the server sends a packfile containing the
0ed8d8da 360requested objects. We convert the packfile to SHA-256 format using
752414ae
JN
361the following steps:
362
3631. index-pack: inflate each object in the packfile and compute its
364 SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
365 objects the client has locally. These objects can be looked up
af9b1e9a 366 using the translation table and their SHA-1 content read as
752414ae
JN
367 described above to resolve the deltas.
3682. topological sort: starting at the "want"s from the negotiation
369 phase, walk through objects in the pack and emit a list of them,
370 excluding blobs, in reverse topologically sorted order, with each
371 object coming later in the list than all objects it references.
372 (This list only contains objects reachable from the "wants". If the
373 pack from the server contained additional extraneous objects, then
374 they will be discarded.)
af9b1e9a 3753. convert to SHA-256: open a new SHA-256 packfile. Read the topologically
752414ae 376 sorted list just generated. For each object, inflate its
af9b1e9a
TA
377 SHA-1 content, convert to SHA-256 content, and write it to the SHA-256
378 pack. Record the new SHA-1<-->SHA-256 mapping entry for use in the idx.
752414ae 3794. sort: reorder entries in the new pack to match the order of objects
af9b1e9a 380 in the pack the server generated and include blobs. Write a SHA-256 idx
752414ae
JN
381 file
3825. clean up: remove the SHA-1 based pack file, index, and
383 topologically sorted list obtained from the server in steps 1
384 and 2.
385
386Step 3 requires every object referenced by the new object to be in the
387translation table. This is why the topological sort step is necessary.
388
389As an optimization, step 1 could write a file describing what non-blob
390objects each object it has inflated from the packfile references. This
391makes the topological sort in step 2 possible without inflating the
392objects in the packfile for a second time. The objects need to be
393inflated again in step 3, for a total of two inflations.
394
395Step 4 is probably necessary for good read-time performance. "git
396pack-objects" on the server optimizes the pack file for good data
397locality (see Documentation/technical/pack-heuristics.txt).
398
399Details of this process are likely to change. It will take some
400experimenting to get this to perform well.
401
402Push
403~~~~
404Push is simpler than fetch because the objects referenced by the
af9b1e9a 405pushed objects are already in the translation table. The SHA-1 content
752414ae 406of each object being pushed can be read as described in the "Reading
af9b1e9a 407an object's SHA-1 content" section to generate the pack written by git
752414ae
JN
408send-pack.
409
410Signed Commits
411~~~~~~~~~~~~~~
0ed8d8da 412We add a new field "gpgsig-sha256" to the commit object format to allow
752414ae 413signing commits without relying on SHA-1. It is similar to the
af9b1e9a 414existing "gpgsig" field. Its signed payload is the SHA-256 content of the
0ed8d8da 415commit object with any "gpgsig" and "gpgsig-sha256" fields removed.
752414ae
JN
416
417This means commits can be signed
de82095a 418
752414ae 4191. using SHA-1 only, as in existing signed commit objects
0ed8d8da 4202. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig
752414ae 421 fields.
0ed8d8da 4223. using only SHA-256, by only using the gpgsig-sha256 field.
752414ae
JN
423
424Old versions of "git verify-commit" can verify the gpgsig signature in
425cases (1) and (2) without modifications and view case (3) as an
426ordinary unsigned commit.
427
428Signed Tags
429~~~~~~~~~~~
0ed8d8da 430We add a new field "gpgsig-sha256" to the tag object format to allow
752414ae 431signing tags without relying on SHA-1. Its signed payload is the
af9b1e9a 432SHA-256 content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP
752414ae
JN
433SIGNATURE-----" delimited in-body signature removed.
434
435This means tags can be signed
de82095a 436
752414ae 4371. using SHA-1 only, as in existing signed tag objects
0ed8d8da 4382. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body
752414ae 439 signature.
0ed8d8da 4403. using only SHA-256, by only using the gpgsig-sha256 field.
752414ae
JN
441
442Mergetag embedding
443~~~~~~~~~~~~~~~~~~
af9b1e9a
TA
444The mergetag field in the SHA-1 content of a commit contains the
445SHA-1 content of a tag that was merged by that commit.
752414ae 446
af9b1e9a
TA
447The mergetag field in the SHA-256 content of the same commit contains the
448SHA-256 content of the same tag.
752414ae
JN
449
450Submodules
451~~~~~~~~~~
452To convert recorded submodule pointers, you need to have the converted
453submodule repository in place. The translation table of the submodule
454can be used to look up the new hash.
455
456Loose objects and unreachable objects
457~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
458Fast lookups in the loose-object-idx require that the number of loose
459objects not grow too high.
460
461"git gc --auto" currently waits for there to be 6700 loose objects
462present before consolidating them into a packfile. We will need to
463measure to find a more appropriate threshold for it to use.
464
465"git gc --auto" currently waits for there to be 50 packs present
466before combining packfiles. Packing loose objects more aggressively
467may cause the number of pack files to grow too quickly. This can be
468mitigated by using a strategy similar to Martin Fick's exponential
469rolling garbage collection script:
470https://gerrit-review.googlesource.com/c/gerrit/+/35215
471
472"git gc" currently expels any unreachable objects it encounters in
473pack files to loose objects in an attempt to prevent a race when
474pruning them (in case another process is simultaneously writing a new
475object that refers to the about-to-be-deleted object). This leads to
476an explosion in the number of loose objects present and disk space
477usage due to the objects in delta form being replaced with independent
478loose objects. Worse, the race is still present for loose objects.
479
480Instead, "git gc" will need to move unreachable objects to a new
481packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see
482below). To avoid the race when writing new objects referring to an
483about-to-be-deleted object, code paths that write new objects will
484need to copy any objects from UNREACHABLE_GARBAGE packs that they
24966cd9 485refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects).
752414ae
JN
486UNREACHABLE_GARBAGE are then safe to delete if their creation time (as
487indicated by the file's mtime) is long enough ago.
488
489To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
490combined under certain circumstances. If "gc.garbageTtl" is set to
491greater than one day, then packs created within a single calendar day,
492UTC, can be coalesced together. The resulting packfile would have an
493mtime before midnight on that day, so this makes the effective maximum
494ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day,
495then we divide the calendar day into intervals one-third of that ttl
496in duration. Packs created within the same interval can be coalesced
497together. The resulting packfile would have an mtime before the end of
498the interval, so this makes the effective maximum ttl equal to the
499garbageTtl * 4/3.
500
501This rule comes from Thirumala Reddy Mutchukota's JGit change
502https://git.eclipse.org/r/90465.
503
504The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack
505index. More generally, that field indicates where a pack came from:
506
507 - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network
508 - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight
509 "gc --auto" operation
510 - 3 (PACK_SOURCE_GC) for a pack created by a full gc
511 - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage
512 discovered by gc
513 - 5 (PACK_SOURCE_INSERT) for locally created objects that were
514 written directly to a pack file, e.g. from "git add ."
515
516This information can be useful for debugging and for "gc --auto" to
517make appropriate choices about which packs to coalesce.
518
519Caveats
520-------
521Invalid objects
522~~~~~~~~~~~~~~~
af9b1e9a 523The conversion from SHA-1 content to SHA-256 content retains any
752414ae
JN
524brokenness in the original object (e.g., tree entry modes encoded with
525leading 0, tree objects whose paths are not sorted correctly, and
526commit objects without an author or committer). This is a deliberate
527feature of the design to allow the conversion to round-trip.
528
529More profoundly broken objects (e.g., a commit with a truncated "tree"
530header line) cannot be converted but were not usable by current Git
531anyway.
532
533Shallow clone and submodules
534~~~~~~~~~~~~~~~~~~~~~~~~~~~~
535Because it requires all referenced objects to be available in the
536locally generated translation table, this design does not support
537shallow clone or unfetched submodules. Protocol improvements might
538allow lifting this restriction.
539
540Alternates
541~~~~~~~~~~
af9b1e9a
TA
542For the same reason, a SHA-256 repository cannot borrow objects from a
543SHA-1 repository using objects/info/alternates or
752414ae
JN
544$GIT_ALTERNATE_OBJECT_REPOSITORIES.
545
546git notes
547~~~~~~~~~
af9b1e9a 548The "git notes" tool annotates objects using their SHA-1 name as key.
752414ae 549This design does not describe a way to migrate notes trees to use
af9b1e9a 550SHA-256 names. That migration is expected to happen separately (for
752414ae
JN
551example using a file at the root of the notes tree to describe which
552hash it uses).
553
554Server-side cost
555~~~~~~~~~~~~~~~~
0ed8d8da 556Until Git protocol gains SHA-256 support, using SHA-256 based storage
752414ae 557on public-facing Git servers is strongly discouraged. Once Git
0ed8d8da 558protocol gains SHA-256 support, SHA-256 based servers are likely not
752414ae 559to support SHA-1 compatibility, to avoid what may be a very expensive
031fd4b9 560hash re-encode during clone and to encourage peers to modernize.
752414ae
JN
561
562The design described here allows fetches by SHA-1 clients of a
0ed8d8da 563personal SHA-256 repository because it's not much more difficult than
752414ae
JN
564allowing pushes from that repository. This support needs to be guarded
565by a configuration option --- servers like git.kernel.org that serve a
566large number of clients would not be expected to bear that cost.
567
568Meaning of signatures
569~~~~~~~~~~~~~~~~~~~~~
570The signed payload for signed commits and tags does not explicitly
571name the hash used to identify objects. If some day Git adopts a new
572hash function with the same length as the current SHA-1 (40
0ed8d8da 573hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the
752414ae
JN
574intent behind the PGP signed payload in an object signature is
575unclear:
576
577 object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7
578 type commit
579 tag v2.12.0
580 tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800
581
582 Git 2.12
583
af9b1e9a 584Does this mean Git v2.12.0 is the commit with SHA-1 name
752414ae
JN
585e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
586new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
587
0ed8d8da 588Fortunately SHA-256 and SHA-1 have different lengths. If Git starts
752414ae
JN
589using another hash with the same length to name objects, then it will
590need to change the format of signed payloads using that hash to
591address this issue.
592
593Object names on the command line
594~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
595To support the transition (see Transition plan below), this design
596supports four different modes of operation:
597
598 1. ("dark launch") Treat object names input by the user as SHA-1 and
599 convert any object names written to output to SHA-1, but store
0ed8d8da 600 objects using SHA-256. This allows users to test the code with no
752414ae
JN
601 visible behavior change except for performance. This allows
602 allows running even tests that assume the SHA-1 hash function, to
603 sanity-check the behavior of the new mode.
604
0ed8d8da 605 2. ("early transition") Allow both SHA-1 and SHA-256 object names in
752414ae
JN
606 input. Any object names written to output use SHA-1. This allows
607 users to continue to make use of SHA-1 to communicate with peers
608 (e.g. by email) that have not migrated yet and prepares for mode 3.
609
0ed8d8da
JN
610 3. ("late transition") Allow both SHA-1 and SHA-256 object names in
611 input. Any object names written to output use SHA-256. In this
752414ae
JN
612 mode, users are using a more secure object naming method by
613 default. The disruption is minimal as long as most of their peers
614 are in mode 2 or mode 3.
615
616 4. ("post-transition") Treat object names input by the user as
0ed8d8da 617 SHA-256 and write output using SHA-256. This is safer than mode 3
752414ae
JN
618 because there is less risk that input is incorrectly interpreted
619 using the wrong hash function.
620
621The mode is specified in configuration.
622
623The user can also explicitly specify which format to use for a
624particular revision specifier and for output, overriding the mode. For
625example:
626
de82095a 627 git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}
752414ae 628
752414ae
JN
629Transition plan
630---------------
631Some initial steps can be implemented independently of one another:
de82095a 632
752414ae 633- adding a hash function API (vtable)
0ed8d8da 634- teaching fsck to tolerate the gpgsig-sha256 field
752414ae
JN
635- excluding gpgsig-* from the fields copied by "git commit --amend"
636- annotating tests that depend on SHA-1 values with a SHA1 test
637 prerequisite
638- using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ
639 consistently instead of "unsigned char *" and the hardcoded
640 constants 20 and 40.
641- introducing index v3
642- adding support for the PSRC field and safer object pruning
643
752414ae
JN
644The first user-visible change is the introduction of the objectFormat
645extension (without compatObjectFormat). This requires:
de82095a 646
752414ae
JN
647- teaching fsck about this mode of operation
648- using the hash function API (vtable) when computing object names
649- signing objects and verifying signatures
650- rejecting attempts to fetch from or push to an incompatible
651 repository
652
653Next comes introduction of compatObjectFormat:
de82095a 654
2ae12e56 655- implementing the loose-object-idx
752414ae
JN
656- translating object names between object formats
657- translating object content between object formats
658- generating and verifying signatures in the compat format
659- adding appropriate index entries when adding a new object to the
660 object store
661- --output-format option
0ed8d8da 662- ^{sha1} and ^{sha256} revision notation
752414ae
JN
663- configuration to specify default input and output format (see
664 "Object names on the command line" above)
665
666The next step is supporting fetches and pushes to SHA-1 repositories:
de82095a 667
752414ae
JN
668- allow pushes to a repository using the compat format
669- generate a topologically sorted list of the SHA-1 names of fetched
670 objects
af9b1e9a 671- convert the fetched packfile to SHA-256 format and generate an idx
752414ae
JN
672 file
673- re-sort to match the order of objects in the fetched packfile
674
675The infrastructure supporting fetch also allows converting an existing
676repository. In converted repositories and new clones, end users can
677gain support for the new hash function without any visible change in
678behavior (see "dark launch" in the "Object names on the command line"
0ed8d8da 679section). In particular this allows users to verify SHA-256 signatures
752414ae
JN
680on objects in the repository, and it should ensure the transition code
681is stable in production in preparation for using it more widely.
682
683Over time projects would encourage their users to adopt the "early
684transition" and then "late transition" modes to take advantage of the
0ed8d8da 685new, more futureproof SHA-256 object names.
752414ae
JN
686
687When objectFormat and compatObjectFormat are both set, commands
0ed8d8da 688generating signatures would generate both SHA-1 and SHA-256 signatures
752414ae
JN
689by default to support both new and old users.
690
0ed8d8da 691In projects using SHA-256 heavily, users could be encouraged to adopt
752414ae
JN
692the "post-transition" mode to avoid accidentally making implicit use
693of SHA-1 object names.
694
695Once a critical mass of users have upgraded to a version of Git that
0ed8d8da 696can verify SHA-256 signatures and have converted their existing
752414ae 697repositories to support verifying them, we can add support for a
0ed8d8da 698setting to generate only SHA-256 signatures. This is expected to be at
752414ae
JN
699least a year later.
700
701That is also a good moment to advertise the ability to convert
0ed8d8da 702repositories to use SHA-256 only, stripping out all SHA-1 related
752414ae
JN
703metadata. This improves performance by eliminating translation
704overhead and security by avoiding the possibility of accidentally
705relying on the safety of SHA-1.
706
707Updating Git's protocols to allow a server to specify which hash
708functions it supports is also an important part of this transition. It
709is not discussed in detail in this document but this transition plan
710assumes it happens. :)
711
712Alternatives considered
713-----------------------
714Upgrading everyone working on a particular project on a flag day
715~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
716Projects like the Linux kernel are large and complex enough that
717flipping the switch for all projects based on the repository at once
718is infeasible.
719
720Not only would all developers and server operators supporting
721developers have to switch on the same flag day, but supporting tooling
722(continuous integration, code review, bug trackers, etc) would have to
723be adapted as well. This also makes it difficult to get early feedback
724from some project participants testing before it is time for mass
725adoption.
726
727Using hash functions in parallel
728~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3eae30e4 729(e.g. https://lore.kernel.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ )
752414ae
JN
730Objects newly created would be addressed by the new hash, but inside
731such an object (e.g. commit) it is still possible to address objects
732using the old hash function.
de82095a 733
752414ae
JN
734* You cannot trust its history (needed for bisectability) in the
735 future without further work
736* Maintenance burden as the number of supported hash functions grows
737 (they will never go away, so they accumulate). In this proposal, by
738 comparison, converted objects lose all references to SHA-1.
739
740Signed objects with multiple hashes
741~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0ed8d8da 742Instead of introducing the gpgsig-sha256 field in commit and tag objects
af9b1e9a
TA
743for SHA-256 content based signatures, an earlier version of this design
744added "hash sha256 <SHA-256 name>" fields to strengthen the existing
745SHA-1 content based signatures.
752414ae
JN
746
747In other words, a single signature was used to attest to the object
748content using both hash functions. This had some advantages:
de82095a 749
752414ae
JN
750* Using one signature instead of two speeds up the signing process.
751* Having one signed payload with both hashes allows the signer to
af9b1e9a 752 attest to the SHA-1 name and SHA-256 name referring to the same object.
752414ae
JN
753* All users consume the same signature. Broken signatures are likely
754 to be detected quickly using current versions of git.
755
756However, it also came with disadvantages:
de82095a 757
af9b1e9a 758* Verifying a signed object requires access to the SHA-1 names of all
752414ae
JN
759 objects it references, even after the transition is complete and
760 translation table is no longer needed for anything else. To support
af9b1e9a
TA
761 this, the design added fields such as "hash sha1 tree <SHA-1 name>"
762 and "hash sha1 parent <SHA-1 name>" to the SHA-256 content of a signed
752414ae 763 commit, complicating the conversion process.
af9b1e9a 764* Allowing signed objects without a SHA-1 (for after the transition is
752414ae 765 complete) complicated the design further, requiring a "nohash sha1"
af9b1e9a 766 field to suppress including "hash sha1" fields in the SHA-256 content
752414ae
JN
767 and signed payload.
768
769Lazily populated translation table
770~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
771Some of the work of building the translation table could be deferred to
772push time, but that would significantly complicate and slow down pushes.
af9b1e9a
TA
773Calculating the SHA-1 name at object creation time at the same time it is
774being streamed to disk and having its SHA-256 name calculated should be
752414ae
JN
775an acceptable cost.
776
777Document History
778----------------
779
7802017-03-03
781bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com,
782sbeller@google.com
783
6eda9ac9 784* Initial version sent to https://lore.kernel.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com
752414ae
JN
785
7862017-03-03 jrnieder@gmail.com
787Incorporated suggestions from jonathantanmy and sbeller:
de82095a 788
810372f8
TA
789* Describe purpose of signed objects with each hash type
790* Redefine signed object verification using object content under the
752414ae
JN
791 first hash function
792
7932017-03-06 jrnieder@gmail.com
de82095a 794
752414ae 795* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2]
af9b1e9a 796* Make SHA3-based signatures a separate field, avoiding the need for
752414ae
JN
797 "hash" and "nohash" fields (thanks to peff[3]).
798* Add a sorting phase to fetch (thanks to Junio for noticing the need
799 for this).
800* Omit blobs from the topological sort during fetch (thanks to peff).
801* Discuss alternates, git notes, and git servers in the caveats
802 section (thanks to Junio Hamano, brian m. carlson[4], and Shawn
803 Pearce).
804* Clarify language throughout (thanks to various commenters,
805 especially Junio).
806
8072017-09-27 jrnieder@gmail.com, sbeller@google.com
de82095a 808
810372f8
TA
809* Use placeholder NewHash instead of SHA3-256
810* Describe criteria for picking a hash function.
811* Include a transition plan (thanks especially to Brandon Williams
752414ae 812 for fleshing these ideas out)
810372f8 813* Define the translation table (thanks, Shawn Pearce[5], Jonathan
752414ae 814 Tan, and Masaya Suzuki)
810372f8 815* Avoid loose object overhead by packing more aggressively in
752414ae
JN
816 "git gc --auto"
817
13f5e098
ÆAB
818Later history:
819
de82095a
TA
820* See the history of this file in git.git for the history of subsequent
821 edits. This document history is no longer being maintained as it
822 would now be superfluous to the commit log
823
824References:
13f5e098 825
6eda9ac9
TA
826 [1] https://lore.kernel.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/
827 [2] https://lore.kernel.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/
828 [3] https://lore.kernel.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/
829 [4] https://lore.kernel.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net
de82095a 830 [5] https://lore.kernel.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/