]> git.ipfire.org Git - thirdparty/git.git/blame - README
[PATCH] Docs - delta object
[thirdparty/git.git] / README
CommitLineData
8ac866a8 1////////////////////////////////////////////////////////////////
6ad6d3d3 2
e83c5163
LT
3 GIT - the stupid content tracker
4
8ac866a8 5////////////////////////////////////////////////////////////////
e83c5163
LT
6"git" can mean anything, depending on your mood.
7
8 - random three-letter combination that is pronounceable, and not
9 actually used by any common UNIX command. The fact that it is a
90c4851b 10 mispronunciation of "get" may or may not be relevant.
e83c5163
LT
11 - stupid. contemptible and despicable. simple. Take your pick from the
12 dictionary of slang.
13 - "global information tracker": you're in a good mood, and it actually
14 works for you. Angels sing, and a light suddenly fills the room.
15 - "goddamn idiotic truckload of sh*t": when it breaks
16
17This is a stupid (but extremely fast) directory content manager. It
18doesn't do a whole lot, but what it _does_ do is track directory
19contents efficiently.
20
21There are two object abstractions: the "object database", and the
6ad6d3d3
LT
22"current directory cache" aka "index".
23
8ac866a8
DG
24The Object Database
25~~~~~~~~~~~~~~~~~~~
e83c5163
LT
26The object database is literally just a content-addressable collection
27of objects. All objects are named by their content, which is
28approximated by the SHA1 hash of the object itself. Objects may refer
8ac866a8
DG
29to other objects (by referencing their SHA1 hash), and so you can
30build up a hierarchy of objects.
e83c5163 31
6ad6d3d3
LT
32All objects have a statically determined "type" aka "tag", which is
33determined at object creation time, and which identifies the format of
7096a645 34the object (i.e. how it is used, and how it can refer to other
2aef5bba
DG
35objects). There are currently five different object types: "blob",
36"tree", "commit", "tag" and "delta"
6ad6d3d3
LT
37
38A "blob" object cannot refer to any other object, and is, like the tag
39implies, a pure storage object containing some user data. It is used to
90c4851b 40actually store the file data, i.e. a blob object is associated with some
6ad6d3d3
LT
41particular version of some file.
42
43A "tree" object is an object that ties one or more "blob" objects into a
44directory structure. In addition, a tree object can refer to other tree
45objects, thus creating a directory hierarchy.
46
7096a645 47A "commit" object ties such directory hierarchies together into
6ad6d3d3
LT
48a DAG of revisions - each "commit" is associated with exactly one tree
49(the directory hierarchy at the time of the commit). In addition, a
50"commit" refers to one or more "parent" commit objects that describe the
51history of how we arrived at that directory hierarchy.
52
53As a special case, a commit object with no parents is called the "root"
54object, and is the point of an initial project commit. Each project
55must have at least one root, and while you can tie several different
56root objects together into one project by creating a commit object which
57has two or more separate roots as its ultimate parents, that's probably
58just going to confuse people. So aim for the notion of "one root object
59per project", even if git itself does not enforce that.
60
8ac866a8
DG
61A "tag" object symbolically identifies and can be used to sign other
62objects. It contains the identifier and type of another object, a
63symbolic name (of course!) and, optionally, a signature.
64
2aef5bba
DG
65A "delta" object is used internally by the object database to minimise
66disk usage. Instead of storing the entire contents of a revision, git
67can behave in a similar manner to RCS et al and simply store a delta.
68
69Regardless of object type, all objects share the following
70characteristics: they are all deflated with zlib, and have a header
71that not only specifies their tag, but also provides size information
72about the data in the object. It's worth noting that the SHA1 hash
73that is used to name the object is the hash of the original data or
74the delta. (Historical note: in the dawn of the age of git the hash
75was the sha1 of the _compressed_ object)
6ad6d3d3
LT
76
77As a result, the general consistency of an object can always be tested
e83c5163
LT
78independently of the contents or the type of the object: all objects can
79be validated by verifying that (a) their hashes match the content of the
80file and (b) the object successfully inflates to a stream of bytes that
81forms a sequence of <ascii tag without space> + <space> + <ascii decimal
82size> + <byte\0> + <binary object data>.
83
8ac866a8
DG
84The structured objects can further have their structure and
85connectivity to other objects verified. This is generally done with
7096a645
DG
86the "git-fsck-cache" program, which generates a full dependency graph
87of all objects, and verifies their internal consistency (in addition
88to just verifying their superficial consistency through the hash).
6ad6d3d3
LT
89
90The object types in some more detail:
91
8ac866a8
DG
92Blob Object
93~~~~~~~~~~~
94A "blob" object is nothing but a binary blob of data, and doesn't
95refer to anything else. There is no signature or any other
96verification of the data, so while the object is consistent (it _is_
97indexed by its sha1 hash, so the data itself is certainly correct), it
98has absolutely no other attributes. No name associations, no
99permissions. It is purely a blob of data (i.e. normally "file
100contents").
101
102In particular, since the blob is entirely defined by its data, if two
103files in a directory tree (or in multiple different versions of the
104repository) have the same contents, they will share the same blob
105object. The object is totally independent of it's location in the
106directory tree, and renaming a file does not change the object that
107file is associated with in any way.
108
7096a645
DG
109A blob is created with link:git-write-blob.html[git-write-blob] and
110it's data can be accessed by link:git-cat-file.html[git-cat-file]
111
8ac866a8
DG
112Tree Object
113~~~~~~~~~~~
114The next hierarchical object type is the "tree" object. A tree object
115is a list of mode/name/blob data, sorted by name. Alternatively, the
116mode data may specify a directory mode, in which case instead of
117naming a blob, that name is associated with another TREE object.
118
119Like the "blob" object, a tree object is uniquely determined by the
120set contents, and so two separate but identical trees will always
121share the exact same object. This is true at all levels, i.e. it's
122true for a "leaf" tree (which does not refer to any other trees, only
123blobs) as well as for a whole subdirectory.
124
125For that reason a "tree" object is just a pure data abstraction: it
126has no history, no signatures, no verification of validity, except
127that since the contents are again protected by the hash itself, we can
128trust that the tree is immutable and its contents never change.
129
130So you can trust the contents of a tree to be valid, the same way you
131can trust the contents of a blob, but you don't know where those
132contents _came_ from.
133
134Side note on trees: since a "tree" object is a sorted list of
135"filename+content", you can create a diff between two trees without
136actually having to unpack two trees. Just ignore all common parts,
137and your diff will look right. In other words, you can effectively
138(and efficiently) tell the difference between any two random trees by
139O(n) where "n" is the size of the difference, rather than the size of
140the tree.
141
142Side note 2 on trees: since the name of a "blob" depends entirely and
143exclusively on its contents (i.e. there are no names or permissions
144involved), you can see trivial renames or permission changes by
145noticing that the blob stayed the same. However, renames with data
146changes need a smarter "diff" implementation.
147
7096a645
DG
148A tree is created with link:git-write-tree.html[git-write-tree] and
149it's data can be accessed by link:git-ls-tree.html[git-ls-tree]
8ac866a8 150
7096a645
DG
151Commit Object
152~~~~~~~~~~~~~
153The "commit" object is an object that introduces the notion of
8ac866a8
DG
154history into the picture. In contrast to the other objects, it
155doesn't just describe the physical state of a tree, it describes how
156we got there, and why.
157
7096a645
DG
158A "commit" is defined by the tree-object that it results in, the
159parent commits (zero, one or more) that led up to that point, and a
160comment on what happened. Again, a commit is not trusted per se:
8ac866a8
DG
161the contents are well-defined and "safe" due to the cryptographically
162strong signatures at all levels, but there is no reason to believe
163that the tree is "good" or that the merge information makes sense.
164The parents do not have to actually have any relationship with the
165result, for example.
166
7096a645
DG
167Note on commits: unlike real SCM's, commits do not contain
168rename information or file mode chane information. All of that is
8ac866a8
DG
169implicit in the trees involved (the result tree, and the result trees
170of the parents), and describing that makes no sense in this idiotic
171file manager.
172
7096a645
DG
173A commit is created with link:git-commit-tree.html[git-commit-tree] and
174it's data can be accessed by link:git-cat-file.html[git-cat-file]
175
176Trust
177~~~~~
178An aside on the notion of "trust". Trust is really outside the scope
179of "git", but it's worth noting a few things. First off, since
180everything is hashed with SHA1, you _can_ trust that an object is
181intact and has not been messed with by external sources. So the name
182of an object uniquely identifies a known state - just not a state that
183you may want to trust.
8ac866a8 184
7096a645 185Furthermore, since the SHA1 signature of a commit refers to the
8ac866a8 186SHA1 signatures of the tree it is associated with and the signatures
7096a645 187of the parent, a single named commit specifies uniquely a whole set
8ac866a8 188of history, with full contents. You can't later fake any step of the
7096a645 189way once you have the name of a commit.
8ac866a8
DG
190
191So to introduce some real trust in the system, the only thing you need
192to do is to digitally sign just _one_ special note, which includes the
7096a645
DG
193name of a top-level commit. Your digital signature shows others
194that you trust that commit, and the immutability of the history of
195commits tells others that they can trust the whole history.
8ac866a8
DG
196
197In other words, you can easily validate a whole archive by just
198sending out a single email that tells the people the name (SHA1 hash)
7096a645 199of the top commit, and digitally sign that email using something
8ac866a8
DG
200like GPG/PGP.
201
7096a645 202To assist in this, git also provides the tag object...
8ac866a8 203
7096a645
DG
204Tag Object
205~~~~~~~~~~
206Git provides the "tag" object to simplify creating, managing and
207exchanging symbolic and signed tokens. The "tag" object at its
208simplest simply symbolically identifies another object by containing
209the sha1, type and symbolic name.
8ac866a8 210
7096a645
DG
211However it can optionally contain additional signature information
212(which git doesn't care about as long as there's less than 8k of
213it). This can then be verified externally to git.
8ac866a8 214
7096a645
DG
215Note that despite the tag features, "git" itself only handles content
216integrity; the trust framework (and signature provision and
217verification) has to come from outside.
8ac866a8 218
7096a645
DG
219A tag is created with link:git-mktag.html[git-mktag] and
220it's data can be accessed by link:git-cat-file.html[git-cat-file]
8ac866a8 221
2aef5bba
DG
222Delta Object
223~~~~~~~~~~~~
224
225The "delta" object is used internally by the object database to
226minimise storage usage by using xdeltas (byte level diffs). Deltas can
227form chains of arbitrary length as RCS does (although this is
228configureable at creation time). Most operations won't see or even be
229aware of delta objects as they are automatically 'applied' and appear
230as 'real' git objects In other words, if you write your own routines
231to look at the contents of the object database then you need to know
232about this - otherwise you don't. Actually, that's not quite true -
233one important area where deltas are likely to prove very valuable is
234in reducing bandwidth loads - so the more sophisticated network tools
235for git repositories will be aware of them too.
236
237Finally, git repositories can (and must) be deltafied in the
238background - the work to calculate the differences does not take place
239automatically at commit time.
240
241A delta can be created (or undeltafied) with
242link:git-mkdelta.html[git-mkdelta] it's raw data cannot be accessed at
243present.
244
245
8ac866a8
DG
246The "index" aka "Current Directory Cache"
247-----------------------------------------
6ad6d3d3
LT
248The index is a simple binary file, which contains an efficient
249representation of a virtual directory content at some random time. It
250does so by a simple array that associates a set of names, dates,
251permissions and content (aka "blob") objects together. The cache is
252always kept ordered by name, and names are unique (with a few very
253specific rules) at any point in time, but the cache has no long-term
8ac866a8 254meaning, and can be partially updated at any time.
6ad6d3d3
LT
255
256In particular, the index certainly does not need to be consistent with
257the current directory contents (in fact, most operations will depend on
258different ways to make the index _not_ be consistent with the directory
259hierarchy), but it has three very important attributes:
e83c5163 260
8ac866a8
DG
261'(a) it can re-generate the full state it caches (not just the
262directory structure: it contains pointers to the "blob" objects so
263that it can regenerate the data too)'
e83c5163 264
8ac866a8
DG
265As a special case, there is a clear and unambiguous one-way mapping
266from a current directory cache to a "tree object", which can be
267efficiently created from just the current directory cache without
268actually looking at any other data. So a directory cache at any one
269time uniquely specifies one and only one "tree" object (but has
270additional data to make it easy to match up that tree object with what
271has happened in the directory)
e83c5163 272
8ac866a8
DG
273'(b) it has efficient methods for finding inconsistencies between that
274cached state ("tree object waiting to be instantiated") and the
275current state.'
e83c5163 276
8ac866a8
DG
277'(c) it can additionally efficiently represent information about merge
278conflicts between different tree objects, allowing each pathname to be
279associated with sufficient information about the trees involved that
280you can create a three-way merge between them.'
6ad6d3d3
LT
281
282Those are the three ONLY things that the directory cache does. It's a
e83c5163
LT
283cache, and the normal operation is to re-generate it completely from a
284known tree object, or update/compare it with a live tree that is being
6ad6d3d3
LT
285developed. If you blow the directory cache away entirely, you generally
286haven't lost any information as long as you have the name of the tree
287that it described.
288
289At the same time, the directory index is at the same time also the
290staging area for creating new trees, and creating a new tree always
291involves a controlled modification of the index file. In particular,
292the index file can have the representation of an intermediate tree that
293has not yet been instantiated. So the index can be thought of as a
294write-back cache, which can contain dirty information that has not yet
8ac866a8 295been written back to the backing store.
6ad6d3d3
LT
296
297
298
8ac866a8
DG
299The Workflow
300------------
6ad6d3d3 301Generally, all "git" operations work on the index file. Some operations
8ac866a8 302work *purely* on the index file (showing the current state of the
6ad6d3d3
LT
303index), but most operations move data to and from the index file. Either
304from the database or from the working directory. Thus there are four
305main combinations:
306
8ac866a8
DG
3071) working directory -> index
308~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6ad6d3d3 309
8ac866a8 310You update the index with information from the working directory with
7096a645
DG
311the link:git-update-cache.html[git-update-cache] command. You
312generally update the index information by just specifying the filename
313you want to update, like so:
6ad6d3d3 314
7096a645 315 git-update-cache filename
6ad6d3d3 316
8ac866a8
DG
317but to avoid common mistakes with filename globbing etc, the command
318will not normally add totally new entries or remove old entries,
319i.e. it will normally just update existing cache entries.
6ad6d3d3 320
8ac866a8
DG
321To tell git that yes, you really do realize that certain files no
322longer exist in the archive, or that new files should be added, you
323should use the "--remove" and "--add" flags respectively.
6ad6d3d3 324
8ac866a8
DG
325NOTE! A "--remove" flag does _not_ mean that subsequent filenames will
326necessarily be removed: if the files still exist in your directory
327structure, the index will be updated with their new status, not
328removed. The only thing "--remove" means is that update-cache will be
329considering a removed file to be a valid thing, and if the file really
330does not exist any more, it will update the index accordingly.
6ad6d3d3 331
7096a645 332As a special case, you can also do "git-update-cache --refresh", which
8ac866a8
DG
333will refresh the "stat" information of each index to match the current
334stat information. It will _not_ update the object status itself, and
335it will only update the fields that are used to quickly test whether
336an object still matches its old backing store object.
6ad6d3d3 337
8ac866a8
DG
3382) index -> object database
339~~~~~~~~~~~~~~~~~~~~~~~~~~~
6ad6d3d3 340
8ac866a8 341You write your current index file to a "tree" object with the program
6ad6d3d3 342
7096a645 343 git-write-tree
6ad6d3d3 344
8ac866a8
DG
345that doesn't come with any options - it will just write out the
346current index into the set of tree objects that describe that state,
347and it will return the name of the resulting top-level tree. You can
348use that tree to re-generate the index at any time by going in the
349other direction:
6ad6d3d3 350
8ac866a8
DG
3513) object database -> index
352~~~~~~~~~~~~~~~~~~~~~~~~~~~
6ad6d3d3 353
8ac866a8
DG
354You read a "tree" file from the object database, and use that to
355populate (and overwrite - don't do this if your index contains any
356unsaved state that you might want to restore later!) your current
357index. Normal operation is just
6ad6d3d3 358
7096a645 359 git-read-tree <sha1 of tree>
6ad6d3d3 360
8ac866a8
DG
361and your index file will now be equivalent to the tree that you saved
362earlier. However, that is only your _index_ file: your working
363directory contents have not been modified.
6ad6d3d3 364
8ac866a8
DG
3654) index -> working directory
366~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6ad6d3d3 367
8ac866a8
DG
368You update your working directory from the index by "checking out"
369files. This is not a very common operation, since normally you'd just
370keep your files updated, and rather than write to your working
371directory, you'd tell the index files about the changes in your
7096a645 372working directory (i.e. "git-update-cache").
6ad6d3d3 373
8ac866a8
DG
374However, if you decide to jump to a new version, or check out somebody
375else's version, or just restore a previous tree, you'd populate your
376index file with read-tree, and then you need to check out the result
377with
7096a645 378 git-checkout-cache filename
6ad6d3d3 379
8ac866a8 380or, if you want to check out all of the index, use "-a".
6ad6d3d3 381
7096a645
DG
382NOTE! git-checkout-cache normally refuses to overwrite old files, so
383if you have an old version of the tree already checked out, you will
384need to use the "-f" flag (_before_ the "-a" flag or the filename) to
8ac866a8 385_force_ the checkout.
6ad6d3d3
LT
386
387
8ac866a8
DG
388Finally, there are a few odds and ends which are not purely moving
389from one representation to the other:
6ad6d3d3 390
8ac866a8
DG
3915) Tying it all together
392~~~~~~~~~~~~~~~~~~~~~~~~
7096a645
DG
393To commit a tree you have instantiated with "git-write-tree", you'd
394create a "commit" object that refers to that tree and the history
395behind it - most notably the "parent" commits that preceded it in
396history.
6ad6d3d3 397
8ac866a8
DG
398Normally a "commit" has one parent: the previous state of the tree
399before a certain change was made. However, sometimes it can have two
400or more parent commits, in which case we call it a "merge", due to the
401fact that such a commit brings together ("merges") two or more
402previous states represented by other commits.
6ad6d3d3 403
8ac866a8
DG
404In other words, while a "tree" represents a particular directory state
405of a working directory, a "commit" represents that state in "time",
406and explains how we got there.
6ad6d3d3 407
8ac866a8
DG
408You create a commit object by giving it the tree that describes the
409state at the time of the commit, and a list of parents:
6ad6d3d3 410
7096a645 411 git-commit-tree <tree> -p <parent> [-p <parent2> ..]
6ad6d3d3 412
8ac866a8
DG
413and then giving the reason for the commit on stdin (either through
414redirection from a pipe or file, or by just typing it at the tty).
6ad6d3d3 415
7096a645
DG
416git-commit-tree will return the name of the object that represents
417that commit, and you should save it away for later use. Normally,
418you'd commit a new "HEAD" state, and while git doesn't care where you
419save the note about that state, in practice we tend to just write the
8ac866a8
DG
420result to the file ".git/HEAD", so that we can always see what the
421last committed state was.
6ad6d3d3 422
8ac866a8
DG
4236) Examining the data
424~~~~~~~~~~~~~~~~~~~~~
6ad6d3d3 425
8ac866a8
DG
426You can examine the data represented in the object database and the
427index with various helper tools. For every object, you can use
7096a645
DG
428link:git-cat-file.html[git-cat-file] to examine details about the
429object:
6ad6d3d3 430
7096a645 431 git-cat-file -t <objectname>
6ad6d3d3 432
8ac866a8
DG
433shows the type of the object, and once you have the type (which is
434usually implicit in where you find the object), you can use
6ad6d3d3 435
7096a645 436 git-cat-file blob|tree|commit <objectname>
6ad6d3d3 437
8ac866a8 438to show its contents. NOTE! Trees have binary content, and as a result
7096a645
DG
439there is a special helper for showing that content, called
440"git-ls-tree", which turns the binary content into a more easily
441readable form.
6ad6d3d3 442
8ac866a8
DG
443It's especially instructive to look at "commit" objects, since those
444tend to be small and fairly self-explanatory. In particular, if you
445follow the convention of having the top commit name in ".git/HEAD",
446you can do
6ad6d3d3 447
7096a645 448 git-cat-file commit $(cat .git/HEAD)
6ad6d3d3 449
8ac866a8 450to see what the top commit was.
6ad6d3d3 451
8ac866a8
DG
4527) Merging multiple trees
453~~~~~~~~~~~~~~~~~~~~~~~~~
6ad6d3d3 454
8ac866a8
DG
455Git helps you do a three-way merge, which you can expand to n-way by
456repeating the merge procedure arbitrary times until you finally
457"commit" the state. The normal situation is that you'd only do one
458three-way merge (two parents), and commit it, but if you like to, you
459can do multiple parents in one go.
6ad6d3d3 460
8ac866a8
DG
461To do a three-way merge, you need the two sets of "commit" objects
462that you want to merge, use those to find the closest common parent (a
463third "commit" object), and then use those commit objects to find the
464state of the directory ("tree" object) at these points.
6ad6d3d3 465
8ac866a8
DG
466To get the "base" for the merge, you first look up the common parent
467of two commits with
6ad6d3d3 468
7096a645 469 git-merge-base <commit1> <commit2>
6ad6d3d3 470
8ac866a8
DG
471which will return you the commit they are both based on. You should
472now look up the "tree" objects of those commits, which you can easily
473do with (for example)
6ad6d3d3 474
7096a645 475 git-cat-file commit <commitname> | head -1
6ad6d3d3 476
8ac866a8
DG
477since the tree object information is always the first line in a commit
478object.
479
480Once you know the three trees you are going to merge (the one
481"original" tree, aka the common case, and the two "result" trees, aka
482the branches you want to merge), you do a "merge" read into the
483index. This will throw away your old index contents, so you should
484make sure that you've committed those - in fact you would normally
485always do a merge against your last commit (which should thus match
486what you have in your current index anyway).
6ad6d3d3 487
8ac866a8 488To do the merge, do
6ad6d3d3 489
7096a645 490 git-read-tree -m <origtree> <target1tree> <target2tree>
6ad6d3d3 491
8ac866a8 492which will do all trivial merge operations for you directly in the
7096a645
DG
493index file, and you can just write the result out with
494"git-write-tree".
6ad6d3d3 495
8ac866a8
DG
496NOTE! Because the merge is done in the index file, and not in your
497working directory, your working directory will no longer match your
7096a645
DG
498index. You can use "git-checkout-cache -f -a" to make the effect of
499the merge be seen in your working directory.
6ad6d3d3 500
8ac866a8
DG
501NOTE2! Sadly, many merges aren't trivial. If there are files that have
502been added.moved or removed, or if both branches have modified the
503same file, you will be left with an index tree that contains "merge
504entries" in it. Such an index tree can _NOT_ be written out to a tree
505object, and you will have to resolve any such merge clashes using
506other tools before you can write out the result.
6ad6d3d3 507
6ad6d3d3 508
8ac866a8 509[ fixme: talk about resolving merges here ]