[thirdparty/git.git] / Documentation / gitformat-chunk.txt

gitformat-chunk(5)
==================

NAME
----
gitformat-chunk - Chunk-based file formats

SYNOPSIS
--------

Used by linkgit:gitformat-commit-graph[5] and the "MIDX" format (see
the pack format documentation in linkgit:gitformat-pack[5]).

DESCRIPTION
-----------

Some file formats in Git use a common concept of "chunks" to describe
sections of the file. This allows structured access to a large file by
scanning a small "table of contents" for the remaining data. This common
format is used by the `commit-graph` and `multi-pack-index` files. See
the `multi-pack-index` format in linkgit:gitformat-pack[5] and
the `commit-graph` format in linkgit:gitformat-commit-graph[5] for
how they use the chunks to describe structured data.

A chunk-based file format begins with some header information custom to
that format. That header should include enough information to identify
the file type, format version, and number of chunks in the file. From this
information, that file can determine the start of the chunk-based region.

The chunk-based region starts with a table of contents describing where
each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
where C is the number of chunks. Consider the following table:

  | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
  |--------------------|------------------------|
  | ID[0]              | OFFSET[0]              |
  | ...                | ...                    |
  | ID[C]              | OFFSET[C]              |
  | 0x0000             | OFFSET[C+1]            |

Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
Each integer is stored in network-byte order.

The chunk identifier `ID[i]` is a label for the data stored within this
fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
and `OFFSET[i]`. This requires that the chunk data appears contiguously
in the same order as the table of contents.

The final entry in the table of contents must be four zero bytes. This
confirms that the table of contents is ending and provides the offset for
the end of the chunk-based data.

Note: The chunk-based format expects that the file contains _at least_ a
trailing hash after `OFFSET[C+1]`.

Functions for working with chunk-based file formats are declared in
`chunk-format.h`. Using these methods provide extra checks that assist
developers when creating new file formats.

Writing chunk-based file formats
--------------------------------

To write a chunk-based file format, create a `struct chunkfile` by
calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
caller is responsible for opening the `hashfile` and writing header
information so the file format is identifiable before the chunk-based
format begins.

Then, call `add_chunk()` for each chunk that is intended for write. This
populates the `chunkfile` with information about the order and size of
each chunk to write. Provide a `chunk_write_fn` function pointer to
perform the write of the chunk data upon request.

Call `write_chunkfile()` to write the table of contents to the `hashfile`
followed by each of the chunks. This will verify that each chunk wrote
the expected amount of data so the table of contents is correct.

Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
caller is responsible for finalizing the `hashfile` by writing the trailing
hash and closing the file.

Reading chunk-based file formats
--------------------------------

To read a chunk-based file format, the file must be opened as a
memory-mapped region. The chunk-format API expects that the entire file
is mapped as a contiguous memory region.

Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.

After reading the header information from the beginning of the file,
including the chunk count, call `read_table_of_contents()` to populate
the `struct chunkfile` with the list of chunks, their offsets, and their
sizes.

Extract the data information for each chunk using `pair_chunk()` or
`read_chunk()`:

* `pair_chunk()` assigns a given pointer with the location inside the
  memory-mapped file corresponding to that chunk's offset. If the chunk
  does not exist, then the pointer is not modified.

* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
  with the appropriate initial pointer and size information. The function
  is not called if the chunk does not exist. Use this method to read chunks
  if you need to perform immediate parsing or if you need to execute logic
  based on the size of the chunk.

After calling these methods, call `free_chunkfile()` to clear the
`struct chunkfile` data. This will not close the memory-mapped region.
Callers are expected to own that data for the timeframe the pointers into
the region are needed.

Examples
--------

These file formats use the chunk-format API, and can be used as examples
for future formats:

* *commit-graph:* see `write_commit_graph_file()` and `parse_commit_graph()`
  in `commit-graph.c` for how the chunk-format API is used to write and
  parse the commit-graph file format documented in
  the commit-graph file format in linkgit:gitformat-commit-graph[5].

* *multi-pack-index:* see `write_midx_internal()` and `load_multi_pack_index()`
  in `midx.c` for how the chunk-format API is used to write and
  parse the multi-pack-index file format documented in
  the multi-pack-index file format section of linkgit:gitformat-pack[5].

GIT
---
Part of the linkgit:git[1] suite
Commit	Line	Data
977c47b4 ÆAB	1	gitformat-chunk(5)
	2	==================
	3
	4	NAME
	5	----
	6	gitformat-chunk - Chunk-based file formats
	7
	8	SYNOPSIS
	9	--------
	10
	11	Used by linkgit:gitformat-commit-graph[5] and the "MIDX" format (see
	12	the pack format documentation in linkgit:gitformat-pack[5]).
	13
	14	DESCRIPTION
	15	-----------
a43a2e6c DS	16
	17	Some file formats in Git use a common concept of "chunks" to describe
	18	sections of the file. This allows structured access to a large file by
	19	scanning a small "table of contents" for the remaining data. This common
	20	format is used by the `commit-graph` and `multi-pack-index` files. See
977c47b4	21	the `multi-pack-index` format in linkgit:gitformat-pack[5] and
8cbace93	22	the `commit-graph` format in linkgit:gitformat-commit-graph[5] for
a43a2e6c DS	23	how they use the chunks to describe structured data.
	24
	25	A chunk-based file format begins with some header information custom to
	26	that format. That header should include enough information to identify
	27	the file type, format version, and number of chunks in the file. From this
	28	information, that file can determine the start of the chunk-based region.
	29
	30	The chunk-based region starts with a table of contents describing where
	31	each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
	32	where C is the number of chunks. Consider the following table:
	33
	34	\| Chunk ID (4 bytes) \| Chunk Offset (8 bytes) \|
	35	\|--------------------\|------------------------\|
	36	\| ID[0] \| OFFSET[0] \|
	37	\| ... \| ... \|
	38	\| ID[C] \| OFFSET[C] \|
	39	\| 0x0000 \| OFFSET[C+1] \|
	40
	41	Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
	42	Each integer is stored in network-byte order.
	43
	44	The chunk identifier `ID[i]` is a label for the data stored within this
	45	fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
	46	size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
	47	and `OFFSET[i]`. This requires that the chunk data appears contiguously
	48	in the same order as the table of contents.
	49
	50	The final entry in the table of contents must be four zero bytes. This
	51	confirms that the table of contents is ending and provides the offset for
	52	the end of the chunk-based data.
	53
	54	Note: The chunk-based format expects that the file contains _at least_ a
	55	trailing hash after `OFFSET[C+1]`.
	56
	57	Functions for working with chunk-based file formats are declared in
	58	`chunk-format.h`. Using these methods provide extra checks that assist
	59	developers when creating new file formats.
	60
	61	Writing chunk-based file formats
	62	--------------------------------
	63
	64	To write a chunk-based file format, create a `struct chunkfile` by
	65	calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
	66	caller is responsible for opening the `hashfile` and writing header
	67	information so the file format is identifiable before the chunk-based
	68	format begins.
	69
	70	Then, call `add_chunk()` for each chunk that is intended for write. This
	71	populates the `chunkfile` with information about the order and size of
	72	each chunk to write. Provide a `chunk_write_fn` function pointer to
	73	perform the write of the chunk data upon request.
	74
	75	Call `write_chunkfile()` to write the table of contents to the `hashfile`
	76	followed by each of the chunks. This will verify that each chunk wrote
	77	the expected amount of data so the table of contents is correct.
	78
	79	Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
	80	caller is responsible for finalizing the `hashfile` by writing the trailing
	81	hash and closing the file.
	82
	83	Reading chunk-based file formats
	84	--------------------------------
	85
	86	To read a chunk-based file format, the file must be opened as a
87	memory-mapped region. The chunk-format API expects that the entire file
88	is mapped as a contiguous memory region.
89
90	Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.
91
92	After reading the header information from the beginning of the file,
93	including the chunk count, call `read_table_of_contents()` to populate
94	the `struct chunkfile` with the list of chunks, their offsets, and their
95	sizes.
96
97	Extract the data information for each chunk using `pair_chunk()` or
98	`read_chunk()`:
99
100	* `pair_chunk()` assigns a given pointer with the location inside the
101	memory-mapped file corresponding to that chunk's offset. If the chunk
102	does not exist, then the pointer is not modified.
103
104	* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
105	with the appropriate initial pointer and size information. The function
106	is not called if the chunk does not exist. Use this method to read chunks
107	if you need to perform immediate parsing or if you need to execute logic
108	based on the size of the chunk.
109
110	After calling these methods, call `free_chunkfile()` to clear the
111	`struct chunkfile` data. This will not close the memory-mapped region.
112	Callers are expected to own that data for the timeframe the pointers into
113	the region are needed.
114
115	Examples
116	--------
117
118	These file formats use the chunk-format API, and can be used as examples
119	for future formats:
120
121	* commit-graph: see `write_commit_graph_file()` and `parse_commit_graph()`
122	in `commit-graph.c` for how the chunk-format API is used to write and
123	parse the commit-graph file format documented in
8cbace93	124	the commit-graph file format in linkgit:gitformat-commit-graph[5].
a43a2e6c DS	125
	126	* multi-pack-index: see `write_midx_internal()` and `load_multi_pack_index()`
	127	in `midx.c` for how the chunk-format API is used to write and
	128	parse the multi-pack-index file format documented in
977c47b4 ÆAB	129	the multi-pack-index file format section of linkgit:gitformat-pack[5].
	130
	131	GIT
	132	---
	133	Part of the linkgit:git[1] suite