[thirdparty/git.git] / Documentation / technical / parallel-checkout.txt

Parallel Checkout Design Notes
==============================

The "Parallel Checkout" feature attempts to use multiple processes to
parallelize the work of uncompressing the blobs, applying in-core
filters, and writing the resulting contents to the working tree during a
checkout operation. It can be used by all checkout-related commands,
such as `clone`, `checkout`, `reset`, `sparse-checkout`, and others.

These commands share the following basic structure:

* Step 1: Read the current index file into memory.

* Step 2: Modify the in-memory index based upon the command, and
  temporarily mark all cache entries that need to be updated.

* Step 3: Populate the working tree to match the new candidate index.
  This includes iterating over all of the to-be-updated cache entries
  and delete, create, or overwrite the associated files in the working
  tree.

* Step 4: Write the new index to disk.

Step 3 is the focus of the "parallel checkout" effort described here.

Sequential Implementation
-------------------------

For the purposes of discussion here, the current sequential
implementation of Step 3 is divided in 3 parts, each one implemented in
its own function:

* Step 3a: `unpack-trees.c:check_updates()` contains a series of
  sequential loops iterating over the `cache_entry`'s array. The main
  loop in this function calls the Step 3b function for each of the
  to-be-updated entries.

* Step 3b: `entry.c:checkout_entry()` examines the existing working tree
  for file conflicts, collisions, and unsaved changes. It removes files
  and creates leading directories as necessary. It calls the Step 3c
  function for each entry to be written.

* Step 3c: `entry.c:write_entry()` loads the blob into memory, smudges
  it if necessary, creates the file in the working tree, writes the
  smudged contents, calls `fstat()` or `lstat()`, and updates the
  associated `cache_entry` struct with the stat information gathered.

It wouldn't be safe to perform Step 3b in parallel, as there could be
race conditions between file creations and removals. Instead, the
parallel checkout framework lets the sequential code handle Step 3b,
and uses parallel workers to replace the sequential
`entry.c:write_entry()` calls from Step 3c.

Rejected Multi-Threaded Solution
--------------------------------

The most "straightforward" implementation would be to spread the set of
to-be-updated cache entries across multiple threads. But due to the
thread-unsafe functions in the object database code, we would have to use locks to
coordinate the parallel operation. An early prototype of this solution
showed that the multi-threaded checkout would bring performance
improvements over the sequential code, but there was still too much lock
contention. A `perf` profiling indicated that around 20% of the runtime
during a local Linux clone (on an SSD) was spent in locking functions.
For this reason this approach was rejected in favor of using multiple
child processes, which led to a better performance.

Multi-Process Solution
----------------------

Parallel checkout alters the aforementioned Step 3 to use multiple
`checkout--worker` background processes to distribute the work. The
long-running worker processes are controlled by the foreground Git
command using the existing run-command API.

Overview
~~~~~~~~

Step 3b is only slightly altered; for each entry to be checked out, the
main process performs the following steps:

* M1: Check whether there is any untracked or unclean file in the
  working tree which would be overwritten by this entry, and decide
  whether to proceed (removing the file(s)) or not.

* M2: Create the leading directories.

* M3: Load the conversion attributes for the entry's path.

* M4: Check, based on the entry's type and conversion attributes,
  whether the entry is eligible for parallel checkout (more on this
  later). If it is eligible, enqueue the entry and the loaded
  attributes to later write the entry in parallel. If not, write the
  entry right away, using the default sequential code.

Note: we save the conversion attributes associated with each entry
because the workers don't have access to the main process' index state,
so they can't load the attributes by themselves (and the attributes are
needed to properly smudge the entry). Additionally, this has a positive
impact on performance as (1) we don't need to load the attributes twice
and (2) the attributes machinery is optimized to handle paths in
sequential order.

After all entries have passed through the above steps, the main process
checks if the number of enqueued entries is sufficient to spread among
the workers. If not, it just writes them sequentially. Otherwise, it
spawns the workers and distributes the queued entries uniformly in
continuous chunks. This aims to minimize the chances of two workers
writing to the same directory simultaneously, which could increase lock
contention in the kernel.

Then, for each assigned item, each worker:

* W1: Checks if there is any non-directory file in the leading part of
  the entry's path or if there already exists a file at the entry' path.
  If so, mark the entry with `PC_ITEM_COLLIDED` and skip it (more on
  this later).

* W2: Creates the file (with O_CREAT and O_EXCL).

* W3: Loads the blob into memory (inflating and delta reconstructing
  it).

* W4: Applies any required in-process filter, like end-of-line
  conversion and re-encoding.

* W5: Writes the result to the file descriptor opened at W2.

* W6: Calls `fstat()` or lstat()` on the just-written path, and sends
  the result back to the main process, together with the end status of
  the operation and the item's identification number.

Note that, when possible, steps W3 to W5 are delegated to the streaming
machinery, removing the need to keep the entire blob in memory.

If the worker fails to read the blob or to write it to the working tree,
it removes the created file to avoid leaving empty files behind. This is
the *only* time a worker is allowed to remove a file.

As mentioned earlier, it is the responsibility of the main process to
remove any file that blocks the checkout operation (or abort if the
removal(s) would cause data loss and the user didn't ask to `--force`).
This is crucial to avoid race conditions and also to properly detect
path collisions at Step W1.

After the workers finish writing the items and sending back the required
information, the main process handles the results in two steps:

- First, it updates the in-memory index with the `lstat()` information
  sent by the workers. (This must be done first as this information
  might me required in the following step.)

- Then it writes the items which collided on disk (i.e. items marked
  with `PC_ITEM_COLLIDED`). More on this below.

Path Collisions
---------------

Path collisions happen when two different paths correspond to the same
entry in the file system. E.g. the paths 'a' and 'A' would collide in a
case-insensitive file system.

The sequential checkout deals with collisions in the same way that it
deals with files that were already present in the working tree before
checkout. Basically, it checks if the path that it wants to write
already exists on disk, makes sure the existing file doesn't have
unsaved data, and then overwrites it. (To be more pedantic: it deletes
the existing file and creates the new one.) So, if there are multiple
colliding files to be checked out, the sequential code will write each
one of them but only the last will actually survive on disk.

Parallel checkout aims to reproduce the same behavior. However, we
cannot let the workers racily write to the same file on disk. Instead,
the workers detect when the entry that they want to check out would
collide with an existing file, and mark it with `PC_ITEM_COLLIDED`.
Later, the main process can sequentially feed these entries back to
`checkout_entry()` without the risk of race conditions. On clone, this
also has the effect of marking the colliding entries to later emit a
warning for the user, like the classic sequential checkout does.

The workers are able to detect both collisions among the entries being
concurrently written and collisions between a parallel-eligible entry
and an ineligible entry. The general idea for collision detection is
quite straightforward: for each parallel-eligible entry, the main
process must remove all files that prevent this entry from being written
(before enqueueing it). This includes any non-directory file in the
leading path of the entry. Later, when a worker gets assigned the entry,
it looks again for the non-directories files and for an already existing
file at the entry's path. If any of these checks finds something, the
worker knows that there was a path collision.

Because parallel checkout can distinguish path collisions from the case
where the file was already present in the working tree before checkout,
we could alternatively choose to skip the checkout of colliding entries.
However, each entry that doesn't get written would have NULL `lstat()`
fields on the index. This could cause performance penalties for
subsequent commands that need to refresh the index, as they would have
to go to the file system to see if the entry is dirty. Thus, if we have
N entries in a colliding group and we decide to write and `lstat()` only
one of them, every subsequent `git-status` will have to read, convert,
and hash the written file N - 1 times. By checking out all colliding
entries (like the sequential code does), we only pay the overhead once,
during checkout.

Eligible Entries for Parallel Checkout
--------------------------------------

As previously mentioned, not all entries passed to `checkout_entry()`
will be considered eligible for parallel checkout. More specifically, we
exclude:

- Symbolic links; to avoid race conditions that, in combination with
  path collisions, could cause workers to write files at the wrong
  place. For example, if we were to concurrently check out a symlink
  'a' -> 'b' and a regular file 'A/f' in a case-insensitive file system,
  we could potentially end up writing the file 'A/f' at 'a/f', due to a
  race condition.

- Regular files that require external filters (either "one shot" filters
  or long-running process filters). These filters are black-boxes to Git
  and may have their own internal locking or non-concurrent assumptions.
  So it might not be safe to run multiple instances in parallel.
+
Besides, long-running filters may use the delayed checkout feature to
postpone the return of some filtered blobs. The delayed checkout queue
and the parallel checkout queue are not compatible and should remain
separate.
+
Note: regular files that only require internal filters, like end-of-line
conversion and re-encoding, are eligible for parallel checkout.

Ineligible entries are checked out by the classic sequential codepath
*before* spawning workers.

Note: submodules's files are also eligible for parallel checkout (as
long as they don't fall into any of the excluding categories mentioned
above). But since each submodule is checked out in its own child
process, we don't mix the superproject's and the submodules' files in
the same parallel checkout process or queue.

The API
-------

The parallel checkout API was designed with the goal of minimizing
changes to the current users of the checkout machinery. This means that
they don't have to call a different function for sequential or parallel
checkout. As already mentioned, `checkout_entry()` will automatically
insert the given entry in the parallel checkout queue when this feature
is enabled and the entry is eligible; otherwise, it will just write the
entry right away, using the sequential code. In general, callers of the
parallel checkout API should look similar to this:

----------------------------------------------
int pc_workers, pc_threshold, err = 0;
struct checkout state;

get_parallel_checkout_configs(&pc_workers, &pc_threshold);

/*
 * This check is not strictly required, but it
 * should save some time in sequential mode.
 */
if (pc_workers > 1)
	init_parallel_checkout();

for (each cache_entry ce to-be-updated)
	err |= checkout_entry(ce, &state, NULL, NULL);

err |= run_parallel_checkout(&state, pc_workers, pc_threshold, NULL, NULL);
----------------------------------------------
Commit	Line	Data
68e66f29 MT	1	Parallel Checkout Design Notes
	2	==============================
	3
	4	The "Parallel Checkout" feature attempts to use multiple processes to
	5	parallelize the work of uncompressing the blobs, applying in-core
	6	filters, and writing the resulting contents to the working tree during a
	7	checkout operation. It can be used by all checkout-related commands,
	8	such as `clone`, `checkout`, `reset`, `sparse-checkout`, and others.
	9
	10	These commands share the following basic structure:
	11
	12	* Step 1: Read the current index file into memory.
	13
	14	* Step 2: Modify the in-memory index based upon the command, and
	15	temporarily mark all cache entries that need to be updated.
	16
	17	* Step 3: Populate the working tree to match the new candidate index.
	18	This includes iterating over all of the to-be-updated cache entries
	19	and delete, create, or overwrite the associated files in the working
	20	tree.
	21
	22	* Step 4: Write the new index to disk.
	23
	24	Step 3 is the focus of the "parallel checkout" effort described here.
	25
	26	Sequential Implementation
	27	-------------------------
	28
	29	For the purposes of discussion here, the current sequential
	30	implementation of Step 3 is divided in 3 parts, each one implemented in
	31	its own function:
	32
	33	* Step 3a: `unpack-trees.c:check_updates()` contains a series of
	34	sequential loops iterating over the `cache_entry`'s array. The main
	35	loop in this function calls the Step 3b function for each of the
	36	to-be-updated entries.
	37
	38	* Step 3b: `entry.c:checkout_entry()` examines the existing working tree
	39	for file conflicts, collisions, and unsaved changes. It removes files
	40	and creates leading directories as necessary. It calls the Step 3c
	41	function for each entry to be written.
	42
	43	* Step 3c: `entry.c:write_entry()` loads the blob into memory, smudges
	44	it if necessary, creates the file in the working tree, writes the
	45	smudged contents, calls `fstat()` or `lstat()`, and updates the
	46	associated `cache_entry` struct with the stat information gathered.
	47
	48	It wouldn't be safe to perform Step 3b in parallel, as there could be
	49	race conditions between file creations and removals. Instead, the
	50	parallel checkout framework lets the sequential code handle Step 3b,
	51	and uses parallel workers to replace the sequential
	52	`entry.c:write_entry()` calls from Step 3c.
	53
	54	Rejected Multi-Threaded Solution
	55	--------------------------------
	56
	57	The most "straightforward" implementation would be to spread the set of
	58	to-be-updated cache entries across multiple threads. But due to the
fa8e8d5b	59	thread-unsafe functions in the object database code, we would have to use locks to
68e66f29 MT	60	coordinate the parallel operation. An early prototype of this solution
	61	showed that the multi-threaded checkout would bring performance
	62	improvements over the sequential code, but there was still too much lock
	63	contention. A `perf` profiling indicated that around 20% of the runtime
	64	during a local Linux clone (on an SSD) was spent in locking functions.
	65	For this reason this approach was rejected in favor of using multiple
	66	child processes, which led to a better performance.
	67
	68	Multi-Process Solution
	69	----------------------
	70
	71	Parallel checkout alters the aforementioned Step 3 to use multiple
	72	`checkout--worker` background processes to distribute the work. The
	73	long-running worker processes are controlled by the foreground Git
	74	command using the existing run-command API.
	75
	76	Overview
	77	~~~~~~~~
	78
	79	Step 3b is only slightly altered; for each entry to be checked out, the
	80	main process performs the following steps:
	81
	82	* M1: Check whether there is any untracked or unclean file in the
	83	working tree which would be overwritten by this entry, and decide
	84	whether to proceed (removing the file(s)) or not.
	85
	86	* M2: Create the leading directories.
	87
	88	* M3: Load the conversion attributes for the entry's path.
	89
	90	* M4: Check, based on the entry's type and conversion attributes,
	91	whether the entry is eligible for parallel checkout (more on this
	92	later). If it is eligible, enqueue the entry and the loaded
	93	attributes to later write the entry in parallel. If not, write the
	94	entry right away, using the default sequential code.
	95
	96	Note: we save the conversion attributes associated with each entry
	97	because the workers don't have access to the main process' index state,
	98	so they can't load the attributes by themselves (and the attributes are
	99	needed to properly smudge the entry). Additionally, this has a positive
	100	impact on performance as (1) we don't need to load the attributes twice
	101	and (2) the attributes machinery is optimized to handle paths in
	102	sequential order.
	103
	104	After all entries have passed through the above steps, the main process
	105	checks if the number of enqueued entries is sufficient to spread among
	106	the workers. If not, it just writes them sequentially. Otherwise, it
	107	spawns the workers and distributes the queued entries uniformly in
	108	continuous chunks. This aims to minimize the chances of two workers
	109	writing to the same directory simultaneously, which could increase lock
	110	contention in the kernel.
	111
	112	Then, for each assigned item, each worker:
	113
	114	* W1: Checks if there is any non-directory file in the leading part of
	115	the entry's path or if there already exists a file at the entry' path.
	116	If so, mark the entry with `PC_ITEM_COLLIDED` and skip it (more on
	117	this later).
	118
	119	* W2: Creates the file (with O_CREAT and O_EXCL).
	120
	121	* W3: Loads the blob into memory (inflating and delta reconstructing
	122	it).
	123
124	* W4: Applies any required in-process filter, like end-of-line
125	conversion and re-encoding.
126
127	* W5: Writes the result to the file descriptor opened at W2.
128
129	* W6: Calls `fstat()` or lstat()` on the just-written path, and sends
130	the result back to the main process, together with the end status of
131	the operation and the item's identification number.
132
133	Note that, when possible, steps W3 to W5 are delegated to the streaming
134	machinery, removing the need to keep the entire blob in memory.
135
136	If the worker fails to read the blob or to write it to the working tree,
137	it removes the created file to avoid leaving empty files behind. This is
138	the only time a worker is allowed to remove a file.
139
140	As mentioned earlier, it is the responsibility of the main process to
141	remove any file that blocks the checkout operation (or abort if the
142	removal(s) would cause data loss and the user didn't ask to `--force`).
143	This is crucial to avoid race conditions and also to properly detect
144	path collisions at Step W1.
145
146	After the workers finish writing the items and sending back the required
147	information, the main process handles the results in two steps:
148
149	- First, it updates the in-memory index with the `lstat()` information
150	sent by the workers. (This must be done first as this information
151	might me required in the following step.)
152
153	- Then it writes the items which collided on disk (i.e. items marked
154	with `PC_ITEM_COLLIDED`). More on this below.
155
156	Path Collisions
157	---------------
158
159	Path collisions happen when two different paths correspond to the same
160	entry in the file system. E.g. the paths 'a' and 'A' would collide in a
161	case-insensitive file system.
162
163	The sequential checkout deals with collisions in the same way that it
164	deals with files that were already present in the working tree before
165	checkout. Basically, it checks if the path that it wants to write
166	already exists on disk, makes sure the existing file doesn't have
167	unsaved data, and then overwrites it. (To be more pedantic: it deletes
168	the existing file and creates the new one.) So, if there are multiple
169	colliding files to be checked out, the sequential code will write each
170	one of them but only the last will actually survive on disk.
171
172	Parallel checkout aims to reproduce the same behavior. However, we
173	cannot let the workers racily write to the same file on disk. Instead,
174	the workers detect when the entry that they want to check out would
175	collide with an existing file, and mark it with `PC_ITEM_COLLIDED`.
176	Later, the main process can sequentially feed these entries back to
177	`checkout_entry()` without the risk of race conditions. On clone, this
178	also has the effect of marking the colliding entries to later emit a
179	warning for the user, like the classic sequential checkout does.
180
181	The workers are able to detect both collisions among the entries being
182	concurrently written and collisions between a parallel-eligible entry
183	and an ineligible entry. The general idea for collision detection is
184	quite straightforward: for each parallel-eligible entry, the main
185	process must remove all files that prevent this entry from being written
186	(before enqueueing it). This includes any non-directory file in the
187	leading path of the entry. Later, when a worker gets assigned the entry,
188	it looks again for the non-directories files and for an already existing
189	file at the entry's path. If any of these checks finds something, the
190	worker knows that there was a path collision.
191
192	Because parallel checkout can distinguish path collisions from the case
193	where the file was already present in the working tree before checkout,
194	we could alternatively choose to skip the checkout of colliding entries.
195	However, each entry that doesn't get written would have NULL `lstat()`
196	fields on the index. This could cause performance penalties for
197	subsequent commands that need to refresh the index, as they would have
198	to go to the file system to see if the entry is dirty. Thus, if we have
199	N entries in a colliding group and we decide to write and `lstat()` only
200	one of them, every subsequent `git-status` will have to read, convert,
201	and hash the written file N - 1 times. By checking out all colliding
202	entries (like the sequential code does), we only pay the overhead once,
203	during checkout.
204
205	Eligible Entries for Parallel Checkout
206	--------------------------------------
207
208	As previously mentioned, not all entries passed to `checkout_entry()`
209	will be considered eligible for parallel checkout. More specifically, we
210	exclude:
211
212	- Symbolic links; to avoid race conditions that, in combination with
213	path collisions, could cause workers to write files at the wrong
214	place. For example, if we were to concurrently check out a symlink
215	'a' -> 'b' and a regular file 'A/f' in a case-insensitive file system,
216	we could potentially end up writing the file 'A/f' at 'a/f', due to a
217	race condition.
218
219	- Regular files that require external filters (either "one shot" filters
220	or long-running process filters). These filters are black-boxes to Git
221	and may have their own internal locking or non-concurrent assumptions.
222	So it might not be safe to run multiple instances in parallel.
223	+
224	Besides, long-running filters may use the delayed checkout feature to
225	postpone the return of some filtered blobs. The delayed checkout queue
226	and the parallel checkout queue are not compatible and should remain
227	separate.
228	+
229	Note: regular files that only require internal filters, like end-of-line
230	conversion and re-encoding, are eligible for parallel checkout.
231
232	Ineligible entries are checked out by the classic sequential codepath
233	before spawning workers.
234
235	Note: submodules's files are also eligible for parallel checkout (as
236	long as they don't fall into any of the excluding categories mentioned
237	above). But since each submodule is checked out in its own child
238	process, we don't mix the superproject's and the submodules' files in
239	the same parallel checkout process or queue.
240
241	The API
242	-------
243
244	The parallel checkout API was designed with the goal of minimizing
245	changes to the current users of the checkout machinery. This means that
246	they don't have to call a different function for sequential or parallel
247	checkout. As already mentioned, `checkout_entry()` will automatically
248	insert the given entry in the parallel checkout queue when this feature
249	is enabled and the entry is eligible; otherwise, it will just write the
250	entry right away, using the sequential code. In general, callers of the
251	parallel checkout API should look similar to this:
252
253	----------------------------------------------
254	int pc_workers, pc_threshold, err = 0;
255	struct checkout state;
256
257	get_parallel_checkout_configs(&pc_workers, &pc_threshold);
258
259	/*
260	* This check is not strictly required, but it
261	* should save some time in sequential mode.
262	*/
263	if (pc_workers > 1)
264	init_parallel_checkout();
265
266	for (each cache_entry ce to-be-updated)
267	err \|= checkout_entry(ce, &state, NULL, NULL);
268
269	err \|= run_parallel_checkout(&state, pc_workers, pc_threshold, NULL, NULL);
270	----------------------------------------------