]>
Commit | Line | Data |
---|---|---|
68e66f29 MT |
1 | Parallel Checkout Design Notes |
2 | ============================== | |
3 | ||
4 | The "Parallel Checkout" feature attempts to use multiple processes to | |
5 | parallelize the work of uncompressing the blobs, applying in-core | |
6 | filters, and writing the resulting contents to the working tree during a | |
7 | checkout operation. It can be used by all checkout-related commands, | |
8 | such as `clone`, `checkout`, `reset`, `sparse-checkout`, and others. | |
9 | ||
10 | These commands share the following basic structure: | |
11 | ||
12 | * Step 1: Read the current index file into memory. | |
13 | ||
14 | * Step 2: Modify the in-memory index based upon the command, and | |
15 | temporarily mark all cache entries that need to be updated. | |
16 | ||
17 | * Step 3: Populate the working tree to match the new candidate index. | |
18 | This includes iterating over all of the to-be-updated cache entries | |
19 | and delete, create, or overwrite the associated files in the working | |
20 | tree. | |
21 | ||
22 | * Step 4: Write the new index to disk. | |
23 | ||
24 | Step 3 is the focus of the "parallel checkout" effort described here. | |
25 | ||
26 | Sequential Implementation | |
27 | ------------------------- | |
28 | ||
29 | For the purposes of discussion here, the current sequential | |
30 | implementation of Step 3 is divided in 3 parts, each one implemented in | |
31 | its own function: | |
32 | ||
33 | * Step 3a: `unpack-trees.c:check_updates()` contains a series of | |
34 | sequential loops iterating over the `cache_entry`'s array. The main | |
35 | loop in this function calls the Step 3b function for each of the | |
36 | to-be-updated entries. | |
37 | ||
38 | * Step 3b: `entry.c:checkout_entry()` examines the existing working tree | |
39 | for file conflicts, collisions, and unsaved changes. It removes files | |
40 | and creates leading directories as necessary. It calls the Step 3c | |
41 | function for each entry to be written. | |
42 | ||
43 | * Step 3c: `entry.c:write_entry()` loads the blob into memory, smudges | |
44 | it if necessary, creates the file in the working tree, writes the | |
45 | smudged contents, calls `fstat()` or `lstat()`, and updates the | |
46 | associated `cache_entry` struct with the stat information gathered. | |
47 | ||
48 | It wouldn't be safe to perform Step 3b in parallel, as there could be | |
49 | race conditions between file creations and removals. Instead, the | |
50 | parallel checkout framework lets the sequential code handle Step 3b, | |
51 | and uses parallel workers to replace the sequential | |
52 | `entry.c:write_entry()` calls from Step 3c. | |
53 | ||
54 | Rejected Multi-Threaded Solution | |
55 | -------------------------------- | |
56 | ||
57 | The most "straightforward" implementation would be to spread the set of | |
58 | to-be-updated cache entries across multiple threads. But due to the | |
fa8e8d5b | 59 | thread-unsafe functions in the object database code, we would have to use locks to |
68e66f29 MT |
60 | coordinate the parallel operation. An early prototype of this solution |
61 | showed that the multi-threaded checkout would bring performance | |
62 | improvements over the sequential code, but there was still too much lock | |
63 | contention. A `perf` profiling indicated that around 20% of the runtime | |
64 | during a local Linux clone (on an SSD) was spent in locking functions. | |
65 | For this reason this approach was rejected in favor of using multiple | |
66 | child processes, which led to a better performance. | |
67 | ||
68 | Multi-Process Solution | |
69 | ---------------------- | |
70 | ||
71 | Parallel checkout alters the aforementioned Step 3 to use multiple | |
72 | `checkout--worker` background processes to distribute the work. The | |
73 | long-running worker processes are controlled by the foreground Git | |
74 | command using the existing run-command API. | |
75 | ||
76 | Overview | |
77 | ~~~~~~~~ | |
78 | ||
79 | Step 3b is only slightly altered; for each entry to be checked out, the | |
80 | main process performs the following steps: | |
81 | ||
82 | * M1: Check whether there is any untracked or unclean file in the | |
83 | working tree which would be overwritten by this entry, and decide | |
84 | whether to proceed (removing the file(s)) or not. | |
85 | ||
86 | * M2: Create the leading directories. | |
87 | ||
88 | * M3: Load the conversion attributes for the entry's path. | |
89 | ||
90 | * M4: Check, based on the entry's type and conversion attributes, | |
91 | whether the entry is eligible for parallel checkout (more on this | |
92 | later). If it is eligible, enqueue the entry and the loaded | |
93 | attributes to later write the entry in parallel. If not, write the | |
94 | entry right away, using the default sequential code. | |
95 | ||
96 | Note: we save the conversion attributes associated with each entry | |
97 | because the workers don't have access to the main process' index state, | |
98 | so they can't load the attributes by themselves (and the attributes are | |
99 | needed to properly smudge the entry). Additionally, this has a positive | |
100 | impact on performance as (1) we don't need to load the attributes twice | |
101 | and (2) the attributes machinery is optimized to handle paths in | |
102 | sequential order. | |
103 | ||
104 | After all entries have passed through the above steps, the main process | |
105 | checks if the number of enqueued entries is sufficient to spread among | |
106 | the workers. If not, it just writes them sequentially. Otherwise, it | |
107 | spawns the workers and distributes the queued entries uniformly in | |
108 | continuous chunks. This aims to minimize the chances of two workers | |
109 | writing to the same directory simultaneously, which could increase lock | |
110 | contention in the kernel. | |
111 | ||
112 | Then, for each assigned item, each worker: | |
113 | ||
114 | * W1: Checks if there is any non-directory file in the leading part of | |
115 | the entry's path or if there already exists a file at the entry' path. | |
116 | If so, mark the entry with `PC_ITEM_COLLIDED` and skip it (more on | |
117 | this later). | |
118 | ||
119 | * W2: Creates the file (with O_CREAT and O_EXCL). | |
120 | ||
121 | * W3: Loads the blob into memory (inflating and delta reconstructing | |
122 | it). | |
123 | ||
124 | * W4: Applies any required in-process filter, like end-of-line | |
125 | conversion and re-encoding. | |
126 | ||
127 | * W5: Writes the result to the file descriptor opened at W2. | |
128 | ||
129 | * W6: Calls `fstat()` or lstat()` on the just-written path, and sends | |
130 | the result back to the main process, together with the end status of | |
131 | the operation and the item's identification number. | |
132 | ||
133 | Note that, when possible, steps W3 to W5 are delegated to the streaming | |
134 | machinery, removing the need to keep the entire blob in memory. | |
135 | ||
136 | If the worker fails to read the blob or to write it to the working tree, | |
137 | it removes the created file to avoid leaving empty files behind. This is | |
138 | the *only* time a worker is allowed to remove a file. | |
139 | ||
140 | As mentioned earlier, it is the responsibility of the main process to | |
141 | remove any file that blocks the checkout operation (or abort if the | |
142 | removal(s) would cause data loss and the user didn't ask to `--force`). | |
143 | This is crucial to avoid race conditions and also to properly detect | |
144 | path collisions at Step W1. | |
145 | ||
146 | After the workers finish writing the items and sending back the required | |
147 | information, the main process handles the results in two steps: | |
148 | ||
149 | - First, it updates the in-memory index with the `lstat()` information | |
150 | sent by the workers. (This must be done first as this information | |
151 | might me required in the following step.) | |
152 | ||
153 | - Then it writes the items which collided on disk (i.e. items marked | |
154 | with `PC_ITEM_COLLIDED`). More on this below. | |
155 | ||
156 | Path Collisions | |
157 | --------------- | |
158 | ||
159 | Path collisions happen when two different paths correspond to the same | |
160 | entry in the file system. E.g. the paths 'a' and 'A' would collide in a | |
161 | case-insensitive file system. | |
162 | ||
163 | The sequential checkout deals with collisions in the same way that it | |
164 | deals with files that were already present in the working tree before | |
165 | checkout. Basically, it checks if the path that it wants to write | |
166 | already exists on disk, makes sure the existing file doesn't have | |
167 | unsaved data, and then overwrites it. (To be more pedantic: it deletes | |
168 | the existing file and creates the new one.) So, if there are multiple | |
169 | colliding files to be checked out, the sequential code will write each | |
170 | one of them but only the last will actually survive on disk. | |
171 | ||
172 | Parallel checkout aims to reproduce the same behavior. However, we | |
173 | cannot let the workers racily write to the same file on disk. Instead, | |
174 | the workers detect when the entry that they want to check out would | |
175 | collide with an existing file, and mark it with `PC_ITEM_COLLIDED`. | |
176 | Later, the main process can sequentially feed these entries back to | |
177 | `checkout_entry()` without the risk of race conditions. On clone, this | |
178 | also has the effect of marking the colliding entries to later emit a | |
179 | warning for the user, like the classic sequential checkout does. | |
180 | ||
181 | The workers are able to detect both collisions among the entries being | |
182 | concurrently written and collisions between a parallel-eligible entry | |
183 | and an ineligible entry. The general idea for collision detection is | |
184 | quite straightforward: for each parallel-eligible entry, the main | |
185 | process must remove all files that prevent this entry from being written | |
186 | (before enqueueing it). This includes any non-directory file in the | |
187 | leading path of the entry. Later, when a worker gets assigned the entry, | |
188 | it looks again for the non-directories files and for an already existing | |
189 | file at the entry's path. If any of these checks finds something, the | |
190 | worker knows that there was a path collision. | |
191 | ||
192 | Because parallel checkout can distinguish path collisions from the case | |
193 | where the file was already present in the working tree before checkout, | |
194 | we could alternatively choose to skip the checkout of colliding entries. | |
195 | However, each entry that doesn't get written would have NULL `lstat()` | |
196 | fields on the index. This could cause performance penalties for | |
197 | subsequent commands that need to refresh the index, as they would have | |
198 | to go to the file system to see if the entry is dirty. Thus, if we have | |
199 | N entries in a colliding group and we decide to write and `lstat()` only | |
200 | one of them, every subsequent `git-status` will have to read, convert, | |
201 | and hash the written file N - 1 times. By checking out all colliding | |
202 | entries (like the sequential code does), we only pay the overhead once, | |
203 | during checkout. | |
204 | ||
205 | Eligible Entries for Parallel Checkout | |
206 | -------------------------------------- | |
207 | ||
208 | As previously mentioned, not all entries passed to `checkout_entry()` | |
209 | will be considered eligible for parallel checkout. More specifically, we | |
210 | exclude: | |
211 | ||
212 | - Symbolic links; to avoid race conditions that, in combination with | |
213 | path collisions, could cause workers to write files at the wrong | |
214 | place. For example, if we were to concurrently check out a symlink | |
215 | 'a' -> 'b' and a regular file 'A/f' in a case-insensitive file system, | |
216 | we could potentially end up writing the file 'A/f' at 'a/f', due to a | |
217 | race condition. | |
218 | ||
219 | - Regular files that require external filters (either "one shot" filters | |
220 | or long-running process filters). These filters are black-boxes to Git | |
221 | and may have their own internal locking or non-concurrent assumptions. | |
222 | So it might not be safe to run multiple instances in parallel. | |
223 | + | |
224 | Besides, long-running filters may use the delayed checkout feature to | |
225 | postpone the return of some filtered blobs. The delayed checkout queue | |
226 | and the parallel checkout queue are not compatible and should remain | |
227 | separate. | |
228 | + | |
229 | Note: regular files that only require internal filters, like end-of-line | |
230 | conversion and re-encoding, are eligible for parallel checkout. | |
231 | ||
232 | Ineligible entries are checked out by the classic sequential codepath | |
233 | *before* spawning workers. | |
234 | ||
235 | Note: submodules's files are also eligible for parallel checkout (as | |
236 | long as they don't fall into any of the excluding categories mentioned | |
237 | above). But since each submodule is checked out in its own child | |
238 | process, we don't mix the superproject's and the submodules' files in | |
239 | the same parallel checkout process or queue. | |
240 | ||
241 | The API | |
242 | ------- | |
243 | ||
244 | The parallel checkout API was designed with the goal of minimizing | |
245 | changes to the current users of the checkout machinery. This means that | |
246 | they don't have to call a different function for sequential or parallel | |
247 | checkout. As already mentioned, `checkout_entry()` will automatically | |
248 | insert the given entry in the parallel checkout queue when this feature | |
249 | is enabled and the entry is eligible; otherwise, it will just write the | |
250 | entry right away, using the sequential code. In general, callers of the | |
251 | parallel checkout API should look similar to this: | |
252 | ||
253 | ---------------------------------------------- | |
254 | int pc_workers, pc_threshold, err = 0; | |
255 | struct checkout state; | |
256 | ||
257 | get_parallel_checkout_configs(&pc_workers, &pc_threshold); | |
258 | ||
259 | /* | |
260 | * This check is not strictly required, but it | |
261 | * should save some time in sequential mode. | |
262 | */ | |
263 | if (pc_workers > 1) | |
264 | init_parallel_checkout(); | |
265 | ||
266 | for (each cache_entry ce to-be-updated) | |
267 | err |= checkout_entry(ce, &state, NULL, NULL); | |
268 | ||
269 | err |= run_parallel_checkout(&state, pc_workers, pc_threshold, NULL, NULL); | |
270 | ---------------------------------------------- |