Documentation/technical/parallel-checkout.txt

   1 Parallel Checkout Design Notes
   2 ==============================
   3
   4 The "Parallel Checkout" feature attempts to use multiple processes to
   5 parallelize the work of uncompressing the blobs, applying in-core
   6 filters, and writing the resulting contents to the working tree during a
   7 checkout operation. It can be used by all checkout-related commands,
   8 such as `clone`, `checkout`, `reset`, `sparse-checkout`, and others.
   9
  10 These commands share the following basic structure:
  11
  12 * Step 1: Read the current index file into memory.
  13
  14 * Step 2: Modify the in-memory index based upon the command, and
  15   temporarily mark all cache entries that need to be updated.
  16
  17 * Step 3: Populate the working tree to match the new candidate index.
  18   This includes iterating over all of the to-be-updated cache entries
  19   and delete, create, or overwrite the associated files in the working
  20   tree.
  21
  22 * Step 4: Write the new index to disk.
  23
  24 Step 3 is the focus of the "parallel checkout" effort described here.
  25
  26 Sequential Implementation
  27 -------------------------
  28
  29 For the purposes of discussion here, the current sequential
  30 implementation of Step 3 is divided in 3 parts, each one implemented in
  31 its own function:
  32
  33 * Step 3a: `unpack-trees.c:check_updates()` contains a series of
  34   sequential loops iterating over the `cache_entry`'s array. The main
  35   loop in this function calls the Step 3b function for each of the
  36   to-be-updated entries.
  37
  38 * Step 3b: `entry.c:checkout_entry()` examines the existing working tree
  39   for file conflicts, collisions, and unsaved changes. It removes files
  40   and creates leading directories as necessary. It calls the Step 3c
  41   function for each entry to be written.
  42
  43 * Step 3c: `entry.c:write_entry()` loads the blob into memory, smudges
  44   it if necessary, creates the file in the working tree, writes the
  45   smudged contents, calls `fstat()` or `lstat()`, and updates the
  46   associated `cache_entry` struct with the stat information gathered.
  47
  48 It wouldn't be safe to perform Step 3b in parallel, as there could be
  49 race conditions between file creations and removals. Instead, the
  50 parallel checkout framework lets the sequential code handle Step 3b,
  51 and uses parallel workers to replace the sequential
  52 `entry.c:write_entry()` calls from Step 3c.
  53
  54 Rejected Multi-Threaded Solution
  55 --------------------------------
  56
  57 The most "straightforward" implementation would be to spread the set of
  58 to-be-updated cache entries across multiple threads. But due to the
  59 thread-unsafe functions in the object database code, we would have to use locks to
  60 coordinate the parallel operation. An early prototype of this solution
  61 showed that the multi-threaded checkout would bring performance
  62 improvements over the sequential code, but there was still too much lock
  63 contention. A `perf` profiling indicated that around 20% of the runtime
  64 during a local Linux clone (on an SSD) was spent in locking functions.
  65 For this reason this approach was rejected in favor of using multiple
  66 child processes, which led to better performance.
  67
  68 Multi-Process Solution
  69 ----------------------
  70
  71 Parallel checkout alters the aforementioned Step 3 to use multiple
  72 `checkout--worker` background processes to distribute the work. The
  73 long-running worker processes are controlled by the foreground Git
  74 command using the existing run-command API.
  75
  76 Overview
  77 ~~~~~~~~
  78
  79 Step 3b is only slightly altered; for each entry to be checked out, the
  80 main process performs the following steps:
  81
  82 * M1: Check whether there is any untracked or unclean file in the
  83   working tree which would be overwritten by this entry, and decide
  84   whether to proceed (removing the file(s)) or not.
  85
  86 * M2: Create the leading directories.
  87
  88 * M3: Load the conversion attributes for the entry's path.
  89
  90 * M4: Check, based on the entry's type and conversion attributes,
  91   whether the entry is eligible for parallel checkout (more on this
  92   later). If it is eligible, enqueue the entry and the loaded
  93   attributes to later write the entry in parallel. If not, write the
  94   entry right away, using the default sequential code.
  95
  96 Note: we save the conversion attributes associated with each entry
  97 because the workers don't have access to the main process' index state,
  98 so they can't load the attributes by themselves (and the attributes are
  99 needed to properly smudge the entry). Additionally, this has a positive
 100 impact on performance as (1) we don't need to load the attributes twice
 101 and (2) the attributes machinery is optimized to handle paths in
 102 sequential order.
 103
 104 After all entries have passed through the above steps, the main process
 105 checks if the number of enqueued entries is sufficient to spread among
 106 the workers. If not, it just writes them sequentially. Otherwise, it
 107 spawns the workers and distributes the queued entries uniformly in
 108 continuous chunks. This aims to minimize the chances of two workers
 109 writing to the same directory simultaneously, which could increase lock
 110 contention in the kernel.
 111
 112 Then, for each assigned item, each worker:
 113
 114 * W1: Checks if there is any non-directory file in the leading part of
 115   the entry's path or if there already exists a file at the entry' path.
 116   If so, mark the entry with `PC_ITEM_COLLIDED` and skip it (more on
 117   this later).
 118
 119 * W2: Creates the file (with O_CREAT and O_EXCL).
 120
 121 * W3: Loads the blob into memory (inflating and delta reconstructing
 122   it).
 123
 124 * W4: Applies any required in-process filter, like end-of-line
 125   conversion and re-encoding.
 126
 127 * W5: Writes the result to the file descriptor opened at W2.
 128
 129 * W6: Calls `fstat()` or `lstat()` on the just-written path, and sends
 130   the result back to the main process, together with the end status of
 131   the operation and the item's identification number.
 132
 133 Note that, when possible, steps W3 to W5 are delegated to the streaming
 134 machinery, removing the need to keep the entire blob in memory.
 135
 136 If the worker fails to read the blob or to write it to the working tree,
 137 it removes the created file to avoid leaving empty files behind. This is
 138 the *only* time a worker is allowed to remove a file.
 139
 140 As mentioned earlier, it is the responsibility of the main process to
 141 remove any file that blocks the checkout operation (or abort if the
 142 removal(s) would cause data loss and the user didn't ask to `--force`).
 143 This is crucial to avoid race conditions and also to properly detect
 144 path collisions at Step W1.
 145
 146 After the workers finish writing the items and sending back the required
 147 information, the main process handles the results in two steps:
 148
 149 - First, it updates the in-memory index with the `lstat()` information
 150   sent by the workers. (This must be done first as this information
 151   might be required in the following step.)
 152
 153 - Then it writes the items which collided on disk (i.e. items marked
 154   with `PC_ITEM_COLLIDED`). More on this below.
 155
 156 Path Collisions
 157 ---------------
 158
 159 Path collisions happen when two different paths correspond to the same
 160 entry in the file system. E.g. the paths 'a' and 'A' would collide in a
 161 case-insensitive file system.
 162
 163 The sequential checkout deals with collisions in the same way that it
 164 deals with files that were already present in the working tree before
 165 checkout. Basically, it checks if the path that it wants to write
 166 already exists on disk, makes sure the existing file doesn't have
 167 unsaved data, and then overwrites it. (To be more pedantic: it deletes
 168 the existing file and creates the new one.) So, if there are multiple
 169 colliding files to be checked out, the sequential code will write each
 170 one of them but only the last will actually survive on disk.
 171
 172 Parallel checkout aims to reproduce the same behavior. However, we
 173 cannot let the workers racily write to the same file on disk. Instead,
 174 the workers detect when the entry that they want to check out would
 175 collide with an existing file, and mark it with `PC_ITEM_COLLIDED`.
 176 Later, the main process can sequentially feed these entries back to
 177 `checkout_entry()` without the risk of race conditions. On clone, this
 178 also has the effect of marking the colliding entries to later emit a
 179 warning for the user, like the classic sequential checkout does.
 180
 181 The workers are able to detect both collisions among the entries being
 182 concurrently written and collisions between a parallel-eligible entry
 183 and an ineligible entry. The general idea for collision detection is
 184 quite straightforward: for each parallel-eligible entry, the main
 185 process must remove all files that prevent this entry from being written
 186 (before enqueueing it). This includes any non-directory file in the
 187 leading path of the entry. Later, when a worker gets assigned the entry,
 188 it looks again for the non-directory files and for an already existing
 189 file at the entry's path. If any of these checks finds something, the
 190 worker knows that there was a path collision.
 191
 192 Because parallel checkout can distinguish path collisions from the case
 193 where the file was already present in the working tree before checkout,
 194 we could alternatively choose to skip the checkout of colliding entries.
 195 However, each entry that doesn't get written would have NULL `lstat()`
 196 fields on the index. This could cause performance penalties for
 197 subsequent commands that need to refresh the index, as they would have
 198 to go to the file system to see if the entry is dirty. Thus, if we have
 199 N entries in a colliding group and we decide to write and `lstat()` only
 200 one of them, every subsequent `git-status` will have to read, convert,
 201 and hash the written file N - 1 times. By checking out all colliding
 202 entries (like the sequential code does), we only pay the overhead once,
 203 during checkout.
 204
 205 Eligible Entries for Parallel Checkout
 206 --------------------------------------
 207
 208 As previously mentioned, not all entries passed to `checkout_entry()`
 209 will be considered eligible for parallel checkout. More specifically, we
 210 exclude:
 211
 212 - Symbolic links; to avoid race conditions that, in combination with
 213   path collisions, could cause workers to write files at the wrong
 214   place. For example, if we were to concurrently check out a symlink
 215   'a' -> 'b' and a regular file 'A/f' in a case-insensitive file system,
 216   we could potentially end up writing the file 'A/f' at 'a/f', due to a
 217   race condition.
 218
 219 - Regular files that require external filters (either "one shot" filters
 220   or long-running process filters). These filters are black-boxes to Git
 221   and may have their own internal locking or non-concurrent assumptions.
 222   So it might not be safe to run multiple instances in parallel.
 223 +
 224 Besides, long-running filters may use the delayed checkout feature to
 225 postpone the return of some filtered blobs. The delayed checkout queue
 226 and the parallel checkout queue are not compatible and should remain
 227 separate.
 228 +
 229 Note: regular files that only require internal filters, like end-of-line
 230 conversion and re-encoding, are eligible for parallel checkout.
 231
 232 Ineligible entries are checked out by the classic sequential codepath
 233 *before* spawning workers.
 234
 235 Note: submodules' files are also eligible for parallel checkout (as
 236 long as they don't fall into any of the excluding categories mentioned
 237 above). But since each submodule is checked out in its own child
 238 process, we don't mix the superproject's and the submodules' files in
 239 the same parallel checkout process or queue.
 240
 241 The API
 242 -------
 243
 244 The parallel checkout API was designed with the goal of minimizing
 245 changes to the current users of the checkout machinery. This means that
 246 they don't have to call a different function for sequential or parallel
 247 checkout. As already mentioned, `checkout_entry()` will automatically
 248 insert the given entry in the parallel checkout queue when this feature
 249 is enabled and the entry is eligible; otherwise, it will just write the
 250 entry right away, using the sequential code. In general, callers of the
 251 parallel checkout API should look similar to this:
 252
 253 ----------------------------------------------
 254 int pc_workers, pc_threshold, err = 0;
 255 struct checkout state;
 256
 257 get_parallel_checkout_configs(&pc_workers, &pc_threshold);
 258
 259 /*
 260  * This check is not strictly required, but it
 261  * should save some time in sequential mode.
 262  */
 263 if (pc_workers > 1)
 264         init_parallel_checkout();
 265
 266 for (each cache_entry ce to-be-updated)
 267         err |= checkout_entry(ce, &state, NULL, NULL);
 268
 269 err |= run_parallel_checkout(&state, pc_workers, pc_threshold, NULL, NULL);
 270 ----------------------------------------------