docs/ROOT_STORAGE_DAEMONS.md

   1 ---
   2 title: Storage Daemons for the Root File System
   3 category: Interfaces
   4 layout: default
   5 SPDX-License-Identifier: LGPL-2.1-or-later
   6 ---
   7
   8 # systemd and Storage Daemons for the Root File System
   9
  10 a.k.a. _Pax Cellae pro Radix Arbor_
  11
  12 (or something like that, my Latin is a bit rusty)
  13
  14 A number of complex storage technologies on Linux (e.g. RAID, volume
  15 management, networked storage) require user space services to run while the
  16 storage is active and mountable. This requirement becomes tricky as soon as the
  17 root file system of the Linux operating system is stored on such storage
  18 technology. Previously no clear path to make this work was available. This text
  19 tries to clear up the resulting confusion, and what is now supported and what
  20 is not.
  21
  22 ## A Bit of Background
  23
  24 When complex storage technologies are used as backing for the root file system
  25 this needs to be set up by the initrd, i.e. on Fedora by Dracut. In newer
  26 systemd versions tear-down of the root file system backing is also done by the
  27 initrd: after terminating all remaining running processes and unmounting all
  28 file systems it can (which means excluding the root file system) systemd will
  29 jump back into the initrd code allowing it to unmount the final file systems
  30 (and its storage backing) that could not be unmounted as long as the OS was
  31 still running from the main root file system. The job of the initrd is to
  32 detach/unmount the root file system, i.e. inverting the exact commands it used
  33 to set them up in the first place. This is not only cleaner, but also allows
  34 for the first time arbitrary complex stacks of storage technology.
  35
  36 Previous attempts to handle root file system setups with complex storage as
  37 backing usually tried to maintain the root storage with program code stored on
  38 the root storage itself, thus creating a number of dependency loops. Safely
  39 detaching such a root file system becomes messy, since the program code on the
  40 storage needs to stay around longer than the storage, which is technically
  41 contradicting.
  42
  43 ## What's new?
  44
  45 As a result, we hereby clarify that we do not support storage technology setups
  46 where the storage daemons are being run from the storage they maintain
  47 themselves. In other words: a storage daemon backing the root file system cannot
  48 be stored on the root file system itself.
  49
  50 What we do support instead is that these storage daemons are started from the
  51 initrd, stay running all the time during normal operation and are terminated
  52 only after we returned control back to the initrd and by the initrd. As such,
  53 storage daemons involved with maintaining the root file system storage
  54 conceptually are more like kernel threads than like normal system services:
  55 from the perspective of the init system (i.e. systemd), these services have been
  56 started before systemd was initialized and stay around until after systemd is
  57 already gone. These daemons can only be updated by updating the initrd and
  58 rebooting; a takeover from initrd-supplied services to replacements from the
  59 root file system is not supported.
  60
  61 ## What does this mean?
  62
  63 Near the end of system shutdown, systemd executes a small tool called
  64 systemd-shutdown, replacing its own process. This tool (which runs as PID 1, as
  65 it entirely replaces the systemd init process) then iterates through the
  66 mounted file systems and running processes (as well as a couple of other
  67 resources) and tries to unmount/read-only mount/detach/kill them. It continues
  68 to do this in a tight loop as long as this results in any effect. From this
  69 killing spree a couple of processes are automatically excluded: PID 1 itself of
  70 course, as well as all kernel threads. After the killing/unmounting spree
  71 control is passed back to the initrd, whose job is then to unmount/detach
  72 whatever might be remaining.
  73
  74 The same killing spree logic (but not the unmount/detach/read-only logic) is
  75 applied during the transition from the initrd to the main system (i.e. the
  76 "`switch_root`" operation), so that no processes from the initrd survive to the
  77 main system.
  78
  79 To implement the supported logic proposed above (i.e. where storage daemons
  80 needed for the root file system which are started by the initrd stay around
  81 during normal operation and are only killed after control is passed back to the
  82 initrd), we need to exclude these daemons from the shutdown/switch_root killing
  83 spree. To accomplish this, the following logic is available starting with
  84 systemd 38:
  85
  86 Processes (run by the root user) whose first character of the zeroth command
  87 line argument is `@` are excluded from the killing spree, much the same way as
  88 kernel threads are excluded too. Thus, a daemon which wants to take advantage
  89 of this logic needs to place the following at the top of its `main()` function:
  90
  91 ```c
  92 ...
  93 argv[0][0] = '@';
  94 ...
  95 ```
  96
  97 And that's already it. Note that this functionality is only to be used by
  98 programs running from the initrd, and **not** for programs running from the
  99 root file system itself. Programs which use this functionality and are running
 100 from the root file system are considered buggy since they effectively prohibit
 101 clean unmounting/detaching of the root file system and its backing storage.
 102
 103 _Again: if your code is being run from the root file system, then this logic
 104 suggested above is **NOT** for you. Sorry. Talk to us, we can probably help you
 105 to find a different solution to your problem._
 106
 107 The recommended way to distinguish between run-from-initrd and run-from-rootfs
 108 for a daemon is to check for `/etc/initrd-release` (which exists on all modern
 109 initrd implementations, see the [initrd Interface](INITRD_INTERFACE.md) for
 110 details) which when exists results in `argv[0][0]` being set to `@`, and
 111 otherwise doesn't. Something like this:
 112
 113 ```c
 114 #include <unistd.h>
 115
 116 int main(int argc, char *argv[]) {
 117         ...
 118         if (access("/etc/initrd-release", F_OK) >= 0)
 119                 argv[0][0] = '@';
 120         ...
 121     }
 122 ```
 123
 124 Why `@`? Why `argv[0][0]`? First of all, a technique like this is not without
 125 precedent: traditionally Unix login shells set `argv[0][0]` to `-` to clarify
 126 they are login shells. This logic is also very easy to implement. We have been
 127 looking for other ways to mark processes for exclusion from the killing spree,
 128 but could not find any that was equally simple to implement and quick to read
 129 when traversing through `/proc/`. Also, as a side effect replacing the first
 130 character of `argv[0]` with `@` also visually invalidates the path normally
 131 stored in `argv[0]` (which usually starts with `/`) thus helping the
 132 administrator to understand that your daemon is actually not originating from
 133 the actual root file system, but from a path in a completely different
 134 namespace (i.e. the initrd namespace). Other than that we just think that `@`
 135 is a cool character which looks pretty in the ps output... 😎
 136
 137 Note that your code should only modify `argv[0][0]` and leave the comm name
 138 (i.e. `/proc/self/comm`) of your process untouched.
 139
 140 Since systemd v255, alternatively the `SurviveFinalKillSignal=yes` unit option
 141 can be set, and provides the equivalent functionality to modifying `argv[0][0]`.
 142
 143 ## To which technologies does this apply?
 144
 145 These recommendations apply to those storage daemons which need to stay around
 146 until after the storage they maintain is unmounted. If your storage daemon is
 147 fine with being shut down before its storage device is unmounted, you may ignore
 148 the recommendations above.
 149
 150 This all applies to storage technology only, not to daemons with any other
 151 (non-storage related) purposes.
 152
 153 ## What else to keep in mind?
 154
 155 If your daemon implements the logic pointed out above, it should work nicely
 156 from initrd environments. In many cases it might be necessary to additionally
 157 support storage daemons to be started from within the actual OS, for example
 158 when complex storage setups are used for auxiliary file systems, i.e. not the
 159 root file system, or created by the administrator during runtime. Here are a
 160 few additional notes for supporting these setups:
 161
 162 * If your storage daemon is run from the main OS (i.e. not the initrd) it will
 163   also be terminated when the OS shuts down (i.e. before we pass control back
 164   to the initrd). Your daemon needs to handle this properly.
 165
 166 * It is not acceptable to spawn off background processes transparently from
 167   user commands or udev rules. Whenever a process is forked off on Unix it
 168   inherits a multitude of process attributes (ranging from the obvious to the
 169   not-so-obvious such as security contexts or audit trails) from its parent
 170   process. It is practically impossible to fully detach a service from the
 171   process context of the spawning process. In particular, systemd tracks which
 172   processes belong to a service or login sessions very closely, and by spawning
 173   off your storage daemon from udev or an administrator command you thus make
 174   it part of its service/login. Effectively this means that whenever udev is
 175   shut down, your storage daemon is killed too, resp. whenever the login
 176   session goes away your storage might be terminated as well. (Also note that
 177   recent udev versions will automatically kill all long running background
 178   processes forked off udev rules now.) So, in summary: double-forking off
 179   processes from user commands or udev rules is **NOT** OK!
 180
 181 * To automatically spawn storage daemons from udev rules or administrator
 182   commands, the recommended technology is socket-based activation as
 183   implemented by systemd. Transparently for your client code connecting to the
 184   socket of your storage daemon will result in the storage to be started. For
 185   that it is simply necessary to inform systemd about the socket you'd like it
 186   to listen on behalf of your daemon and minimally modify the daemon to
 187   receive the listening socket for its services from systemd instead of
 188   creating it on its own. Such modifications can be minimal, and are easily
 189   written in a way that does not negatively impact usability on non-systemd
 190   systems. For more information on making use of socket activation in your
 191   program consult this blog story: [Socket
 192   Activation](https://0pointer.de/blog/projects/socket-activation.html)
 193
 194 * Consider having a look at the [initrd Interface of systemd](INITRD_INTERFACE.md).