]> git.ipfire.org Git - thirdparty/systemd.git/blob - docs/ROOT_STORAGE_DAEMONS.md
779044b0d4e8358134a20c7ec99624e5c2bc8bf3
[thirdparty/systemd.git] / docs / ROOT_STORAGE_DAEMONS.md
1 ---
2 title: Storage Daemons for the Root File System
3 category: Interfaces
4 layout: default
5 ---
6
7 # systemd and Storage Daemons for the Root File System
8
9 a.k.a. _Pax Cellae pro Radix Arbor_
10
11 (or something like that, my Latin is a bit rusty)
12
13 A number of complex storage technologies on Linux (e.g. RAID, volume
14 management, networked storage) require user space services to run while the
15 storage is active and mountable. This requirement becomes tricky as soon as the
16 root file system of the Linux operating system is stored on such storage
17 technology. Previously no clear path to make this work was available. This text
18 tries to clear up the resulting confusion, and what is now supported and what
19 is not.
20
21 ## A Bit of Background
22
23 When complex storage technologies are used as backing for the root file system
24 this needs to be set up by the initial RAM file system (initrd), i.e. on Fedora
25 by Dracut. In newer systemd versions tear-down of the root file system backing
26 is also done by the initrd: after terminating all remaining running processes
27 and unmounting all file systems it can (which means excluding the root fs)
28 systemd will jump back into the initrd code allowing it to unmount the final
29 file systems (and its storage backing) that could not be unmounted as long as
30 the OS was still running from the main root file system. The initrd' job is to
31 detach/unmount the root fs, i.e. inverting the exact commands it used to set
32 them up in the first place. This is not only cleaner, but also allows for the
33 first time arbitrary complex stacks of storage technology.
34
35 Previous attempts to handle root file system setups with complex storage as
36 backing usually tried to maintain the root storage with program code stored on
37 the root storage itself, thus creating a number of dependency loops. Safely
38 detaching such a root file system becomes messy, since the program code on the
39 storage needs to stay around longer than the storage, which is technically
40 contradicting.
41
42
43 ## What's new?
44
45 As a result, we hereby clarify that we do not support storage technology setups
46 where the storage daemons are being run from the storage it maintains
47 itself. In other words: a storage daemon backing the root file system cannot be
48 stored on the root file system itself.
49
50 What we do support instead is that these storage daemons are started from the
51 initrd, stay running all the time during normal operation and are terminated
52 only after we returned control back to the initrd and by the initrd. As such,
53 storage daemons involved with maintaining the root file system storage
54 conceptually are more like kernel threads than like normal system services:
55 from the perspective of the init system (i.e. systemd) these services have been
56 started before systemd got initialized and stay around until after systemd is
57 already gone. These daemons can only be updated by updating the initrd and
58 rebooting, a takeover from initrd-supplied services to replacements from the
59 root file system is not supported.
60
61
62 ## What does this mean?
63
64 Near the end of system shutdown, systemd executes a small tool called
65 systemd-shutdown, replacing its own process. This tool (which runs as PID 1, as
66 it entirely replaces the systemd init process) then iterates through the
67 mounted file systems and running processes (as well as a couple of other
68 resources) and tries to unmount/read-only mount/detach/kill them. It continues
69 to do this in a tight loop as long as this results in any effect. From this
70 killing spree a couple of processes are automatically excluded: PID 1 itself of
71 course, as well as all kernel threads. After the killing/unmounting spree
72 control is passed back to the initrd, whose job is then to unmount/detach
73 whatever might be remaining.
74
75 The same killing spree logic (but not the unmount/detach/read-only logic) is
76 applied during the transition from the initrd to the main system (i.e. the
77 "`switch_root`" operation), so that no processes from the initrd survive to the
78 main system.
79
80 To implement the supported logic proposed above (i.e. where storage daemons
81 needed for the root fs which are started by the initrd stay around during
82 normal operation and are only killed after control is passed back to the
83 initrd) we need to exclude these daemons from the shutdown/switch_root killing
84 spree. To accomplish this the following logic is available starting with
85 systemd 38:
86
87 Processes (run by the root user) whose first character of the zeroth command
88 line argument is `@` are excluded from the killing spree, much the same way as
89 kernel threads are excluded too. Thus, a daemon which wants to take advantage
90 of this logic needs to place the following at the top of its `main()` function:
91
92 ```c
93 ...
94 argv[0][0] = '@';
95 ...
96 ```
97
98 And that's already it. Note that this functionality is only to be used by
99 programs running from the initrd, and **not** for programs running from the
100 root file system itself. Programs which use this functionality and are running
101 from the root file system are considered buggy since they effectively prohibit
102 clean unmounting/detaching of the root file system and its backing storage.
103
104 _Again: if your code is being run from the root file system, then this logic
105 suggested above is **NOT** for you. Sorry. Talk to us, we can probably help you
106 to find a different solution to your problem._
107
108 The recommended way to distinguish between run-from-initrd and run-from-rootfs
109 for a daemon is to check for `/etc/initrd-release` (which exists on all modern
110 initrd implementations, see the [initrd
111 Interface](http://www.freedesktop.org/wiki/Software/systemd/InitrdInterface)
112 for details) which when exists results in `argv[0][0]` being set to `@`, and
113 otherwise doesn't. Something like this:
114
115 ```c
116 #include <unistd.h>
117
118 int main(int argc, char *argv[]) {
119 ...
120 if (access("/etc/initrd-release", F_OK) >= 0)
121 argv[0][0] = '@';
122 ...
123 }
124 ```
125
126 Why `@`? Why `argv[0][0]`? First of all, a technique like this is not without
127 precedent: traditionally Unix login shells set `argv[0][0]` to `-` to clarify
128 they are login shells. This logic is also very easy to implement. We have been
129 looking for other ways to mark processes for exclusion from the killing spree,
130 but could not find any that was equally simple to implement and quick to read
131 when traversing through `/proc/`. Also, as a side effect replacing the first
132 character of `argv[0]` with `@` also visually invalidates the path normally
133 stored in `argv[0]` (which usually starts with `/`) thus helping the
134 administrator to understand that your daemon is actually not originating from
135 the actual root file system, but from a path in a completely different
136 namespace (i.e. the initrd namespace). Other than that we just think that `@`
137 is a cool character which looks pretty in the ps output... 😎
138
139 Note that your code should only modify `argv[0][0]` and leave the comm name
140 (i.e. `/proc/self/comm`) of your process untouched.
141
142 ## To which technologies does this apply?
143
144 These recommendations apply to those storage daemons which need to stay around
145 until after the storage they maintain is unmounted. If your storage daemon is
146 fine with being shut down before its storage device is unmounted you may ignore
147 the recommendations above.
148
149 This all applies to storage technology only, not to daemons with any other
150 (non-storage related) purposes.
151
152 ## What else to keep in mind?
153
154 If your daemon implements the logic pointed out above it should work nicely
155 from initrd environments. In many cases it might be necessary to additionally
156 support storage daemons to be started from within the actual OS, for example
157 when complex storage setups are used for auxiliary file systems, i.e. not the
158 root file system, or created by the administrator during runtime. Here are a
159 few additional notes for supporting these setups:
160
161 * If your storage daemon is run from the main OS (i.e. not the initrd) it will
162 also be terminated when the OS shuts down (i.e. before we pass control back
163 to the initrd). Your daemon needs to handle this properly.
164
165 * It is not acceptable to spawn off background processes transparently from
166 user commands or udev rules. Whenever a process is forked off on Unix it
167 inherits a multitude of process attributes (ranging from the obvious to the
168 not-so-obvious such as security contexts or audit trails) from its parent
169 process. It is practically impossible to fully detach a service from the
170 process context of the spawning process. In particular, systemd tracks which
171 processes belong to a service or login sessions very closely, and by spawning
172 off your storage daemon from udev or an administrator command you thus make
173 it part of its service/login. Effectively this means that whenever udev is
174 shut down, your storage daemon is killed too, resp. whenever the login
175 session goes away your storage might be terminated as well. (Also note that
176 recent udev versions will automatically kill all long running background
177 processes forked off udev rules now.) So, in summary: double-forking off
178 processes from user commands or udev rules is **NOT** OK!
179
180 * To automatically spawn storage daemons from udev rules or administrator
181 commands, the recommended technology is socket-based activation as
182 implemented by systemd. Transparently for your client code connecting to the
183 socket of your storage daemon will result in the storage to be started. For
184 that it is simply necessary to inform systemd about the socket you'd like it
185 to listen on on behalf of your daemon and minimally modify the daemon to
186 receive the listening socket for its services from systemd instead of
187 creating it on its own. Such modifications can be minimal, and are easily
188 written in a way that does not negatively impact usability on non-systemd
189 systems. For more information on making use of socket activation in your
190 program consult this blog story: [Socket
191 Activation](http://0pointer.de/blog/projects/socket-activation.html)
192
193 * Consider having a look at the [initrd Interface of systemd](https://systemd.io/INITRD_INTERFACE/).