]>
Commit | Line | Data |
---|---|---|
6e47cac0 LP |
1 | --- |
2 | title: Storage Daemons for the Root File System | |
3 | category: Interfaces | |
4 | layout: default | |
0aff7b75 | 5 | SPDX-License-Identifier: LGPL-2.1-or-later |
6e47cac0 LP |
6 | --- |
7 | ||
8 | # systemd and Storage Daemons for the Root File System | |
9 | ||
10 | a.k.a. _Pax Cellae pro Radix Arbor_ | |
11 | ||
12 | (or something like that, my Latin is a bit rusty) | |
13 | ||
14 | A number of complex storage technologies on Linux (e.g. RAID, volume | |
15 | management, networked storage) require user space services to run while the | |
16 | storage is active and mountable. This requirement becomes tricky as soon as the | |
17 | root file system of the Linux operating system is stored on such storage | |
18 | technology. Previously no clear path to make this work was available. This text | |
19 | tries to clear up the resulting confusion, and what is now supported and what | |
20 | is not. | |
21 | ||
22 | ## A Bit of Background | |
23 | ||
24 | When complex storage technologies are used as backing for the root file system | |
25 | this needs to be set up by the initial RAM file system (initrd), i.e. on Fedora | |
26 | by Dracut. In newer systemd versions tear-down of the root file system backing | |
27 | is also done by the initrd: after terminating all remaining running processes | |
28 | and unmounting all file systems it can (which means excluding the root fs) | |
29 | systemd will jump back into the initrd code allowing it to unmount the final | |
30 | file systems (and its storage backing) that could not be unmounted as long as | |
31 | the OS was still running from the main root file system. The initrd' job is to | |
32 | detach/unmount the root fs, i.e. inverting the exact commands it used to set | |
33 | them up in the first place. This is not only cleaner, but also allows for the | |
34 | first time arbitrary complex stacks of storage technology. | |
35 | ||
36 | Previous attempts to handle root file system setups with complex storage as | |
37 | backing usually tried to maintain the root storage with program code stored on | |
38 | the root storage itself, thus creating a number of dependency loops. Safely | |
39 | detaching such a root file system becomes messy, since the program code on the | |
40 | storage needs to stay around longer than the storage, which is technically | |
41 | contradicting. | |
42 | ||
43 | ||
44 | ## What's new? | |
45 | ||
46 | As a result, we hereby clarify that we do not support storage technology setups | |
47 | where the storage daemons are being run from the storage it maintains | |
48 | itself. In other words: a storage daemon backing the root file system cannot be | |
49 | stored on the root file system itself. | |
50 | ||
51 | What we do support instead is that these storage daemons are started from the | |
52 | initrd, stay running all the time during normal operation and are terminated | |
53 | only after we returned control back to the initrd and by the initrd. As such, | |
54 | storage daemons involved with maintaining the root file system storage | |
55 | conceptually are more like kernel threads than like normal system services: | |
56 | from the perspective of the init system (i.e. systemd) these services have been | |
57 | started before systemd got initialized and stay around until after systemd is | |
58 | already gone. These daemons can only be updated by updating the initrd and | |
59 | rebooting, a takeover from initrd-supplied services to replacements from the | |
60 | root file system is not supported. | |
61 | ||
62 | ||
63 | ## What does this mean? | |
64 | ||
65 | Near the end of system shutdown, systemd executes a small tool called | |
66 | systemd-shutdown, replacing its own process. This tool (which runs as PID 1, as | |
67 | it entirely replaces the systemd init process) then iterates through the | |
68 | mounted file systems and running processes (as well as a couple of other | |
69 | resources) and tries to unmount/read-only mount/detach/kill them. It continues | |
70 | to do this in a tight loop as long as this results in any effect. From this | |
71 | killing spree a couple of processes are automatically excluded: PID 1 itself of | |
72 | course, as well as all kernel threads. After the killing/unmounting spree | |
73 | control is passed back to the initrd, whose job is then to unmount/detach | |
74 | whatever might be remaining. | |
75 | ||
76 | The same killing spree logic (but not the unmount/detach/read-only logic) is | |
77 | applied during the transition from the initrd to the main system (i.e. the | |
78 | "`switch_root`" operation), so that no processes from the initrd survive to the | |
79 | main system. | |
80 | ||
81 | To implement the supported logic proposed above (i.e. where storage daemons | |
82 | needed for the root fs which are started by the initrd stay around during | |
83 | normal operation and are only killed after control is passed back to the | |
84 | initrd) we need to exclude these daemons from the shutdown/switch_root killing | |
85 | spree. To accomplish this the following logic is available starting with | |
86 | systemd 38: | |
87 | ||
88 | Processes (run by the root user) whose first character of the zeroth command | |
89 | line argument is `@` are excluded from the killing spree, much the same way as | |
90 | kernel threads are excluded too. Thus, a daemon which wants to take advantage | |
faec9de8 | 91 | of this logic needs to place the following at the top of its `main()` function: |
6e47cac0 LP |
92 | |
93 | ```c | |
744c49e1 | 94 | ... |
faec9de8 | 95 | argv[0][0] = '@'; |
744c49e1 | 96 | ... |
6e47cac0 LP |
97 | ``` |
98 | ||
99 | And that's already it. Note that this functionality is only to be used by | |
100 | programs running from the initrd, and **not** for programs running from the | |
101 | root file system itself. Programs which use this functionality and are running | |
102 | from the root file system are considered buggy since they effectively prohibit | |
103 | clean unmounting/detaching of the root file system and its backing storage. | |
104 | ||
105 | _Again: if your code is being run from the root file system, then this logic | |
106 | suggested above is **NOT** for you. Sorry. Talk to us, we can probably help you | |
107 | to find a different solution to your problem._ | |
108 | ||
109 | The recommended way to distinguish between run-from-initrd and run-from-rootfs | |
110 | for a daemon is to check for `/etc/initrd-release` (which exists on all modern | |
111 | initrd implementations, see the [initrd | |
1d10005b | 112 | Interface](https://systemd.io/INITRD_INTERFACE) for details) which when exists |
f856778b | 113 | results in `argv[0][0]` being set to `@`, and otherwise doesn't. Something like |
114 | this: | |
6e47cac0 LP |
115 | |
116 | ```c | |
117 | #include <unistd.h> | |
118 | ||
119 | int main(int argc, char *argv[]) { | |
744c49e1 | 120 | ... |
6e47cac0 LP |
121 | if (access("/etc/initrd-release", F_OK) >= 0) |
122 | argv[0][0] = '@'; | |
744c49e1 | 123 | ... |
6e47cac0 LP |
124 | } |
125 | ``` | |
126 | ||
127 | Why `@`? Why `argv[0][0]`? First of all, a technique like this is not without | |
128 | precedent: traditionally Unix login shells set `argv[0][0]` to `-` to clarify | |
129 | they are login shells. This logic is also very easy to implement. We have been | |
130 | looking for other ways to mark processes for exclusion from the killing spree, | |
131 | but could not find any that was equally simple to implement and quick to read | |
132 | when traversing through `/proc/`. Also, as a side effect replacing the first | |
133 | character of `argv[0]` with `@` also visually invalidates the path normally | |
134 | stored in `argv[0]` (which usually starts with `/`) thus helping the | |
135 | administrator to understand that your daemon is actually not originating from | |
136 | the actual root file system, but from a path in a completely different | |
137 | namespace (i.e. the initrd namespace). Other than that we just think that `@` | |
138 | is a cool character which looks pretty in the ps output... 😎 | |
139 | ||
140 | Note that your code should only modify `argv[0][0]` and leave the comm name | |
141 | (i.e. `/proc/self/comm`) of your process untouched. | |
142 | ||
143 | ## To which technologies does this apply? | |
144 | ||
145 | These recommendations apply to those storage daemons which need to stay around | |
146 | until after the storage they maintain is unmounted. If your storage daemon is | |
147 | fine with being shut down before its storage device is unmounted you may ignore | |
148 | the recommendations above. | |
149 | ||
150 | This all applies to storage technology only, not to daemons with any other | |
151 | (non-storage related) purposes. | |
152 | ||
153 | ## What else to keep in mind? | |
154 | ||
155 | If your daemon implements the logic pointed out above it should work nicely | |
156 | from initrd environments. In many cases it might be necessary to additionally | |
157 | support storage daemons to be started from within the actual OS, for example | |
158 | when complex storage setups are used for auxiliary file systems, i.e. not the | |
159 | root file system, or created by the administrator during runtime. Here are a | |
160 | few additional notes for supporting these setups: | |
161 | ||
162 | * If your storage daemon is run from the main OS (i.e. not the initrd) it will | |
163 | also be terminated when the OS shuts down (i.e. before we pass control back | |
164 | to the initrd). Your daemon needs to handle this properly. | |
165 | ||
166 | * It is not acceptable to spawn off background processes transparently from | |
167 | user commands or udev rules. Whenever a process is forked off on Unix it | |
168 | inherits a multitude of process attributes (ranging from the obvious to the | |
169 | not-so-obvious such as security contexts or audit trails) from its parent | |
170 | process. It is practically impossible to fully detach a service from the | |
171 | process context of the spawning process. In particular, systemd tracks which | |
172 | processes belong to a service or login sessions very closely, and by spawning | |
173 | off your storage daemon from udev or an administrator command you thus make | |
174 | it part of its service/login. Effectively this means that whenever udev is | |
175 | shut down, your storage daemon is killed too, resp. whenever the login | |
176 | session goes away your storage might be terminated as well. (Also note that | |
177 | recent udev versions will automatically kill all long running background | |
178 | processes forked off udev rules now.) So, in summary: double-forking off | |
179 | processes from user commands or udev rules is **NOT** OK! | |
180 | ||
181 | * To automatically spawn storage daemons from udev rules or administrator | |
182 | commands, the recommended technology is socket-based activation as | |
183 | implemented by systemd. Transparently for your client code connecting to the | |
184 | socket of your storage daemon will result in the storage to be started. For | |
185 | that it is simply necessary to inform systemd about the socket you'd like it | |
186 | to listen on on behalf of your daemon and minimally modify the daemon to | |
187 | receive the listening socket for its services from systemd instead of | |
188 | creating it on its own. Such modifications can be minimal, and are easily | |
189 | written in a way that does not negatively impact usability on non-systemd | |
190 | systems. For more information on making use of socket activation in your | |
191 | program consult this blog story: [Socket | |
192 | Activation](http://0pointer.de/blog/projects/socket-activation.html) | |
193 | ||
1d10005b | 194 | * Consider having a look at the [initrd Interface of systemd](https://systemd.io/INITRD_INTERFACE). |