]> git.ipfire.org Git - thirdparty/systemd.git/blame - docs/ROOT_STORAGE_DAEMONS.md
Merge pull request #32324 from mrc0mmand/more-website-fixes
[thirdparty/systemd.git] / docs / ROOT_STORAGE_DAEMONS.md
CommitLineData
6e47cac0
LP
1---
2title: Storage Daemons for the Root File System
3category: Interfaces
4layout: default
0aff7b75 5SPDX-License-Identifier: LGPL-2.1-or-later
6e47cac0
LP
6---
7
8# systemd and Storage Daemons for the Root File System
9
10a.k.a. _Pax Cellae pro Radix Arbor_
11
12(or something like that, my Latin is a bit rusty)
13
14A number of complex storage technologies on Linux (e.g. RAID, volume
15management, networked storage) require user space services to run while the
16storage is active and mountable. This requirement becomes tricky as soon as the
17root file system of the Linux operating system is stored on such storage
18technology. Previously no clear path to make this work was available. This text
19tries to clear up the resulting confusion, and what is now supported and what
20is not.
21
22## A Bit of Background
23
24When complex storage technologies are used as backing for the root file system
55c041b4
LP
25this needs to be set up by the initrd, i.e. on Fedora by Dracut. In newer
26systemd versions tear-down of the root file system backing is also done by the
27initrd: after terminating all remaining running processes and unmounting all
2262cbf9
MA
28file systems it can (which means excluding the root file system) systemd will
29jump back into the initrd code allowing it to unmount the final file systems
30(and its storage backing) that could not be unmounted as long as the OS was
31still running from the main root file system. The job of the initrd is to
32detach/unmount the root file system, i.e. inverting the exact commands it used
33to set them up in the first place. This is not only cleaner, but also allows
34for the first time arbitrary complex stacks of storage technology.
6e47cac0
LP
35
36Previous attempts to handle root file system setups with complex storage as
37backing usually tried to maintain the root storage with program code stored on
38the root storage itself, thus creating a number of dependency loops. Safely
39detaching such a root file system becomes messy, since the program code on the
40storage needs to stay around longer than the storage, which is technically
41contradicting.
42
6e47cac0
LP
43## What's new?
44
45As a result, we hereby clarify that we do not support storage technology setups
2262cbf9
MA
46where the storage daemons are being run from the storage they maintain
47themselves. In other words: a storage daemon backing the root file system cannot
48be stored on the root file system itself.
6e47cac0
LP
49
50What we do support instead is that these storage daemons are started from the
51initrd, stay running all the time during normal operation and are terminated
52only after we returned control back to the initrd and by the initrd. As such,
53storage daemons involved with maintaining the root file system storage
54conceptually are more like kernel threads than like normal system services:
2262cbf9
MA
55from the perspective of the init system (i.e. systemd), these services have been
56started before systemd was initialized and stay around until after systemd is
6e47cac0 57already gone. These daemons can only be updated by updating the initrd and
2262cbf9 58rebooting; a takeover from initrd-supplied services to replacements from the
6e47cac0
LP
59root file system is not supported.
60
6e47cac0
LP
61## What does this mean?
62
63Near the end of system shutdown, systemd executes a small tool called
64systemd-shutdown, replacing its own process. This tool (which runs as PID 1, as
65it entirely replaces the systemd init process) then iterates through the
66mounted file systems and running processes (as well as a couple of other
67resources) and tries to unmount/read-only mount/detach/kill them. It continues
68to do this in a tight loop as long as this results in any effect. From this
69killing spree a couple of processes are automatically excluded: PID 1 itself of
70course, as well as all kernel threads. After the killing/unmounting spree
71control is passed back to the initrd, whose job is then to unmount/detach
72whatever might be remaining.
73
74The same killing spree logic (but not the unmount/detach/read-only logic) is
75applied during the transition from the initrd to the main system (i.e. the
76"`switch_root`" operation), so that no processes from the initrd survive to the
77main system.
78
79To implement the supported logic proposed above (i.e. where storage daemons
2262cbf9
MA
80needed for the root file system which are started by the initrd stay around
81during normal operation and are only killed after control is passed back to the
82initrd), we need to exclude these daemons from the shutdown/switch_root killing
83spree. To accomplish this, the following logic is available starting with
6e47cac0
LP
84systemd 38:
85
86Processes (run by the root user) whose first character of the zeroth command
87line argument is `@` are excluded from the killing spree, much the same way as
88kernel threads are excluded too. Thus, a daemon which wants to take advantage
faec9de8 89of this logic needs to place the following at the top of its `main()` function:
6e47cac0
LP
90
91```c
744c49e1 92...
faec9de8 93argv[0][0] = '@';
744c49e1 94...
6e47cac0
LP
95```
96
97And that's already it. Note that this functionality is only to be used by
98programs running from the initrd, and **not** for programs running from the
99root file system itself. Programs which use this functionality and are running
100from the root file system are considered buggy since they effectively prohibit
101clean unmounting/detaching of the root file system and its backing storage.
102
103_Again: if your code is being run from the root file system, then this logic
104suggested above is **NOT** for you. Sorry. Talk to us, we can probably help you
105to find a different solution to your problem._
106
107The recommended way to distinguish between run-from-initrd and run-from-rootfs
108for a daemon is to check for `/etc/initrd-release` (which exists on all modern
0d592a5e 109initrd implementations, see the [initrd Interface](/INITRD_INTERFACE) for
5c90c67a
BF
110details) which when exists results in `argv[0][0]` being set to `@`, and
111otherwise doesn't. Something like this:
6e47cac0
LP
112
113```c
114#include <unistd.h>
115
116int main(int argc, char *argv[]) {
744c49e1 117 ...
6e47cac0
LP
118 if (access("/etc/initrd-release", F_OK) >= 0)
119 argv[0][0] = '@';
744c49e1 120 ...
6e47cac0
LP
121 }
122```
123
124Why `@`? Why `argv[0][0]`? First of all, a technique like this is not without
125precedent: traditionally Unix login shells set `argv[0][0]` to `-` to clarify
126they are login shells. This logic is also very easy to implement. We have been
127looking for other ways to mark processes for exclusion from the killing spree,
128but could not find any that was equally simple to implement and quick to read
129when traversing through `/proc/`. Also, as a side effect replacing the first
130character of `argv[0]` with `@` also visually invalidates the path normally
131stored in `argv[0]` (which usually starts with `/`) thus helping the
132administrator to understand that your daemon is actually not originating from
133the actual root file system, but from a path in a completely different
134namespace (i.e. the initrd namespace). Other than that we just think that `@`
135is a cool character which looks pretty in the ps output... 😎
136
137Note that your code should only modify `argv[0][0]` and leave the comm name
138(i.e. `/proc/self/comm`) of your process untouched.
139
2c0ca3e3
LB
140Since systemd v255, alternatively the `SurviveFinalKillSignal=yes` unit option
141can be set, and provides the equivalent functionality to modifying `argv[0][0]`.
142
6e47cac0
LP
143## To which technologies does this apply?
144
145These recommendations apply to those storage daemons which need to stay around
146until after the storage they maintain is unmounted. If your storage daemon is
2262cbf9 147fine with being shut down before its storage device is unmounted, you may ignore
6e47cac0
LP
148the recommendations above.
149
150This all applies to storage technology only, not to daemons with any other
151(non-storage related) purposes.
152
153## What else to keep in mind?
154
2262cbf9 155If your daemon implements the logic pointed out above, it should work nicely
6e47cac0
LP
156from initrd environments. In many cases it might be necessary to additionally
157support storage daemons to be started from within the actual OS, for example
158when complex storage setups are used for auxiliary file systems, i.e. not the
159root file system, or created by the administrator during runtime. Here are a
160few additional notes for supporting these setups:
161
162* If your storage daemon is run from the main OS (i.e. not the initrd) it will
163 also be terminated when the OS shuts down (i.e. before we pass control back
164 to the initrd). Your daemon needs to handle this properly.
165
166* It is not acceptable to spawn off background processes transparently from
167 user commands or udev rules. Whenever a process is forked off on Unix it
168 inherits a multitude of process attributes (ranging from the obvious to the
169 not-so-obvious such as security contexts or audit trails) from its parent
170 process. It is practically impossible to fully detach a service from the
171 process context of the spawning process. In particular, systemd tracks which
172 processes belong to a service or login sessions very closely, and by spawning
173 off your storage daemon from udev or an administrator command you thus make
174 it part of its service/login. Effectively this means that whenever udev is
175 shut down, your storage daemon is killed too, resp. whenever the login
176 session goes away your storage might be terminated as well. (Also note that
177 recent udev versions will automatically kill all long running background
178 processes forked off udev rules now.) So, in summary: double-forking off
179 processes from user commands or udev rules is **NOT** OK!
180
181* To automatically spawn storage daemons from udev rules or administrator
182 commands, the recommended technology is socket-based activation as
183 implemented by systemd. Transparently for your client code connecting to the
184 socket of your storage daemon will result in the storage to be started. For
185 that it is simply necessary to inform systemd about the socket you'd like it
2262cbf9 186 to listen on behalf of your daemon and minimally modify the daemon to
6e47cac0
LP
187 receive the listening socket for its services from systemd instead of
188 creating it on its own. Such modifications can be minimal, and are easily
189 written in a way that does not negatively impact usability on non-systemd
190 systems. For more information on making use of socket activation in your
191 program consult this blog story: [Socket
dc7e580e 192 Activation](https://0pointer.de/blog/projects/socket-activation.html)
6e47cac0 193
0d592a5e 194* Consider having a look at the [initrd Interface of systemd](/INITRD_INTERFACE).