]>
Commit | Line | Data |
---|---|---|
c3e270f4 FB |
1 | --- |
2 | title: Automatic Boot Assessment | |
4cdca0af | 3 | category: Booting |
b41a3f66 | 4 | layout: default |
0aff7b75 | 5 | SPDX-License-Identifier: LGPL-2.1-or-later |
c3e270f4 FB |
6 | --- |
7 | ||
0c74648b LP |
8 | # Automatic Boot Assessment |
9 | ||
10 | systemd provides support for automatically reverting back to the previous | |
11 | version of the OS or kernel in case the system consistently fails to boot. This | |
12 | support is built into various of its components. When used together these | |
13 | components provide a complete solution on UEFI systems, built as add-on to the | |
5c90c67a | 14 | [Boot Loader Specification](BOOT_LOADER_SPECIFICATION.md). |
4ea07826 ZJS |
15 | However, the different components may also be used independently, and in |
16 | combination with other software, to implement similar schemes, for example with | |
17 | other boot loaders or for non-UEFI systems. Here's a brief overview of the | |
18 | complete set of components: | |
0c74648b LP |
19 | |
20 | * The | |
21 | [`systemd-boot(7)`](https://www.freedesktop.org/software/systemd/man/systemd-boot.html) | |
22 | boot loader optionally maintains a per-boot-loader-entry counter that is | |
23 | decreased by one on each attempt to boot the entry, prioritizing entries that | |
24 | have non-zero counters over those which already reached a counter of zero | |
25 | when choosing the entry to boot. | |
26 | ||
27 | * The | |
28 | [`systemd-bless-boot.service(8)`](https://www.freedesktop.org/software/systemd/man/systemd-bless-boot.service.html) | |
29 | service automatically marks a boot loader entry, for which boot counting as | |
30 | mentioned above is enabled, as "good" when a boot has been determined to be | |
31 | successful, thus turning off boot counting for it. | |
32 | ||
33 | * The | |
34 | [`systemd-bless-boot-generator(8)`](https://www.freedesktop.org/software/systemd/man/systemd-bless-boot-generator.html) | |
35 | generator automatically pulls in `systemd-bless-boot.service` when use of | |
36 | `systemd-boot` with boot counting enabled is detected. | |
37 | ||
38 | * The | |
39 | [`systemd-boot-check-no-failures.service(8)`](https://www.freedesktop.org/software/systemd/man/systemd-boot-check-no-failures.service.html) | |
40 | service is a simple health check tool that determines whether the boot | |
b2454670 | 41 | completed successfully. When enabled it becomes an indirect dependency of |
0c74648b LP |
42 | `systemd-bless-boot.service` (by means of `boot-complete.target`, see |
43 | below), ensuring that the boot will not be considered successful if there are | |
44 | any failed services. | |
45 | ||
46 | * The `boot-complete.target` target unit (see | |
47 | [`systemd.special(7)`](https://www.freedesktop.org/software/systemd/man/systemd.special.html)) | |
4ea07826 ZJS |
48 | serves as a generic extension point both for units that are necessary to |
49 | consider a boot successful (example: `systemd-boot-check-no-failures.service` | |
50 | as described above), and units that want to act only if the boot is | |
51 | successful (example: `systemd-bless-boot.service` as described above). | |
0c74648b LP |
52 | |
53 | * The | |
54 | [`kernel-install(8)`](https://www.freedesktop.org/software/systemd/man/kernel-install.html) | |
55 | script can optionally create boot loader entries that carry an initial boot | |
56 | counter (the initial counter is configurable in `/etc/kernel/tries`). | |
57 | ||
ff2c2d08 | 58 | ## Details |
0c74648b LP |
59 | |
60 | The boot counting data `systemd-boot` and `systemd-bless-boot.service` | |
61 | manage is stored in the name of the boot loader entries. If a boot loader entry | |
62 | file name contains `+` followed by one or two numbers (if two numbers, then | |
63 | those need to be separated by `-`) right before the `.conf` suffix, then boot | |
64 | counting is enabled for it. The first number is the "tries left" counter | |
65 | encoding how many attempts to boot this entry shall still be made. The second | |
66 | number is the "tries done" counter, encoding how many failed attempts to boot | |
67 | it have already been made. Each time a boot loader entry marked this way is | |
68 | booted the first counter is decreased by one, and the second one increased by | |
69 | one. (If the second counter is missing, then it is assumed to be equivalent to | |
70 | zero.) If the "tries left" counter is above zero the entry is still considered | |
71 | for booting (the entry's state is considered to be "indeterminate"), as soon as | |
72 | it reached zero the entry is not tried anymore (entry state "bad"). If the boot | |
73 | attempt completed successfully the entry's counters are removed from the name | |
74 | (entry state "good"), thus turning off boot counting for the future. | |
75 | ||
76 | ## Walkthrough | |
77 | ||
78 | Here's an example walkthrough of how this all fits together. | |
79 | ||
80 | 1. The user runs `echo 3 > /etc/kernel/tries` to enable boot counting. | |
81 | ||
82 | 2. A new kernel is installed. `kernel-install` is used to generate a new boot | |
83 | loader entry file for it. Let's say the version string for the new kernel is | |
84 | `4.14.11-300.fc27.x86_64`, a new boot loader entry | |
85 | `/boot/loader/entries/4.14.11-300.fc27.x86_64+3.conf` is hence created. | |
86 | ||
87 | 3. The system is booted for the first time after the new kernel is | |
88 | installed. The boot loader now sees the `+3` counter in the entry file | |
89 | name. It hence renames the file to `4.14.11-300.fc27.x86_64+2-1.conf` | |
90 | indicating that at this point one attempt has started and thus only one less | |
91 | is left. After the rename completed the entry is booted as usual. | |
92 | ||
93 | 4. Let's say this attempt to boot fails. On the following boot the boot loader | |
94 | will hence see the `+2-1` tag in the name, and hence rename the entry file to | |
95 | `4.14.11-300.fc27.x86_64+1-2.conf`, and boot it. | |
96 | ||
97 | 5. Let's say the boot fails again. On the subsequent boot the loader hence will | |
98 | see the `+1-2` tag, and rename the file to | |
99 | `4.14.11-300.fc27.x86_64+0-3.conf` and boot it. | |
100 | ||
d238709c | 101 | 6. If this boot also fails, on the next boot the boot loader will see the |
0c74648b | 102 | tag `+0-3`, i.e. the counter reached zero. At this point the entry will be |
e6190e28 DF |
103 | considered "bad", and ordered to the beginning of the list of entries. The |
104 | next newest boot entry is now tried, i.e. the system automatically reverted | |
105 | back to an earlier version. | |
0c74648b | 106 | |
b2454670 | 107 | The above describes the walkthrough when the selected boot entry continuously |
0c74648b LP |
108 | fails. Let's have a look at an alternative ending to this walkthrough. In this |
109 | scenario the first 4 steps are the same as above: | |
110 | ||
111 | 1. *as above* | |
112 | ||
113 | 2. *as above* | |
114 | ||
115 | 3. *as above* | |
116 | ||
117 | 4. *as above* | |
118 | ||
119 | 5. Let's say the second boot succeeds. The kernel initializes properly, systemd | |
120 | is started and invokes all generators. | |
121 | ||
122 | 6. One of the generators started is `systemd-bless-boot-generator` which | |
123 | detects that boot counting is used. It hence pulls | |
124 | `systemd-bless-boot.service` into the initial transaction. | |
125 | ||
126 | 7. `systemd-bless-boot.service` is ordered after and `Requires=` the generic | |
127 | `boot-complete.target` unit. This unit is hence also pulled into the initial | |
128 | transaction. | |
129 | ||
130 | 8. The `boot-complete.target` unit is ordered after and pulls in various units | |
131 | that are required to succeed for the boot process to be considered | |
132 | successful. One such unit is `systemd-boot-check-no-failures.service`. | |
133 | ||
134 | 9. `systemd-boot-check-no-failures.service` is run after all its own | |
135 | dependencies completed, and assesses that the boot completed | |
136 | successfully. It hence exits cleanly. | |
137 | ||
138 | 10. This allows `boot-complete.target` to be reached. This signifies to the | |
139 | system that this boot attempt shall be considered successful. | |
140 | ||
141 | 11. Which in turn permits `systemd-bless-boot.service` to run. It now | |
142 | determines which boot loader entry file was used to boot the system, and | |
143 | renames it dropping the counter tag. Thus | |
144 | `4.14.11-300.fc27.x86_64+1-2.conf` is renamed to | |
145 | `4.14.11-300.fc27.x86_64.conf`. From this moment boot counting is turned | |
146 | off. | |
147 | ||
148 | 12. On the following boot (and all subsequent boots after that) the entry is | |
149 | now seen with boot counting turned off, no further renaming takes place. | |
150 | ||
ff2c2d08 | 151 | ## How to adapt this scheme to other setups |
0c74648b LP |
152 | |
153 | Of the stack described above many components may be replaced or augmented. Here | |
154 | are a couple of recommendations. | |
155 | ||
156 | 1. To support alternative boot loaders in place of `systemd-boot` two scenarios | |
157 | are recommended: | |
158 | ||
159 | a. Boot loaders already implementing the Boot Loader Specification can simply | |
160 | implement an equivalent file rename based logic, and thus integrate fully | |
161 | with the rest of the stack. | |
162 | ||
163 | b. Boot loaders that want to implement boot counting and store the counters | |
164 | elsewhere can provide their own replacements for | |
165 | `systemd-bless-boot.service` and `systemd-bless-boot-generator`, but should | |
166 | continue to use `boot-complete.target` and thus support any services | |
167 | ordered before that. | |
168 | ||
169 | 2. To support additional components that shall succeed before the boot is | |
170 | considered successful, simply place them in units (if they aren't already) | |
171 | and order them before the generic `boot-complete.target` target unit, | |
172 | combined with `Requires=` dependencies from the target, so that the target | |
173 | cannot be reached when any of the units fail. You may add any number of | |
174 | units like this, and only if they all succeed the boot entry is marked as | |
175 | good. Note that the target unit shall pull in these boot checking units, not | |
176 | the other way around. | |
177 | ||
178 | 3. To support additional components that shall only run on boot success, simply | |
179 | wrap them in a unit and order them after `boot-complete.target`, pulling it | |
180 | in. | |
181 | ||
ff2c2d08 | 182 | ## FAQ |
0c74648b LP |
183 | |
184 | 1. *Why do you use file renames to store the counter? Why not a regular file?* | |
185 | — Mainly two reasons: it's relatively likely that renames can be implemented | |
186 | atomically even in simpler file systems, while writing to file contents has | |
187 | a much bigger chance to be result in incomplete or corrupt data, as renaming | |
188 | generally avoids allocating or releasing data blocks. Moreover it has the | |
189 | benefit that the boot count metadata is directly attached to the boot loader | |
190 | entry file, and thus the lifecycle of the metadata and the entry itself are | |
191 | bound together. This means no additional clean-up needs to take place to | |
192 | drop the boot loader counting information for an entry when it is removed. | |
193 | ||
194 | 2. *Why not use EFI variables for storing the boot counter?* — The memory chips | |
195 | used to back the persistent EFI variables are generally not of the highest | |
196 | quality, hence shouldn't be written to more than necessary. This means we | |
197 | can't really use it for changes made regularly during boot, but can use it | |
198 | only for seldom made configuration changes. | |
199 | ||
200 | 3. *I have a service which — when it fails — should immediately cause a | |
201 | reboot. How does that fit in with the above?* — Well, that's orthogonal to | |
202 | the above, please use `FailureAction=` in the unit file for this. | |
203 | ||
204 | 4. *Under some condition I want to mark the current boot loader entry as bad | |
205 | right-away, so that it never is tried again, how do I do that?* — You may | |
206 | invoke `/usr/lib/systemd/systemd-bless-boot bad` at any time to mark the | |
207 | current boot loader entry as "bad" right-away so that it isn't tried again | |
208 | on later boots. |