Eric DeVolder [Thu, 16 May 2019 13:59:01 +0000 (08:59 -0500)]
pstore: Tool to archive contents of pstore
This patch introduces the systemd pstore service which will archive the
contents of the Linux persistent storage filesystem, pstore, to other storage,
thus preserving the existing information contained in the pstore, and clearing
pstore storage for future error events.
Linux provides a persistent storage file system, pstore[1], that can store
error records when the kernel dies (or reboots or powers-off). These records in
turn can be referenced to debug kernel problems (currently the kernel stuffs
the tail of the dmesg, which also contains a stack backtrace, into pstore).
The pstore file system supports a variety of backends that map onto persistent
storage, such as the ACPI ERST[2, Section 18.5 Error Serialization] and UEFI
variables[3 Appendix N Common Platform Error Record]. The pstore backends
typically offer a relatively small amount of persistent storage, e.g. 64KiB,
which can quickly fill up and thus prevent subsequent kernel crashes from
recording errors. Thus there is a need to monitor and extract the pstore
contents so that future kernel problems can also record information in the
pstore.
The pstore service is independent of the kdump service. In cloud environments
specifically, host and guest filesystems are on remote filesystems (eg. iSCSI
or NFS), thus kdump relies [implicitly and/or explicitly] upon proper operation
of networking software *and* hardware *and* infrastructure. Thus it may not be
possible to capture a kernel coredump to a file since writes over the network
may not be possible.
The pstore backend, on the other hand, is completely local and provides a path
to store error records which will survive a reboot and aid in post-mortem
debugging.
Usage Notes:
This tool moves files from /sys/fs/pstore into /var/lib/systemd/pstore.
To enable kernel recording of error records into pstore, one must either pass
crash_kexec_post_notifiers[4] to the kernel command line or enable via 'echo Y
> /sys/module/kernel/parameters/crash_kexec_post_notifiers'. This option
invokes the recording of errors into pstore *before* an attempt to kexec/kdump
on a kernel crash.
Optionally, to record reboots and shutdowns in the pstore, one can either pass
the printk.always_kmsg_dump[4] to the kernel command line or enable via 'echo Y >
/sys/module/printk/parameters/always_kmsg_dump'. This option enables code on the
shutdown path to record information via pstore.
This pstore service is a oneshot service. When run, the service invokes
systemd-pstore which is a tool that performs the following:
- reads the pstore.conf configuration file
- collects the lists of files in the pstore (eg. /sys/fs/pstore)
- for certain file types (eg. dmesg) a handler is invoked
- for all other files, the file is moved from pstore
- In the case of dmesg handler, final processing occurs as such:
- files processed in reverse lexigraphical order to faciliate
reconstruction of original dmesg
- the filename is examined to determine which dmesg it is a part
- the file is appended to the reconstructed dmesg
For example, the following pstore contents:
root@vm356:~# ls -al /sys/fs/pstore
total 0
drwxr-x--- 2 root root 0 May 9 09:50 .
drwxr-xr-x 7 root root 0 May 9 09:50 ..
-r--r--r-- 1 root root 1610 May 9 09:49 dmesg-efi-155741337601001
-r--r--r-- 1 root root 1778 May 9 09:49 dmesg-efi-155741337602001
-r--r--r-- 1 root root 1726 May 9 09:49 dmesg-efi-155741337603001
-r--r--r-- 1 root root 1746 May 9 09:49 dmesg-efi-155741337604001
-r--r--r-- 1 root root 1686 May 9 09:49 dmesg-efi-155741337605001
-r--r--r-- 1 root root 1690 May 9 09:49 dmesg-efi-155741337606001
-r--r--r-- 1 root root 1775 May 9 09:49 dmesg-efi-155741337607001
-r--r--r-- 1 root root 1811 May 9 09:49 dmesg-efi-155741337608001
-r--r--r-- 1 root root 1817 May 9 09:49 dmesg-efi-155741337609001
-r--r--r-- 1 root root 1795 May 9 09:49 dmesg-efi-155741337710001
-r--r--r-- 1 root root 1770 May 9 09:49 dmesg-efi-155741337711001
-r--r--r-- 1 root root 1796 May 9 09:49 dmesg-efi-155741337712001
-r--r--r-- 1 root root 1787 May 9 09:49 dmesg-efi-155741337713001
-r--r--r-- 1 root root 1808 May 9 09:49 dmesg-efi-155741337714001
-r--r--r-- 1 root root 1754 May 9 09:49 dmesg-efi-155741337715001
results in the following:
root@vm356:~# ls -al /var/lib/systemd/pstore/155741337/
total 92
drwxr-xr-x 2 root root 4096 May 9 09:50 .
drwxr-xr-x 4 root root 40 May 9 09:50 ..
-rw-r--r-- 1 root root 1610 May 9 09:50 dmesg-efi-155741337601001
-rw-r--r-- 1 root root 1778 May 9 09:50 dmesg-efi-155741337602001
-rw-r--r-- 1 root root 1726 May 9 09:50 dmesg-efi-155741337603001
-rw-r--r-- 1 root root 1746 May 9 09:50 dmesg-efi-155741337604001
-rw-r--r-- 1 root root 1686 May 9 09:50 dmesg-efi-155741337605001
-rw-r--r-- 1 root root 1690 May 9 09:50 dmesg-efi-155741337606001
-rw-r--r-- 1 root root 1775 May 9 09:50 dmesg-efi-155741337607001
-rw-r--r-- 1 root root 1811 May 9 09:50 dmesg-efi-155741337608001
-rw-r--r-- 1 root root 1817 May 9 09:50 dmesg-efi-155741337609001
-rw-r--r-- 1 root root 1795 May 9 09:50 dmesg-efi-155741337710001
-rw-r--r-- 1 root root 1770 May 9 09:50 dmesg-efi-155741337711001
-rw-r--r-- 1 root root 1796 May 9 09:50 dmesg-efi-155741337712001
-rw-r--r-- 1 root root 1787 May 9 09:50 dmesg-efi-155741337713001
-rw-r--r-- 1 root root 1808 May 9 09:50 dmesg-efi-155741337714001
-rw-r--r-- 1 root root 1754 May 9 09:50 dmesg-efi-155741337715001
-rw-r--r-- 1 root root 26754 May 9 09:50 dmesg.txt
where dmesg.txt is reconstructed from the group of related
dmesg-efi-155741337* files.
Configuration file:
The pstore.conf configuration file has four settings, described below.
- Storage : one of "none", "external", or "journal". With "none", this
tool leaves the contents of pstore untouched. With "external", the
contents of the pstore are moved into the /var/lib/systemd/pstore,
as well as logged into the journal. With "journal", the contents of
the pstore are recorded only in the systemd journal. The default is
"external".
- Unlink : is a boolean. When "true", the default, then files in the
pstore are removed once processed. When "false", processing of the
pstore occurs normally, but the pstore files remain.
References:
[1] "Persistent storage for a kernel's dying breath",
March 23, 2011.
https://lwn.net/Articles/434821/
[2] "Advanced Configuration and Power Interface Specification",
version 6.2, May 2017.
https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
[3] "Unified Extensible Firmware Interface Specification",
version 2.8, March 2019.
https://uefi.org/sites/default/files/resources/UEFI_Spec_2_8_final.pdf
[4] "The kernel’s command-line parameters",
https://static.lwn.net/kerneldoc/admin-guide/kernel-parameters.html
This file was testing a mix of functions from src/core/load-fragment.c and some
from src/shared/install.c. Let's name it more appropriately. I want to add
tests for the new unit-file.c too.
basic/hashmap: add hashops variant that does strdup/freeing on its own
So far, we'd use hashmap_free_free to free both keys and values along with
the hashmap. I think it's better to make this more encapsulated: in this variant
the way contents are freed can be decided when the hashmap is created, and
users of the hashmap can always use hashmap_free.
basic/unit-name: allow unit_name_to_instance() to be used to classify units
This could already be done by calling unit_name_is_*(), but if we don't know
if the argument is a valid unit name, it is more convenient to have a single
function which returns the type or possibly an error if the unit name is not
valid.
The values in the enum are sorted "by length". Not really important, but it
seems more natural to me.
pid1: order jobs that execute processes with lower priority
We can meaningfully compare jobs for units which have cpu weight or nice set.
But non-exec units those have those set.
Starting non-exec jobs first allows us to get them out of the queue quickly,
and consider more jobs for starting.
If we have service A, and socket B, and service C which is after socket B,
and we want to start both A and C, and C has higher cpu weight, if we get
B out of the way first, we'll know that we can start both A and C, and we'll
start C first.
Also invert the comparisons using CMP() so they are always done left vs. right,
and negate when returning instead.
At the moment the shutdown watchdog is set only when rebooting.
The set of "things that can go wrong" is not too far off when kexec'ing
and in fact we have a use case where it would be useful - moving to a
new kernel image.
Michael Biebl [Wed, 17 Jul 2019 23:24:00 +0000 (01:24 +0200)]
meson: make nologin path build time configurable
Some distros install nologin as /usr/sbin/nologin, others as
/sbin/nologin.
Since we can't really on merged-usr everywhere (where the path wouldn't
matter), make the path build time configurable via -Dnologin-path=.
shared/conf-parser: emit a nicer warning for something like "======"
Urlich Windl wrote on the mailing list:
> I noticed that a line of "=======" in "[Service]" cases the message " Unknown lvalue '' in section 'Service'".
This now becomes:
/etc/systemd/system/eqeqeqeq.service:3: Missing key name before '=', ignoring line.
shared/conf-parser: be nice and ignore lines without "="
We generally don't treat syntax error as fatal, but in this case we would
completely refuse to load the file. I think we should treat the the same
as assignment outside of a section, or an unknown key name.
Michael Olbrich [Wed, 22 May 2019 10:12:17 +0000 (12:12 +0200)]
job: make the run queue order deterministic
Jobs are added to the run queue in random order. This happens because most
jobs are added by iterating over the transaction or dependency hash maps.
As a result, jobs that can be executed at the same time are started in a
different order each time.
On small embedded devices this can cause a measurable jitter for the point
in time when a job starts (~100ms jitter for 10 units that are started in
random order).
This results is a similar jitter for the boot time. This is undesirable in
general and make optimizing the boot time a lot harder.
Also, jobs that should have a higher priority because the unit has a higher
CPU weight might get executed later than others.
Fix this by turning the job run_queue into a Prioq and sort by the
following criteria (use the next if the values are equal):
- CPU weight
- nice level
- unit type
- unit name
The last one is just there for deterministic sorting to avoid any jitter.
Michael Olbrich [Wed, 3 Jul 2019 14:11:47 +0000 (16:11 +0200)]
basic: reorder UnitType enum
The enum order will be used to order jobs in the job queue.
Make sure that unit types that fork aditional processes come first to
maximize parallelism.
resolved: switch cache option to a tri-state option (systemd#5552).
Change the resolved.conf Cache option to a tri-state "no, no-negative, yes" values.
If a lookup returns SERVFAIL systemd-resolved will cache the result for 30s (See 201d995),
however, there are several use cases on which this condition is not acceptable (See systemd#5552 comments)
and the only workaround would be to disable cache entirely or flush it , which isn't optimal.
This change adds the 'no-negative' option when set it avoids putting in cache
negative answers but still works the same heuristics for positive answers.
network: drop fallback mechanism to assign DHCPv6 addresses with IFA_F_NOPREFIXROUTE
The flag IFA_F_NOPREFIXROUTE was introduced in kernel-3.14. But even if
the kernel does not support the flag, it should be just ignored. So, it
is not necessary to do the fallback logic. Moreover, the current logic
is not a fallback mechanism but just retrying. So, it should not work.
Let's drop that.
systemctl: call the unit dbus path dbus_path everywhere
Similar variables had differing names: unit, path, unit_path. We also
have file system paths in surrounding code. Let's make this easier for
the reader and use "dbus_path" consistently.
It had two users, but it is just a very thin wrapper around
unit_file_find_dropin_paths(), so using it seems more complicated than directly
invoking unit_file_find_dropin_paths() twice.
man: rework the description of Aliases and .wants/.requires directories
The description of Alias= wasn't incorrect, but it sounded like Alias= creates
a different type of dependency, while it's just a glorified way to create
symlinks. Also recommend 'preset' in addition to 'enable'.
Describe .wants/.requires dirs as equals, without implying that the [Install]
section can only be used for .wants.
The text was partially out of date (systemd-networkd.service now creates as
alias in /etc, not /usr/lib, let's just not say anything about the full path).
This restores proper speed with asan builds with gcc 9.1.1.
Fixes #12997.
$ rpm -q gcc
gcc-9.1.1-2.fc31.x86_64
$ time ASAN_OPTIONS=strict_string_checks=1:detect_stack_use_after_return=1:check_initialization_order=1:strict_init_order=1 UBSAN_OPTIONS=print_stacktrace=1:print_summary=1:halt_on_error=1 build-rawhide-sanitize/test-conf-parser
(old) 86.99s user 20.22s system 361% cpu 29.635 total
(new) 3.05s user 0.29s system 99% cpu 3.377 total
firstboot: fix hang waiting for second Enter on input
The comment explains the reason: we'd wait for the second \n
and then ungetc() it. Then the buffered \n would cause a problem
when the next prompt was issued, so in effect it wasn't possible
to answer the second question.
The user most likely knows the name of their locale/keymap/whatever,
and paging through multiple pages of output has little benefit.
The header that was printed before is now not printed anymore. But
now it's obvious from the context what we are printing, so we don't
need to print the header.
The commit
"util: Do not clear parent mount flags when setting up namespaces"
introduced a statvfs call read the flags of the original mount
and have them applied to the bind mount.
This has two problems:
(1) The mount flags returned by statvfs(2) do not match the flags
accepted by mount(2). For example, the value 4096 means ST_RELATIME
when returned by statvfs(2), but means MS_BIND when passed to mount(2).
(2) A call to statvfs blocks indefinitely when ran against a disconnected
network drive ( https://github.com/systemd/systemd/issues/12667 ).
We already use libmount to parse `/proc/self/mountinfo` but did not use the
mount flag information from there. This patch changes that to use the mount
flags parsed by libmount instead of calling statvfs. Only if getting the
flags through libmount fails we call statvfs.
The test would not pass before, because EXTRACT_UNQUOTE|EXTRACT_RETAIN_ESCAPE
didn't work (we'd get "KEY3=val with \\quotation\\" as the last string. Now we
are only doing EXTRACT_UNQUOTE, so we get the expected "KEY3=val with \"quotation\"".
It's hard to even say what exactly this combination means. Escaping is
necessary when quoting to have quotes within the string. So the escaping of
quote characters is inherently tied to quoting. When unquoting, it seems
natural to remove escaping which was done for the quoting purposes. But with
both flags we would be expected to re-add this escaping after unqouting? Or
maybe keep the escaping which is not necessary for quoting but otherwise
present? This all seems too complicated, let's just forbid such usage and
always fully unescape when unquoting.