core: Use a subdirectory of /run/ for PrivateDevices=
When we're starting early boot services such as systemd-userdbd.service,
/tmp might not yet be mounted, so let's use a directory in /run instead
which is guaranteed to be available.
Daan De Meyer [Sun, 1 Oct 2023 18:40:45 +0000 (20:40 +0200)]
mount: Log when we can't create the mount point
Debugging mount unit failures caused by systemd not being able to
create the mount point is currently rather hard. Let's log about
failures to create mount points to simplify debugging.
journalctl: find boot ID more gracefully in corrupted journal
In discover_next_boot(), first we find a new boot ID based on the value
stored in the entry object. Then, find the tail (or head when we are going
upwards) entry of the boot based on the _BOOT_ID= field data.
If boot IDs of an entry in the entry object and _BOOT_ID field data
are inconsistent, which may happen on corrupted journal, then previously
discover_next_boot() failed with -ENODATA.
This makes the function check if the two boot IDs in each entry are
consistent, and skip the entry if not.
Fixes the failure of `journalctl -b -1` for 'truncated' journal:
https://github.com/systemd/systemd/pull/29334#issuecomment-1736567951
This adds a new script tools/check-version-history.py and a corresponding
test when building in developer mode. It checks manpages (except dbus
documentation which is handled by update-dbus-docs) for missing version
history information.
It also adds ignore lists based on version 183 (the version that our version
annotations go back to). These can be augmented if we want to ignore other
elements if it doesn't make sense for them to have version annotations.
sd-journal: merge journal_file_next_entry_for_data() with generic_array_get_plus_one()
Because journal_file_next_entry_for_data() provides the first entry, while
journal_file_next_entry() actually provides the next entry of the input,
this also renames it to journal_file_move_to_entry_for_data().
Also, previously, on DIRECTION_UP the function did not fall back to the
'extra' entry when all entries linked in the chained array are broken.
This also fixes the issue, and now it fall back to the extra entry.
sd-journal: fix calculation of number of 'total' entries in the chained arrays
If there's corruption and we are going upwards, then the 'total'
must be decreased when we go to the previous array. However,
previously, we wrongly kept or increased the number. This fixes
the behavior.
In the do-while loop, we do not read any other entry array object, hence
the current object is always in the mmap cache and not necessary to re-read it.
Doing that in VMs without acceleration is prohibitively expensive (i.e.
20+ seconds in the C8S job). Thankfully, the recent [0] --lines=+n syntax
makes this all quite easy to fix.
test: shutdown the machine on fail after soft-reboot
Since the soft-reboot drops the enqueued end.service, we won't shutdown
the test VM if the test fails and have to wait for the watchdog to kill
us (which may take quite a long time). Let's just forcibly kill the
machine instead to save CI resources.
tpm2-setup: add new early boot tool for initializing the SRK
This adds an explicit service for initializing the TPM2 SRK. This is
implicitly also done by systemd-cryptsetup, hence strictly speaking
redundant, but doing this early has the benefit that we can parallelize
this in a nicer way. This also write a copy of the SRK public key in PEM
format to /run/ + /var/lib/, thus pinning the disk image to the TPM.
Making the SRK public key is also useful for allowing easy offline
encryption for a specific TPM.
Sooner or later we should probably grow what this service does, the
above is just the first step. For example, the service should probably
offer the ability to reset the TPM (clear the owner hierarchy?) on a
factory reset, if such a policy is needed. And we might want to install
some default AK (?).
Jan Janssen [Fri, 22 Sep 2023 12:41:47 +0000 (14:41 +0200)]
boot: Lift linker requirements
The biggest reason for forcing bfd was the use of linker scrips. Since
we don't rely on those anymore we can lift the requirement.
The biggest issue is gold as it does not understand -static-pie. Given
that it's pretty much on life support it's safe to just declare it not
supported anymore.
Don't link addons with libefi as clang/lld is sometimes very eager to
include memset etc., causing needless binary bloat and link errors with
LTO.
Jan Janssen [Fri, 22 Sep 2023 10:15:55 +0000 (12:15 +0200)]
elf2efi: Add --copy-sections option
This makes the special PE sections available again in our output EFI
images.
Since the compiler provides no way to mark a section as not allocated,
we use GNU assembler syntax to emit the sections instead. This ensures
the section data isn't emitted twice as load segments will only contain
allocating input sections.
Jan Janssen [Thu, 28 Sep 2023 14:09:42 +0000 (16:09 +0200)]
elf2efi: Rework ELF section conversion
The main reason we need to apply a whole lot of logic to the section
conversion logic is because PE sections have to be aligned to the page
size (although, currently not even EDK2 enforces this). The process of
achieving this with a linker script is fraught with errors, they are a
pain to set up correctly and suck in general. They are also not
supported by mold, which requires us to forcibly use bfd, which also
means that linker feature detection is easily at odds as meson has a
differnt idea of what linker is in use.
Instead of forcing a manual ELF segment layout with a linker script we
just let the linker do its thing. We then simply copy/concatenate the
sections while observing proper page boundaries.
Note that we could just copy the ELF load *segments* directly and
achieve the same result. Doing this manually allows us to strip sections
we don't need at runtime like the dynamic linking information (the
elf2efi conversion is effectively the dynamic loader).
Important sections like .sbat that we emit directly from code will
currently *not* be exposed as individual PE sections as they are
contained within the ELF segments. A future commit will fix this.
Dan Streetman [Fri, 30 Jun 2023 16:52:10 +0000 (12:52 -0400)]
tpm2: add tpm2_index_to_handle() and tpm2_index_from_handle()
Adjust the tpm2_esys_handle_from_tpm_handle() function into better-named
tpm2_index_to_handle(), which operates like tpm2_get_srk() but allows using any
handle index. Also add matching tpm2_index_from_handle().
Also change the references to 'location' in tpm2_persist_handle() to more
appropriate 'handle index'.
So the coverage-related drop-in [0] can kick in to avoid errors with
DynamicUser=true. Also, to not make the test confusing with this change,
replace "nft-test" with "test-nft" everywhere.
[0] See test/README.testsuite, section "Code coverage"
tpm2: move measurement log to /run/log/ (from /var/log/)
I have no idea what went on in my mind when I used a path in /var/ for
the tpm2 event log we now keep for userspace measurements. The
measurements are only valid for the current boot, hence should not be
persisted (in particular as they cannot be rotated, hence should not
grow without bounds).
Fix that, simply move from /var/log/ to /run/log/.
fix: do not check/verify slice units if recursive errors are to be ignored
Before this fix, when recursive-errors was set to 'no' during a systemd-analyze
verification, the parent slice was checked regardless. The 'no' setting means that,
only the specified unit should be looked at and verified and errors in the slices should be
ignored. This commit fixes that issue.
Example:
Say we have a sample.service file:
[Unit]
Description=Sample Service
[Service]
ExecStart=/bin/echo "a"
Slice=support.slice
Before Change:
systemd-analyze verify --recursive-errors=no maanya/sample.service
Assertion 'u' failed at src/core/unit.c:153, function unit_has_name(). Aborting.
Aborted (core dumped)
After Change:
systemd-analyze verify --recursive-errors=no maanya/sample.service
{No errors}
core: move pid watch/unwatch logic of the service manager to pidfd
This makes sure unit_watch_pid() and unit_unwatch_pid() will track
processes by pidfd if supported. Also ports over some related code.
Should not really change behaviour.
Note that this does *not* add support waiting for POLLIN on the pidfds
as additional exit notification. This is left for a later commit (this
commit is already large enough), in particular as that would add new
logic and not just convert existing logic.
This new helper can be used after reading process info from procfs, to
verify that the data that was just read actually matches the pidfd, and
does not belong to some new process that just reused the numeric PID of
the process we originally pinned.
pidref: add helpers for managing PidRef on the heap
Usually we want to embed PidRef in other structures, but sometimes it
makes sense to allocate it on the heap in case it should be used
standalone. Add helpers for that.
Primary usecase: use as key in Hashmap objects, that for example map
process to unit objects in PID 1.
This adds pidref_free()/pidref_freep() for freeing such an allocated
struct, as well as pidref_dup() (for duplicating an existing PidRef
on the heap 1:1), and pidref_new_pid() (for allocating a new PidRef from a
PID).
This helper truns a pid_t into a PidRef. It's different from
pidref_set_pid() in being "passive", i.e. it does not attempt to acquire
a pidfd for the pid.
This is useful when using the PidRef as a lookup key that shall also
work after a process is already dead, and hence no conversion to a pidfd
is possible anymore.
resolved: register ipv4only.arpa are private domain
From RFC 8880:
Because the 'ipv4only.arpa' zone has to be an insecure delegation,
DNSSEC cannot be used to protect these answers from tampering by
malicious devices on the path.
Consequently, the 'ipv4only.arpa' zone MUST be an insecure delegation to
give DNS64/NAT64 gateways the freedom to synthesize answers to those
queries at will, without the answers being rejected by DNSSEC-capable
resolvers. DNSSEC-capable resolvers that follow this specification MUST
NOT attempt to validate answers received in response to queries for the
IPv6 AAAA address records for 'ipv4only.arpa'. Note that the name
'ipv4only.arpa' has no use outside of being used for this special DNS
pseudo-query used to learn the DNS64/NAT64 address synthesis prefix, so
the lack of DNSSEC security for that name is not a problem.
There's no way for us to wait for specific virtiofs tags to appear,
so we have to try and make sure that the tags are all available by
the time we try to mount any virtiofs tag. Let's try to do that by
loading the necessary modules as early as we can.
Mike Yuan [Wed, 27 Sep 2023 10:38:10 +0000 (18:38 +0800)]
core: mark units as need daemon-reload if unit file operations are
performed
systemctl would issue daemon-reload after unit file operations
(enable/disable/preset/...) succeed. However, such operations
are not atomic, meaning that the unit file state could still change
even if the operation generally fails, and the unit_file_state
cached by manager becomes outdated.
pid1: add SurviveFinalKillSignal= to skip units on final sigterm/sigkill spree
Add a new boolean for units, SurviveFinalKillSignal=yes/no. Units that
set it will not have their process receive the final sigterm/sigkill in
the shutdown phase.
This is implemented by checking if a process is part of a cgroup marked
with a user.survive_final_kill_signal xattr (or a trusted xattr if we
can't set a user one, which were added only in kernel v5.7 and are not
supported in CentOS 8).
Rework unit_name_mangle_with_suffix() to (very slightly) simplify the path
'systemctl status /../dev' now looks for 'dev.mount', not '-..-dev.service',
and 'systemctl status /../foo' looks for 'foo.mount', not '-..-foo.service'. I
think this much more useful. I think the escaping is not very useful, so I plan
to submit a later series which changes that behaviour. But I think this first
step here is already useful on its own.
Note that the patch is smaller than it seems: before, is_device_path() would
return true only for absolute paths, so moving of is_device_path() under the
path_is_absolute() conditional doesn't influence the logic.
exec-util: print executed commands in do_execute()
kernel-install uses do_execute(). We would log whenever a spawned child
finished, but we would not log anything when the child is launched. When the
children log output without a prefix (as the kernel-install plugins do), it
is hard to see where that output is coming from.
For us, this is a compatibility mode, but most likely it is there to stay: the
kernel Makefile's install target expects to be able to call /bin/installkernel.
We want people who build their own kernels to use this, so that they use
kernel-install and get support for all the functionality provided by it,
including building of UKIs and other new features. So let's actually advertise
that this exists and works.
Because names beneath .alt are in an alternative namespace, they have no
significance in the regular DNS context. DNS stub and recursive
resolvers do not need to look them up in the DNS context.