In mkosi, I want to add a sysupdate verb to wrap systemd-sysupdate.
The definitions will be picked up from mkosi.sysupdate/ and passed
to systemd-sysupdate. I want users to be able to write transfer
definitions that are independent of the output directory used by
mkosi. To make this possible, it should be possible to specify the
directory that transfer sources should be looked up in on the sysupdate
command line. Let's allow this via a new --transfer-source= option.
Additionally, transfer sources that want to take advantage of this
feature should specify PathRelativeTo=directory to indicate the configured
Path= is interpreted relative to the tranfer source directory specified
on the CLI.
This allows for the following transfer definition to be put in
mkosi.sysupdate:
Luke T. Shumaker [Wed, 21 Aug 2024 23:29:10 +0000 (17:29 -0600)]
nspawn: enable FUSE in containers
Linux kernel v4.18 (2018-08-12) added user-namespace support to FUSE, and
bumped the FUSE version to 7.27 (see: da315f6e0398 (Merge tag
'fuse-update-4.18' of
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse, Linus Torvalds,
2018-06-07). This means that on such kernels it is safe to enable FUSE in
nspawn containers.
In outer_child(), before calling copy_devnodes(), check the FUSE version to
decide whether enable (>=7.27) or disable (<7.27) FUSE in the container. We
look at the FUSE version instead of the kernel version in order to enable FUSE
support on older-versioned kernels that may have the mentioned patchset
backported ([as requested by @poettering][1]). However, I am not sure that
this is safe; user-namespace support is not a documented part of the FUSE
protocol, which is what FUSE_KERNEL_VERSION/FUSE_KERNEL_MINOR_VERSION are meant
to capture. While the same patchset
- added FUSE_ABORT_ERROR (which is all that the 7.27 version bump
is documented as including),
- bumped FUSE_KERNEL_MINOR_VERSION from 26 to 27, and
- added user-namespace support
these 3 things are not inseparable; it is conceivable to me that a backport
could include the first 2 of those things and exclude the 3rd; perhaps it would
be safer to check the kernel version.
Do note that our get_fuse_version() function uses the fsopen() family of
syscalls, which were not added until Linux kernel v5.2 (2019-07-07); so if
nothing has been backported, then the minimum kernel version for FUSE-in-nspawn
is actually v5.2, not v4.18.
Pass whether or not to enable FUSE to copy_devnodes(); have copy_devnodes()
copy in /dev/fuse if enabled.
Pass whether or not to enable FUSE back over fd_outer_socket to run_container()
so that it can pass that to append_machine_properties() (via either
register_machine() or allocate_scope()); have append_machine_properties()
append "DeviceAllow=/dev/fuse rw" if enabled.
For testing, simply check that /dev/fuse can be opened for reading and writing,
but that actually reading from it fails with EPERM. The test assumes that if
FUSE is supported (/dev/fuse exists), then the testsuite is running on a kernel
with FUSE >= 7.27; I am unsure how to go about writing a test that validates
that the version check disables FUSE on old kernels.
The commit changed lookup_paths_init_or_warn() call to
be fatal to manager_reload(), but invoke_main_loop()
assumes that manager_reload() would only return
recoverable error, and put the manager back to
MANAGER_OK in that case, which is spurious.
Looking at it more, it appears to be utterly unnecessary
to reinitialize LookupPaths here, given that nothing during
the reload process would change the search dirs. Let's drop
the path altogether hence.
Luke T. Shumaker [Wed, 21 Aug 2024 23:29:10 +0000 (17:29 -0600)]
test: add a testcase for unprivileged nspawn
Right now it mostly duplicates a test that already exists in
TEST-50-DISSECT.mountfsd.sh, but it serves as a template for more unprivileged
nspawn tests.
Luke T. Shumaker [Thu, 22 Aug 2024 04:50:16 +0000 (22:50 -0600)]
nspawn: fix the comment about which namespaces outer_child is in
The comment says that it is still in the host's CLONE_NEWUSER namespace,
which is not true if !arg_privileged. Also, it says that the CLONE_NEWNS
namespace was created by clone(), but if !arg_privileged then it was
actually created by nsresource_allocate_userns() and switched into by
setns(). Fix those inaccuracies.
When trying to word it clearly, there are enough commas and nested clauses
that I think it's clearer to break it into a list/table.
pcrlock: be more careful when preparing credential name for pcrlock policy
The .cred suffix is stripped from a credential as it is imported from
the ESP, hence it should not be included in the credential name embedded
in the credential.
journald: mention the access mode we tried to open /dev/kmsg in
Let's make clearer what we are going to use /dev/kmsg for: read/write or just
writing. This hopefully should avoid confusion, such as the one #33975
is result of.
(Also while we are at it, add one extra debug message).
cryptenroll/cryptsetup: allow combined signed TPM2 PCR policy + pcrlock policy
So far you had to pick:
1. Use a signed PCR TPM2 policy to lock your disk to (i.e. UKI vendor
blesses your setup via signature)
or
2. Use a pcrlock policy (i.e. local system blesses your setup via
dynamic local policy stored in NV index)
It was not possible combine these two, because TPM2 access policies do
not allow the combination of PolicyAuthorize (used to implement #1
above) and PolicyAuthorizeNV (used to implement #2) in a single policy,
unless one is "further upstream" (and can simply remove the other from
the policy freely).
This is quite limiting of course, since we actually do want to enforce
on each TPM object that both the OS vendor policy and the local policy
must be fulfilled, without the chance for the vendor or the local system
to disable the other.
This patch addresses this: instead of trying to find a way to come up
with some adventurous scheme to combine both policy into one TPM2
policy, we simply shard the symmetric LUKS decryption key: one half we
protect via the signed PCR policy, and the other we protect via the
pcrlock policy. Only if both halves can be acquired the disk can be
decrypted.
This means:
1. we simply double the unlock key in length in case both policies shall
be used.
2. We store two resulting TPM policy hashes in the LUKS token JSON, one
for each policy
3. We store two sealed TPM policy key blobs in the LUKS token JSON, for
both halves of the LUKS unlock key.
This patch keeps the "sharding" logic relatively generic (i.e. the low
level logic is actually fine with more than 2 shards), because I figure
sooner or later we might have to encode more shards, for example if we
add further TPM2-based access policies, for example when combining FIDO2
with TPM2, or implementing TOTP for this.
Apparently _PATH_UTMPX is a glibc'ism. UTMPX_FILE is the same thing and
what everyone else uses. Since they are otherwise equivalent, let's just
switch.
We generally use utmpx instead of utmp (both are actually identical on
Linux, but utmpx is POSIX, while utmp is not). Let's fix one left-over
case.
UT_NAMESIZE does not exist in utmpx world, it has no direct counterpart,
hence let's just sizeof_field() to determine the size of the actual
field. (which comes to the same result of course: 32).
In https://github.com/systemd/systemd/pull/5283/commits/924453c22599cc246746a0233b2f52a27ade0819
ProtectHome was set to true for systemd-coredump in order to reduce risk, since an attacker could craft a malicious binary in order to compromise systemd-coredump.
At that point the object analysis was done in the main systemd-coredump process.
Because of this systemd-coredump is unable to product symbolicated call-stacks for binaries running under /home ("n/a" is shown instead of function names).
However, later in https://github.com/systemd/systemd/commit/61aea456c12c54f49c4a76259af130e576130ce9 systemd-coredump was changed to do the object analysis in a forked process,
covering those security concerns.
Let's set ProtectHome to read-only so that systemd-coredump produces symbolicated call-stacks for processes running under /home.
nspawn: only remount /usr/ with idmap when --volatile=yes
The root directory is already mounted with a picked UID shift, hence
it is not necessary to remount with idmap. However, /usr/ is a bind-mount,
hence it must be remounted with idmap.
With this change, now '-U --volatile=yes' works fine.
nspawn: mount /var/ after remount_idmap() when --volatile=state
Previously, remount_idmap() failed as /var/ was already mounted, thus
remounting (strictly speaking, unmounting old root directory) failed
with -EBUSY.
As tmpfs /var/ is mounted with picked UID shift, it should not be
remounted with idmap, but needs to be mounted after the root directory
being remounted.
This makes '-U --volatile=state' work as expected.
This also
- rename variable n -> address,
- use log_syntax_parse_error() where applicable,
- add one more assertion for lvalue in config_parse_address().
hwdb: Mark Apple Wireless keyboards as not having NumLock LED
Mark those Apple Wireless keyboards as not having a NumLock key:
https://en.wikipedia.org/wiki/Apple_Wireless_Keyboard
You can see that they don't have a NumLock LED because they didn't even
have a NumLock in the first place:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0fea6fe7d5ef1b5fa5f78048d4729f85181c04ca
The identifier 'stdin' is reserved in C. It can be #defined to any
statement that evaluates to a FILE*. We do not want that for our field,
so change to a more descriptive name.
Peter Rajnoha [Thu, 5 Sep 2024 10:31:20 +0000 (12:31 +0200)]
udev: allow persistent storage rules for rbd devices
The RADOS Block Device (rbd) can be used as any other block device with
further layers on top of it, hence allow the common persistent storage
rules to apply, including watching for changes.
time-util: rework localtime_or_gmtime() into localtime_or_gmtime_usec()
We typically want to deal in usec_t, hence let's change the prototype
accordingly, and do proper range checks. Also, make sure are not
confused by negative times.
Do something similar for mktime_or_timegm().
This is a more comprehensive alternative to #34065