Daan De Meyer [Tue, 9 Jan 2024 10:24:18 +0000 (11:24 +0100)]
Only run mount --make-rslave / if we didn't unshare a user namespace
When unsharing a mount namespace in a different user namespace than
the parent mount namespace, all mounts are marked as slave by default
so we don't need to explicitly mark all of them as slave mounts.
Daan De Meyer [Mon, 8 Jan 2024 22:31:37 +0000 (23:31 +0100)]
Simplify apivfs_cmd() and chroot_cmd()
We move the setpgid logic to run(), avoiding the need to pass a tools
argument to chroot_cmd() and apivfs_cmd().
We also try to remove as much logic from these functions as possible.
Since we can't really assume that any logic we execute during the
function will still hold true in the sandbox, so it's best to delay
any logic execution until we're already in the sandbox (using the
--ro-bind-try options of bubblewrap).
We also rework the /etc/resolv.conf handling to simply make sure that
/run/systemd/resolve exists in the chroot since if /etc/resolv.conf
points to /run it'll almost certainly be to
/run/systemd/resolv/stub-resolv.conf.
Daan De Meyer [Mon, 8 Jan 2024 15:56:31 +0000 (16:56 +0100)]
Use /work for host scripts as well
Now that everything runs sandboxed, /work is free to use for host
scripts as well. At the same time, let's stop unconditionally
mounting the current working directory when running build scripts.
To keep things working smoothly, we'll make mounting the current
working directory the default value for BuildSources= instead.
Daan De Meyer [Mon, 8 Jan 2024 14:52:15 +0000 (15:52 +0100)]
Don't use host's /var/tmp in sandbox
Instead, use a subdirectory of the host's /var/tmp. Because we want
to limit the lifetime of this directory to the lifetime of the sandbox,
we use a shell command to create and remove the directory.
Daan De Meyer [Mon, 8 Jan 2024 14:21:01 +0000 (15:21 +0100)]
Put tmpfs on /tmp in sandbox when not in relaxed mode
Let's sandbox more by not using the host's /tmp but instead putting
a fresh tmpfs on /tmp. We used the host's /tmp before because the
definitions could potentially be in the host's /tmp but now that we
mount everything in explicitly that isn't a problem anymore.
Daan De Meyer [Tue, 2 Jan 2024 07:37:40 +0000 (08:37 +0100)]
Use bubblewrap to set up the tools tree instead of doing it ourselves
The problem with overmounting the host's /usr (in a private mount
namespace) is that we have no control over the symlinks in the root
directory (/lib, /bin, /lib64) and if these symlinks don't match
between the host distribution and the tools tree distribution, all
kinds of weird breakage starts happening. For example, using Fedora
tools trees on Arch Linux is currently broken because /lib64 on Arch
Linux points to /usr/lib whereas on Fedora it points to /usr/lib64.
Because we can't (and shouldn't) modify the symlinks of the host's
root filesystem, we need to set up the tools tree in a sandbox that
we chroot into, so that we have full control over the rootfs of the
sandbox and can make sure the symlinks are correct. Luckily, we
already do just that with bubblewrap, except that currently we mount
the tools tree over /usr ourselves and then just carry that over into
the bubblewrap sandbox.
Instead, we stop mounting over the host's /usr ourselves and have
bubblewrap pick the right /usr itself. We also copy the symlinks from
the tools tree or the host if there is no tools tree.
Because we don't mount over the host's /usr anymore, we have to run
every tool that should come from the tools tree with bubblewrap now.
The side effect of this is that almost all of our tools now run
sandboxed. We also have to make use of find_binary() everywhere
instead of shutil.which() to make sure we look for binaries in the
tools tree when required. Various other codepaths that look into /usr
also have to be modified to look into the tools tree when needed.
Also, because we don't unshare the user namespace in the main mkosi
process anymore now, we can get rid of a lot of chown()'s in qemu.py
and opening the qemu device file descriptors can be moved into
run_qemu() itself.
We also don't have to make sure all python modules are loaded anymore
as the host's /usr is never overmounted so the required python modules
will be available for the entire runtime of the mkosi process.
Because virtiofsd is now executed with bubblewrap, we use bubblewrap
to set up the required uidmap instead of relying on virtiofsd to do it
with newuidmap/newgidmap. Note that this breaks RuntimeTrees= as
virtiofsd unconditionally tries to drop groups with setgroups() which
fails with EPERM in an unprivileged user namespace set up by bubblewrap.
This is fixed by https://gitlab.com/virtio-fs/virtiofsd/-/merge_requests/207
which is still awaiting review.
To make this work codewise, this commit renames the bwrap() function
to sandbox_cmd() (similar to chroot_cmd() and apivfs_cmd()) which now
returns a command line instead of executing the command itself. run()
is modified to take an extra "sandbox" arguments which is simply the
part of the full command that sets up the sandbox. Context and Config
both learn new sandbox() methods which set up the sandbox for each
object respectively (mostly by adding extra bind mounts).
Because almost every call to run() now takes a sandbox, this gives us
a lot of control over the individual environment for each tool we run.
We make use of this to restrict each tool we run to the minimal possible
sandbox that that tool needs to run. By specifically mounting in the
required paths for each tool we run, we also make sure these are always
available instead of relying that somewhere we mount a path that has the
input in it.
Because we allow passing arbitrary options to mkosi qemu, mkosi boot and
various other verbs, we run these verbs with a relaxed sandbox, where we
mount in most directories from the host. This means that whatever
directories users specify will be available.
In terms of CI, the extra sandboxing means that our previous approach of
building various systemd binaries from source and symlinking them to
/usr/bin doesn't work anymore. Instead, we opt to always use tools trees
and drop the host builds from the testing matrix. This also simplifies
and speeds up the github action as we don't have to compile systemd and
xfsprogs from source and we have to install fewer packages.
Daan De Meyer [Fri, 5 Jan 2024 13:01:26 +0000 (14:01 +0100)]
Don't copy xattrs from mkosi.extra and friends
These directories and files might have selinux xattrs and such that
we don't want to end up in the image so let's make sure that we don't
copy xattrs from skeleton and extra trees.
Daan De Meyer [Thu, 4 Jan 2024 12:17:27 +0000 (13:17 +0100)]
Add RuntimeScratch= setting
When booting output formats that reside almost entirely in memory
(initrd, UKI, ESP), doing any kind of write heavy operation in the
booted VM has a high chance of leading to OOM errors as all files
will be written in memory.
When booting disk images, unless one is using RuntimeSize=, one will
often run into disk space issues when writing lots of data.
When booting off virtiofs and doing write heavy operations, virtiofsd
can run out of file descriptors or become very slow.
To allow doing write heavy operations in all these scenarios, let's
add RuntimeScratch= which mounts extra scratch space to /var/tmp that
can be used for write heavy operations.
Daan De Meyer [Fri, 5 Jan 2024 08:23:55 +0000 (09:23 +0100)]
Fix importlib usage
We have to use as_file() on the final path, not the module path.
Because as_file() only learned to support directories in python 3.12,
we backport the 3.12 implementation temporarily in mkosi itself.
Because as_file() does not apply the executable bit, we apply it
ourselves after parsing the config. This requires delaying the check
if scripts are executable to some later point so we can parse the
config without failing because scripts are not executable.
Daan De Meyer [Thu, 4 Jan 2024 14:59:24 +0000 (15:59 +0100)]
ci: Disable jobs with arch linux tools trees for now.
Arch has qemu 8.2 which has severely broken TCG acceleration
(see https://gitlab.com/qemu-project/qemu/-/issues/2070). Let's disable
the jobs with arch tools trees until the bug is fixed.
Daan De Meyer [Wed, 3 Jan 2024 21:14:31 +0000 (22:14 +0100)]
Allow building default ubuntu image for jammy
Useful for debugging CI failures since CI also runs jammy.
We also make sure the shared configuration is included after the
distribution specific configuration so we can set defaults in the
distribution specific configuration and use it in the shared
configuration.
Daan De Meyer [Wed, 3 Jan 2024 15:33:21 +0000 (16:33 +0100)]
initrd: Install util-linux-core on Fedora
With https://bodhi.fedoraproject.org/updates/FEDORA-2023-7ba9a1b546,
sulogin is moved to util-linux-core on Fedora which means util-linux-core
has everything required for the initrd so let's use util-linux-core instead
of util-linux to save on disk space.
Daan De Meyer [Wed, 3 Jan 2024 13:46:16 +0000 (14:46 +0100)]
Only mount cache overlay if base trees are specified and Overlay= is not enabled
The setup() method of some distributions creates files in the root
directory which means that checking if the root directory is empty
doesn't work. Instead, let's check if any base trees were specified
explicitly.
Daan De Meyer [Wed, 3 Jan 2024 13:44:09 +0000 (14:44 +0100)]
Cache skeleton trees
These are only intended for files that affect package manager
operation so we should be able to cache this step without any issues
since if the skeleton tree is changed, users are likely going to want
to throw away their cache regardless.
Daan De Meyer [Tue, 2 Jan 2024 16:11:12 +0000 (17:11 +0100)]
Unshare fewer namespaces
These were primarily unshared to get the systemd unit test suite passing.
Now that the systemd test suite passes even if these are not unshared,
let's stop unsharing them as they don't make much sense for the operations
were doing and nspawn doesn't run when some of these are unshared.
Daan De Meyer [Mon, 1 Jan 2024 16:49:08 +0000 (17:49 +0100)]
Rename various symbols
- Let's get rid of the Mkosi prefix everywhere. Python has namespaced
modules for a reason, let's make use of that.
- Let's also rename State to Context, to match systemd where Context
is generally used as well instead of State.
Daan De Meyer [Tue, 2 Jan 2024 20:55:05 +0000 (21:55 +0100)]
Preserve target directories stat when copying extra/skeleton trees
When copying extra and skeleton trees, let's not touch the permissions
of directories that already exist in the image's root directory. 99%
of the time, the directories are only in the extra tree to make sure
the files go in the right directory in the image's root directory and
serve no other purpose so it makes sense to ignore their metadata in
this case.
Because cp does not support this natively (either all permissions are
copied for directories and files or none are copied), we implement this
ourselves by saving the necessary permissions before we call cp and
restoring them afterwards).
In (at least) Debian, some binaries such as awk point to
/etc/alternatives which would not exist and cause apt-key to fail
without specifying the exact keyring (e.g. when using /etc/apt/trusted.gpg.d)
Disable debsig for dpkg by default as they do in debian.
From the default dpkg.conf:
# Do not enable debsig-verify by default; since the distribution is not using
# embedded signatures, debsig-verify would reject all packages.
no-debsig
Daan De Meyer [Fri, 22 Dec 2023 14:29:06 +0000 (15:29 +0100)]
Mount entire /etc from package manager tree into sandbox
Instead of mounting individual directories, let's just mount the
entire /etc into the sandbox. This allows any tool we run through
the sandbox to pick up configuration from the package manager tree
without having to add explicit support for it in mkosi.
This also removes our special casing for uki.conf. ukify will now
pick up its configuration from its canonical location just like all
the other tools.
Daan De Meyer [Fri, 22 Dec 2023 11:18:40 +0000 (12:18 +0100)]
Mount package manager trees
Now that /etc and /var are free game when running within bwrap()
because we don't mount in the directories from the host anymore,
let's take advantage of that by mounting all our package manager
configuration to the canonical location in /etc instead of configuring
the package managers via their CLI or config file to look in the
right directory.
This also makes us look for rpm configuration in /etc/rpm instead
of /usr/lib/rpm as that's now possible.
Malte Poll [Fri, 22 Dec 2023 11:41:10 +0000 (12:41 +0100)]
bubblewrap: try to mount /nix/store readonly
Similar to most usrmerged systems, NixOS stores all installed
binaries and libraries in /nix/store.
To make mkosi work on NixOS, the nix store should be mounted by default.
Co-authored-by: Paul Meyer <49727155+katexochen@users.noreply.github.com>
Daan De Meyer [Thu, 21 Dec 2023 15:00:44 +0000 (16:00 +0100)]
Run more binaries with bwrap()
Let's sandbox more of the image build. This isolates more of the
build from the host which reduces the chance of leaking in host
specific details into the image.
Daan De Meyer [Wed, 20 Dec 2023 20:31:56 +0000 (21:31 +0100)]
Sandbox more in bwrap()
Let's not make the full root filesystem available to commands
running in bwrap(). Instead, limit it to some select directories.
- /usr
- Various directories from /etc. Note that this also means we can
get rid of mount_tools() as all these directories are now mounted
in bwrap() instead. This also allows us to get rid of the overlay
hack in mount_tools() to create the necessary mount points. The
goal is to get rid of as many of these as possible over time.
- /var/tmp
- /tmp
Because to make this work we have to pass MkosiConfig into bwrap(),
we split off a new file bubblewrap.py with all the bubblewrap stuff.
To avoid having to import MkosiState and bwrap() into tree.py,
install_tree() is moved __init__.py
Daan De Meyer [Thu, 21 Dec 2023 10:07:36 +0000 (11:07 +0100)]
Run depmod and modinfo on host again
Running these in the chroot is much slower when building images for
another architecture. Also, we might soon have a way to prevent dnf
from running depmod (see
https://gitlab.com/cki-project/kernel-ark/-/merge_requests/2743), so
let's adopt that when it is merged.