bpf-firewall: close gap when updating the firewall
If we have BPF_F_ALLOW_MULTI support we can install the new program
before we drop the old (because we can install two program at the same
time). Let's do that, and thus fully close the firewall
gap.
Mostly logging related: let's downgrade logging in dlopen_bpf() for
example, and remove duplicate logging at various places. Add %m to log
messages and so on.
bpf-firewall: move destruction of IP firewall objects to bpf-firewall.c
These are so many runtime objects, let's add a bpf_firewall_close()
helper that destroys them all, and call that from unit_free(), simply as
an excercise of encapsulating more BPF code in bpf-firewall.c.
This also brings the destruction order and variable declaration order in
struct Unit into the same systematic order.
No change in behaviour just some minor refactoring.
In dns_server_unlink_marked() and dns_server_mark_all() we done recursively.
People might have dozens of servers defined, and it's better to avoid recursion
when a simple loop suffices.
dns_server_unlink_marked() would only unmark the first marked server.
tmpfiles: extend "Age" to accept an "age-by" argument
For "systemd-tmpfiles --cleanup", when the "Age" parameter
is specified, the criteria for deletion is determined from
the path's last modification timestamp ("mtime"), its last
access timestamp ("atime") and its last status change
timestamp ("ctime").
For instance, if one of those paths to be cleaned up are
opened, it results in the modification of "atime", which
results file system entry to not be removed because the
default aging algorithm would skip the entry.
Add an optional "age-by" argument by extending the "Age"
parameter to restrict the clean-up for a particular type
of file timestamp, which can be specified in "tmpfiles.d"
as follows:
[age-by:]cleanup-age, where age-by is "[abcmACBM]+"
For example:
d /foo/bar - - - abM:1m -
Would clean-up any files that were not accessed and created,
or directories that were not modified less than a minute ago
in "/foo/bar".
Allen Webb [Tue, 30 Mar 2021 14:37:11 +0000 (09:37 -0500)]
tmpfiles: add '=' action modifier.
Add the '=' action modifier that instructs tmpfiles.d to check the file
type of a path and remove objects that do not match before trying to
open or create the path.
core: do not serialize mounts and automounts for switch-root
When e.g. tmp.mount is present in the initrd, and we serialize it, switch root,
and deserialize, the new systemd is confused because it thinks /tmp is mounted.
In general, it doesn't make sense to serialize anything that refers to paths in
the old root file system.
This fixes two errors for me:
1. tmp.mount was not mounted properly before local-fs.target. It would be
mounted as some point (I guess when we re-read /proc/self/mountinfo for some
other reason). In effect systemd-tmpfiles-setup.service would see one fs, and
some other units started later a different one. In particular gdm.service would
fail because the pre-created /tmp/.X11-unix with proper permissions would not
exist at time it was started.
2. # systemd[1]: proc-sys-fs-binfmt_misc.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?
# systemd[1]: proc-sys-fs-binfmt_misc.automount: Failed with result 'unmounted'.
# systemd[1]: Mounting proc-sys-fs-binfmt_misc.mount...
# systemd[1]: Mounted proc-sys-fs-binfmt_misc.mount.
# systemd[1]: Starting systemd-binfmt.service...
# systemd[1]: Finished systemd-binfmt.service.
# systemd[1]: proc-sys-fs-binfmt_misc.automount: Path /proc/sys/fs/binfmt_misc is already a mount point, refusing start.
# systemd[1]: Failed to set up automount proc-sys-fs-binfmt_misc.automount.
# systemd[1]: proc-sys-fs-binfmt_misc.automount: Path /proc/sys/fs/binfmt_misc is already a mount point, refusing start.
# systemd[1]: Failed to set up automount proc-sys-fs-binfmt_misc.automount.
# systemd[1]: proc-sys-fs-binfmt_misc.automount: Path /proc/sys/fs/binfmt_misc is already a mount point, refusing start.
# systemd[1]: Failed to set up automount proc-sys-fs-binfmt_misc.automount.
# systemd[1]: Stopping systemd-binfmt.service...
# systemd[1]: systemd-binfmt.service: Deactivated successfully.
# systemd[1]: Stopped systemd-binfmt.service.
I couldn't understand the error here, but in retrospect the first line is entirely
correct: "someone or something else" was the old systemd unmounting the old root.
Luca Boccassi [Fri, 12 Mar 2021 20:17:09 +0000 (20:17 +0000)]
coredump: check cgroups memory limit if storing on tmpfs
When /var/lib/systemd/coredump/ is backed by a tmpfs, all disk usage
will be accounted under the systemd-coredump process cgroup memory
limit.
If MemoryMax is set, this might cause systemd-coredump to be terminated
by the kernel oom handler when writing large uncompressed core files,
even if the compressed core would fit within the limits.
Detect if a tmpfs is used, and if so check MemoryMax from the process
and slice cgroups, and do not write uncompressed core files that are
greater than half the available memory. If the limit is breached,
stop writing and compress the written chunk immediately, then delete
the uncompressed chunk to free more memory, and resume compressing
directly from STDIN.
Example debug log when this situation happens:
systemd-coredump[737455]: Setting max_size to limit writes to 51344896 bytes.
systemd-coredump[737455]: ZSTD compression finished (51344896 -> 3260 bytes, 0.0%)
systemd-coredump[737455]: ZSTD compression finished (1022786048 -> 47245 bytes, 0.0%)
systemd-coredump[737455]: Process 737445 (a.out) of user 1000 dumped core.
Luca Boccassi [Wed, 26 May 2021 18:16:48 +0000 (19:16 +0100)]
core: add MemoryAvailable unit property
Try to infer the unused memory that a unit can claim before the
memory.max limit is reached, including any limit set on any parent
slice above the unit itself.
We were effectively doing all post-upgrade scripts twice in Fedora. We got this
wrong, so it's likely other people will get it wrong too. So let's explain
what is actually needed to make this work, but also when it's not useful.
Yu Watanabe [Mon, 24 May 2021 05:59:09 +0000 (14:59 +0900)]
network: merge link_configure() and link_configure_continue() again
It is not necessary to stop whole configuration process until MTU and
IPv6LL address generation mode are set. But it is enough just setting
IPv6 MTU again after MTU is set, and dropping IPv6LL address after
setting the address generation mode.
Yu Watanabe [Tue, 18 May 2021 05:59:10 +0000 (14:59 +0900)]
network: make link enter failed state on failure in link_update() and link_reset_carrier()
Previously, several failures in link_carrier_gained() make link enter
failed state, and other errors are ignored. Now, all failures in
link_carrier_gained(), moreover, link_update() are critical.
install: allow adding plain templates to .wants/ or .requires/
Fixes #19437.
As reported in the bug:
> # drkonqi-coredump-processor@.service
> ...
> [Install]
> WantedBy=systemd-coredump@.service
>
> The plan here is to have a systemd-coredump@ instance start the same %i for
> drkonqi-coredump-processor@. Works perfectly when creating the symlink manually
> ln -sv /usr/lib/systemd/system/drkonqi-coredump-processor@.service
> /etc/systemd/system/systemd-coredump@.service.wants/.
When DefaultInstance is set, we replace template references with
template@default-inst. But in this case we want to create a symlink for the
template name, so that systemd will fill in the instance from the
wanting/requiring unit. This is only possible for those units that actually
have an instance set, so we create the symlink only from .requires/ or .wants
of an instantiated unit (then this specific instance will be used), or a
template (than some instance will be inherited later).
→ enable foo@.service creates other@.service.wants/foo@inst.service, and
other@a.service will want foo@inst.service, and other@b.service will want foo@inst.service,
and fixed.service will want foo@inst.service.
Without DefaultInstance,
→ enable foo@.service creates other@.service.wants/foo@.service, and
other@a.service would want foo@a.service, and other@b.service would want foo@b.service,
but enablement fails because no dependency can be created for fixed.service:
Failed to enable unit, unit fixed.service is a non-template unit.
Yu Watanabe [Mon, 7 Jun 2021 12:53:35 +0000 (21:53 +0900)]
network: address: always read address flag from IFA_FLAGS attribute
Otherwise, update flag become incomplete and the IFA_F_MANAGETEMPADDR flag
will not be stored, thus no temporary addresses will be removed when
networkd requests to remove the main address.
Initially I wanted to add ConditionPathExists=!/etc/initrd-release in various
units (ldconfig.service, systemd-sysusers.service, systemd-hwdb-update.service,
systemd-journal-catalog-update, systemd-update-done.service), but I think it's
better to just disable the mechanism in the initrd altogether. Initrd images
are put together in a very particular way, and there is not need to do
post-update steps on them. If a unit from some other package winds up in the
initrd, we wouldn't want to invoke it either.
Also, any modifications are ephemeral, so any update would happen on every
use. And finally, initrd images are all about speed, and we shouldn't invoke
any unneeded services.
core: downgrade errors about BPF loading when called from socket_bind_supported()
prepare_socket_bind_bpf() is called from two sites: socket_bind_supported() and
socket_bind_install_impl(). For the latter, when errors occur we certainly want
to log, since they'll be fatal for the unit. But for the former, we should be
quiet, at least on the "expected" errors like lack of permissions. I kept error
on map resizing and such, which should not fail, at log_warning(). They are not
fatal when called from socket_bind_suppported(), but still a sign that
something is off.
Currently BPF filters can only be used by privileged users. Thus each systemd
--user will fail in socket_bind_supported(). With the patch, we only log this
at debug level.
https://lwn.net/ml/bpf/cover.1620499942.git.yifeifz2@illinois.edu/ gives some
hope that unprivileged access will be possible, so let's keep the code trying.
We might get lucky and get support for filters in user mode without any changes
on our side.