Define 'microcode' file type for the kernel-install staging area.
steps:
- uses: actions/checkout@8e5e7e5ab8b370d6c329ec480221332ada57f0ab
- - uses: systemd/mkosi@87a6b70ea9ab529b95fc91306fc8151175999dca
+ - uses: systemd/mkosi@77cde8a1252767ffc448bfb0fa237fb586a689a2
- name: Configure
run: |
[Host]
ExtraSearchPaths=!*
QemuVsock=yes
+ Ephemeral=yes
EOF
# For erofs, we have to install linux-modules-extra-azure, but that doesn't match the running kernel
Features:
+* add another PE section ".fname" or so that encodes the intended filename for
+ PE file, and validate that when loading add-ons and similar before using
+ it. This is particularly relevant when we load multiple add-ons and want to
+ sort them to apply them in a define order. The order should not be under
+ control of the attacker.
+
* also include packaging metadata (á la
https://systemd.io/ELF_PACKAGE_METADATA/) in our UEFI PE binaries, using the
same JSON format.
2. `SetCredential=` may be used to set a credential to a literal string encoded
in the unit file. Because unit files are world-readable (both on disk and
via D-Bus), this should only be used for credentials that aren't sensitive,
- i.e. public keys/certificates – but not private keys.
+ e.g. public keys or certificates, but not private keys.
3. `LoadCredentialEncrypted=` is similar to `LoadCredential=` but will load an
encrypted credential, and decrypt it before passing it to the service. For
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
readonly b DefaultBlockIOAccounting = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
+ readonly b DefaultIOAccounting = ...;
+ @org.freedesktop.DBus.Property.EmitsChangedSignal("const")
+ readonly b DefaultIPAccounting = ...;
+ @org.freedesktop.DBus.Property.EmitsChangedSignal("const")
readonly b DefaultMemoryAccounting = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
readonly b DefaultTasksAccounting = ...;
<!--property DefaultBlockIOAccounting is not documented!-->
+ <!--property DefaultIOAccounting is not documented!-->
+
+ <!--property DefaultIPAccounting is not documented!-->
+
<!--property DefaultMemoryAccounting is not documented!-->
<!--property DefaultTasksAccounting is not documented!-->
<variablelist class="dbus-property" generated="True" extra-ref="DefaultBlockIOAccounting"/>
+ <variablelist class="dbus-property" generated="True" extra-ref="DefaultIOAccounting"/>
+
+ <variablelist class="dbus-property" generated="True" extra-ref="DefaultIPAccounting"/>
+
<variablelist class="dbus-property" generated="True" extra-ref="DefaultMemoryAccounting"/>
<variablelist class="dbus-property" generated="True" extra-ref="DefaultTasksAccounting"/>
with a focus on implementing stateless operating system images.</para></listitem>
</varlistentry>
</variablelist>
+ </refsect2>
- </refsect2><refsect2>
+ <refsect2>
<title>Input/Output Options</title>
<variablelist>
</varlistentry>
</variablelist>
- </refsect2><refsect2>
- <title>Credentials</title>
-
- <variablelist>
- <varlistentry>
- <term><option>--load-credential=</option><replaceable>ID</replaceable>:<replaceable>PATH</replaceable></term>
- <term><option>--set-credential=</option><replaceable>ID</replaceable>:<replaceable>VALUE</replaceable></term>
-
- <listitem><para>Pass a credential to the container. These two options correspond to the
- <varname>LoadCredential=</varname> and <varname>SetCredential=</varname> settings in unit files. See
- <citerefentry><refentrytitle>systemd.exec</refentrytitle><manvolnum>5</manvolnum></citerefentry> for
- details about these concepts, as well as the syntax of the option's arguments.</para>
-
- <para>Note: when <command>systemd-nspawn</command> runs as systemd system service it can propagate
- the credentials it received via <varname>LoadCredential=</varname>/<varname>SetCredential=</varname>
- to the container payload. A systemd service manager running as PID 1 in the container can further
- propagate them to the services it itself starts. It is thus possible to easily propagate credentials
- from a parent service manager to a container manager service and from there into its payload. This
- can even be done recursively.</para>
-
- <para>In order to embed binary data into the credential data for <option>--set-credential=</option>
- use C-style escaping (i.e. <literal>\n</literal> to embed a newline, or <literal>\x00</literal> to
- embed a <constant>NUL</constant> byte. Note that the invoking shell might already apply unescaping
- once, hence this might require double escaping!).</para>
-
- <para>The
- <citerefentry><refentrytitle>systemd-sysusers.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>
- and
- <citerefentry><refentrytitle>systemd-firstboot</refentrytitle><manvolnum>1</manvolnum></citerefentry>
- services read credentials configured this way for the purpose of configuring the container's root
- user's password and shell, as well as system locale, keymap and timezone during the first boot
- process of the container. This is particularly useful in combination with
- <option>--volatile=yes</option> where every single boot appears as first boot, since configuration
- applied to <filename>/etc/</filename> is lost on container reboot cycles. See the respective man
- pages for details. Example:</para>
-
- <programlisting># systemd-nspawn -i image.raw \
- --volatile=yes \
- --set-credential=firstboot.locale:de_DE.UTF-8 \
- --set-credential=passwd.hashed-password.root:'$y$j9T$yAuRJu1o5HioZAGDYPU5d.$F64ni6J2y2nNQve90M/p0ZP0ECP/qqzipNyaY9fjGpC' \
- -b</programlisting>
-
- <para>The above command line will invoke the specified image file <filename>image.raw</filename> in
- volatile mode, i.e. with empty <filename>/etc/</filename> and <filename>/var/</filename>. The
- container payload will recognize this as a first boot, and will invoke
- <filename>systemd-firstboot.service</filename>, which then reads the two passed credentials to
- configure the system's initial locale and root password.</para>
- </listitem>
+ </refsect2>
+ <refsect2>
+ <title>Credentials</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><option>--load-credential=</option><replaceable>ID</replaceable>:<replaceable>PATH</replaceable></term>
+ <term><option>--set-credential=</option><replaceable>ID</replaceable>:<replaceable>VALUE</replaceable></term>
+
+ <listitem><para>Pass a credential to the container. These two options correspond to the
+ <varname>LoadCredential=</varname> and <varname>SetCredential=</varname> settings in unit files. See
+ <citerefentry><refentrytitle>systemd.exec</refentrytitle><manvolnum>5</manvolnum></citerefentry> for
+ details about these concepts, as well as the syntax of the option's arguments.</para>
+
+ <para>Note: when <command>systemd-nspawn</command> runs as systemd system service it can propagate
+ the credentials it received via <varname>LoadCredential=</varname>/<varname>SetCredential=</varname>
+ to the container payload. A systemd service manager running as PID 1 in the container can further
+ propagate them to the services it itself starts. It is thus possible to easily propagate credentials
+ from a parent service manager to a container manager service and from there into its payload. This
+ can even be done recursively.</para>
+
+ <para>In order to embed binary data into the credential data for <option>--set-credential=</option>,
+ use C-style escaping (i.e. <literal>\n</literal> to embed a newline, or <literal>\x00</literal> to
+ embed a <constant>NUL</constant> byte). Note that the invoking shell might already apply unescaping
+ once, hence this might require double escaping!.</para>
+
+ <para>The
+ <citerefentry><refentrytitle>systemd-sysusers.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>
+ and
+ <citerefentry><refentrytitle>systemd-firstboot</refentrytitle><manvolnum>1</manvolnum></citerefentry>
+ services read credentials configured this way for the purpose of configuring the container's root
+ user's password and shell, as well as system locale, keymap and timezone during the first boot
+ process of the container. This is particularly useful in combination with
+ <option>--volatile=yes</option> where every single boot appears as first boot, since configuration
+ applied to <filename>/etc/</filename> is lost on container reboot cycles. See the respective man
+ pages for details. Example:</para>
+
+ <programlisting># systemd-nspawn -i image.raw \
+ --volatile=yes \
+ --set-credential=firstboot.locale:de_DE.UTF-8 \
+ --set-credential=passwd.hashed-password.root:'$y$j9T$yAuRJu1o5HioZAGDYPU5d.$F64ni6J2y2nNQve90M/p0ZP0ECP/qqzipNyaY9fjGpC' \
+ -b</programlisting>
+
+ <para>The above command line will invoke the specified image file <filename>image.raw</filename> in
+ volatile mode, i.e. with empty <filename>/etc/</filename> and <filename>/var/</filename>. The
+ container payload will recognize this as a first boot, and will invoke
+ <filename>systemd-firstboot.service</filename>, which then reads the two passed credentials to
+ configure the system's initial locale and root password.</para>
+ </listitem>
</varlistentry>
-
- </variablelist>
+ </variablelist>
</refsect2><refsect2>
<title>Other</title>
revoked with a SBAT policy update, without requiring blocklisting via DBX/MOKX. The
<citerefentry><refentrytitle>ukify</refentrytitle><manvolnum>1</manvolnum></citerefentry> tool will
add a SBAT policy by default if none is passed when building addons. For more information on SBAT see
- <ulink url="https://github.com/rhboot/shim/blob/main/SBAT.md">Shim's documentation.</ulink>
+ <ulink url="https://github.com/rhboot/shim/blob/main/SBAT.md">Shim's documentation</ulink>.
Addons are supposed to be used to pass additional kernel command line parameters, regardless of the
kernel image being booted, for example to allow platform vendors to ship platform-specific
configuration. The loaded command line addon files are sorted, loaded, measured into TPM PCR 12 (if a
<command>cpio</command> archive and placed in the <filename>/.extra/global_credentials/</filename>
directory of the initrd file hierarchy. This is supposed to be used to pass additional credentials to
the initrd, regardless of the kernel being booted. The generated <command>cpio</command> archive is
- measured into TPM PCR 12 (if a TPM is present)</para></listitem>
+ measured into TPM PCR 12 (if a TPM is present).</para></listitem>
<listitem><para>Additionally, files <filename>/loader/addons/*.addon.efi</filename> are loaded and
verified as PE binaries, and a <literal>.cmdline</literal> section is parsed from them. This is
<term><varname>DefaultIOAccounting=</varname></term>
<term><varname>DefaultIPAccounting=</varname></term>
- <listitem><para>Configure the default resource accounting settings, as configured per-unit by
+ <listitem>
+ <para>Configure the default resource accounting settings, as configured per-unit by
<varname>CPUAccounting=</varname>, <varname>MemoryAccounting=</varname>,
<varname>TasksAccounting=</varname>, <varname>IOAccounting=</varname> and
<varname>IPAccounting=</varname>. See
<citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>
- for details on the per-unit settings. <varname>DefaultTasksAccounting=</varname> defaults to yes,
- <varname>DefaultMemoryAccounting=</varname> to &MEMORY_ACCOUNTING_DEFAULT;.
- <varname>DefaultCPUAccounting=</varname> defaults to yes, but really has no effect if enabling CPU
- accounting doesn't require the <option>cpu</option> controller to be enabled (Linux 4.15+ using the
- unified hierarchy for resource control), otherwise it defaults to no. The other three settings
- default to no.</para></listitem>
+ for details on the per-unit settings.</para>
+
+ <para><varname>DefaultCPUAccounting=</varname> defaults to yes when running on kernel ≥4.15, and no on older versions.
+ <varname>DefaultMemoryAccounting=</varname> defaults to &MEMORY_ACCOUNTING_DEFAULT;.
+ <varname>DefaultTasksAccounting=</varname> defaults to yes.
+ The other settings default to no.</para>
+ </listitem>
</varlistentry>
<varlistentry>
<para>When multiple credentials of the same name are found, credentials found by
<varname>LoadCredential=</varname> and <varname>LoadCredentialEncrypted=</varname> take priority over
- credentials found by <varname>ImportCredential=</varname></para></listitem>.
+ credentials found by <varname>ImportCredential=</varname>.</para></listitem>
</varlistentry>
<varlistentry>
<refsect1>
<title>Options</title>
- <para>Units of the types listed above can have settings
- for resource control configuration:</para>
+ <para>Units of the types listed above can have settings for resource control configuration:</para>
+
+ <refsect2><title>CPU Accounting and Control</title>
<variablelist class='unit-directives'>
</listitem>
</varlistentry>
- <varlistentry>
- <term><varname>AllowedMemoryNodes=</varname></term>
- <term><varname>StartupAllowedMemoryNodes=</varname></term>
-
- <listitem>
- <para>These settings control the <option>cpuset</option> controller in the unified hierarchy.</para>
-
- <para>Restrict processes to be executed on specific memory NUMA nodes. Takes a list of memory NUMA nodes indices
- or ranges separated by either whitespace or commas. Memory NUMA nodes ranges are specified by the lower and upper
- NUMA nodes indices separated by a dash.</para>
-
- <para>Setting <varname>AllowedMemoryNodes=</varname> or <varname>StartupAllowedMemoryNodes=</varname> doesn't
- guarantee that all of the memory NUMA nodes will be used by the processes as it may be limited by parent units.
- The effective configuration is reported as <varname>EffectiveMemoryNodes=</varname>.</para>
+ </variablelist>
- <para>While <varname>StartupAllowedMemoryNodes=</varname> applies to the startup and shutdown phases of the system,
- <varname>AllowedMemoryNodes=</varname> applies to normal runtime of the system, and if the former is not set also to
- the startup and shutdown phases. Using <varname>StartupAllowedMemoryNodes=</varname> allows prioritizing specific services at
- boot-up and shutdown differently than during normal runtime.</para>
+ </refsect2><refsect2><title>Memory Accounting and Control</title>
- <para>This setting is supported only with the unified control group hierarchy.</para>
- </listitem>
- </varlistentry>
+ <variablelist class='unit-directives'>
<varlistentry>
<term><varname>MemoryAccounting=</varname></term>
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><varname>AllowedMemoryNodes=</varname></term>
+ <term><varname>StartupAllowedMemoryNodes=</varname></term>
+
+ <listitem>
+ <para>These settings control the <option>cpuset</option> controller in the unified hierarchy.</para>
+
+ <para>Restrict processes to be executed on specific memory NUMA nodes. Takes a list of memory NUMA nodes indices
+ or ranges separated by either whitespace or commas. Memory NUMA nodes ranges are specified by the lower and upper
+ NUMA nodes indices separated by a dash.</para>
+
+ <para>Setting <varname>AllowedMemoryNodes=</varname> or <varname>StartupAllowedMemoryNodes=</varname> doesn't
+ guarantee that all of the memory NUMA nodes will be used by the processes as it may be limited by parent units.
+ The effective configuration is reported as <varname>EffectiveMemoryNodes=</varname>.</para>
+
+ <para>While <varname>StartupAllowedMemoryNodes=</varname> applies to the startup and shutdown phases of the system,
+ <varname>AllowedMemoryNodes=</varname> applies to normal runtime of the system, and if the former is not set also to
+ the startup and shutdown phases. Using <varname>StartupAllowedMemoryNodes=</varname> allows prioritizing specific services at
+ boot-up and shutdown differently than during normal runtime.</para>
+
+ <para>This setting is supported only with the unified control group hierarchy.</para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </refsect2><refsect2><title>Process Accounting and Control</title>
+
+ <variablelist class='unit-directives'>
+
<varlistentry>
<term><varname>TasksAccounting=</varname></term>
</listitem>
</varlistentry>
+ </variablelist>
+
+ </refsect2><refsect2><title>IO Accounting and Control</title>
+
+ <variablelist class='unit-directives'>
+
<varlistentry>
<term><varname>IOAccounting=</varname></term>
</listitem>
</varlistentry>
+ </variablelist>
+
+ </refsect2><refsect2><title>Network Accounting and Control</title>
+
+ <variablelist class='unit-directives'>
+
<varlistentry>
<term><varname>IPAccounting=</varname></term>
</listitem>
</varlistentry>
- <varlistentry>
- <term><varname>IPIngressFilterPath=<replaceable>BPF_FS_PROGRAM_PATH</replaceable></varname></term>
- <term><varname>IPEgressFilterPath=<replaceable>BPF_FS_PROGRAM_PATH</replaceable></varname></term>
-
- <listitem>
- <para>Add custom network traffic filters implemented as BPF programs, applying to all IP packets
- sent and received over <constant>AF_INET</constant> and <constant>AF_INET6</constant> sockets.
- Takes an absolute path to a pinned BPF program in the BPF virtual filesystem (<filename>/sys/fs/bpf/</filename>).
- </para>
-
- <para>The filters configured with this option are applied to all sockets created by processes
- of this unit (or in the case of socket units, associated with it). The filters are loaded in addition
- to filters any of the parent slice units this unit might be a member of as well as any
- <varname>IPAddressAllow=</varname> and <varname>IPAddressDeny=</varname> filters in any of these units.
- By default there are no filters specified.</para>
-
- <para>If these settings are used multiple times in the same unit all the specified programs are attached. If an
- empty string is assigned to these settings the program list is reset and all previous specified programs ignored.</para>
-
- <para>If the path <replaceable>BPF_FS_PROGRAM_PATH</replaceable> in <varname>IPIngressFilterPath=</varname> assignment
- is already being handled by <varname>BPFProgram=</varname> ingress hook, e.g.
- <varname>BPFProgram=</varname><constant>ingress</constant>:<replaceable>BPF_FS_PROGRAM_PATH</replaceable>,
- the assignment will be still considered valid and the program will be attached to a cgroup. Same for
- <varname>IPEgressFilterPath=</varname> path and <constant>egress</constant> hook.</para>
-
- <para>Note that for socket-activated services, the IP filter programs configured on the socket unit apply to
- all sockets associated with it directly, but not to any sockets created by the ultimately activated services
- for it. Conversely, the IP filter programs configured for the service are not applied to any sockets passed into
- the service via socket activation. Thus, it is usually a good idea, to replicate the IP filter programs on both
- the socket and the service unit, however it often makes sense to maintain one configuration more open and the other
- one more restricted, depending on the usecase.</para>
-
- <para>Note that these settings might not be supported on some systems (for example if eBPF control group
- support is not enabled in the underlying kernel or container manager). These settings will fail the service in
- that case. If compatibility with such systems is desired it is hence recommended to attach your filter manually
- (requires <varname>Delegate=</varname><constant>yes</constant>) instead of using this setting.</para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term><varname>BPFProgram=<replaceable>type</replaceable><constant>:</constant><replaceable>program-path</replaceable></varname></term>
- <listitem>
- <para>Add a custom cgroup BPF program.</para>
-
- <para><varname>BPFProgram=</varname> allows attaching BPF hooks to the cgroup of a systemd unit.
- (This generalizes the functionality exposed via <varname>IPEgressFilterPath=</varname> for egress and
- <varname>IPIngressFilterPath=</varname> for ingress.)
- Cgroup-bpf hooks in the form of BPF programs loaded to the BPF filesystem are attached with cgroup-bpf attach
- flags determined by the unit. For details about attachment types and flags see <ulink
- url="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/include/uapi/linux/bpf.h"/>.
- For general BPF documentation please refer to <ulink url="https://docs.kernel.org/bpf/index.html"/>.</para>
-
- <para>The specification of BPF program consists of a <replaceable>type</replaceable> followed by a
- <replaceable>program-path</replaceable> with <literal>:</literal> as the separator:
- <replaceable>type</replaceable><constant>:</constant><replaceable>program-path</replaceable>.</para>
-
- <para><replaceable>type</replaceable> is the string name of BPF attach type also used in
- <command>bpftool</command>. <replaceable>type</replaceable> can be one of <constant>egress</constant>,
- <constant>ingress</constant>, <constant>sock_create</constant>, <constant>sock_ops</constant>,
- <constant>device</constant>, <constant>bind4</constant>, <constant>bind6</constant>,
- <constant>connect4</constant>, <constant>connect6</constant>, <constant>post_bind4</constant>,
- <constant>post_bind6</constant>, <constant>sendmsg4</constant>, <constant>sendmsg6</constant>,
- <constant>sysctl</constant>, <constant>recvmsg4</constant>, <constant>recvmsg6</constant>,
- <constant>getsockopt</constant>, <constant>setsockopt</constant>.</para>
-
- <para>Setting <varname>BPFProgram=</varname> to an empty value makes previous assignments ineffective.</para>
- <para>Multiple assignments of the same <replaceable>type</replaceable>:<replaceable>program-path</replaceable>
- value have the same effect as a single assignment: the program with the path <replaceable>program-path</replaceable>
- will be attached to cgroup hook <replaceable>type</replaceable> just once.</para>
- <para>If BPF <constant>egress</constant> pinned to <replaceable>program-path</replaceable> path is already being
- handled by <varname>IPEgressFilterPath=</varname>, <varname>BPFProgram=</varname>
- assignment will be considered valid and <varname>BPFProgram=</varname> will be attached to a cgroup.
- Similarly for <constant>ingress</constant> hook and <varname>IPIngressFilterPath=</varname> assignment.</para>
-
- <para>BPF programs passed with <varname>BPFProgram=</varname> are attached to the cgroup of a unit with BPF
- attach flag <constant>multi</constant>, that allows further attachments of the same
- <replaceable>type</replaceable> within cgroup hierarchy topped by the unit cgroup.</para>
-
- <para>Examples:<programlisting>
-BPFProgram=egress:/sys/fs/bpf/egress-hook
-BPFProgram=bind6:/sys/fs/bpf/sock-addr-hook
-</programlisting></para>
- </listitem>
- </varlistentry>
-
<varlistentry>
<term><varname>SocketBindAllow=<replaceable>bind-rule</replaceable></varname></term>
<term><varname>SocketBindDeny=<replaceable>bind-rule</replaceable></varname></term>
</listitem>
</varlistentry>
+ </variablelist>
+
+ </refsect2><refsect2><title>BPF Programs</title>
+
+ <variablelist class='unit-directives'>
+
+ <varlistentry>
+ <term><varname>IPIngressFilterPath=<replaceable>BPF_FS_PROGRAM_PATH</replaceable></varname></term>
+ <term><varname>IPEgressFilterPath=<replaceable>BPF_FS_PROGRAM_PATH</replaceable></varname></term>
+
+ <listitem>
+ <para>Add custom network traffic filters implemented as BPF programs, applying to all IP packets
+ sent and received over <constant>AF_INET</constant> and <constant>AF_INET6</constant> sockets.
+ Takes an absolute path to a pinned BPF program in the BPF virtual filesystem (<filename>/sys/fs/bpf/</filename>).
+ </para>
+
+ <para>The filters configured with this option are applied to all sockets created by processes
+ of this unit (or in the case of socket units, associated with it). The filters are loaded in addition
+ to filters any of the parent slice units this unit might be a member of as well as any
+ <varname>IPAddressAllow=</varname> and <varname>IPAddressDeny=</varname> filters in any of these units.
+ By default there are no filters specified.</para>
+
+ <para>If these settings are used multiple times in the same unit all the specified programs are attached. If an
+ empty string is assigned to these settings the program list is reset and all previous specified programs ignored.</para>
+
+ <para>If the path <replaceable>BPF_FS_PROGRAM_PATH</replaceable> in <varname>IPIngressFilterPath=</varname> assignment
+ is already being handled by <varname>BPFProgram=</varname> ingress hook, e.g.
+ <varname>BPFProgram=</varname><constant>ingress</constant>:<replaceable>BPF_FS_PROGRAM_PATH</replaceable>,
+ the assignment will be still considered valid and the program will be attached to a cgroup. Same for
+ <varname>IPEgressFilterPath=</varname> path and <constant>egress</constant> hook.</para>
+
+ <para>Note that for socket-activated services, the IP filter programs configured on the socket unit apply to
+ all sockets associated with it directly, but not to any sockets created by the ultimately activated services
+ for it. Conversely, the IP filter programs configured for the service are not applied to any sockets passed into
+ the service via socket activation. Thus, it is usually a good idea, to replicate the IP filter programs on both
+ the socket and the service unit, however it often makes sense to maintain one configuration more open and the other
+ one more restricted, depending on the usecase.</para>
+
+ <para>Note that these settings might not be supported on some systems (for example if eBPF control group
+ support is not enabled in the underlying kernel or container manager). These settings will fail the service in
+ that case. If compatibility with such systems is desired it is hence recommended to attach your filter manually
+ (requires <varname>Delegate=</varname><constant>yes</constant>) instead of using this setting.</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><varname>BPFProgram=<replaceable>type</replaceable><constant>:</constant><replaceable>program-path</replaceable></varname></term>
+ <listitem>
+ <para>Add a custom cgroup BPF program.</para>
+
+ <para><varname>BPFProgram=</varname> allows attaching BPF hooks to the cgroup of a systemd unit.
+ (This generalizes the functionality exposed via <varname>IPEgressFilterPath=</varname> for egress and
+ <varname>IPIngressFilterPath=</varname> for ingress.)
+ Cgroup-bpf hooks in the form of BPF programs loaded to the BPF filesystem are attached with cgroup-bpf attach
+ flags determined by the unit. For details about attachment types and flags see <ulink
+ url="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/include/uapi/linux/bpf.h"/>.
+ For general BPF documentation please refer to <ulink url="https://docs.kernel.org/bpf/index.html"/>.</para>
+
+ <para>The specification of BPF program consists of a <replaceable>type</replaceable> followed by a
+ <replaceable>program-path</replaceable> with <literal>:</literal> as the separator:
+ <replaceable>type</replaceable><constant>:</constant><replaceable>program-path</replaceable>.</para>
+
+ <para><replaceable>type</replaceable> is the string name of BPF attach type also used in
+ <command>bpftool</command>. <replaceable>type</replaceable> can be one of <constant>egress</constant>,
+ <constant>ingress</constant>, <constant>sock_create</constant>, <constant>sock_ops</constant>,
+ <constant>device</constant>, <constant>bind4</constant>, <constant>bind6</constant>,
+ <constant>connect4</constant>, <constant>connect6</constant>, <constant>post_bind4</constant>,
+ <constant>post_bind6</constant>, <constant>sendmsg4</constant>, <constant>sendmsg6</constant>,
+ <constant>sysctl</constant>, <constant>recvmsg4</constant>, <constant>recvmsg6</constant>,
+ <constant>getsockopt</constant>, <constant>setsockopt</constant>.</para>
+
+ <para>Setting <varname>BPFProgram=</varname> to an empty value makes previous assignments ineffective.</para>
+ <para>Multiple assignments of the same <replaceable>type</replaceable>:<replaceable>program-path</replaceable>
+ value have the same effect as a single assignment: the program with the path <replaceable>program-path</replaceable>
+ will be attached to cgroup hook <replaceable>type</replaceable> just once.</para>
+ <para>If BPF <constant>egress</constant> pinned to <replaceable>program-path</replaceable> path is already being
+ handled by <varname>IPEgressFilterPath=</varname>, <varname>BPFProgram=</varname>
+ assignment will be considered valid and <varname>BPFProgram=</varname> will be attached to a cgroup.
+ Similarly for <constant>ingress</constant> hook and <varname>IPIngressFilterPath=</varname> assignment.</para>
+
+ <para>BPF programs passed with <varname>BPFProgram=</varname> are attached to the cgroup of a unit with BPF
+ attach flag <constant>multi</constant>, that allows further attachments of the same
+ <replaceable>type</replaceable> within cgroup hierarchy topped by the unit cgroup.</para>
+
+ <para>Examples:<programlisting>
+BPFProgram=egress:/sys/fs/bpf/egress-hook
+BPFProgram=bind6:/sys/fs/bpf/sock-addr-hook
+</programlisting></para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </refsect2><refsect2><title>Device Access</title>
+
+ <variablelist class='unit-directives'>
+
<varlistentry>
<term><varname>DeviceAllow=</varname></term>
</listitem>
</varlistentry>
+ </variablelist>
+
+ </refsect2><refsect2><title>Control Group Management</title>
+
+ <variablelist class='unit-directives'>
+
<varlistentry>
<term><varname>Slice=</varname></term>
</listitem>
</varlistentry>
+ </variablelist>
+
+ </refsect2><refsect2><title>Memory Pressure Control</title>
+
+ <variablelist class='unit-directives'>
+
<varlistentry>
<term><varname>ManagedOOMSwap=auto|kill</varname></term>
<term><varname>ManagedOOMMemoryPressure=auto|kill</varname></term>
details on the permitted syntax.</para></listitem>
</varlistentry>
</variablelist>
+ </refsect2>
</refsect1>
<refsect1>
SD_BUS_PROPERTY("DefaultStartLimitBurst", "u", bus_property_get_unsigned, offsetof(Manager, default_start_limit_burst), SD_BUS_VTABLE_PROPERTY_CONST),
SD_BUS_PROPERTY("DefaultCPUAccounting", "b", bus_property_get_bool, offsetof(Manager, default_cpu_accounting), SD_BUS_VTABLE_PROPERTY_CONST),
SD_BUS_PROPERTY("DefaultBlockIOAccounting", "b", bus_property_get_bool, offsetof(Manager, default_blockio_accounting), SD_BUS_VTABLE_PROPERTY_CONST),
+ SD_BUS_PROPERTY("DefaultIOAccounting", "b", bus_property_get_bool, offsetof(Manager, default_io_accounting), SD_BUS_VTABLE_PROPERTY_CONST),
+ SD_BUS_PROPERTY("DefaultIPAccounting", "b", bus_property_get_bool, offsetof(Manager, default_ip_accounting), SD_BUS_VTABLE_PROPERTY_CONST),
SD_BUS_PROPERTY("DefaultMemoryAccounting", "b", bus_property_get_bool, offsetof(Manager, default_memory_accounting), SD_BUS_VTABLE_PROPERTY_CONST),
SD_BUS_PROPERTY("DefaultTasksAccounting", "b", bus_property_get_bool, offsetof(Manager, default_tasks_accounting), SD_BUS_VTABLE_PROPERTY_CONST),
SD_BUS_PROPERTY("DefaultLimitCPU", "t", bus_property_get_rlimit, offsetof(Manager, rlimit[RLIMIT_CPU]), SD_BUS_VTABLE_PROPERTY_CONST),
return;
s->main_pid = 0;
- s->main_pid_known = false;
exec_status_exit(&s->main_exec_status, &s->exec_context, pid, code, status);
if (s->main_command) {
return 1;
}
+static int get_maximum_partition_size(
+ int fd,
+ struct fdisk_partition *p,
+ uint64_t *ret_maximum_partition_size) {
+
+ _cleanup_(fdisk_unref_contextp) struct fdisk_context *c = NULL;
+ uint64_t start_lba, start, last_lba, end;
+ int r;
+
+ assert(fd >= 0);
+ assert(p);
+ assert(ret_maximum_partition_size);
+
+ c = fdisk_new_context();
+ if (!c)
+ return log_oom();
+
+ r = fdisk_assign_device(c, FORMAT_PROC_FD_PATH(fd), 0);
+ if (r < 0)
+ return log_error_errno(r, "Failed to open device: %m");
+
+ start_lba = fdisk_partition_get_start(p);
+ assert(start_lba <= UINT64_MAX/512);
+ start = start_lba * 512;
+
+ last_lba = fdisk_get_last_lba(c); /* One sector before boundary where usable space ends */
+ assert(last_lba < UINT64_MAX/512);
+ end = DISK_SIZE_ROUND_DOWN((last_lba + 1) * 512); /* Round down to multiple of 4K */
+
+ if (start > end)
+ return log_error_errno(SYNTHETIC_ERRNO(EBADMSG), "Last LBA is before partition start.");
+
+ *ret_maximum_partition_size = DISK_SIZE_ROUND_DOWN(end - start);
+
+ return 1;
+}
+
static int ask_cb(struct fdisk_context *c, struct fdisk_ask *ask, void *userdata) {
char *result;
setup->partition_offset + setup->partition_size > old_image_size)
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "Old partition doesn't fit in backing storage, refusing.");
+ /* Get target partition information in here for new_partition_size calculation */
+ r = prepare_resize_partition(
+ image_fd,
+ setup->partition_offset,
+ setup->partition_size,
+ &disk_uuid,
+ &table,
+ &partition);
+ if (r < 0)
+ return r;
+
if (S_ISREG(st.st_mode)) {
uint64_t partition_table_extra, largest_size;
new_partition_size = 0;
intention = INTENTION_SHRINK;
} else {
- uint64_t new_partition_size_rounded;
+ uint64_t new_partition_size_rounded = DISK_SIZE_ROUND_DOWN(h->disk_size);
- new_partition_size_rounded = DISK_SIZE_ROUND_DOWN(h->disk_size);
+ if (h->disk_size == UINT64_MAX && partition) {
+ r = get_maximum_partition_size(image_fd, partition, &new_partition_size_rounded);
+ if (r < 0)
+ return r;
+ }
if (setup->partition_size >= new_partition_size_rounded &&
setup->partition_size <= h->disk_size) {
special_glyph(SPECIAL_GLYPH_ARROW_RIGHT),
FORMAT_BYTES(new_fs_size));
- r = prepare_resize_partition(
- image_fd,
- setup->partition_offset,
- setup->partition_size,
- &disk_uuid,
- &table,
- &partition);
- if (r < 0)
- return r;
-
if (new_fs_size > old_fs_size) { /* → Grow */
if (S_ISREG(st.st_mode)) {
#include "parse-util.h"
#include "pretty-print.h"
#include "sigbus.h"
+#include "signal-util.h"
#include "tmpfile-util.h"
#define JOURNAL_WAIT_TIMEOUT (10*USEC_PER_SEC)
static int run(int argc, char *argv[]) {
_cleanup_(MHD_stop_daemonp) struct MHD_Daemon *d = NULL;
+ static const struct sigaction sigterm = {
+ .sa_handler = nop_signal_handler,
+ .sa_flags = SA_RESTART,
+ };
struct MHD_OptionItem opts[] = {
{ MHD_OPTION_EXTERNAL_LOGGER,
(intptr_t) microhttpd_logger, NULL },
return r;
sigbus_install();
+ assert_se(sigaction(SIGTERM, &sigterm, NULL) >= 0);
r = setup_gnutls_logger(NULL);
if (r < 0)
r = bus_wait_for_jobs_new(bus, &w);
if (r < 0)
- return log_oom();
+ return log_error_errno(r, "Could not watch jobs: %m");
for (int i = 1; i < argc; i++) {
_cleanup_(sd_bus_message_unrefp) sd_bus_message *reply = NULL;
}
#if ENABLE_POLKIT
-static int bus_message_append_strv_key_value(
- sd_bus_message *m,
- const char **l) {
-
+static int bus_message_append_strv_key_value(sd_bus_message *m, const char **l) {
int r;
assert(m);
return r;
}
+
+static int bus_message_new_polkit_auth_call(
+ sd_bus_message *m,
+ const char *action,
+ const char **details,
+ bool interactive,
+ sd_bus_message **ret) {
+
+ _cleanup_(sd_bus_message_unrefp) sd_bus_message *c = NULL;
+ const char *sender;
+ int r;
+
+ assert(m);
+ assert(action);
+ assert(ret);
+
+ sender = sd_bus_message_get_sender(m);
+ if (!sender)
+ return -EBADMSG;
+
+ r = sd_bus_message_new_method_call(
+ ASSERT_PTR(m->bus),
+ &c,
+ "org.freedesktop.PolicyKit1",
+ "/org/freedesktop/PolicyKit1/Authority",
+ "org.freedesktop.PolicyKit1.Authority",
+ "CheckAuthorization");
+ if (r < 0)
+ return r;
+
+ r = sd_bus_message_append(c, "(sa{sv})s", "system-bus-name", 1, "name", "s", sender, action);
+ if (r < 0)
+ return r;
+
+ r = bus_message_append_strv_key_value(c, details);
+ if (r < 0)
+ return r;
+
+ r = sd_bus_message_append(c, "us", interactive, NULL);
+ if (r < 0)
+ return r;
+
+ *ret = TAKE_PTR(c);
+ return 0;
+}
#endif
int bus_test_polkit(
r = sd_bus_query_sender_privilege(call, capability);
if (r < 0)
return r;
- else if (r > 0)
+ if (r > 0)
return 1;
-#if ENABLE_POLKIT
- else {
- _cleanup_(sd_bus_message_unrefp) sd_bus_message *request = NULL;
- _cleanup_(sd_bus_message_unrefp) sd_bus_message *reply = NULL;
- int authorized = false, challenge = false;
- const char *sender;
-
- sender = sd_bus_message_get_sender(call);
- if (!sender)
- return -EBADMSG;
-
- r = sd_bus_message_new_method_call(
- call->bus,
- &request,
- "org.freedesktop.PolicyKit1",
- "/org/freedesktop/PolicyKit1/Authority",
- "org.freedesktop.PolicyKit1.Authority",
- "CheckAuthorization");
- if (r < 0)
- return r;
-
- r = sd_bus_message_append(
- request,
- "(sa{sv})s",
- "system-bus-name", 1, "name", "s", sender,
- action);
- if (r < 0)
- return r;
-
- r = bus_message_append_strv_key_value(request, details);
- if (r < 0)
- return r;
- r = sd_bus_message_append(request, "us", 0, NULL);
- if (r < 0)
- return r;
+#if ENABLE_POLKIT
+ _cleanup_(sd_bus_message_unrefp) sd_bus_message *request = NULL, *reply = NULL;
+ int authorized = false, challenge = false;
- r = sd_bus_call(call->bus, request, 0, ret_error, &reply);
- if (r < 0) {
- /* Treat no PK available as access denied */
- if (bus_error_is_unknown_service(ret_error)) {
- sd_bus_error_free(ret_error);
- return -EACCES;
- }
+ r = bus_message_new_polkit_auth_call(call, action, details, /* interactive = */ false, &request);
+ if (r < 0)
+ return r;
- return r;
+ r = sd_bus_call(call->bus, request, 0, ret_error, &reply);
+ if (r < 0) {
+ /* Treat no PK available as access denied */
+ if (bus_error_is_unknown_service(ret_error)) {
+ sd_bus_error_free(ret_error);
+ return -EACCES;
}
- r = sd_bus_message_enter_container(reply, 'r', "bba{ss}");
- if (r < 0)
- return r;
+ return r;
+ }
- r = sd_bus_message_read(reply, "bb", &authorized, &challenge);
- if (r < 0)
- return r;
+ r = sd_bus_message_enter_container(reply, 'r', "bba{ss}");
+ if (r < 0)
+ return r;
- if (authorized)
- return 1;
+ r = sd_bus_message_read(reply, "bb", &authorized, &challenge);
+ if (r < 0)
+ return r;
- if (_challenge) {
- *_challenge = challenge;
- return 0;
- }
+ if (authorized)
+ return 1;
+
+ if (_challenge) {
+ *_challenge = challenge;
+ return 0;
}
#endif
sd_event_source *defer_event_source;
} AsyncPolkitQuery;
-static void async_polkit_query_free(AsyncPolkitQuery *q) {
+static AsyncPolkitQuery *async_polkit_query_free(AsyncPolkitQuery *q) {
if (!q)
- return;
+ return NULL;
sd_bus_slot_unref(q->slot);
strv_free(q->details);
sd_event_source_disable_unref(q->defer_event_source);
- free(q);
+
+ return mfree(q);
}
static int async_polkit_defer(sd_event_source *s, void *userdata) {
- AsyncPolkitQuery *q = userdata;
+ AsyncPolkitQuery *q = ASSERT_PTR(userdata);
assert(s);
return r;
}
+static int process_polkit_response(
+ AsyncPolkitQuery *q,
+ sd_bus_message *call,
+ const char *action,
+ const char **details,
+ Hashmap **registry,
+ sd_bus_error *ret_error) {
+
+ int authorized, challenge, r;
+
+ assert(q);
+ assert(call);
+ assert(action);
+ assert(registry);
+ assert(ret_error);
+
+ assert(q->action);
+ assert(q->reply);
+
+ /* If the operation we want to authenticate changed between the first and the second time,
+ * let's not use this authentication, it might be out of date as the object and context we
+ * operate on might have changed. */
+ if (!streq(q->action, action) || !strv_equal(q->details, (char**) details))
+ return -ESTALE;
+
+ if (sd_bus_message_is_method_error(q->reply, NULL)) {
+ const sd_bus_error *e;
+
+ e = sd_bus_message_get_error(q->reply);
+
+ /* Treat no PK available as access denied */
+ if (bus_error_is_unknown_service(e))
+ return -EACCES;
+
+ /* Copy error from polkit reply */
+ sd_bus_error_copy(ret_error, e);
+ return -sd_bus_error_get_errno(e);
+ }
+
+ r = sd_bus_message_enter_container(q->reply, 'r', "bba{ss}");
+ if (r >= 0)
+ r = sd_bus_message_read(q->reply, "bb", &authorized, &challenge);
+ if (r < 0)
+ return r;
+
+ if (authorized)
+ return 1;
+
+ if (challenge)
+ return sd_bus_error_set(ret_error, SD_BUS_ERROR_INTERACTIVE_AUTHORIZATION_REQUIRED, "Interactive authentication required.");
+
+ return -EACCES;
+}
+
#endif
int bus_verify_polkit_async(
Hashmap **registry,
sd_bus_error *ret_error) {
- const char *sender;
int r;
assert(call);
#if ENABLE_POLKIT
AsyncPolkitQuery *q = hashmap_get(*registry, call);
- if (q) {
- int authorized, challenge;
-
- /* This is the second invocation of this function, and there's already a response from
- * polkit, let's process it */
- assert(q->reply);
-
- /* If the operation we want to authenticate changed between the first and the second time,
- * let's not use this authentication, it might be out of date as the object and context we
- * operate on might have changed. */
- if (!streq(q->action, action) ||
- !strv_equal(q->details, (char**) details))
- return -ESTALE;
-
- if (sd_bus_message_is_method_error(q->reply, NULL)) {
- const sd_bus_error *e;
-
- e = sd_bus_message_get_error(q->reply);
-
- /* Treat no PK available as access denied */
- if (bus_error_is_unknown_service(e))
- return -EACCES;
-
- /* Copy error from polkit reply */
- sd_bus_error_copy(ret_error, e);
- return -sd_bus_error_get_errno(e);
- }
-
- r = sd_bus_message_enter_container(q->reply, 'r', "bba{ss}");
- if (r >= 0)
- r = sd_bus_message_read(q->reply, "bb", &authorized, &challenge);
- if (r < 0)
- return r;
-
- if (authorized)
- return 1;
-
- if (challenge)
- return sd_bus_error_set(ret_error, SD_BUS_ERROR_INTERACTIVE_AUTHORIZATION_REQUIRED, "Interactive authentication required.");
-
- return -EACCES;
- }
+ /* This is the second invocation of this function, and there's already a response from
+ * polkit, let's process it */
+ if (q)
+ return process_polkit_response(q, call, action, details, registry, ret_error);
#endif
r = sd_bus_query_sender_privilege(call, capability);
if (r < 0)
return r;
- else if (r > 0)
+ if (r > 0)
return 1;
- sender = sd_bus_message_get_sender(call);
- if (!sender)
- return -EBADMSG;
-
#if ENABLE_POLKIT
_cleanup_(sd_bus_message_unrefp) sd_bus_message *pk = NULL;
if (r < 0)
return r;
- r = sd_bus_message_new_method_call(
- call->bus,
- &pk,
- "org.freedesktop.PolicyKit1",
- "/org/freedesktop/PolicyKit1/Authority",
- "org.freedesktop.PolicyKit1.Authority",
- "CheckAuthorization");
- if (r < 0)
- return r;
-
- r = sd_bus_message_append(
- pk,
- "(sa{sv})s",
- "system-bus-name", 1, "name", "s", sender,
- action);
- if (r < 0)
- return r;
-
- r = bus_message_append_strv_key_value(pk, details);
- if (r < 0)
- return r;
-
- r = sd_bus_message_append(pk, "us", interactive, NULL);
+ r = bus_message_new_polkit_auth_call(call, action, details, interactive, &pk);
if (r < 0)
return r;
/* Change into the new rootfs. */
if (fchdir(fd_newroot) < 0)
- return log_debug_errno(errno, "Failed to change into new rootfs '%s': %m", path);
+ return log_debug_errno(errno, "Failed to chdir into new rootfs '%s': %m", path);
/* Let the kernel tuck the new root under the old one. */
if (pivot_root(".", ".") < 0)
/* Change into the new rootfs. */
if (fchdir(fd_newroot) < 0)
- return log_debug_errno(errno, "Failed to change into new rootfs '%s': %m", path);
+ return log_debug_errno(errno, "Failed to chdir into new rootfs '%s': %m", path);
/* Move the new root fs */
if (mount(".", "/", NULL, MS_MOVE, NULL) < 0)
return log_debug_errno(errno, "Failed to move new rootfs '%s': %m", path);
- /* Also change chroot dir */
+ /* Also change root dir */
if (chroot(".") < 0)
return log_debug_errno(errno, "Failed to chroot to new rootfs '%s': %m", path);
if get_bool "$IS_BUILT_WITH_COVERAGE"; then
mkdir -p "$initdir/etc/systemd/system/service.d/"
echo -ne "[Service]\nProtectSystem=no\nProtectHome=no\n" >"$initdir/etc/systemd/system/service.d/99-gcov-override.conf"
- # Similarly, set ReadWritePaths= to the $BUILD_DIR in the test image
- # to make the coverage work with units using DynamicUser=yes. Do this
- # only for services with test- prefix, as setting this system-wide
- # has many undesirable side-effects, as it creates its own namespace.
- mkdir -p "$initdir/etc/systemd/system/test-.service.d/"
- echo -ne "[Service]\nReadWritePaths=${BUILD_DIR:?}\n" >"$initdir/etc/systemd/system/test-.service.d/99-gcov-rwpaths-override.conf"
+ # Similarly, set ReadWritePaths= to the $BUILD_DIR in the test image to make the coverage work with
+ # units using DynamicUser=yes. Do this only for services with test- prefix and a couple of
+ # known-to-use DynamicUser=yes services, as setting this system-wide has many undesirable
+ # side-effects, as it creates its own namespace.
+ for service in test- systemd-journal-{gatewayd,upload}; do
+ mkdir -p "$initdir/etc/systemd/system/$service.service.d/"
+ echo -ne "[Service]\nReadWritePaths=${BUILD_DIR:?}\n" >"$initdir/etc/systemd/system/$service.service.d/99-gcov-rwpaths-override.conf"
+ done
# Ditto, but for the user daemon
mkdir -p "$initdir/etc/systemd/user/test-.service.d/"
echo -ne "[Service]\nReadWritePaths=${BUILD_DIR:?}\n" >"$initdir/etc/systemd/user/test-.service.d/99-gcov-rwpaths-override.conf"
# nsswitch.conf uses [SUCCESS=merge] (like on Arch Linux)
# delv, dig - pull in nss_resolve if `resolve` is in nsswitch.conf
# tar - called by machinectl in TEST-25
- bin_rx='/(agetty|chown|delv|dig|getfacl|getent|id|login|ls|mkfs\.[a-z0-9]+|mksquashfs|mkswap|setfacl|setpriv|stat|su|tar|useradd|userdel)$'
+ bin_rx='/(agetty|chown|curl|delv|dig|getfacl|getent|id|login|ls|mkfs\.[a-z0-9]+|mksquashfs|mkswap|setfacl|setpriv|stat|su|tar|useradd|userdel)$'
if get_bool "$IS_BUILT_WITH_ASAN" && [[ "$bin" =~ $bin_rx ]]; then
wrap_binary=1
fi