resolved: rework how we handle truncation in the stub resolver
When we a reply message gets longer than the client supports we need to
truncate the response and set the TC bit, and we already do that.
However, we are not supposed to send incomplete RRs in that case, but
instead truncate right at a record boundary. Do that.
This fixes the "Message parser reports malformed message packet."
warning the venerable "host" tool outputs when a very large response is
requested.
dynamic-user: don't use a UID that currently owns IPC objects (#6962)
This fixes a mostly theoretical potential security hole: if for some
reason we failed to remove IPC objects created for a dynamic user (maybe
because a MAC/SElinux erronously prohibited), then we should not hand
out the same UID again until they are successfully removed.
With this commit we'll enumerate the IPC objects currently existing, and
step away from using a UID for the dynamic UID logic if there are any
matching it.
man: document which special "systemctl" commands are synchronous and which asynchronous.
This documents the status quo, clarifying when we are synchronous and
when asynchronous by default and when --no-block is support to force
asynchronous operation.
systemctl: make sure "reboot", "suspend" and friends are always asynchronous
Currently, "systemctl reboot" behaves differently in setups with and
without logind. If logind is used (which is probably the more common
case) the operation is asynchronous, and otherwise synchronous (though
subject to --no-block in this case). Let's clean this up, and always
expose the same behaviour, regardless if logind is used or not: let's
always make it asynchronous.
It might make sense to add a "--block" mode in a future PR that makes
these operations synchronous, but this requires non-trivial work in
logind, and is outside of the scope of this change.
This adds new method calls Halt() and CanHalt() to the logind bus APIs.
They aren't overly useful (as the whole concept of halting isn't really
too useful), however they clean up one major asymmetry: currently, using
the "shutdown" legacy commands it is possibly to enqueue a "halt"
operation through logind, while logind officially doesn't actually
support this. Moreover, the path through "shutdown" currently ultimately
fails, since the referenced "halt" action isn't actually defined in
PolicyKit.
Finally, the current logic results in an unexpected asymmetry in
systemctl: "systemctl poweroff", "systemctl reboot" are currently
asynchronous (due to the logind involvement) while "systemctl halt"
isnt. Let's clean this up, and make all three APIs implemented by
logind natively, and all three hence asynchronous in "systemctl".
Yu Watanabe [Wed, 4 Oct 2017 17:29:36 +0000 (02:29 +0900)]
nss-systemd: if cannot open bus, then try to read user info directly (#6971)
If sd_bus_open_system() fail, then try to read information about
dynamic users from /run/systemd/dynamic-uid.
This makes services can successfully call getpwuid() or their friends
even if dbus.service is not started yet.
Alan Jenkins [Tue, 3 Oct 2017 11:26:02 +0000 (12:26 +0100)]
logind: use pid_is_valid() where appropriate
These two sites _do_ match the definition of pid_is_valid(); they don't
provide any special handling for the invalid PID value 0. (They're used
by dbus methods, so the PID value 0 is handled with reference to the dbus
client creds, outside of these functions).
The second part of this hunk, avoiding using `%m`
when we didn't actually have `errno` set, seems
like a nice enough cleanup to be worthwhile on
it's own.
Also use PID_FMT to improve the error message we print
(pid_t is signed).
The configuration option was called -Dresolve, but the internal define
was …RESOLVED. This options governs more than just resolved itself, so
let's settle on the version without "d".
It broke almost everywhere it touched. The places that
handn't been converted, were mostly followed by special
handling for the invalid PID `0`. That explains why they
tested for `pid < 0` instead of `pid <= 0`.
I think that one was the first commit I reviewed, heh.
Djalal Harouni [Tue, 3 Oct 2017 05:20:05 +0000 (07:20 +0200)]
seccomp: remove '@credentials' syscall set (#6958)
This removes the '@credentials' syscall set that was added in commit v234-468-gcd0ddf6f75.
Most of these syscalls are so simple that we do not want to filter them.
They work on the current calling process, doing only read operations,
they do not have a deep kernel path.
The problem may only be in 'capget' syscall since it can query arbitrary
processes, and used to discover processes, however sending signal 0 to
arbitrary processes can be used to discover if a process exists or not.
It is unfortunate that Linux allows to query processes of different
users. Lets put it now in '@process' syscall set, and later we may add
it to a new '@basic-process' set that allows most basic process
operations.
Also, tests for DynamicUser= should really run for system mode, as we
allocate from a system resource.
(This also increases the test timeout to 2min. If one of our tests
really hangs then waiting for 2min longer doesn't hurt either. The old
2s is really short, given that we run in potentially slow VM
environments for this test. This becomes noticable when the slow "find"
command this adds is triggered)
core: when looking for a UID to use for a dynamic UID start with the current owner of the StateDirectory= and friends
Let's optimize dynamic UID allocation a bit: if a StateDirectory= (or
suchlike) is configured, we start our allocation loop from that UID and
use it if it currently isn't used otherwise. This is beneficial as it
saves us from having to expensively recursively chown() these
directories in the typical case (which StateDirectory= does when it
notices that the owner of the directory doesn't match the UID picked).
With this in place we now have the a three-phase logic for allocating a
dynamic UID:
a) first, we try to use the owning UID of StateDirectory=,
CacheDirectory=, LogDirectory= if that exists and is currently
otherwise unused.
b) if that didn't work out, we hash the UID from the service name
c) if that didn't yield an unused UID either, randomly pick new ones
until we find a free one.
execute: make StateDirectory= and friends compatible with DynamicUser=1 and RootDirectory=/RootImage=
Let's clean up the interaction of StateDirectory= (and friends) to
DynamicUser=1: instead of creating these directories directly below
/var/lib, place them in /var/lib/private instead if DynamicUser=1 is
set, making that directory 0700 and owned by root:root. This way, if a
dynamic UID is later reused, access to the old run's state directory is
prohibited for that user. Then, use file system namespacing inside the
service to make /var/lib/private a readable tmpfs, hiding all state
directories that are not listed in StateDirectory=, and making access to
the actual state directory possible. Mount all directories listed in
StateDirectory= to the same places inside the service (which means
they'll now be mounted into the tmpfs instance). Finally, add a symlink
from the state directory name in /var/lib/ to the one in
/var/lib/private, so that both the host and the service can access the
path under the same location.
Here's an example: let's say a service runs with StateDirectory=foo.
When DynamicUser=0 is set, it will get the following setup, and no
difference between what the unit and what the host sees:
/var/lib/foo (created as directory)
Now, if DynamicUser=1 is set, we'll instead get this on the host:
/var/lib/private (created as directory with mode 0700, root:root)
/var/lib/private/foo (created as directory)
/var/lib/foo → private/foo (created as symlink)
And from inside the unit:
/var/lib/private (a tmpfs mount with mode 0755, root:root)
/var/lib/private/foo (bind mounted from the host)
/var/lib/foo → private/foo (the same symlink as above)
This takes inspiration from how container trees are protected below
/var/lib/machines: they generally reuse UIDs/GIDs of the host, but
because /var/lib/machines itself is set to 0700 host users cannot access
files in the container tree even if the UIDs/GIDs are reused. However,
for this commit we add one further trick: inside and outside of the unit
/var/lib/private is a different thing: outside it is a plain,
inaccessible directory, and inside it is a world-readable tmpfs mount
with only the whitelisted subdirs below it, bind mounte din. This
means, from the outside the dir acts as an access barrier, but from the
inside it does not. And the symlink created in /var/lib/foo itself
points across the barrier in both cases, so that root and the unit's
user always have access to these dirs without knowing the details of
this mounting magic.
This logic resolves a major shortcoming of DynamicUser=1 units:
previously they couldn't safely store persistant data. With this change
they can have their own private state, log and data directories, which
they can write to, but which are protected from UID recycling.
With this change, if RootDirectory= or RootImage= are used it is ensured
that the specified state/log/cache directories are always mounted in
from the host. This change of semantics I think is much preferable since
this means the root directory/image logic can be used easily for
read-only resource bundling (as all writable data resides outside of the
image). Note that this is a change of behaviour, but given that we
haven't released any systemd version with StateDirectory= and friends
implemented this should be a safe change to make (in particular as
previously it wasn't clear what would actually happen when used in
combination). Moreover, by making this change we can later add a "+"
modifier to these setings too working similar to the same modifier in
ReadOnlyPaths= and friends, making specified paths relative to the
container itself.
namespace: if we can create the destination of bind and PrivateTmp= mounts
When putting together the namespace, always create the file or directory
we are supposed to bind mount on, the same way we do it for most other
stuff, for example mount units or systemd-nspawn's --bind= option.
This has the big benefit that we can use namespace bind mounts on dirs
in /tmp or /var/tmp even in conjunction with PrivateTmp=.
namespace: properly handle bind mounts from the host
Before this patch we had an ordering problem: if we have no namespacing
enabled except for two bind mounts that intend to swap /a and /b via
bind mounts, then we'd execute the bind mount binding /b to /a, followed
by thebind mount from /a to /b, thus having the effect that /b is now
visible in both /a and /b, which was not intended.
With this change, as soon as any bind mount is configured we'll put
together the service mount namespace in a temporary directory instead of
operating directly in the root. This solves the problem in a
straightforward fashion: the source of bind mounts will always refer to
the host, and thus be unaffected from the bind mounts we already
created.
We already create /dev implicitly if PrivateTmp=yes is on, if it is
missing. Do so too for the other two API VFS, as well as for /dev if
PrivateTmp=yes is off but MountAPIVFS=yes is on (i.e. when /dev is bind
mounted from the host).
core: usually our enum's _INVALID and _MAX special values are named after the full type
In most cases we followed the rule that the special _INVALID and _MAX
values we use in our enums use the full type name as prefix (in contrast
to regular values that we often make shorter), do so for
ExecDirectoryType as well.
No functional changes, just a little bit of renaming to make this code
more like the rest.
core: chown() StateDirectory= and friends recursively when starting a service
This is particularly useful when used in conjunction with DynamicUser=1,
where the UID might change for every invocation, but is useful in other
cases too, for example, when these directories are shared between
systems where the UID assignments differ slightly.
Jouke Witteveen [Mon, 2 Oct 2017 14:35:27 +0000 (16:35 +0200)]
service: better detect when a Type=notify service cannot become active anymore (#6959)
No need to wait for a timeout when we know things are not going to work out.
When the main process goes away and only notifications from the main process are
accepted, then we will not receive any notifications anymore.
Ignoring duplicate entry: 0001C8 = "THOMAS CONRAD CORP.", "CONRAD CORP."
Ignoring duplicate entry: 080030 = "NETWORK RESEARCH CORPORATION", "ROYAL MELBOURNE INST OF TECH"
Ignoring duplicate entry: 080030 = "NETWORK RESEARCH CORPORATION", "CERN"
→ we have two vendor prefixes with duplicate entries. For the first one,
there are two entries with what appear to be the same company. In the
second case, the same prefix is assigned to three different entities.
I arbitrarily chose to prefer the first entry.
Without the original files it's hard to see what changed "upstream", and what
entries were added and removed. Upstream did not keep the entries sorted, and
our processing scripts did not sort the output either, so from just looking at
diffs it's hard to say what changed. So let's keep the original data, at least
for a few update cycles, so get a better handle on the upstream changes.
It's a few hundred kilobytes, so not that big, and text, so it should
compresses well.
During the initial meson conversion, custom_target:s and run_target:s behaved
the same, and the target name became a top-level command. Now custom_target:s
require the subdir to be included, e.g. we have man/man target to build man pages,
but run_target:s not. So I think this target got a name that is so generic because
of the confusion caused by changing rules. Let's rename it.
service: accept the fact that the three xyz_good() functions return ints
Currently, all three of cgroup_good(), main_pid_good(),
control_pid_good() all return an "int" (two of them propagate errors).
It's a good thing to keep the three functions similar, so let's leave it
at that, but then let's clean up the invocation of the three functions
so that they always clearly acknowledge that the return value is not a
bool, but potentially negative.
The compiler should be good enough to figure this out on its own if this
is a static function, and it makes control_pid_good() an outlier anyway,
and decorators like this tend to bitrot. Hence, to keep things simple
and automatic, let's just drop the decorator.
service: a cgroup empty notification isn't reason enough to go down
The processes associated with a service are not just the ones in its
cgroup, but also the control and main processes, which might possibly
live outside of it, for example if they transitioned into their own
cgroups because they registered a PAM session of their own. Hence, if we
get a cgroup empty notification always check if the main PID is still
around before taking action too eagerly.
resolved: make sure a non-existing PTR record never gets mangled into NODATA
Previously, if a PTR query is seen for a non-existing record, we'd
generate an empty response (but not NXDOMAIN or so). Fix that. If we
have no data about an IP address, then let's say so, so that the
original error is returned, instead of anything synthesized.
resolved: when there is no gateway, make sure _gateway results in NXDOMAIN
Let's ensure that "no gateway" translates to "no domain", instead of an
empty reply. This is in line with what nss-myhostname does in the same
case, hence let's unify behaviour here of nss-myhostname and resolved.
THe match cookie was used by kdbus to identify matches we install
uniquely. But given that kdbus is gone, the cookie serves no process
anymore, let's kill it.
sd-bus: when showing brief message info show error name in debug out put too
When debug logging is enabled we show brief information about every bus
message we send or receieve. Pretty much all information is shown,
except for the error name if a message is an error (interestingly we do
print the error text however). Fix that, and add the error name as well.
create-sys-script: adapt to separate build dir, modernize, add more checks
The script wasn't apparently used since the switch to meson, because
it required the sys subdirectory to be present in the same subdirectory
where the output script is located.
Let's use f-strings to make the whole thing more readable. Add some
extra checks.