Michal Sekletar [Mon, 17 Jul 2017 08:04:37 +0000 (10:04 +0200)]
journald: make sure we retain all stream fds across restarts (#6348)
Currently we set 4096 as maximum for number of stream connections that
we accept. However maximum number of file descriptors that systemd is
willing to accept from us is just 1024. This means we can't retain all
stream connections that we accepted. Hence bump the limit of fds in a
unit file so that systemd holds open all stream fds while we are
restarted.
I messed up when adding the definitions in 4278d1f5310f5acb4c6a6788233625234edb5145.
Unfortunately I didn't have the hardware at hand and went by
looking at the kernel headers.
So don't even try to added the filter to reduce noise.
The test is updated to skip calling _sysctl because the kernel prints
an oops-like message that is confusing and unhelpful:
core: support "nsdelegate" cgroup v2 mount option (#6294)
cgroup namespace wasn't useful for delegation because it allowed resource
control interface files (e.g. memory.high) to be written from inside the
namespace - this allowed the namespace parent's resource distribution to be
disturbed by its namespace-scoped children.
A new mount option, "nsdelegate", was added to cgroup v2 to address this issue.
The flag is meangingful only when mounting cgroup v2 in the init namespace and
makes a cgroup namespace a delegation boundary. The kernel feature is pending
for v4.13.
This should have been the default behavior on cgroup namespaces and this commit
makes systemd try "nsdelegate" first when trying to mount cgroup v2 and fall
back if the option is not supported.
Note that this has danger of breaking usages which depend on modifying the
parent's resource settings from the namespace root, which isn't a valid thing
to do, but such usages may still exist.
journal: use context_attach_window() in add_mmap() (#6339)
Instead of context_detach_window() and a manual attach of the new
window, simply call context_attach_window() which performs the
detach first if appropriate.
generator_add_symlink() is extended to ignore EEXIST. This should be fine
for all existing callers.
There's a small difference in behaviour when adding symlinks in sysv-generator:
the message is more generic and does not include ", ignored". But creation of
symlinks shouldn't ever fail except if things are very wrong, so in practice
this shouldn't matter.
Test needed updating: os.path.exists(os.readlink(link)) only works if the link
is absolute (or if we are in the right directory). Let's just use
os.path.exists(link), which properly tests that the symlink target exists.
journal_file_move_to_object() can skip the second
journal_file_move_to() call if the first one already mapped a
sufficiently large area.
Now that mmap_cache_get() returns the size of the mapped area
when asked, ask for the size and only perform the second call if
the required size exceeds the mapped size instead of the object
header size.
This results in a nice performance boost in my testing, even with
a corpus of many small logs burning much CPU time elsewhere:
Before:
# time ./journalctl -b -1 --no-pager > /dev/null
real 0m16.330s
user 0m16.281s
sys 0m0.046s
# time ./journalctl -b -1 --no-pager > /dev/null
real 0m16.409s
user 0m16.358s
sys 0m0.048s
# time ./journalctl -b -1 --no-pager > /dev/null
real 0m16.625s
user 0m16.558s
sys 0m0.061s
After:
# time ./journalctl -b -1 --no-pager > /dev/null
real 0m15.311s
user 0m15.257s
sys 0m0.046s
# time ./journalctl -b -1 --no-pager > /dev/null
real 0m15.201s
user 0m15.135s
sys 0m0.062s
# time ./journalctl -b -1 --no-pager > /dev/null
real 0m15.170s
user 0m15.113s
sys 0m0.053s
If requested, return the actual mapping size to the caller in
addition to the address.
journal_file_move_to_object() often performs two successive
mmap_cache_get() calls via journal_file_move_to(); one to get the
object header, then another to get the entire object when it's
larger than the header's size.
If mmap_cache_get() returned the actual mapping's size, it's
probable that the second mmap_cache_get() could be skipped when
the established mapping already encompassed the desired size.
core/load-fragment: refuse units with errors in certain directives
If an error is encountered in any of the Exec* lines, WorkingDirectory,
SELinuxContext, ApparmorProfile, SmackProcessLabel, Service (in .socket
units), User, or Group, refuse to load the unit. If the config stanza
has support, ignore the failure if '-' is present.
For those configuration directives, even if we started the unit, it's
pretty likely that it'll do something unexpected (like write files
in a wrong place, or with a wrong context, or run with wrong permissions,
etc). It seems better to refuse to start the unit and have the admin
clean up the configuration without giving the service a chance to mess
up stuff.
Note that all "security" options that restrict what the unit can do
(Capabilities, AmbientCapabilities, Restrict*, SystemCallFilter, Limit*,
PrivateDevices, Protect*, etc) are _not_ treated like this. Such options are
only supplementary, and are not always available depending on the architecture
and compilation options, so unit authors have to make sure that the service
runs correctly without them anyway.
Colin Walters [Tue, 11 Jul 2017 16:48:57 +0000 (12:48 -0400)]
fstab-generator: Chase symlinks where possible (#6293)
This has a long history; see see 5261ba901845c084de5a8fd06500ed09bfb0bd80
which originally introduced the behavior. Unfortunately that commit
doesn't include any rationale, but IIRC the basic issue is that
systemd wants to model the real mount state as units, and symlinks
make canonicalization much more difficult.
At the same time, on a RHEL6 system (upstart), one can make e.g. `/home` a
symlink, and things work as well as they always did; but one doesn't have
access to the sophistication of mount units (dependencies, introspection, etc.)
Supporting symlinks here will hence make it easier for people to do upgrades to
RHEL7 and beyond.
The `/home` as symlink case also appears prominently for OSTree; see
https://ostree.readthedocs.io/en/latest/manual/adapting-existing/
A basic limitation with doing this in the fstab generator (and that I hit while
doing some testing) is that we obviously can't chase symlinks into mounts,
since the generator runs early before mounts. Or at least - doing so would
require multiple passes over the fstab data (as well as looking at existing
mount units), and potentially doing multi-phase generation. I'm not sure it's
worth doing that without a real world use case. For now, this will fix at least
the OSTree + `/home` <https://bugzilla.redhat.com/show_bug.cgi?id=1382873> case
mentioned above, and in general anyone who for whatever reason has symlinks in
their `/etc/fstab`.
systemd: do not stop units bound to inactive units while coldplugging (#6316)
When running systemd-analyze verify I would get a random subset of warnings
(sometimes none, sometimes one or two):
dev-mapper-luks\x2d8db85dcf\x2d6230\x2d4e88\x2d940d\x2dba176d062b31.swap: Unit is bound to inactive unit dev-mapper-luks\x2d8db85dcf\x2d6230\x2d4e88\x2d940d\x2dba176d062b31.device. Stopping, too.
home.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device. Stopping, too.
boot.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-56c56bfd\x2d93f0\x2d48fb\x2dbc4b\x2d90aa67144ea5.device. Stopping, too.
When running with debug on, it's pretty obvious what is happening:
home.mount: Changed dead -> mounted
home.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device. Stopping, too.
home.mount: Trying to enqueue job home.mount/stop/fail
home.mount: Installed new job home.mount/stop as 27
home.mount: Enqueued job home.mount/stop as 27
...
dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device: Installed new job dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device/start as 47
dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device: Changed dead -> plugged
dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device: Job dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device/start finished, result=done
resolved: allow resolution of names which libidn2 considers invalid (#6315)
https://tools.ietf.org/html/rfc5891#section-4.2.3.1 says that
> The Unicode string MUST NOT contain "--" (two consecutive hyphens) in the third
> and fourth character positions and MUST NOT start or end with a "-" (hyphen).
This means that libidn2 refuses to encode such names.
Let's just resolve them without trying to use IDN.
random-util: always cast from smaller to bigger type when comparing
When we compare two size values, let's make sure we cast from the
smaller to the bigger type first, if both types differ, rather than the
reverse in order to not run into overflows.
This way we have a MMapFileDescriptor reference external to the cache,
and can supply the handle directly to mmap_cache_get(), eliminating
hashmap lookups entirely from the hot path.
This should make output deterministic, and independent of the directory
layout on disk. Just using ordered hashmaps would be enough to make
the output deterministic on a specific machine, but to make it
identical on different machines with the same set of files and
directories, names are sorted after being use.
mount: change find_loop_device() error code when no loop device is found to ENXIO
ENOENT is a bit too likely to be returned for various reasons, for
example if /sys or /proc are not mounted and hence the files we need not
around. Hence, let's use ENXIO instead, which is equally fitting for the
purpose but has the benefit that the underlying calls won't generate
this error on their own, hence any ambiguity is removed.
mount: rework find_loop_device() to log about no errors
We should either log about all errors in a function, or about none (and
then leave the logging about it to the caller who we propagate the error
to). Given that the callers of find_loop_device() already log about the
returned errors let's hence suppress the log messages in
find_loop_device() itself.
mount: fix potential bad memory access when /proc/self/mountinfo is empty
It's unlikely this can ever be triggered, but let's be safe rather than
sorry, and handle the case where the list of mount points is zero, and
the "l" array thus NULL. let's ensure we allocate at least one entry.
systemctl link is the only systemctl verb that takes a filename (and not
a unit name) as argument
use path_strv_make_absolute_cwd to expand the provided filename in order
to make it easier to use from the command line
keep the absolute pathname requirement when --root is used
[zj: add explicit error messages for the cases of --root and plain filename
instead of skipping normalization and just relying on systemd to refuse
to link non-absolute arguments. This allows us to make the error message
more informative.]
Make agetty started by *getty* units pass '-p' option to "login", so it
doesn't clear the environment and passes whatever was setup by systemd
to shells. This is needed especially for programs which are specified as
user shells, but won't read locale settings from anywhere but
environment.
[zj: cherry-pick just the second patch from the series, see discussion
on the pull request.]
time-util: make parse_timestamp() set 0 if the input is very old date (#6297)
If the input is older than "1970-01-01 UTC", then `parse_timestamp()`
fails and returns -EINVAL. However, if the input is e.g. `-100years`,
then the function succeeds and sets `usec = 0`.
This commit makes the function also succeed for old dates and set
`usec = 0`.
shared: leave output_journal() output in buffer (#6304)
e268b81e moved an fflush() from output_json() to the generic
output_journal(), when it probably should have deleted all fflush()
calls from logs-show.c altogether.
The caller supplies the FILE * to these functions, and should be in
charge of flushing as needed. The current implementation essentially
defeats any buffering stdio was bringing to the table, resulting in
extraneous tiny write() calls in commands like `journalctl -b`.
This commit removes the fflush() call from output_journal(), and adds
them to journalctl before waiting for more entries and at completion.
This way in the hot path when journalctl loops on entries stdio can
combine multiple entries into bulkier write() calls.
Benjamin Robin [Thu, 6 Jul 2017 02:56:17 +0000 (04:56 +0200)]
resolve: Try to remove the ambiguity about the mtu parameter of dns_packet_new (#6285)
Actually the caller of dns_packet_new() pass 0 or the data size of the UDP message.
So try to reflect that, so rename the `mtu` parameter to `min_alloc_dsize`.
In fact `mtu` is the size of the whole UDP message, including the UDP header,
and here we just need to pass the size of data (without header). This was confusing.
Also add a check on the requested allocated size, since some caller do not check what is really allocated.
Indeed the function do not allocate more than DNS_PACKET_SIZE_MAX whatever the value of the `mtu` parameter.
secure_getenv does not work when the process has a nonempty permitted
capability set, which means that it's unduly hard to configure logging in
systemd-logind, systemd-resolved, and others.
secure_getenv is useful for code in libraries which might get called from a
setuid application. log_parse_environment() is never called from our library
code, but directly form various top-level executables. None of them are
installed suid, and none are prepared to be used this way, since many
additional changes would be required to make that safe. We may just as well
drop the check and allow SYSTEMD_LOG_* to properly parsed.
Mike Gilbert [Wed, 5 Jul 2017 03:22:47 +0000 (23:22 -0400)]
test-fs-util: re-order test_readlink_and_make_absolute and test_get_files_in_directory (#6288)
test_readlink_and_make_absolute switches to a temp directory, and then
removes it.
test_get_files_in_directory calls opendir(".") from a directory that has
been removed from the filesystem.
This call sequence triggers a bug in Gentoo's sandbox library. This
library attempts to resolve the "." to an absolute path, and aborts when
it ultimately fails to do so.
Re-ordering the calls works around the issue until the sandbox library
can be fixed to more gracefully deal with this.
When "bg" is specified for NFS mounts, and if the server is
not accessible, two behaviors are possible depending on networking
details.
If a definitive error is received, such a EHOSTUNREACH or ECONNREFUSED,
mount.nfs will fork and continue in the background, while /bin/mount
will report success.
If no definitive error is reported but the connection times out
instead, then the mount.nfs timeout will normally be longer than the
systemd.mount timeout, so mount.nfs will be killed by systemd.
In the first case the mount has appeared to succeed even though
it hasn't. This can be confusing. Also the background mount.nfs
will never get cleaned up, even if the mount unit is stopped.
In the second case, mount.nfs is killed early and so the mount will
not complete when the server comes back.
Neither of these are ideal.
This patch modifies the options when an NFS bg mount is detected to
force an "fg" mount, but retain the default "retry" time of 10000
minutes that applies to "bg" mounts.
It also imposes "nofail" behaviour and sets the TimeoutSec for the
mount to "infinity" so the retry= time is allowed to complete.
This provides near-identical behaviour to an NFS bg mount started directly
by "mount -a". The only difference is that systemd will not wait for
the first mount attempt, while "mount -a" will.
Christian Hesse [Tue, 4 Jul 2017 07:38:31 +0000 (09:38 +0200)]
core: link user keyring to session keyring (#6275)
Commit 74dd6b515fa968c5710b396a7664cac335e25ca8 (core: run each system
service with a fresh session keyring) broke adding keys to user keyring.
Added keys could not be accessed with error message: