build-sys: fix appending of CFLAGS and define __SANE_USERSPACE_TYPES__
It's pointless to call AC_SUBST more than once on the same variable. Because
of all the copypasta, we were mixing CLFAGS and LDFLAGS.
… and the assertion in previous commit was wrong. PPC64 is a special snowflake.
__SANE_USERSPACE_TYPES__ is needed on PPC64 to make __u64 be llu, instead of
lu. Considering that both lu and llu are 64 bits, there's nothing sane about
this, maybe the flag should be called __INSANE_USERSPACE_TYPES__ instead. Sane
or not, this makes ppc64 kernel headers behave consistent with other
architectures. With this flag, no warnings are emitted at -O0 level.
Martin Pitt [Tue, 8 Nov 2016 04:31:55 +0000 (05:31 +0100)]
nspawn: fix exit code for --help and --version (#4609)
Commit b006762 inverted the initial exit code which is relevant for --help and
--version without a particular reason. For these special options, parse_argv()
returns 0 so that our main() immediately skips to the end without adjusting
"ret". Otherwise, if an actual container is being started, ret is set on error
in run(), which still provides the "non-zero exit on error" behaviour.
Martin Pitt [Mon, 7 Nov 2016 18:51:20 +0000 (19:51 +0100)]
tests: use less aggressive systemctl --wait timeout in TEST-03-JOBS (#4606)
If the "systemctl start" happens at an "unlucky" time such as 1000.9 seconds
and then e. g. runs for 2.6 s (sleep 2 plus the overhead of starting the unit
and waiting for it) the END_SEC would be 1003.5s which would round to 1004,
making the difference 4. On busier testbeds the overhead apparently can take a
bit more than 0.5s. The main point is really that it doesn't wait that much
longer, so "-le 4" seems perfectly fine. We allow up to 1.5s in the subsequent
"wait5fail" test below too.
In file included from ./src/basic/macro.h:415:0,
from ./src/shared/acl-util.h:28,
from src/coredump/coredump.c:36:
src/coredump/coredump.c: In function ‘submit_coredump’:
src/coredump/coredump.c:711:26: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 7 has type ‘uint64_t {aka long long unsigned int}’ [-Wformat=]
log_info("The core will not be stored: size %zu is greater than %zu (the configured maximum)",
^
./src/basic/log.h:175:82: note: in definition of macro ‘log_full_errno’
? log_internal(_level, _e, __FILE__, __LINE__, __func__, __VA_ARGS__) \
^~~~~~~~~~~
./src/basic/log.h:183:28: note: in expansion of macro ‘log_full’
#define log_info(...) log_full(LOG_INFO, __VA_ARGS__)
^~~~~~~~
src/coredump/coredump.c:711:17: note: in expansion of macro ‘log_info’
log_info("The core will not be stored: size %zu is greater than %zu (the configured maximum)",
^~~~~~~~
src/coredump/coredump.c:711:26: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 8 has type ‘uint64_t {aka long long unsigned int}’ [-Wformat=]
log_info("The core will not be stored: size %zu is greater than %zu (the configured maximum)",
^
./src/basic/log.h:175:82: note: in definition of macro ‘log_full_errno’
? log_internal(_level, _e, __FILE__, __LINE__, __func__, __VA_ARGS__) \
^~~~~~~~~~~
./src/basic/log.h:183:28: note: in expansion of macro ‘log_full’
#define log_info(...) log_full(LOG_INFO, __VA_ARGS__)
^~~~~~~~
src/coredump/coredump.c:711:17: note: in expansion of macro ‘log_info’
log_info("The core will not be stored: size %zu is greater than %zu (the configured maximum)",
^~~~~~~~
src/coredump/coredump.c:741:27: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 7 has type ‘uint64_t {aka long long unsigned int}’ [-Wformat=]
log_debug("Not generating stack trace: core size %zu is greater than %zu (the configured maximum)",
^
./src/basic/log.h:175:82: note: in definition of macro ‘log_full_errno’
? log_internal(_level, _e, __FILE__, __LINE__, __func__, __VA_ARGS__) \
^~~~~~~~~~~
./src/basic/log.h:182:28: note: in expansion of macro ‘log_full’
#define log_debug(...) log_full(LOG_DEBUG, __VA_ARGS__)
^~~~~~~~
src/coredump/coredump.c:741:17: note: in expansion of macro ‘log_debug’
log_debug("Not generating stack trace: core size %zu is greater than %zu (the configured maximum)",
^~~~~~~~~
src/coredump/coredump.c:741:27: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 8 has type ‘uint64_t {aka long long unsigned int}’ [-Wformat=]
log_debug("Not generating stack trace: core size %zu is greater than %zu (the configured maximum)",
^
./src/basic/log.h:175:82: note: in definition of macro ‘log_full_errno’
? log_internal(_level, _e, __FILE__, __LINE__, __func__, __VA_ARGS__) \
^~~~~~~~~~~
./src/basic/log.h:182:28: note: in expansion of macro ‘log_full’
#define log_debug(...) log_full(LOG_DEBUG, __VA_ARGS__)
^~~~~~~~
src/coredump/coredump.c:741:17: note: in expansion of macro ‘log_debug’
log_debug("Not generating stack trace: core size %zu is greater than %zu (the configured maximum)",
^~~~~~~~~
src/coredump/coredump.c:768:34: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 7 has type ‘uint64_t {aka long long unsigned int}’ [-Wformat=]
log_info("The core will not be stored: size %zu is greater than %zu (the configured maximum)",
^
./src/basic/log.h:175:82: note: in definition of macro ‘log_full_errno’
? log_internal(_level, _e, __FILE__, __LINE__, __func__, __VA_ARGS__) \
^~~~~~~~~~~
./src/basic/log.h:183:28: note: in expansion of macro ‘log_full’
#define log_info(...) log_full(LOG_INFO, __VA_ARGS__)
^~~~~~~~
src/coredump/coredump.c:768:25: note: in expansion of macro ‘log_info’
log_info("The core will not be stored: size %zu is greater than %zu (the configured maximum)",
^~~~~~~~
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
Djalal Harouni [Sun, 6 Nov 2016 21:51:49 +0000 (22:51 +0100)]
core: make RootDirectory= and ProtectKernelModules= work
Instead of having two fields inside BindMount struct where one is stack
based and the other one is heap, use one field to store the full path
and updated it when we chase symlinks. This way we avoid dealing with
both at the same time.
This makes RootDirectory= work with ProtectHome= and ProtectKernelModules=yes
Felipe Sateler [Sun, 6 Nov 2016 14:16:42 +0000 (11:16 -0300)]
delta: skip symlink paths when split-usr is enabled (#4591)
If systemd is built with --enable-split-usr, but the system is indeed a
merged-usr system, then systemd-delta gets all confused and reports
that all units and configuration files have been overridden.
Skip any prefix paths that are symlinks in this case.
core: add new RestrictNamespaces= unit file setting
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().
RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.
This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
Fixes:
$ ./libtool --mode execute valgrind --leak-check=full ./journalctl >/dev/null
==22309== Memcheck, a memory error detector
==22309== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22309== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22309== Command: /home/vagrant/systemd/.libs/lt-journalctl
==22309==
Hint: You are currently not seeing messages from other users and the system.
Users in groups 'adm', 'systemd-journal', 'wheel' can see all messages.
Pass -q to turn off this notice.
==22309==
==22309== HEAP SUMMARY:
==22309== in use at exit: 8,680 bytes in 4 blocks
==22309== total heap usage: 5,543 allocs, 5,539 frees, 9,045,618 bytes allocated
==22309==
==22309== 488 (56 direct, 432 indirect) bytes in 1 blocks are definitely lost in loss record 2 of 4
==22309== at 0x4C2BBAD: malloc (vg_replace_malloc.c:299)
==22309== by 0x6F37A0A: __new_var_obj_p (__libobj.c:36)
==22309== by 0x6F362F7: __acl_init_obj (acl_init.c:28)
==22309== by 0x6F37731: __acl_from_xattr (__acl_from_xattr.c:54)
==22309== by 0x6F36087: acl_get_file (acl_get_file.c:69)
==22309== by 0x4F15752: acl_search_groups (acl-util.c:172)
==22309== by 0x113A1E: access_check_var_log_journal (journalctl.c:1836)
==22309== by 0x113D8D: access_check (journalctl.c:1889)
==22309== by 0x115681: main (journalctl.c:2236)
==22309==
==22309== LEAK SUMMARY:
==22309== definitely lost: 56 bytes in 1 blocks
==22309== indirectly lost: 432 bytes in 1 blocks
==22309== possibly lost: 0 bytes in 0 blocks
==22309== still reachable: 8,192 bytes in 2 blocks
==22309== suppressed: 0 bytes in 0 blocks
Direct leak of 48492 byte(s) in 2694 object(s) allocated from:
#0 0x7fb4aba13e60 in malloc (/lib64/libasan.so.3+0xc6e60)
#1 0x7fb4ab5b2cc4 in malloc_multiply src/basic/alloc-util.h:70
#2 0x7fb4ab5b3194 in parse_field src/shared/logs-show.c:98
#3 0x7fb4ab5b4918 in output_short src/shared/logs-show.c:347
#4 0x7fb4ab5b7cb7 in output_journal src/shared/logs-show.c:977
#5 0x5650e29cd83d in main src/journal/journalctl.c:2581
#6 0x7fb4aabdb730 in __libc_start_main (/lib64/libc.so.6+0x20730)
SUMMARY: AddressSanitizer: 48492 byte(s) leaked in 2694 allocation(s).
Follow up for #4546:
> @@ -848,8 +848,7 @@ static int bus_kernel_make_message(sd_bus *bus, struct kdbus_msg *k) {
if (k->src_id == KDBUS_SRC_ID_KERNEL)
bus_message_set_sender_driver(bus, m);
else {
- xsprintf(m->sender_buffer, ":1.%llu",
- (unsigned long long)k->src_id);
+ xsprintf(m->sender_buffer, ":1.%"PRIu64, k->src_id);
This produces:
src/libsystemd/sd-bus/bus-kernel.c: In function ‘bus_kernel_make_message’:
src/libsystemd/sd-bus/bus-kernel.c:851:44: warning: format ‘%lu’ expects argument of type ‘long
unsigned int’, but argument 4 has type ‘__u64 {aka long long unsigned int}’ [-Wformat=]
xsprintf(m->sender_buffer, ":1.%"PRIu64, k->src_id);
^
If we encounter the (unlikely) situation where the combined path to the
new root and a path to a mount to be moved together exceed maximum path length,
we shouldn't crash, but fail this path instead.
This reverts some changes introduced in d054f0a4d4.
xsprintf should be used in cases where we calculated the right buffer
size by hand (using DECIMAL_STRING_MAX and such), and never in cases where
we are printing externally specified strings of arbitrary length.
Unfortunately, github drops the original commiter when a PR is "squashed" (even
if it is only a single commit) and replaces it with some rubbish
github-specific user id. Thus, to make the contributors list somewhat useful,
update the .mailmap file and undo all the weirdness github applied there.
pid1: fix fd memleak when we hit FileDescriptorStoreMax limit
Since service_add_fd_store() already does the check, remove the redundant check
from service_add_fd_store_set().
Also, print a warning when repopulating FDStore after daemon-reexec and we hit
the limit. This is a user visible issue, so we should not discard fds silently.
(Note that service_deserialize_item is impacted by the return value from
service_add_fd_store(), but we rely on the general error message, so the caller
does not need to be modified, and does not show up in the diff.)
core: change mount_synthesize_root() return to int
Let's propagate the error here, instead of eating it up early.
In a later change we should probably also change mount_enumerate() to propagate
errors up, but that would mean we'd have to change the unit vtable, and thus
change all unit types, hence is quite an invasive change.
nspawn: if we set up a loopback device, try to mount it with "discard"
Let's make sure that our loopback files remain sparse, hence let's set
"discard" as mount option on file systems that support it if the backing device
is a loopback.
systemctl: tweak the "systemctl list-units" output a bit
Make the underlining between the header and the body and between the units of
different types span the whole width of the table.
Let's never make the table wider than necessary (which is relevant due the
above).
When space is limited and we can't show the full ID or description string
prefer showing the full ID over the full description. The ID is after all
something people might want to copy/paste, while the description is mostly just
helpful decoration.
sysctl: do not fail systemd-sysctl.service if /proc/sys is mounted read-only
Let's make missing write access to /proc/sys non-fatal to the sysctl service.
This is a follow-up to 411e869f497c7c7bd0688f1e3500f9043bc56e48 which altered
the condition for running the sysctl service to check for /proc/sys/net being
writable, accepting that /proc/sys might be read-only. In order to ensure the
boot-up stays clean in containers lower the log level for the EROFS errors
generated due to this.
core: rework the "no_gc" unit flag to become a more generic "perpetual" flag
So far "no_gc" was set on -.slice and init.scope, to units that are always
running, cannot be stopped and never exist in an "inactive" state. Since these
units are the only users of this flag, let's remodel it and rename it
"perpetual" and let's derive more funcitonality off it. Specifically, refuse
enqueing stop jobs for these units, and report that they are "unstoppable" in
the CanStop bus property.
man: document that too strict system call filters may affect the service manager
If execve() or socket() is filtered the service manager might get into trouble
executing the service binary, or handling any failures when this fails. Mention
this in the documentation.
The other option would be to implicitly whitelist all system calls that are
required for these codepaths. However, that appears less than desirable as this
would mean socket() and many related calls have to be whitelisted
unconditionally. As writing system call filters requires a certain level of
expertise anyway it sounds like the better option to simply document these
issues and suggest that the user disables system call filters in the service
temporarily in order to debug any such failures.
execute: apply seccomp filters after changing selinux/aa/smack contexts
Seccomp is generally an unprivileged operation, changing security contexts is
most likely associated with some form of policy. Moreover, while seccomp may
influence our own flow of code quite a bit (much more than the security context
change) make sure to apply the seccomp filters immediately before executing the
binary to invoke.
This also moves enforcement of NNP after the security context change, so that
NNP cannot affect it anymore. (However, the security policy now has to permit
the NNP change).
This change has a good chance of breaking current SELinux/AA/SMACK setups, because
the policy might not expect this change of behaviour. However, it's technically
the better choice I think and should hence be applied.
@resources contains various syscalls that alter resource limits and memory and
scheduling parameters of processes. As such they are good candidates to block
for most services.
@basic-io contains a number of basic syscalls for I/O, similar to the list
seccomp v1 permitted but slightly more complete. It should be useful for
building basic whitelisting for minimal sandboxes
The system call is already part in @default hence implicitly allowed anyway.
Also, if it is actually blocked then systemd couldn't execute the service in
question anymore, since the application of seccomp is immediately followed by
it.