From: Iago López Galeiras Date: Tue, 7 Nov 2023 10:06:56 +0000 (+0100) Subject: core: allow using seccomp without no_new_privs when unprivileged X-Git-Tag: v255-rc2~91^2~1 X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=24832d10b604848cf46624bb439c7fac27f3ce3f;p=thirdparty%2Fsystemd.git core: allow using seccomp without no_new_privs when unprivileged Until now, using any form of seccomp while being unprivileged (User=) resulted in systemd enabling no_new_privs. There's no need for doing this because: * We trust the filters we apply * If User= is set and a process wants to apply a new seccomp filter, it will need to set no_new_privs itself An example of application that might want seccomp + !no_new_privs is a program that wants to run as an unprivileged user but uses file capabilities to start a web server on a privileged port while benefitting from a restrictive seccomp profile. We now keep the privileges needed to do seccomp before calling enforce_user() and drop them after the seccomp filters are applied. If the syscall filter doesn't allow the needed syscalls to drop the privileges, we keep the previous behavior by enabling no_new_privs. --- diff --git a/man/systemd.exec.xml b/man/systemd.exec.xml index 8db8deb36df..203e5ab4f55 100644 --- a/man/systemd.exec.xml +++ b/man/systemd.exec.xml @@ -823,21 +823,10 @@ CapabilityBoundingSet=~CAP_B CAP_C Takes a boolean argument. If true, ensures that the service process and all its children can never gain new privileges through execve() (e.g. via setuid or setgid bits, or filesystem capabilities). This is the simplest and most effective way to ensure that - a process and its children can never elevate privileges again. Defaults to false, but certain - settings override this and ignore the value of this setting. This is the case when - DynamicUser=, LockPersonality=, - MemoryDenyWriteExecute=, PrivateDevices=, - ProtectClock=, ProtectHostname=, - ProtectKernelLogs=, ProtectKernelModules=, - ProtectKernelTunables=, RestrictAddressFamilies=, - RestrictNamespaces=, RestrictRealtime=, - RestrictSUIDSGID=, SystemCallArchitectures=, - SystemCallFilter=, or SystemCallLog= are specified. Note that - even if this setting is overridden by them, systemctl show shows the original - value of this setting. In case the service will be run in a new mount namespace anyway and SELinux is - disabled, all file systems are mounted with MS_NOSUID flag. Also see - the kernel document - No New Privileges Flag. + a process and its children can never elevate privileges again. Defaults to false. In case the service + will be run in a new mount namespace anyway and SELinux is disabled, all file systems are mounted with + MS_NOSUID flag. Also see No New Privileges Flag. Note that this setting only has an effect on the unit's processes themselves (or any processes @@ -1779,9 +1768,7 @@ BindReadOnlyPaths=/var/lib/systemd mmap2 of /dev/zero instead of using MAP_ANON. For this setting the same restrictions regarding mount propagation and privileges apply as for - ReadOnlyPaths= and related calls, see above. If turned on and if running in user - mode, or in system mode, but without the CAP_SYS_ADMIN capability (e.g. setting - User=), NoNewPrivileges=yes is implied. + ReadOnlyPaths= and related calls, see above. Note that the implementation of this setting might be impossible (for example if mount namespaces are not available), and the unit should be written in a way that does not solely rely on @@ -1973,10 +1960,6 @@ BindReadOnlyPaths=/var/lib/systemd the system into the service, it is hence not suitable for services that need to take notice of system hostname changes dynamically. - If this setting is on, but the unit doesn't have the CAP_SYS_ADMIN - capability (e.g. services for which User= is set), - NoNewPrivileges=yes is implied. - @@ -1994,9 +1977,7 @@ BindReadOnlyPaths=/var/lib/systemd Effectively, /dev/rtc0, /dev/rtc1, etc. are made read-only to the service. See systemd.resource-control5 - for the details about DeviceAllow=. If this setting is on, but the unit doesn't - have the CAP_SYS_ADMIN capability (e.g. services for which - User= is set), NoNewPrivileges=yes is implied. + for the details about DeviceAllow=. It is recommended to turn this on for most services that do not need modify the clock or check its state. @@ -2018,13 +1999,10 @@ BindReadOnlyPaths=/var/lib/systemd sysctl.d5 mechanism. Few services need to write to these at runtime; it is hence recommended to turn this on for most services. For this setting the same restrictions regarding mount propagation and privileges apply as for - ReadOnlyPaths= and related calls, see above. Defaults to off. If this - setting is on, but the unit doesn't have the CAP_SYS_ADMIN capability - (e.g. services for which User= is set), - NoNewPrivileges=yes is implied. Note that this option does not prevent - indirect changes to kernel tunables effected by IPC calls to other processes. However, - InaccessiblePaths= may be used to make relevant IPC file system objects - inaccessible. If ProtectKernelTunables= is set, + ReadOnlyPaths= and related calls, see above. Defaults to off. + Note that this option does not prevent indirect changes to kernel tunables effected by IPC calls to + other processes. However, InaccessiblePaths= may be used to make relevant IPC file system + objects inaccessible. If ProtectKernelTunables= is set, MountAPIVFS=yes is implied. @@ -2046,9 +2024,7 @@ BindReadOnlyPaths=/var/lib/systemd both privileged and unprivileged. To disable module auto-load feature please see sysctl.d5 kernel.modules_disabled mechanism and - /proc/sys/kernel/modules_disabled documentation. If this setting is on, - but the unit doesn't have the CAP_SYS_ADMIN capability (e.g. services for - which User= is set), NoNewPrivileges=yes is implied. + /proc/sys/kernel/modules_disabled documentation. @@ -2067,9 +2043,7 @@ BindReadOnlyPaths=/var/lib/systemd syslog3 for userspace logging). The kernel exposes its log buffer to userspace via /dev/kmsg and /proc/kmsg. If enabled, these are made inaccessible to all the processes in the unit. - If this setting is on, but the unit doesn't have the CAP_SYS_ADMIN - capability (e.g. services for which User= is set), - NoNewPrivileges=yes is implied. + @@ -2113,12 +2087,9 @@ BindReadOnlyPaths=/var/lib/systemd including x86-64). Note that on systems supporting multiple ABIs (such as x86/x86-64) it is recommended to turn off alternative ABIs for services, so that they cannot be used to circumvent the restrictions of this option. Specifically, it is recommended to combine this option with - SystemCallArchitectures=native or similar. If running in user mode, or in system - mode, but without the CAP_SYS_ADMIN capability (e.g. setting - User=), NoNewPrivileges=yes is implied. By default, no - restrictions apply, all address families are accessible to processes. If assigned the empty string, - any previous address family restriction changes are undone. This setting does not affect commands - prefixed with +. + SystemCallArchitectures=native or similar. By default, no restrictions apply, all + address families are accessible to processes. If assigned the empty string, any previous address family + restriction changes are undone. This setting does not affect commands prefixed with +. Use this option to limit exposure of processes to remote access, in particular via exotic and sensitive network protocols, such as AF_PACKET. Note that in most cases, the local @@ -2251,9 +2222,7 @@ RestrictFileSystems=ext4 creation and switching of the specified types of namespaces (or all of them, if true) access to the setns() system call with a zero flags parameter is prohibited. This setting is only supported on x86, x86-64, mips, mips-le, mips64, mips64-le, mips64-n32, mips64-le-n32, ppc64, ppc64-le, s390 - and s390x, and enforces no restrictions on other architectures. If running in user mode, or in system mode, but - without the CAP_SYS_ADMIN capability (e.g. setting User=), - NoNewPrivileges=yes is implied. + and s390x, and enforces no restrictions on other architectures. Example: if a unit has the following, RestrictNamespaces=cgroup ipc @@ -2274,9 +2243,7 @@ RestrictNamespaces=~cgroup net project='man-pages'>personality2 system call so that the kernel execution domain may not be changed from the default or the personality selected with Personality= directive. This may be useful to improve security, because odd personality - emulations may be poorly tested and source of vulnerabilities. If running in user mode, or in system mode, but - without the CAP_SYS_ADMIN capability (e.g. setting User=), - NoNewPrivileges=yes is implied. + emulations may be poorly tested and source of vulnerabilities. @@ -2308,9 +2275,7 @@ RestrictNamespaces=~cgroup net available on x86. Note that on systems supporting multiple ABIs (such as x86/x86-64) it is recommended to turn off alternative ABIs for services, so that they cannot be used to circumvent the restrictions of this option. Specifically, it is recommended to combine this option with - SystemCallArchitectures=native or similar. If running in user mode, or in system - mode, but without the CAP_SYS_ADMIN capability (e.g. setting - User=), NoNewPrivileges=yes is implied. + SystemCallArchitectures=native or similar. @@ -2322,9 +2287,7 @@ RestrictNamespaces=~cgroup net the unit are refused. This restricts access to realtime task scheduling policies such as SCHED_FIFO, SCHED_RR or SCHED_DEADLINE. See sched7 - for details about these scheduling policies. If running in user mode, or in system mode, but without the - CAP_SYS_ADMIN capability (e.g. setting User=), - NoNewPrivileges=yes is implied. Realtime scheduling policies may be used to monopolize CPU + for details about these scheduling policies. Realtime scheduling policies may be used to monopolize CPU time for longer periods of time, and may hence be used to lock up or otherwise trigger Denial-of-Service situations on the system. It is hence recommended to restrict access to realtime scheduling to the few programs that actually require them. Defaults to off. @@ -2338,10 +2301,8 @@ RestrictNamespaces=~cgroup net Takes a boolean argument. If set, any attempts to set the set-user-ID (SUID) or set-group-ID (SGID) bits on files or directories will be denied (for details on these bits see inode7). If - running in user mode, or in system mode, but without the CAP_SYS_ADMIN - capability (e.g. setting User=), NoNewPrivileges=yes is - implied. As the SUID/SGID bits are mechanisms to elevate privileges, and allow users to acquire the + project='man-pages'>inode7). + As the SUID/SGID bits are mechanisms to elevate privileges, and allow users to acquire the identity of other users, it is recommended to restrict creation of SUID/SGID files to the few programs that actually require them. Note that this restricts marking of any type of file system object with these bits, including both regular files and directories (where the SGID is a different @@ -2457,15 +2418,12 @@ RestrictNamespaces=~cgroup net full list). This value will be returned when a deny-listed system call is triggered, instead of terminating the processes immediately. Special setting kill can be used to explicitly specify killing. This value takes precedence over the one given in - SystemCallErrorNumber=, see below. If running in user mode, or in system mode, - but without the CAP_SYS_ADMIN capability (e.g. setting - User=), NoNewPrivileges=yes is implied. This feature - makes use of the Secure Computing Mode 2 interfaces of the kernel ('seccomp filtering') and is useful - for enforcing a minimal sandboxing environment. Note that the execve(), - exit(), exit_group(), getrlimit(), - rt_sigreturn(), sigreturn() system calls and the system calls - for querying time and sleeping are implicitly allow-listed and do not need to be listed - explicitly. This option may be specified more than once, in which case the filter masks are + SystemCallErrorNumber=, see below. This feature makes use of the Secure Computing Mode 2 + interfaces of the kernel ('seccomp filtering') and is useful for enforcing a minimal sandboxing environment. + Note that the execve(), exit(), exit_group(), + getrlimit(), rt_sigreturn(), sigreturn() + system calls and the system calls for querying time and sleeping are implicitly allow-listed and do not + need to be listed explicitly. This option may be specified more than once, in which case the filter masks are merged. If the empty string is assigned, the filter is reset, all prior assignments will have no effect. This does not affect commands prefixed with +. @@ -2692,10 +2650,7 @@ SystemCallErrorNumber=EPERM as well as x32, mips64-n32, mips64-le-n32, and the special identifier native. The special identifier native implicitly maps to the native architecture of the system (or more precisely: to the architecture the system - manager is compiled for). If running in user mode, or in system mode, but without the - CAP_SYS_ADMIN capability (e.g. setting User=), - NoNewPrivileges=yes is implied. By default, this option is set to the empty list, i.e. no - filtering is applied. + manager is compiled for). By default, this option is set to the empty list, i.e. no filtering is applied. If this setting is used, processes of this unit will only be permitted to call native system calls, and system calls of the specified architectures. For the purposes of this option, the x32 architecture is treated @@ -2723,13 +2678,11 @@ SystemCallErrorNumber=EPERM Takes a space-separated list of system call names. If this setting is used, all system calls executed by the unit processes for the listed ones will be logged. If the first character of the list is ~, the effect is inverted: all system calls except the - listed system calls will be logged. If running in user mode, or in system mode, but without the - CAP_SYS_ADMIN capability (e.g. setting User=), - NoNewPrivileges=yes is implied. This feature makes use of the Secure Computing - Mode 2 interfaces of the kernel ('seccomp filtering') and is useful for auditing or setting up a - minimal sandboxing environment. This option may be specified more than once, in which case the filter - masks are merged. If the empty string is assigned, the filter is reset, all prior assignments will - have no effect. This does not affect commands prefixed with +. + listed system calls will be logged. This feature makes use of the Secure Computing Mode 2 interfaces + of the kernel ('seccomp filtering') and is useful for auditing or setting up a minimal sandboxing + environment. This option may be specified more than once, in which case the filter masks are merged. + If the empty string is assigned, the filter is reset, all prior assignments will have no effect. + This does not affect commands prefixed with +. diff --git a/src/basic/capability-util.c b/src/basic/capability-util.c index 1698ea02cca..e84e00a6f61 100644 --- a/src/basic/capability-util.c +++ b/src/basic/capability-util.c @@ -367,16 +367,16 @@ int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities) { return 0; } -int drop_capability(cap_value_t cv) { +static int change_capability(cap_value_t cv, cap_flag_value_t flag) { _cleanup_cap_free_ cap_t tmp_cap = NULL; tmp_cap = cap_get_proc(); if (!tmp_cap) return -errno; - if ((cap_set_flag(tmp_cap, CAP_INHERITABLE, 1, &cv, CAP_CLEAR) < 0) || - (cap_set_flag(tmp_cap, CAP_PERMITTED, 1, &cv, CAP_CLEAR) < 0) || - (cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, CAP_CLEAR) < 0)) + if ((cap_set_flag(tmp_cap, CAP_INHERITABLE, 1, &cv, flag) < 0) || + (cap_set_flag(tmp_cap, CAP_PERMITTED, 1, &cv, flag) < 0) || + (cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, flag) < 0)) return -errno; if (cap_set_proc(tmp_cap) < 0) @@ -385,6 +385,14 @@ int drop_capability(cap_value_t cv) { return 0; } +int drop_capability(cap_value_t cv) { + return change_capability(cv, CAP_CLEAR); +} + +int keep_capability(cap_value_t cv) { + return change_capability(cv, CAP_SET); +} + bool ambient_capabilities_supported(void) { static int cache = -1; diff --git a/src/basic/capability-util.h b/src/basic/capability-util.h index 3e0f901a3d3..f911de8b1b7 100644 --- a/src/basic/capability-util.h +++ b/src/basic/capability-util.h @@ -31,6 +31,7 @@ int capability_update_inherited_set(cap_t caps, uint64_t ambient_set); int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities); int drop_capability(cap_value_t cv); +int keep_capability(cap_value_t cv); DEFINE_TRIVIAL_CLEANUP_FUNC_FULL(cap_t, cap_free, NULL); #define _cleanup_cap_free_ _cleanup_(cap_freep) diff --git a/src/core/exec-invoke.c b/src/core/exec-invoke.c index b1467947e5a..245e9b5a3d9 100644 --- a/src/core/exec-invoke.c +++ b/src/core/exec-invoke.c @@ -1378,15 +1378,7 @@ static bool context_has_syscall_logs(const ExecContext *c) { !hashmap_isempty(c->syscall_log); } -static bool context_has_no_new_privileges(const ExecContext *c) { - assert(c); - - if (c->no_new_privileges) - return true; - - if (have_effective_cap(CAP_SYS_ADMIN) > 0) /* if we are privileged, we don't need NNP */ - return false; - +static bool context_has_seccomp(const ExecContext *c) { /* We need NNP if we have any form of seccomp and are unprivileged */ return c->lock_personality || c->memory_deny_write_execute || @@ -1405,8 +1397,49 @@ static bool context_has_no_new_privileges(const ExecContext *c) { context_has_syscall_logs(c); } +static bool context_has_no_new_privileges(const ExecContext *c) { + assert(c); + + if (c->no_new_privileges) + return true; + + if (have_effective_cap(CAP_SYS_ADMIN) > 0) /* if we are privileged, we don't need NNP */ + return false; + + return context_has_seccomp(c); +} + #if HAVE_SECCOMP +static bool seccomp_allows_drop_privileges(const ExecContext *c) { + void *id, *val; + bool has_capget = false, has_capset = false, has_prctl = false; + + assert(c); + + /* No syscall filter, we are allowed to drop privileges */ + if (hashmap_isempty(c->syscall_filter)) + return true; + + HASHMAP_FOREACH_KEY(val, id, c->syscall_filter) { + _cleanup_free_ char *name = NULL; + + name = seccomp_syscall_resolve_num_arch(SCMP_ARCH_NATIVE, PTR_TO_INT(id) - 1); + + if (streq(name, "capget")) + has_capget = true; + else if (streq(name, "capset")) + has_capset = true; + else if (streq(name, "prctl")) + has_prctl = true; + } + + if (c->syscall_allow_list) + return has_capget && has_capset && has_prctl; + else + return !(has_capget || has_capset || has_prctl); +} + static bool skip_seccomp_unavailable(const ExecContext *c, const ExecParameters *p, const char* msg) { if (is_seccomp_available()) @@ -3911,6 +3944,7 @@ int exec_invoke( needs_setuid, /* Do we need to do the actual setresuid()/setresgid() calls? */ needs_mount_namespace, /* Do we need to set up a mount namespace for this kernel? */ needs_ambient_hack; /* Do we need to apply the ambient capabilities hack? */ + bool keep_seccomp_privileges = false; #if HAVE_SELINUX _cleanup_free_ char *mac_selinux_context_net = NULL; bool use_selinux = false; @@ -3920,6 +3954,9 @@ int exec_invoke( #endif #if HAVE_APPARMOR bool use_apparmor = false; +#endif +#if HAVE_SECCOMP + uint64_t saved_bset = 0; #endif uid_t saved_uid = getuid(); gid_t saved_gid = getgid(); @@ -4817,6 +4854,28 @@ int exec_invoke( (UINT64_C(1) << CAP_SETUID) | (UINT64_C(1) << CAP_SETGID); +#if HAVE_SECCOMP + /* If the service has any form of a seccomp filter and it allows dropping privileges, we'll + * keep the needed privileges to apply it even if we're not root. */ + if (needs_setuid && + uid_is_valid(uid) && + context_has_seccomp(context) && + seccomp_allows_drop_privileges(context)) { + keep_seccomp_privileges = true; + + if (prctl(PR_SET_KEEPCAPS, 1) < 0) { + *exit_status = EXIT_USER; + return log_exec_error_errno(context, params, errno, "Failed to enable keep capabilities flag: %m"); + } + + /* Save the current bounding set so we can restore it after applying the seccomp + * filter */ + saved_bset = bset; + bset |= (UINT64_C(1) << CAP_SYS_ADMIN) | + (UINT64_C(1) << CAP_SETPCAP); + } +#endif + if (!cap_test_all(bset)) { r = capability_bounding_set_drop(bset, /* right_now= */ false); if (r < 0) { @@ -4858,6 +4917,26 @@ int exec_invoke( return log_exec_error_errno(context, params, r, "Failed to change UID to " UID_FMT ": %m", uid); } + if (keep_seccomp_privileges) { + r = drop_capability(CAP_SETUID); + if (r < 0) { + *exit_status = EXIT_USER; + return log_exec_error_errno(context, params, r, "Failed to drop CAP_SETUID: %m"); + } + + r = keep_capability(CAP_SYS_ADMIN); + if (r < 0) { + *exit_status = EXIT_USER; + return log_exec_error_errno(context, params, r, "Failed to keep CAP_SYS_ADMIN: %m"); + } + + r = keep_capability(CAP_SETPCAP); + if (r < 0) { + *exit_status = EXIT_USER; + return log_exec_error_errno(context, params, r, "Failed to keep CAP_SETPCAP: %m"); + } + } + if (!needs_ambient_hack && capability_ambient_set != 0) { /* Raise the ambient capabilities after user change. */ @@ -5027,21 +5106,60 @@ int exec_invoke( *exit_status = EXIT_SECCOMP; return log_exec_error_errno(context, params, r, "Failed to apply system call log filters: %m"); } +#endif - /* This really should remain the last step before the execve(), to make sure our own code is unaffected +#if HAVE_LIBBPF + r = apply_restrict_filesystems(context, params); + if (r < 0) { + *exit_status = EXIT_BPF; + return log_exec_error_errno(context, params, r, "Failed to restrict filesystems: %m"); + } +#endif + +#if HAVE_SECCOMP + /* This really should remain as close to the execve() as possible, to make sure our own code is unaffected * by the filter as little as possible. */ r = apply_syscall_filter(context, params, needs_ambient_hack); if (r < 0) { *exit_status = EXIT_SECCOMP; return log_exec_error_errno(context, params, r, "Failed to apply system call filters: %m"); } -#endif -#if HAVE_LIBBPF - r = apply_restrict_filesystems(context, params); - if (r < 0) { - *exit_status = EXIT_BPF; - return log_exec_error_errno(context, params, r, "Failed to restrict filesystems: %m"); + if (keep_seccomp_privileges) { + /* Restore the capability bounding set with what's expected from the service + the + * ambient capabilities hack */ + if (!cap_test_all(saved_bset)) { + r = capability_bounding_set_drop(saved_bset, /* right_now= */ false); + if (r < 0) { + *exit_status = EXIT_CAPABILITIES; + return log_exec_error_errno(context, params, r, "Failed to drop bset capabilities: %m"); + } + } + + /* Only drop CAP_SYS_ADMIN if it's not in the bounding set, otherwise we'll break + * applications that use it. */ + if (!FLAGS_SET(saved_bset, (UINT64_C(1) << CAP_SYS_ADMIN))) { + r = drop_capability(CAP_SYS_ADMIN); + if (r < 0) { + *exit_status = EXIT_USER; + return log_exec_error_errno(context, params, r, "Failed to drop CAP_SYS_ADMIN: %m"); + } + } + + /* Only drop CAP_SETPCAP if it's not in the bounding set, otherwise we'll break + * applications that use it. */ + if (!FLAGS_SET(saved_bset, (UINT64_C(1) << CAP_SETPCAP))) { + r = drop_capability(CAP_SETPCAP); + if (r < 0) { + *exit_status = EXIT_USER; + return log_exec_error_errno(context, params, r, "Failed to drop CAP_SETPCAP: %m"); + } + } + + if (prctl(PR_SET_KEEPCAPS, 0) < 0) { + *exit_status = EXIT_USER; + return log_exec_error_errno(context, params, errno, "Failed to drop keep capabilities flag: %m"); + } } #endif