From effbd6d2eadb61bd236d118afc7901940c4c6b37 Mon Sep 17 00:00:00 2001 From: Lennart Poettering Date: Fri, 26 Aug 2016 12:24:37 +0200 Subject: [PATCH] man: rework documentation for ReadOnlyPaths= and related settings This reworks the documentation for ReadOnlyPaths=, ReadWritePaths=, InaccessiblePaths=. It no longer claims that we'd follow symlinks relative to the host file system. (Which wasn't true actually, as we didn't follow symlinks at all in the most recent releases, and we know do follow them, but relative to RootDirectory=). This also replaces all references to the fact that all fs namespacing options can be undone with enough privileges and disable propagation by a single one in the documentation of ReadOnlyPaths= and friends, and then directs the read to this in all other places. Moreover a hint is added to the documentation of SystemCallFilter=, suggesting usage of ~@mount in case any of the fs namespacing related options are used. --- man/systemd.exec.xml | 214 +++++++++++++++++++------------------------ 1 file changed, 92 insertions(+), 122 deletions(-) diff --git a/man/systemd.exec.xml b/man/systemd.exec.xml index 67182f17dc5..84f81fe38ef 100644 --- a/man/systemd.exec.xml +++ b/man/systemd.exec.xml @@ -877,48 +877,34 @@ ReadOnlyPaths= InaccessiblePaths= - Sets up a new file system namespace for - executed processes. These options may be used to limit access - a process might have to the main file system hierarchy. Each - setting takes a space-separated list of paths relative to - the host's root directory (i.e. the system running the service manager). - Note that if entries contain symlinks, they are resolved from the host's root directory as well. - Entries (files or directories) listed in - ReadWritePaths= are accessible from - within the namespace with the same access rights as from - outside. Entries listed in - ReadOnlyPaths= are accessible for - reading only, writing will be refused even if the usual file - access controls would permit this. Entries listed in - InaccessiblePaths= will be made - inaccessible for processes inside the namespace, and may not - countain any other mountpoints, including those specified by - ReadWritePaths= or - ReadOnlyPaths=. - Note that restricting access with these options does not extend - to submounts of a directory that are created later on. - Non-directory paths can be specified as well. These - options may be specified more than once, in which case all - paths listed will have limited access from within the - namespace. If the empty string is assigned to this option, the - specific list is reset, and all prior assignments have no - effect. - Paths in - ReadOnlyPaths= - and - InaccessiblePaths= - may be prefixed with - -, in which case - they will be ignored when they do not - exist. Note that using this - setting will disconnect propagation of - mounts from the service to the host - (propagation in the opposite direction - continues to work). This means that - this setting may not be used for - services which shall be able to - install mount points in the main mount - namespace. + Sets up a new file system namespace for executed processes. These options may be used to limit + access a process might have to the file system hierarchy. Each setting takes a space-separated list of paths + relative to the host's root directory (i.e. the system running the service manager). Note that if paths + contain symlinks, they are resolved relative to the root directory set with + RootDirectory=. + + Paths listed in ReadWritePaths= are accessible from within the namespace with the same + access modes as from outside of it. Paths listed in ReadOnlyPaths= are accessible for + reading only, writing will be refused even if the usual file access controls would permit this. Nest + ReadWritePaths= inside of ReadOnlyPaths= in order to provide writable + subdirectories within read-only directories. Use ReadWritePaths= in order to whitelist + specific paths for write access if ProtectSystem=strict is used. Paths listed in + InaccessiblePaths= will be made inaccessible for processes inside the namespace (along with + everything below them in the file system hierarchy). + + Note that restricting access with these options does not extend to submounts of a directory that are + created later on. Non-directory paths may be specified as well. These options may be specified more than once, + in which case all paths listed will have limited access from within the namespace. If the empty string is + assigned to this option, the specific list is reset, and all prior assignments have no effect. + + Paths in ReadOnlyPaths= and InaccessiblePaths= may be prefixed with + -, in which case they will be ignored when they do not exist. Note that using this setting + will disconnect propagation of mounts from the service to the host (propagation in the opposite direction + continues to work). This means that this setting may not be used for services which shall be able to install + mount points in the main mount namespace. Note that the effect of these settings may be undone by privileged + processes. In order to set up an effective sandboxed environment for a unit it is thus recommended to combine + these settings with either CapabilityBoundingSet=~CAP_SYS_ADMIN or + SystemCallFilter=~@mount. @@ -933,37 +919,30 @@ private /tmp and /var/tmp namespace by using the JoinsNamespaceOf= directive, see systemd.unit5 for - details. Note that using this setting will disconnect propagation of mounts from the service to the host - (propagation in the opposite direction continues to work). This means that this setting may not be used for - services which shall be able to install mount points in the main mount namespace. This setting is implied if - DynamicUser= is set. + details. This setting is implied if DynamicUser= is set. For this setting the same + restrictions regarding mount propagation and privileges apply as for ReadOnlyPaths= and + related calls, see above. + PrivateDevices= - Takes a boolean argument. If true, sets up a - new /dev namespace for the executed processes and only adds - API pseudo devices such as /dev/null, - /dev/zero or - /dev/random (as well as the pseudo TTY - subsystem) to it, but no physical devices such as - /dev/sda. This is useful to securely turn - off physical device access by the executed process. Defaults - to false. Enabling this option will also remove - CAP_MKNOD from the capability bounding - set for the unit (see above), and set - DevicePolicy=closed (see + Takes a boolean argument. If true, sets up a new /dev namespace for the executed processes and + only adds API pseudo devices such as /dev/null, /dev/zero or + /dev/random (as well as the pseudo TTY subsystem) to it, but no physical devices such as + /dev/sda. This is useful to securely turn off physical device access by the executed + process. Defaults to false. Enabling this option will also remove CAP_MKNOD from the + capability bounding set for the unit (see above), and set DevicePolicy=closed (see systemd.resource-control5 - for details). Note that using this setting will disconnect - propagation of mounts from the service to the host - (propagation in the opposite direction continues to work). - This means that this setting may not be used for services - which shall be able to install mount points in the main mount - namespace. The /dev namespace will be mounted read-only and 'noexec'. - The latter may break old programs which try to set up executable - memory by using mmap2 - of /dev/zero instead of using MAP_ANON. + for details). Note that using this setting will disconnect propagation of mounts from the service to the host + (propagation in the opposite direction continues to work). This means that this setting may not be used for + services which shall be able to install mount points in the main mount namespace. The /dev namespace will be + mounted read-only and 'noexec'. The latter may break old programs which try to set up executable memory by + using mmap2 of + /dev/zero instead of using MAP_ANON. This setting is implied if + DynamicUser= is set. For this setting the same restrictions regarding mount propagation and + privileges apply as for ReadOnlyPaths= and related calls, see above. @@ -1023,33 +1002,23 @@ operating system (and optionally its configuration, and local mounts) is prohibited for the service. It is recommended to enable this setting for all long-running services, unless they are involved with system updates or need to modify the operating system in other ways. If this option is used, - ReadWritePaths= may be used to exclude specific directories from being made read-only. Note - that processes retaining the CAP_SYS_ADMIN capability (and with no system call filter that - prohibits mount-related system calls applied) can undo the effect of this setting. This setting is hence - particularly useful for daemons which have this either the @mount set filtered using - SystemCallFilter=, or have the CAP_SYS_ADMIN capability removed, for - example with CapabilityBoundingSet=. Defaults to off. + ReadWritePaths= may be used to exclude specific directories from being made read-only. This + setting is implied if DynamicUser= is set. For this setting the same restrictions regarding + mount propagation and privileges apply as for ReadOnlyPaths= and related calls, see + above. Defaults to off. ProtectHome= - Takes a boolean argument or - read-only. If true, the directories - /home, /root and - /run/user - are made inaccessible and empty for processes invoked by this - unit. If set to read-only, the three - directories are made read-only instead. It is recommended to - enable this setting for all long-running services (in - particular network-facing ones), to ensure they cannot get - access to private user data, unless the services actually - require access to the user's private data. Note however that - processes retaining the CAP_SYS_ADMIN capability can undo the - effect of this setting. This setting is hence particularly - useful for daemons which have this capability removed, for - example with CapabilityBoundingSet=. - Defaults to off. + Takes a boolean argument or read-only. If true, the directories + /home, /root and /run/user are made inaccessible + and empty for processes invoked by this unit. If set to read-only, the three directories are + made read-only instead. It is recommended to enable this setting for all long-running services (in particular + network-facing ones), to ensure they cannot get access to private user data, unless the services actually + require access to the user's private data. This setting is implied if DynamicUser= is + set. For this setting the same restrictions regarding mount propagation and privileges apply as for + ReadOnlyPaths= and related calls, see above. @@ -1059,48 +1028,41 @@ /proc/sys and /sys will be made read-only to all processes of the unit. Usually, tunable kernel variables should only be written at boot-time, with the sysctl.d5 mechanism. Almost - no services need to write to these at runtime; it is hence recommended to turn this on for most - services. Defaults to off. + no services need to write to these at runtime; it is hence recommended to turn this on for most services. For + this setting the same restrictions regarding mount propagation and privileges apply as for + ReadOnlyPaths= and related calls, see above. Defaults to off. ProtectControlGroups= - Takes a boolean argument. If true, the Linux Control Groups ("cgroups") hierarchies accessible - through /sys/fs/cgroup will be made read-only to all processes of the unit. Except for - container managers no services should require write access to the control groups hierarchies; it is hence - recommended to turn this on for most services. Defaults to off. + Takes a boolean argument. If true, the Linux Control Groups (cgroups7) hierarchies + accessible through /sys/fs/cgroup will be made read-only to all processes of the + unit. Except for container managers no services should require write access to the control groups hierarchies; + it is hence recommended to turn this on for most services. For this setting the same restrictions regarding + mount propagation and privileges apply as for ReadOnlyPaths= and related calls, see + above. Defaults to off. MountFlags= - Takes a mount propagation flag: - , or - , which control whether mounts in the - file system namespace set up for this unit's processes will - receive or propagate mounts or unmounts. See - mount2 - for details. Defaults to . Use - to ensure that mounts and unmounts are - propagated from the host to the container and vice versa. Use - to run processes so that none of their - mounts and unmounts will propagate to the host. Use - to also ensure that no mounts and - unmounts from the host will propagate into the unit processes' - namespace. Note that means that file - systems mounted on the host might stay mounted continuously in - the unit's namespace, and thus keep the device busy. Note that - the file system namespace related options - (PrivateTmp=, - PrivateDevices=, - ProtectSystem=, - ProtectHome=, - ReadOnlyPaths=, - InaccessiblePaths= and - ReadWritePaths=) require that mount - and unmount propagation from the unit's file system namespace - is disabled, and hence downgrade to + Takes a mount propagation flag: , or + , which control whether mounts in the file system namespace set up for this unit's + processes will receive or propagate mounts or unmounts. See mount2 for + details. Defaults to . Use to ensure that mounts and unmounts + are propagated from the host to the container and vice versa. Use to run processes so + that none of their mounts and unmounts will propagate to the host. Use to also ensure + that no mounts and unmounts from the host will propagate into the unit processes' namespace. Note that + means that file systems mounted on the host might stay mounted continuously in the + unit's namespace, and thus keep the device busy. Note that the file system namespace related options + (PrivateTmp=, PrivateDevices=, ProtectSystem=, + ProtectHome=, ProtectKernelTunables=, + ProtectControlGroups=, ReadOnlyPaths=, + InaccessiblePaths=, ReadWritePaths=) require that mount and unmount + propagation from the unit's file system namespace is disabled, and hence downgrade to . @@ -1335,7 +1297,15 @@ Note, that as new system calls are added to the kernel, additional system calls might be added to the groups - above, so the contents of the sets may change between systemd versions. + above, so the contents of the sets may change between systemd versions. + + It is recommended to combine the file system namespacing related options with + SystemCallFilter=~@mount, in order to prohibit the unit's processes to undo the + mappings. Specifically these are the options PrivateTmp=, + PrivateDevices=, ProtectSystem=, ProtectHome=, + ProtectKernelTunables=, ProtectControlGroups=, + ReadOnlyPaths=, InaccessiblePaths= and + ReadWritePaths=. -- 2.39.2