1 .\" Copyright (c) 2013, 2014 by Michael Kerrisk <mtk.manpages@gmail.com>
2 .\" and Copyright (c) 2012, 2014 by Eric W. Biederman <ebiederm@xmission.com>
4 .\" %%%LICENSE_START(VERBATIM)
5 .\" Permission is granted to make and distribute verbatim copies of this
6 .\" manual provided the copyright notice and this permission notice are
7 .\" preserved on all copies.
9 .\" Permission is granted to copy and distribute modified versions of this
10 .\" manual under the conditions for verbatim copying, provided that the
11 .\" entire resulting derived work is distributed under the terms of a
12 .\" permission notice identical to this one.
14 .\" Since the Linux kernel and libraries are constantly changing, this
15 .\" manual page may be incorrect or out-of-date. The author(s) assume no
16 .\" responsibility for errors or omissions, or for damages resulting from
17 .\" the use of the information contained herein. The author(s) may not
18 .\" have taken the same level of care in the production of this manual,
19 .\" which is licensed free of charge, as they might when working
22 .\" Formatted or processed versions of this manual, if unaccompanied by
23 .\" the source, must acknowledge the copyright and authors of this work.
27 .TH USER_NAMESPACES 7 2015-03-29 "Linux" "Linux Programmer's Manual"
29 user_namespaces \- overview of Linux user namespaces
31 For an overview of namespaces, see
34 User namespaces isolate security-related identifiers and attributes,
36 user IDs and group IDs (see
41 .\" FIXME: This page says very little about the interaction
42 .\" of user namespaces and keys. Add something on this topic.
44 .BR capabilities (7)).
45 A process's user and group IDs can be different
46 inside and outside a user namespace.
48 a process can have a normal unprivileged user ID outside a user namespace
49 while at the same time having a user ID of 0 inside the namespace;
51 the process has full privileges for operations inside the user namespace,
52 but is unprivileged for operations outside the namespace.
54 .\" ============================================================
56 .SS Nested namespaces, namespace membership
57 User namespaces can be nested;
58 that is, each user namespace\(emexcept the initial ("root")
59 namespace\(emhas a parent user namespace,
60 and can have zero or more child user namespaces.
61 The parent user namespace is the user namespace
62 of the process that creates the user namespace via a call to
70 The kernel imposes (since version 3.11) a limit of 32 nested levels of
71 .\" commit 8742f229b635bf1c1c84a3dfe5e47c814c20b5c8
73 .\" FIXME Explain the rationale for this limit. (What is the rationale?)
78 that would cause this limit to be exceeded fail with the error
81 Each process is a member of exactly one user namespace.
88 flag is a member of the same user namespace as its parent.
89 A single-threaded process can join another user namespace with
94 upon doing so, it gains a full set of capabilities in that namespace.
102 flag makes the new child process (for
106 a member of the new user namespace created by the call.
108 .\" ============================================================
111 The child process created by
115 flag starts out with a complete set
116 of capabilities in the new user namespace.
117 Likewise, a process that creates a new user namespace using
119 or joins an existing user namespace using
121 gains a full set of capabilities in that namespace.
123 that process has no capabilities in the parent (in the case of
125 or previous (in the case of
130 even if the new namespace is created or joined by the root user
131 (i.e., a process with user ID 0 in the root namespace).
135 will cause a process's capabilities to be recalculated in the usual way (see
136 .BR capabilities (7)).
138 unless the process has a user ID of 0 within the namespace,
139 or the executable file has a nonempty inheritable capabilities mask,
140 the process will lose all capabilities.
141 See the discussion of user and group ID mappings, below.
150 flag sets the "securebits" flags
152 .BR capabilities (7))
153 to their default values (all flags disabled) in the child (for
159 Note that because the caller no longer has capabilities
160 in its original user namespace after a call to
162 it is not possible for a process to reset its "securebits" flags while
163 retaining its user namespace membership by using a pair of
165 calls to move to another user namespace and then return to
166 its original user namespace.
168 The rules for determining whether or not a process has a capability
169 in a particular user namespace are as follows:
171 A process has a capability inside a user namespace
172 if it is a member of that namespace and
173 it has the capability in its effective capability set.
174 A process can gain capabilities in its effective capability
176 For example, it may execute a set-user-ID program or an
177 executable with associated file capabilities.
179 a process may gain capabilities via the effect of
184 as already described.
185 .\" In the 3.8 sources, see security/commoncap.c::cap_capable():
187 If a process has a capability in a user namespace,
188 then it has that capability in all child (and further removed descendant)
191 .\" * The owner of the user namespace in the parent of the
192 .\" * user namespace has all caps.
193 When a user namespace is created, the kernel records the effective
194 user ID of the creating process as being the "owner" of the namespace.
195 .\" (and likewise associates the effective group ID of the creating process
196 .\" with the namespace).
197 A process that resides
198 in the parent of the user namespace
199 .\" See kernel commit 520d9eabce18edfef76a60b7b839d54facafe1f9 for a fix
201 and whose effective user ID matches the owner of the namespace
202 has all capabilities in the namespace.
203 .\" This includes the case where the process executes a set-user-ID
204 .\" program that confers the effective UID of the creator of the namespace.
205 By virtue of the previous rule,
206 this means that the process has all capabilities in all
207 further removed descendant user namespaces as well.
209 .\" ============================================================
211 .SS Effect of capabilities within a user namespace
212 Having a capability inside a user namespace
213 permits a process to perform operations (that require privilege)
214 only on resources governed by that namespace.
215 In other words, having a capability in a user namespace permits a process
216 to perform privileged operations on resources that are governed by (nonuser)
217 namespaces associated with the user namespace (see the next subsection).
219 On the other hand, there are many privileged operations that affect
220 resources that are not associated with any namespace type,
221 for example, changing the system time (governed by
223 loading a kernel module (governed by
224 .BR CAP_SYS_MODULE ),
225 and creating a device (governed by
227 Only a process with privileges in the
229 user namespace can perform such operations.
233 within the user namespace associated with a process's mount namespace
234 allows that process to create bind mounts
235 and mount the following types of filesystems:
236 .\" fs_flags = FS_USERNS_MOUNT in kernel sources
260 .\" commit b2197755b2633e164a439682fb05a9b5ea48f706
267 within the PID namespace associated with a process's cgroup namespace
268 allows (since Linux 4.6)
269 that process to mount cgroup filesystems.
273 within the user namespace associated with a process's PID namespace
274 allows (since Linux 3.8)
275 that process to mount
279 Note however, that mounting block-based filesystems can be done
280 only by a process that holds
282 in the initial user namespace.
284 .\" ============================================================
286 .SS Interaction of user namespaces and other types of namespaces
287 Starting in Linux 3.8, unprivileged processes can create user namespaces,
288 and other the other types of namespaces can be created with just the
290 capability in the caller's user namespace.
292 When a non-user-namespace is created,
293 it is owned by the user namespace in which the creating process
294 was a member at the time of the creation of the namespace.
295 Actions on the non-user-namespace
296 require capabilities in the corresponding user namespace.
300 is specified along with other
306 call, the user namespace is guaranteed to be created first,
311 privileges over the remaining namespaces created by the call.
312 Thus, it is possible for an unprivileged caller to specify this combination
315 When a new namespace (other than a user namespace) is created via
319 the kernel records the user namespace of the creating process against
321 (This association can't be changed.)
322 When a process in the new namespace subsequently performs
323 privileged operations that operate on global
324 resources isolated by the namespace,
325 the permission checks are performed according to the process's capabilities
326 in the user namespace that the kernel associated with the new namespace.
327 For example, suppose that a process attempts to change the hostname
328 .RB ( sethostname (2)),
329 a resource governed by the UTS namespace.
331 the kernel will determine which user namespace is associated with
332 the process's UTS namespace, and check whether the process has the
334 .RB ( CAP_SYS_ADMIN )
335 in that user namespace.
337 .\" ============================================================
339 .SS Restrictions on mount namespaces
341 Note the following points with respect to mount namespaces:
343 A mount namespace has an owner user namespace.
344 A mount namespace whose owner user namespace is different from
345 the owner user namespace of its parent mount namespace is
346 considered a less privileged mount namespace.
348 When creating a less privileged mount namespace,
349 shared mounts are reduced to slave mounts.
350 This ensures that mappings performed in less
351 privileged mount namespaces will not propagate to more privileged
355 .\" What does "come as a single unit from more privileged mount" mean?
356 Mounts that come as a single unit from more privileged mount are
357 locked together and may not be separated in a less privileged mount
362 operation brings across all of the mounts from the original
363 mount namespace as a single unit,
364 and recursive mounts that propagate between
365 mount namespaces propagate as a single unit.)
373 and the "atime" flags
377 settings become locked
378 .\" commit 9566d6742852c527bf5af38af5cbb878dad75705
379 .\" Author: Eric W. Biederman <ebiederm@xmission.com>
380 .\" Date: Mon Jul 28 17:26:07 2014 -0700
382 .\" mnt: Correct permission checks in do_remount
384 when propagated from a more privileged to
385 a less privileged mount namespace,
386 and may not be changed in the less privileged mount namespace.
388 .\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree))
389 A file or directory that is a mount point in one namespace that is not
390 a mount point in another namespace, may be renamed, unlinked, or removed
392 in the mount namespace in which it is not a mount point
393 (subject to the usual permission checks).
395 Previously, attempting to unlink, rename, or remove a file or directory
396 that was a mount point in another mount namespace would result in the error
398 That behavior had technical problems of enforcement (e.g., for NFS)
399 and permitted denial-of-service attacks against more privileged users.
400 (i.e., preventing individual files from being updated
401 by bind mounting on top of them).
403 .\" ============================================================
405 .SS User and group ID mappings: uid_map and gid_map
406 When a user namespace is created,
407 it starts out without a mapping of user IDs (group IDs)
408 to the parent user namespace.
410 .IR /proc/[pid]/uid_map
412 .IR /proc/[pid]/gid_map
413 files (available since Linux 3.5)
414 .\" commit 22d917d80e842829d0ca0a561967d728eb1d6303
415 expose the mappings for user and group IDs
416 inside the user namespace for the process
418 These files can be read to view the mappings in a user namespace and
419 written to (once) to define the mappings.
421 The description in the following paragraphs explains the details for
425 but each instance of "user ID" is replaced by "group ID".
429 file exposes the mapping of user IDs from the user namespace
432 to the user namespace of the process that opened
434 (but see a qualification to this point below).
435 In other words, processes that are in different user namespaces
436 will potentially see different values when reading from a particular
438 file, depending on the user ID mappings for the user namespaces
439 of the reading processes.
443 file specifies a 1-to-1 mapping of a range of contiguous
444 user IDs between two user namespaces.
445 (When a user namespace is first created, this file is empty.)
446 The specification in each line takes the form of
447 three numbers delimited by white space.
448 The first two numbers specify the starting user ID in
449 each of the two user namespaces.
450 The third number specifies the length of the mapped range.
451 In detail, the fields are interpreted as follows:
453 The start of the range of user IDs in
454 the user namespace of the process
457 The start of the range of user
458 IDs to which the user IDs specified by field one map.
459 How field two is interpreted depends on whether the process that opened
463 are in the same user namespace, as follows:
466 If the two processes are in different user namespaces:
467 field two is the start of a range of
468 user IDs in the user namespace of the process that opened
471 If the two processes are in the same user namespace:
472 field two is the start of the range of
473 user IDs in the parent user namespace of the process
475 This case enables the opener of
477 (the common case here is opening
478 .IR /proc/self/uid_map )
479 to see the mapping of user IDs into the user namespace of the process
480 that created this user namespace.
483 The length of the range of user IDs that is mapped between the two
486 System calls that return user IDs (group IDs)\(emfor example,
489 and the credential fields in the structure returned by
490 .BR stat (2)\(emreturn
491 the user ID (group ID) mapped into the caller's user namespace.
493 When a process accesses a file, its user and group IDs
494 are mapped into the initial user namespace for the purpose of permission
495 checking and assigning IDs when creating a file.
496 When a process retrieves file user and group IDs via
498 the IDs are mapped in the opposite direction,
499 to produce values relative to the process user and group ID mappings.
501 The initial user namespace has no parent namespace,
502 but, for consistency, the kernel provides dummy user and group
503 ID mapping files for this namespace.
508 is the same) from a shell in the initial namespace shows:
512 $ \fBcat /proc/$$/uid_map\fP
517 This mapping tells us
518 that the range starting at user ID 0 in this namespace
519 maps to a range starting at 0 in the (nonexistent) parent namespace,
520 and the length of the range is the largest 32-bit unsigned integer.
521 This leaves 4294967295 (the 32-bit signed \-1 value) unmapped.
524 is used in several interfaces (e.g.,
526 as a way to specify "no user ID".
529 unmapped and unusable guarantees that there will be no
530 confusion when using these interfaces.
532 .\" ============================================================
534 .SS Defining user and group ID mappings: writing to uid_map and gid_map
536 After the creation of a new user namespace, the
540 of the processes in the namespace may be written to
542 to define the mapping of user IDs in the new user namespace.
543 An attempt to write more than once to a
545 file in a user namespace fails with the error
547 Similar rules apply for
554 must conform to the following rules:
556 The three fields must be valid numbers,
557 and the last field must be greater than 0.
559 Lines are terminated by newline characters.
561 There is an (arbitrary) limit on the number of lines in the file.
562 As at Linux 3.18, the limit is five lines.
563 In addition, the number of bytes written to
564 the file must be less than the system page size,
565 .\" FIXME(Eric): the restriction "less than" rather than "less than or equal"
566 .\" seems strangely arbitrary. Furthermore, the comment does not agree
567 .\" with the code in kernel/user_namespace.c. Which is correct?
568 and the write must be performed at the start of the file (i.e.,
572 can't be used to write to nonzero offsets in the file).
574 The range of user IDs (group IDs)
575 specified in each line cannot overlap with the ranges
577 In the initial implementation (Linux 3.8), this requirement was
578 satisfied by a simplistic implementation that imposed the further
580 the values in both field 1 and field 2 of successive lines must be
581 in ascending numerical order,
582 which prevented some otherwise valid maps from being created.
584 .\" commit 0bd14b4fd72afd5df41e9fd59f356740f22fceba
585 fix this limitation, allowing any valid set of nonoverlapping maps.
587 At least one line must be written to the file.
589 Writes that violate the above rules fail with the error
592 In order for a process to write to the
593 .I /proc/[pid]/uid_map
594 .RI ( /proc/[pid]/gid_map )
595 file, all of the following requirements must be met:
597 The writing process must have the
600 capability in the user namespace of the process
603 The writing process must either be in the user namespace of the process
605 or be in the parent user namespace of the process
608 The mapped user IDs (group IDs) must in turn have a mapping
609 in the parent user namespace.
611 One of the following two cases applies:
615 the writing process has the
623 No further restrictions apply:
624 the process can make mappings to arbitrary user IDs (group IDs)
625 in the parent user namespace.
629 otherwise all of the following restrictions apply:
635 must consist of a single line that maps
636 the writing process's effective user ID
637 (group ID) in the parent user namespace to a user ID (group ID)
638 in the user namespace.
640 The writing process must have the same effective user ID as the process
641 that created the user namespace.
647 system call must first be denied by writing
650 .I /proc/[pid]/setgroups
651 file (see below) before writing to
656 Writes that violate the above rules fail with the error
659 .\" ============================================================
661 .SS Interaction with system calls that change process UIDs or GIDs
662 In a user namespace where the
664 file has not been written, the system calls that change user IDs will fail.
667 file has not been written, the system calls that change group IDs will fail.
672 files have been written, only the mapped values may be used in
673 system calls that change user and group IDs.
675 For user IDs, the relevant system calls include
681 For group IDs, the relevant system calls include
692 .I /proc/[pid]/setgroups
693 file before writing to
694 .I /proc/[pid]/gid_map
695 .\" Things changed in Linux 3.19
696 .\" commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8
697 .\" commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272
698 .\" http://lwn.net/Articles/626665/
699 will permanently disable
701 in a user namespace and allow writing to
702 .I /proc/[pid]/gid_map
705 capability in the parent user namespace.
707 .\" ============================================================
709 .SS The /proc/[pid]/setgroups file
711 .\" commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8
712 .\" commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272
713 .\" http://lwn.net/Articles/626665/
714 .\" http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-8989
717 .I /proc/[pid]/setgroups
718 file displays the string
720 if processes in the user namespace that contains the process
722 are permitted to employ the
724 system call; it displays
728 is not permitted in that user namespace.
729 Note that regardless of the value in the
730 .I /proc/[pid]/setgroups
731 file (and regardless of the process's capabilities), calls to
733 are also not permitted if
734 .IR /proc/[pid]/gid_map
735 has not yet been set.
737 A privileged process (one with the
739 capability in the namespace) may write either of the strings
745 writing a group ID mapping
746 for this user namespace to the file
747 .IR /proc/[pid]/gid_map .
750 prevents any process in the user namespace from employing
753 The essence of the restrictions described in the preceding
754 paragraph is that it is permitted to write to
755 .I /proc/[pid]/setgroups
756 only so long as calling
758 is disallowed because
759 .I /proc/[pid]gid_map
761 This ensures that a process cannot transition from a state where
763 is allowed to a state where
766 a process can only transition from
772 The default value of this file in the initial user namespace is
776 .IR /proc/[pid]/gid_map
778 (which has the effect of enabling
780 in the user namespace),
781 it is no longer possible to disallow
786 .IR /proc/[pid]/setgroups
787 (the write fails with the error
790 A child user namespace inherits the
791 .IR /proc/[pid]/setgroups
792 setting from its parent.
800 system call can't subsequently be reenabled (by writing
802 to the file) in this user namespace.
803 (Attempts to do so will fail with the error
805 This restriction also propagates down to all child user namespaces of
809 .I /proc/[pid]/setgroups
810 file was added in Linux 3.19,
811 but was backported to many earlier stable kernel series,
812 because it addresses a security issue.
813 The issue concerned files with permissions such as "rwx\-\-\-rwx".
814 Such files give fewer permissions to "group" than they do to "other".
815 This means that dropping groups using
817 might allow a process file access that it did not formerly have.
818 Before the existence of user namespaces this was not a concern,
819 since only a privileged process (one with the
821 capability) could call
823 However, with the introduction of user namespaces,
824 it became possible for an unprivileged process to create
825 a new namespace in which the user had all privileges.
826 This then allowed formerly unprivileged
827 users to drop groups and thus gain file access
828 that they did not previously have.
830 .I /proc/[pid]/setgroups
831 file was added to address this security issue,
832 by denying any pathway for an unprivileged process to drop groups with
835 .\" /proc/PID/setgroups
836 .\" [allow == setgroups() is allowed, "deny" == setgroups() is disallowed]
837 .\" * Can write if have CAP_SYS_ADMIN in NS
838 .\" * Must write BEFORE writing to /proc/PID/gid_map
841 .\" * Must already have written to gid_maps
842 .\" * /proc/PID/setgroups must be "allow"
844 .\" /proc/PID/gid_map -- writing
845 .\" * Must already have written "deny" to /proc/PID/setgroups
847 .\" ============================================================
849 .SS Unmapped user and group IDs
851 There are various places where an unmapped user ID (group ID)
852 may be exposed to user space.
853 For example, the first process in a new user namespace may call
855 before a user ID mapping has been defined for the namespace.
856 In most such cases, an unmapped user ID is converted
857 .\" from_kuid_munged(), from_kgid_munged()
858 to the overflow user ID (group ID);
859 the default value for the overflow user ID (group ID) is 65534.
860 See the descriptions of
861 .IR /proc/sys/kernel/overflowuid
863 .IR /proc/sys/kernel/overflowgid
867 The cases where unmapped IDs are mapped in this fashion include
868 system calls that return user IDs
872 credentials passed over a UNIX domain socket,
874 credentials returned by
877 and the System V IPC "ctl"
880 credentials exposed by
883 .IR /proc/sysvipc/* ,
884 credentials returned via the
888 received with a signal (see
890 credentials written to the process accounting file (see
892 and credentials returned with POSIX message queue notifications (see
895 There is one notable case where unmapped user and group IDs are
897 .\" from_kuid(), from_kgid()
898 .\" Also F_GETOWNER_UIDS is an exception
899 converted to the corresponding overflow ID value.
904 file in which there is no mapping for the second field,
905 that field is displayed as 4294967295 (\-1 as an unsigned integer);
907 .\" ============================================================
909 .SS Set-user-ID and set-group-ID programs
911 When a process inside a user namespace executes
912 a set-user-ID (set-group-ID) program,
913 the process's effective user (group) ID inside the namespace is changed
914 to whatever value is mapped for the user (group) ID of the file.
915 However, if either the user
917 the group ID of the file has no mapping inside the namespace,
918 the set-user-ID (set-group-ID) bit is silently ignored:
919 the new program is executed,
920 but the process's effective user (group) ID is left unchanged.
921 (This mirrors the semantics of executing a set-user-ID or set-group-ID
922 program that resides on a filesystem that was mounted with the
924 flag, as described in
927 .\" ============================================================
931 When a process's user and group IDs are passed over a UNIX domain socket
932 to a process in a different user namespace (see the description of
936 they are translated into the corresponding values as per the
937 receiving process's user and group ID mappings.
940 Namespaces are a Linux-specific feature.
943 Over the years, there have been a lot of features that have been added
944 to the Linux kernel that have been made available only to privileged users
945 because of their potential to confuse set-user-ID-root applications.
946 In general, it becomes safe to allow the root user in a user namespace to
947 use those features because it is impossible, while in a user namespace,
948 to gain more privilege than the root user of a user namespace has.
950 .\" ============================================================
953 Use of user namespaces requires a kernel that is configured with the
956 User namespaces require support in a range of subsystems across
958 When an unsupported subsystem is configured into the kernel,
959 it is not possible to configure user namespaces support.
961 As at Linux 3.8, most relevant subsystems supported user namespaces,
962 but a number of filesystems did not have the infrastructure needed
963 to map user and group IDs between user namespaces.
964 Linux 3.9 added the required infrastructure support for many of
965 the remaining unsupported filesystems
966 (Plan 9 (9P), Andrew File System (AFS), Ceph, CIFS, CODA, NFS, and OCFS2).
967 Linux 3.12 added support the last of the unsupported major filesystems,
968 .\" commit d6970d4b726cea6d7a9bc4120814f95c09571fc3
972 The program below is designed to allow experimenting with
973 user namespaces, as well as other types of namespaces.
974 It creates namespaces as specified by command-line options and then executes
975 a command inside those namespaces.
978 function inside the program provide a full explanation of the program.
979 The following shell session demonstrates its use.
981 First, we look at the run-time environment:
985 $ \fBuname -rs\fP # Need Linux 3.8 or later
987 $ \fBid -u\fP # Running as unprivileged user
994 Now start a new shell in new user
1000 namespaces, with user ID
1004 1000 mapped to 0 inside the user namespace:
1008 $ \fB./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash\fP
1012 The shell has PID 1, because it is the first process in the new
1022 Inside the user namespace, the shell has user and group ID 0,
1023 and a full set of permitted and effective capabilities:
1027 bash$ \fBcat /proc/$$/status | egrep '^[UG]id'\fP
1030 bash$ \fBcat /proc/$$/status | egrep '^Cap(Prm|Inh|Eff)'\fP
1031 CapInh: 0000000000000000
1032 CapPrm: 0000001fffffffff
1033 CapEff: 0000001fffffffff
1039 filesystem and listing all of the processes visible
1040 in the new PID namespace shows that the shell can't see
1041 any processes outside the PID namespace:
1045 bash$ \fBmount -t proc proc /proc\fP
1047 PID TTY STAT TIME COMMAND
1049 22 pts/3 R+ 0:00 ps ax
1055 /* userns_child_exec.c
1057 Licensed under GNU General Public License v2 or later
1059 Create a child process that executes a shell command in new
1060 namespace(s); allow UID and GID mappings to be specified when
1061 creating a user namespace.
1067 #include <sys/wait.h>
1075 /* A simple error\-handling function: print an error message based
1076 on the value in \(aqerrno\(aq and terminate the calling process */
1078 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \\
1082 char **argv; /* Command to be executed by child, with args */
1083 int pipe_fd[2]; /* Pipe used to synchronize parent and child */
1091 fprintf(stderr, "Usage: %s [options] cmd [arg...]\\n\\n", pname);
1092 fprintf(stderr, "Create a child process that executes a shell "
1093 "command in a new user namespace,\\n"
1094 "and possibly also other new namespace(s).\\n\\n");
1095 fprintf(stderr, "Options can be:\\n\\n");
1096 #define fpe(str) fprintf(stderr, " %s", str);
1097 fpe("\-i New IPC namespace\\n");
1098 fpe("\-m New mount namespace\\n");
1099 fpe("\-n New network namespace\\n");
1100 fpe("\-p New PID namespace\\n");
1101 fpe("\-u New UTS namespace\\n");
1102 fpe("\-U New user namespace\\n");
1103 fpe("\-M uid_map Specify UID map for user namespace\\n");
1104 fpe("\-G gid_map Specify GID map for user namespace\\n");
1105 fpe("\-z Map user\(aqs UID and GID to 0 in user namespace\\n");
1106 fpe(" (equivalent to: \-M \(aq0 <uid> 1\(aq \-G \(aq0 <gid> 1\(aq)\\n");
1107 fpe("\-v Display verbose messages\\n");
1109 fpe("If \-z, \-M, or \-G is specified, \-U is required.\\n");
1110 fpe("It is not permitted to specify both \-z and either \-M or \-G.\\n");
1112 fpe("Map strings for \-M and \-G consist of records of the form:\\n");
1114 fpe(" ID\-inside\-ns ID\-outside\-ns len\\n");
1116 fpe("A map string can contain multiple records, separated"
1118 fpe("the commas are replaced by newlines before writing"
1119 " to map files.\\n");
1124 /* Update the mapping file \(aqmap_file\(aq, with the value provided in
1125 \(aqmapping\(aq, a string that defines a UID or GID mapping. A UID or
1126 GID mapping consists of one or more newline\-delimited records
1129 ID_inside\-ns ID\-outside\-ns length
1131 Requiring the user to supply a string that contains newlines is
1132 of course inconvenient for command\-line use. Thus, we permit the
1133 use of commas to delimit records in this string, and replace them
1134 with newlines before writing the string to the file. */
1137 update_map(char *mapping, char *map_file)
1140 size_t map_len; /* Length of \(aqmapping\(aq */
1142 /* Replace commas in mapping string with newlines */
1144 map_len = strlen(mapping);
1145 for (j = 0; j < map_len; j++)
1146 if (mapping[j] == \(aq,\(aq)
1147 mapping[j] = \(aq\\n\(aq;
1149 fd = open(map_file, O_RDWR);
1151 fprintf(stderr, "ERROR: open %s: %s\\n", map_file,
1156 if (write(fd, mapping, map_len) != map_len) {
1157 fprintf(stderr, "ERROR: write %s: %s\\n", map_file,
1165 /* Linux 3.19 made a change in the handling of setgroups(2) and the
1166 \(aqgid_map\(aq file to address a security issue. The issue allowed
1167 *unprivileged* users to employ user namespaces in order to drop
1168 The upshot of the 3.19 changes is that in order to update the
1169 \(aqgid_maps\(aq file, use of the setgroups() system call in this
1170 user namespace must first be disabled by writing "deny" to one of
1171 the /proc/PID/setgroups files for this namespace. That is the
1172 purpose of the following function. */
1175 proc_setgroups_write(pid_t child_pid, char *str)
1177 char setgroups_path[PATH_MAX];
1180 snprintf(setgroups_path, PATH_MAX, "/proc/%ld/setgroups",
1183 fd = open(setgroups_path, O_RDWR);
1186 /* We may be on a system that doesn\(aqt support
1187 /proc/PID/setgroups. In that case, the file won\(aqt exist,
1188 and the system won\(aqt impose the restrictions that Linux 3.19
1189 added. That\(aqs fine: we don\(aqt need to do anything in order
1190 to permit \(aqgid_map\(aq to be updated.
1192 However, if the error from open() was something other than
1193 the ENOENT error that is expected for that case, let the
1196 if (errno != ENOENT)
1197 fprintf(stderr, "ERROR: open %s: %s\\n", setgroups_path,
1202 if (write(fd, str, strlen(str)) == \-1)
1203 fprintf(stderr, "ERROR: write %s: %s\\n", setgroups_path,
1209 static int /* Start function for cloned child */
1210 childFunc(void *arg)
1212 struct child_args *args = (struct child_args *) arg;
1215 /* Wait until the parent has updated the UID and GID mappings.
1216 See the comment in main(). We wait for end of file on a
1217 pipe that will be closed by the parent process once it has
1218 updated the mappings. */
1220 close(args\->pipe_fd[1]); /* Close our descriptor for the write
1221 end of the pipe so that we see EOF
1222 when parent closes its descriptor */
1223 if (read(args\->pipe_fd[0], &ch, 1) != 0) {
1225 "Failure in child: read from pipe returned != 0\\n");
1229 /* Execute a shell command */
1231 printf("About to exec %s\\n", args\->argv[0]);
1232 execvp(args\->argv[0], args\->argv);
1236 #define STACK_SIZE (1024 * 1024)
1238 static char child_stack[STACK_SIZE]; /* Space for child\(aqs stack */
1241 main(int argc, char *argv[])
1243 int flags, opt, map_zero;
1245 struct child_args args;
1246 char *uid_map, *gid_map;
1247 const int MAP_BUF_SIZE = 100;
1248 char map_buf[MAP_BUF_SIZE];
1249 char map_path[PATH_MAX];
1251 /* Parse command\-line options. The initial \(aq+\(aq character in
1252 the final getopt() argument prevents GNU\-style permutation
1253 of command\-line options. That\(aqs useful, since sometimes
1254 the \(aqcommand\(aq to be executed by this program itself
1255 has command\-line options. We don\(aqt want getopt() to treat
1256 those as options to this program. */
1263 while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != \-1) {
1265 case \(aqi\(aq: flags |= CLONE_NEWIPC; break;
1266 case \(aqm\(aq: flags |= CLONE_NEWNS; break;
1267 case \(aqn\(aq: flags |= CLONE_NEWNET; break;
1268 case \(aqp\(aq: flags |= CLONE_NEWPID; break;
1269 case \(aqu\(aq: flags |= CLONE_NEWUTS; break;
1270 case \(aqv\(aq: verbose = 1; break;
1271 case \(aqz\(aq: map_zero = 1; break;
1272 case \(aqM\(aq: uid_map = optarg; break;
1273 case \(aqG\(aq: gid_map = optarg; break;
1274 case \(aqU\(aq: flags |= CLONE_NEWUSER; break;
1275 default: usage(argv[0]);
1279 /* \-M or \-G without \-U is nonsensical */
1281 if (((uid_map != NULL || gid_map != NULL || map_zero) &&
1282 !(flags & CLONE_NEWUSER)) ||
1283 (map_zero && (uid_map != NULL || gid_map != NULL)))
1286 args.argv = &argv[optind];
1288 /* We use a pipe to synchronize the parent and child, in order to
1289 ensure that the parent sets the UID and GID maps before the child
1290 calls execve(). This ensures that the child maintains its
1291 capabilities during the execve() in the common case where we
1292 want to map the child\(aqs effective user ID to 0 in the new user
1293 namespace. Without this synchronization, the child would lose
1294 its capabilities if it performed an execve() with nonzero
1295 user IDs (see the capabilities(7) man page for details of the
1296 transformation of a process\(aqs capabilities during execve()). */
1298 if (pipe(args.pipe_fd) == \-1)
1301 /* Create the child in new namespace(s) */
1303 child_pid = clone(childFunc, child_stack + STACK_SIZE,
1304 flags | SIGCHLD, &args);
1305 if (child_pid == \-1)
1308 /* Parent falls through to here */
1311 printf("%s: PID of child created by clone() is %ld\\n",
1312 argv[0], (long) child_pid);
1314 /* Update the UID and GID maps in the child */
1316 if (uid_map != NULL || map_zero) {
1317 snprintf(map_path, PATH_MAX, "/proc/%ld/uid_map",
1320 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getuid());
1323 update_map(uid_map, map_path);
1326 if (gid_map != NULL || map_zero) {
1327 proc_setgroups_write(child_pid, "deny");
1329 snprintf(map_path, PATH_MAX, "/proc/%ld/gid_map",
1332 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getgid());
1335 update_map(gid_map, map_path);
1338 /* Close the write end of the pipe, to signal to the child that we
1339 have updated the UID and GID maps */
1341 close(args.pipe_fd[1]);
1343 if (waitpid(child_pid, NULL, 0) == \-1) /* Wait for child */
1347 printf("%s: terminating\\n", argv[0]);
1353 .BR newgidmap (1), \" From the shadow package
1354 .BR newuidmap (1), \" From the shadow package
1360 .BR subgid (5), \" From the shadow package
1361 .BR subuid (5), \" From the shadow package
1362 .BR credentials (7),
1363 .BR capabilities (7),
1365 .BR cgroup_namespaces (7)
1366 .BR pid_namespaces (7)
1368 The kernel source file
1369 .IR Documentation/namespaces/resource-control.txt .