]> git.ipfire.org Git - thirdparty/man-pages.git/blob - man7/user_namespaces.7
user_namespaces.7: Correct kernel version where XFS added support for user namespaces
[thirdparty/man-pages.git] / man7 / user_namespaces.7
1 .\" Copyright (c) 2013, 2014 by Michael Kerrisk <mtk.manpages@gmail.com>
2 .\" and Copyright (c) 2012, 2014 by Eric W. Biederman <ebiederm@xmission.com>
3 .\"
4 .\" %%%LICENSE_START(VERBATIM)
5 .\" Permission is granted to make and distribute verbatim copies of this
6 .\" manual provided the copyright notice and this permission notice are
7 .\" preserved on all copies.
8 .\"
9 .\" Permission is granted to copy and distribute modified versions of this
10 .\" manual under the conditions for verbatim copying, provided that the
11 .\" entire resulting derived work is distributed under the terms of a
12 .\" permission notice identical to this one.
13 .\"
14 .\" Since the Linux kernel and libraries are constantly changing, this
15 .\" manual page may be incorrect or out-of-date. The author(s) assume no
16 .\" responsibility for errors or omissions, or for damages resulting from
17 .\" the use of the information contained herein. The author(s) may not
18 .\" have taken the same level of care in the production of this manual,
19 .\" which is licensed free of charge, as they might when working
20 .\" professionally.
21 .\"
22 .\" Formatted or processed versions of this manual, if unaccompanied by
23 .\" the source, must acknowledge the copyright and authors of this work.
24 .\" %%%LICENSE_END
25 .\"
26 .\"
27 .TH USER_NAMESPACES 7 2015-03-29 "Linux" "Linux Programmer's Manual"
28 .SH NAME
29 user_namespaces \- overview of Linux user namespaces
30 .SH DESCRIPTION
31 For an overview of namespaces, see
32 .BR namespaces (7).
33
34 User namespaces isolate security-related identifiers and attributes,
35 in particular,
36 user IDs and group IDs (see
37 .BR credentials (7)),
38 the root directory,
39 keys (see
40 .BR keyctl (2)),
41 .\" FIXME: This page says very little about the interaction
42 .\" of user namespaces and keys. Add something on this topic.
43 and capabilities (see
44 .BR capabilities (7)).
45 A process's user and group IDs can be different
46 inside and outside a user namespace.
47 In particular,
48 a process can have a normal unprivileged user ID outside a user namespace
49 while at the same time having a user ID of 0 inside the namespace;
50 in other words,
51 the process has full privileges for operations inside the user namespace,
52 but is unprivileged for operations outside the namespace.
53 .\"
54 .\" ============================================================
55 .\"
56 .SS Nested namespaces, namespace membership
57 User namespaces can be nested;
58 that is, each user namespace\(emexcept the initial ("root")
59 namespace\(emhas a parent user namespace,
60 and can have zero or more child user namespaces.
61 The parent user namespace is the user namespace
62 of the process that creates the user namespace via a call to
63 .BR unshare (2)
64 or
65 .BR clone (2)
66 with the
67 .BR CLONE_NEWUSER
68 flag.
69
70 The kernel imposes (since version 3.11) a limit of 32 nested levels of
71 .\" commit 8742f229b635bf1c1c84a3dfe5e47c814c20b5c8
72 user namespaces.
73 .\" FIXME Explain the rationale for this limit. (What is the rationale?)
74 Calls to
75 .BR unshare (2)
76 or
77 .BR clone (2)
78 that would cause this limit to be exceeded fail with the error
79 .BR EUSERS .
80
81 Each process is a member of exactly one user namespace.
82 A process created via
83 .BR fork (2)
84 or
85 .BR clone (2)
86 without the
87 .BR CLONE_NEWUSER
88 flag is a member of the same user namespace as its parent.
89 A single-threaded process can join another user namespace with
90 .BR setns (2)
91 if it has the
92 .BR CAP_SYS_ADMIN
93 in that namespace;
94 upon doing so, it gains a full set of capabilities in that namespace.
95
96 A call to
97 .BR clone (2)
98 or
99 .BR unshare (2)
100 with the
101 .BR CLONE_NEWUSER
102 flag makes the new child process (for
103 .BR clone (2))
104 or the caller (for
105 .BR unshare (2))
106 a member of the new user namespace created by the call.
107 .\"
108 .\" ============================================================
109 .\"
110 .SS Capabilities
111 The child process created by
112 .BR clone (2)
113 with the
114 .BR CLONE_NEWUSER
115 flag starts out with a complete set
116 of capabilities in the new user namespace.
117 Likewise, a process that creates a new user namespace using
118 .BR unshare (2)
119 or joins an existing user namespace using
120 .BR setns (2)
121 gains a full set of capabilities in that namespace.
122 On the other hand,
123 that process has no capabilities in the parent (in the case of
124 .BR clone (2))
125 or previous (in the case of
126 .BR unshare (2)
127 and
128 .BR setns (2))
129 user namespace,
130 even if the new namespace is created or joined by the root user
131 (i.e., a process with user ID 0 in the root namespace).
132
133 Note that a call to
134 .BR execve (2)
135 will cause a process's capabilities to be recalculated in the usual way (see
136 .BR capabilities (7)).
137 Consequently,
138 unless the process has a user ID of 0 within the namespace,
139 or the executable file has a nonempty inheritable capabilities mask,
140 the process will lose all capabilities.
141 See the discussion of user and group ID mappings, below.
142
143 A call to
144 .BR clone (2),
145 .BR unshare (2),
146 or
147 .BR setns (2)
148 using the
149 .BR CLONE_NEWUSER
150 flag sets the "securebits" flags
151 (see
152 .BR capabilities (7))
153 to their default values (all flags disabled) in the child (for
154 .BR clone (2))
155 or caller (for
156 .BR unshare (2),
157 or
158 .BR setns (2)).
159 Note that because the caller no longer has capabilities
160 in its original user namespace after a call to
161 .BR setns (2),
162 it is not possible for a process to reset its "securebits" flags while
163 retaining its user namespace membership by using a pair of
164 .BR setns (2)
165 calls to move to another user namespace and then return to
166 its original user namespace.
167
168 The rules for determining whether or not a process has a capability
169 in a particular user namespace are as follows:
170 .IP 1. 3
171 A process has a capability inside a user namespace
172 if it is a member of that namespace and
173 it has the capability in its effective capability set.
174 A process can gain capabilities in its effective capability
175 set in various ways.
176 For example, it may execute a set-user-ID program or an
177 executable with associated file capabilities.
178 In addition,
179 a process may gain capabilities via the effect of
180 .BR clone (2),
181 .BR unshare (2),
182 or
183 .BR setns (2),
184 as already described.
185 .\" In the 3.8 sources, see security/commoncap.c::cap_capable():
186 .IP 2.
187 If a process has a capability in a user namespace,
188 then it has that capability in all child (and further removed descendant)
189 namespaces as well.
190 .IP 3.
191 .\" * The owner of the user namespace in the parent of the
192 .\" * user namespace has all caps.
193 When a user namespace is created, the kernel records the effective
194 user ID of the creating process as being the "owner" of the namespace.
195 .\" (and likewise associates the effective group ID of the creating process
196 .\" with the namespace).
197 A process that resides
198 in the parent of the user namespace
199 .\" See kernel commit 520d9eabce18edfef76a60b7b839d54facafe1f9 for a fix
200 .\" on this point
201 and whose effective user ID matches the owner of the namespace
202 has all capabilities in the namespace.
203 .\" This includes the case where the process executes a set-user-ID
204 .\" program that confers the effective UID of the creator of the namespace.
205 By virtue of the previous rule,
206 this means that the process has all capabilities in all
207 further removed descendant user namespaces as well.
208 .\"
209 .\" ============================================================
210 .\"
211 .SS Effect of capabilities within a user namespace
212 Having a capability inside a user namespace
213 permits a process to perform operations (that require privilege)
214 only on resources governed by that namespace.
215 In other words, having a capability in a user namespace permits a process
216 to perform privileged operations on resources that are governed by (nonuser)
217 namespaces associated with the user namespace (see the next subsection).
218
219 On the other hand, there are many privileged operations that affect
220 resources that are not associated with any namespace type,
221 for example, changing the system time (governed by
222 .BR CAP_SYS_TIME ),
223 loading a kernel module (governed by
224 .BR CAP_SYS_MODULE ),
225 and creating a device (governed by
226 .BR CAP_MKNOD ).
227 Only a process with privileges in the
228 .I initial
229 user namespace can perform such operations.
230
231 Holding
232 .B CAP_SYS_ADMIN
233 within the user namespace associated with a process's mount namespace
234 allows that process to create bind mounts
235 and mount the following types of filesystems:
236 .\" fs_flags = FS_USERNS_MOUNT in kernel sources
237
238 .RS 4
239 .PD 0
240 .IP * 2
241 .IR /proc
242 (since Linux 3.8)
243 .IP *
244 .IR /sys
245 (since Linux 3.8)
246 .IP *
247 .IR devpts
248 (since Linux 3.9)
249 .IP *
250 .IR tmpfs
251 (since Linux 3.9)
252 .IP *
253 .IR ramfs
254 (since Linux 3.9)
255 .IP *
256 .IR mqueue
257 (since Linux 3.9)
258 .IP *
259 .IR bpf
260 .\" commit b2197755b2633e164a439682fb05a9b5ea48f706
261 (since Linux 4.4)
262 .PD
263 .RE
264 .PP
265 Holding
266 .B CAP_SYS_ADMIN
267 within the PID namespace associated with a process's cgroup namespace
268 allows (since Linux 4.6)
269 that process to mount cgroup filesystems.
270
271 Holding
272 .B CAP_SYS_ADMIN
273 within the user namespace associated with a process's PID namespace
274 allows (since Linux 3.8)
275 that process to mount
276 .I /proc
277 filesystems.
278
279 Note however, that mounting block-based filesystems can be done
280 only by a process that holds
281 .BR CAP_SYS_ADMIN
282 in the initial user namespace.
283 .\"
284 .\" ============================================================
285 .\"
286 .SS Interaction of user namespaces and other types of namespaces
287 Starting in Linux 3.8, unprivileged processes can create user namespaces,
288 and other the other types of namespaces can be created with just the
289 .B CAP_SYS_ADMIN
290 capability in the caller's user namespace.
291
292 When a non-user-namespace is created,
293 it is owned by the user namespace in which the creating process
294 was a member at the time of the creation of the namespace.
295 Actions on the non-user-namespace
296 require capabilities in the corresponding user namespace.
297
298 If
299 .BR CLONE_NEWUSER
300 is specified along with other
301 .B CLONE_NEW*
302 flags in a single
303 .BR clone (2)
304 or
305 .BR unshare (2)
306 call, the user namespace is guaranteed to be created first,
307 giving the child
308 .RB ( clone (2))
309 or caller
310 .RB ( unshare (2))
311 privileges over the remaining namespaces created by the call.
312 Thus, it is possible for an unprivileged caller to specify this combination
313 of flags.
314
315 When a new namespace (other than a user namespace) is created via
316 .BR clone (2)
317 or
318 .BR unshare (2),
319 the kernel records the user namespace of the creating process against
320 the new namespace.
321 (This association can't be changed.)
322 When a process in the new namespace subsequently performs
323 privileged operations that operate on global
324 resources isolated by the namespace,
325 the permission checks are performed according to the process's capabilities
326 in the user namespace that the kernel associated with the new namespace.
327 For example, suppose that a process attempts to change the hostname
328 .RB ( sethostname (2)),
329 a resource governed by the UTS namespace.
330 In this case,
331 the kernel will determine which user namespace is associated with
332 the process's UTS namespace, and check whether the process has the
333 required capability
334 .RB ( CAP_SYS_ADMIN )
335 in that user namespace.
336 .\"
337 .\" ============================================================
338 .\"
339 .SS Restrictions on mount namespaces
340
341 Note the following points with respect to mount namespaces:
342 .IP * 3
343 A mount namespace has an owner user namespace.
344 A mount namespace whose owner user namespace is different from
345 the owner user namespace of its parent mount namespace is
346 considered a less privileged mount namespace.
347 .IP *
348 When creating a less privileged mount namespace,
349 shared mounts are reduced to slave mounts.
350 This ensures that mappings performed in less
351 privileged mount namespaces will not propagate to more privileged
352 mount namespaces.
353 .IP *
354 .\" FIXME .
355 .\" What does "come as a single unit from more privileged mount" mean?
356 Mounts that come as a single unit from more privileged mount are
357 locked together and may not be separated in a less privileged mount
358 namespace.
359 (The
360 .BR unshare (2)
361 .B CLONE_NEWNS
362 operation brings across all of the mounts from the original
363 mount namespace as a single unit,
364 and recursive mounts that propagate between
365 mount namespaces propagate as a single unit.)
366 .IP *
367 The
368 .BR mount (2)
369 flags
370 .BR MS_RDONLY ,
371 .BR MS_NOSUID ,
372 .BR MS_NOEXEC ,
373 and the "atime" flags
374 .RB ( MS_NOATIME ,
375 .BR MS_NODIRATIME ,
376 .BR MS_RELATIME )
377 settings become locked
378 .\" commit 9566d6742852c527bf5af38af5cbb878dad75705
379 .\" Author: Eric W. Biederman <ebiederm@xmission.com>
380 .\" Date: Mon Jul 28 17:26:07 2014 -0700
381 .\"
382 .\" mnt: Correct permission checks in do_remount
383 .\"
384 when propagated from a more privileged to
385 a less privileged mount namespace,
386 and may not be changed in the less privileged mount namespace.
387 .IP *
388 .\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree))
389 A file or directory that is a mount point in one namespace that is not
390 a mount point in another namespace, may be renamed, unlinked, or removed
391 .RB ( rmdir (2))
392 in the mount namespace in which it is not a mount point
393 (subject to the usual permission checks).
394 .IP
395 Previously, attempting to unlink, rename, or remove a file or directory
396 that was a mount point in another mount namespace would result in the error
397 .BR EBUSY .
398 That behavior had technical problems of enforcement (e.g., for NFS)
399 and permitted denial-of-service attacks against more privileged users.
400 (i.e., preventing individual files from being updated
401 by bind mounting on top of them).
402 .\"
403 .\" ============================================================
404 .\"
405 .SS User and group ID mappings: uid_map and gid_map
406 When a user namespace is created,
407 it starts out without a mapping of user IDs (group IDs)
408 to the parent user namespace.
409 The
410 .IR /proc/[pid]/uid_map
411 and
412 .IR /proc/[pid]/gid_map
413 files (available since Linux 3.5)
414 .\" commit 22d917d80e842829d0ca0a561967d728eb1d6303
415 expose the mappings for user and group IDs
416 inside the user namespace for the process
417 .IR pid .
418 These files can be read to view the mappings in a user namespace and
419 written to (once) to define the mappings.
420
421 The description in the following paragraphs explains the details for
422 .IR uid_map ;
423 .IR gid_map
424 is exactly the same,
425 but each instance of "user ID" is replaced by "group ID".
426
427 The
428 .I uid_map
429 file exposes the mapping of user IDs from the user namespace
430 of the process
431 .IR pid
432 to the user namespace of the process that opened
433 .IR uid_map
434 (but see a qualification to this point below).
435 In other words, processes that are in different user namespaces
436 will potentially see different values when reading from a particular
437 .I uid_map
438 file, depending on the user ID mappings for the user namespaces
439 of the reading processes.
440
441 Each line in the
442 .I uid_map
443 file specifies a 1-to-1 mapping of a range of contiguous
444 user IDs between two user namespaces.
445 (When a user namespace is first created, this file is empty.)
446 The specification in each line takes the form of
447 three numbers delimited by white space.
448 The first two numbers specify the starting user ID in
449 each of the two user namespaces.
450 The third number specifies the length of the mapped range.
451 In detail, the fields are interpreted as follows:
452 .IP (1) 4
453 The start of the range of user IDs in
454 the user namespace of the process
455 .IR pid .
456 .IP (2)
457 The start of the range of user
458 IDs to which the user IDs specified by field one map.
459 How field two is interpreted depends on whether the process that opened
460 .I uid_map
461 and the process
462 .IR pid
463 are in the same user namespace, as follows:
464 .RS
465 .IP a) 3
466 If the two processes are in different user namespaces:
467 field two is the start of a range of
468 user IDs in the user namespace of the process that opened
469 .IR uid_map .
470 .IP b)
471 If the two processes are in the same user namespace:
472 field two is the start of the range of
473 user IDs in the parent user namespace of the process
474 .IR pid .
475 This case enables the opener of
476 .I uid_map
477 (the common case here is opening
478 .IR /proc/self/uid_map )
479 to see the mapping of user IDs into the user namespace of the process
480 that created this user namespace.
481 .RE
482 .IP (3)
483 The length of the range of user IDs that is mapped between the two
484 user namespaces.
485 .PP
486 System calls that return user IDs (group IDs)\(emfor example,
487 .BR getuid (2),
488 .BR getgid (2),
489 and the credential fields in the structure returned by
490 .BR stat (2)\(emreturn
491 the user ID (group ID) mapped into the caller's user namespace.
492
493 When a process accesses a file, its user and group IDs
494 are mapped into the initial user namespace for the purpose of permission
495 checking and assigning IDs when creating a file.
496 When a process retrieves file user and group IDs via
497 .BR stat (2),
498 the IDs are mapped in the opposite direction,
499 to produce values relative to the process user and group ID mappings.
500
501 The initial user namespace has no parent namespace,
502 but, for consistency, the kernel provides dummy user and group
503 ID mapping files for this namespace.
504 Looking at the
505 .I uid_map
506 file
507 .RI ( gid_map
508 is the same) from a shell in the initial namespace shows:
509
510 .in +4n
511 .nf
512 $ \fBcat /proc/$$/uid_map\fP
513 0 0 4294967295
514 .fi
515 .in
516
517 This mapping tells us
518 that the range starting at user ID 0 in this namespace
519 maps to a range starting at 0 in the (nonexistent) parent namespace,
520 and the length of the range is the largest 32-bit unsigned integer.
521 This leaves 4294967295 (the 32-bit signed \-1 value) unmapped.
522 This is deliberate:
523 .IR "(uid_t)\ \-1"
524 is used in several interfaces (e.g.,
525 .BR setreuid (2))
526 as a way to specify "no user ID".
527 Leaving
528 .IR "(uid_t)\ \-1"
529 unmapped and unusable guarantees that there will be no
530 confusion when using these interfaces.
531 .\"
532 .\" ============================================================
533 .\"
534 .SS Defining user and group ID mappings: writing to uid_map and gid_map
535 .PP
536 After the creation of a new user namespace, the
537 .I uid_map
538 file of
539 .I one
540 of the processes in the namespace may be written to
541 .I once
542 to define the mapping of user IDs in the new user namespace.
543 An attempt to write more than once to a
544 .I uid_map
545 file in a user namespace fails with the error
546 .BR EPERM .
547 Similar rules apply for
548 .I gid_map
549 files.
550
551 The lines written to
552 .IR uid_map
553 .RI ( gid_map )
554 must conform to the following rules:
555 .IP * 3
556 The three fields must be valid numbers,
557 and the last field must be greater than 0.
558 .IP *
559 Lines are terminated by newline characters.
560 .IP *
561 There is an (arbitrary) limit on the number of lines in the file.
562 As at Linux 3.18, the limit is five lines.
563 In addition, the number of bytes written to
564 the file must be less than the system page size,
565 .\" FIXME(Eric): the restriction "less than" rather than "less than or equal"
566 .\" seems strangely arbitrary. Furthermore, the comment does not agree
567 .\" with the code in kernel/user_namespace.c. Which is correct?
568 and the write must be performed at the start of the file (i.e.,
569 .BR lseek (2)
570 and
571 .BR pwrite (2)
572 can't be used to write to nonzero offsets in the file).
573 .IP *
574 The range of user IDs (group IDs)
575 specified in each line cannot overlap with the ranges
576 in any other lines.
577 In the initial implementation (Linux 3.8), this requirement was
578 satisfied by a simplistic implementation that imposed the further
579 requirement that
580 the values in both field 1 and field 2 of successive lines must be
581 in ascending numerical order,
582 which prevented some otherwise valid maps from being created.
583 Linux 3.9 and later
584 .\" commit 0bd14b4fd72afd5df41e9fd59f356740f22fceba
585 fix this limitation, allowing any valid set of nonoverlapping maps.
586 .IP *
587 At least one line must be written to the file.
588 .PP
589 Writes that violate the above rules fail with the error
590 .BR EINVAL .
591
592 In order for a process to write to the
593 .I /proc/[pid]/uid_map
594 .RI ( /proc/[pid]/gid_map )
595 file, all of the following requirements must be met:
596 .IP 1. 3
597 The writing process must have the
598 .BR CAP_SETUID
599 .RB ( CAP_SETGID )
600 capability in the user namespace of the process
601 .IR pid .
602 .IP 2.
603 The writing process must either be in the user namespace of the process
604 .I pid
605 or be in the parent user namespace of the process
606 .IR pid .
607 .IP 3.
608 The mapped user IDs (group IDs) must in turn have a mapping
609 in the parent user namespace.
610 .IP 4.
611 One of the following two cases applies:
612 .RS
613 .IP * 3
614 .IR Either
615 the writing process has the
616 .BR CAP_SETUID
617 .RB ( CAP_SETGID )
618 capability in the
619 .I parent
620 user namespace.
621 .RS
622 .IP + 3
623 No further restrictions apply:
624 the process can make mappings to arbitrary user IDs (group IDs)
625 in the parent user namespace.
626 .RE
627 .IP * 3
628 .IR Or
629 otherwise all of the following restrictions apply:
630 .RS
631 .IP + 3
632 The data written to
633 .I uid_map
634 .RI ( gid_map )
635 must consist of a single line that maps
636 the writing process's effective user ID
637 (group ID) in the parent user namespace to a user ID (group ID)
638 in the user namespace.
639 .IP +
640 The writing process must have the same effective user ID as the process
641 that created the user namespace.
642 .IP +
643 In the case of
644 .IR gid_map ,
645 use of the
646 .BR setgroups (2)
647 system call must first be denied by writing
648 .RI \(dq deny \(dq
649 to the
650 .I /proc/[pid]/setgroups
651 file (see below) before writing to
652 .IR gid_map .
653 .RE
654 .RE
655 .PP
656 Writes that violate the above rules fail with the error
657 .BR EPERM .
658 .\"
659 .\" ============================================================
660 .\"
661 .SS Interaction with system calls that change process UIDs or GIDs
662 In a user namespace where the
663 .I uid_map
664 file has not been written, the system calls that change user IDs will fail.
665 Similarly, if the
666 .I gid_map
667 file has not been written, the system calls that change group IDs will fail.
668 After the
669 .I uid_map
670 and
671 .I gid_map
672 files have been written, only the mapped values may be used in
673 system calls that change user and group IDs.
674
675 For user IDs, the relevant system calls include
676 .BR setuid (2),
677 .BR setfsuid (2),
678 .BR setreuid (2),
679 and
680 .BR setresuid (2).
681 For group IDs, the relevant system calls include
682 .BR setgid (2),
683 .BR setfsgid (2),
684 .BR setregid (2),
685 .BR setresgid (2),
686 and
687 .BR setgroups (2).
688
689 Writing
690 .RI \(dq deny \(dq
691 to the
692 .I /proc/[pid]/setgroups
693 file before writing to
694 .I /proc/[pid]/gid_map
695 .\" Things changed in Linux 3.19
696 .\" commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8
697 .\" commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272
698 .\" http://lwn.net/Articles/626665/
699 will permanently disable
700 .BR setgroups (2)
701 in a user namespace and allow writing to
702 .I /proc/[pid]/gid_map
703 without having the
704 .BR CAP_SETGID
705 capability in the parent user namespace.
706 .\"
707 .\" ============================================================
708 .\"
709 .SS The /proc/[pid]/setgroups file
710 .\"
711 .\" commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8
712 .\" commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272
713 .\" http://lwn.net/Articles/626665/
714 .\" http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-8989
715 .\"
716 The
717 .I /proc/[pid]/setgroups
718 file displays the string
719 .RI \(dq allow \(dq
720 if processes in the user namespace that contains the process
721 .I pid
722 are permitted to employ the
723 .BR setgroups (2)
724 system call; it displays
725 .RI \(dq deny \(dq
726 if
727 .BR setgroups (2)
728 is not permitted in that user namespace.
729 Note that regardless of the value in the
730 .I /proc/[pid]/setgroups
731 file (and regardless of the process's capabilities), calls to
732 .BR setgroups (2)
733 are also not permitted if
734 .IR /proc/[pid]/gid_map
735 has not yet been set.
736
737 A privileged process (one with the
738 .BR CAP_SYS_ADMIN
739 capability in the namespace) may write either of the strings
740 .RI \(dq allow \(dq
741 or
742 .RI \(dq deny \(dq
743 to this file
744 .I before
745 writing a group ID mapping
746 for this user namespace to the file
747 .IR /proc/[pid]/gid_map .
748 Writing the string
749 .RI \(dq deny \(dq
750 prevents any process in the user namespace from employing
751 .BR setgroups (2).
752
753 The essence of the restrictions described in the preceding
754 paragraph is that it is permitted to write to
755 .I /proc/[pid]/setgroups
756 only so long as calling
757 .BR setgroups (2)
758 is disallowed because
759 .I /proc/[pid]gid_map
760 has not been set.
761 This ensures that a process cannot transition from a state where
762 .BR setgroups (2)
763 is allowed to a state where
764 .BR setgroups (2)
765 is denied;
766 a process can only transition from
767 .BR setgroups (2)
768 being disallowed to
769 .BR setgroups (2)
770 being allowed.
771
772 The default value of this file in the initial user namespace is
773 .RI \(dq allow \(dq.
774
775 Once
776 .IR /proc/[pid]/gid_map
777 has been written to
778 (which has the effect of enabling
779 .BR setgroups (2)
780 in the user namespace),
781 it is no longer possible to disallow
782 .BR setgroups (2)
783 by writing
784 .RI \(dq deny \(dq
785 to
786 .IR /proc/[pid]/setgroups
787 (the write fails with the error
788 .BR EPERM ).
789
790 A child user namespace inherits the
791 .IR /proc/[pid]/setgroups
792 setting from its parent.
793
794 If the
795 .I setgroups
796 file has the value
797 .RI \(dq deny \(dq,
798 then the
799 .BR setgroups (2)
800 system call can't subsequently be reenabled (by writing
801 .RI \(dq allow \(dq
802 to the file) in this user namespace.
803 (Attempts to do so will fail with the error
804 .BR EPERM .)
805 This restriction also propagates down to all child user namespaces of
806 this user namespace.
807
808 The
809 .I /proc/[pid]/setgroups
810 file was added in Linux 3.19,
811 but was backported to many earlier stable kernel series,
812 because it addresses a security issue.
813 The issue concerned files with permissions such as "rwx\-\-\-rwx".
814 Such files give fewer permissions to "group" than they do to "other".
815 This means that dropping groups using
816 .BR setgroups (2)
817 might allow a process file access that it did not formerly have.
818 Before the existence of user namespaces this was not a concern,
819 since only a privileged process (one with the
820 .BR CAP_SETGID
821 capability) could call
822 .BR setgroups (2).
823 However, with the introduction of user namespaces,
824 it became possible for an unprivileged process to create
825 a new namespace in which the user had all privileges.
826 This then allowed formerly unprivileged
827 users to drop groups and thus gain file access
828 that they did not previously have.
829 The
830 .I /proc/[pid]/setgroups
831 file was added to address this security issue,
832 by denying any pathway for an unprivileged process to drop groups with
833 .BR setgroups (2).
834 .\"
835 .\" /proc/PID/setgroups
836 .\" [allow == setgroups() is allowed, "deny" == setgroups() is disallowed]
837 .\" * Can write if have CAP_SYS_ADMIN in NS
838 .\" * Must write BEFORE writing to /proc/PID/gid_map
839 .\"
840 .\" setgroups()
841 .\" * Must already have written to gid_maps
842 .\" * /proc/PID/setgroups must be "allow"
843 .\"
844 .\" /proc/PID/gid_map -- writing
845 .\" * Must already have written "deny" to /proc/PID/setgroups
846 .\"
847 .\" ============================================================
848 .\"
849 .SS Unmapped user and group IDs
850 .PP
851 There are various places where an unmapped user ID (group ID)
852 may be exposed to user space.
853 For example, the first process in a new user namespace may call
854 .BR getuid ()
855 before a user ID mapping has been defined for the namespace.
856 In most such cases, an unmapped user ID is converted
857 .\" from_kuid_munged(), from_kgid_munged()
858 to the overflow user ID (group ID);
859 the default value for the overflow user ID (group ID) is 65534.
860 See the descriptions of
861 .IR /proc/sys/kernel/overflowuid
862 and
863 .IR /proc/sys/kernel/overflowgid
864 in
865 .BR proc (5).
866
867 The cases where unmapped IDs are mapped in this fashion include
868 system calls that return user IDs
869 .RB ( getuid (2),
870 .BR getgid (2),
871 and similar),
872 credentials passed over a UNIX domain socket,
873 .\" also SO_PEERCRED
874 credentials returned by
875 .BR stat (2),
876 .BR waitid (2),
877 and the System V IPC "ctl"
878 .B IPC_STAT
879 operations,
880 credentials exposed by
881 .IR /proc/PID/status
882 and the files in
883 .IR /proc/sysvipc/* ,
884 credentials returned via the
885 .I si_uid
886 field in the
887 .I siginfo_t
888 received with a signal (see
889 .BR sigaction (2)),
890 credentials written to the process accounting file (see
891 .BR acct (5)),
892 and credentials returned with POSIX message queue notifications (see
893 .BR mq_notify (3)).
894
895 There is one notable case where unmapped user and group IDs are
896 .I not
897 .\" from_kuid(), from_kgid()
898 .\" Also F_GETOWNER_UIDS is an exception
899 converted to the corresponding overflow ID value.
900 When viewing a
901 .I uid_map
902 or
903 .I gid_map
904 file in which there is no mapping for the second field,
905 that field is displayed as 4294967295 (\-1 as an unsigned integer);
906 .\"
907 .\" ============================================================
908 .\"
909 .SS Set-user-ID and set-group-ID programs
910 .PP
911 When a process inside a user namespace executes
912 a set-user-ID (set-group-ID) program,
913 the process's effective user (group) ID inside the namespace is changed
914 to whatever value is mapped for the user (group) ID of the file.
915 However, if either the user
916 .I or
917 the group ID of the file has no mapping inside the namespace,
918 the set-user-ID (set-group-ID) bit is silently ignored:
919 the new program is executed,
920 but the process's effective user (group) ID is left unchanged.
921 (This mirrors the semantics of executing a set-user-ID or set-group-ID
922 program that resides on a filesystem that was mounted with the
923 .BR MS_NOSUID
924 flag, as described in
925 .BR mount (2).)
926 .\"
927 .\" ============================================================
928 .\"
929 .SS Miscellaneous
930 .PP
931 When a process's user and group IDs are passed over a UNIX domain socket
932 to a process in a different user namespace (see the description of
933 .B SCM_CREDENTIALS
934 in
935 .BR unix (7)),
936 they are translated into the corresponding values as per the
937 receiving process's user and group ID mappings.
938 .\"
939 .SH CONFORMING TO
940 Namespaces are a Linux-specific feature.
941 .\"
942 .SH NOTES
943 Over the years, there have been a lot of features that have been added
944 to the Linux kernel that have been made available only to privileged users
945 because of their potential to confuse set-user-ID-root applications.
946 In general, it becomes safe to allow the root user in a user namespace to
947 use those features because it is impossible, while in a user namespace,
948 to gain more privilege than the root user of a user namespace has.
949 .\"
950 .\" ============================================================
951 .\"
952 .SS Availability
953 Use of user namespaces requires a kernel that is configured with the
954 .B CONFIG_USER_NS
955 option.
956 User namespaces require support in a range of subsystems across
957 the kernel.
958 When an unsupported subsystem is configured into the kernel,
959 it is not possible to configure user namespaces support.
960
961 As at Linux 3.8, most relevant subsystems supported user namespaces,
962 but a number of filesystems did not have the infrastructure needed
963 to map user and group IDs between user namespaces.
964 Linux 3.9 added the required infrastructure support for many of
965 the remaining unsupported filesystems
966 (Plan 9 (9P), Andrew File System (AFS), Ceph, CIFS, CODA, NFS, and OCFS2).
967 Linux 3.12 added support the last of the unsupported major filesystems,
968 .\" commit d6970d4b726cea6d7a9bc4120814f95c09571fc3
969 XFS.
970 .\"
971 .SH EXAMPLE
972 The program below is designed to allow experimenting with
973 user namespaces, as well as other types of namespaces.
974 It creates namespaces as specified by command-line options and then executes
975 a command inside those namespaces.
976 The comments and
977 .I usage()
978 function inside the program provide a full explanation of the program.
979 The following shell session demonstrates its use.
980
981 First, we look at the run-time environment:
982
983 .in +4n
984 .nf
985 $ \fBuname -rs\fP # Need Linux 3.8 or later
986 Linux 3.8.0
987 $ \fBid -u\fP # Running as unprivileged user
988 1000
989 $ \fBid -g\fP
990 1000
991 .fi
992 .in
993
994 Now start a new shell in new user
995 .RI ( \-U ),
996 mount
997 .RI ( \-m ),
998 and PID
999 .RI ( \-p )
1000 namespaces, with user ID
1001 .RI ( \-M )
1002 and group ID
1003 .RI ( \-G )
1004 1000 mapped to 0 inside the user namespace:
1005
1006 .in +4n
1007 .nf
1008 $ \fB./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash\fP
1009 .fi
1010 .in
1011
1012 The shell has PID 1, because it is the first process in the new
1013 PID namespace:
1014
1015 .in +4n
1016 .nf
1017 bash$ \fBecho $$\fP
1018 1
1019 .fi
1020 .in
1021
1022 Inside the user namespace, the shell has user and group ID 0,
1023 and a full set of permitted and effective capabilities:
1024
1025 .in +4n
1026 .nf
1027 bash$ \fBcat /proc/$$/status | egrep '^[UG]id'\fP
1028 Uid: 0 0 0 0
1029 Gid: 0 0 0 0
1030 bash$ \fBcat /proc/$$/status | egrep '^Cap(Prm|Inh|Eff)'\fP
1031 CapInh: 0000000000000000
1032 CapPrm: 0000001fffffffff
1033 CapEff: 0000001fffffffff
1034 .fi
1035 .in
1036
1037 Mounting a new
1038 .I /proc
1039 filesystem and listing all of the processes visible
1040 in the new PID namespace shows that the shell can't see
1041 any processes outside the PID namespace:
1042
1043 .in +4n
1044 .nf
1045 bash$ \fBmount -t proc proc /proc\fP
1046 bash$ \fBps ax\fP
1047 PID TTY STAT TIME COMMAND
1048 1 pts/3 S 0:00 bash
1049 22 pts/3 R+ 0:00 ps ax
1050 .fi
1051 .in
1052 .SS Program source
1053 \&
1054 .nf
1055 /* userns_child_exec.c
1056
1057 Licensed under GNU General Public License v2 or later
1058
1059 Create a child process that executes a shell command in new
1060 namespace(s); allow UID and GID mappings to be specified when
1061 creating a user namespace.
1062 */
1063 #define _GNU_SOURCE
1064 #include <sched.h>
1065 #include <unistd.h>
1066 #include <stdlib.h>
1067 #include <sys/wait.h>
1068 #include <signal.h>
1069 #include <fcntl.h>
1070 #include <stdio.h>
1071 #include <string.h>
1072 #include <limits.h>
1073 #include <errno.h>
1074
1075 /* A simple error\-handling function: print an error message based
1076 on the value in \(aqerrno\(aq and terminate the calling process */
1077
1078 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \\
1079 } while (0)
1080
1081 struct child_args {
1082 char **argv; /* Command to be executed by child, with args */
1083 int pipe_fd[2]; /* Pipe used to synchronize parent and child */
1084 };
1085
1086 static int verbose;
1087
1088 static void
1089 usage(char *pname)
1090 {
1091 fprintf(stderr, "Usage: %s [options] cmd [arg...]\\n\\n", pname);
1092 fprintf(stderr, "Create a child process that executes a shell "
1093 "command in a new user namespace,\\n"
1094 "and possibly also other new namespace(s).\\n\\n");
1095 fprintf(stderr, "Options can be:\\n\\n");
1096 #define fpe(str) fprintf(stderr, " %s", str);
1097 fpe("\-i New IPC namespace\\n");
1098 fpe("\-m New mount namespace\\n");
1099 fpe("\-n New network namespace\\n");
1100 fpe("\-p New PID namespace\\n");
1101 fpe("\-u New UTS namespace\\n");
1102 fpe("\-U New user namespace\\n");
1103 fpe("\-M uid_map Specify UID map for user namespace\\n");
1104 fpe("\-G gid_map Specify GID map for user namespace\\n");
1105 fpe("\-z Map user\(aqs UID and GID to 0 in user namespace\\n");
1106 fpe(" (equivalent to: \-M \(aq0 <uid> 1\(aq \-G \(aq0 <gid> 1\(aq)\\n");
1107 fpe("\-v Display verbose messages\\n");
1108 fpe("\\n");
1109 fpe("If \-z, \-M, or \-G is specified, \-U is required.\\n");
1110 fpe("It is not permitted to specify both \-z and either \-M or \-G.\\n");
1111 fpe("\\n");
1112 fpe("Map strings for \-M and \-G consist of records of the form:\\n");
1113 fpe("\\n");
1114 fpe(" ID\-inside\-ns ID\-outside\-ns len\\n");
1115 fpe("\\n");
1116 fpe("A map string can contain multiple records, separated"
1117 " by commas;\\n");
1118 fpe("the commas are replaced by newlines before writing"
1119 " to map files.\\n");
1120
1121 exit(EXIT_FAILURE);
1122 }
1123
1124 /* Update the mapping file \(aqmap_file\(aq, with the value provided in
1125 \(aqmapping\(aq, a string that defines a UID or GID mapping. A UID or
1126 GID mapping consists of one or more newline\-delimited records
1127 of the form:
1128
1129 ID_inside\-ns ID\-outside\-ns length
1130
1131 Requiring the user to supply a string that contains newlines is
1132 of course inconvenient for command\-line use. Thus, we permit the
1133 use of commas to delimit records in this string, and replace them
1134 with newlines before writing the string to the file. */
1135
1136 static void
1137 update_map(char *mapping, char *map_file)
1138 {
1139 int fd, j;
1140 size_t map_len; /* Length of \(aqmapping\(aq */
1141
1142 /* Replace commas in mapping string with newlines */
1143
1144 map_len = strlen(mapping);
1145 for (j = 0; j < map_len; j++)
1146 if (mapping[j] == \(aq,\(aq)
1147 mapping[j] = \(aq\\n\(aq;
1148
1149 fd = open(map_file, O_RDWR);
1150 if (fd == \-1) {
1151 fprintf(stderr, "ERROR: open %s: %s\\n", map_file,
1152 strerror(errno));
1153 exit(EXIT_FAILURE);
1154 }
1155
1156 if (write(fd, mapping, map_len) != map_len) {
1157 fprintf(stderr, "ERROR: write %s: %s\\n", map_file,
1158 strerror(errno));
1159 exit(EXIT_FAILURE);
1160 }
1161
1162 close(fd);
1163 }
1164
1165 /* Linux 3.19 made a change in the handling of setgroups(2) and the
1166 \(aqgid_map\(aq file to address a security issue. The issue allowed
1167 *unprivileged* users to employ user namespaces in order to drop
1168 The upshot of the 3.19 changes is that in order to update the
1169 \(aqgid_maps\(aq file, use of the setgroups() system call in this
1170 user namespace must first be disabled by writing "deny" to one of
1171 the /proc/PID/setgroups files for this namespace. That is the
1172 purpose of the following function. */
1173
1174 static void
1175 proc_setgroups_write(pid_t child_pid, char *str)
1176 {
1177 char setgroups_path[PATH_MAX];
1178 int fd;
1179
1180 snprintf(setgroups_path, PATH_MAX, "/proc/%ld/setgroups",
1181 (long) child_pid);
1182
1183 fd = open(setgroups_path, O_RDWR);
1184 if (fd == \-1) {
1185
1186 /* We may be on a system that doesn\(aqt support
1187 /proc/PID/setgroups. In that case, the file won\(aqt exist,
1188 and the system won\(aqt impose the restrictions that Linux 3.19
1189 added. That\(aqs fine: we don\(aqt need to do anything in order
1190 to permit \(aqgid_map\(aq to be updated.
1191
1192 However, if the error from open() was something other than
1193 the ENOENT error that is expected for that case, let the
1194 user know. */
1195
1196 if (errno != ENOENT)
1197 fprintf(stderr, "ERROR: open %s: %s\\n", setgroups_path,
1198 strerror(errno));
1199 return;
1200 }
1201
1202 if (write(fd, str, strlen(str)) == \-1)
1203 fprintf(stderr, "ERROR: write %s: %s\\n", setgroups_path,
1204 strerror(errno));
1205
1206 close(fd);
1207 }
1208
1209 static int /* Start function for cloned child */
1210 childFunc(void *arg)
1211 {
1212 struct child_args *args = (struct child_args *) arg;
1213 char ch;
1214
1215 /* Wait until the parent has updated the UID and GID mappings.
1216 See the comment in main(). We wait for end of file on a
1217 pipe that will be closed by the parent process once it has
1218 updated the mappings. */
1219
1220 close(args\->pipe_fd[1]); /* Close our descriptor for the write
1221 end of the pipe so that we see EOF
1222 when parent closes its descriptor */
1223 if (read(args\->pipe_fd[0], &ch, 1) != 0) {
1224 fprintf(stderr,
1225 "Failure in child: read from pipe returned != 0\\n");
1226 exit(EXIT_FAILURE);
1227 }
1228
1229 /* Execute a shell command */
1230
1231 printf("About to exec %s\\n", args\->argv[0]);
1232 execvp(args\->argv[0], args\->argv);
1233 errExit("execvp");
1234 }
1235
1236 #define STACK_SIZE (1024 * 1024)
1237
1238 static char child_stack[STACK_SIZE]; /* Space for child\(aqs stack */
1239
1240 int
1241 main(int argc, char *argv[])
1242 {
1243 int flags, opt, map_zero;
1244 pid_t child_pid;
1245 struct child_args args;
1246 char *uid_map, *gid_map;
1247 const int MAP_BUF_SIZE = 100;
1248 char map_buf[MAP_BUF_SIZE];
1249 char map_path[PATH_MAX];
1250
1251 /* Parse command\-line options. The initial \(aq+\(aq character in
1252 the final getopt() argument prevents GNU\-style permutation
1253 of command\-line options. That\(aqs useful, since sometimes
1254 the \(aqcommand\(aq to be executed by this program itself
1255 has command\-line options. We don\(aqt want getopt() to treat
1256 those as options to this program. */
1257
1258 flags = 0;
1259 verbose = 0;
1260 gid_map = NULL;
1261 uid_map = NULL;
1262 map_zero = 0;
1263 while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != \-1) {
1264 switch (opt) {
1265 case \(aqi\(aq: flags |= CLONE_NEWIPC; break;
1266 case \(aqm\(aq: flags |= CLONE_NEWNS; break;
1267 case \(aqn\(aq: flags |= CLONE_NEWNET; break;
1268 case \(aqp\(aq: flags |= CLONE_NEWPID; break;
1269 case \(aqu\(aq: flags |= CLONE_NEWUTS; break;
1270 case \(aqv\(aq: verbose = 1; break;
1271 case \(aqz\(aq: map_zero = 1; break;
1272 case \(aqM\(aq: uid_map = optarg; break;
1273 case \(aqG\(aq: gid_map = optarg; break;
1274 case \(aqU\(aq: flags |= CLONE_NEWUSER; break;
1275 default: usage(argv[0]);
1276 }
1277 }
1278
1279 /* \-M or \-G without \-U is nonsensical */
1280
1281 if (((uid_map != NULL || gid_map != NULL || map_zero) &&
1282 !(flags & CLONE_NEWUSER)) ||
1283 (map_zero && (uid_map != NULL || gid_map != NULL)))
1284 usage(argv[0]);
1285
1286 args.argv = &argv[optind];
1287
1288 /* We use a pipe to synchronize the parent and child, in order to
1289 ensure that the parent sets the UID and GID maps before the child
1290 calls execve(). This ensures that the child maintains its
1291 capabilities during the execve() in the common case where we
1292 want to map the child\(aqs effective user ID to 0 in the new user
1293 namespace. Without this synchronization, the child would lose
1294 its capabilities if it performed an execve() with nonzero
1295 user IDs (see the capabilities(7) man page for details of the
1296 transformation of a process\(aqs capabilities during execve()). */
1297
1298 if (pipe(args.pipe_fd) == \-1)
1299 errExit("pipe");
1300
1301 /* Create the child in new namespace(s) */
1302
1303 child_pid = clone(childFunc, child_stack + STACK_SIZE,
1304 flags | SIGCHLD, &args);
1305 if (child_pid == \-1)
1306 errExit("clone");
1307
1308 /* Parent falls through to here */
1309
1310 if (verbose)
1311 printf("%s: PID of child created by clone() is %ld\\n",
1312 argv[0], (long) child_pid);
1313
1314 /* Update the UID and GID maps in the child */
1315
1316 if (uid_map != NULL || map_zero) {
1317 snprintf(map_path, PATH_MAX, "/proc/%ld/uid_map",
1318 (long) child_pid);
1319 if (map_zero) {
1320 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getuid());
1321 uid_map = map_buf;
1322 }
1323 update_map(uid_map, map_path);
1324 }
1325
1326 if (gid_map != NULL || map_zero) {
1327 proc_setgroups_write(child_pid, "deny");
1328
1329 snprintf(map_path, PATH_MAX, "/proc/%ld/gid_map",
1330 (long) child_pid);
1331 if (map_zero) {
1332 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getgid());
1333 gid_map = map_buf;
1334 }
1335 update_map(gid_map, map_path);
1336 }
1337
1338 /* Close the write end of the pipe, to signal to the child that we
1339 have updated the UID and GID maps */
1340
1341 close(args.pipe_fd[1]);
1342
1343 if (waitpid(child_pid, NULL, 0) == \-1) /* Wait for child */
1344 errExit("waitpid");
1345
1346 if (verbose)
1347 printf("%s: terminating\\n", argv[0]);
1348
1349 exit(EXIT_SUCCESS);
1350 }
1351 .fi
1352 .SH SEE ALSO
1353 .BR newgidmap (1), \" From the shadow package
1354 .BR newuidmap (1), \" From the shadow package
1355 .BR clone (2),
1356 .BR ptrace (2),
1357 .BR setns (2),
1358 .BR unshare (2),
1359 .BR proc (5),
1360 .BR subgid (5), \" From the shadow package
1361 .BR subuid (5), \" From the shadow package
1362 .BR credentials (7),
1363 .BR capabilities (7),
1364 .BR namespaces (7),
1365 .BR cgroup_namespaces (7)
1366 .BR pid_namespaces (7)
1367 .sp
1368 The kernel source file
1369 .IR Documentation/namespaces/resource-control.txt .