man7/user_namespaces.7

   1 .\" Copyright (c) 2013 by Michael Kerrisk <mtk.manpages@gmail.com>
   2 .\" and Copyright (c) 2012 by Eric W. Biederman <ebiederm@xmission.com>
   3 .\"
   4 .\" Permission is granted to make and distribute verbatim copies of this
   5 .\" manual provided the copyright notice and this permission notice are
   6 .\" preserved on all copies.
   7 .\"
   8 .\" Permission is granted to copy and distribute modified versions of this
   9 .\" manual under the conditions for verbatim copying, provided that the
  10 .\" entire resulting derived work is distributed under the terms of a
  11 .\" permission notice identical to this one.
  12 .\"
  13 .\" Since the Linux kernel and libraries are constantly changing, this
  14 .\" manual page may be incorrect or out-of-date.  The author(s) assume no
  15 .\" responsibility for errors or omissions, or for damages resulting from
  16 .\" the use of the information contained herein.  The author(s) may not
  17 .\" have taken the same level of care in the production of this manual,
  18 .\" which is licensed free of charge, as they might when working
  19 .\" professionally.
  20 .\"
  21 .\" Formatted or processed versions of this manual, if unaccompanied by
  22 .\" the source, must acknowledge the copyright and authors of this work.
  23 .\"
  24 .\"
  25 .TH USER_NAMESPACES 7 2013-01-14 "Linux" "Linux Programmer's Manual"
  26 .SH NAME
  27 user_namespaces \- overview of Linux user_namespaces
  28 .SH DESCRIPTION
  29 For an overview of namespaces, see
  30 .BR namespaces (7).
  31
  32 User namespaces isolate security-related identifiers, in particular,
  33 user IDs and group IDs (see
  34 .BR credentials (7),
  35 keys (see
  36 .BR keyctl (2)),
  37 .\" FIXME: This page says very little about the interaction
  38 .\" of user namespaces and keys. Add something on this topic.
  39 and capabilities (see
  40 .BR capabilities (7)).
  41 A process's user and group IDs can be different
  42 inside and outside a user namespace.
  43 In particular,
  44 a process can have a normal unprivileged user ID outside a user namespace
  45 while at the same time having a user ID of 0 inside the namespace;
  46 in other words,
  47 the process has full privileges for operations inside the user namespace,
  48 but is unprivileged for operations outside the namespace.
  49 .\"
  50 .\" ============================================================
  51 .\"
  52 .SS Nested namespaces, namespace membership
  53 User namespaces can be nested;
  54 that is, each user namespace\(emexcept the initial ("root")
  55 namespace\(emhas a parent user namespace,
  56 and can have zero or more child user namespaces.
  57 The parent user namespace is the user namespace
  58 of the process that creates the user namespace via a call to
  59 .BR unshare (2)
  60 or
  61 .BR clone (2)
  62 with the
  63 .BR CLONE_NEWUSER
  64 flag.
  65
  66 The kernel imposes (since version 3.11) a limit of 32 nested levels of
  67 .\" commit 8742f229b635bf1c1c84a3dfe5e47c814c20b5c8
  68 user namespaces.
  69 .\" FIXME Explain the rationale for this limit. (What is the rationale?)
  70 Calls to
  71 .BR unshare (2)
  72 or
  73 .BR clone (2)
  74 that would cause this limit to be exceeded fail with the error
  75 .BR EUSERS .
  76
  77 Each process is a member of exactly one user namespace.
  78 A process created via
  79 .BR fork (2)
  80 or
  81 .BR clone (2)
  82 without the
  83 .BR CLONE_NEWUSER
  84 flag is a member of the same user namespace as its parent.
  85 A process can join another user namespace with
  86 .BR setns (2)
  87 if it has the
  88 .BR CAP_SYS_ADMIN
  89 in that namespace;
  90 upon doing so, it gains a full set of capabilities in that namespace.
  91
  92 A call to
  93 .BR clone (2)
  94 or
  95 .BR unshare (2)
  96 with the
  97 .BR CLONE_NEWUSER
  98 flag makes the new child process (for
  99 .BR clone (2))
 100 or the caller (for
 101 .BR unshare (2))
 102 a member of the new user namespace created by the call.
 103 .\"
 104 .\" ============================================================
 105 .\"
 106 .SS Capabilities
 107 The child process created by
 108 .BR clone (2)
 109 with the
 110 .BR CLONE_NEWUSER
 111 flag starts out with a complete set
 112 of capabilities in the new user namespace.
 113 Likewise, a process that creates a new user namespace using
 114 .BR unshare (2)
 115 or joins an existing user namespace using
 116 .BR setns (2)
 117 gains a full set of capabilities in that namespace.
 118 On the other hand,
 119 that process has no capabilities in the parent (in the case of
 120 .BR clone (2))
 121 or previous (in the case of
 122 .BR unshare (2)
 123 and
 124 .BR setns (2))
 125 user namespace,
 126 even if the new namespace is created or joined by the root user
 127 (i.e., a process with user ID 0 in the root namespace).
 128
 129 Note that a call to
 130 .BR execve (2)
 131 will cause a process to lose any capabilities that it has,
 132 unless it has a user ID of 0 within the namespace.
 133 Thus, before calling
 134 .BR execve (2),
 135 a user ID mapping for ID 0 must be defined,
 136 and the caller may also need to use
 137 .BR setuid (2)
 138 or similar to set its user ID to 0.
 139
 140 A call to
 141 .BR clone (2),
 142 .BR unshare (2),
 143 or
 144 .BR setns (2)
 145 using the
 146 .BR CLONE_NEWUSER
 147 flag sets the "securebits" flags
 148 (see
 149 .BR capabilities (7))
 150 to their default values (all flags disabled) in the child (for
 151 .BR clone (2))
 152 or caller (for
 153 .BR unshare (2),
 154 or
 155 .BR setns (2)).
 156 Note that because the caller no longer has capabilities
 157 in its original user namespace after a call to
 158 .BR setns (2),
 159 it is not possible for a process to reset its "securebits" flags while
 160 retaining its user namespace membership by using a pair of
 161 .BR setns (2)
 162 calls to move to another user namespace and then return to
 163 its original user namespace.
 164
 165 Having a capability inside a user namespace
 166 permits a process to perform operations (that require privilege)
 167 only on resources governed by that namespace.
 168 The rules for determining whether or not a process has a capability
 169 in a particular user namespace are as follows:
 170 .IP 1. 3
 171 A process has a capability inside a user namespace
 172 if it is a member of that namespace and
 173 it has the capability in its effective capability set.
 174 A process can gain capabilities in its effective capability
 175 set in various ways.
 176 For example, it may execute a set-user-ID program or an
 177 executable with associated file capabilities.
 178 In addition,
 179 a process may gain capabilities via the effect of
 180 .BR clone (2),
 181 .BR unshare (2),
 182 or
 183 .BR setns (2),
 184 as already described.
 185 .\" In the 3.8 sources, see security/commoncap.c::cap_capable():
 186 .IP 2.
 187 If a process has a capability in a user namespace,
 188 then it has that capability in all child (and further removed descendant)
 189 namespaces as well.
 190 .IP 3.
 191 .\" * The owner of the user namespace in the parent of the
 192 .\" * user namespace has all caps.
 193 When a user namespace is created, the kernel records the effective
 194 user ID of the creating process as being the "owner" of the namespace.
 195 .\" (and likewise associates the effective group ID of the creating process
 196 .\" with the namespace).
 197 A process that resides
 198 in the parent of the user namespace
 199 .\" See kernel commit 520d9eabce18edfef76a60b7b839d54facafe1f9 for a fix
 200 .\" on this point
 201 and whose effective user ID matches the owner of the namespace
 202 has all capabilities in the namespace.
 203 .\"     This includes the case where the process executes a set-user-ID
 204 .\"     program that confers the effective UID of the creator of the namespace.
 205 By virtue of the previous rule,
 206 this means that the process has all capabilities in all
 207 further removed descendant user namespaces as well.
 208 .\"
 209 .\" ============================================================
 210 .\"
 211 .SS Interaction of user namespaces and other types of namespaces
 212 Starting in Linux 3.8, unprivileged processes can create user namespaces,
 213 and mount, PID, IPC, network, and UTS namespaces can be created with just the
 214 .B CAP_SYS_ADMIN
 215 capability in the caller's user namespace.
 216
 217 If
 218 .BR CLONE_NEWUSER
 219 is specified along with other
 220 .B CLONE_NEW*
 221 flags in a single
 222 .BR clone (2)
 223 or
 224 .BR unshare (2)
 225 call, the user namespace is guaranteed to be created first,
 226 giving the child
 227 .RB ( clone (2))
 228 or caller
 229 .RB ( unshare (2))
 230 privileges over the remaining namespaces created by the call.
 231 Thus, it is possible for an unprivileged caller to specify this combination
 232 of flags.
 233
 234 When a new IPC, mount, network, PID, or UTS namespace is created via
 235 .BR clone (2)
 236 or
 237 .BR unshare (2),
 238 the kernel records the user namespace of the creating process against
 239 the new namespace.
 240 (This association can't be changed.)
 241 When a process in the new namespace subsequently performs
 242 privileged operations that operate on global
 243 resources isolated by the namespace,
 244 the permission checks are performed according to the process's capabilities
 245 in the user namespace that the kernel associated with the new namespace.
 246 .\"
 247 .\" ============================================================
 248 .\"
 249 .SS User and group ID mappings: uid_map and gid_map
 250 When a user namespace is created,
 251 it starts out without a mapping of user IDs (group IDs)
 252 to the parent user namespace.
 253 The
 254 .IR /proc/[pid]/uid_map
 255 and
 256 .IR /proc/[pid]/gid_map
 257 files (available since Linux 3.5)
 258 .\" commit 22d917d80e842829d0ca0a561967d728eb1d6303
 259 expose the mappings for user and group IDs
 260 inside the user namespace for the process
 261 .IR pid .
 262 These files can be read to view the mappings in a user namespace and
 263 written to (once) to define the mappings.
 264
 265 The description in the following paragraphs explains the details for
 266 .IR uid_map ;
 267 .IR gid_map
 268 is exactly the same,
 269 but each instance of "user ID" is replaced by "group ID".
 270
 271 The
 272 .I uid_map
 273 file exposes the mapping of user IDs from the user namespace
 274 of the process
 275 .IR pid
 276 to the user namespace of the process that opened
 277 .IR uid_map
 278 (but see a qualification to this point below).
 279 In other words, processes that are in different user namespaces
 280 will potentially see different values when reading from a particular
 281 .I uid_map
 282 file, depending on the user ID mappings for the user namespaces
 283 of the reading processes.
 284
 285 Each line in the
 286 .I uid_map
 287 file specifies a 1-to-1 mapping of a range of contiguous
 288 user IDs between two user namespaces.
 289 (When a user namespace is first created, this file is empty.)
 290 The specification in each line takes the form of
 291 three numbers delimited by white space.
 292 The first two numbers specify the starting user ID in
 293 each of the two user namespaces.
 294 The third number specifies the length of the mapped range.
 295 In detail, the fields are interpreted as follows:
 296 .IP (1) 4
 297 The start of the range of user IDs in
 298 the user namespace of the process
 299 .IR pid .
 300 .IP (2)
 301 The start of the range of user
 302 IDs to which the user IDs specified by field one map.
 303 How field two is interpreted depends on whether the process that opened
 304 .I uid_map
 305 and the process
 306 .IR pid
 307 are in the same user namespace, as follows:
 308 .RS
 309 .IP a) 3
 310 If the two processes are in different user namespaces:
 311 field two is the start of a range of
 312 user IDs in the user namespace of the process that opened
 313 .IR uid_map .
 314 .IP b)
 315 If the two processes are in the same user namespace:
 316 field two is the start of the range of
 317 user IDs in the parent user namespace of the process
 318 .IR pid .
 319 This case enables the opener of
 320 .I uid_map
 321 (the common case here is opening
 322 .IR /proc/self/uid_map )
 323 to see the mapping of user IDs into the user namespace of the process
 324 that created this user namespace.
 325 .RE
 326 .IP (3)
 327 The length of the range of user IDs that is mapped between the two
 328 user namespaces.
 329 .PP
 330 System calls that return user IDs (group IDs)\(emfor example,
 331 .BR getuid (2),
 332 .BR getgid (2),
 333 and the credential fields in the structure returned by
 334 .BR stat (2)\(emreturn
 335 the user ID (group ID) mapped into the caller's user namespace.
 336
 337 When a process accesses a file, its user and group IDs
 338 are mapped into the initial user namespace for the purpose of permission
 339 checking and assigning IDs when creating a file.
 340 When a process retrieves file user and group IDs via
 341 .BR stat (2),
 342 the IDs are mapped in the opposite direction,
 343 to produce values relative to the process user and group ID mappings.
 344
 345 The initial user namespace has no parent namespace,
 346 but, for consistency, the kernel provides dummy user and group
 347 ID mapping files for this namespace.
 348 Looking at the
 349 .I uid_map
 350 file
 351 .RI ( gid_map
 352 is the same) from a shell in the initial namespace shows:
 353
 354 .in +4n
 355 .nf
 356 $ \fBcat /proc/$$/uid_map\fP
 357          0          0 4294967295
 358 .fi
 359 .in
 360
 361 This mapping tells us
 362 that the range starting at user ID 0 in this namespace
 363 maps to a range starting at 0 in the (nonexistent) parent namespace,
 364 and the length of the range is the largest 32-bit unsigned integer.
 365 .\"
 366 .\" ============================================================
 367 .\"
 368 .SS Defining user and group ID mappings: writing to uid_map and gid_map
 369 .PP
 370 After the creation of a new user namespace, the
 371 .I uid_map
 372 file of
 373 .I one
 374 of the processes in the namespace may be written to
 375 .I once
 376 to define the mapping of user IDs in the new user namespace.
 377 An attempt to write more than once to a
 378 .I uid_map
 379 file in a user namespace fails with the error
 380 .BR EPERM .
 381 Similar rules apply for
 382 .I gid_map
 383 files.
 384
 385 The lines written to
 386 .IR uid_map
 387 .RI ( gid_map )
 388 must conform to the following rules:
 389 .IP * 3
 390 The three fields must be valid numbers,
 391 and the last field must be greater than 0.
 392 .IP *
 393 Lines are terminated by newline characters.
 394 .IP *
 395 There is an (arbitrary) limit on the number of lines in the file.
 396 As at Linux 3.8, the limit is five lines.
 397 In addition, the number of bytes written to
 398 the file must be less than the system page size,
 399 .\" FIXME(Eric): the restriction "less than" rather than "less than or equal"
 400 .\" seems strangely arbitrary. Furthermore, the comment does not agree
 401 .\" with the code in kernel/user_namespace.c. Which is correct.
 402 and the write must be performed at the start of the file (i.e.,
 403 .BR lseek (2)
 404 and
 405 .BR pwrite (2)
 406 can't be used to write to nonzero offsets in the file).
 407 .IP *
 408 The range of user IDs (group IDs)
 409 specified in each line cannot overlap with the ranges
 410 in any other lines.
 411 In the initial implementation (Linux 3.8), this requirement was
 412 satisfied by a simplistic implementation that imposed the further
 413 requirement that
 414 the values in both field 1 and field 2 of successive lines must be
 415 in ascending numerical order,
 416 which prevented some otherwise valid maps from being created.
 417 Linux 3.9 and later
 418 .\" commit 0bd14b4fd72afd5df41e9fd59f356740f22fceba
 419 fix this limitation, allowing any valid set of nonoverlapping maps.
 420 .IP *
 421 At least one line must be written to the file.
 422 .PP
 423 Writes that violate the above rules fail with the error
 424 .BR EINVAL .
 425
 426 In order for a process to write to the
 427 .I /proc/[pid]/uid_map
 428 .RI ( /proc/[pid]/gid_map )
 429 file, all of the following requirements must be met:
 430 .IP 1. 3
 431 The writing process must have the
 432 .BR CAP_SETUID
 433 .RB ( CAP_SETGID )
 434 capability in the user namespace of the process
 435 .IR pid .
 436 .IP 2.
 437 The writing process must be in either the user namespace of the process
 438 .I pid
 439 or inside the parent user namespace of the process
 440 .IR pid .
 441 .IP 3.
 442 The mapped user IDs (group IDs) must in turn have a mapping
 443 in the parent user namespace.
 444 .IP 4.
 445 One of the following is true:
 446 .RS
 447 .IP * 3
 448 The data written to
 449 .I uid_map
 450 .RI ( gid_map )
 451 consists of a single line that maps the writing process's filesystem user ID
 452 (group ID) in the parent user namespace to a user ID (group ID)
 453 in the user namespace.
 454 The usual case here is that this single line provides a mapping for user ID
 455 of the process that created the namespace.
 456 .IP * 3
 457 The process has the
 458 .BR CAP_SETUID
 459 .RB ( CAP_SETGID )
 460 capability in the parent user namespace.
 461 Thus, a privileged process can make mappings to arbitrary user IDs (group IDs)
 462 in the parent user namespace.
 463 .RE
 464 .PP
 465 Writes that violate the above rules fail with the error
 466 .BR EPERM .
 467 .\"
 468 .\" ============================================================
 469 .\"
 470 .SS Unmapped user and group IDs
 471 .PP
 472 There are various places where an unmapped user ID (group ID)
 473 may be exposed to user space.
 474 For example, the first process in a new user namespace may call
 475 .BR getuid ()
 476 before a user ID mapping has been defined for the namespace.
 477 In most such cases, an unmapped user ID is converted
 478 .\" from_kuid_munged(), from_kgid_munged()
 479 to the overflow user ID (group ID);
 480 the default value for the overflow user ID (group ID) is 65534.
 481 See the descriptions of
 482 .IR /proc/sys/kernel/overflowuid
 483 and
 484 .IR /proc/sys/kernel/overflowgid
 485 in
 486 .BR proc (5).
 487
 488 The cases where unmapped IDs are mapped in this fashion include
 489 system calls that return user IDs
 490 .RB ( getuid (2)
 491 .BR getgid (2),
 492 and similar),
 493 credentials passed over a UNIX domain socket,
 494 .\" also SO_PEERCRED
 495 credentials returned by
 496 .BR stat (2),
 497 .BR waitid (2),
 498 and the System V IPC "ctl"
 499 .B IPC_STAT
 500 operations,
 501 credentials exposed by
 502 .IR /proc/PID/status
 503 and the files in
 504 .IR /proc/sysvipc/* ,
 505 credentials returned via the
 506 .I si_uid
 507 field in the
 508 .I siginfo_t
 509 received with a signal (see
 510 .BR sigaction (2)),
 511 credentials written to the process accounting file (see
 512 .BR acct (5)),
 513 and credentials returned with POSIX message queue notifications (see
 514 .BR mq_notify (3)).
 515
 516 There is one notable case where unmapped user and group IDs are
 517 .I not
 518 .\" from_kuid(), from_kgid()
 519 .\" Also F_GETOWNER_UIDS is an exception
 520 converted to the corresponding overflow ID value.
 521 When viewing a
 522 .I uid_map
 523 or
 524 .I gid_map
 525 file in which there is no mapping for the second field,
 526 that field is displayed as 4294967295 (\-1 as an unsigned integer);
 527 .\"
 528 .\" ============================================================
 529 .\"
 530 .SS Set-user-ID and set-group-ID programs
 531 .PP
 532 When a process inside a user namespace executes
 533 a set-user-ID (set-group-ID) program,
 534 the process's effective user (group) ID inside the namespace is changed
 535 to whatever value is mapped for the user (group) ID of the file.
 536 However, if either the user
 537 .I or
 538 the group ID of the file has no mapping inside the namespace,
 539 the set-user-ID (set-group-ID) bit is silently ignored:
 540 the new program is executed,
 541 but the process's effective user (group) ID is left unchanged.
 542 (This mirrors the semantics of executing a set-user-ID or set-group-ID
 543 program that resides on a filesystem that was mounted with the
 544 .BR MS_NOSUID
 545 flag, as described in
 546 .BR mount (2).)
 547 .\"
 548 .\" ============================================================
 549 .\"
 550 .SS Miscellaneous
 551 .PP
 552 When a process's user and group IDs are passed over a UNIX domain socket
 553 to a process in a different user namespace (see the description of
 554 .B SCM_CREDENTIALS
 555 in
 556 .BR unix (7)),
 557 they are translated into the corresponding values as per the
 558 receiving process's user and group ID mappings.
 559 .\"
 560 .SH CONFORMING TO
 561 Namespaces are a Linux-specific feature.
 562 .\"
 563 .SH NOTES
 564 Over the years, there have been a lot of features that have been added
 565 to the Linux kernel that have been made available only to privileged users
 566 because of their potential to confuse set-user-ID-root applications.
 567 In general, it becomes safe to allow the root user in a user namespace to
 568 use those features because it is impossible, while in a user namespace,
 569 to gain more privilege than the root user of a user namespace has.
 570 .SS Availability
 571 Use of user namespaces requires a kernel that is configured with the
 572 .B CONFIG_USER_NS
 573 option.
 574 User namespaces require support in a range of subsystems across
 575 the kernel.
 576 When an unsupported subsystem is configured into the kernel,
 577 it is not possible to configure user namespaces support.
 578 As at Linux 3.8, most relevant subsystems support user namespaces,
 579 but there are a number of filesystems that do not.
 580 Linux 3.9 added user namespaces support for many of the remaining
 581 unsupported filesystems:
 582 Plan 9 (9P), Andrew File System (AFS), Ceph, CIFS, CODA, NFS, and OCFS2.
 583 XFS support for user namespaces is not yet available.
 584 .\"
 585 .SH EXAMPLE
 586 The program below is designed to allow experimenting with
 587 user namespaces, as well as other types of namespaces.
 588 It creates namespaces as specified by command-line options and then executes
 589 a command inside those namespaces.
 590 The comments and
 591 .I usage()
 592 function inside the program provide a full explanation of the program.
 593 The following shell session demonstrates its use.
 594
 595 First, we look at the run-time environment:
 596
 597 .in +4n
 598 .nf
 599 $ \fBuname -rs\fP     # Need Linux 3.8 or later
 600 Linux 3.8.0
 601 $ \fBid -u\fP         # Running as unprivileged user
 602 1000
 603 $ \fBid -g\fP
 604 1000
 605 .fi
 606 .in
 607
 608 Now start a new shell in new user
 609 .RI ( \-U ),
 610 mount
 611 .RI ( \-m ),
 612 and PID
 613 .RI ( \-p )
 614 namespaces, with user ID
 615 .RI ( \-M )
 616 and group ID
 617 .RI ( \-G )
 618 1000 mapped to 0 inside the user namespace:
 619
 620 .in +4n
 621 .nf
 622 $ \fB./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash\fP
 623 .fi
 624 .in
 625
 626 The shell has PID 1, because it is the first process in the new
 627 PID namespace:
 628
 629 .in +4n
 630 .nf
 631 bash$ \fBecho $$\fP
 632 1
 633 .fi
 634 .in
 635
 636 Inside the user namespace, the shell has user and group ID 0,
 637 and a full set of permitted and effective capabilities:
 638
 639 .in +4n
 640 .nf
 641 bash$ \fBcat /proc/$$/status | egrep '^[UG]id'\fP
 642 Uid:    0       0       0       0
 643 Gid:    0       0       0       0
 644 bash$ \fBcat /proc/$$/status | egrep '^Cap(Prm|Inh|Eff)'\fP
 645 CapInh: 0000000000000000
 646 CapPrm: 0000001fffffffff
 647 CapEff: 0000001fffffffff
 648 .fi
 649 .in
 650
 651 Mounting a new
 652 .I /proc
 653 filesystem and listing all of the processes visible
 654 in the new PID namespace shows that the shell can't see
 655 any processes outside the PID namespace:
 656
 657 .in +4n
 658 .nf
 659 bash$ \fBmount -t proc proc /proc\fP
 660 bash$ \fBps ax\fP
 661   PID TTY      STAT   TIME COMMAND
 662     1 pts/3    S      0:00 bash
 663    22 pts/3    R+     0:00 ps ax
 664 .fi
 665 .in
 666 .SS Program source
 667 \&
 668 .nf
 669 /* userns_child_exec.c
 670
 671    Licensed under GNU General Public License v2 or later
 672
 673    Create a child process that executes a shell command in new
 674    namespace(s); allow UID and GID mappings to be specified when
 675    creating a user namespace.
 676 */
 677 #define _GNU_SOURCE
 678 #include <sched.h>
 679 #include <unistd.h>
 680 #include <stdlib.h>
 681 #include <sys/wait.h>
 682 #include <signal.h>
 683 #include <fcntl.h>
 684 #include <stdio.h>
 685 #include <string.h>
 686 #include <limits.h>
 687 #include <errno.h>
 688
 689 /* A simple error\-handling function: print an error message based
 690    on the value in \(aqerrno\(aq and terminate the calling process */
 691
 692 #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \\
 693                         } while (0)
 694
 695 struct child_args {
 696     char **argv;        /* Command to be executed by child, with args */
 697     int    pipe_fd[2];  /* Pipe used to synchronize parent and child */
 698 };
 699
 700 static int verbose;
 701
 702 static void
 703 usage(char *pname)
 704 {
 705     fprintf(stderr, "Usage: %s [options] cmd [arg...]\\n\\n", pname);
 706     fprintf(stderr, "Create a child process that executes a shell "
 707             "command in a new user namespace,\\n"
 708             "and possibly also other new namespace(s).\\n\\n");
 709     fprintf(stderr, "Options can be:\\n\\n");
 710 #define fpe(str) fprintf(stderr, "    %s", str);
 711     fpe("\-i          New IPC namespace\\n");
 712     fpe("\-m          New mount namespace\\n");
 713     fpe("\-n          New network namespace\\n");
 714     fpe("\-p          New PID namespace\\n");
 715     fpe("\-u          New UTS namespace\\n");
 716     fpe("\-U          New user namespace\\n");
 717     fpe("\-M uid_map  Specify UID map for user namespace\\n");
 718     fpe("\-G gid_map  Specify GID map for user namespace\\n");
 719     fpe("\-z          Map user\(aqs UID and GID to 0 in user namespace\\n");
 720     fpe("            (equivalent to: \-M \(aq0 <uid> 1\(aq \-G \(aq0 <gid> 1\(aq)\\n");
 721     fpe("\-v          Display verbose messages\\n");
 722     fpe("\\n");
 723     fpe("If \-z, \-M, or \-G is specified, \-U is required.\\n");
 724     fpe("It is not permitted to specify both \-z and either \-M or \-G.\\n");
 725     fpe("\\n");
 726     fpe("Map strings for \-M and \-G consist of records of the form:\\n");
 727     fpe("\\n");
 728     fpe("    ID\-inside\-ns   ID\-outside\-ns   len\\n");
 729     fpe("\\n");
 730     fpe("A map string can contain multiple records, separated"
 731         " by commas;\\n");
 732     fpe("the commas are replaced by newlines before writing"
 733         " to map files.\\n");
 734
 735     exit(EXIT_FAILURE);
 736 }
 737
 738 /* Update the mapping file \(aqmap_file\(aq, with the value provided in
 739    \(aqmapping\(aq, a string that defines a UID or GID mapping. A UID or
 740    GID mapping consists of one or more newline\-delimited records
 741    of the form:
 742
 743        ID_inside\-ns    ID\-outside\-ns   length
 744
 745    Requiring the user to supply a string that contains newlines is
 746    of course inconvenient for command\-line use. Thus, we permit the
 747    use of commas to delimit records in this string, and replace them
 748    with newlines before writing the string to the file. */
 749
 750 static void
 751 update_map(char *mapping, char *map_file)
 752 {
 753     int fd, j;
 754     size_t map_len;     /* Length of \(aqmapping\(aq */
 755
 756     /* Replace commas in mapping string with newlines */
 757
 758     map_len = strlen(mapping);
 759     for (j = 0; j < map_len; j++)
 760         if (mapping[j] == \(aq,\(aq)
 761             mapping[j] = \(aq\\n\(aq;
 762
 763     fd = open(map_file, O_RDWR);
 764     if (fd == \-1) {
 765         fprintf(stderr, "ERROR: open %s: %s\\n", map_file,
 766                 strerror(errno));
 767         exit(EXIT_FAILURE);
 768     }
 769
 770     if (write(fd, mapping, map_len) != map_len) {
 771         fprintf(stderr, "ERROR: write %s: %s\\n", map_file,
 772                 strerror(errno));
 773         exit(EXIT_FAILURE);
 774     }
 775
 776     close(fd);
 777 }
 778
 779 static int              /* Start function for cloned child */
 780 childFunc(void *arg)
 781 {
 782     struct child_args *args = (struct child_args *) arg;
 783     char ch;
 784
 785     /* Wait until the parent has updated the UID and GID mappings.
 786        See the comment in main(). We wait for end of file on a
 787        pipe that will be closed by the parent process once it has
 788        updated the mappings. */
 789
 790     close(args\->pipe_fd[1]);    /* Close our descriptor for the write
 791                                    end of the pipe so that we see EOF
 792                                    when parent closes its descriptor */
 793     if (read(args\->pipe_fd[0], &ch, 1) != 0) {
 794         fprintf(stderr,
 795                 "Failure in child: read from pipe returned != 0\\n");
 796         exit(EXIT_FAILURE);
 797     }
 798
 799     /* Execute a shell command */
 800
 801     printf("About to exec %s\\n", args\->argv[0]);
 802     execvp(args\->argv[0], args\->argv);
 803     errExit("execvp");
 804 }
 805
 806 #define STACK_SIZE (1024 * 1024)
 807
 808 static char child_stack[STACK_SIZE];    /* Space for child\(aqs stack */
 809
 810 int
 811 main(int argc, char *argv[])
 812 {
 813     int flags, opt, map_zero;
 814     pid_t child_pid;
 815     struct child_args args;
 816     char *uid_map, *gid_map;
 817     const int MAP_BUF_SIZE = 100;
 818     char map_buf[MAP_BUF_SIZE];
 819     char map_path[PATH_MAX];
 820
 821     /* Parse command\-line options. The initial \(aq+\(aq character in
 822        the final getopt() argument prevents GNU\-style permutation
 823        of command\-line options. That\(aqs useful, since sometimes
 824        the \(aqcommand\(aq to be executed by this program itself
 825        has command\-line options. We don\(aqt want getopt() to treat
 826        those as options to this program. */
 827
 828     flags = 0;
 829     verbose = 0;
 830     gid_map = NULL;
 831     uid_map = NULL;
 832     map_zero = 0;
 833     while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != \-1) {
 834         switch (opt) {
 835         case \(aqi\(aq: flags |= CLONE_NEWIPC;        break;
 836         case \(aqm\(aq: flags |= CLONE_NEWNS;         break;
 837         case \(aqn\(aq: flags |= CLONE_NEWNET;        break;
 838         case \(aqp\(aq: flags |= CLONE_NEWPID;        break;
 839         case \(aqu\(aq: flags |= CLONE_NEWUTS;        break;
 840         case \(aqv\(aq: verbose = 1;                  break;
 841         case \(aqz\(aq: map_zero = 1;                 break;
 842         case \(aqM\(aq: uid_map = optarg;             break;
 843         case \(aqG\(aq: gid_map = optarg;             break;
 844         case \(aqU\(aq: flags |= CLONE_NEWUSER;       break;
 845         default:  usage(argv[0]);
 846         }
 847     }
 848
 849     /* \-M or \-G without \-U is nonsensical */
 850
 851     if (((uid_map != NULL || gid_map != NULL || map_zero) &&
 852                 !(flags & CLONE_NEWUSER)) ||
 853             (map_zero && (uid_map != NULL || gid_map != NULL)))
 854         usage(argv[0]);
 855
 856     args.argv = &argv[optind];
 857
 858     /* We use a pipe to synchronize the parent and child, in order to
 859        ensure that the parent sets the UID and GID maps before the child
 860        calls execve(). This ensures that the child maintains its
 861        capabilities during the execve() in the common case where we
 862        want to map the child\(aqs effective user ID to 0 in the new user
 863        namespace. Without this synchronization, the child would lose
 864        its capabilities if it performed an execve() with nonzero
 865        user IDs (see the capabilities(7) man page for details of the
 866        transformation of a process\(aqs capabilities during execve()). */
 867
 868     if (pipe(args.pipe_fd) == \-1)
 869         errExit("pipe");
 870
 871     /* Create the child in new namespace(s) */
 872
 873     child_pid = clone(childFunc, child_stack + STACK_SIZE,
 874                       flags | SIGCHLD, &args);
 875     if (child_pid == \-1)
 876         errExit("clone");
 877
 878     /* Parent falls through to here */
 879
 880     if (verbose)
 881         printf("%s: PID of child created by clone() is %ld\\n",
 882                 argv[0], (long) child_pid);
 883
 884     /* Update the UID and GID maps in the child */
 885
 886     if (uid_map != NULL || map_zero) {
 887         snprintf(map_path, PATH_MAX, "/proc/%ld/uid_map",
 888                 (long) child_pid);
 889         if (map_zero) {
 890             snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getuid());
 891             uid_map = map_buf;
 892         }
 893         update_map(uid_map, map_path);
 894     }
 895     if (gid_map != NULL || map_zero) {
 896         snprintf(map_path, PATH_MAX, "/proc/%ld/gid_map",
 897                 (long) child_pid);
 898         if (map_zero) {
 899             snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getgid());
 900             gid_map = map_buf;
 901         }
 902         update_map(gid_map, map_path);
 903     }
 904
 905     /* Close the write end of the pipe, to signal to the child that we
 906        have updated the UID and GID maps */
 907
 908     close(args.pipe_fd[1]);
 909
 910     if (waitpid(child_pid, NULL, 0) == \-1)      /* Wait for child */
 911         errExit("waitpid");
 912
 913     if (verbose)
 914         printf("%s: terminating\\n", argv[0]);
 915
 916     exit(EXIT_SUCCESS);
 917 }
 918 .fi
 919 .SH SEE ALSO
 920 .BR newgidmap (1),      \" From the shadow package
 921 .BR newuidmap (1),      \" From the shadow package
 922 .BR clone (2),
 923 .BR setns (2),
 924 .BR unshare (2),
 925 .BR proc (5),
 926 .BR subgid (5),         \" From the shadow package
 927 .BR subuid (5),         \" From the shadow package
 928 .BR credentials (7),
 929 .BR capabilities (7),
 930 .BR namespaces (7),
 931 .BR pid_namespaces (7)
 932 .sp
 933 The kernel source file
 934 .IR Documentation/namespaces/resource-control.txt .