]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man2/clone.2
clone.2: Describe the user namespace (CLONE_NEWUSER)
[thirdparty/man-pages.git] / man2 / clone.2
CommitLineData
fea681da 1.\" Copyright (c) 1992 Drew Eckhardt <drew@cs.colorado.edu>, March 28, 1992
8c7b566c 2.\" and Copyright (c) Michael Kerrisk, 2001, 2002, 2005, 2013
2297bf0e 3.\"
fd0fc519 4.\" %%%LICENSE_START(GPL_NOVERSION_ONELINE)
fea681da 5.\" May be distributed under the GNU General Public License.
fd0fc519 6.\" %%%LICENSE_END
dccaff1e 7.\"
fea681da
MK
8.\" Modified by Michael Haardt <michael@moria.de>
9.\" Modified 24 Jul 1993 by Rik Faith <faith@cs.unc.edu>
10.\" Modified 21 Aug 1994 by Michael Chastain <mec@shell.portal.com>:
11.\" New man page (copied from 'fork.2').
12.\" Modified 10 June 1995 by Andries Brouwer <aeb@cwi.nl>
13.\" Modified 25 April 1998 by Xavier Leroy <Xavier.Leroy@inria.fr>
14.\" Modified 26 Jun 2001 by Michael Kerrisk
15.\" Mostly upgraded to 2.4.x
16.\" Added prototype for sys_clone() plus description
17.\" Added CLONE_THREAD with a brief description of thread groups
c13182ef 18.\" Added CLONE_PARENT and revised entire page remove ambiguity
fea681da
MK
19.\" between "calling process" and "parent process"
20.\" Added CLONE_PTRACE and CLONE_VFORK
21.\" Added EPERM and EINVAL error codes
fd8a5be4 22.\" Renamed "__clone" to "clone" (which is the prototype in <sched.h>)
fea681da 23.\" various other minor tidy ups and clarifications.
c11b1abf 24.\" Modified 26 Jun 2001 by Michael Kerrisk <mtk.manpages@gmail.com>
d9bfdb9c 25.\" Updated notes for 2.4.7+ behavior of CLONE_THREAD
c11b1abf 26.\" Modified 15 Oct 2002 by Michael Kerrisk <mtk.manpages@gmail.com>
fea681da
MK
27.\" Added description for CLONE_NEWNS, which was added in 2.4.19
28.\" Slightly rephrased, aeb.
29.\" Modified 1 Feb 2003 - added CLONE_SIGHAND restriction, aeb.
30.\" Modified 1 Jan 2004 - various updates, aeb
0967c11f 31.\" Modified 2004-09-10 - added CLONE_PARENT_SETTID etc. - aeb.
d9bfdb9c 32.\" 2005-04-12, mtk, noted the PID caching behavior of NPTL's getpid()
31830ef0 33.\" wrapper under BUGS.
fd8a5be4
MK
34.\" 2005-05-10, mtk, added CLONE_SYSVSEM, CLONE_UNTRACED, CLONE_STOPPED.
35.\" 2005-05-17, mtk, Substantially enhanced discussion of CLONE_THREAD.
4e836144 36.\" 2008-11-18, mtk, order CLONE_* flags alphabetically
82ee147a 37.\" 2008-11-18, mtk, document CLONE_NEWPID
43ce9dda 38.\" 2008-11-19, mtk, document CLONE_NEWUTS
667417b3 39.\" 2008-11-19, mtk, document CLONE_NEWIPC
cfdc761b 40.\" 2008-11-19, Jens Axboe, mtk, document CLONE_IO
fea681da 41.\"
185341d4
MK
42.\" FIXME Document CLONE_NEWUSER, which is new in 2.6.23
43.\" (also supported for unshare()?)
360ed6b3 44.\"
8980a500 45.TH CLONE 2 2014-08-19 "Linux" "Linux Programmer's Manual"
fea681da 46.SH NAME
9b0e0996 47clone, __clone2 \- create a child process
fea681da 48.SH SYNOPSIS
c10859eb 49.nf
81f10dad
MK
50/* Prototype for the glibc wrapper function */
51
fea681da 52.B #include <sched.h>
c10859eb 53
ff929e3b
MK
54.BI "int clone(int (*" "fn" ")(void *), void *" child_stack ,
55.BI " int " flags ", void *" "arg" ", ... "
d3dbc9b1 56.BI " /* pid_t *" ptid ", struct user_desc *" tls \
ff929e3b 57", pid_t *" ctid " */ );"
81f10dad 58
e585064b 59/* Prototype for the raw system call */
81f10dad
MK
60
61.BI "long clone(unsigned long " flags ", void *" child_stack ,
62.BI " void *" ptid ", void *" ctid ,
63.BI " struct pt_regs *" regs );
c10859eb 64.fi
e73b3103
MK
65.sp
66.in -4n
81f10dad 67Feature Test Macro Requirements for glibc wrapper function (see
e73b3103
MK
68.BR feature_test_macros (7)):
69.in
70.sp
71.BR clone ():
72.ad l
73.RS 4
74.PD 0
75.TP 4
76Since glibc 2.14:
77_GNU_SOURCE
78.TP 4
bd297db0 79.\" See http://sources.redhat.com/bugzilla/show_bug.cgi?id=4749
e73b3103
MK
80Before glibc 2.14:
81_BSD_SOURCE || _SVID_SOURCE
82 /* _GNU_SOURCE also suffices */
83.PD
84.RE
85.ad b
fea681da 86.SH DESCRIPTION
edcc65ff
MK
87.BR clone ()
88creates a new process, in a manner similar to
fea681da 89.BR fork (2).
81f10dad
MK
90
91This page describes both the glibc
e511ffb6 92.BR clone ()
e585064b 93wrapper function and the underlying system call on which it is based.
81f10dad 94The main text describes the wrapper function;
e585064b 95the differences for the raw system call
81f10dad 96are described toward the end of this page.
fea681da
MK
97
98Unlike
99.BR fork (2),
81f10dad
MK
100.BR clone ()
101allows the child process to share parts of its execution context with
fea681da 102the calling process, such as the memory space, the table of file
c13182ef
MK
103descriptors, and the table of signal handlers.
104(Note that on this manual
105page, "calling process" normally corresponds to "parent process".
106But see the description of
107.B CLONE_PARENT
fea681da
MK
108below.)
109
110The main use of
edcc65ff 111.BR clone ()
fea681da
MK
112is to implement threads: multiple threads of control in a program that
113run concurrently in a shared memory space.
114
115When the child process is created with
c13182ef 116.BR clone (),
fea681da 117it executes the function
c13182ef 118.IR fn ( arg ).
fea681da 119(This differs from
c13182ef 120.BR fork (2),
fea681da 121where execution continues in the child from the point
c13182ef
MK
122of the
123.BR fork (2)
fea681da
MK
124call.)
125The
126.I fn
127argument is a pointer to a function that is called by the child
128process at the beginning of its execution.
129The
130.I arg
131argument is passed to the
132.I fn
133function.
134
c13182ef 135When the
fea681da 136.IR fn ( arg )
c13182ef
MK
137function application returns, the child process terminates.
138The integer returned by
fea681da 139.I fn
c13182ef
MK
140is the exit code for the child process.
141The child process may also terminate explicitly by calling
fea681da
MK
142.BR exit (2)
143or after receiving a fatal signal.
144
145The
146.I child_stack
c13182ef
MK
147argument specifies the location of the stack used by the child process.
148Since the child and calling process may share memory,
fea681da 149it is not possible for the child process to execute in the
c13182ef
MK
150same stack as the calling process.
151The calling process must therefore
fea681da
MK
152set up memory space for the child stack and pass a pointer to this
153space to
edcc65ff 154.BR clone ().
5fab2e7c 155Stacks grow downward on all processors that run Linux
fea681da
MK
156(except the HP PA processors), so
157.I child_stack
158usually points to the topmost address of the memory space set up for
159the child stack.
160
161The low byte of
162.I flags
fd8a5be4
MK
163contains the number of the
164.I "termination signal"
165sent to the parent when the child dies.
166If this signal is specified as anything other than
fea681da
MK
167.BR SIGCHLD ,
168then the parent process must specify the
c13182ef
MK
169.B __WALL
170or
fea681da 171.B __WCLONE
c13182ef
MK
172options when waiting for the child with
173.BR wait (2).
fea681da
MK
174If no signal is specified, then the parent process is not signaled
175when the child terminates.
176
177.I flags
fd8a5be4
MK
178may also be bitwise-or'ed with zero or more of the following constants,
179in order to specify what is shared between the calling process
fea681da 180and the child process:
fea681da 181.TP
f5dbc7c8
MK
182.BR CLONE_CHILD_CLEARTID " (since Linux 2.5.49)"
183Erase child thread ID at location
d3dbc9b1 184.I ctid
f5dbc7c8
MK
185in child memory when the child exits, and do a wakeup on the futex
186at that address.
187The address involved may be changed by the
188.BR set_tid_address (2)
189system call.
190This is used by threading libraries.
191.TP
192.BR CLONE_CHILD_SETTID " (since Linux 2.5.49)"
193Store child thread ID at location
d3dbc9b1 194.I ctid
f5dbc7c8
MK
195in child memory.
196.TP
1603d6a1 197.BR CLONE_FILES " (since Linux 2.0)"
fea681da 198If
f5dbc7c8
MK
199.B CLONE_FILES
200is set, the calling process and the child process share the same file
201descriptor table.
202Any file descriptor created by the calling process or by the child
203process is also valid in the other process.
204Similarly, if one of the processes closes a file descriptor,
205or changes its associated flags (using the
206.BR fcntl (2)
207.B F_SETFD
208operation), the other process is also affected.
fea681da
MK
209
210If
f5dbc7c8
MK
211.B CLONE_FILES
212is not set, the child process inherits a copy of all file descriptors
213opened in the calling process at the time of
214.BR clone ().
215(The duplicated file descriptors in the child refer to the
216same open file descriptions (see
217.BR open (2))
218as the corresponding file descriptors in the calling process.)
219Subsequent operations that open or close file descriptors,
220or change file descriptor flags,
221performed by either the calling
222process or the child process do not affect the other process.
fea681da 223.TP
1603d6a1 224.BR CLONE_FS " (since Linux 2.0)"
fea681da
MK
225If
226.B CLONE_FS
9ee4a2b6 227is set, the caller and the child process share the same filesystem
c13182ef 228information.
9ee4a2b6 229This includes the root of the filesystem, the current
c13182ef
MK
230working directory, and the umask.
231Any call to
fea681da
MK
232.BR chroot (2),
233.BR chdir (2),
234or
235.BR umask (2)
edcc65ff 236performed by the calling process or the child process also affects the
fea681da
MK
237other process.
238
c13182ef 239If
fea681da 240.B CLONE_FS
9ee4a2b6 241is not set, the child process works on a copy of the filesystem
fea681da 242information of the calling process at the time of the
edcc65ff 243.BR clone ()
fea681da
MK
244call.
245Calls to
246.BR chroot (2),
247.BR chdir (2),
248.BR umask (2)
249performed later by one of the processes do not affect the other process.
fea681da 250.TP
a4cc375e 251.BR CLONE_IO " (since Linux 2.6.25)"
11f27a1c
JA
252If
253.B CLONE_IO
254is set, then the new process shares an I/O context with
255the calling process.
256If this flag is not set, then (as with
257.BR fork (2))
258the new process has its own I/O context.
259
260.\" The following based on text from Jens Axboe
a113945f 261The I/O context is the I/O scope of the disk scheduler (i.e,
11f27a1c
JA
262what the I/O scheduler uses to model scheduling of a process's I/O).
263If processes share the same I/O context,
264they are treated as one by the I/O scheduler.
265As a consequence, they get to share disk time.
266For some I/O schedulers,
267.\" the anticipatory and CFQ scheduler
268if two processes share an I/O context,
269they will be allowed to interleave their disk access.
270If several threads are doing I/O on behalf of the same process
271.RB ( aio_read (3),
272for instance), they should employ
273.BR CLONE_IO
274to get better I/O performance.
275.\" with CFQ and AS.
276
277If the kernel is not configured with the
278.B CONFIG_BLOCK
279option, this flag is a no-op.
280.TP
8722311b 281.BR CLONE_NEWIPC " (since Linux 2.6.19)"
667417b3
MK
282If
283.B CLONE_NEWIPC
284is set, then create the process in a new IPC namespace.
285If this flag is not set, then (as with
286.BR fork (2)),
287the process is created in the same IPC namespace as
288the calling process.
0236bea9 289This flag is intended for the implementation of containers.
667417b3 290
efbfd7ec 291An IPC namespace provides an isolated view of System\ V IPC objects (see
009a049e
MK
292.BR svipc (7))
293and (since Linux 2.6.30)
294.\" commit 7eafd7c74c3f2e67c27621b987b28397110d643f
295.\" https://lwn.net/Articles/312232/
296POSIX message queues
297(see
298.BR mq_overview (7)).
19911fa5
MK
299The common characteristic of these IPC mechanisms is that IPC
300objects are identified by mechanisms other than filesystem
301pathnames.
009a049e 302
c440fe01 303Objects created in an IPC namespace are visible to all other processes
667417b3
MK
304that are members of that namespace,
305but are not visible to processes in other IPC namespaces.
306
83c1f4b5 307When an IPC namespace is destroyed
009a049e 308(i.e., when the last process that is a member of the namespace terminates),
83c1f4b5
MK
309all IPC objects in the namespace are automatically destroyed.
310
667417b3
MK
311Use of this flag requires: a kernel configured with the
312.B CONFIG_SYSVIPC
313and
314.B CONFIG_IPC_NS
c8e18bd1 315options and that the process be privileged
667417b3
MK
316.RB ( CAP_SYS_ADMIN ).
317This flag can't be specified in conjunction with
318.BR CLONE_SYSVSEM .
319.TP
163bf178 320.BR CLONE_NEWNET " (since Linux 2.6.24)"
33a0ccb2 321(The implementation of this flag was completed only
9108d867 322by about kernel version 2.6.29.)
163bf178
MK
323
324If
325.B CLONE_NEWNET
326is set, then create the process in a new network namespace.
327If this flag is not set, then (as with
328.BR fork (2)),
329the process is created in the same network namespace as
330the calling process.
331This flag is intended for the implementation of containers.
332
333A network namespace provides an isolated view of the networking stack
334(network device interfaces, IPv4 and IPv6 protocol stacks,
335IP routing tables, firewall rules, the
336.I /proc/net
337and
338.I /sys/class/net
339directory trees, sockets, etc.).
340A physical network device can live in exactly one
341network namespace.
342A virtual network device ("veth") pair provides a pipe-like abstraction
bea08fec 343.\" FIXME . Add pointer to veth(4) page when it is eventually completed
163bf178
MK
344that can be used to create tunnels between network namespaces,
345and can be used to create a bridge to a physical network device
346in another namespace.
347
bf032425
SH
348When a network namespace is freed
349(i.e., when the last process in the namespace terminates),
350its physical network devices are moved back to the
351initial network namespace (not to the parent of the process).
352
163bf178
MK
353Use of this flag requires: a kernel configured with the
354.B CONFIG_NET_NS
355option and that the process be privileged
cae2ec15 356.RB ( CAP_SYS_ADMIN ).
163bf178 357.TP
c10859eb 358.BR CLONE_NEWNS " (since Linux 2.4.19)"
732e54dd 359Start the child in a new mount namespace.
fea681da 360
732e54dd 361Every process lives in a mount namespace.
c13182ef 362The
fea681da
MK
363.I namespace
364of a process is the data (the set of mounts) describing the file hierarchy
c13182ef
MK
365as seen by that process.
366After a
fea681da
MK
367.BR fork (2)
368or
2777b1ca 369.BR clone ()
fea681da
MK
370where the
371.B CLONE_NEWNS
732e54dd 372flag is not set, the child lives in the same mount
4df2eb09 373namespace as the parent.
fea681da
MK
374The system calls
375.BR mount (2)
376and
377.BR umount (2)
732e54dd 378change the mount namespace of the calling process, and hence affect
fea681da 379all processes that live in the same namespace, but do not affect
732e54dd 380processes in a different mount namespace.
fea681da
MK
381
382After a
2777b1ca 383.BR clone ()
fea681da
MK
384where the
385.B CLONE_NEWNS
732e54dd 386flag is set, the cloned child is started in a new mount namespace,
fea681da
MK
387initialized with a copy of the namespace of the parent.
388
0b9bdf82 389Only a privileged process (one having the \fBCAP_SYS_ADMIN\fP capability)
fea681da
MK
390may specify the
391.B CLONE_NEWNS
392flag.
393It is not permitted to specify both
394.B CLONE_NEWNS
395and
396.B CLONE_FS
397in the same
e511ffb6 398.BR clone ()
fea681da 399call.
70d21f17
EB
400.TP
401.BR CLONE_NEWUSER " (since Linux 3.6)"
402If
403.B CLONE_NEWUSER
404is set, the create the process in a new user namespace. If this flag is not set, then (as with
405.BR fork (2)),
406the process is created in the same user namespace as the calling process.
407
408A user namespace provides an isolated environment for security related identifiers in particular
409uids, gids, keys (see
410.BR keyctl (2)),
411and capabilities.
412
413When a user namespace is created it initially starts out without a mapping of uids and gids
414to the parent user namespace. The desired mapping of uids to the parent user namespace
415may be set by writting into
416.IR /proc/[pid]/uid_map.
417The desired mapping of gids to the parent user namespace may be set by writinng into
418.IR /proc/[pid]/gid_map.
419
420The first process in a user namespace starts out with a complete set of capabilities with
421respect to the new user namespace.
422
423syscalls that return uids and gids will either return the uid or gid mapped into the current
424user namespace if there is a mapping or depending on the context will return either
425the overflowuid (default 65534) or the overflowgid (default 65534). See
426.IR /proc/sys/kernel/overflowuid, /proc/sys/kernel/overflowgid
427
428As of Linux 3.8 no priviliges are needed to create a user namespace,
429and mount, pid, ipc, net, uts namespaces can be created with just
430CAP_SYS_ADMIN privileges in your current user namespace.
431
432Over the years there have been a lot of features that have been added
433to the linux kernel that are only available to privileged users
434because of their potential to confuse setuid root applications. In
435general it becomes safe to allow the root user in a user namespace to
436use those features because it is impossible while in a user namespace
437to gain more privilege than the root user of a user namespace has.
438
fea681da 439.TP
82ee147a
MK
440.BR CLONE_NEWPID " (since Linux 2.6.24)"
441.\" This explanation draws a lot of details from
442.\" http://lwn.net/Articles/259217/
443.\" Authors: Pavel Emelyanov <xemul@openvz.org>
444.\" and Kir Kolyshkin <kir@openvz.org>
445.\"
446.\" The primary kernel commit is 30e49c263e36341b60b735cbef5ca37912549264
447.\" Author: Pavel Emelyanov <xemul@openvz.org>
448If
5c95e5e8 449.B CLONE_NEWPID
82ee147a
MK
450is set, then create the process in a new PID namespace.
451If this flag is not set, then (as with
452.BR fork (2)),
453the process is created in the same PID namespace as
454the calling process.
0236bea9 455This flag is intended for the implementation of containers.
82ee147a
MK
456
457A PID namespace provides an isolated environment for PIDs:
458PIDs in a new namespace start at 1,
459somewhat like a standalone system, and calls to
460.BR fork (2),
461.BR vfork (2),
462or
27d47e71 463.BR clone ()
5584229c 464will produce processes with PIDs that are unique within the namespace.
82ee147a
MK
465
466The first process created in a new namespace
467(i.e., the process created using the
468.BR CLONE_NEWPID
469flag) has the PID 1, and is the "init" process for the namespace.
470Children that are orphaned within the namespace will be reparented
471to this process rather than
472.BR init (8).
473Unlike the traditional
474.B init
475process, the "init" process of a PID namespace can terminate,
476and if it does, all of the processes in the namespace are terminated.
477
478PID namespaces form a hierarchy.