]>
Commit | Line | Data |
---|---|---|
a79bacf5 MK |
1 | .\" Copyright (c) 2013 by Michael Kerrisk <mtk.manpages@gmail.com> |
2 | .\" and Copyright (c) 2012 by Eric W. Biederman <ebiederm@xmission.com> | |
3 | .\" | |
c228b4b4 | 4 | .\" %%%LICENSE_START(VERBATIM) |
a79bacf5 MK |
5 | .\" Permission is granted to make and distribute verbatim copies of this |
6 | .\" manual provided the copyright notice and this permission notice are | |
7 | .\" preserved on all copies. | |
8 | .\" | |
9 | .\" Permission is granted to copy and distribute modified versions of this | |
10 | .\" manual under the conditions for verbatim copying, provided that the | |
11 | .\" entire resulting derived work is distributed under the terms of a | |
12 | .\" permission notice identical to this one. | |
13 | .\" | |
14 | .\" Since the Linux kernel and libraries are constantly changing, this | |
15 | .\" manual page may be incorrect or out-of-date. The author(s) assume no | |
16 | .\" responsibility for errors or omissions, or for damages resulting from | |
17 | .\" the use of the information contained herein. The author(s) may not | |
18 | .\" have taken the same level of care in the production of this manual, | |
19 | .\" which is licensed free of charge, as they might when working | |
20 | .\" professionally. | |
21 | .\" | |
22 | .\" Formatted or processed versions of this manual, if unaccompanied by | |
23 | .\" the source, must acknowledge the copyright and authors of this work. | |
c228b4b4 | 24 | .\" %%%LICENSE_END |
a79bacf5 MK |
25 | .\" |
26 | .\" | |
9ba01802 | 27 | .TH PID_NAMESPACES 7 2019-03-06 "Linux" "Linux Programmer's Manual" |
a79bacf5 MK |
28 | .SH NAME |
29 | pid_namespaces \- overview of Linux PID namespaces | |
30 | .SH DESCRIPTION | |
31 | For an overview of namespaces, see | |
32 | .BR namespaces (7). | |
a721e8b2 | 33 | .PP |
a79bacf5 MK |
34 | PID namespaces isolate the process ID number space, |
35 | meaning that processes in different PID namespaces can have the same PID. | |
36b04745 MK |
36 | PID namespaces allow containers to provide functionality |
37 | such as suspending/resuming the set of processes in the container and | |
38 | migrating the container to a new host | |
a79bacf5 | 39 | while the processes inside the container maintain the same PIDs. |
a721e8b2 | 40 | .PP |
a79bacf5 MK |
41 | PIDs in a new PID namespace start at 1, |
42 | somewhat like a standalone system, and calls to | |
43 | .BR fork (2), | |
44 | .BR vfork (2), | |
45 | or | |
46 | .BR clone (2) | |
47 | will produce processes with PIDs that are unique within the namespace. | |
a721e8b2 | 48 | .PP |
84030779 MK |
49 | Use of PID namespaces requires a kernel that is configured with the |
50 | .B CONFIG_PID_NS | |
51 | option. | |
4085d4cd MK |
52 | .\" |
53 | .\" ============================================================ | |
54 | .\" | |
84030779 | 55 | .SS The namespace "init" process |
a79bacf5 MK |
56 | The first process created in a new namespace |
57 | (i.e., the process created using | |
58 | .BR clone (2) | |
59 | with the | |
60 | .BR CLONE_NEWPID | |
61 | flag, or the first child created by a process after a call to | |
62 | .BR unshare (2) | |
63 | using the | |
64 | .BR CLONE_NEWPID | |
65 | flag) has the PID 1, and is the "init" process for the namespace (see | |
66 | .BR init (1)). | |
4f1a13fe MK |
67 | This process becomes the parent of any child processes that are orphaned |
68 | because a process that resides in this PID namespace terminated | |
69 | (see below for further details). | |
a721e8b2 | 70 | .PP |
a79bacf5 MK |
71 | If the "init" process of a PID namespace terminates, |
72 | the kernel terminates all of the processes in the namespace via a | |
73 | .BR SIGKILL | |
74 | signal. | |
75 | This behavior reflects the fact that the "init" process | |
76 | is essential for the correct operation of a PID namespace. | |
7a9ab601 | 77 | In this case, a subsequent |
a79bacf5 | 78 | .BR fork (2) |
26cd31fd | 79 | into this PID namespace fail with the error |
a79bacf5 | 80 | .BR ENOMEM ; |
16f3fc88 | 81 | it is not possible to create a new process in a PID namespace whose "init" |
a79bacf5 | 82 | process has terminated. |
81ccc853 MK |
83 | Such scenarios can occur when, for example, |
84 | a process uses an open file descriptor for a | |
85 | .I /proc/[pid]/ns/pid | |
86 | file corresponding to a process that was in a namespace to | |
87 | .BR setns (2) | |
88 | into that namespace after the "init" process has terminated. | |
89 | Another possible scenario can occur after a call to | |
90 | .BR unshare (2): | |
91 | if the first child subsequently created by a | |
92 | .BR fork (2) | |
93 | terminates, then subsequent calls to | |
94 | .BR fork (2) | |
26cd31fd | 95 | fail with |
81ccc853 | 96 | .BR ENOMEM . |
a721e8b2 | 97 | .PP |
a79bacf5 MK |
98 | Only signals for which the "init" process has established a signal handler |
99 | can be sent to the "init" process by other members of the PID namespace. | |
100 | This restriction applies even to privileged processes, | |
101 | and prevents other members of the PID namespace from | |
102 | accidentally killing the "init" process. | |
a721e8b2 | 103 | .PP |
a79bacf5 MK |
104 | Likewise, a process in an ancestor namespace |
105 | can\(emsubject to the usual permission checks described in | |
106 | .BR kill (2)\(emsend | |
7a9ab601 | 107 | signals to the "init" process of a child PID namespace only |
a79bacf5 MK |
108 | if the "init" process has established a handler for that signal. |
109 | (Within the handler, the | |
110 | .I siginfo_t | |
111 | .I si_pid | |
112 | field described in | |
113 | .BR sigaction (2) | |
114 | will be zero.) | |
115 | .B SIGKILL | |
116 | or | |
117 | .B SIGSTOP | |
118 | are treated exceptionally: | |
119 | these signals are forcibly delivered when sent from an ancestor PID namespace. | |
120 | Neither of these signals can be caught by the "init" process, | |
121 | and so will result in the usual actions associated with those signals | |
122 | (respectively, terminating and stopping the process). | |
a721e8b2 | 123 | .PP |
f7ee0f51 | 124 | Starting with Linux 3.4, the |
78d6b55b | 125 | .BR reboot (2) |
891121f6 | 126 | system call causes a signal to be sent to the namespace "init" process. |
78d6b55b | 127 | See |
ff853168 | 128 | .BR reboot (2) |
78d6b55b | 129 | for more details. |
4085d4cd MK |
130 | .\" |
131 | .\" ============================================================ | |
132 | .\" | |
84030779 | 133 | .SS Nesting PID namespaces |
546fb4ee MK |
134 | PID namespaces can be nested: |
135 | each PID namespace has a parent, | |
136 | except for the initial ("root") PID namespace. | |
137 | The parent of a PID namespace is the PID namespace of the process that | |
138 | created the namespace using | |
139 | .BR clone (2) | |
140 | or | |
141 | .BR unshare (2). | |
142 | PID namespaces thus form a tree, | |
143 | with all namespaces ultimately tracing their ancestry to the root namespace. | |
fb509133 MK |
144 | Since Linux 3.7, |
145 | .\" commit f2302505775fd13ba93f034206f1e2a587017929 | |
146 | .\" The kernel constant MAX_PID_NS_LEVEL | |
147 | the kernel limits the maximum nesting depth for PID namespaces to 32. | |
a721e8b2 | 148 | .PP |
546fb4ee MK |
149 | A process is visible to other processes in its PID namespace, |
150 | and to the processes in each direct ancestor PID namespace | |
151 | going back to the root PID namespace. | |
152 | In this context, "visible" means that one process | |
153 | can be the target of operations by another process using | |
154 | system calls that specify a process ID. | |
155 | Conversely, the processes in a child PID namespace can't see | |
891121f6 | 156 | processes in the parent and further removed ancestor namespaces. |
a79bacf5 | 157 | More succinctly: a process can see (e.g., send signals with |
ff853168 | 158 | .BR kill (2), |
546fb4ee MK |
159 | set nice values with |
160 | .BR setpriority (2), | |
161 | etc.) only processes contained in its own PID namespace | |
162 | and in descendants of that namespace. | |
a721e8b2 | 163 | .PP |
546fb4ee MK |
164 | A process has one process ID in each of the layers of the PID |
165 | namespace hierarchy in which is visible, | |
166 | and walking back though each direct ancestor namespace | |
a79bacf5 | 167 | through to the root PID namespace. |
546fb4ee MK |
168 | System calls that operate on process IDs always |
169 | operate using the process ID that is visible in the | |
170 | PID namespace of the caller. | |
a79bacf5 MK |
171 | A call to |
172 | .BR getpid (2) | |
173 | always returns the PID associated with the namespace in which | |
546fb4ee | 174 | the process was created. |
a721e8b2 | 175 | .PP |
a79bacf5 MK |
176 | Some processes in a PID namespace may have parents |
177 | that are outside of the namespace. | |
178 | For example, the parent of the initial process in the namespace | |
546fb4ee | 179 | (i.e., the |
a79bacf5 MK |
180 | .BR init (1) |
181 | process with PID 1) is necessarily in another namespace. | |
182 | Likewise, the direct children of a process that uses | |
183 | .BR setns (2) | |
184 | to cause its children to join a PID namespace are in a different | |
185 | PID namespace from the caller of | |
186 | .BR setns (2). | |
187 | Calls to | |
188 | .BR getppid (2) | |
189 | for such processes return 0. | |
a721e8b2 | 190 | .PP |
fe376752 MK |
191 | While processes may freely descend into child PID namespaces |
192 | (e.g., using | |
ba7d7ed9 | 193 | .BR setns (2) |
7cae1f4a | 194 | with a PID namespace file descriptor), |
ba7d7ed9 MF |
195 | they may not move in the other direction. |
196 | That is to say, processes may not enter any ancestor namespaces | |
197 | (parent, grandparent, etc.). | |
6d891a81 | 198 | Changing PID namespaces is a one-way operation. |
a721e8b2 | 199 | .PP |
3889900a MK |
200 | The |
201 | .BR NS_GET_PARENT | |
202 | .BR ioctl (2) | |
203 | operation can be used to discover the parental relationship | |
204 | between PID namespaces; see | |
09860f31 | 205 | .BR ioctl_ns (2). |
4085d4cd MK |
206 | .\" |
207 | .\" ============================================================ | |
208 | .\" | |
84030779 | 209 | .SS setns(2) and unshare(2) semantics |
a79bacf5 MK |
210 | Calls to |
211 | .BR setns (2) | |
212 | that specify a PID namespace file descriptor | |
213 | and calls to | |
214 | .BR unshare (2) | |
215 | with the | |
216 | .BR CLONE_NEWPID | |
217 | flag cause children subsequently created | |
218 | by the caller to be placed in a different PID namespace from the caller. | |
df984681 MK |
219 | (Since Linux 4.12, that PID namespace is shown via the |
220 | .IR /proc/[pid]/ns/pid_for_children | |
221 | file, as described in | |
222 | .BR namespaces (7).) | |
a79bacf5 MK |
223 | These calls do not, however, |
224 | change the PID namespace of the calling process, | |
225 | because doing so would change the caller's idea of its own PID | |
226 | (as reported by | |
227 | .BR getpid ()), | |
228 | which would break many applications and libraries. | |
a721e8b2 | 229 | .PP |
a79bacf5 MK |
230 | To put things another way: |
231 | a process's PID namespace membership is determined when the process is created | |
232 | and cannot be changed thereafter. | |
6e377abf | 233 | Among other things, this means that the parental relationship |
837ddeb9 | 234 | between processes mirrors the parental relationship between PID namespaces: |
6e377abf MK |
235 | the parent of a process is either in the same namespace |
236 | or resides in the immediate parent PID namespace. | |
e5cd406d MK |
237 | .PP |
238 | A process may call | |
239 | .BR unshare (2) | |
240 | with the | |
241 | .B CLONE_NEWPID | |
242 | flag only once. | |
df0a41df MK |
243 | After it has performed this operation, its |
244 | .IR /proc/PID/ns/pid_for_children | |
245 | symbolic link will be empty until the first child is created in the namespace. | |
e5cd406d | 246 | .\" |
4f1a13fe MK |
247 | .\" ============================================================ |
248 | .\" | |
249 | .SS Adoption of orphaned children | |
250 | When a child process becomes orphaned, it is reparented to the "init" | |
251 | process in the PID namespace of its parent | |
252 | (unless one of the nearer ancestors of the parent employed the | |
253 | .BR prctl (2) | |
254 | .B PR_SET_CHILD_SUBREAPER | |
255 | command to mark itself as the reaper of orphaned descendant processes). | |
256 | Note that because of the | |
257 | .BR setns (2) | |
258 | and | |
259 | .BR unshare (2) | |
260 | semantics described above, this may be the "init" process in the PID | |
261 | namespace that is the | |
262 | .I parent | |
263 | of the child's PID namespace, | |
264 | rather than the "init" process in the child's own PID namespace. | |
265 | \" Furthermore, by definition, the parent of the "init" process | |
266 | .\" of a PID namespace resides in the parent PID namespace. | |
267 | .\" | |
268 | .\" ============================================================ | |
269 | .\" | |
98029e65 | 270 | .SS Compatibility of CLONE_NEWPID with other CLONE_* flags |
4026f8ba | 271 | In current versions of Linux, |
98029e65 | 272 | .BR CLONE_NEWPID |
e9fcae0f KF |
273 | can't be combined with |
274 | .BR CLONE_THREAD . | |
275 | Threads are required to be in the same PID namespace such that | |
98029e65 EB |
276 | the threads in a process can send signals to each other. |
277 | Similarly, it must be possible to see all of the threads | |
278 | of a processes in the | |
279 | .BR proc (5) | |
4026f8ba MK |
280 | filesystem. |
281 | Additionally, if two threads were in different PID | |
e9fcae0f | 282 | namespaces, the process ID of the process sending a signal |
98029e65 EB |
283 | could not be meaningfully encoded when a signal is sent |
284 | (see the description of the | |
285 | .I siginfo_t | |
286 | type in | |
287 | .BR sigaction (2)). | |
4026f8ba | 288 | Since this is computed when a signal is enqueued, |
e9fcae0f KF |
289 | a signal queue shared by processes in multiple PID namespaces |
290 | would defeat that. | |
a721e8b2 | 291 | .PP |
e9fcae0f KF |
292 | .\" Note these restrictions were all introduced in |
293 | .\" 8382fcac1b813ad0a4e68a838fc7ae93fa39eda0 | |
294 | .\" when CLONE_NEWPID|CLONE_VM was disallowed | |
4026f8ba | 295 | In earlier versions of Linux, |
e9fcae0f | 296 | .BR CLONE_NEWPID |
4026f8ba MK |
297 | was additionally disallowed (failing with the error |
298 | .BR EINVAL ) | |
299 | in combination with | |
e9fcae0f KF |
300 | .BR CLONE_SIGHAND |
301 | .\" (restriction lifted in faf00da544045fdc1454f3b9e6d7f65c841de302) | |
4026f8ba | 302 | (before Linux 4.3) as well as |
e9fcae0f KF |
303 | .\" (restriction lifted in e79f525e99b04390ca4d2366309545a836c03bf1) |
304 | .BR CLONE_VM | |
4026f8ba MK |
305 | (before Linux 3.12). |
306 | The changes that lifted these restrictions have also been ported to | |
307 | earlier stable kernels. | |
4085d4cd MK |
308 | .\" |
309 | .\" ============================================================ | |
310 | .\" | |
805685dc | 311 | .SS /proc and PID namespaces |
bac61628 MK |
312 | A |
313 | .I /proc | |
ab3311aa | 314 | filesystem shows (in the |
750653a8 | 315 | .I /proc/[pid] |
bac61628 MK |
316 | directories) only processes visible in the PID namespace |
317 | of the process that performed the mount, even if the | |
318 | .I /proc | |
ab3311aa | 319 | filesystem is viewed from processes in other namespaces. |
a721e8b2 | 320 | .PP |
84030779 MK |
321 | After creating a new PID namespace, |
322 | it is useful for the child to change its root directory | |
323 | and mount a new procfs instance at | |
324 | .I /proc | |
325 | so that tools such as | |
326 | .BR ps (1) | |
327 | work correctly. | |
805685dc | 328 | If a new mount namespace is simultaneously created by including |
84030779 MK |
329 | .BR CLONE_NEWNS |
330 | in the | |
7a9ab601 | 331 | .IR flags |
84030779 MK |
332 | argument of |
333 | .BR clone (2) | |
334 | or | |
cbf542aa | 335 | .BR unshare (2), |
84030779 MK |
336 | then it isn't necessary to change the root directory: |
337 | a new procfs instance can be mounted directly over | |
805685dc | 338 | .IR /proc . |
a721e8b2 | 339 | .PP |
bac61628 MK |
340 | From a shell, the command to mount |
341 | .I /proc | |
342 | is: | |
019d9ee8 MK |
343 | .PP |
344 | .in +4n | |
345 | .EX | |
346 | $ mount -t proc proc /proc | |
347 | .EE | |
348 | .in | |
349 | .PP | |
6c3db754 MK |
350 | Calling |
351 | .BR readlink (2) | |
352 | on the path | |
353 | .I /proc/self | |
354 | yields the process ID of the caller in the PID namespace of the procfs mount | |
355 | (i.e., the PID namespace of the process that mounted the procfs). | |
5597d425 MK |
356 | This can be useful for introspection purposes, |
357 | when a process wants to discover its PID in other namespaces. | |
805685dc MK |
358 | .\" |
359 | .\" ============================================================ | |
360 | .\" | |
10bd7553 MK |
361 | .SS /proc files |
362 | .TP | |
363 | .BR /proc/sys/kernel/ns_last_pid " (since Linux 3.3)" | |
364 | .\" commit b8f566b04d3cddd192cfd2418ae6d54ac6353792 | |
365 | This file displays the last PID that was allocated in this PID namespace. | |
366 | When the next PID is allocated, | |
367 | the kernel will search for the lowest unallocated PID | |
368 | that is greater than this value, | |
369 | and when this file is subsequently read it will show that PID. | |
370 | .IP | |
371 | This file is writable by a process that has the | |
372 | .B CAP_SYS_ADMIN | |
373 | capability inside its user namespace. | |
47d03138 | 374 | .\" This ability is necessary to support checkpoint restore in user-space |
10bd7553 | 375 | This makes it possible to determine the PID that is allocated |
47d03138 | 376 | to the next process that is created inside this PID namespace. |
10bd7553 MK |
377 | .\" |
378 | .\" ============================================================ | |
379 | .\" | |
805685dc | 380 | .SS Miscellaneous |
7a9ab601 | 381 | When a process ID is passed over a UNIX domain socket to a |
a79bacf5 MK |
382 | process in a different PID namespace (see the description of |
383 | .B SCM_CREDENTIALS | |
384 | in | |
385 | .BR unix (7)), | |
386 | it is translated into the corresponding PID value in | |
387 | the receiving process's PID namespace. | |
a79bacf5 MK |
388 | .SH CONFORMING TO |
389 | Namespaces are a Linux-specific feature. | |
fa88d1a4 MK |
390 | .SH EXAMPLE |
391 | See | |
392 | .BR user_namespaces (7). | |
a79bacf5 | 393 | .SH SEE ALSO |
a79bacf5 | 394 | .BR clone (2), |
d64c7be5 | 395 | .BR reboot (2), |
a79bacf5 MK |
396 | .BR setns (2), |
397 | .BR unshare (2), | |
398 | .BR proc (5), | |
a79bacf5 | 399 | .BR capabilities (7), |
b10cb05c | 400 | .BR credentials (7), |
4bf43ba5 | 401 | .BR mount_namespaces (7), |
8f29c47d | 402 | .BR namespaces (7), |
a79bacf5 MK |
403 | .BR user_namespaces (7), |
404 | .BR switch_root (8) |