]>
Commit | Line | Data |
---|---|---|
a79bacf5 MK |
1 | .\" Copyright (c) 2013 by Michael Kerrisk <mtk.manpages@gmail.com> |
2 | .\" and Copyright (c) 2012 by Eric W. Biederman <ebiederm@xmission.com> | |
3 | .\" | |
5fbde956 | 4 | .\" SPDX-License-Identifier: Linux-man-pages-copyleft |
a79bacf5 MK |
5 | .\" |
6 | .\" | |
4c1c5274 | 7 | .TH pid_namespaces 7 (date) "Linux man-pages (unreleased)" |
a79bacf5 MK |
8 | .SH NAME |
9 | pid_namespaces \- overview of Linux PID namespaces | |
10 | .SH DESCRIPTION | |
11 | For an overview of namespaces, see | |
12 | .BR namespaces (7). | |
c6d039a3 | 13 | .P |
a79bacf5 MK |
14 | PID namespaces isolate the process ID number space, |
15 | meaning that processes in different PID namespaces can have the same PID. | |
36b04745 MK |
16 | PID namespaces allow containers to provide functionality |
17 | such as suspending/resuming the set of processes in the container and | |
18 | migrating the container to a new host | |
a79bacf5 | 19 | while the processes inside the container maintain the same PIDs. |
c6d039a3 | 20 | .P |
a79bacf5 MK |
21 | PIDs in a new PID namespace start at 1, |
22 | somewhat like a standalone system, and calls to | |
23 | .BR fork (2), | |
24 | .BR vfork (2), | |
25 | or | |
26 | .BR clone (2) | |
27 | will produce processes with PIDs that are unique within the namespace. | |
c6d039a3 | 28 | .P |
84030779 MK |
29 | Use of PID namespaces requires a kernel that is configured with the |
30 | .B CONFIG_PID_NS | |
31 | option. | |
4085d4cd MK |
32 | .\" |
33 | .\" ============================================================ | |
34 | .\" | |
84030779 | 35 | .SS The namespace "init" process |
a79bacf5 MK |
36 | The first process created in a new namespace |
37 | (i.e., the process created using | |
38 | .BR clone (2) | |
39 | with the | |
1ae6b2c7 | 40 | .B CLONE_NEWPID |
a79bacf5 MK |
41 | flag, or the first child created by a process after a call to |
42 | .BR unshare (2) | |
43 | using the | |
1ae6b2c7 | 44 | .B CLONE_NEWPID |
a79bacf5 MK |
45 | flag) has the PID 1, and is the "init" process for the namespace (see |
46 | .BR init (1)). | |
4f1a13fe MK |
47 | This process becomes the parent of any child processes that are orphaned |
48 | because a process that resides in this PID namespace terminated | |
49 | (see below for further details). | |
c6d039a3 | 50 | .P |
a79bacf5 MK |
51 | If the "init" process of a PID namespace terminates, |
52 | the kernel terminates all of the processes in the namespace via a | |
1ae6b2c7 | 53 | .B SIGKILL |
a79bacf5 MK |
54 | signal. |
55 | This behavior reflects the fact that the "init" process | |
56 | is essential for the correct operation of a PID namespace. | |
7a9ab601 | 57 | In this case, a subsequent |
a79bacf5 | 58 | .BR fork (2) |
26cd31fd | 59 | into this PID namespace fail with the error |
a79bacf5 | 60 | .BR ENOMEM ; |
16f3fc88 | 61 | it is not possible to create a new process in a PID namespace whose "init" |
a79bacf5 | 62 | process has terminated. |
81ccc853 MK |
63 | Such scenarios can occur when, for example, |
64 | a process uses an open file descriptor for a | |
1ae6b2c7 | 65 | .IR /proc/ pid /ns/pid |
81ccc853 MK |
66 | file corresponding to a process that was in a namespace to |
67 | .BR setns (2) | |
68 | into that namespace after the "init" process has terminated. | |
69 | Another possible scenario can occur after a call to | |
70 | .BR unshare (2): | |
71 | if the first child subsequently created by a | |
72 | .BR fork (2) | |
73 | terminates, then subsequent calls to | |
74 | .BR fork (2) | |
26cd31fd | 75 | fail with |
81ccc853 | 76 | .BR ENOMEM . |
c6d039a3 | 77 | .P |
a79bacf5 MK |
78 | Only signals for which the "init" process has established a signal handler |
79 | can be sent to the "init" process by other members of the PID namespace. | |
80 | This restriction applies even to privileged processes, | |
81 | and prevents other members of the PID namespace from | |
82 | accidentally killing the "init" process. | |
c6d039a3 | 83 | .P |
a79bacf5 | 84 | Likewise, a process in an ancestor namespace |
36546c38 AC |
85 | can\[em]subject to the usual permission checks described in |
86 | .BR kill (2)\[em]send | |
7a9ab601 | 87 | signals to the "init" process of a child PID namespace only |
a79bacf5 MK |
88 | if the "init" process has established a handler for that signal. |
89 | (Within the handler, the | |
90 | .I siginfo_t | |
91 | .I si_pid | |
92 | field described in | |
93 | .BR sigaction (2) | |
94 | will be zero.) | |
95 | .B SIGKILL | |
96 | or | |
97 | .B SIGSTOP | |
98 | are treated exceptionally: | |
99 | these signals are forcibly delivered when sent from an ancestor PID namespace. | |
100 | Neither of these signals can be caught by the "init" process, | |
101 | and so will result in the usual actions associated with those signals | |
102 | (respectively, terminating and stopping the process). | |
c6d039a3 | 103 | .P |
f7ee0f51 | 104 | Starting with Linux 3.4, the |
78d6b55b | 105 | .BR reboot (2) |
891121f6 | 106 | system call causes a signal to be sent to the namespace "init" process. |
78d6b55b | 107 | See |
ff853168 | 108 | .BR reboot (2) |
78d6b55b | 109 | for more details. |
4085d4cd MK |
110 | .\" |
111 | .\" ============================================================ | |
112 | .\" | |
84030779 | 113 | .SS Nesting PID namespaces |
546fb4ee MK |
114 | PID namespaces can be nested: |
115 | each PID namespace has a parent, | |
116 | except for the initial ("root") PID namespace. | |
117 | The parent of a PID namespace is the PID namespace of the process that | |
118 | created the namespace using | |
119 | .BR clone (2) | |
120 | or | |
121 | .BR unshare (2). | |
122 | PID namespaces thus form a tree, | |
123 | with all namespaces ultimately tracing their ancestry to the root namespace. | |
fb509133 MK |
124 | Since Linux 3.7, |
125 | .\" commit f2302505775fd13ba93f034206f1e2a587017929 | |
126 | .\" The kernel constant MAX_PID_NS_LEVEL | |
127 | the kernel limits the maximum nesting depth for PID namespaces to 32. | |
c6d039a3 | 128 | .P |
546fb4ee MK |
129 | A process is visible to other processes in its PID namespace, |
130 | and to the processes in each direct ancestor PID namespace | |
131 | going back to the root PID namespace. | |
132 | In this context, "visible" means that one process | |
133 | can be the target of operations by another process using | |
134 | system calls that specify a process ID. | |
135 | Conversely, the processes in a child PID namespace can't see | |
891121f6 | 136 | processes in the parent and further removed ancestor namespaces. |
a79bacf5 | 137 | More succinctly: a process can see (e.g., send signals with |
ff853168 | 138 | .BR kill (2), |
546fb4ee MK |
139 | set nice values with |
140 | .BR setpriority (2), | |
141 | etc.) only processes contained in its own PID namespace | |
142 | and in descendants of that namespace. | |
c6d039a3 | 143 | .P |
546fb4ee MK |
144 | A process has one process ID in each of the layers of the PID |
145 | namespace hierarchy in which is visible, | |
146 | and walking back though each direct ancestor namespace | |
a79bacf5 | 147 | through to the root PID namespace. |
546fb4ee MK |
148 | System calls that operate on process IDs always |
149 | operate using the process ID that is visible in the | |
150 | PID namespace of the caller. | |
a79bacf5 MK |
151 | A call to |
152 | .BR getpid (2) | |
153 | always returns the PID associated with the namespace in which | |
546fb4ee | 154 | the process was created. |
c6d039a3 | 155 | .P |
a79bacf5 MK |
156 | Some processes in a PID namespace may have parents |
157 | that are outside of the namespace. | |
158 | For example, the parent of the initial process in the namespace | |
546fb4ee | 159 | (i.e., the |
a79bacf5 MK |
160 | .BR init (1) |
161 | process with PID 1) is necessarily in another namespace. | |
162 | Likewise, the direct children of a process that uses | |
163 | .BR setns (2) | |
164 | to cause its children to join a PID namespace are in a different | |
165 | PID namespace from the caller of | |
166 | .BR setns (2). | |
167 | Calls to | |
168 | .BR getppid (2) | |
169 | for such processes return 0. | |
c6d039a3 | 170 | .P |
fe376752 MK |
171 | While processes may freely descend into child PID namespaces |
172 | (e.g., using | |
ba7d7ed9 | 173 | .BR setns (2) |
7cae1f4a | 174 | with a PID namespace file descriptor), |
ba7d7ed9 MF |
175 | they may not move in the other direction. |
176 | That is to say, processes may not enter any ancestor namespaces | |
177 | (parent, grandparent, etc.). | |
6d891a81 | 178 | Changing PID namespaces is a one-way operation. |
c6d039a3 | 179 | .P |
3889900a | 180 | The |
1ae6b2c7 | 181 | .B NS_GET_PARENT |
3889900a MK |
182 | .BR ioctl (2) |
183 | operation can be used to discover the parental relationship | |
184 | between PID namespaces; see | |
09860f31 | 185 | .BR ioctl_ns (2). |
4085d4cd MK |
186 | .\" |
187 | .\" ============================================================ | |
188 | .\" | |
84030779 | 189 | .SS setns(2) and unshare(2) semantics |
a79bacf5 MK |
190 | Calls to |
191 | .BR setns (2) | |
192 | that specify a PID namespace file descriptor | |
193 | and calls to | |
194 | .BR unshare (2) | |
195 | with the | |
1ae6b2c7 | 196 | .B CLONE_NEWPID |
a79bacf5 MK |
197 | flag cause children subsequently created |
198 | by the caller to be placed in a different PID namespace from the caller. | |
df984681 | 199 | (Since Linux 4.12, that PID namespace is shown via the |
1ae6b2c7 | 200 | .IR /proc/ pid /ns/pid_for_children |
df984681 MK |
201 | file, as described in |
202 | .BR namespaces (7).) | |
a79bacf5 MK |
203 | These calls do not, however, |
204 | change the PID namespace of the calling process, | |
205 | because doing so would change the caller's idea of its own PID | |
206 | (as reported by | |
207 | .BR getpid ()), | |
208 | which would break many applications and libraries. | |
c6d039a3 | 209 | .P |
a79bacf5 MK |
210 | To put things another way: |
211 | a process's PID namespace membership is determined when the process is created | |
212 | and cannot be changed thereafter. | |
6e377abf | 213 | Among other things, this means that the parental relationship |
837ddeb9 | 214 | between processes mirrors the parental relationship between PID namespaces: |
6e377abf MK |
215 | the parent of a process is either in the same namespace |
216 | or resides in the immediate parent PID namespace. | |
c6d039a3 | 217 | .P |
e5cd406d MK |
218 | A process may call |
219 | .BR unshare (2) | |
220 | with the | |
221 | .B CLONE_NEWPID | |
222 | flag only once. | |
df0a41df | 223 | After it has performed this operation, its |
1ae6b2c7 | 224 | .IR /proc/ pid /ns/pid_for_children |
df0a41df | 225 | symbolic link will be empty until the first child is created in the namespace. |
e5cd406d | 226 | .\" |
4f1a13fe MK |
227 | .\" ============================================================ |
228 | .\" | |
229 | .SS Adoption of orphaned children | |
230 | When a child process becomes orphaned, it is reparented to the "init" | |
231 | process in the PID namespace of its parent | |
232 | (unless one of the nearer ancestors of the parent employed the | |
233 | .BR prctl (2) | |
234 | .B PR_SET_CHILD_SUBREAPER | |
235 | command to mark itself as the reaper of orphaned descendant processes). | |
236 | Note that because of the | |
237 | .BR setns (2) | |
238 | and | |
239 | .BR unshare (2) | |
240 | semantics described above, this may be the "init" process in the PID | |
241 | namespace that is the | |
242 | .I parent | |
243 | of the child's PID namespace, | |
244 | rather than the "init" process in the child's own PID namespace. | |
243d656f | 245 | .\" Furthermore, by definition, the parent of the "init" process |
4f1a13fe MK |
246 | .\" of a PID namespace resides in the parent PID namespace. |
247 | .\" | |
248 | .\" ============================================================ | |
249 | .\" | |
98029e65 | 250 | .SS Compatibility of CLONE_NEWPID with other CLONE_* flags |
4026f8ba | 251 | In current versions of Linux, |
1ae6b2c7 | 252 | .B CLONE_NEWPID |
e9fcae0f KF |
253 | can't be combined with |
254 | .BR CLONE_THREAD . | |
255 | Threads are required to be in the same PID namespace such that | |
98029e65 EB |
256 | the threads in a process can send signals to each other. |
257 | Similarly, it must be possible to see all of the threads | |
067c60a7 | 258 | of a process in the |
98029e65 | 259 | .BR proc (5) |
4026f8ba MK |
260 | filesystem. |
261 | Additionally, if two threads were in different PID | |
e9fcae0f | 262 | namespaces, the process ID of the process sending a signal |
98029e65 EB |
263 | could not be meaningfully encoded when a signal is sent |
264 | (see the description of the | |
265 | .I siginfo_t | |
266 | type in | |
267 | .BR sigaction (2)). | |
4026f8ba | 268 | Since this is computed when a signal is enqueued, |
e9fcae0f KF |
269 | a signal queue shared by processes in multiple PID namespaces |
270 | would defeat that. | |
c6d039a3 | 271 | .P |
e9fcae0f KF |
272 | .\" Note these restrictions were all introduced in |
273 | .\" 8382fcac1b813ad0a4e68a838fc7ae93fa39eda0 | |
274 | .\" when CLONE_NEWPID|CLONE_VM was disallowed | |
4026f8ba | 275 | In earlier versions of Linux, |
1ae6b2c7 | 276 | .B CLONE_NEWPID |
4026f8ba MK |
277 | was additionally disallowed (failing with the error |
278 | .BR EINVAL ) | |
279 | in combination with | |
1ae6b2c7 | 280 | .B CLONE_SIGHAND |
e9fcae0f | 281 | .\" (restriction lifted in faf00da544045fdc1454f3b9e6d7f65c841de302) |
4026f8ba | 282 | (before Linux 4.3) as well as |
e9fcae0f | 283 | .\" (restriction lifted in e79f525e99b04390ca4d2366309545a836c03bf1) |
1ae6b2c7 | 284 | .B CLONE_VM |
4026f8ba MK |
285 | (before Linux 3.12). |
286 | The changes that lifted these restrictions have also been ported to | |
287 | earlier stable kernels. | |
4085d4cd MK |
288 | .\" |
289 | .\" ============================================================ | |
290 | .\" | |
805685dc | 291 | .SS /proc and PID namespaces |
bac61628 MK |
292 | A |
293 | .I /proc | |
ab3311aa | 294 | filesystem shows (in the |
1ae6b2c7 | 295 | .IR /proc/ pid |
bac61628 MK |
296 | directories) only processes visible in the PID namespace |
297 | of the process that performed the mount, even if the | |
298 | .I /proc | |
ab3311aa | 299 | filesystem is viewed from processes in other namespaces. |
c6d039a3 | 300 | .P |
84030779 MK |
301 | After creating a new PID namespace, |
302 | it is useful for the child to change its root directory | |
303 | and mount a new procfs instance at | |
304 | .I /proc | |
305 | so that tools such as | |
306 | .BR ps (1) | |
307 | work correctly. | |
805685dc | 308 | If a new mount namespace is simultaneously created by including |
1ae6b2c7 | 309 | .B CLONE_NEWNS |
84030779 | 310 | in the |
1ae6b2c7 | 311 | .I flags |
84030779 MK |
312 | argument of |
313 | .BR clone (2) | |
314 | or | |
cbf542aa | 315 | .BR unshare (2), |
84030779 MK |
316 | then it isn't necessary to change the root directory: |
317 | a new procfs instance can be mounted directly over | |
805685dc | 318 | .IR /proc . |
c6d039a3 | 319 | .P |
bac61628 MK |
320 | From a shell, the command to mount |
321 | .I /proc | |
322 | is: | |
c6d039a3 | 323 | .P |
019d9ee8 MK |
324 | .in +4n |
325 | .EX | |
fb6d2c09 | 326 | $ mount \-t proc proc /proc |
019d9ee8 MK |
327 | .EE |
328 | .in | |
c6d039a3 | 329 | .P |
6c3db754 MK |
330 | Calling |
331 | .BR readlink (2) | |
332 | on the path | |
333 | .I /proc/self | |
334 | yields the process ID of the caller in the PID namespace of the procfs mount | |
335 | (i.e., the PID namespace of the process that mounted the procfs). | |
5597d425 MK |
336 | This can be useful for introspection purposes, |
337 | when a process wants to discover its PID in other namespaces. | |
805685dc MK |
338 | .\" |
339 | .\" ============================================================ | |
340 | .\" | |
10bd7553 MK |
341 | .SS /proc files |
342 | .TP | |
343 | .BR /proc/sys/kernel/ns_last_pid " (since Linux 3.3)" | |
344 | .\" commit b8f566b04d3cddd192cfd2418ae6d54ac6353792 | |
3f298932 MK |
345 | This file |
346 | (which is virtualized per PID namespace) | |
347 | displays the last PID that was allocated in this PID namespace. | |
10bd7553 MK |
348 | When the next PID is allocated, |
349 | the kernel will search for the lowest unallocated PID | |
350 | that is greater than this value, | |
351 | and when this file is subsequently read it will show that PID. | |
352 | .IP | |
353 | This file is writable by a process that has the | |
354 | .B CAP_SYS_ADMIN | |
1e516a82 MK |
355 | or (since Linux 5.9) |
356 | .B CAP_CHECKPOINT_RESTORE | |
439526d1 | 357 | capability inside the user namespace that owns the PID namespace. |
47d03138 | 358 | .\" This ability is necessary to support checkpoint restore in user-space |
10bd7553 | 359 | This makes it possible to determine the PID that is allocated |
47d03138 | 360 | to the next process that is created inside this PID namespace. |
10bd7553 MK |
361 | .\" |
362 | .\" ============================================================ | |
363 | .\" | |
805685dc | 364 | .SS Miscellaneous |
7a9ab601 | 365 | When a process ID is passed over a UNIX domain socket to a |
a79bacf5 MK |
366 | process in a different PID namespace (see the description of |
367 | .B SCM_CREDENTIALS | |
368 | in | |
369 | .BR unix (7)), | |
370 | it is translated into the corresponding PID value in | |
371 | the receiving process's PID namespace. | |
3113c7f3 | 372 | .SH STANDARDS |
4131356c | 373 | Linux. |
a14af333 | 374 | .SH EXAMPLES |
fa88d1a4 MK |
375 | See |
376 | .BR user_namespaces (7). | |
a79bacf5 | 377 | .SH SEE ALSO |
a79bacf5 | 378 | .BR clone (2), |
d64c7be5 | 379 | .BR reboot (2), |
a79bacf5 MK |
380 | .BR setns (2), |
381 | .BR unshare (2), | |
382 | .BR proc (5), | |
a79bacf5 | 383 | .BR capabilities (7), |
b10cb05c | 384 | .BR credentials (7), |
4bf43ba5 | 385 | .BR mount_namespaces (7), |
8f29c47d | 386 | .BR namespaces (7), |
a79bacf5 MK |
387 | .BR user_namespaces (7), |
388 | .BR switch_root (8) |