]>
Commit | Line | Data |
---|---|---|
a79bacf5 MK |
1 | .\" Copyright (c) 2013 by Michael Kerrisk <mtk.manpages@gmail.com> |
2 | .\" and Copyright (c) 2012 by Eric W. Biederman <ebiederm@xmission.com> | |
3 | .\" | |
c228b4b4 | 4 | .\" %%%LICENSE_START(VERBATIM) |
a79bacf5 MK |
5 | .\" Permission is granted to make and distribute verbatim copies of this |
6 | .\" manual provided the copyright notice and this permission notice are | |
7 | .\" preserved on all copies. | |
8 | .\" | |
9 | .\" Permission is granted to copy and distribute modified versions of this | |
10 | .\" manual under the conditions for verbatim copying, provided that the | |
11 | .\" entire resulting derived work is distributed under the terms of a | |
12 | .\" permission notice identical to this one. | |
13 | .\" | |
14 | .\" Since the Linux kernel and libraries are constantly changing, this | |
15 | .\" manual page may be incorrect or out-of-date. The author(s) assume no | |
16 | .\" responsibility for errors or omissions, or for damages resulting from | |
17 | .\" the use of the information contained herein. The author(s) may not | |
18 | .\" have taken the same level of care in the production of this manual, | |
19 | .\" which is licensed free of charge, as they might when working | |
20 | .\" professionally. | |
21 | .\" | |
22 | .\" Formatted or processed versions of this manual, if unaccompanied by | |
23 | .\" the source, must acknowledge the copyright and authors of this work. | |
c228b4b4 | 24 | .\" %%%LICENSE_END |
a79bacf5 MK |
25 | .\" |
26 | .\" | |
3df541c0 | 27 | .TH PID_NAMESPACES 7 2016-07-17 "Linux" "Linux Programmer's Manual" |
a79bacf5 MK |
28 | .SH NAME |
29 | pid_namespaces \- overview of Linux PID namespaces | |
30 | .SH DESCRIPTION | |
31 | For an overview of namespaces, see | |
32 | .BR namespaces (7). | |
84030779 | 33 | |
a79bacf5 MK |
34 | PID namespaces isolate the process ID number space, |
35 | meaning that processes in different PID namespaces can have the same PID. | |
36b04745 MK |
36 | PID namespaces allow containers to provide functionality |
37 | such as suspending/resuming the set of processes in the container and | |
38 | migrating the container to a new host | |
a79bacf5 MK |
39 | while the processes inside the container maintain the same PIDs. |
40 | ||
41 | PIDs in a new PID namespace start at 1, | |
42 | somewhat like a standalone system, and calls to | |
43 | .BR fork (2), | |
44 | .BR vfork (2), | |
45 | or | |
46 | .BR clone (2) | |
47 | will produce processes with PIDs that are unique within the namespace. | |
48 | ||
84030779 MK |
49 | Use of PID namespaces requires a kernel that is configured with the |
50 | .B CONFIG_PID_NS | |
51 | option. | |
4085d4cd MK |
52 | .\" |
53 | .\" ============================================================ | |
54 | .\" | |
84030779 | 55 | .SS The namespace "init" process |
a79bacf5 MK |
56 | The first process created in a new namespace |
57 | (i.e., the process created using | |
58 | .BR clone (2) | |
59 | with the | |
60 | .BR CLONE_NEWPID | |
61 | flag, or the first child created by a process after a call to | |
62 | .BR unshare (2) | |
63 | using the | |
64 | .BR CLONE_NEWPID | |
65 | flag) has the PID 1, and is the "init" process for the namespace (see | |
66 | .BR init (1)). | |
2a4b78e7 | 67 | A child process that is orphaned within the namespace will be reparented |
a79bacf5 | 68 | to this process rather than |
2a4b78e7 MK |
69 | .BR init (1) |
70 | (unless one of the ancestors of the child | |
1a1d8762 | 71 | in the same PID namespace employed the |
2a4b78e7 | 72 | .BR prctl (2) |
208c82ce | 73 | .B PR_SET_CHILD_SUBREAPER |
2a4b78e7 | 74 | command to mark itself as the reaper of orphaned descendant processes). |
a79bacf5 MK |
75 | |
76 | If the "init" process of a PID namespace terminates, | |
77 | the kernel terminates all of the processes in the namespace via a | |
78 | .BR SIGKILL | |
79 | signal. | |
80 | This behavior reflects the fact that the "init" process | |
81 | is essential for the correct operation of a PID namespace. | |
7a9ab601 | 82 | In this case, a subsequent |
a79bacf5 | 83 | .BR fork (2) |
81ccc853 | 84 | into this PID namespace will fail with the error |
a79bacf5 MK |
85 | .BR ENOMEM ; |
86 | it is not possible to create a new processes in a PID namespace whose "init" | |
87 | process has terminated. | |
81ccc853 MK |
88 | Such scenarios can occur when, for example, |
89 | a process uses an open file descriptor for a | |
90 | .I /proc/[pid]/ns/pid | |
91 | file corresponding to a process that was in a namespace to | |
92 | .BR setns (2) | |
93 | into that namespace after the "init" process has terminated. | |
94 | Another possible scenario can occur after a call to | |
95 | .BR unshare (2): | |
96 | if the first child subsequently created by a | |
97 | .BR fork (2) | |
98 | terminates, then subsequent calls to | |
99 | .BR fork (2) | |
100 | will fail with | |
101 | .BR ENOMEM . | |
a79bacf5 MK |
102 | |
103 | Only signals for which the "init" process has established a signal handler | |
104 | can be sent to the "init" process by other members of the PID namespace. | |
105 | This restriction applies even to privileged processes, | |
106 | and prevents other members of the PID namespace from | |
107 | accidentally killing the "init" process. | |
108 | ||
109 | Likewise, a process in an ancestor namespace | |
110 | can\(emsubject to the usual permission checks described in | |
111 | .BR kill (2)\(emsend | |
7a9ab601 | 112 | signals to the "init" process of a child PID namespace only |
a79bacf5 MK |
113 | if the "init" process has established a handler for that signal. |
114 | (Within the handler, the | |
115 | .I siginfo_t | |
116 | .I si_pid | |
117 | field described in | |
118 | .BR sigaction (2) | |
119 | will be zero.) | |
120 | .B SIGKILL | |
121 | or | |
122 | .B SIGSTOP | |
123 | are treated exceptionally: | |
124 | these signals are forcibly delivered when sent from an ancestor PID namespace. | |
125 | Neither of these signals can be caught by the "init" process, | |
126 | and so will result in the usual actions associated with those signals | |
127 | (respectively, terminating and stopping the process). | |
78d6b55b | 128 | |
f7ee0f51 | 129 | Starting with Linux 3.4, the |
78d6b55b | 130 | .BR reboot (2) |
891121f6 | 131 | system call causes a signal to be sent to the namespace "init" process. |
78d6b55b | 132 | See |
ff853168 | 133 | .BR reboot (2) |
78d6b55b | 134 | for more details. |
4085d4cd MK |
135 | .\" |
136 | .\" ============================================================ | |
137 | .\" | |
84030779 | 138 | .SS Nesting PID namespaces |
546fb4ee MK |
139 | PID namespaces can be nested: |
140 | each PID namespace has a parent, | |
141 | except for the initial ("root") PID namespace. | |
142 | The parent of a PID namespace is the PID namespace of the process that | |
143 | created the namespace using | |
144 | .BR clone (2) | |
145 | or | |
146 | .BR unshare (2). | |
147 | PID namespaces thus form a tree, | |
148 | with all namespaces ultimately tracing their ancestry to the root namespace. | |
149 | ||
150 | A process is visible to other processes in its PID namespace, | |
151 | and to the processes in each direct ancestor PID namespace | |
152 | going back to the root PID namespace. | |
153 | In this context, "visible" means that one process | |
154 | can be the target of operations by another process using | |
155 | system calls that specify a process ID. | |
156 | Conversely, the processes in a child PID namespace can't see | |
891121f6 | 157 | processes in the parent and further removed ancestor namespaces. |
a79bacf5 | 158 | More succinctly: a process can see (e.g., send signals with |
ff853168 | 159 | .BR kill (2), |
546fb4ee MK |
160 | set nice values with |
161 | .BR setpriority (2), | |
162 | etc.) only processes contained in its own PID namespace | |
163 | and in descendants of that namespace. | |
a79bacf5 | 164 | |
546fb4ee MK |
165 | A process has one process ID in each of the layers of the PID |
166 | namespace hierarchy in which is visible, | |
167 | and walking back though each direct ancestor namespace | |
a79bacf5 | 168 | through to the root PID namespace. |
546fb4ee MK |
169 | System calls that operate on process IDs always |
170 | operate using the process ID that is visible in the | |
171 | PID namespace of the caller. | |
a79bacf5 MK |
172 | A call to |
173 | .BR getpid (2) | |
174 | always returns the PID associated with the namespace in which | |
546fb4ee | 175 | the process was created. |
a79bacf5 MK |
176 | |
177 | Some processes in a PID namespace may have parents | |
178 | that are outside of the namespace. | |
179 | For example, the parent of the initial process in the namespace | |
546fb4ee | 180 | (i.e., the |
a79bacf5 MK |
181 | .BR init (1) |
182 | process with PID 1) is necessarily in another namespace. | |
183 | Likewise, the direct children of a process that uses | |
184 | .BR setns (2) | |
185 | to cause its children to join a PID namespace are in a different | |
186 | PID namespace from the caller of | |
187 | .BR setns (2). | |
188 | Calls to | |
189 | .BR getppid (2) | |
190 | for such processes return 0. | |
ba7d7ed9 | 191 | |
fe376752 MK |
192 | While processes may freely descend into child PID namespaces |
193 | (e.g., using | |
ba7d7ed9 MF |
194 | .BR setns (2) |
195 | with | |
196 | .BR CLONE_NEWPID ), | |
197 | they may not move in the other direction. | |
198 | That is to say, processes may not enter any ancestor namespaces | |
199 | (parent, grandparent, etc.). | |
200 | Changing PID namespaces is a one way operation. | |
4085d4cd MK |
201 | .\" |
202 | .\" ============================================================ | |
203 | .\" | |
84030779 | 204 | .SS setns(2) and unshare(2) semantics |
a79bacf5 MK |
205 | Calls to |
206 | .BR setns (2) | |
207 | that specify a PID namespace file descriptor | |
208 | and calls to | |
209 | .BR unshare (2) | |
210 | with the | |
211 | .BR CLONE_NEWPID | |
212 | flag cause children subsequently created | |
213 | by the caller to be placed in a different PID namespace from the caller. | |
214 | These calls do not, however, | |
215 | change the PID namespace of the calling process, | |
216 | because doing so would change the caller's idea of its own PID | |
217 | (as reported by | |
218 | .BR getpid ()), | |
219 | which would break many applications and libraries. | |
6e377abf | 220 | |
a79bacf5 MK |
221 | To put things another way: |
222 | a process's PID namespace membership is determined when the process is created | |
223 | and cannot be changed thereafter. | |
6e377abf | 224 | Among other things, this means that the parental relationship |
837ddeb9 | 225 | between processes mirrors the parental relationship between PID namespaces: |
6e377abf MK |
226 | the parent of a process is either in the same namespace |
227 | or resides in the immediate parent PID namespace. | |
98029e65 EB |
228 | .SS Compatibility of CLONE_NEWPID with other CLONE_* flags |
229 | .BR CLONE_NEWPID | |
230 | can't be combined with some other | |
231 | .BR CLONE_* | |
232 | flags: | |
233 | .IP * 3 | |
234 | .B CLONE_THREAD | |
e4010a25 | 235 | requires being in the same PID namespace in order that |
98029e65 EB |
236 | the threads in a process can send signals to each other. |
237 | Similarly, it must be possible to see all of the threads | |
238 | of a processes in the | |
239 | .BR proc (5) | |
ab3311aa | 240 | filesystem. |
98029e65 EB |
241 | .IP * |
242 | .BR CLONE_SIGHAND | |
243 | requires being in the same PID namespace; | |
244 | otherwise the process ID of the process sending a signal | |
245 | could not be meaningfully encoded when a signal is sent | |
246 | (see the description of the | |
247 | .I siginfo_t | |
248 | type in | |
249 | .BR sigaction (2)). | |
250 | A signal queue shared by processes in multiple PID namespaces | |
251 | will defeat that. | |
252 | .IP * | |
253 | .BR CLONE_VM | |
254 | requires all of the threads to be in the same PID namespace, | |
255 | because, from the point of view of a core dump, | |
891121f6 | 256 | if two processes share the same address space then they are threads and will |
98029e65 EB |
257 | be core dumped together. |
258 | When a core dump is written, the PID of each | |
259 | thread is written into the core dump. | |
260 | Writing the process IDs could not meaningfully succeed | |
261 | if some of the process IDs were in a parent PID namespace. | |
262 | .PP | |
263 | To summarize: there is a technical requirement for each of | |
264 | .BR CLONE_THREAD , | |
265 | .BR CLONE_SIGHAND , | |
266 | and | |
267 | .BR CLONE_VM | |
268 | to share a PID namespace. | |
269 | (Note furthermore that in | |
270 | .BR clone (2) | |
271 | requires | |
272 | .BR CLONE_VM | |
273 | to be specified if | |
274 | .BR CLONE_THREAD | |
275 | or | |
276 | .BR CLONE_SIGHAND | |
277 | is specified.) | |
278 | Thus, call sequences such as the following will fail (with the error | |
47832b6d | 279 | .BR EINVAL ): |
a79bacf5 MK |
280 | |
281 | .nf | |
282 | unshare(CLONE_NEWPID); | |
283 | clone(..., CLONE_VM, ...); /* Fails */ | |
284 | ||
285 | setns(fd, CLONE_NEWPID); | |
286 | clone(..., CLONE_VM, ...); /* Fails */ | |
a79bacf5 | 287 | |
bd23efc7 MK |
288 | clone(..., CLONE_VM, ...); |
289 | setns(fd, CLONE_NEWPID); /* Fails */ | |
290 | ||
291 | clone(..., CLONE_VM, ...); | |
292 | unshare(CLONE_NEWPID); /* Fails */ | |
293 | .fi | |
4085d4cd MK |
294 | .\" |
295 | .\" ============================================================ | |
296 | .\" | |
805685dc | 297 | .SS /proc and PID namespaces |
bac61628 MK |
298 | A |
299 | .I /proc | |
ab3311aa | 300 | filesystem shows (in the |
750653a8 | 301 | .I /proc/[pid] |
bac61628 MK |
302 | directories) only processes visible in the PID namespace |
303 | of the process that performed the mount, even if the | |
304 | .I /proc | |
ab3311aa | 305 | filesystem is viewed from processes in other namespaces. |
bac61628 | 306 | |
84030779 MK |
307 | After creating a new PID namespace, |
308 | it is useful for the child to change its root directory | |
309 | and mount a new procfs instance at | |
310 | .I /proc | |
311 | so that tools such as | |
312 | .BR ps (1) | |
313 | work correctly. | |
805685dc | 314 | If a new mount namespace is simultaneously created by including |
84030779 MK |
315 | .BR CLONE_NEWNS |
316 | in the | |
7a9ab601 | 317 | .IR flags |
84030779 MK |
318 | argument of |
319 | .BR clone (2) | |
320 | or | |
cbf542aa | 321 | .BR unshare (2), |
84030779 MK |
322 | then it isn't necessary to change the root directory: |
323 | a new procfs instance can be mounted directly over | |
805685dc | 324 | .IR /proc . |
a79bacf5 | 325 | |
bac61628 MK |
326 | From a shell, the command to mount |
327 | .I /proc | |
328 | is: | |
329 | ||
330 | $ mount -t proc proc /proc | |
331 | ||
6c3db754 MK |
332 | Calling |
333 | .BR readlink (2) | |
334 | on the path | |
335 | .I /proc/self | |
336 | yields the process ID of the caller in the PID namespace of the procfs mount | |
337 | (i.e., the PID namespace of the process that mounted the procfs). | |
5597d425 MK |
338 | This can be useful for introspection purposes, |
339 | when a process wants to discover its PID in other namespaces. | |
805685dc MK |
340 | .\" |
341 | .\" ============================================================ | |
342 | .\" | |
343 | .SS Miscellaneous | |
7a9ab601 | 344 | When a process ID is passed over a UNIX domain socket to a |
a79bacf5 MK |
345 | process in a different PID namespace (see the description of |
346 | .B SCM_CREDENTIALS | |
347 | in | |
348 | .BR unix (7)), | |
349 | it is translated into the corresponding PID value in | |
350 | the receiving process's PID namespace. | |
a79bacf5 MK |
351 | .SH CONFORMING TO |
352 | Namespaces are a Linux-specific feature. | |
fa88d1a4 MK |
353 | .SH EXAMPLE |
354 | See | |
355 | .BR user_namespaces (7). | |
a79bacf5 | 356 | .SH SEE ALSO |
a79bacf5 MK |
357 | .BR clone (2), |
358 | .BR setns (2), | |
359 | .BR unshare (2), | |
360 | .BR proc (5), | |
a79bacf5 | 361 | .BR capabilities (7), |
b10cb05c | 362 | .BR credentials (7), |
8f29c47d | 363 | .BR namespaces (7), |
a79bacf5 MK |
364 | .BR user_namespaces (7), |
365 | .BR switch_root (8) |