clone.2: Use pid_t for clone3() {child,parent}_tid
Advertise to userspace that they should use proper pid_t types
for arguments returning a pid.
The kernel-internal struct kernel_clone_args currently uses int
as type and since POSIX mandates that pid_t is a signed integer
type and glibc and friends use int this is not an issue. After
the merge window for v5.5 closes we can switch struct
kernel_clone_args over to using pid_t as well without any danger
in regressing current userspace.
Also note, that the new set tid feature which will be merged for
v5.5 uses pid_t types as well.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Thu, 14 Nov 2019 11:19:21 +0000 (12:19 +0100)]
clone.2: Allocate child's stack using mmap(2) rather than malloc(3)
Christian Brauner suggested mmap(MAP_STACKED), rather than
malloc(), as the canonical way of allocating a stack for the
child of clone(), and Jann Horn noted some reasons why:
Not on Linux, but on OpenBSD, they do use MAP_STACK now
AFAIK; this was announced here:
<http://openbsd-archive.7691.n7.nabble.com/stack-register-checking-td338238.html>.
Basically they periodically check whether the userspace
stack pointer points into a MAP_STACK region, and if not,
they kill the process. So even if it's a no-op on Linux, it
might make sense to advise people to use the flag to improve
portability? I'm not sure if that's something that belongs
in Linux manpages.
Another reason against malloc() is that when setting up
thread stacks in proper, reliable software, you'll probably
want to place a guard page (in other words, a 4K PROT_NONE
VMA) at the bottom of the stack to reliably catch stack
overflows; and you probably don't want to do that with
malloc, in particular with non-page-aligned allocations.
And the OpenBSD 6.5 manual pages says:
MAP_STACK
Indicate that the mapping is used as a stack. This
flag must be used in combination with MAP_ANON and
MAP_PRIVATE.
And I then noticed that MAP_STACK seems already to be on
FreeBSD for a long time:
MAP_STACK
Map the area as a stack. MAP_ANON is implied.
Offset should be 0, fd must be -1, and prot should
include at least PROT_READ and PROT_WRITE. This
option creates a memory region that grows to at
most len bytes in size, starting from the stack
top and growing down. The stack top is the startā
ing address returned by the call, plus len bytes.
The bottom of the stack at maximum growth is the
starting address returned by the call.
The entire area is reserved from the point of view
of other mmap() calls, even if not faulted in yet.
Reported-by: Jann Horn <jannh@google.com> Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Sat, 9 Nov 2019 07:32:32 +0000 (08:32 +0100)]
clone.2: Tidy up the description of CLONE_DETACHED
The obsolete CLONE_DETACHED flag has never been properly
documented, but now the discussion CLONE_PIDFD also requires
mention of CLONE_DETACHED. So, properly document CLONE_DETACHED,
and mention its interactions with CLONE_PIDFD.
Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Fri, 8 Nov 2019 21:40:45 +0000 (22:40 +0100)]
clone.2: Consistently order paragraphs for CLONE_NEW* flags
Sometimes the descriptions of these flags mentioned the
corresponding section 7 namespace manual page and then the
required capabilities, and sometimes the order was the was
the reverse. Make it consistent.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Fri, 8 Nov 2019 21:32:47 +0000 (22:32 +0100)]
clone.2: Remove various details that are already covered in namespaces pages
Remove details of UTS, IPC, and network namespaces that are
already covered in the corresponding namespaces pages in
section 7. This change is for consistency, since corresponding
details were not provided for other namespace types in clone(2)
and these details do not appear in unshare(2).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Yang Xu [Fri, 25 Oct 2019 06:06:25 +0000 (14:06 +0800)]
quotactl.2: Add some details about Q_QUOTAON
For Q_QUOTAON, on old kernel we can use quotacheck -ug to generate
quota files. But in current kernel, we can also hide them in
system inodes and indicate them by using "quota" or project
feature.
For user or group quota, we can do as below (etc ext4):
Michael Kerrisk [Thu, 24 Oct 2019 07:13:44 +0000 (09:13 +0200)]
clone.2: Rename arguments for consistency with clone3()
Sometime soon, we'll have to add documentation of clone3() to this
page. As a preparatorys step, make the names of the clone()
arguments the same as the fields in the clone3() 'args' struct:
Michael Kerrisk [Sat, 12 Oct 2019 22:20:46 +0000 (00:20 +0200)]
wait.2: Clarify semantics of waitpid(0, ...)
As noted in kernel commit 821cc7b0b205c0df64cce59aacc330af251fa8f7,
threads create an ambiguity: what if the calling process's PGID
is changed by another thread while waitpid(0, ...) is blocked?
So, clarify that waitpid(0, ...) means wait for children whose
PGID matches the caller's PGID at the time of the call to
waitpid().
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Thu, 10 Oct 2019 07:56:07 +0000 (09:56 +0200)]
select.2: POLLIN_SET/POLLOUT_SET/POLLEX_SET are now defined in terms of EPOLL*
Since kernel commit a9a08845e9acbd224e4ee466f5c1275ed50054e8, the
equivalence between select() and poll()/epoll is defined in terms
of the EPOLL* constants, rather than the POLL* constants.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Wed, 25 Sep 2019 13:59:12 +0000 (15:59 +0200)]
pidfd_open.2: Enhance the discussion of usage of fork() + pidfd_open()
After review comments from Christian and Daniel.
Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Reported-by: Daniel Colascione <dancol@google.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Mon, 23 Sep 2019 08:00:19 +0000 (10:00 +0200)]
pidfd_open.2: Opening /proc/PID doesn't yield a pollable file descriptor
Thus, pidfd_open() is the preferred way of obtaining a PID
file descriptor.
Notes from a conversation with Christian Brauner:
[[
> A further question... We now have three ways of getting a
> process file descriptor [*]:
>
> open() of /proc/PID
> pidfd_open()
> clone()/clone3() with CLONE_PIDFD
>
> I thought the FD was supposed to be equivalent in all three cases.
> However, if I try (on kernel 5.3) poll() an FD returned by opening
> /proc/PID, poll() tells me POLLNVAL for the FD. Is that difference
> intentional? (I am guessing it is not.)
It's intentional.
The short answer is that /proc/<pid> is a convenience for sending
signals.
The longer answer is that this stems from a heavy debate about what a
process file descriptor was supposed to be and some people pushing for
at least being able to use /proc/<pid> dirfds while ignoring security
problems as soon as you're talking about returning those fds from
clone(); not to mention the additional problems discovered when trying
to implementing this.
A "real" pidfd is one from CLONE_PIDFD or pidfd_open() and all features
such as exit notification, read, and other future extensions will only
be implemented on top of them.
As much as we'd have liked to get rid of two different file descriptor
types it doesn't hurt us much and is not that much different from what
we will e.g. see with fsinfo() in the new mount api which needs to work
on regular fds gotten via open()/openat() and mountfds gotten from
fsopen() and fspick(). The mountfds will also allow for advanced
operations that the other ones will not. There's even an argument to be
made that fds you will get from open()/openat() and openat2() are
different types since they have very different behavior; openat2()
returning fds that are non arbitrarily upgradable etc.
]]
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Sun, 22 Sep 2019 21:05:48 +0000 (23:05 +0200)]
pidfd_open.2: New page documenting pidfd_open(2)
Notes from a conversation on linux-man@ with Christian Brauner:
[[
> [*} By the way, going forward, can we call these things
> "process FDs", rather than "PID FDs"? The API names are what
> they are, an that's okay, but these just as we have socket
> FDs that refer to sockets, directory FDs that refer to
> directories, and timer FDs that refer to timers, and so on,
> these are FDs that refer to *processes*, not "process IDs".
> It's a little thing, but I think the naming better, and
> it's what I propose to use in the manual pages.
The naming was another debate and we ended with this compromise.
I would just clarify that a pidfd is a process file descriptor. I
wouldn't make too much of a deal of hiding the shortcut "pidfd".
People are already using it out there in the wild and it's never
proven a good idea to go against accepted practice.
]]
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>