Michael Kerrisk [Thu, 10 Oct 2019 08:37:36 +0000 (10:37 +0200)]
pivot_root.2: Relegate text about what pivot_root() may or may not do to NOTES
The text stating that "pivot_root() may or may not change the
current root and the current working directory of any processes
or threads which use the old root directory" was written 19 years
ago, before the system call itself was even finalized in the
kernel. The implementation has never changed, and it won't
change in the future, since that would cause user-space breakage.
The existence of that text in DESCRIPTION, followed by qualifying
text stating what the implementation actually does (and has always
done) makes for confusing reading. Therefore, relegate this text
to a historical note in NOTES (so that readers with long memories
can see why the manual page was changed) and rework the text in
DESCRIPTION accordingly.
Reported-by: Philipp Wendler <ml@philippwendler.de> Reported-by: Eric W. Biederman <ebiederm@xmission.com> Reported-by: Reid Priedhorsky <reidpr@lanl.gov> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Wed, 9 Oct 2019 10:14:35 +0000 (12:14 +0200)]
pivot_root.2: Change "filesystem" to "mount" in various places
Quoting Eric:
If we are going to be pedantic "filesystem" is really the
wrong concept here. The section about bind mount clarifies
it, but I wonder if there is a better term.
I think I would say: "new_root and put_old must not be on
the same mount as the current root."
I think using "mount" instead of "filesystem" keeps the
concepts less confusing.
As I am reading through this email and seeing text that is
trying to be precise and clear then hitting the term
"filesystem" is a bit jarring. pivot_root doesn't care a
thing for file systems. pivot_root only cares about mounts.
And by a "mount" I mean the thing that you get when you
create a bind mount or you call mount normally.
Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Wed, 9 Oct 2019 07:37:35 +0000 (09:37 +0200)]
pivot_root.2: Simplify discussion of restrictions for 'new_root'
Philipp Wendler noted that the text on the restrictions for
'new_root' was slightly contradictory, and things could be
clarified and simplified by describing the restrictions on
'new_root' in one place.
Reported-by: Philipp Wendler <ml@philippwendler.de> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Wed, 9 Oct 2019 07:11:59 +0000 (09:11 +0200)]
pivot_root.2: Remove an imprecision in description
Remove the text that suggests that pivot_root() changes the root
directory and CWD of process that have directory and CWD on the
old root *filesystem*. Change "filesystem" to "directory".
Reported-by: Philipp Wendler <ml@philippwendler.de> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Wed, 9 Oct 2019 06:59:22 +0000 (08:59 +0200)]
namespaces.7: Include manual page references in the summary table of namespace types
Make the page more compact by removing the stub subsections that
list the manual pages for the namespace types. And while we're
here, add an explanation of the table columns.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Tue, 8 Oct 2019 21:30:55 +0000 (23:30 +0200)]
mount_namespaces.7: Tweak discussion of "less privileged" mount namespace
Eric Biederman:
I hate to nitpick, but I am going to say that when I read
the text above the phrase "mount namespace of the process
that created the new mount namespace" feels wrong.
Either you use unshare(2) and the mount namespace of the
process that created the mount namespace changes.
Or you use clone(2) and you could argue it is the new child
that created the mount namespace.
Having a different mount namespace at the end of the
creation operation feels like it makes your phrase confusing
about what the starting mount namespace is. I hate to use
references that are ambiguous when things are changing.
Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Tue, 8 Oct 2019 18:57:55 +0000 (20:57 +0200)]
pivot_root.2: Remove the term 'old_root'
Reid noted a confusion between 'old_root' (my attempt at a
shorthand for the old root point) and 'put_old. Eliminate the
confusion by replacing the shorthand with "old root mount point".
Reported-by: Reid Priedhorsky <reidpr@lanl.gov> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Tue, 8 Oct 2019 14:20:59 +0000 (16:20 +0200)]
mount_namespaces.7: Clarify description of "less privileged" mount namespaces
The current text talks about "parent mount namespaces", but there
is no such concept. As confirmed by Eric Biederman, what is mean
here is "the mount namespace this mount namespace started as a
copy of". So, this change writes up Eric's description in a more
detailed way.
Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Mon, 7 Oct 2019 09:12:40 +0000 (12:12 +0300)]
pivot_root.2: Simplify pivot_root(".", ".") example
Eric Biederman notes that the change in commit f646ac88ef83969 was
not strictly necessary for this example, since one of the already
documented requirements is that various mount points must not have
shared propagation, or else pivot_root() will fail. So, simplify
the example.
Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Carlos O'Donell [Fri, 4 Oct 2019 21:12:08 +0000 (17:12 -0400)]
pthread_setcancelstate.3, pthreads.7, signal-safety.7: Describe issues with cancellation points in signal handlers
In a recent conversation with Mathieu Desnoyers I was reminded
that we haven't written up anything about how deferred
cancellation and asynchronous signal handlers interact. Mathieu
ran into some of this behaviour and I promised to improve the
documentation in this area to point out the potential pitfall.
Thoughts?
8< --- 8< --- 8<
In pthread_setcancelstate.3, pthreads.7, and signal-safety.7 we
describe that if you have an asynchronous signal nesting over a
deferred cancellation region that any cancellation point in the
signal handler may trigger a cancellation that will behave
as-if it was an asynchronous cancellation. This asynchronous
cancellation may have unexpected effects on the consistency of
the application. Therefore care should be taken with asynchronous
signals and deferred cancellation.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Fri, 27 Sep 2019 12:12:01 +0000 (14:12 +0200)]
sched_setparam.2, pthread_mutexattr_init.3, pthread_mutexattr_setrobust.3, pthread_mutex_consistent.3, strtol.3, sched.7, uts_namespaces.7: SEE ALSO: correct list order
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Amir Goldstein [Thu, 26 Sep 2019 17:01:19 +0000 (20:01 +0300)]
copy_file_range.2: Kernel v5.3 updates
Update with all the missing errors the syscall can return, the
behaviour the syscall should have w.r.t. to copies within single
files, etc.
[Amir] updates for final released version.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Wed, 25 Sep 2019 12:33:15 +0000 (14:33 +0200)]
sched_setaffinity.2: RETURN VALUE: sched_getaffinity() syscall differs from the wrapper
In RETURN VALUE, point reader at subsection noting that the return
value of the raw sched_setaffinity() system call differs from the
wrapper function in glibc.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Mon, 23 Sep 2019 14:36:45 +0000 (16:36 +0200)]
signalfd.2: Rewrite the text on epoll semantics
I also verified the behavior reported by Andrew Clayton
with the program below.
$ ./epoll_signalfd
PID of parent: 5661
PID of child: 5662
epoll_wait() returned 0
PID 5662: got signal 10
Successfully read signal, even though epoll_wait() didn't say FD was ready!
Andrew Clayton [Fri, 20 Sep 2019 23:42:11 +0000 (00:42 +0100)]
signalfd.2: Note about interactions with epoll & fork
Using signalfd(2) with epoll(7) and fork(2) can lead to some head
scratching.
It seems that when a signalfd file descriptor is added to epoll
you will only get notifications for signals sent to the process
that added the file descriptor to epoll.
So if you have a signalfd fd registered with epoll and then call
fork(2), perhaps by way of daemon(3) for example. Then you will
find that you no longer get notifications for signals sent to the
newly forked process.
User kentonv on ycombinator[0] explained it thus
"One place where the inconsistency gets weird is when you
use signalfd with epoll. The epoll will flag events on the
signalfd based on the process where the signalfd was
registered with epoll, not the process where the epoll is
being used. One case where this can be surprising is if you
set up a signalfd and an epoll and then fork() for the
purpose of daemonizing -- now you will find that your epoll
mysteriously doesn't deliver any events for the signalfd
despite the signalfd otherwise appearing to function as
expected."
And another post from the same person[1].
And then there is this snippet from this kernel commit message[2]
"If you share epoll fd which contains our sigfd with another
process you should blame yourself. signalfd is "really
special"."
So add a note to the man page that points this out where people
will hopefully find it sooner rather than later!
Michael Kerrisk [Mon, 23 Sep 2019 10:26:37 +0000 (12:26 +0200)]
pivot_root.2: Correct the list of mount points that can't be MS_SHARED
Eric Biederman noted that my list of directories that could not
have shared propagation was incorrect. I had written that
new_root could not be shared; rather it should be: the parent of
the current root mount point.
Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Sun, 15 Sep 2019 08:09:08 +0000 (10:09 +0200)]
pivot_root.2: Tweak pivot_root(".", ".") example
Quoting Eric Biederman:
The concern from our conversation at the container
mini-summit was that there is a pathology if in your initial
mount namespace all of the mounts are marked MS_SHARED like
systemd does (and is almost necessary if you are going to
use mount propagation), that if new_root itself is MS_SHARED
then unmounting the old_root could propagate.
The change to new new_root could be either MS_SLAVE or
MS_PRIVATE. So long as it is not MS_SHARED the mount won't
propagate back to the parent mount namespace.
Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As far as I can see, just the umount() is sufficient, since,
after pivot_root(), oldi_root is at the top of the stack
of mounts at "/" and thus (so long as CWD is at "/")
the umount will remove the mount at the top of the stack.
Eric Biederman confirmed my understanding by mail, and
Philipp Wendler verified my results by experiment.
Helped-by: Eric W. Biederman <ebiederm@xmission.com> Helped-by: Philipp Wendler <ml@philippwendler.de> Helped-by: Aleksa Sarai <asarai@suse.de> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Tue, 10 Sep 2019 09:35:37 +0000 (11:35 +0200)]
pivot_root.2: Eliminate text suggesting that behavior may change in the future
After around 19 years, the behavior of pivot_root() has not been
changed, and will almost certainly not change in the future.
So, reword to remove the suggestion that the behavior may change.
Also, more clearly document the effect of pivot_root() on
the calling process's current working directory.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Tue, 10 Sep 2019 08:36:27 +0000 (10:36 +0200)]
pivot_root.2: Remove a note about a historical idea/expectation
The idea that there might one day be a mechanism for kernel
threads to explicitly relinquish access to the filesystem never
came to pass (after 20 years), and the presence of text
describing this idea is, IMO, a distraction. So, remove it.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Tue, 30 Jul 2019 20:59:17 +0000 (22:59 +0200)]
pivot_root.2: Remove text describing case where current root is not a mount point
One kernel printk() later, my suspicions seem confirmed: the text
describing the situation where the current root is not a mount
point (because of a chroot()) seems to be bogus. (Perhaps it was
true once upon a time.) In my testing, if the current root is not
a mount point, an EINVAL error results.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Tue, 30 Jul 2019 12:23:17 +0000 (14:23 +0200)]
pivot_root.2: Fix a technical detail
In this text:
If the current root is not a mount point (e.g., after an
earlier chroot(2) or pivot_root())...
mention of pivot_root() makes no sense, since (as noted in an
earlier commit message for this page) 'new_root' in a previous
pivot_root() must (since Linux 2.4.5) have been a mount point.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Tue, 30 Jul 2019 12:14:10 +0000 (14:14 +0200)]
pivot_root.2: Remove BUGS section
One of these "bugs" is a philosophical point already covered
elsewhere in the page, while the other is a somewhat obscure joke.
Both pieces are a bit of a distraction, really.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Tue, 30 Jul 2019 10:17:34 +0000 (12:17 +0200)]
pivot_root.2: Remove bogus a bogus EBUSY error case
The note that EBUSY is given if a filesystem is already mounted
on 'Iput_old' was never really true. That restriction was in
Linux 2.3.14, but removed in Linux 2.3.99-pre6 so it never made
it to mainline.
The relevant diff in pivot_root() was:
error = -EBUSY;
- if (d_new_root->d_sb == root->d_sb || d_put_old->d_sb == root->d_sb)
+ if (new_nd.mnt == root_mnt || old_nd.mnt == root_mnt)
goto out2; /* loop */
- if (d_put_old != d_put_old->d_covers)
- goto out2; /* mount point is busy */
error = -EINVAL;
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Sat, 27 Jul 2019 07:40:34 +0000 (09:40 +0200)]
pivot_root.2: Rework the text on "future changes" to reflect that 20 years have passed
Some of the text was written long ago, and hinted that things
might change in the future. However, 20 years have passed
and these details have not changed, so rework the text to
hint at that fact.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Sat, 27 Jul 2019 06:55:23 +0000 (08:55 +0200)]
pivot_root.2: There is no restriction against 'put_old' being a mount point
As far as I can see from the source code, the statement that
"No other filesystem may be mounted on 'put_old'" is incorrect.
Even looking at the 2.4.0 source code, there I can't see such
a restriction. In addition, some testing on a 5.0 kernel
(mounting 'put_old' in the new mount namespace just before
pivot_root()) did not result in an error for this case when
calling pivot_root().
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Mike Frysinger [Thu, 19 Sep 2019 05:43:42 +0000 (01:43 -0400)]
setns.2: Fix CLONE_NEWNS restriction info
Threads are allowed to switch mount namespaces if the filesystem
details aren't being shared. That's the purpose of the check in
the kernel quoted by the comment:
if (fs->users != 1)
return -EINVAL;
It's been this way since the code was originally merged in v3.8.
Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>