CLOCK_BOOTTIME y n (EINVAL) y y y
CLOCK_BOOTTIME_ALARM y n (EINVAL) y [1] y [1] y [1]
CLOCK_MONOTONIC y n (EINVAL) y y y
CLOCK_MONOTONIC_COARSE y n (EINVAL) n (ENOTSUP) n (ENOTSUP) n (EINVAL)
CLOCK_MONOTONIC_RAW y n (EINVAL) n (ENOTSUP) n (ENOTSUP) n (EINVAL)
CLOCK_REALTIME y y y y y
CLOCK_REALTIME_ALARM y n (EINVAL) y [1] y [1] y [1]
CLOCK_REALTIME_COARSE y n (EINVAL) n (ENOTSUP) n (ENOTSUP) n (EINVAL)
CLOCK_TAI y n (EINVAL) y y n (EINVAL)
CLOCK_PROCESS_CPUTIME_ID y n (EINVAL) y y n (EINVAL)
CLOCK_THREAD_CPUTIME_ID y n (EINVAL) n (EINVAL [2]) y n (EINVAL)
pthread_getcpuclockid() y n (EINVAL) y y n (EINVAL)
[1] The caller must have CAP_WAKE_ALARM, or the error EPERM results.
[2] This error is generated in the glibc wrapper.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Wed, 1 Apr 2020 06:16:40 +0000 (08:16 +0200)]
openat2.2: Improve text describing caveat for use of RESOLVE_NO_XDEV
From email discussions with Aleksa Sarai:
> .\" FIXME I find the "previously-functional systems" in the previous
> .\" sentence a little odd (since openat2() ia new sysycall), so I would
> .\" like to clarify a little...
> .\" Are you referring to the scenario where someone might take an
> .\" existing application that uses openat() and replaces the uses
> .\" of openat() with openat2()? In which case, is it correct to
> .\" understand that you mean that one should not just indiscriminately
> .\" add the RESOLVE_NO_XDEV flag to all of the openat2() calls?
> .\" If I'm not on the right track, could you point me in the right
> .\" direction please.
This is mostly meant as a warning to hopefully avoid applications
because the developer didn't realise that system paths may contain
symlinks or bind-mounts. For an application which has switched to
openat2() and then uses RESOLVE_NO_SYMLINKS for a non-security reason,
it's possible that on some distributions (or future versions of a
distribution) that their application will stop working because a system
path suddenly contains a symlink or is a bind-mount.
This was a concern which was brought up on LWN some time ago. If you can
think of a phrasing that makes this more clear, I'd appreciate it.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Mon, 30 Mar 2020 20:52:58 +0000 (22:52 +0200)]
timerfd_create.2: Negetive changes to CLOCK_REALTIME may cause read() to return 0
Devi R K reported this issue, and went on to note:
> We have written a program using real time clock and it has been raised to
> the community.
>
> https://lore.kernel.org/lkml/alpine.DEB.2.21.1908191943280.1796@nanos.tec.linutronix.de/T/
[...]
Thanks for pointing me at that thread. In particular, the test
program at
https://lore.kernel.org/lkml/alpine.DEB.2.21.1908191943280.1796@nanos.tec.linutronix.de/T/#m489d81abdfbb2699743e18c37657311f8d52a4cd
[...]
I think this patch does not really capture the details
properly. The immediately preceding paragraph says:
If the associated clock is either CLOCK_REALTIME or
CLOCK_REALTIME_ALARM, the timer is absolute
(TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET
was specified when calling timerfd_settime(), then read(2)
fails with the error ECANCELED if the real-time clock
undergoes a discontinuous change. (This allows the reading
application to discover such discontinuous changes to the
clock.)
Following on from that, I think we should have a paragraph that says
something like:
If the associated clock is either CLOCK_REALTIME or
CLOCK_REALTIME_ALARM, the timer is absolute
(TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET
was not specified when calling timerfd_settime(), then a
discontinuous negative change to the clock
(e.g., clock_settime(2)) may cause read(2) to unblock, but
return a value of 0 (i.e., no bytes read), if the clock
change occurs after the time expired, but before the
read(2) on the timerfd file descriptor.
This seems consistent with Thomas's observations in
https://lore.kernel.org/lkml/alpine.DEB.2.21.1908191943280.1796@nanos.tec.linutronix.de/T/#m49b78122b573a2749a05b720dc9fa036546db490
==
Thomas Gleixner replied:
Yes, that's correct. Accurate as always!
This is pretty much in line with clock_nanosleep(CLOCK_REALTIME,
TIMER_ABSTIME) which has a similar problem vs. observability in user
space.
clock_nanosleep(2) mutters:
"POSIX.1 specifies that after changing the value of the CLOCK_REALTIME
clock via clock_settime(2), the new clock value shall be used to
determine the time at which a thread blocked on an absolute
clock_nanosleep() will wake up; if the new clock value falls past the
end of the sleep interval, then the clock_nanosleep() call will return
immediately."
which can be interpreted as guarantee that clock_nanosleep() never
returns prematurely, i.e. the assert() in the below code would indicate
a kernel failure:
ret = clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &expiry, NULL);
if (!ret) {
clock_gettime(CLOCK_REALTIME, &now);
assert(now >= expiry);
}
But that assert can trigger when CLOCK_REALTIME was modified after the
timer fired and the kernel decided to wake up the task and let it return
to user space.
-> timer interrupt
now = ktime_get_real();
if (expires <= now)
-------------------------------- After this point
wakeup(); clock_settime(2) or
adjtimex(2) which
makes CLOCK_REALTIME
jump back far enough will
cause the above assert
to trigger.
...
return from syscall (retval == 0)
There is no guarantee against clock_settime() coming after the
wakeup. Even if we put another check into the return to user path then
we won't catch a clock_settime() which comes right after that and before
user space invokes clock_gettime().
POSIX spec Issue 7 (2018 edition) says:
The suspension for the absolute clock_nanosleep() function (that is,
with the TIMER_ABSTIME flag set) shall be in effect at least until the
value of the corresponding clock reaches the absolute time specified by
rqtp.
And that's what the kernel implements for clock_nanosleep() and timerfd
behaves exactly the same way.
The wakeup of the waiter, i.e. task blocked in clock_nanosleep(2),
read(2), poll(2), is not happening _before_ the absolute time specified
is reached.
If clock_settime() happens right before the expiry check, then it does
the right thing, but any modification to the clock after the wakeup
cannot be mitigated. At least not in a way which would make the assert()
in the example code above a reliable indicator for a kernel fail.
That's the reason why I rejected the attempt to mitigate that particular
0 tick issue in timerfd as it would just scratch a particular itch but
still not provide any guarantee. So having the '0' return documented is
the right way to go.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: devi R.K <devi.feb27@gmail.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Aleksa Sarai [Mon, 30 Mar 2020 06:34:45 +0000 (08:34 +0200)]
openat2.2: Document new openat2(2) syscall
Rather than trying to merge the new syscall documentation into
open.2 (which would probably result in the man-page being
incomprehensible), instead the new syscall gets its own dedicated
page with links between open(2) and openat2(2) to avoid
duplicating information such as the list of O_* flags or common
errors.
In addition to describing all of the key flags, information about
the extensibility design is provided so that users can better
understand why they need to pass sizeof(struct open_how) and how
their programs will work across kernels. After some discussions
with David Laight, I also included explicit instructions to zero
the structure to avoid issues when recompiling with new headers.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Michael Kerrisk [Sun, 29 Mar 2020 20:08:10 +0000 (22:08 +0200)]
clock_getres.2: Move text in BUGS to NOTES
The fact that CLOCK_PROCESS_CPUTIME_ID and
CLOCK_PROCESS_CPUTIME_ID are not settable isn't a bug,
since POSIX does allow the possibility that these clocks
are not settable.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>