]> git.ipfire.org Git - thirdparty/man-pages.git/log
thirdparty/man-pages.git
4 years agoseccomp_unotify.2: A cookie check is also required after reading target's memory
Michael Kerrisk [Thu, 29 Oct 2020 18:41:22 +0000 (19:41 +0100)] 
seccomp_unotify.2: A cookie check is also required after reading target's memory

Quoting Jann Horn:

[[
As discussed at
<https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=Q@mail.gmail.com>,
we need to re-check checkNotificationIdIsValid() after reading remote
memory but before using the read value in any way. Otherwise, the
syscall could in the meantime get interrupted by a signal handler, the
signal handler could return, and then the function that performed the
syscall could free() allocations or return (thereby freeing buffers on
the stack).

In essence, this pread() is (unavoidably) a potential use-after-free
read; and to make that not have any security impact, we need to check
whether UAF read occurred before using the read value. This should
probably be called out elsewhere in the manpage, too...

Now, of course, **reading** is the easy case. The difficult case is if
we have to **write** to the remote process... because then we can't
play games like that. If we write data to a freed pointer, we're
screwed, that's it. (And for somewhat unrelated bonus fun, consider
that /proc/$pid/mem is originally intended for process debugging,
including installing breakpoints, and will therefore happily write
over "readonly" private mappings, such as typical mappings of
executable code.)

So, uuuuh... I guess if anyone wants to actually write memory back to
the target process, we'd better come up with some dedicated API for
that, using an ioctl on the seccomp fd that magically freezes the
target process inside the syscall while writing to its memory, or
something like that? And until then, the manpage should have a big fat
warning that writing to the target's memory is simply not possible
(safely).
]]

and
<https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=Q@mail.gmail.com>:

[[
The second bit of trouble is that if the supervisor is so oblivious
that it doesn't realize that syscalls can be interrupted, it'll run
into other problems. Let's say the target process does something like
this:

int func(void) {
  char pathbuf[4096];
  sprintf(pathbuf, "/tmp/blah.%d", some_number);
  mount("foo", pathbuf, ...);
}

and mount() is handled with a notification. If the supervisor just
reads the path string and immediately passes it into the real mount()
syscall, something like this can happen:

target: starts mount()
target: receives signal, aborts mount()
target: runs signal handler, returns from signal handler
target: returns out of func()
supervisor: receives notification
supervisor: reads path from remote buffer
supervisor: calls mount()

but because the stack allocation has already been freed by the time
the supervisor reads it, the supervisor just reads random garbage, and
beautiful fireworks ensue.

So the supervisor *fundamentally* has to be written to expect that at
*any* time, the target can abandon a syscall. And every read of remote
memory has to be separated from uses of that remote memory by a
notification ID recheck.

And at that point, I think it's reasonable to expect the supervisor to
also be able to handle that a syscall can be aborted before the
notification is delivered.
]]

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: wfix
Michael Kerrisk [Thu, 29 Oct 2020 18:23:47 +0000 (19:23 +0100)] 
seccomp_unotify.2: wfix

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: EXAMPLES: make SECCOMP_IOCTL_NOTIF_ID_VALID function return bool
Michael Kerrisk [Thu, 29 Oct 2020 16:15:50 +0000 (17:15 +0100)] 
seccomp_unotify.2: EXAMPLES: make SECCOMP_IOCTL_NOTIF_ID_VALID function return bool

- Rename the function that does the SECCOMP_IOCTL_NOTIF_ID_VALID
  check.
- Make that function return a 'bool' rather than terminating the
  process.
- Use that return value in the calling function.
- Rework/improve various related comments.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: EXAMPLES: Improve comments describing checkNotificationIdIsValid()
Michael Kerrisk [Thu, 29 Oct 2020 11:19:16 +0000 (12:19 +0100)] 
seccomp_unotify.2: EXAMPLES: Improve comments describing checkNotificationIdIsValid()

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: EXAMPLES: make getTargetPathname() a bit more generically useful
Michael Kerrisk [Thu, 29 Oct 2020 09:46:10 +0000 (10:46 +0100)] 
seccomp_unotify.2: EXAMPLES: make getTargetPathname() a bit more generically useful

Allow the caller to specify which system call argument should
be looked up as a pathname.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: SEE ALSO: add pidfd_open(2) and pidfd_getfd(2)
Michael Kerrisk [Wed, 28 Oct 2020 18:18:56 +0000 (19:18 +0100)] 
seccomp_unotify.2: SEE ALSO: add pidfd_open(2) and pidfd_getfd(2)

pidfd_open(2) and pidfd_getfd(2) presumably have use cases
with the user-space notification feature.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: NOTES: describe an example use-case
Michael Kerrisk [Wed, 28 Oct 2020 12:14:08 +0000 (13:14 +0100)] 
seccomp_unotify.2: NOTES: describe an example use-case

The container manager use case was the original motivation
for this feature.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Remove FIXME asking about usefulness of POLLOUT/EPOLLOUT
Michael Kerrisk [Tue, 27 Oct 2020 06:17:01 +0000 (07:17 +0100)] 
seccomp_unotify.2: Remove FIXME asking about usefulness of POLLOUT/EPOLLOUT

According to Tycho Andersen, he had no particular use case
in mind when building this detail into the API.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: srcfix: Add a further FIXME relating to SA_RESTART behavior
Michael Kerrisk [Mon, 26 Oct 2020 09:45:24 +0000 (10:45 +0100)] 
seccomp_unotify.2: srcfix: Add a further FIXME relating to SA_RESTART behavior

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Various fixes after review comments from Kees Cook
Michael Kerrisk [Mon, 26 Oct 2020 09:11:09 +0000 (10:11 +0100)] 
seccomp_unotify.2: Various fixes after review comments from Kees Cook

Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Update a FIXME
Michael Kerrisk [Sun, 25 Oct 2020 14:02:54 +0000 (15:02 +0100)] 
seccomp_unotify.2: Update a FIXME

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agocmsg.3, unix.7: Refer to seccomp_unotify(2) for an example of SCM_RIGHTS usage
Michael Kerrisk [Sun, 25 Oct 2020 12:54:05 +0000 (13:54 +0100)] 
cmsg.3, unix.7: Refer to seccomp_unotify(2) for an example of SCM_RIGHTS usage

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosignal.7: Add reference to seccomp_unotify(2)
Michael Kerrisk [Sat, 24 Oct 2020 10:54:11 +0000 (12:54 +0200)] 
signal.7: Add reference to seccomp_unotify(2)

The seccomp user-space notification feature can cause changes in
the semantics of SA_RESTART with respect to system calls that
would never normally be restarted. Point the reader to the page
that provide further details.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Describe the interaction with SA_RESTART signal handlers
Michael Kerrisk [Sat, 24 Oct 2020 12:29:11 +0000 (14:29 +0200)] 
seccomp_unotify.2: Describe the interaction with SA_RESTART signal handlers

And, as noted by Jann Horn, note how the user-space notification
mechanism causes a small breakage in the user-space API with
respect to nonrestartable system calls.

====

From the email discussion with Jann Horn

> >> So, I partially demonstrated what you describe here, for two example
> >> system calls (epoll_wait() and pause()). But I could not exactly
> >> demonstrate things as I understand you to be describing them. (So,
> >> I'm not sure whether I have not understood you correctly, or
> >> if things are not exactly as you describe them.)
> >>
> >> Here's a scenario (A) that I tested:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >>    (epoll_wait() or pause(), both of which should never restart,
> >>    regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor is sleeping (i.e., is not blocked in
> >>    SECCOMP_IOCTL_NOTIF_RECV operation).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. SIGINT gets delivered to target; handler gets called;
> >>    ***and syscall gets restarted by the kernel***
> >>
> >> That last should never happen, of course, and is a result of the
> >> combination of both the user-notify filter and the SA_RESTART flag.
> >> If one or other is not present, then the system call is not
> >> restarted.
> >>
> >> So, as you note below, the UAPI gets broken a little.
> >>
> >> However, from your description above I had understood that
> >> something like the following scenario (B) could occur:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >>    (epoll_wait() or pause(), both of which should never restart,
> >>    regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
> >>    blocks).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. Supervisor gets seccomp user-space notification (i.e.,
> >>    SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
> >> 6. SIGINT gets delivered to target; handler gets called;
> >>    and syscall gets restarted by the kernel
> >> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
> >>    which gets another notification for the restarted system call.
> >>
> >> However, I don't observe such behavior. In step 6, the syscall
> >> does not get restarted by the kernel, but instead returns -1/EINTR.
> >> Perhaps I have misconstructed my experiment in the second case, or
> >> perhaps I've misunderstood what you meant, or is it possibly the
> >> case that things are not quite as you said?
>
> Thanks for the code, Jann (including the demo of the CLONE_FILES
> technique to pass the notification FD to the supervisor).
>
> But I think your code just demonstrates what I described in
> scenario A. So, it seems that I both understood what you
> meant (because my code demonstrates the same thing) and
> also misunderstood what you said (because I thought you
> were meaning something more like scenario B).

Ahh, sorry, I should've read your mail more carefully. Indeed, that
testcase only shows scenario A. But the following shows scenario B...

[Below, two pieces of code from Jann, with a lot of
cosmetic changes by mtk.]

====

[And from a follow-up in the same email thread:]

> If userspace relies on non-restarting behavior, it should be using
> something like epoll_pwait(). And that stuff only unblocks signals
> after we've already past the seccomp checks on entry.
Thanks for elaborating that detail, since as soon as you talked
about "enlarging a preexisting race" above, I immediately wondered
sigsuspend(), pselect(), etc.

(Mind you, I still wonder about the effect on system calls that
are normally nonrestartable because they have timeouts. My
understanding is that the kernel doesn't restart those system
calls because it's impossible for the kernel to restart the call
with the right timeout value. I wonder what happens when those
system calls are restarted in the scenario we're discussing.)

Anyway, returning to your point... So, to be clear (and to
quickly remind myself in case I one day reread this thread),
there is not a problem with sigsuspend(), pselect(), ppoll(),
and epoll_pwait() since:

* Before the syscall, signals are blocked in the target.
* Inside the syscall, signals are still blocked at the time
  the check is made for seccomp filters.
* If a seccomp user-space notification  event kicks, the target
  is put to sleep with the signals still blocked.
* The signal will only get delivered after the supervisor either
  triggers a spoofed success/failure return in the target or the
  supervisor sends a CONTINUE response to the kernel telling it
  to execute the target's system call. Either way, there won't be
  any restarting of the target's system call (and the supervisor
  thus won't see multiple notifications).

====

Scenario A

$ ./seccomp_unotify_restart_scen_A
C: installed seccomp: fd 3
C: woke 1 waiters
P: child installed seccomp fd 3
C: About to call pause(): Success
P: going to send SIGUSR1...
C: sigusr1_handler handler invoked
P: about to terminate
C: got pdeath signal on parent termination
C: about to terminate

/* Modified version of code from Jann Horn */

#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <stddef.h>
#include <limits.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/futex.h>

struct {
    int seccomp_fd;
} *shared;

static void
sigusr1_handler(int sig, siginfo_t * info, void *uctx)
{
    printf("C: sigusr1_handler handler invoked\n");
}

static void
sigusr2_handler(int sig, siginfo_t * info, void *uctx)
{
    printf("C: got pdeath signal on parent termination\n");
    printf("C: about to terminate\n");
    exit(0);
}

int
main(void)
{
    setbuf(stdout, NULL);

    /* Allocate memory that will be shared by parent and child */

    shared = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE,
                  MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    if (shared == MAP_FAILED)
        err(1, "mmap");
    shared->seccomp_fd = -1;

    /* glibc's clone() wrapper doesn't support fork()-style usage */
    /* Child process and parent share file descriptor table */

    pid_t child = syscall(__NR_clone, CLONE_FILES | SIGCHLD,
                          NULL, NULL, NULL, 0);
    if (child == -1)
        err(1, "clone");

    /* CHILD */

    if (child == 0) {
        /* don't outlive the parent */
        prctl(PR_SET_PDEATHSIG, SIGUSR2);

        if (getppid() == 1)
            exit(0);

        /* Install seccomp filter */

        prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
        struct sock_filter insns[] = {
            BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                     offsetof(struct seccomp_data, nr)),
            BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_pause, 0, 1),
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
        };
        struct sock_fprog prog = {
            .len = sizeof(insns) / sizeof(insns[0]),
            .filter = insns
        };
        int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
                                  SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
        if (seccomp_ret < 0)
            err(1, "install");
        printf("C: installed seccomp: fd %d\n", seccomp_ret);

        /* Place the notifier FD number into the shared memory */

        __atomic_store(&shared->seccomp_fd, &seccomp_ret,
                       __ATOMIC_RELEASE);

        /* Wake the parent */

        int futex_ret =
            syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
                    INT_MAX, NULL, NULL, 0);
        printf("C: woke %d waiters\n", futex_ret);

        /* Establish SA_RESTART handler for SIGUSR1 */

        struct sigaction act = {
            .sa_sigaction = sigusr1_handler,
            .sa_flags = SA_RESTART | SA_SIGINFO
        };
        if (sigaction(SIGUSR1, &act, NULL))
            err(1, "sigaction");

        struct sigaction act2 = {
            .sa_sigaction = sigusr2_handler,
            .sa_flags = 0
        };
        if (sigaction(SIGUSR2, &act2, NULL))
            err(1, "sigaction");

        /* Make a blocking system call */

        perror("C: About to call pause()");
        pause();
        perror("C: pause returned");

        exit(0);
    }

    /* PARENT */

    /* Wait for futex wake-up from child */

    int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
                            -1, NULL, NULL, 0);
    if (futex_ret == -1 && errno != EAGAIN)
        err(1, "futex wait");

    /* Get notification FD from the child */

    int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
    printf("\tP: child installed seccomp fd %d\n", fd);

    sleep(1);

    printf("\tP: going to send SIGUSR1...\n");
    kill(child, SIGUSR1);

    sleep(1);
    printf("\tP: about to terminate\n");

    exit(0);
}

====

Scenario B

$ ./seccomp_unotify_restart_scen_B
C: installed seccomp: fd 3
C: woke 1 waiters
C: About to call pause()
P: child installed seccomp fd 3
P: about to SECCOMP_IOCTL_NOTIF_RECV
P: got notif: id=17773741941218455591 pid=25052 nr=34
P: about to send SIGUSR1 to child...
P: about to SECCOMP_IOCTL_NOTIF_RECV
C: sigusr1_handler handler invoked
P: got notif: id=17773741941218455592 pid=25052 nr=34
P: about to send SIGUSR1 to child...
P: about to SECCOMP_IOCTL_NOTIF_RECV
C: sigusr1_handler handler invoked
P: got notif: id=17773741941218455593 pid=25052 nr=34
P: about to send SIGUSR1 to child...
P: about to SECCOMP_IOCTL_NOTIF_RECV
C: sigusr1_handler handler invoked
P: got notif: id=17773741941218455594 pid=25052 nr=34
P: about to send SIGUSR1 to child...
C: sigusr1_handler handler invoked
C: got pdeath signal on parent termination
C: about to terminate

/* Modified version of code from Jann Horn */

#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <stddef.h>
#include <string.h>
#include <limits.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/futex.h>

struct {
    int seccomp_fd;
} *shared;

static void
sigusr1_handler(int sig, siginfo_t * info, void *uctx)
{
    printf("C: sigusr1_handler handler invoked\n");
}

static void
sigusr2_handler(int sig, siginfo_t * info, void *uctx)
{
    printf("C: got pdeath signal on parent termination\n");
    printf("C: about to terminate\n");
    exit(0);
}

static size_t
max_size(size_t a, size_t b)
{
    return (a > b) ? a : b;
}

int
main(void)
{
    setbuf(stdout, NULL);

    /* Allocate memory that will be shared by parent and child */

    shared = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE,
                  MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    if (shared == MAP_FAILED)
        err(1, "mmap");
    shared->seccomp_fd = -1;

    /* glibc's clone() wrapper doesn't support fork()-style usage */
    /* Child process and parent share file descriptor table */
    pid_t child = syscall(__NR_clone, CLONE_FILES | SIGCHLD,
                          NULL, NULL, NULL, 0);
    if (child == -1)
        err(1, "clone");

    /* CHILD */

    if (child == 0) {
        /* don't outlive the parent */
        prctl(PR_SET_PDEATHSIG, SIGUSR2);
        if (getppid() == 1)
            exit(0);

        /* Install seccomp filter */

        prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
        struct sock_filter insns[] = {
            BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                     offsetof(struct seccomp_data, nr)),
            BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_pause, 0, 1),
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
        };
        struct sock_fprog prog = {
            .len = sizeof(insns) / sizeof(insns[0]),
            .filter = insns
        };
        int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
                                  SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
        if (seccomp_ret < 0)
            err(1, "install");
        printf("C: installed seccomp: fd %d\n", seccomp_ret);

        /* Place the notifier FD number into the shared memory */

        __atomic_store(&shared->seccomp_fd, &seccomp_ret,
                       __ATOMIC_RELEASE);

        /* Wake the parent */

        int futex_ret =
            syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
                    INT_MAX, NULL, NULL, 0);
        printf("C: woke %d waiters\n", futex_ret);

        /* Establish SA_RESTART handler for SIGUSR1 */

        struct sigaction act = {
            .sa_sigaction = sigusr1_handler,
            .sa_flags = SA_RESTART | SA_SIGINFO
        };
        if (sigaction(SIGUSR1, &act, NULL))
            err(1, "sigaction");

        struct sigaction act2 = {
            .sa_sigaction = sigusr2_handler,
            .sa_flags = 0
        };
        if (sigaction(SIGUSR2, &act2, NULL))
            err(1, "sigaction");

        /* Make a blocking system call */

        printf("C: About to call pause()\n");
        pause();
        perror("C: pause returned");

        exit(0);
    }

    /* PARENT */

    /* Wait for futex wake-up from child */

    int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
                            -1, NULL, NULL, 0);
    if (futex_ret == -1 && errno != EAGAIN)
        err(1, "futex wait");

    /* Get notification FD from the child */

    int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
    printf("\tP: child installed seccomp fd %d\n", fd);

    /* Discover seccomp buffer sizes and allocate notification buffer */

    struct seccomp_notif_sizes sizes;
    if (syscall(__NR_seccomp, SECCOMP_GET_NOTIF_SIZES, 0, &sizes))
        err(1, "notif_sizes");
    struct seccomp_notif *notif =
        malloc(max_size(sizeof(struct seccomp_notif),
                        sizes.seccomp_notif));
    if (!notif)
        err(1, "malloc");

    for (int i = 0; i < 4; i++) {
        printf("\tP: about to SECCOMP_IOCTL_NOTIF_RECV\n");
        memset(notif, '\0', sizes.seccomp_notif);
        if (ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, notif))
            err(1, "notif_recv");
        printf("\tP: got notif: id=%llu pid=%u nr=%d\n",
               notif->id, notif->pid, notif->data.nr);
        sleep(1);
        printf("\tP: about to send SIGUSR1 to child...\n");
        kill(child, SIGUSR1);
    }
    sleep(1);

    exit(0);
}

====

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: EXAMPLE: correct the check for NUL in buffer returned by read()
Michael Kerrisk [Sat, 24 Oct 2020 08:46:28 +0000 (10:46 +0200)] 
seccomp_unotify.2: EXAMPLE: correct the check for NUL in buffer returned by read()

In the usual case, read(fd, buf, PATH_MAX) will return PATH_MAX
bytes that include trailing garbage after the pathname. So the
right check is to scan from the start of the buffer to see if
there's a NUL, and error if there is not.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Better handling of invalid target pathname
Michael Kerrisk [Sun, 18 Oct 2020 20:11:54 +0000 (22:11 +0200)] 
seccomp_unotify.2: Better handling of invalid target pathname

After some discussions with Jann Horn, perhaps a better way of
dealing with an invalid target pathname is to trigger an
error for the system call.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: EXAMPLE: rename a variable
Michael Kerrisk [Fri, 16 Oct 2020 15:08:24 +0000 (17:08 +0200)] 
seccomp_unotify.2: EXAMPLE: rename a variable

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: EXAMPLE: Improve allocation of response buffer
Michael Kerrisk [Fri, 16 Oct 2020 09:24:25 +0000 (11:24 +0200)] 
seccomp_unotify.2: EXAMPLE: Improve allocation of response buffer

From a conversation with Jann Horn:

[[
>>>>            struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
>>>
>>> This should probably do something like max(sizes.seccomp_notif_resp,
>>> sizeof(struct seccomp_notif_resp)) in case the program was built
>>> against new UAPI headers that make struct seccomp_notif_resp big, but
>>> is running under an old kernel where that struct is still smaller?
>>
>> I'm confused. Why? I mean, if the running kernel says that it expects
>> a buffer of a certain size, and we allocate a buffer of that size,
>> what's the problem?
>
> Because in userspace, we cast the result of malloc() to a "struct
> seccomp_notif_resp *". If the kernel tells us that it expects a size
> smaller than sizeof(struct seccomp_notif_resp), then we end up with a
> pointer to a struct that consists partly of allocated memory, partly
> of out-of-bounds memory, which is generally a bad idea - I'm not sure
> whether the C standard permits that. And if userspace then e.g.
> decides to access some member of that struct that is beyond what the
> kernel thinks is the struct size, we get actual OOB memory accesses.
Got it. (But gosh, this seems like a fragile API mess.)

I added the following to the code:

    /* When allocating the response buffer, we must allow for the fact
       that the user-space binary may have been built with user-space
       headers where 'struct seccomp_notif_resp' is bigger than the
       response buffer expected by the (older) kernel. Therefore, we
       allocate a buffer that is the maximum of the two sizes. This
       ensures that if the supervisor places bytes into the response
       structure that are past the response size that the kernel expects,
       then the supervisor is not touching an invalid memory location. */

    size_t resp_size = sizes.seccomp_notif_resp;
    if (sizeof(struct seccomp_notif_resp) > resp_size)
        resp_size = sizeof(struct seccomp_notif_resp);

    struct seccomp_notif_resp *resp = malloc(resp_size);
]]

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: EXAMPLE: ensure path read() by the supervisor is null-terminated
Michael Kerrisk [Fri, 16 Oct 2020 09:02:08 +0000 (11:02 +0200)] 
seccomp_unotify.2: EXAMPLE: ensure path read() by the supervisor is null-terminated

From a conversation with Jann Horn:

    >> We should probably make sure here that the value we read is actually
    >> NUL-terminated?
    >
    > So, I was curious about that point also. But, (why) are we not
    > guaranteed that it will be NUL-terminated?

    Because it's random memory filled by another process, which we don't
    necessarily trust. While seccomp notifiers aren't usable for applying
    *extra* security restrictions, the supervisor will still often be more
    privileged than the supervised process.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: wfix in example program
Michael Kerrisk [Fri, 16 Oct 2020 08:58:38 +0000 (10:58 +0200)] 
seccomp_unotify.2: wfix in example program

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Small wording fix
Michael Kerrisk [Fri, 16 Oct 2020 07:29:10 +0000 (09:29 +0200)] 
seccomp_unotify.2: Small wording fix

Change "read(2) will return 0" to "read(2) may return 0".

Quoting Jann Horn:

    Maybe make that "may return 0" instead of "will return 0" -
    reading from /proc/$pid/mem can only return 0 in the
    following cases AFAICS:

    1. task->mm was already gone at open() time
    2. mm->mm_users has dropped to zero (the mm only has lazytlb
       users; page tables and VMAs are being blown away or have
       been blown away)
    3. the syscall was called with length 0

    When a process has gone away, normally mm->mm_users will
    drop to zero, but someone else could theoretically still be
    holding a reference to the mm (e.g. someone else in the
    middle of accessing /proc/$pid/mem).  (Such references
    should normally not be very long-lived though.)

    Additionally, in the unlikely case that the OOM killer just
    chomped through the page tables of the target process, I
    think the read will return -EIO (same error as if the
    address was simply unmapped) if the address is within a
    non-shared mapping. (Maybe that's something procfs could do
    better...)

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Minor wording change + add a FIXME
Michael Kerrisk [Thu, 15 Oct 2020 11:33:27 +0000 (13:33 +0200)] 
seccomp_unotify.2: Minor wording change + add a FIXME

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: User-space notification can't be used to implement security policy
Michael Kerrisk [Thu, 15 Oct 2020 10:27:33 +0000 (12:27 +0200)] 
seccomp_unotify.2: User-space notification can't be used to implement security policy

Add some strongly worded text warning the reader about the correct
uses of seccomp user-space notification.

Reported-by: Jann Horn <jannh@google.com>
Cowritten-by: Christian Brauner <christian@brauner.io>
Cowritten-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Fixes after review comments from Christian Brauner
Michael Kerrisk [Wed, 14 Oct 2020 16:30:34 +0000 (18:30 +0200)] 
seccomp_unotify.2: Fixes after review comments from Christian Brauner

Reported-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2, seccomp_unotify.2: Clarify that there can be only one SECCOMP_FILTER_FLAG_...
Michael Kerrisk [Wed, 14 Oct 2020 06:05:15 +0000 (08:05 +0200)] 
seccomp.2, seccomp_unotify.2: Clarify that there can be only one SECCOMP_FILTER_FLAG_NEW_LISTENER

Reported-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Note when FD indicates EOF/(E)POLLHUP in (e)poll/select
Michael Kerrisk [Thu, 15 Oct 2020 08:14:09 +0000 (10:14 +0200)] 
seccomp_unotify.2: Note when FD indicates EOF/(E)POLLHUP in (e)poll/select

Verified by experiment.

Reported-by: Christian Brauner <christian.brauner@canonical.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Note when notification FD indicates as writable by select/poll...
Michael Kerrisk [Wed, 14 Oct 2020 05:28:40 +0000 (07:28 +0200)] 
seccomp_unotify.2: Note when notification FD indicates as writable by select/poll/epoll

Reported-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Minor fixes
Michael Kerrisk [Sun, 4 Oct 2020 05:21:54 +0000 (07:21 +0200)] 
seccomp_unotify.2: Minor fixes

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Fixes after review comments by Jann Horn
Michael Kerrisk [Thu, 1 Oct 2020 09:33:16 +0000 (11:33 +0200)] 
seccomp_unotify.2: Fixes after review comments by Jann Horn

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Add BUGS section describing SECCOMP_IOCTL_NOTIF_RECV bug
Michael Kerrisk [Wed, 30 Sep 2020 20:32:46 +0000 (22:32 +0200)] 
seccomp_unotify.2: Add BUGS section describing SECCOMP_IOCTL_NOTIF_RECV bug

Tycho Andersen confirmed that this issue is present.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: srcfix: remove bogus FIXME
Michael Kerrisk [Wed, 30 Sep 2020 20:25:55 +0000 (22:25 +0200)] 
seccomp_unotify.2: srcfix: remove bogus FIXME

Pathname arguments are limited to PATH_MAX bytes.

Reported-by: Tycho Andersen <tycho@tycho.pizza>
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Changes after feed back from Tycho Andersen
Michael Kerrisk [Wed, 30 Sep 2020 20:24:59 +0000 (22:24 +0200)] 
seccomp_unotify.2: Changes after feed back from Tycho Andersen

Reported-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp_unotify.2: Document the seccomp user-space notification mechanism
Michael Kerrisk [Mon, 28 Sep 2020 20:13:12 +0000 (22:13 +0200)] 
seccomp_unotify.2: Document the seccomp user-space notification mechanism

The APIs used by this mechanism comprise not only seccomp(2), but
also a number of ioctl(2) operations. And any useful example
demonstrating these APIs is will necessarily be rather long.
Trying to cram all of this into the seccomp(2) page would make
that page unmanageably long. Therefore, let's document this
mechanism in a separate page.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2: Note that SECCOMP_RET_USER_NOTIF can be overridden
Michael Kerrisk [Thu, 15 Oct 2020 11:12:03 +0000 (13:12 +0200)] 
seccomp.2: Note that SECCOMP_RET_USER_NOTIF can be overridden

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2: wfix: mention term "supervisor" in description of SECCOMP_RET_USER_NOTIF
Michael Kerrisk [Thu, 15 Oct 2020 11:11:08 +0000 (13:11 +0200)] 
seccomp.2: wfix: mention term "supervisor" in description of SECCOMP_RET_USER_NOTIF

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2: SEE ALSO: add seccomp_unotify(2)
Michael Kerrisk [Mon, 28 Sep 2020 22:10:34 +0000 (00:10 +0200)] 
seccomp.2: SEE ALSO: add seccomp_unotify(2)

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2: Rework SECCOMP_GET_NOTIF_SIZES somewhat
Michael Kerrisk [Mon, 28 Sep 2020 07:42:38 +0000 (09:42 +0200)] 
seccomp.2: Rework SECCOMP_GET_NOTIF_SIZES somewhat

The existing text says the structures (plural!) contain a 'struct
seccomp_data'. But this is only true for the received notification
structure (seccomp_notif). So, reword the sentence to be more
general, noting simply that the structures may evolve over time.

Add some comments to the structure definition.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2: Add some details for SECCOMP_FILTER_FLAG_NEW_LISTENER
Michael Kerrisk [Sat, 26 Sep 2020 20:48:44 +0000 (22:48 +0200)] 
seccomp.2: Add some details for SECCOMP_FILTER_FLAG_NEW_LISTENER

Rework the description a little, and note that the close-on-exec
flag is set for the returned file descriptor.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2: Minor edits to Tycho's SECCOMP_FILTER_FLAG_NEW_LISTENER patch
Michael Kerrisk [Sat, 26 Sep 2020 13:45:45 +0000 (15:45 +0200)] 
seccomp.2: Minor edits to Tycho's SECCOMP_FILTER_FLAG_NEW_LISTENER patch

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2: Document SECCOMP_FILTER_FLAG_NEW_LISTENER
Tycho Andersen [Sat, 26 Sep 2020 13:42:36 +0000 (15:42 +0200)] 
seccomp.2: Document SECCOMP_FILTER_FLAG_NEW_LISTENER

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2: Reorder list of SECCOMP_SET_MODE_FILTER flags alphabetically
Michael Kerrisk [Sat, 26 Sep 2020 13:40:56 +0000 (15:40 +0200)] 
seccomp.2: Reorder list of SECCOMP_SET_MODE_FILTER flags alphabetically

(No content changes.)

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2: Some reworking of Tycho's SECCOMP_RET_USER_NOTIF patch
Michael Kerrisk [Sat, 26 Sep 2020 13:34:05 +0000 (15:34 +0200)] 
seccomp.2: Some reworking of Tycho's SECCOMP_RET_USER_NOTIF patch

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2: Document SECCOMP_RET_USER_NOTIF
Tycho Andersen [Sat, 26 Sep 2020 13:29:47 +0000 (15:29 +0200)] 
seccomp.2: Document SECCOMP_RET_USER_NOTIF

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2: Minor edits to Tycho Andersen's patch
Michael Kerrisk [Sat, 26 Sep 2020 13:18:38 +0000 (15:18 +0200)] 
seccomp.2: Minor edits to Tycho Andersen's patch

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoseccomp.2: Document SECCOMP_GET_NOTIF_SIZES
Tycho Andersen [Thu, 13 Dec 2018 00:11:05 +0000 (17:11 -0700)] 
seccomp.2: Document SECCOMP_GET_NOTIF_SIZES

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosocketcall.2: srcfix
Michael Kerrisk [Wed, 9 Jun 2021 22:26:48 +0000 (10:26 +1200)] 
socketcall.2: srcfix

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosocketcall.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Mon, 24 May 2021 18:19:47 +0000 (20:19 +0200)] 
socketcall.2: Use syscall(SYS_...); for system calls without a wrapper

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosigprocmask.2: Use syscall(SYS_...); for raw system calls
Alejandro Colomar [Mon, 24 May 2021 18:19:46 +0000 (20:19 +0200)] 
sigprocmask.2: Use syscall(SYS_...); for raw system calls

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoshmop.2: Remove unused include
Alejandro Colomar [Mon, 24 May 2021 18:19:45 +0000 (20:19 +0200)] 
shmop.2: Remove unused include

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosgetmask.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Mon, 24 May 2021 18:19:44 +0000 (20:19 +0200)] 
sgetmask.2: Use syscall(SYS_...); for system calls without a wrapper

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoset_tid_address.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Mon, 24 May 2021 18:19:43 +0000 (20:19 +0200)] 
set_tid_address.2: Use syscall(SYS_...); for system calls without a wrapper

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoset_thread_area.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Mon, 24 May 2021 18:19:42 +0000 (20:19 +0200)] 
set_thread_area.2: Use syscall(SYS_...); for system calls without a wrapper

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agort_sigqueueinfo.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Mon, 24 May 2021 18:19:39 +0000 (20:19 +0200)] 
rt_sigqueueinfo.2: Use syscall(SYS_...); for system calls without a wrapper

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoopen.2: Remove unused <sys/stat.h>
Alejandro Colomar [Mon, 24 May 2021 18:19:38 +0000 (20:19 +0200)] 
open.2: Remove unused <sys/stat.h>

I can't see a reason to include it.  <fcntl.h> provides O_*
constants for 'flags', S_* constants for 'mode', and mode_t.

Probably a long time ago, some of those weren't defined in
<fcntl.h>, and both headers needed to be included, or maybe it's
a historical error.

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosystem_data_types.7: Minor enhancement of description of mode_t
Michael Kerrisk [Wed, 9 Jun 2021 21:37:34 +0000 (09:37 +1200)] 
system_data_types.7: Minor enhancement of description of mode_t

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agomode_t.3: New link to system_data_types(7)
Alejandro Colomar [Sun, 23 May 2021 11:22:13 +0000 (13:22 +0200)] 
mode_t.3: New link to system_data_types(7)

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosystem_data_types.7: Add 'mode_t'
Alejandro Colomar [Sun, 23 May 2021 11:22:12 +0000 (13:22 +0200)] 
system_data_types.7: Add 'mode_t'

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoblksize_t.3: New link to system_data_types(7)
Alejandro Colomar [Sun, 23 May 2021 11:22:11 +0000 (13:22 +0200)] 
blksize_t.3: New link to system_data_types(7)

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosystem_data_types.7: Add 'blksize_t'
Alejandro Colomar [Sun, 23 May 2021 11:22:10 +0000 (13:22 +0200)] 
system_data_types.7: Add 'blksize_t'

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agocc_t.3: New link to system_data_types(7)
Alejandro Colomar [Sun, 23 May 2021 11:22:09 +0000 (13:22 +0200)] 
cc_t.3: New link to system_data_types(7)

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosystem_data_types.7: Add 'cc_t'
Alejandro Colomar [Sun, 23 May 2021 11:22:08 +0000 (13:22 +0200)] 
system_data_types.7: Add 'cc_t'

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoblkcnt_t.3: New link to system_data_types(7)
Alejandro Colomar [Sun, 23 May 2021 11:22:07 +0000 (13:22 +0200)] 
blkcnt_t.3: New link to system_data_types(7)

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosystem_data_types.7: Add 'blkcnt_t'
Alejandro Colomar [Sun, 23 May 2021 11:22:06 +0000 (13:22 +0200)] 
system_data_types.7: Add 'blkcnt_t'

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agokernel_lockdown.7: Remove additional text alluding to lifting via SysRq
dann frazier [Mon, 7 Jun 2021 22:19:43 +0000 (16:19 -0600)] 
kernel_lockdown.7: Remove additional text alluding to lifting via SysRq

My previous patch intended to drop the docs for the lockdown lift
SysRq, but it missed this other section that refers to lifting it
via a keyboard - an allusion to that same SysRq.

Signed-off-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agokernel_lockdown.7: Remove description of lifting via SysRq (not upstream)
dann frazier [Thu, 27 May 2021 07:13:42 +0000 (09:13 +0200)] 
kernel_lockdown.7: Remove description of lifting via SysRq (not upstream)

The patch that implemented lockdown lifting via SysRq ended up
getting dropped[*] before the feature was merged upstream. Having
the feature documented but unsupported has caused some confusion
for our users.

[*] http://archive.lwn.net:8080/linux-kernel/CACdnJuuxAM06TcnczOA6NwxhnmQUeqqm3Ma8btukZpuCS+dOqg@mail.gmail.com/

Signed-off-by: dann frazier <dann.frazier@canonical.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Pedro Principeza <pedro.principeza@canonical.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Kyle McMartin <kyle@redhat.com>
Cc: Matthew Garrett <mjg59@google.com>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoMakefile, README: Break installation into a target for each mandir
Alejandro Colomar [Wed, 9 Jun 2021 17:01:08 +0000 (19:01 +0200)] 
Makefile, README: Break installation into a target for each mandir

Instead of having a monolithic 'make install', break it into
multiple targets such as 'make install-man3'.  This simplifies
packaging, for example in Debian, where they break this project
into several packages: 'manpages' and 'manpages-dev', each
containing different mandirs.

The above allows for multithread installation: 'make -j'

Also, don't overwrite files that don't need to be overwritten, by
having a target for files, which makes use of make's timestamp
comparison.

This allows for much faster installation times.

For comparison, on my laptop (i7-8850H; 6C/12T):

Old Makefile:
~/src/linux/man-pages$ time sudo make >/dev/null

real 0m7.509s
user 0m5.269s
sys 0m2.614s

The times with the old makefile, varied a lot, between
5 and 10 seconds.  The times after applying this patch
are much more consistent.  BTW, I compared these times to
the very old Makefile of man-pages-5-09, and those were
around 3.5 s, so it was a bit of my fault to have such a
slow Makefile, when I changed the Makefile some weeks ago.

New Makefile (full clean install):
~/src/linux/man-pages$ time sudo make >/dev/null

real 0m5.160s
user 0m4.326s
sys 0m1.137s
~/src/linux/man-pages$ time sudo make -j2 >/dev/null

real 0m1.602s
user 0m2.529s
sys 0m0.289s
~/src/linux/man-pages$ time sudo make -j >/dev/null

real 0m1.398s
user 0m2.502s
sys 0m0.281s

Here we can see that 'make -j' drops times drastically,
compared to the old monolithic Makefile.  Not only that,
but since when we are working with the man pages there
aren't many pages involved, times will be even better.

Here are some times with a single page changed (touched):

New Makefile (one page touched):
~/src/linux/man-pages$ touch man2/membarrier.2
~/src/linux/man-pages$ time sudo make install
- INSTALL /usr/local/share/man/man2/membarrier.2

real 0m0.988s
user 0m0.966s
sys 0m0.025s
~/src/linux/man-pages$ touch man2/membarrier.2
~/src/linux/man-pages$ time sudo make install -j
- INSTALL /usr/local/share/man/man2/membarrier.2

real 0m0.989s
user 0m0.943s
sys 0m0.049s

Also, modify the output of the make install and uninstall commands
so that a line is output for each file or directory that is
installed, similarly to the kernel's Makefile.  This doesn't apply
to html targets, which haven't been changed in this commit.

Also, make sure that for each invocation of $(INSTALL_DIR), no
parents are created, (i.e., avoid `mkdir -p` behavior).  The GNU
make manual states that it can create race conditions.  Instead,
declare as a prerequisite for each directory its parent directory,
and let make resolve the order of creation.

Also, use ':=' instead of '=' to improve performance, by
evaluating each assignment only once.

Ensure than the shell is not called when not needed, by removing
all ";" and quotes in the commands.

See also: <https://stackoverflow.com/q/67862417/6872717>

Specify conventions and rationales used in the Makefile in a comment.

Add copyright.

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosetresuid.2: tfix (Oxford comma)
Michael Kerrisk [Fri, 21 May 2021 08:19:28 +0000 (20:19 +1200)] 
setresuid.2: tfix (Oxford comma)

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoselect.2: Strengthen the warning regarding the low value of FD_SETSIZE
Michael Kerrisk [Wed, 19 May 2021 21:51:18 +0000 (09:51 +1200)] 
select.2: Strengthen the warning regarding the low value of FD_SETSIZE

All modern code should avoid select(2) in favor of poll(2)
or epoll(7).

For a long history of this problem, see:

https://marc.info/?l=bugtraq&m=110660879328901
    List:       bugtraq
    Subject:    SECURITY.NNOV: Multiple applications fd_set structure bitmap array index overflow
    From:       3APA3A <3APA3A () security ! nnov ! ru>
    Date:       2005-01-24 20:30:08

https://sourceware.org/legacy-ml/libc-alpha/2003-05/msg00171.html
    User-settable FD_SETSIZE and select()
    From: mtk-lists at gmx dot net
    To: libc-alpha at sources dot redhat dot com
    Date: Mon, 19 May 2003 14:49:03 +0200 (MEST)
    Subject: User-settable FD_SETSIZE and select()

https://sourceware.org/bugzilla/show_bug.cgi?id=10352

http://0pointer.net/blog/file-descriptor-limits.html
https://twitter.com/pid_eins/status/1394962183033868292

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoselect.2: Relocate sentence about the fd_set value-result arguments to BUGS
Michael Kerrisk [Wed, 19 May 2021 21:49:09 +0000 (09:49 +1200)] 
select.2: Relocate sentence about the fd_set value-result arguments to BUGS

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosched_setattr.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Sat, 15 May 2021 18:20:28 +0000 (20:20 +0200)] 
sched_setattr.2: Use syscall(SYS_...); for system calls without a wrapper

Document also why each header is required

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agos390_sthyi.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Sat, 15 May 2021 18:20:27 +0000 (20:20 +0200)] 
s390_sthyi.2: Use syscall(SYS_...); for system calls without a wrapper

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agos390_sthyi.2: Replace numeric constant by its name (macro)
Alejandro Colomar [Sat, 15 May 2021 18:20:26 +0000 (20:20 +0200)] 
s390_sthyi.2: Replace numeric constant by its name (macro)

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Eugene Syromyatnikov <evgsyr@gmail.com>
Cc: QingFeng Hao <haoqf@linux.vnet.ibm.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agos390_runtime_instr.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Sat, 15 May 2021 18:20:25 +0000 (20:20 +0200)] 
s390_runtime_instr.2: Use syscall(SYS_...); for system calls without a wrapper

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agos390_pci_mmio_write.2: Use syscall(SYS_...); for system calls without a wrapper;...
Alejandro Colomar [Sat, 15 May 2021 18:20:24 +0000 (20:20 +0200)] 
s390_pci_mmio_write.2: Use syscall(SYS_...); for system calls without a wrapper; fix includes too

This function doesn't use any flags or special types, so there's
no reason to include <asm/unistd.h>; remove it.  Add the includes
needed for syscall(2) only.

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agos390_guarded_storage.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Sat, 15 May 2021 18:20:23 +0000 (20:20 +0200)] 
s390_guarded_storage.2: Use syscall(SYS_...); for system calls without a wrapper

Also document why each header is needed.

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agorename.2: ffix
Alejandro Colomar [Sat, 15 May 2021 18:20:21 +0000 (20:20 +0200)] 
rename.2: ffix

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agomemfd_create.2, mmap.2, shmget.2: Document the EPERM for huge page allocations
Michael Kerrisk [Mon, 17 May 2021 03:31:08 +0000 (15:31 +1200)] 
memfd_create.2, mmap.2, shmget.2: Document the EPERM for huge page allocations

This error can occur if the caller is does not have CAP_IPC_LOCK
and is not a member of the sysctl_hugetlb_shm_group.

Reported-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoproc.5: Document /proc/sys/vm/sysctl_hugetlb_shm_group
Michael Kerrisk [Mon, 17 May 2021 03:18:10 +0000 (15:18 +1200)] 
proc.5: Document /proc/sys/vm/sysctl_hugetlb_shm_group

As a deprecated feature, it appears that the RLIMIT_MEMLOCK
can also be used to permit huge page allocation, but let's
not document that for now.

In the Linux 5.12, see fs/hugetlbfs/inode.c.

static int can_do_hugetlb_shm(void)
{
        kgid_t shm_group;
        shm_group = make_kgid(&init_user_ns, sysctl_hugetlb_shm_group);
        return capable(CAP_IPC_LOCK) || in_group_p(shm_group);
}

...

struct file *hugetlb_file_setup(const char *name, size_t size,
                                vm_flags_t acctflag, struct user_struct **user,
                                int creat_flags, int page_size_log)
{
        ...
        if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
                *user = current_user();
                if (user_shm_lock(size, *user)) {
                        task_lock(current);
                        pr_warn_once("%s (%d): Using mlock ulimits for SHM_HUGETLB is deprecated\n",
                                current->comm, current->pid);
                        task_unlock(current);
                } else {
                        *user = NULL;
                        return ERR_PTR(-EPERM);
                }
        }
        ...
}

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agocapabilities.7: CAP_IPC_LOCK also governs memory allocation using huge pages
Michael Kerrisk [Mon, 17 May 2021 02:08:37 +0000 (14:08 +1200)] 
capabilities.7: CAP_IPC_LOCK also governs memory allocation using huge pages

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoconsole_codes.4: tfix
Alejandro Colomar [Thu, 13 May 2021 15:21:55 +0000 (17:21 +0200)] 
console_codes.4: tfix

The correct meaning of SGR is "Select Graphic Rendition".

See:
<https://www.ecma-international.org/wp-content/uploads/ECMA-48_5th_edition_june_1991.pdf>
<https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub86.pdf>
<https://en.wikipedia.org/wiki/ANSI_escape_code#SGR_(Select_Graphic_Rendition)_parameters>

Reported-by: Christoph Anton Mitterer <calestyo@scientia.net>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoctime.3: Restore documentation of 'tm_gmtoff' field
Michael Kerrisk [Mon, 17 May 2021 00:56:46 +0000 (12:56 +1200)] 
ctime.3: Restore documentation of 'tm_gmtoff' field

Accidentally deleted in commit ba39b288ab0714941786.

Reported-by: Katsuhiro Numata <byakkomon@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoREADME: Update installation path
Alejandro Colomar [Tue, 11 May 2021 06:53:00 +0000 (08:53 +0200)] 
README: Update installation path

The installation path was changed recently (See 'prefix' in the
Makefile).  I forgot to update the README with those changes.
Fix it.

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoexpm1.3: tfix
Akihiro Motoki [Tue, 11 May 2021 10:11:47 +0000 (12:11 +0200)] 
expm1.3: tfix

Signed-off-by: Akihiro Motoki <amotoki@gmail.com>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agosigvec.3: tfix
Akihiro Motoki [Tue, 11 May 2021 10:11:46 +0000 (12:11 +0200)] 
sigvec.3: tfix

Signed-off-by: Akihiro Motoki <amotoki@gmail.com>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agocapabilities.7: ffix
Akihiro Motoki [Tue, 11 May 2021 10:11:45 +0000 (12:11 +0200)] 
capabilities.7: ffix

Signed-off-by: Akihiro Motoki <amotoki@gmail.com>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agottyslot.3: tfix
Akihiro Motoki [Tue, 11 May 2021 10:11:44 +0000 (12:11 +0200)] 
ttyslot.3: tfix

Signed-off-by: Akihiro Motoki <amotoki@gmail.com>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agotgamma.3: tfix
Akihiro Motoki [Tue, 11 May 2021 10:11:43 +0000 (12:11 +0200)] 
tgamma.3: tfix

Signed-off-by: Akihiro Motoki <amotoki@gmail.com>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agogetdents.2: ffix
Michael Kerrisk [Tue, 11 May 2021 05:21:35 +0000 (17:21 +1200)] 
getdents.2: ffix

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoreboot.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Mon, 10 May 2021 17:55:48 +0000 (19:55 +0200)] 
reboot.2: Use syscall(SYS_...); for system calls without a wrapper

Explain also why headers are needed.
And some ffix.

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoreadlink.2: ffix
Alejandro Colomar [Mon, 10 May 2021 17:55:47 +0000 (19:55 +0200)] 
readlink.2: ffix

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoreaddir.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Mon, 10 May 2021 17:55:46 +0000 (19:55 +0200)] 
readdir.2: Use syscall(SYS_...); for system calls without a wrapper

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoquotactl.2: Better detail why <xfs/xqm.h> is included
Alejandro Colomar [Mon, 10 May 2021 17:55:45 +0000 (19:55 +0200)] 
quotactl.2: Better detail why <xfs/xqm.h> is included

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoprocess_madvise.2: Use syscall(SYS_...); for system calls without a wrapper. Fix...
Alejandro Colomar [Mon, 10 May 2021 17:55:44 +0000 (19:55 +0200)] 
process_madvise.2: Use syscall(SYS_...); for system calls without a wrapper. Fix includes too.

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agopoll.2: Remove <signal.h>
Alejandro Colomar [Mon, 10 May 2021 17:55:43 +0000 (19:55 +0200)] 
poll.2: Remove <signal.h>

It is only used for providing 'sigset_t'.  We're only documenting
(with some exceptions) the includes needed for constants and the
prototype itself.  And 'sigset_t' is better documented in
system_data_types(7).  Remove that include.

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agopivot_root.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Mon, 10 May 2021 17:55:42 +0000 (19:55 +0200)] 
pivot_root.2: Use syscall(SYS_...); for system calls without a wrapper

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agopipe.2: wfix
Alejandro Colomar [Mon, 10 May 2021 17:55:41 +0000 (19:55 +0200)] 
pipe.2: wfix

For consistency with other pages.

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agopidfd_send_signal.2: Use syscall(SYS_...); for system calls without a wrapper. Fix...
Alejandro Colomar [Mon, 10 May 2021 17:55:40 +0000 (19:55 +0200)] 
pidfd_send_signal.2: Use syscall(SYS_...); for system calls without a wrapper. Fix includes too

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agopidfd_open.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Mon, 10 May 2021 17:55:39 +0000 (19:55 +0200)] 
pidfd_open.2: Use syscall(SYS_...); for system calls without a wrapper

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agopidfd_getfd.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Mon, 10 May 2021 17:55:38 +0000 (19:55 +0200)] 
pidfd_getfd.2: Use syscall(SYS_...); for system calls without a wrapper

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4 years agoperf_event_open.2: Use syscall(SYS_...); for system calls without a wrapper
Alejandro Colomar [Mon, 10 May 2021 17:55:37 +0000 (19:55 +0200)] 
perf_event_open.2: Use syscall(SYS_...); for system calls without a wrapper

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>