When we deleted cgroups for unprivileged containers we used to allocate a new
mapping and clone a new user namespace each time we delete a cgroup. This of
course meant - on a cgroup v1 system - doing this >= 10 times when all
controllers were used. Let's not to do this and only allocate and establish a
mapping once.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When fully unprivileged users run a container that only maps their own {g,u}id
and they do not have access to setuid new{g,u}idmap binaries we will write the
idmapping directly. This however requires us to write "deny" to
/proc/[pid]/setgroups otherwise any write to /proc/[pid]/gid_map will be
denied.
On a sidenote, this patch enables fully unprivileged containers. If you now set
lxc.net.[i].type = empty no privilege whatsoever is required to run a container.
Enhances #2033.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Felix Abecassis <fabecassis@nvidia.com> Cc: Jonathan Calmels <jcalmels@nvidia.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Serge Hallyn [Thu, 4 Jan 2018 03:02:53 +0000 (21:02 -0600)]
configure.ac: fix the check for static libcap
The existing check doesn't work, because when you statically
link a program against libc, any functions not called are not
included. So cap_init() which we check for is not there in
the built binary.
So instead just check whether a "gcc -lcap -static" works.
If libcap.a is not available it will fail, if it is it will
succeed.
mainloop: capture output of short-lived init procs
The handler for the signal fd will detect when the init process of a container
has exited and cause the mainloop to close. However, this can happen before the
console handlers - or any other events for that matter - are handled. So in the
case of init exiting we still need to allow for all buffered input to the
console to be handled before exiting. This allows us to capture output from
short-lived init processes.
This is conceptually equivalent to my implementation of ExecReaderToChannel()
https://github.com/lxc/lxd/blob/master/shared/util_linux.go#L527
Closes #1694.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When set set{u,g}id() the kernel will make us undumpable. This is unnecessary
since we can guarantee that whatever is running inside the child process at
this point this is fully trusted by the parent. Making us dumpable let's users
use debuggers on the child process before the exec as well and also allows us
to open /proc/<child-pid> files in lieu of the child.
Note, that we only need to perform the prctl(PR_SET_DUMPABLE, ...) if our
effective uid on the host is not 0. If our effective uid on the host is 0 then
we will keep all capabilities in the child user namespace across set{g,u}id().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Receive fd for LSM security module before we set{g,u}id(). The reason is that
on set{g,u}id() the kernel will a) make us undumpable and b) we will change our
effective uid. This means our effective uid will be different from the
effective uid of the process that created us which means that this processs no
longer has capabilities in our namespace including CAP_SYS_PTRACE. This means
we will not be able to read and /proc/<pid> files for the process anymore when
/proc is mounted with hidepid={1,2}. So let's get the lsm label fd before the
set{g,u}id().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
At this point, macros such DEBUG or ERROR does not take effect because
this code is called from cgroup_ops_init(cgroup.c), which runs with
__attribute__((constructor)), before any log level is set form any tool
like lxc-start, so these messages are lost.
For now on, use the same LXC_DEBUG_CGFSNG environment variable to
control these messages.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
This is based on raw_clone in systemd but adapted to our needs. The main reason
is that we need an implementation of fork()/clone() that does guarantee us that
no pthread_atfork() handlers are run. While clone() in glibc currently doesn't
run pthread_atfork() handlers we should be fine but there's no guarantee that
this won't be the case in the future. So let's do the syscall directly - or as
direct as we can. An additional nice feature is that we get fork() behavior,
i.e. lxc_raw_clone() returns 0 in the child and the child pid in the parent.
Our implementation tries to make sure that we cover all cases according to
kernel sources. Note that we are not interested in any arguments that could be
passed after the stack.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When we report STOPPED to a caller and then close the command socket it is
technically possible - and I've seen this happen on the test builders - that a
container start() right after a wait() will receive ECONNREFUSED because it
called open() before we close(). So for all new state clients simply close the
command socket. This will inform all state clients that the container is
STOPPED and also prevents a race between a open()/close() on the command socket
causing a new process to get ECONNREFUSED because we haven't yet closed the
command socket.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Tycho Andersen [Fri, 8 Dec 2017 23:23:26 +0000 (23:23 +0000)]
init: don't kill(-1) if we aren't in a pid ns
...otherwise we'll kill everyone on the machine. Instead, let's explicitly
try to kill our children. Let's do a best effort against fork bombs by
disabling forking via the pids cgroup if it exists. This is best effort for
a number of reasons:
* the pids cgroup may not be available
* the container may have bind mounted /dev/null over pids.max, so the write
doesn't do anything
Prior to this patch we raced with a very short-lived init process. Essentially,
the init process could exit before we had time to record the cgroup namespace
causing the container to abort and report ABORTING to the caller when it
actually started just fine. Let's not do this.
(This uses syscall(SYS_getpid) in the the child to retrieve the pid just in case
we're on an older glibc version and we end up in the namespace sharing branch
of the actual lxc_clone() call.)
Additionally this fixes the shortlived tests. They were faulty so far and
should have actually failed because of the cgroup namespace recording race but
the ret variable used to return from the function was not correctly
initialized. This fixes it.
Furthermore, the shortlived tests used the c->error_num variable to determine
success or failure but this is actually not correct when the container is
started daemonized.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
In the case the container has a console with a valid slave pty file descriptor
we duplicate std{in,out,err} to the slave file descriptor so console logging
works correctly. When the container does not have a valid slave pty file
descriptor for its console and is started daemonized we should dup to
/dev/null.
Closes #1646.
Signed-off-by: Li Feng <lifeng68@huawei.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
we made std{err,in,out} a duplicate of the slave file descriptor of the console
if it existed. This meant we also duplicated all of them when we executed
application containers in the foreground even if some std{err,in,out} file
descriptor did not refer to a {p,t}ty. This blocked use cases such as:
echo foo | lxc-execute -n -- cat
which are very valid and common with application containers but less common
with system containers where we don't have to care about this. So my suggestion
is to unconditionally duplicate std{err,in,out} to the console file descriptor
if we are either running daemonized - this ensures that daemonized application
containers with a single bash shell keep on working - or when we are not
running an application container. In other cases we only duplicate those file
descriptors that actually refer to a {p,t}ty. This logic is similar to what we
do for lxc-attach already.
Refers to #1690.
Closes #2028.
Reported-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Detaching network namespaces as an unprivileged user is currently not possible
and attaching to the user namespace will mean we are not allowed to move the
network device into an ancestor network namespace.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
tools: block using lxc-execute without config file
Moving away from internal symbols we can't do hacks like we currently do in
lxc-start and call internal functions like lxc_conf_init(). This is unsafe
anyway. Instead, we should simply error out if the user didn't give us a
configuration file to use. lxc-start refuses to start in that case already.
Relates to discussion in https://github.com/lxc/go-lxc/pull/96#discussion_r155075560 .
Closes #2023.
Reported-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
It doesn't make sense to error out when an app container doesn't pass explicit
arguments through c->start{l}(). This is especially true since we implemented
lxc.execute.cmd. However, even before we could have always relied on
lxc.init.cmd and errored out after that.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Callers can then make a decision whether they want to consider the peer closing
the connection an error or not. For example, a c->wait(c, "STOPPED", -1) call
can then consider a ECONNRESET not an error but rather see it - correctly - as
a container exiting before being able to register a state client.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Take the lock on the list after we've done all necessary work and check state.
If we are in requested state, do cleanup and return without adding the state
client to the state client list.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>