Shukui Yang [Fri, 16 Feb 2018 04:16:40 +0000 (23:16 -0500)]
confile: add "force" to cgroup:{mixed,ro,rw}
This lets users specify
lxc.mount.auto = cgroup:mixed:force
or
lxc.mount.auto = cgroup:ro:force
or
lxc.mount.auto = cgroup:rw:force
When cgroup namespaces are supported LXC will not mount cgroups for the
container since it assumes that the init system will mount cgroups itself if it
wants to. This assumption already broke when users wanted to run containers
without CAP_SYS_ADMIN. For example, systemd based containers wouldn't start
since systemd needs to mount cgroups (named systemd hierarchy for legacy
cgroups and the unified hierarchy for unified cgroups) to track processes. This
problem was solved by detecting whether the container had CAP_SYS_ADMIN. If it
didn't we performed the cgroup mounts for it.
However, there are more cases when we should be able to mount cgroups for the
container when cgroup namespaces are supported:
- init systems not mounting cgroups themselves:
A init system that doesn't mount cgroups would not have cgroups available
especially when combined with custom LSM profiles to prevent cgroup
{u}mount()ing inside containers.
- application containers:
Application containers will usually not mount by cgroups themselves.
- read-only cgroups:
It is useful to be able to mount cgroups read-only to e.g. prevent
changing cgroup limits from inside the container while at the same time
allowing the applications to perform introspection on their own cgroups. This
again is mostly useful for application containers. System containers running
systemd will usually not work correctly when cgroups are mounted read-only.
To be fair, all of those use-cases could be covered by custom hooks or
lxc.mount.entry entries but exposing it through lxc.mount.auto takes care of
setting correct mount options and adding the necessary logic to e.g. mount
filesystem read-only correctly.
Currently we only extend this to cgroup:{mixed,ro,rw} but technically there's
no reason not to enable the same behavior for cgroup-full:{mixed,ro,rw} as
well. If someone requests this we can simply treat it as a bug and add "force"
for cgroup-full.
Replaces #2136.
Signed-off-by: Shukui Yang <yangshukui@huawei.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
If the handler closes the file descriptor for the peer or master fd it is
crucial that we mark it as -EBADF. This will prevent lxc_console_delete()
from calling close() on an already closed file descriptor again. I've
observed the double close in the attach code.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
If a file descriptor fd is opened by fdopen() and associated with a stream f
will **not** have been dup()ed. This means that fclose(f) will also close the
fd. So never call close(fd) after fdopen(fd) succeeded.
This fixes a double close() Stéphane and I observed when debugging on aarch64
and armf.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Tycho Andersen [Fri, 9 Feb 2018 13:26:31 +0000 (13:26 +0000)]
fix userns helper error handling
In both of these cases if there is actually an error, we won't close the
pipe and the api call will hang. Instead, let's be sure to close the pipe
before waiting, so that it doesn't hang.
Serge Hallyn [Thu, 8 Feb 2018 19:04:23 +0000 (13:04 -0600)]
Restore most cases of am_guest_unpriv
The only cases where we really need to be privileged with respect
to the host is when we are trying to mknod, and in some cases
to do with a physical network device. This patch leaves the
detection of the network device cases as a TODO.
This should fix the currently broken case of starting a privileged
container with at least one veth nic, nested inside an unprivileged
container.
Issues fixed:
- lxc-centos died about a missing /run directory
- lxc-centos complained about some config files it couldn't modify
- the new container got stuck at startup time for a minute
(literally), waiting for systemd-remount-fs startup script
Of course it still works for RHEL 6, CentOS 6 and 7 as well. I did not
verify earlier CentOS or RHEL releases.
Signed-off-by: Harald Dunkel <harald.dunkel@aixigo.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Tycho Andersen [Mon, 5 Feb 2018 14:17:48 +0000 (14:17 +0000)]
monitor: send SIGTERM to the container when SIGHUP is received
For the ->execute() case, we want to make sure the application dies when
SIGHUP is received. The next patch will ignore SIGHUP in the lxc monitor,
because tasks inside the container send SIGHUP to init to have it reload
its config sometimes, and we don't want to do that with init.lxc, since it
might actually kill the container if it forwards SIGHUP to the child and
the child can't handle it.
Tycho Andersen [Fri, 26 Jan 2018 21:21:51 +0000 (21:21 +0000)]
better unprivileged detection
In particular, if we are already in a user namespace we are unprivileged,
and doing things like moving the physical nics back to the host netns won't
work. Let's do the same thing LXD does if euid == 0: inspect
/proc/self/uid_map and see what that says.
Tycho Andersen [Fri, 26 Jan 2018 17:43:12 +0000 (17:43 +0000)]
better check for lock dir
Consider the case where we're running in a user namespace but in the host's
mount ns with the host's filesystem (something like
lxc-usernsexec ... lxc-execute ...), in this case, we'll be euid 0, but we
can't actually write to /run. Let's improve this locking check to make sure
we can actually write to /run before we decide to actually use it as our
locking dir.
Tycho Andersen [Fri, 19 Jan 2018 03:31:33 +0000 (03:31 +0000)]
lxc-execute: actually exit with the status of the spawned task
Now that we have things propagated through init and liblxc correctly, at
least in non-daemon mode, we can exit with the actual exit status of the
task, instead of always succeeding, which is not so helpful.
Tycho Andersen [Fri, 19 Jan 2018 03:29:05 +0000 (03:29 +0000)]
start: don't return false when the container's init exits nonzero
This seems slightly counter-intuitive, but IMO it's what we want.
Basically, ->start() should succeed if the container is spawned correctly
(similar to how golang's exec.Cmd.Start() returns nil if the thing spawns
correctly), and users can check error_num (i.e. golang's exec.Cmd.Wait())
to see how it exited.
This preserves previous behavior, which basically was that start was always
successful if the thing actually launched. Since we never kept track of
exit codes, this would always succeed too. Now that we do, it doesn't, and
this change is required.
Tycho Andersen [Fri, 19 Jan 2018 03:24:59 +0000 (03:24 +0000)]
remember the exit code from the init process
error_num seems to be trying to remember the exit code of the init process,
except that nothing actually keeps track of it anywhere. So, let's add a
field to the handler, so that we can keep track of the process' exit
status, and the propagate it to error_num in struct lxc_container so that
people can use it.
Note that this is a slight behavior change, essentially instead of making
error_num always == the return code from start, now it contains slightly
more useful information (the actual exit status). But, there is only one
internal user of error_num which I'll fix in later in the series, so IMO
this is ok.
Tycho Andersen [Fri, 19 Jan 2018 03:21:10 +0000 (03:21 +0000)]
lxc.init: correctly exit with the app's error code
Based on the comments in the code (and the have_status flag), the intent
here (and IMO, the desired behavior) should be for init.lxc to propagate
the actual exit code from the real application process up through.
Otherwise, it is swallowed and nobody can access it.
The bug being fixed here is that ret held the correct exit code, but when
it went around the loop again (to wait for other children) ret is
clobbered. Let's save the desired exit status somewhere else, so it can't
get clobbered, and we propagate things correctly.
Tycho Andersen [Fri, 19 Jan 2018 03:20:08 +0000 (03:20 +0000)]
fix lxc_error_set_and_log to match the docs
The documentation for this function says if the task was killed by a
signal, the return code will be 128+n, where n is the signal number. Let's
make that actually true.
Tycho Andersen [Fri, 19 Jan 2018 00:50:39 +0000 (00:50 +0000)]
start: don't log stop/continue for non-init processes
This non-init forwarding check should really be before all the log messages
about "init continued" or "init stopped", since they will otherwise lie
about some process that wasn't init being stopped or continued.