If we don't re-open these after clone, the init process has a pointer to the
parent's /dev/{zero,null}. CRIU seese these and wants to dump the parent's
mount namespace, which is unnecessary. Instead, we should just re-open
stdin/out/err after we do the clone and pivot root, to ensure that we have
pointers to the devcies in init's rootfs instead of the host's.
v2: Only close fds if the container was daemonized. This didn't turn out as
nicely as described on the list because lxc_start() doesn't actually have
the struct lxc_container, so it cant see the flag. Instead, we just pass it
down everywhere.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
As of criu 1.5, the --veth-pair argument supports an additional parameter that
is the bridge name to attach to. This enables us to get rid of the goofy
action-script hack that passed bridge names as environment variables.
This patch is on top of the systemd/lxcfs mount rework patch, as we probably
want to wait to use 1.5 options until it has been out for a while and is in
distros.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
CRIU now supports autodetection of external mounts via the --ext-mount-map auto
--enable-external-sharing --enable-external-masters options, so we don't need
to explicitly pass the cgmanager mount or any of the mounts from the config.
This also means that lxcfs mounts (since they are bind mounts from outside the
container) are autodetected, meaning that c/r of containers using lxcfs works.
A further advantage of this patch is that it addresses some of the ugliness
that was in the exec_criu() function. There are other criu options that will
allow us to trim this even further, though.
Finally, with --enable-external-masters, criu understands slave mounts in the
container with shared mounts in the peer group that are outside the namespace.
This allows containers on a systemd host to be dumped and restored correctly.
However, these options have just landed in criu trunk today, and the next
tagged release will be 1.6 on June 1, so we should avoid merging this into any
stable releases until then.
v2: remount / as private before bind mounting the container's directory for
criu. The problem here is that if / is mounted as shared, even if we
unshare() the /var/lib/lxc/rootfs mountpoint propagates outside of our
mount namespace, which is bad, since we don't want to leak mounts. In
particular, this leak confuses criu the second time it goes to checkpoint
the container.
v3: whoops, we really want / as MS_SLAVE | MS_REC here, to match what start
does
v4: rebase onto master for revert of logging patch
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
In the past, lxc-cmd-stop would wait until the command pipe was closed
before returning, ensuring that the container monitor had exited.
Now that we accept the actual success return value, lxcapi_stop can
return success before the monitor has fully exited.
So explicitly wait for the container to stop, when lxc-cmd-stop returned
success.
1. When we stop a container from the lxc_cmd stop handler, we kill its
init task, then we unfreeze the container to make sure it receives the
signal. When that unfreeze succeeds, we were immediately returning 0,
without sending a response to the invoker.
2. lxc_cmd returns the length of the field received. In the case of
an lxc_cmd_stop this is 16. But a comment claims we expect no response,
only a 0. In fact the handler does send a response, which may or may
not include an error. So don't call an error just because we got back a
response.
Since attach asks the restore process what the clone flags were, if we forgot
to set them then the attach command ran in the hosts namespaces instead of the
containers, which is a Very Bad Thing :). Instead, we remember to set the clone
flags in the restore process' handler, so that we report them correctly to any
attach processes who ask.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Tycho Andersen [Fri, 20 Mar 2015 16:17:31 +0000 (10:17 -0600)]
lxcapi_restore shouldn't steal the calling process
Previously, lxcapi_restore used the calling process as the lxc monitor process
(and just never returned), requiring users to fork before calling it. This, of
course, would cause problems for things like LXD, which can't fork.
Now, restore() forks the monitor as a child of the process that calls it. Users
who want to daemonize the restore process need to fork themselves.
lxc-checkpoint has been updated to reflect this behavior change.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Fix incomplete destruction of unprivileged ephemeral containers
If an unprivileged ephemeral container is started as follows,
lxc-start-ephemeral -o trusty -n test_ephemeral
Then an empty directory remains upon exit from the container,
~/.local/share/lxc/test_ephemeral/tmpfs/delta0
(The tmpfs filesystem is successfully unmounted, but we seem to lack
permission to delete the delta0 directory).
This issue arose following commits 4799a1e and dd2271e .
The following patch resolves the issue. It has been tested on ubuntu
14.04 with the lxc-daily ppa.
Since gmail screws up the formatting of the patch via line-wrapping
etc, please copy the patch from the issue-tracker rather than from
this email.
Serge Hallyn [Mon, 16 Mar 2015 17:02:12 +0000 (17:02 +0000)]
lxc-destroy: actually work if underlying fs is overlayfs
One of the 'features' of overlayfs is that depending on whether a file
is on the upper or lower dir you get back a different device from stat.
That breaks our lxc_rmdir_onedev.
So at lxc_rmdir_ondev check the device of the directory being deleted.
If it is overlayfs, then skip the device check.
Note this is unrelated to overlayfs snapshots - in those cases when you
delete a container, /var/lib/lxc/$container/ does not actually have an
overlayfs under it. Rather, to reproduce this you would
Serge Hallyn [Wed, 18 Mar 2015 00:02:18 +0000 (19:02 -0500)]
cgmanager: put unprivileged containers under $(curcgroup)/lxc/$(container0
Currently if we are in /user.slice/user-1000.slice/session-c2.scope,
and we start an unprivileged container t1, it will be in cgroup
3:memory:/user.slice/user-1000.slice/session-c2.scope/t1. If
we then do a 'lxc-cgroup -n t1 freezer.tasks', cgm_get will
first switch to 3:memory:/user.slice/user-1000.slice/session-c2.scope
then look up 't1's values. The reasons for this are
1. cgmanager get_value is relative to your own cgroup, so we need
to be sure to be in t1's cgroup or an ancestor
2. we don't want to be in the container's cgroup bc it might freeze us.
But in Ubuntu 15.04 it was decided that
3:memory:/user.slice/user-1000.slice/session-c2.scope/tasks should
not be writeable by the user, making this fail.
Therefore put all unprivileged cgroups under "lxc/%n". That way
the "lxc" cgroup should always be owned by the user so that he can
enter.
Serge Hallyn [Wed, 11 Mar 2015 22:10:55 +0000 (22:10 +0000)]
logs: introduce a thread-local 'current' lxc_config
The logging code uses a global log_fd and log_level to direct
logging (ERROR(), etc). While the container configuration file allows
for lxc.loglevel and lxc.logfile, those are only used at configuration
file read time to set the global variables. This works ok in the
lxc front-end programs, but becomes a problem with threaded API users.
The simplest solution would be to not allow per-container configuration
files, but it'd be nice to avoid that.
Passing a logfd or lxc_conf into every ERROR/INFO/etc call is "possible",
but would be a huge complication as there are many functions, including
struct member functions and callbacks, which don't have that info and
would need to get it from somewhere.
So the approach I'm taking here is to say that all real container work
is done inside api calls, and therefore the API calls themselves can
set a thread-local variable indicating which log info to use. If
unset, then use the global values. The lxc-* programs, when called
with a '-o logfile' argument, set a global variable to indicate that
the user-specified value should be used.
In this patch:
If the lxc container configuration specifies a loglevel/logfile, only
set the lxc_config's logfd and loglevel according to those, not the
global values.
Each API call is wrapped to set/unset the current_config. (The few
exceptions are calls which do not result in any log actions)
Update logfile appender to use the logfile specified in lxc_conf if (a)
current_config is set and (b) the lxc-* command did not override it.
This patch enables seccomp support for LXC containers running on PowerPC
architectures. It is based on the latest PowerPC support added to libseccomp, on
the working-ppc64 branch [1].
Libseccomp has been tested on ppc, ppc64 and ppc64le architectures. LXC with
seccomp support has been tested on ppc and ppc64 architectures, using the
default seccomp policy example files delivered with the LXC package.
brauner [Sun, 8 Feb 2015 15:48:31 +0000 (16:48 +0100)]
config: Allow all containers to use fuse
This enables containers to mount fuse filesystems per default. The mount
is designed to be safe. Hence, it can be enabled per default in
common.conf. It will lead to a cleaner boot for some unprivileged
systemd-based containers.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Stéphane Graber [Mon, 2 Feb 2015 09:21:20 +0000 (11:21 +0200)]
In lxc.mount.auto, skip on ENONENT
This resolves the case where /proc/sysrq-trigger doesn't exist by simply
ignoring any mount failure on ENOENT. With the current mount list, this
will always result in a safe environment (typically the read-only
underlay).
Closes #425
v2: Don't always show an error
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Tycho Andersen [Wed, 4 Feb 2015 12:02:02 +0000 (14:02 +0200)]
Process command line is null terminated
It turns out the process command line is in fact null terminated on the stack;
this caused a bug where when the new process title was smaller than the old
one, the first environment entry would be rendered as part of the process
title.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Tycho Andersen [Fri, 30 Jan 2015 13:59:13 +0000 (14:59 +0100)]
set the monitor process title to something useful
Instead of having a parent process that's called whatever the caller of the
library is called, we instead set it to "[lxc monitor] <lxcpath> <container>"
Closes #180
v2: check for null in tok for loop, only truncate environment when necessary
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Thu, 29 Jan 2015 23:50:41 +0000 (23:50 +0000)]
apparmor: support lxc.ttydir when bind-mounting ptys
Because we now create the ttys from inside the container, we had to
add an apparmor rule for start-container to bind-mount /dev/pts/** -> /dev/tty*/.
However that's not sufficient if the container sets lxc.ttydir, in
which case we need to support mounting onto files in subdirs of /dev.
Serge Hallyn [Thu, 29 Jan 2015 16:09:45 +0000 (16:09 +0000)]
clone_paths: use 'rootfs' for destination directory
We were trying to be smart and use whatever the last part of
the container's rootfs path was. However for block devices
that doesn't make much sense. I.e. if lxc.rootfs = /dev/md-1,
chances are that /var/lib/lxc/c1/md-1 does not exist.
So always use the $lxcpath/$lxcname/rootfs, and if it does
not exist, try to create it.
With this, 'lxc-clone -s -o c1 -n c2' where c1 has an lvm backend
is fixed. See https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1414771
Serge Hallyn [Thu, 29 Jan 2015 10:13:36 +0000 (10:13 +0000)]
create lxc.tty ptys from container process
Lxc has always created the ptys for use by console and ttys early
on from the monitor process. This has some advantages, but also
has disadvantages, namely (1) container ptys counting against the
max ptys for the host, and (2) not having a /dev/pts/N in the
container to pass to getty. (2) was not a problem for us historically
because we bind-mounted the host's /dev/pts/N onto a /dev/ttyN in
the container. However, systemd hardocdes a check for container_ttys
that the path have 'pts/' in it. If it were only for (2) I'd have
opted for a systemd patch to check the device major number, but (1)
made it worth moving the openpty to the container namespace.
So this patch moves the tty creation into the task which becomes
the container init. It then passes the fds for the opened ptys
back to the monitor over a unix socketpair (for use by lxc-console).
The /dev/console is still created in the monitor process, so that
it can for instance be used by lxc.logfd.
So now if you have a foreground container with lxc.tty = 4, you
should end up with one host /dev/pts entry per container rather than 5.
And lxc-console now works with systemd containers.
Note that if the container init mounts its own devpts over the
one mounted by lxc, the tty /dev/pts/n will be hidden. This is ok
since it's only systemd that needs it, and systemd won't do that.
Serge Hallyn [Tue, 27 Jan 2015 23:06:22 +0000 (23:06 +0000)]
systemd: specify container_ttys in environment
The lxc.tty configuration item specifies a number of ttys to create.
Historically, for each of those, we create a /dev/pts/N entry and
symlink it to /dev/ttyN for older inits to use. For systemd, we should
instead specify each tty name in a $container_ttys environment variable
passed to init.
See http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/ and
https://github.com/lxc/lxc/issues/419.