When users wanted to mount overlay directories with lxc.mount.entry they had to
create upperdirs and workdirs beforehand in order to mount them. To create it
for them we add the functions mount_entry_create_overlay_dirs() and
mount_entry_create_aufs_dirs() which do this for them. User can now simply
specify e.g.:
and /upper and /workdir will be created for them. /upper and /workdir need to
be absolute paths to directories which are created under the containerdir (e.g.
under $lxcpath/CONTAINERNAME/). Relative mountpoints, mountpoints outside the
containerdir, and mountpoints within the container's rootfs are ignored. (The
latter *might* change in the future should it be considered safe/useful.)
Serge Hallyn [Sat, 3 Oct 2015 21:52:16 +0000 (21:52 +0000)]
lxc_mount_auto_mounts: fix weirdness
The default_mounts[i].destination is never NULL except in the last
'stop here' entry. Coverity doesn't know about that and so is spewing
a warning. In any case, let's add a more stringent check in case someone
accidentally adds a NULL there later.
While lxc-copy is under review let users benefit (reboot survival etc.) from the
new lxc.ephemeral option already in lxc-start-ephemeral. This way we can remove
the lxc.hook.post-stop script-
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Enable aarch64 seccomp support for LXC containers running on ARM64
architectures. Tested with libseccomp 2.2.0 and the default seccomp
policy example files delivered with the LXC package.
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Colin Watson [Wed, 30 Sep 2015 12:37:10 +0000 (13:37 +0100)]
lxc-start-ephemeral: Parse passwd directly
On Ubuntu 15.04, lxc-start-ephemeral's call to pwd.getpwnam always
fails. While I haven't been able to prove it or track down an exact
cause, I strongly suspect that glibc does not guarantee that you can
call NSS functions after a context switch without re-execing. (Running
"id root" in a subprocess from the same point works fine.)
It's safer to use getent to extract the relevant line from the passwd
file and parse it directly.
Serge Hallyn [Mon, 31 Aug 2015 17:57:20 +0000 (12:57 -0500)]
CVE-2015-1335: Protect container mounts against symlinks
When a container starts up, lxc sets up the container's inital fstree
by doing a bunch of mounting, guided by the container configuration
file. The container config is owned by the admin or user on the host,
so we do not try to guard against bad entries. However, since the
mount target is in the container, it's possible that the container admin
could divert the mount with symbolic links. This could bypass proper
container startup (i.e. confinement of a root-owned container by the
restrictive apparmor policy, by diverting the required write to
/proc/self/attr/current), or bypass the (path-based) apparmor policy
by diverting, say, /proc to /mnt in the container.
To prevent this,
1. do not allow mounts to paths containing symbolic links
2. do not allow bind mounts from relative paths containing symbolic
links.
Details:
Define safe_mount which ensures that the container has not inserted any
symbolic links into any mount targets for mounts to be done during
container setup.
The host's mount path may contain symbolic links. As it is under the
control of the administrator, that's ok. So safe_mount begins the check
for symbolic links after the rootfs->mount, by opening that directory.
It opens each directory along the path using openat() relative to the
parent directory using O_NOFOLLOW. When the target is reached, it
mounts onto /proc/self/fd/<targetfd>.
Use safe_mount() in mount_entry(), when mounting container proc,
and when needed. In particular, safe_mount() need not be used in
any case where:
1. the mount is done in the container's namespace
2. the mount is for the container's rootfs
3. the mount is relative to a tmpfs or proc/sysfs which we have
just safe_mount()ed ourselves
Since we were using proc/net as a temporary placeholder for /proc/sys/net
during container startup, and proc/net is a symbolic link, use proc/tty
instead.
Update the lxc.container.conf manpage with details about the new
restrictions.
Finally, add a testcase to test some symbolic link possibilities.
I've noticed that a bunch of the code we've included over the past few
weeks has been using 8-spaces rather than tabs, making it all very hard
to read depending on your tabstop setting.
This commit attempts to revert all of that back to proper tabs and fix a
few more cases I've noticed here and there.
No functional changes are included in this commit.
Otherwise the kernel will umount when it gets around to it, but
that on lxc_destroy we may race with it and fail the rmdir of
the overmounted (BUSY) rootfs.
On shutdown ephemeral containers will be destroyed. We use mod_all_rdeps() from
lxccontainer.c to update the lxc_snapshots file of the original container. We
also include lxclock.h to lock the container when mod_all_rdeps() is called to
avoid races.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
This allows us to use standard string handling functions and we can avoid using
the GNU-extension memmem(). This simplifies removing the container from the
lxc_snapshots file. Wrap strstr() in a while loop to remove duplicate entries.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
We can't rsync the delta as unpriv user because we can't create
the chardevs representing a whiteout. We can however rsync the
rootfs and have the kernel create the whiteouts for us.
Test edge cases (removing first and last entries in lxc_snapshots and the very
last snapshot) and make sure original container isn't destroyed while there are
snapshots, and is when there are none.
Add a nesting.conf which can be included to support nesting containers (v2)
Newer kernels have added a new restriction: if /proc or /sys on the
host has files or non-empty directories which are over-mounted, and
there is no /proc which fully visible, then it assumes there is a
"security" reason for this. It prevents anyone in a non-initial user
namespace from creating a new proc or sysfs mount.
To work around this, this patch adds a new 'nesting.conf' which can be
lxc.include'd from a container configuration file. It adds a
non-overmounted mount of /proc and /sys under /dev/.lxc, so that the
kernel can see that we're not trying to *hide* things like /proc/uptime.
and /sys/devices/virtual/net. If the host adds this to the config file
for container w1, then container w1 will support unprivileged child
containers.
The nesting.conf file also sets the apparmor profile to the with-nesting
variant, since that is required anyway. This actually means that
supporting nesting isn't really more work than it used to be, just
different. Instead of adding
Finally, in order to maintain the current apparmor protections on
proc and sys, we make /dev/.lxc/{proc,sys} non-read/writeable.
We don't need to be able to use them, we're just showing the
kernel what's what.
Major Hayden [Wed, 2 Sep 2015 21:21:11 +0000 (16:21 -0500)]
Tear down network devices during container halt
On very busy systems, some virtual network devices won't be destroyed after a
container halts. This patch uses the lxc_delete_network() method to ensure
that network devices attached to the container are destroyed when the
container halts.
Without the patch, some virtual network devices are left over on the system
and must be removed with `ip link del <device>`. This caused containers
with lxc.network.veth.pair to not be able to start. For containers using
randomly generated virtual network device names, the old devices will hang
around on the bridge with their original MAC address.
David Ward [Tue, 23 Jun 2015 14:57:19 +0000 (10:57 -0400)]
Only mount /proc if needed, even without a rootfs
Use the same code with and without a rootfs to check if mounting
/proc is necessary before doing so. If mounting it is unsuccessful
and there is no rootfs, continue as before.
Signed-off-by: David Ward <david.ward@ll.mit.edu> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
David Ward [Tue, 23 Jun 2015 14:57:23 +0000 (10:57 -0400)]
Allow autodev without a rootfs
A container without a rootfs is useful for running a collection of
processes in separate namespaces (to provide separate networking as
an example), while sharing the host filesystem (except for specific
paths that are re-mounted as needed). For multiple processes to run
automatically when such a container is started, it can be launched
using lxc-start, and a separate instance of systemd can manage just
the processes inside the container. (This assumes that the path to
the systemd unit files is re-mounted and only contains the services
that should run inside the container.) For this use case, autodev
should be permitted for a container that does not have a rootfs.
Signed-off-by: David Ward <david.ward@ll.mit.edu> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Natanael Copa [Fri, 21 Aug 2015 09:48:10 +0000 (11:48 +0200)]
Clone bridge interface MTU setting
Instead of require static mtu setting in config we simply clone the
existing MTU setting of the bridge interface.
This fixes issue when bridge interface has bigger MTU (like 9000 for
jumbo frame support) than the default 1500. When veth interface is
created it has by default MTU set to 1500 and when this is added to the
bridge, the kernel wee reduce the MTU for the bridge to 1500. We solve
this by cloning the MTU value from bridge interface.
This simplifies managing containers with bridge interface who supports
jumbo frames (mtu 9000) and makes it easier to move containers between
hosts with different MTU settings.
Signed-off-by: Natanael Copa <ncopa@alpinelinux.org> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
If we currently create clone-snapshots via lxc-clone only the plain total
number of the containers it serves as a base-container is written to the file
"lxc-snapshots". This commit modifies mod_rdep() so it will store the paths and
names to the containers that are clone-snapshots (similar to the "lxc_rdepends"
file for the clones). **Users which still have containers that have a non-empty
(with a number > 0 as an entry) "lxc-snapshots" file in the old format are not
affected by this change. It will be used until all old clones have been
deleted!** For all others, the "lxc_snapshots" file placed under the original
container now looks like this:
/var/lib/lxc
bb
/var/lib/lxc
cc
/opt
dd
This is an example of a container that provides the base for three
clone-snapshots bb, cc, and dd. Where bb and cc both are placed in the usual
path for privileged containers and dd is placed in a custom path.
- Add additional argument to function that takes in the clone-snapshotted
lxc_container.
- Have mod_rdep() write the path and name of the clone-snapshotted container the
file lxc_snapshots of the original container.
- If a clone-snapshot gets deleted the corresponding line in the file
lxc_snapshot of the original container will be deleted and the file updated
via mmap() + memmove() + munmap().
- Adapt has_fs_snapshots().
- **If an lxc-snapshot file in the old format is found we'll keep using it.**
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
- Passing the LXC_CLONE_KEEPNAME flag to do_lxcapi_clone() was not respected and
let to unexpected behaviour for e.g. lxc-clone. We wrap
clear_unexp_config_line() and set_config_item_line() in an appropriate
if-condition.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
- This enables the user to destroy a container with all its snapshots without
having to use lxc-snapshot first to destroy all snapshots. (The enum values
DESTROY and SNAP from the previous commit are reused here again.)
- Some unification regarding the usage of exit() and return has been done.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
- lxc_snapshot.c lacked necessary members in the associated lxc_arguments struct
in arguments.h. This commit extends the lxc_arguments struct to include
several parameters used by lxc-snapshot which allows a rewrite that is more
consistent with the rest of the lxc-* executables.
- All tests have been moved beyond the call to lxc_log_init() to allow for the
messages to be printed or saved.
- Some small changes to the my_args struct. (The enum task is set to SNAP (for
snapshot) per default and variables illustrating the usage of the command line
flags are written in all caps.)
- arguments.h has been extended to accommodate a future rewrite of lxc-clone
- Traditional behaviour of the executable has been retained in this commit.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>