CVE-2017-5985: Ensure target netns is caller-owned
Before this commit, lxc-user-nic could potentially have been tricked into
operating on a network namespace over which the caller did not hold privilege.
This commit ensures that the caller is privileged over the network namespace by
temporarily dropping privilege.
Launchpad: https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1654676 Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Stéphane Graber [Thu, 12 Nov 2015 17:44:38 +0000 (12:44 -0500)]
ubuntu-cloud: Various fixes
- Update list of supported releases
- Make the fallback release trusty
- Don't specify the compression algorithm (use auto-detection) so that
people passing tarballs to the template don't see regressions.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Tycho Andersen [Fri, 6 Nov 2015 20:50:33 +0000 (13:50 -0700)]
define PR_SET_MM_MAP & friends if necessary
PR_SET_MM_MAP only went in to the kernel at 3.18 (or 3.19), so we need to
define these for kernels before then. If there was an error, the code
simply logs the failure and continues on.
Tycho Andersen [Fri, 6 Nov 2015 19:34:47 +0000 (12:34 -0700)]
use PR_SET_MM_MAP instead of PR_SET_MM
PR_SET_MM_MAP can be called as non-root, which we are in the unprivileged
(or nested) case.
Also, let's not do the strcpy() for the new cmdline until after we're sure
the prctl succeeded. This means that even if it does fail, we won't
mutilate the command line like we did before, it just won't be as pretty.
v2: remember to chop off bits of the string that are too long
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
(1) This commit fixes the calculations when updating paths in lxc.hooks.*
entries. We now also update conf->unexpandend_alloced which hasn't been
done prior to this commit.
(2) Also we use the stricter check:
if (p >= lend)
continue;
This should deal better with invalid config files.
(3) Insert some spaces between operators to increase readability.
(4) Use gotos to simplify function and increase readability.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
When using overlay and aufs mounts with lxc.mount.entry users have to specify
absolute paths for upperdir and workdir which will then get created
automatically by mount_entry_create_overlay_dirs() and
mount_entry_create_aufs_dirs() in conf.c. When we clone a container with
overlay or aufs lxc.mount.entry entries we need to update these absolute paths.
In order to do this we add the function update_ovl_paths() in
lxccontainer.c. The function updates the mounts in two locations:
If we were to only update 2) we would end up with wrong upperdir and workdir
mounts as the absolute paths would still point to the container that serves as
the base for the clone. If we were to only update 1) we would end up with wrong
upperdir and workdir lxc.mount.entry entries in the clone's config as the
absolute paths in upperdir and workdir would still point to the container that
serves as the base for the clone. Updating both will get the job done.
NOTE: This function does not sanitize paths apart from removing trailing
slashes. (So when a user specifies //home//someone/// it will be cleaned to
//home//someone. This is the minimal path cleansing which is also done by
lxc_container_new().) But the mount_entry_create_overlay_dirs() and
mount_entry_create_aufs_dirs() functions both try to be extremely strict about
when to create upperdirs and workdirs. They will only accept sanitized paths,
i.e. they require /home/someone. I think this is a (safety) virtue and we
should consider sanitizing paths in general. In short: update_ovl_paths() does
update all absolute paths to the new container but
mount_entry_create_overlay_dirs() and mount_entry_create_aufs_dirs() will still
refuse to create upperdir and workdir when the updated path is unclean. This
happens easily when e.g. a user calls lxc-clone -o OLD -n NEW -P
//home//chb///.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
This functions updates absolute paths for overlay upper- and workdirs so users
can simply clone and start new containers without worrying about absolute paths
in lxc.mount.entry overlay entries.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Jakub Sztandera [Fri, 30 Oct 2015 11:05:44 +0000 (12:05 +0100)]
arch template: Fix systemd-sysctl service
The systemd-sysctl service includes condition that /proc/sys/ has to be read-write.
In lxc only /proc/sys/net/ is read-write which causes the condition to fail and service not to run.
This patch changes the check to /proc/sys/net/ and makes the service apply only rules that are in net tree.
Instead of duplicating the cleanup-code, once for success and once for failure,
simply keep a variable fret which is -1 in the beginning and gets set to 0 on
success or stays -1 on failure.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
The mount_entry_overlay_dirs() and mount_entry_aufs_dirs() functions create
workdirs and upperdirs for overlay and aufs lxc.mount.entry entries. They try
to make sure that the workdirs and upperdirs can only be created under the
containerdir (e.g. /path/to/the/container/CONTAINERNAME). In order to do this
the right hand side of
was thought to check if the rootfs->path is not present in the workdir and
upperdir mount options. But the current check is bogus since it will be
trivially true whenever the container is a block-dev or overlay or aufs backed
since the rootfs->path will then have a form like e.g.
overlayfs:/some/path:/some/other/path
This patch adds the function ovl_get_rootfs_dir() which parses rootfs->path by
searching backwards for the first occurrence of the delimiter pair ":/". We do
not simply search for ":" since it might be used in path names. If ":/" is not
found we assume the container is directory backed and simply return
strdup(rootfs->path).
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Thu, 15 Oct 2015 18:56:17 +0000 (18:56 +0000)]
Ignore trailing /init.scope in init cgroups
The lxc monitor does not store the container's cgroups, rather it
recalculates them whenever needed.
Systemd moves itself into a /init.scope cgroup for the systemd
controller.
It might be worth changing that (by storing all cgroup info in the
lxc_handler), but for now go the hacky route and chop off any
trailing /init.scope.
I definately thinkg we want to switch to storing as that will be
more bullet-proof, but for now we need a quick backportable fix
for systemd 226 guests.
The mount_entry_create_*_dirs() functions currently assume that the rootfs of
the container is actually named "rootfs". This has the consequence that
del = strstr(lxcpath, "/rootfs");
if (!del) {
free(lxcpath);
lxc_free_array((void **)opts, free);
return -1;
}
*del = '\0';
will return NULL when the rootfs of a container is not actually named "rootfs".
This means the we return -1 and do not create the necessary upperdir/workdir
directories required for the overlay/aufs mount to work. Hence, let's not make
that assumption. We now pass lxc_path and lxc_name to
mount_entry_create_*_dirs() and create the path directly. To prevent failure we
also have mount_entry_create_*_dirs() check that lxc_name and lxc_path are not
empty when they are passed in.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
When users wanted to mount overlay directories with lxc.mount.entry they had to
create upperdirs and workdirs beforehand in order to mount them. To create it
for them we add the functions mount_entry_create_overlay_dirs() and
mount_entry_create_aufs_dirs() which do this for them. User can now simply
specify e.g.:
and /upper and /workdir will be created for them. /upper and /workdir need to
be absolute paths to directories which are created under the containerdir (e.g.
under $lxcpath/CONTAINERNAME/). Relative mountpoints, mountpoints outside the
containerdir, and mountpoints within the container's rootfs are ignored. (The
latter *might* change in the future should it be considered safe/useful.)
Serge Hallyn [Sat, 3 Oct 2015 21:52:16 +0000 (21:52 +0000)]
lxc_mount_auto_mounts: fix weirdness
The default_mounts[i].destination is never NULL except in the last
'stop here' entry. Coverity doesn't know about that and so is spewing
a warning. In any case, let's add a more stringent check in case someone
accidentally adds a NULL there later.
Enable aarch64 seccomp support for LXC containers running on ARM64
architectures. Tested with libseccomp 2.2.0 and the default seccomp
policy example files delivered with the LXC package.
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Colin Watson [Wed, 30 Sep 2015 12:37:10 +0000 (13:37 +0100)]
lxc-start-ephemeral: Parse passwd directly
On Ubuntu 15.04, lxc-start-ephemeral's call to pwd.getpwnam always
fails. While I haven't been able to prove it or track down an exact
cause, I strongly suspect that glibc does not guarantee that you can
call NSS functions after a context switch without re-execing. (Running
"id root" in a subprocess from the same point works fine.)
It's safer to use getent to extract the relevant line from the passwd
file and parse it directly.
Serge Hallyn [Mon, 31 Aug 2015 17:57:20 +0000 (12:57 -0500)]
CVE-2015-1335: Protect container mounts against symlinks
When a container starts up, lxc sets up the container's inital fstree
by doing a bunch of mounting, guided by the container configuration
file. The container config is owned by the admin or user on the host,
so we do not try to guard against bad entries. However, since the
mount target is in the container, it's possible that the container admin
could divert the mount with symbolic links. This could bypass proper
container startup (i.e. confinement of a root-owned container by the
restrictive apparmor policy, by diverting the required write to
/proc/self/attr/current), or bypass the (path-based) apparmor policy
by diverting, say, /proc to /mnt in the container.
To prevent this,
1. do not allow mounts to paths containing symbolic links
2. do not allow bind mounts from relative paths containing symbolic
links.
Details:
Define safe_mount which ensures that the container has not inserted any
symbolic links into any mount targets for mounts to be done during
container setup.
The host's mount path may contain symbolic links. As it is under the
control of the administrator, that's ok. So safe_mount begins the check
for symbolic links after the rootfs->mount, by opening that directory.
It opens each directory along the path using openat() relative to the
parent directory using O_NOFOLLOW. When the target is reached, it
mounts onto /proc/self/fd/<targetfd>.
Use safe_mount() in mount_entry(), when mounting container proc,
and when needed. In particular, safe_mount() need not be used in
any case where:
1. the mount is done in the container's namespace
2. the mount is for the container's rootfs
3. the mount is relative to a tmpfs or proc/sysfs which we have
just safe_mount()ed ourselves
Since we were using proc/net as a temporary placeholder for /proc/sys/net
during container startup, and proc/net is a symbolic link, use proc/tty
instead.
Update the lxc.container.conf manpage with details about the new
restrictions.
Finally, add a testcase to test some symbolic link possibilities.
I've noticed that a bunch of the code we've included over the past few
weeks has been using 8-spaces rather than tabs, making it all very hard
to read depending on your tabstop setting.
This commit attempts to revert all of that back to proper tabs and fix a
few more cases I've noticed here and there.
No functional changes are included in this commit.
Otherwise the kernel will umount when it gets around to it, but
that on lxc_destroy we may race with it and fail the rmdir of
the overmounted (BUSY) rootfs.
We can't rsync the delta as unpriv user because we can't create
the chardevs representing a whiteout. We can however rsync the
rootfs and have the kernel create the whiteouts for us.
Add a nesting.conf which can be included to support nesting containers (v2)
Newer kernels have added a new restriction: if /proc or /sys on the
host has files or non-empty directories which are over-mounted, and
there is no /proc which fully visible, then it assumes there is a
"security" reason for this. It prevents anyone in a non-initial user
namespace from creating a new proc or sysfs mount.
To work around this, this patch adds a new 'nesting.conf' which can be
lxc.include'd from a container configuration file. It adds a
non-overmounted mount of /proc and /sys under /dev/.lxc, so that the
kernel can see that we're not trying to *hide* things like /proc/uptime.
and /sys/devices/virtual/net. If the host adds this to the config file
for container w1, then container w1 will support unprivileged child
containers.
The nesting.conf file also sets the apparmor profile to the with-nesting
variant, since that is required anyway. This actually means that
supporting nesting isn't really more work than it used to be, just
different. Instead of adding
Finally, in order to maintain the current apparmor protections on
proc and sys, we make /dev/.lxc/{proc,sys} non-read/writeable.
We don't need to be able to use them, we're just showing the
kernel what's what.
Major Hayden [Wed, 2 Sep 2015 21:21:11 +0000 (16:21 -0500)]
Tear down network devices during container halt
On very busy systems, some virtual network devices won't be destroyed after a
container halts. This patch uses the lxc_delete_network() method to ensure
that network devices attached to the container are destroyed when the
container halts.
Without the patch, some virtual network devices are left over on the system
and must be removed with `ip link del <device>`. This caused containers
with lxc.network.veth.pair to not be able to start. For containers using
randomly generated virtual network device names, the old devices will hang
around on the bridge with their original MAC address.
David Ward [Tue, 23 Jun 2015 14:57:19 +0000 (10:57 -0400)]
Only mount /proc if needed, even without a rootfs
Use the same code with and without a rootfs to check if mounting
/proc is necessary before doing so. If mounting it is unsuccessful
and there is no rootfs, continue as before.
Signed-off-by: David Ward <david.ward@ll.mit.edu> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
David Ward [Tue, 23 Jun 2015 14:57:23 +0000 (10:57 -0400)]
Allow autodev without a rootfs
A container without a rootfs is useful for running a collection of
processes in separate namespaces (to provide separate networking as
an example), while sharing the host filesystem (except for specific
paths that are re-mounted as needed). For multiple processes to run
automatically when such a container is started, it can be launched
using lxc-start, and a separate instance of systemd can manage just
the processes inside the container. (This assumes that the path to
the systemd unit files is re-mounted and only contains the services
that should run inside the container.) For this use case, autodev
should be permitted for a container that does not have a rootfs.
Signed-off-by: David Ward <david.ward@ll.mit.edu> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
- Passing the LXC_CLONE_KEEPNAME flag to do_lxcapi_clone() was not respected and
let to unexpected behaviour for e.g. lxc-clone. We wrap
clear_unexp_config_line() and set_config_item_line() in an appropriate
if-condition.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
KATOH Yasufumi [Wed, 19 Aug 2015 11:35:36 +0000 (20:35 +0900)]
doc: Update lxc.cgroup.use in lxc.system.conf(5)
LXC now uses lxc.cgroup.use even when cgmanager is used.
So remove the description for the case of using cgmanager.
And add the case of not specifying it.
This commit only updates en and ja man pages.
Robert Schiele [Fri, 21 Aug 2015 05:35:34 +0000 (07:35 +0200)]
check for NULL pointers before calling setenv()
Latest glibc release actually honours calling setenv with a NULL
pointer by causing SIGSEGV but checking pointers before submitting
to any system function is a good idea anyway.
Signed-off-by: Robert Schiele <rschiele@gmail.com>
Tycho Andersen [Fri, 14 Aug 2015 16:24:47 +0000 (10:24 -0600)]
c/r: enable tracefs
tracefs is a new filesystem that can be mounted by users. Only the options
and fs name need to be passed to restore the state, so we can use criu's
auto fs feature.
Robert LeBlanc [Thu, 13 Aug 2015 19:36:55 +0000 (13:36 -0600)]
Caps are getting lost when cloning an LXC. Adding the -X parameter copies the extended attributes. This allows things like ping to continue to be used by a non-privilged user in Debian at least.
Tycho Andersen [Mon, 10 Aug 2015 17:12:18 +0000 (11:12 -0600)]
c/r: get rid of dump_net_info()
This was originally used to propagate the bridge and veth names across
hosts, but now we extract both from the container's config file, and
nothing reads the files that dump_net_info() writes, so let's just get rid
of them.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
reuse label cleanup since free(NULL) is a no-op Signed-off-by: Arjun Sreedharan <arjun024@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
When setting lxc.network.veth.pair to get a fixed interface
name the recreation of it after a reboot caused an EEXIST.
-) The reboot flag is now a three-state value. It's set to
1 to request a reboot, and 2 during a reboot until after
lxc_spawn where it is reset to 0.
-) If the reboot is set (!= 0) within instantiate_veth and
a fixed name is used, the interface is now deleted before
being recreated.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Jiri Slaby [Wed, 5 Aug 2015 08:32:54 +0000 (10:32 +0200)]
templates: lxc-opensuse, use rpm to determine build version
zypper info's output is not usable for several reasons:
* it is localized -- there is no "Version: " in my output
* it shows results both from the repo and local system
So use plain rpm to determine whether build is installed and if proper
version is in place.