I've noticed that a bunch of the code we've included over the past few
weeks has been using 8-spaces rather than tabs, making it all very hard
to read depending on your tabstop setting.
This commit attempts to revert all of that back to proper tabs and fix a
few more cases I've noticed here and there.
No functional changes are included in this commit.
Otherwise the kernel will umount when it gets around to it, but
that on lxc_destroy we may race with it and fail the rmdir of
the overmounted (BUSY) rootfs.
We can't rsync the delta as unpriv user because we can't create
the chardevs representing a whiteout. We can however rsync the
rootfs and have the kernel create the whiteouts for us.
Add a nesting.conf which can be included to support nesting containers (v2)
Newer kernels have added a new restriction: if /proc or /sys on the
host has files or non-empty directories which are over-mounted, and
there is no /proc which fully visible, then it assumes there is a
"security" reason for this. It prevents anyone in a non-initial user
namespace from creating a new proc or sysfs mount.
To work around this, this patch adds a new 'nesting.conf' which can be
lxc.include'd from a container configuration file. It adds a
non-overmounted mount of /proc and /sys under /dev/.lxc, so that the
kernel can see that we're not trying to *hide* things like /proc/uptime.
and /sys/devices/virtual/net. If the host adds this to the config file
for container w1, then container w1 will support unprivileged child
containers.
The nesting.conf file also sets the apparmor profile to the with-nesting
variant, since that is required anyway. This actually means that
supporting nesting isn't really more work than it used to be, just
different. Instead of adding
Finally, in order to maintain the current apparmor protections on
proc and sys, we make /dev/.lxc/{proc,sys} non-read/writeable.
We don't need to be able to use them, we're just showing the
kernel what's what.
Major Hayden [Wed, 2 Sep 2015 21:21:11 +0000 (16:21 -0500)]
Tear down network devices during container halt
On very busy systems, some virtual network devices won't be destroyed after a
container halts. This patch uses the lxc_delete_network() method to ensure
that network devices attached to the container are destroyed when the
container halts.
Without the patch, some virtual network devices are left over on the system
and must be removed with `ip link del <device>`. This caused containers
with lxc.network.veth.pair to not be able to start. For containers using
randomly generated virtual network device names, the old devices will hang
around on the bridge with their original MAC address.
David Ward [Tue, 23 Jun 2015 14:57:19 +0000 (10:57 -0400)]
Only mount /proc if needed, even without a rootfs
Use the same code with and without a rootfs to check if mounting
/proc is necessary before doing so. If mounting it is unsuccessful
and there is no rootfs, continue as before.
Signed-off-by: David Ward <david.ward@ll.mit.edu> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
David Ward [Tue, 23 Jun 2015 14:57:23 +0000 (10:57 -0400)]
Allow autodev without a rootfs
A container without a rootfs is useful for running a collection of
processes in separate namespaces (to provide separate networking as
an example), while sharing the host filesystem (except for specific
paths that are re-mounted as needed). For multiple processes to run
automatically when such a container is started, it can be launched
using lxc-start, and a separate instance of systemd can manage just
the processes inside the container. (This assumes that the path to
the systemd unit files is re-mounted and only contains the services
that should run inside the container.) For this use case, autodev
should be permitted for a container that does not have a rootfs.
Signed-off-by: David Ward <david.ward@ll.mit.edu> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
- Passing the LXC_CLONE_KEEPNAME flag to do_lxcapi_clone() was not respected and
let to unexpected behaviour for e.g. lxc-clone. We wrap
clear_unexp_config_line() and set_config_item_line() in an appropriate
if-condition.
Signed-off-by: Christian Brauner <christianvanbrauner@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
KATOH Yasufumi [Wed, 19 Aug 2015 11:35:36 +0000 (20:35 +0900)]
doc: Update lxc.cgroup.use in lxc.system.conf(5)
LXC now uses lxc.cgroup.use even when cgmanager is used.
So remove the description for the case of using cgmanager.
And add the case of not specifying it.
This commit only updates en and ja man pages.
Robert Schiele [Fri, 21 Aug 2015 05:35:34 +0000 (07:35 +0200)]
check for NULL pointers before calling setenv()
Latest glibc release actually honours calling setenv with a NULL
pointer by causing SIGSEGV but checking pointers before submitting
to any system function is a good idea anyway.
Signed-off-by: Robert Schiele <rschiele@gmail.com>
Tycho Andersen [Fri, 14 Aug 2015 16:24:47 +0000 (10:24 -0600)]
c/r: enable tracefs
tracefs is a new filesystem that can be mounted by users. Only the options
and fs name need to be passed to restore the state, so we can use criu's
auto fs feature.
Robert LeBlanc [Thu, 13 Aug 2015 19:36:55 +0000 (13:36 -0600)]
Caps are getting lost when cloning an LXC. Adding the -X parameter copies the extended attributes. This allows things like ping to continue to be used by a non-privilged user in Debian at least.
Tycho Andersen [Mon, 10 Aug 2015 17:12:18 +0000 (11:12 -0600)]
c/r: get rid of dump_net_info()
This was originally used to propagate the bridge and veth names across
hosts, but now we extract both from the container's config file, and
nothing reads the files that dump_net_info() writes, so let's just get rid
of them.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
reuse label cleanup since free(NULL) is a no-op Signed-off-by: Arjun Sreedharan <arjun024@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
When setting lxc.network.veth.pair to get a fixed interface
name the recreation of it after a reboot caused an EEXIST.
-) The reboot flag is now a three-state value. It's set to
1 to request a reboot, and 2 during a reboot until after
lxc_spawn where it is reset to 0.
-) If the reboot is set (!= 0) within instantiate_veth and
a fixed name is used, the interface is now deleted before
being recreated.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Jiri Slaby [Wed, 5 Aug 2015 08:32:54 +0000 (10:32 +0200)]
templates: lxc-opensuse, use rpm to determine build version
zypper info's output is not usable for several reasons:
* it is localized -- there is no "Version: " in my output
* it shows results both from the repo and local system
So use plain rpm to determine whether build is installed and if proper
version is in place.
1) Two checks on amd64 for whether compat_ctx has already
been generated were redundant, as compat_ctx is generally
generated before entering the parsing loop.
2) With introduction of reject_force_umount the check for
whether the syscall has the same id on both native and
compat archs results in false behavior as this is an
internal keyword and thus produces a -1 on
seccomp_syscall_resolve_name_arch().
The result was that it was added to the native architecture
twice and never to the 32 bit architecture, causing it to
have no effect on 32 bit containers on 64 bit hosts.
3) I do not see a reason to care about whether the syscalls
have the same number on the two architectures. On the one
hand this check was there to avoid adding it to two archs
(and effectively leaving one arch unprotected), while on
the other hand it seemed to be okay to add it to the
same arch *twice*.
The entire architecture checking branches are now reduced to
three simple cases: 'native', 'non-native' and 'all'. With
'all' adding to both architectures regardless of the syscall
ID.
Also note that libseccomp had a bug in its architecture
checking, so architecture related filters weren't working as
expected before version 2.2.2, which may have contributed to
the confusion in the original architecture-related code.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
The Fedora 22 squashfs doesn't appear to work, the Fedora 21 isn't
available, so lets use the fedora archive mirror and pull the good old
Fedora 20 squashfs.
Loop devices can be added on the fly when needed, they're
not always created beforehand. The loop-control device can
be used to find and allocate the next available number
instead of going through the /dev directory contents (which
is now only a fallback mechanism).
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
CVE-2015-1334: Don't use the container's /proc during attach
A user could otherwise over-mount /proc and prevent the apparmor profile
or selinux label from being written which combined with a modified
/bin/sh or other commonly used binary would lead to unconfined code
execution.
Reported-by: Roman Fiedler Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
KATOH Yasufumi [Thu, 25 Jun 2015 09:14:04 +0000 (18:14 +0900)]
Support unprivileged ephemeral container using aufs
As the commit 31a882e, an unprivileged container can use aufs.
This patch removes the check for unpriv aufs, and change the path of
xino file as an unprivileged user can mount aufs.
Signed-off-by: KATOH Yasufumi <karma@jazz.email.ne.jp> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Stéphane Graber [Thu, 18 Jun 2015 19:55:45 +0000 (15:55 -0400)]
lxc-net: Use iproute and relative paths everywhere (V2)
V2 changes:
- Keep using /var/lib for the lease file, but making it respect localstatedir
- Don't pass an empty --conf-file as that confuses dnsmasq when
/etc/dnsmasq.conf doesn't exist or isn't readable.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Lenz Grimmer [Fri, 12 Jun 2015 23:08:41 +0000 (01:08 +0200)]
use `hostname` for DHCP_HOSTNAME in ifcfg-eth0
Updated centos/fedora/oracle templates to use `hostname` for DHCP_HOSTNAME in
/etc/sysconfig/network/ifcfg-eth0, so the container's host name is propagated
to the host's DHCP server (e.g. dnsmasq, which also acts as the DNS server).
This resolves lxc/lxd#756
Dennis Schridde [Thu, 11 Jun 2015 17:51:02 +0000 (19:51 +0200)]
Adopt capability drop explanations from other distros on Gentoo, drop setpcap,sys_nice caps
Documents setpcap,sys_admin,sys_resources as breaking systemd, but does not drop them from lxc.cap.drop, as the default init system on Gentoo is OpenRC, thus stuff breaking systemd can be blocked anyway.
This also drops setpcap and sys_nice caps, as these are also dropped in other non-systemd distros.
Most of the explanatory blurb was copied from other distros' configs.
Serge Hallyn [Thu, 11 Jun 2015 04:08:15 +0000 (23:08 -0500)]
daemonized start: exit children on failure, don't return
When starting a daemonized container, only the original parent
thread should return to the caller. The first forked child
immediately exits after forking, but the grandparent child
was in some places returning on error - causing a second instance
of the calling function.
Tycho Andersen [Wed, 10 Jun 2015 21:57:50 +0000 (21:57 +0000)]
uniformly nullify std fds
In various places throughout the code, we want to "nullify" the std fds,
opening them to /dev/null or zero or so. Instead, let's unify this code and do
it in such a way that Coverity (probably) won't complain.
v2: use /dev/null for stdin as well
v3: add a comment about use of C's short circuiting
v4: axe comment, check errors on dup2, s/quiet/need_null_stdfds
Daniel Golle [Tue, 9 Jun 2015 10:58:12 +0000 (12:58 +0200)]
fix build on mpc85xx
Initialize ret to 0 so compiler no longer complains about
monitor.c: In function 'lxc_monitor_open':
monitor.c:212:5: error: 'ret' may be used uninitialized in this function [-Werror=maybe-uninitialized]
https://github.com/openwrt/packages/issues/1356
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Serge Hallyn [Tue, 2 Jun 2015 22:33:34 +0000 (22:33 +0000)]
api_start: always close fds 0-2 when daemonized
commit 507cee3618237d3 moved the close and re-open of fds 0-2 into
do_start. But this means that the lxc monitor itself keeps the
caller's fds 0-2 open, which is wrong for daemonized containers.
Closes #548
Reported-by: Mathieu Le Marec - Pasquet <kiorky@cryptelium.net> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Serge Hallyn [Wed, 27 May 2015 10:05:16 +0000 (10:05 +0000)]
cgmanager: attach: never use 'all' controller
We were using 'all' controller if current was in all the
same cgroup. That doesn't suffice. We'd have to check
the target. At that point we may as well just attach
controller by controller.
An optimization to consider is to check the /proc/initpid/cgroup
for all identical controllers. Let's start by just getting it
right.
Stéphane Graber [Fri, 29 May 2015 15:39:25 +0000 (11:39 -0400)]
Fix ABI compatibility
Until we bump the SONAME to liblxc2, only symbol additions and struct
member additions are allowed.
Adding struct members in the middle of the struct breaks backward
compatibility.
This commit makes it clear when struct members were added and moves a
few members that were added in the middle of the 1.0 struct to the end
of it.
Note that unfortunately that means we're breaking backward compatibility
between LXC 1.1.0 and the state after this commit, given 1.1 is
reasonably new, this is the least damaging way of fixing the problem.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Dwight Schauer [Tue, 2 Jun 2015 04:41:09 +0000 (23:41 -0500)]
The yum in Centos 5.11 does not know about '--releasever', which is used by: lxc-create ... -- release=VERSION
The release version only needs to be set in the outer bootstrap, not the inner one.
With this change an lxc-create bootstrap of CentOS 5.11 completes enough to be usable.
CentOS 5.11 containers can be created, started, stopped, and networking works. Signed-off-by: Dwight Schauer <das@teegra.net>