So far, we opened a file descriptor refering to proc on the host inside the
host namespace and handed that fd to the attached process in
attach_child_main(). This was done to ensure that LSM labels were correctly
setup. However, by exploiting a potential kernel bug, ptrace could be used to
prevent the file descriptor from being closed which in turn could be used by an
unprivileged container to gain access to the host namespace. Aside from this
needing an upstream kernel fix, we should make sure that we don't pass the fd
for proc itself to the attached process. However, we cannot completely prevent
this, as the attached process needs to be able to change its apparmor profile
by writing to /proc/self/attr/exec or /proc/self/attr/current. To minimize the
attack surface, we only send the fd for /proc/self/attr/exec or
/proc/self/attr/current to the attached process. To do this we introduce a
little more IPC between the child and parent:
* IPC mechanism: (X is receiver)
* initial process intermediate attached
* X <--- send pid of
* attached proc,
* then exit
* send 0 ------------------------------------> X
* [do initialization]
* X <------------------------------------ send 1
* [add to cgroup, ...]
* send 2 ------------------------------------> X
* [set LXC_ATTACH_NO_NEW_PRIVS]
* X <------------------------------------ send 3
* [open LSM label fd]
* send 4 ------------------------------------> X
* [set LSM label]
* close socket close socket
* run program
The attached child tells the parent when it is ready to have its LSM labels set
up. The parent then opens an approriate fd for the child PID to
/proc/<pid>/attr/exec or /proc/<pid>/attr/current and sends it via SCM_RIGHTS
to the child. The child can then set its LSM laben. Both sides then close the
socket fds and the child execs the requested process.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
Stéphane Graber [Mon, 14 Nov 2016 16:53:07 +0000 (11:53 -0500)]
debian: Don't depend on libui-dialog-perl
This package doesn't exist in stretch anymore, and it's unclear why we
were depending on a library to begin with (as opposed to having it
brought by whatever needs it).
Somehow this implementation of a cgroupfs backend decided to use the hierarchy
numbers it detects in /proc/cgroups and /proc/self/cgroups as indices for
the hierarchy struct. Controller numbering usually starts at 1 but may start at
0 if:
a) the controller is not mounted on a cgroups v1 hierarchy;
b) the controller is bound to the cgroups v2 single unified hierarchy; or
c) the controller is disabled
To avoid having to rework our fallback backend significantly, we should
explicitly check for each controller if hierarchy[i] != NULL.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
conf: merge network namespace move & rename on shutdown
On shutdown we move physical network interfaces back to the
host namespace and rename them afterwards as well as in the
later lxc_network_delete() step. However, if the device had
a name which already exists in the host namespace then the
moving fails and so do the subsequent rename attempts. When
the namespace ceases to exist the devices finally end up
in the host namespace named 'dev<ID>' by the kernel.
In order to avoid this, we do the moving and renaming in a
single step (lxc_netdev_move_by_*()'s move & rename happen
in a single netlink transaction).
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Evgeni Golov [Sat, 8 Oct 2016 16:29:30 +0000 (18:29 +0200)]
mark the python examples as having utf-8 encoding
this allows running them also under Python2, which otherwise
would choke on Stéphane's name and error out with
SyntaxError: Non-ASCII character '\xc3' in file …
- We expect destroy to fail in zfs_clone() so try to silence it so users are
not irritated when they create zfs snapshots.
- Add -r recursive to zfs_destroy(). This code is only hit when a) the
container has no snapshots or b) the user calls destroy with snapshots. So
this should be safe. Without -r snapshots will remain.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
James Cowgill [Mon, 15 Aug 2016 16:09:44 +0000 (16:09 +0000)]
seccomp: Implement MIPS seccomp handling
MIPS processors implement 3 ABIs: o32, n64 and n32 (similar to x32). The kernel
treats each ABI separately so syscalls disallowed on "all" arches should be
added to all three seccomp sets. This is implemented by expanding compat_arch
and compat_ctx to accept two compat architectures.
After this, the MIPS hostarch detection code and config section code is added.
Signed-off-by: James Cowgill <james410@cowgill.org.uk>
Stéphane Graber [Wed, 17 Aug 2016 19:42:34 +0000 (15:42 -0400)]
Use full GPG fingerprint instead of long IDs.
With how easy it is to create a collision on a short ID nowadays and
given that the user doesn't actually have to remember or manually enter
the key ID, lets just use the full fingerprint from now on.
This fixes a double free corruption on container-requested
reboots when lxc_spawn() fails before receiving the ttys, as
lxc_fini() (part of __lxc_start()'s cleanup) calls
lxc_delete_tty().
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Antonio Terceiro [Wed, 29 Jun 2016 17:58:35 +0000 (14:58 -0300)]
lxc-debian: fix regression when creating wheezy containers
The regression was introduced by commit 3c39b0b7a2b445e08d2e2aecb05566075f4f3423 which makes it possible to
create working stretch containers by forcinig `init` to be in the
included package list.
However, `init` didn't exit before jessie, so now for wheezy we
explicitly include `sysvinit`; sysvinit on wheezy is essential,
so it would already be included anyway.
Signed-off-by: Antonio Terceiro <terceiro@debian.org>
Preetam D'Souza [Tue, 28 Jun 2016 03:12:12 +0000 (23:12 -0400)]
Include all lxcmntent.h function declarations on Bionic
Newer versions of Android (5.0+, aka API Level 21+) include mntent.h,
which declares setmntent and endmntent. This hits an edge
case with the preprocessor checks in lxcmntent.h because HAVE_SETMNTENT
and HAVE_ENDMNTENT are both defined (in Bionic's mntent.h), but conf.c
always includes lxcmntent.h on Bionic! As a result, we get compiler
warnings of implicit function declarations for setmntent endmntent.
This patch always includes setmntent/endmntent/hasmntopt function
declarations on Bionic, which gets rid of these warnings.
Antonio Terceiro [Fri, 17 Jun 2016 22:00:56 +0000 (19:00 -0300)]
lxc-debian: make sure init is installed
init 1.34 is not "Essential" anymore, in order to make it not required
on minimal chroots, docker containers, etc. Because of that we now need
to manually include it on systems that are expected to boot.
Signed-off-by: Antonio Terceiro <terceiro@debian.org>
Stewart Brodie [Tue, 10 May 2016 12:57:00 +0000 (13:57 +0100)]
Allow configuration file values to be quoted
If the value starts and ends with matching quote characters, those
characters are stripped automatically. Quote characters are the
single quote (') or double quote ("). The quote removal is done after
the whitespace trimming.
This is needed particularly in order that lxc.environment values may
have trailing spaces. However, the quote removal is done for all values
in the parse_line function, as it has non-const access to the value.
Signed-off-by: Stewart Brodie <stewart@metahusky.net>
Aron Podrigal [Sun, 1 May 2016 15:06:53 +0000 (11:06 -0400)]
Fixed - set PyErr when Container.__init__ fails
When container init failed for whatever reason, previously it resulted
in a `SystemError: NULL result without error in PyObject_Call`
This will now result in a RuntimeError with the error message
previously printed to stderr.
nicer date format and support for SOURCE_DATE_EPOCH in LXC_GENERATE_DATE
Using $(date) for LXC_GENERATE_DATE has various flaws:
* formating depends on the locale of the system we execute configure on
* the output is not really a date but more a timestamp
Let's use $(date --utc '+%Y-%m-%d') instead.
While at it, also support SOURCE_DATE_EPOCH [1] to make the build
reproducible
All uses of netlink_open() assume that on error the
nl_handler doesn't need to be closed, but some error cases
happen after the socket was opened successfully and used to
simply return -errno.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
A change in kernel 4.2 caused btrfs_recursive_destroy to
fail to delete unprivileged containers. This patch restores
the pre-kernel-4.2 behaviour. Ref: Issue 935.
lxc-busybox: Remove warning for dynamically linked Busybox
The warning has been present since commit 32b37181ea (with no purpose stated).
Support for dynamically linked Busybox has been added since commit bf6cc73696.
Haven't encountered any issues with dynamically linked Busybox in my last
2 years' testing.
open_without_symlink: Don't SYSERROR on something else than ELOOP
The open_without_symlink routine has been specifically created to prevent
mounts with synlinks as source or destination. Keep SYSERROR'ing in that
particular scenario, but leave error handling to calling functions for the
other ones - e.g. optional bind mount when the source dir doesn't exist
throws a nasty error.
Serge Hallyn [Thu, 25 Feb 2016 19:01:12 +0000 (11:01 -0800)]
cgfs: make sure we use valid cgroup mountpoints
If lxcfs starts before cgroup-lite, then the first cgroup mountpoints in
/proc/self/mountinfo are /run/lxcfs/*. Unprivileged users cannot access
these. So privileged containers are ok, and unprivileged containers are ok
since they won't cache those to begin with. But unprivileged root-owned
containers cache /run/lxcfs/* and then try to use them.
So when doing cgroup automounting check whether the mountpoints we have
stored are accessible, and if not look for a new one to use.
Ubuntu [Sat, 20 Feb 2016 02:25:55 +0000 (02:25 +0000)]
lxc: cgfs: handle lxcfs
When containers have lxcfs mounted instead of cgroupfs, we have to
process /proc/self/mountinfo a bit differently. In particular, we
should look for fuse.lxcfs fstype, we need to look elsewhere for the
list of comounted controllers, and the mount_prefix is not a cgroup path
which was bind mounted, so we should ignore it, and named subsystems
show up without the 'name=' prefix.
With this patchset I can start containers inside a privileged lxd
container with lxcfs mounted (i.e. without cgroup namespaces).
Serge Hallyn [Fri, 19 Feb 2016 22:12:47 +0000 (14:12 -0800)]
cgroups: do not fail if setting devices cgroup fails due to EPERM
If we're trying to allow a device which was denied to our parent
container, just continue.
Cgmanager does not help us to distinguish between eperm and other
errors, so just always continue.
We may want to consider actually computing the range of devices
to which the container monitor has access, but OTOH that introduces
a whole new set of complexity to compute access sets.
Serge Hallyn [Mon, 15 Feb 2016 20:15:10 +0000 (12:15 -0800)]
log.c:__lxc_log_set_file: fname cannot be null
fname cannot be passed in as NULL by any of its current callers. If it
could, then build_dir() would crash as it doesn't check for it. So make
sure we are warned if in the future we pass in NULL.