Fix incorrect timeout handling of do_reboot_and_check()
Currently do_reboot_and_check() is decreasing timeout variable even if
it is set to -1, so running 'lxc-stop --reboot --timeout=-1 ...' will
exits immediately at end of second iteration of loop, without waiting
container reboot.
Also, there is no need to call gettimeofday if timeout is set to -1, so
these statements should be evaluated only when timeout is enabled.
Signed-off-by: Yuto KAWAMURA(kawamuray) <kawamuray.dadada@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
chown_mapped_root: don't try chgrp if we don't own the file
New kernels require that to have privilege over a file, your
userns must have the old and new groups mapped into your userns.
So if a file is owned by our uid but another groupid, then we
have to chgrp the file to our primary group before we can try
(in a new user namespace) to chgrp the file to a group id in the
namespace.
But in some cases (when cloning) the file may already be mapped
into the container. Now we cannot chgrp the file to our own
primary group - and we don't have to.
So detect that case. Only try to chgrp the file to our primary
group if the file is owned by our euid (i.e. not by the container)
and the owning group is not already mapped into the container by
default.
With this patch, I'm again able to both create and clone containers
with no errors again.
TAMUKI Shoichi [Sat, 28 Jun 2014 09:39:54 +0000 (18:39 +0900)]
Fix to work lxc-destroy with unprivileged containers on recent kernel
Change idmap_add_id() to add both ID_TYPE_UID and ID_TYPE_GID entries
to an existing lxc_conf, not just an ID_TYPE_UID entry, so as to work
lxc-destroy with unprivileged containers on recent kernel.
TAMUKI Shoichi [Fri, 27 Jun 2014 08:29:01 +0000 (17:29 +0900)]
Fix to work lxc-start with unprivileged containers on recent kernel
Change chown_mapped_root() to map in both the root uid and gid, not
just the uid, so as to work lxc-start with unprivileged containers on
recent kernel.
Serge Hallyn [Thu, 26 Jun 2014 21:44:46 +0000 (16:44 -0500)]
cgmanager: have cgm_set and cgm_get use absolute path when possible
This allows users to get/set cgroup settings when logged into a different
session than that from which they started the container.
There is no cgmanager command to do an _abs variant of cgmanager_get_value
and cgmanager_set_value. So we fork off a new task, which enters the
parent cgroup of the started container, then can get/set the value from
there. The reason not to go straight into the container's cgroup is that
if we are freezing the container, or the container is already frozen, we'll
freeze as well :) The reason to fork off a new task is that if we are
in a cgroup which is set to remove-on-empty, we may not be able to return
to our original cgroup after making the change.
This should fix https://github.com/lxc/lxc/issues/246
Prevent write_config from corrupting container config
write_config doesn't check the value sig_name function returns,
this causes write_config to produce corrupted container config when
using non-predefined signal names.
Signed-off-by: Alexander Vladimirov <alexander.idkfa.vladimirov@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Fri, 20 Jun 2014 20:40:42 +0000 (15:40 -0500)]
ubuntu containers: use a seccomp filter by default (v2)
Blacklist module loading, kexec, and open_by_handle_at (the cause of the
not-docker-specific dockerinit mounts namespace escape).
This should be applied to all arches, but iiuc stgraber will be doing
some reworking of the commonizations which will simplify that, so I'm
not doing it here.
Serge Hallyn [Fri, 20 Jun 2014 19:58:41 +0000 (14:58 -0500)]
seccomp: fix 32-bit rules
When calling seccomp_rule_add(), you must pass the native syscall number
even if the context is a 32-bit context. So use resolve_name rather
than resolve_name_arch.
Enhance the check of /proc/self/status for Seccomp: so that we do not
enable seccomp policies if seccomp is not built into the kernel. This
is needed before we can enable by-default seccomp policies (which we
want to do next)
Fix wrong return value check from seccomp_arch_exist, and remove
needless abstraction in arch handling.
Serge Hallyn [Thu, 19 Jun 2014 20:52:34 +0000 (20:52 +0000)]
seccomp: support 'all' arch sections (plus bugfixes)
seccomp_ctx is already a void*, so don't use 'scmp_filter_ctx *'
Separately track the native arch from the arch a rule is aimed at.
Clearly ignore irrelevant architectures (i.e. arm rules on x86)
Don't try to load seccomp (and don't fail) if we are already
seccomp-confined. Otherwise nested containers fail.
Make it clear that the extra seccomp ctx is only for compat calls
on 64-bit arch. (This will be extended to arm64 when libseccomp
supports it). Power may will complicate this (if ever it is supported)
and require a new rethink and rewrite.
NOTE - currently when starting a 32-bit container on 64-bit host,
rules pertaining to 32-bit syscalls (as opposed to once which have
the same syscall #) appear to be ignored. I can reproduce that without
lxc, so either there is a bug in seccomp or a fundamental
misunderstanding in how I"m merging the contexts.
Rereading the seccomp_rule_add manpage suggests that keeping the seccond
seccomp context may not be necessary, but this is not something I care
to test right now. If it's true, then the code could be simplified, and
it may solve my concerns about power.
With this patch I'm able to start nested containers (with seccomp
policies defined) including 32-bit and 32-bit-in-64-bit.
[ this patch does not yet add the default seccomp policy ]
Dwight Engen [Thu, 19 Jun 2014 13:01:26 +0000 (09:01 -0400)]
don't force dropping capabilities in lxc-init
Commit 0af683cf added clearing of capabilities to lxc-init, but only
after lxc_setup_fs() was done, likely so that the mounting done in
that routine wouldn't fail.
However, in my testing lxc_caps_reset() wasn't really effective
anyway since it did not clear the bounding set. Adding prctl
PR_CAPBSET_DROP in a loop from 0 to CAP_LAST_CAP would fix this, but I
don't think its necessary to forcefully clear all capabilities since
users can now specify lxc.cap.keep = none to drop all capabilities.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Wed, 18 Jun 2014 19:36:37 +0000 (19:36 +0000)]
seccomp: warn but continue on unresolvable syscalls
If a syscall is listed which is not resolvable, continue. This allows
us to keep a more complete list of syscalls in a global seccomp policy
without having to worry about older kernels not supporting the newer
syscalls.
Stéphane Graber [Fri, 13 Jun 2014 21:45:26 +0000 (17:45 -0400)]
tests: Avoid the download template when possible
The use of the download template with an hardcoded --arch=amd64 in aa.c
was causing test failures on any platform incapable of running amd64
binaries.
This wasn't noticed in the CI environment as we run the tests within
containers on an amd64 kernel but this caused failures on the Ubuntu CI
environment.
Instead, let's use the busybox template, tweaking the configuration when
needed to match the needs of the testcase.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Stéphane Graber [Mon, 9 Jun 2014 21:13:56 +0000 (17:13 -0400)]
tests: Wait 5s for init to respond in lxc-test-autostart
lxc-test-autostart occasionaly fails at the restart test in the CI
environment. Looking at the current test case, the most obvious race
there is if lxc-wait exists succesfuly immediately after LXC marked the
container RUNNING (init spawned) but before init had a chance to setup
the signal handlers.
To avoid this potential race period, let's add a 5s delay between the
tests to give a chance for init to finish starting up.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Backport of autoboot/autostart rollup to stable-1.0
Full backport of the autostart / autoboot rollup patch from
master to stable-1.0.
lxc-autostart: rework boot and group handling
This adds new functionality to lxc-autostart.
*) The -g / --groups option is multiple cummulative entry.
This may be mixed freely with the previous comma separated
group list convention. Groups are processed in the
order they first appear in the aggregated group list.
*) The NULL group may be specified in the group list using either a
leading comma, a trailing comma, or an embedded comma.
*) Booting proceeds in order of the groups specified on the command line
then ordered by lxc.start.order and name collalating sequence.
*) Default host bootup is now specified as "-g onboot," meaning that first
the "onboot" group is booted and then any remaining enabled
containers in the NULL group are booted.
*) Adds documentation to lxc-autostart for -g processing order and
combinations.
*) Parameterizes bootgroups, options, and shutdown delay in init scripts
and services.
*) Update the various init scripts to use lxc-autostart in a similar way.
Reported-by: CDR <venefax@gmail.com> Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: Michael H. Warfield <mhw@WittsEnd.com> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Stéphane Graber [Wed, 4 Jun 2014 18:05:25 +0000 (14:05 -0400)]
Try to be more helpful on container startup failure
This hides some of the confusing "command X failed to receive response"
why are usually caused by another more understandable error.
On failure to start() from lxc-start, a new error message is displayed,
suggesting the user sets logfile and loglevel and if using -d, restarts
the container in the foreground instead.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Wed, 4 Jun 2014 15:16:10 +0000 (10:16 -0500)]
Specially handle block device rootfs
It is not possible to mount a block device from a non-init user namespace.
Therefore if root on the host is starting a container with a uid
mapping, and the rootfs is a block device, then mount the rootfs before
we spawn the container init task.
This addresses https://github.com/lxc/lxc/issues/221
Serge Hallyn [Tue, 3 Jun 2014 03:04:12 +0000 (22:04 -0500)]
configure.ac: don't let -lcgmanager end up in LIBS
AC_SEARCH_LIBS always places the library being queried into LIBS. We
don't want that - we were only checking whether a function is
available. Not everything (notably not init.lxc.static) needs to
link against -lcgmanager.
Serge Hallyn [Thu, 22 May 2014 21:53:40 +0000 (16:53 -0500)]
attach: get personality through get_config command
Newer kernels optionally disallow reading /proc/$$/personality by
non-root users. We can get the personality through the lxc command
interface, so do so.
Also try to be more consistent about personality being a signed long.
We had it as int, unsigned long, signed long throughout the code.
Serge Hallyn [Tue, 20 May 2014 16:47:17 +0000 (11:47 -0500)]
cgmanager: slow down there (don't always grab abs cgroup path)
When I converted attach and enter to using move_pid_abs, these needed
to use the new get_pid_cgroup_abs method to get an absolute path. But
for some inexplicable reason I also converted the functions which get
and set cgroup properties to use the absolute paths. These are simply
not compatible with the cgmanager set_value and get_value methods.
This breaks for instance lxc-test-cgpath.
So undo that. With this patch lxc-test-cgpath, lxc-test-autotest,
and lxc-test-concurrent once again pass in a nested container.
Edvinas Klovas [Sat, 10 May 2014 14:47:52 +0000 (16:47 +0200)]
archlinux template: fix lxc.root for btrfs backend
when using btrfs backend lxc-create first creates rootfs in /usr/lib/lxc/rootfs
directory before moving it to /var/lib/lxc or other directory supplied by the
command line. Archlinux template relied in $rootfs_path which made containers
created with btrfs backend have lxc.rootfs set to /usr/lib/lxc/rootfs. By using
$path instead of $rootfs_path we make sure that lxc.rootfs is always correct.
Signed-off-by: Edvinas Klovas <edvinas@pnd.io> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
On older cgmanager the support was broken. So rather than
fail container starts altogether, just keep the old lxc behavior
in this case by not using name= subsystems.
Edvinas Klovas [Sat, 3 May 2014 17:15:36 +0000 (19:15 +0200)]
archlinux template: added sigpwr handling to systemd (lxc-stop)
archlinux is using systemd and systemd's configuration does not have any
services setup to handle sigpwr hook which is sent by lxc-stop command. By
enabling sigpwr service we make sure that lxc-stop will work.
Serge Hallyn [Thu, 1 May 2014 20:27:55 +0000 (15:27 -0500)]
cgmanager: use absolute cgroup path to switch cgroups at attach
If an unprivileged user does 'lxc-start -n u1' in one
login session, followed by 'lxc-attach -n u1' in another
session, the attach will fail if the sessions are in different
cgroups. The same is true of lxc-cgroup commands.
Address this by using the GetPidCgroupAbs and MovePidAbs
which work with the containers' cgroup path relative to
the cgproxy.
Since GetPidCgroupAbs is new to api version 3 in cgmanager,
use the old method if we are on an older cgmanager.
Serge Hallyn [Fri, 2 May 2014 18:36:32 +0000 (13:36 -0500)]
cgmanager: also handle named subsystems (like name=systemd)
Read /proc/self/cgroup instead of /proc/cgroups, so as to catch
named subsystems. Otherwise the contaienrs will not be fully
moved into the container cgroups.
lxc.mount.auto: improve defaults for cgroup and cgroup-full
If the user specifies cgroup or cgroup-full without a specifier (:ro,
:rw or :mixed), this changes the behavior. Previously, these were
simple aliases for the :mixed variants; now they depend on whether the
container also has CAP_SYS_ADMIN; if it does they resolve to the :rw
variants, if it doesn't to the :mixed variants (as before).
If a container has CAP_SYS_ADMIN privileges, any filesystem can be
remounted read-write from within, so initially mounting the cgroup
filesystems partially read-only as a default creates a false sense of
security. It is better to default to full read-write mounts to show the
administrator what keeping CAP_SYS_ADMIN entails.
If an administrator really wants both CAP_SYS_ADMIN and the :mixed
variant of cgroup or cgroup-full automatic mounts, they can still
specify that explicitly; this commit just changes the default without
specifier.
Currently, setup_caps and dropcaps_except both use the same parsing
logic for parsing capabilities (try to identify by name, but allow
numerical specification). Since this is a common routine, separate it
out to improve maintainability and reuseability.
Signed-off-by: Christian Seiler <christian@iwakd.de> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Ubuntu containers have had trouble with automatic cgroup mounting that
was not read-write (i.e. lxc.mount.auto = cgroup{,-full}:{ro,mixed}) in
containers without CAP_SYS_ADMIN. Ubuntu's mountall program reads
/lib/init/fstab, which contains an entry for /sys/fs/cgroup. Since
there is no ro option specified for that filesystem, mountall will try
to remount it readwrite if it is already mounted. Without
CAP_SYS_ADMIN, that fails and mountall will interrupt boot and wait for
user input on whether to proceed anyway or to manually fix it,
effectively hanging container bootup.
This patch makes sure that /sys/fs/cgroup is always a readwrite tmpfs,
but that the actual cgroup hierarchy paths (/sys/fs/cgroup/$subsystem)
are readonly if :ro or :mixed is used. This still has the desired
effect within the container (no cgroup escalation possible and programs
get errors if they try to do so anyway), while keeping Ubuntu
containers happy.
Stéphane Graber [Tue, 6 May 2014 03:34:04 +0000 (22:34 -0500)]
python-lxc: minor fixes to __init__.py
Set a base class for the network object and set the encoding in the
header. Neither of those changes are required for python3 but they do
make it easier for anyone trying to make a python2 binding.
Stéphane Graber [Mon, 5 May 2014 15:51:19 +0000 (10:51 -0500)]
lxc-ls: Force running against containers without python
When using --nesting, we exec ourselves in the container context, if we
somehow need to dynamically-load modules from there, things break. So
make sure we pre-load everything we may need.
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Fri, 2 May 2014 16:35:10 +0000 (11:35 -0500)]
cgfs: don't mount /sys/fs/cgroup readonly
/sys/fs/cgroup is just a size-limited tmpfs, and making it ro does
nothing to affect our ability alter mount settings of its subdirs.
OTOH making it ro can upset mountall in the container which tries
to remount it rw, which may be refused.
Dwight Engen [Thu, 1 May 2014 14:33:48 +0000 (10:33 -0400)]
lxc-oracle: fix warnings/errors from some rpm scriptlets
- Some scriptlets expect fstab to exist so create it before doing the
yum install
- Set the rootfs selinux label same as the hosts or else the PREIN script
from initscripts will fail when running groupadd utmp, which prevents
creation of OL4.x containers on hosts > OL6.x.
- Move creation of devices into a separate function