When managing containers, I need to take action based on container
exit status. For instance, if it exited abnormally (status!=0), I
sometime want to respawn it automatically. Or, when invoking
`lxc-stop` I want to know if it terminated gracefully (ie on `SIGTERM`)
or on `SIGKILL` after a timeout.
This patch adds a new message type `lxc_msg_exit_code,` to preserve
ABI. It sends the raw status code as returned by `waitpid` so that
listening application may want to apply `WEXITSTATUS` before. This is
what `lxc-monitor` does.
Signed-off-by: Jean-Tiare LE BIGOT <jean-tiare.le-bigot@ovh.net>
Serge Hallyn [Fri, 29 Aug 2014 14:20:44 +0000 (14:20 +0000)]
lxc-cgm: fix issue with nested chowning
To ask cgmanager to chown files as an unpriv user, we must send the
request from the container's namespace (with our own userid also
mapped in). However when we create a new namespace then we must
open a new dbus connection, so that our credential and the credential
on the dbus socket match. Otherwise the proxy will refuse the request.
Because we were warning about this failure but not exiting, the failure
was not noticed until the unprivileged container went on to try to
administer its cgroups, i.e. creating a container inside itself.
Fix this by having the do_chown_cgroup create a new cgmanager connection.
In order to reduce the number of connections, since the list of subsystems
is global anyway, don't call do_chown_cgroup once for each controller,
just call it once and have it run over all controllers.
(This patch does not change the fact that we don't fail if the
chown failed. I think we should change that, but let's do it in a
later patch)
Tycho Andersen [Tue, 26 Aug 2014 14:09:36 +0000 (09:09 -0500)]
Add support for checkpoint and restore via CRIU
This patch adds support for checkpointing and restoring containers via CRIU.
It adds two api calls, ->checkpoint and ->restore, which are wrappers around
the CRIU CLI. CRIU has an RPC API, but reasons for preferring exec() are
discussed in [1].
To checkpoint, users specify a directory to dump the container metadata (CRIU
dump files, plus some additional information about veth pairs and which
bridges they are attached to) into this directory. On restore, this
information is read out of the directory, a CRIU command line is constructed,
and CRIU is exec()d. CRIU uses the lxc-restore-net callback (which in turn
inspects the image directory with the NIC data) to properly restore the
network.
This will only work with the current git master of CRIU; anything as of a152c843 should work. There is a known bug where containers which have been
restored cannot be checkpointed [2].
v2: fixed some problems with the s/int/bool return code form api function
v3: added a testcase, fixed up the man page synopsis
v4: fix a small typo in lxc-test-checkpoint-restore
v5: remove a reference to the old CRIU_PATH, and a bad error about the same
Daniel Miranda [Mon, 25 Aug 2014 21:16:43 +0000 (18:16 -0300)]
build: Make setup.py run from srcdir to avoid distutils errors
distutils can't handle paths to source files containing '..'. It will
try to navigate away from the build directory and fail. To fix that,
before building the python module, transform all the path variables then
cd to the srcdir, and set the build directory manually.
This is hopefully the last needed fix to use separate build and
source diretories.
Signed-off-by: Daniel Miranda <danielkza2@gmail.com> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Daniel Miranda [Mon, 25 Aug 2014 21:16:42 +0000 (18:16 -0300)]
build: don't remove configuration template on clean
Now that default.conf is generated/linked during the configuration
phase, it should not longer be removed in the 'clean' stage, or
subsequent builds will fail. Only remove it during 'dist-clean'.
Signed-off-by: Daniel Miranda <danielkza2@gmail.com> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Serge Hallyn [Fri, 22 Aug 2014 21:23:56 +0000 (16:23 -0500)]
statvfs: do nothing if statvfs does not exist (android/bionic)
If statvfs does not exist, then don't recalculate mount flags
at remount.
If someone does need this, they could replace the code (only
if !HAVE_STATVFS) with code parsing /proc/self/mountinfo (which
exists in the recent git history)
Serge Hallyn [Wed, 20 Aug 2014 23:18:40 +0000 (23:18 +0000)]
lxc_mount_auto_mounts: honor existing nodev etc at remounts
Same problem as we had with mount_entry(). lxc_mount_auto_mounts()
sometimes does bind mount followed by remount to change options.
With recent kernels it must pass any preexisting NODEV/NOSUID/etc
flags.
Serge Hallyn [Wed, 20 Aug 2014 22:51:43 +0000 (22:51 +0000)]
mount_entry: use statvfs
Use statvfs instead of parsing /proc/self/mountinfo to check for the
flags we need to and into the msbind mount flags. This will be faster
and the code is cleaner.
Daniel Miranda [Thu, 21 Aug 2014 10:56:39 +0000 (07:56 -0300)]
build: Fix support for split build and source dirs
Building LXC in a separate target directory, by running configure from
outside the source tree, failed with multiple errors, mostly in the
Python and Lua extensions, due to assuming the source dir and build dir
are the same in a few places. To fix that:
- Pre-process setup.py with the appropriate directories at configure
time
- Introduce the build dir as an include path in the Lua Makefile
- Link the default container configuration file from the alternatives
in the configure stage, instead of setting a variable and using it
in the Makefile
Signed-off-by: Daniel Miranda <danielkza2@gmail.com> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Serge Hallyn [Thu, 21 Aug 2014 16:02:18 +0000 (16:02 +0000)]
chmod container dir to 0770 (v2)
This prevents u2 from going into /home/u1/.local/share/lxc/u1/rootfs
and running setuid-root applications to get write access to u1's
container rootfs.
v2: set umask to 002 for the mkdir. Otherwise if umask happens to be,
say, 022, then user does not have write permissions under the container
dir and creation of $containerdir/partial file will fail.
Serge Hallyn [Fri, 22 Aug 2014 04:45:18 +0000 (04:45 +0000)]
load_config_locked: update unexp network
When we read a lxc.network.hwaddr line, if it contained any 'x's then
those get quitely filled in at config_network_hwaddr. If that happens
then we want to save the autogenerated hwaddr in the unexpanded config
so that when we write it to disk, it is saved.
This patch dumbly re-generates the network configuration in the
unexp configuration every time we load a config file, just as we do
after every clone.
S.Çağlar Onur [Fri, 22 Aug 2014 16:10:12 +0000 (12:10 -0400)]
show additional info if btrfs subvolume deletion fails (issue #315)
Unprivileged users require "-o user_subvol_rm_allowed" mount option for btrfs.
Make the INFO level message to ERROR to make it clear, which now says following;
[caglar@qop:~] lxc-destroy -n rubik
lxc_container: Is the rootfs mounted with -o user_subvol_rm_allowed?
lxc_container: Error destroying rootfs for rubik
Destroying rubik failed
Signed-off-by: S.Çağlar Onur <caglar@10ur.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Fri, 22 Aug 2014 03:50:36 +0000 (22:50 -0500)]
lxc_map_ids: don't do bogus chekc for newgidmap
If we didn't find newuidmap, then simply require the caller to be
root and write to /proc/self/uidmap manually. Checking for
newgidmap to exist is bogus.
TAMUKI Shoichi [Tue, 19 Aug 2014 00:29:49 +0000 (09:29 +0900)]
Update plamo template
- If "installpkg" command does not exist, lxc-plamo temporarily
install the command with static linked tar command into the lxc
cache directory. The tar command does not refer to passwd/group
files, which means that only a few files/directories are extracted
with wrong user/group ownership. To avoid this, the installpkg
command now uses the standard tar command in the system.
- Change mode to 666 for $rootfs/dev/null to allow write access for
all users.
- Small fix in usage message.
Serge Hallyn [Mon, 18 Aug 2014 03:28:21 +0000 (03:28 +0000)]
do_mount_entry: add nexec, nosuid, nodev, rdonly flags if needed at remount
See http://lkml.org/lkml/2014/8/13/746 and its history. The kernel now refuses
mounts if we don't add ro,nosuid,nodev,noexec flags if they were already there.
Also use the newly found info to skip remount if unneeded. For background, if
you want to create a read-only bind mount, then you must first mount(2) with
MS_BIND to create the bind mount, then re-mount(2) again to get the new mount
options to apply. So if this wasn't a bind mount, or no new mount options were
introduced, then we don't do the second mount(2).
null_endofword() and get_field() were not changed, only moved up in
the file.
(Note, while I can start containers inside a privileged container with
this patch, most of the lxc tests still fail with the kernel in question;
Andy's patch seems to still be needed - a kernel with which is available
at https://launchpad.net/~serge-hallyn/+archive/ubuntu/userns-natty
ppa:serge-hallyn/userns-natty)
Serge Hallyn [Sat, 9 Aug 2014 00:30:12 +0000 (00:30 +0000)]
monitor: fix sockname calculation for long lxcpaths
A long enough lxcpath (and small PATH_MAX through crappy defines) can cause
the creation of the string to be hashed to fail. So just use alloca to
get the size string we need.
More importantly, while I can't explain it, if lxcpath is too long, setting
sockname[sizeof(addr->sun_path)-2] to \0 simply doesn't seem to work. So set
sockname[sizeof(addr->sun_path)-3] to \0, which does work.
Serge Hallyn [Sat, 9 Aug 2014 00:28:18 +0000 (00:28 +0000)]
command socket: use hash if needed
The container command socket is an abstract unix socket containing
the lxcpath and container name. Those can be too long. In that case,
use the hash of the lxcpath and lxcname. Continue to use the path and
name if possible to avoid any back compat issues.
Stéphane Graber [Sat, 16 Aug 2014 21:16:36 +0000 (17:16 -0400)]
Revert "chmod container dir to 0770"
This commit broke the testsuite for unprivileged containers as the
container directory is now 0750 with the owner being the container root
and the group being the user's group, meaning that the parent user can
only enter the directory, not create entries in there.
When "lxc.autodev = 1", LXC creates automatically a "/dev/.lxc/<name>.<hash>"
folder to put container's devices in so that they are visible from both
the host and the container itself.
On container exit (ne it normal or not), this folder was not cleaned
which made "/dev" folder grow continuously.
We fix this by adding a new `int lxc_delete_autodev(struct lxc_handler
*handler)` called from `static void lxc_fini(const char *name, struct
lxc_handler *handler)`.
Signed-off-by: Jean-Tiare LE BIGOT <jean-tiare.le-bigot@ovh.net> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Serge Hallyn [Thu, 14 Aug 2014 18:29:55 +0000 (18:29 +0000)]
chmod container dir to 0770
This prevents u2 from going into /home/u1/.local/share/lxc/u1/rootfs
and running setuid-root applications to get write access to u1's
container rootfs.
S.Çağlar Onur [Sat, 9 Aug 2014 03:13:27 +0000 (23:13 -0400)]
introduce --with-distro=raspbian
Raspberry Pi kernel finally supports all the bits required by LXC [1]
This patch makes "./configure --with-distro=raspbian" to install lxcbr0
based config file and upstart jobs.
Also src/lxc/lxc.net now checks the existence of the lxc-dnsmasq user
(and fallbacks to dnsmasq)
RPI users still need to pass
"MIRROR=http://archive.raspbian.org/raspbian/" parameter to lxc-create
to pick the correct packages
When `lxc.autodev = 0` and empty tmpfs is mounted on /dev
and private pts are requested, we need to ensure '/dev/pts'
exists before attempting to mount devpts on it.
Signed-off-by: Jean-Tiare LE BIGOT <jean-tiare.le-bigot@ovh.net> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
With the current old CentOS template, dnsmasq was not able to resolve
the hostname of an lxc container after it had been created. This minor
change rectifies that.
Serge Hallyn [Thu, 7 Aug 2014 03:23:48 +0000 (03:23 +0000)]
ubuntu templates: don't check for $rootfs/run/shm
/dev/shm must be turned from a directory into a symlink to /run/shm.
The templates do this only if they find -d $rootfs/run/shm. Since /run
will be a tmpfs, checking for it in the rootfs is silly. It also is
currently broken as ubuntu cloud images have an empty /run.
(this should fix https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1353734)
Serge Hallyn [Wed, 6 Aug 2014 22:39:45 +0000 (22:39 +0000)]
add lxc.console.logpath
v2: add get_config_item
clear_config_item is not supported, as it isn't for lxc.console, bc
you can do 'lxc.console.logfile =' to clear it. Likewise save_config
is not needed because the config is now just written through the
unexpanded char*.
Serge Hallyn [Fri, 1 Aug 2014 23:34:16 +0000 (23:34 +0000)]
unexpanded config file: turn into a string
Originally, we only kept a struct lxc_conf representing the current
container configuration. This was insufficient because lxc.include's
were expanded, so a clone or a snapshot would contain the expanded
include file contents, rather than the original "lxc.include". If
the host's include files are updated, clones and snapshots would not
inherit those updates.
To address this, we originally added a lxc_unexp_conf, which mirrored
the lxc_conf, except that lxc.include was not expanded.
This has its own cshortcomings, however, In particular, if a lxc.include
has a lxc.cgroup setting, and you use the api to say:
c.clear_config_item("lxc.cgroup")
this is not representable in the lxc_unexp_conf. (The original problem,
which was pointed out to me by stgraber, was slightly different, but
unlike this problem it was not unsolvable).
This patch changes the unexpanded configuration to be a textual
representation of the configuration. This allows us *order* the
configuration commands, which is what was not possible using the
struct lxc_conf *lxc_unexp_conf.
The write_config() now becomes a simple fwrite. However, lxc_clone
is slightly complicated in parts, the worst of which is the need to
rewrite the network configuration if we are changing the macaddrs.
With this patch, lxc-clone and clear_config_item do the right thing.
lxc-test-saveconfig and lxc-test-clonetest both pass.
Serge Hallyn [Fri, 1 Aug 2014 22:55:21 +0000 (22:55 +0000)]
btrfs: support recursive subvolume deletion (v2)
Pull the #defines and struct definitions for btrfs into a separate
.h file to not clutter bdev.c
Implement btrfs recursive delete support
A non-root user isn't allow to do the ioctls needed for searching (as you can
verify with 'btrfs subvolume list'). So for an unprivileged user, if the
rootfs has subvolumes under it, deletion will fail. Otherwise, it will
succeed.
Changelog: Aug 1:
. Fix wrong objid passing when determining directory paths
. In do_remove_btrfs_children, avoid dereferencing NULL dirid
. Fix memleak in error case.
Martin Pitt [Thu, 31 Jul 2014 06:53:53 +0000 (08:53 +0200)]
Add systemd unit for lxc.net
This is the equivalent of the upstart lxc-net.conf to set up the LXC bridge.
This also drops "lxc.service" from tarballs. It is built source which depends
on configure options, so the statically shipped file will not work on most
systems.
use non-thread-safe getpwuid and getpwgid for android
We only call it (so far) after doing a fork(), so this is fine. If we
ever need such a thing from threaded context, we'll simply need to write
our own version for android.
print a helpful message if creating unpriv container with no idmap
This gives me:
ubuntu@c-t1:~$ lxc-create -t download -n u1
lxc_container: No mapping for container root
lxc_container: Error chowning /home/ubuntu/.local/share/lxc/u1/rootfs to container root
lxc_container: You must either run as root, or define uid mappings
lxc_container: To pass uid mappings to lxc-create, you could create
lxc_container: ~/.config/lxc/default.conf:
lxc_container: lxc.include = /etc/lxc/default.conf
lxc_container: lxc.id_map = u 0 100000 65536
lxc_container: lxc.id_map = g 0 100000 65536
lxc_container: Error creating backing store type (none) for u1
lxc_container: Error creating container u1
when I create a container without having an id mapping defined.
provide an example SELinux policy for older releases
The virtd_lxc_t type provided by the default RHEL/CentOS/Oracle 6.5
policy is an unconfined_domain(), so it doesn't really enforce anything.
This change will provide a link in the documentation to an example
policy that does confine containers.
On more recent distributions with new enough policy, it is recommended
not to use this sample policy, but to use the types already available
on the system from /etc/selinux/targeted/contexts/lxc_contexts, ie:
process = "system_u:system_r:svirt_lxc_net_t:s0"
file = "system_u:object_r:svirt_sandbox_file_t:s0"
Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Matt Palmer [Tue, 1 Jul 2014 07:01:39 +0000 (17:01 +1000)]
Support providing env vars to container init
It's quite useful to be able to configure containers by specifying
environment variables, which init (or initscripts) can use to adjust the
container's operation.
This patch adds one new configuration parameter, `lxc.environment`, which
can be specified zero or more times to define env vars to set in the
container, like this:
Default operation is unchanged; if the user doesn't specify any
lxc.environment parameters, the container environment will be what it is
today ('container=lxc').
Signed-off-by: Matt Palmer <mpalmer@hezmatt.org> Acked-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
We detect whether ovs-vsctl is available. If so, then we support
adding network interfaces to openvswitch bridges with it.
Note that with this patch, veths do not appear to be removed from the
openvswitch bridge. This seems a bug in openvswitch, as the veths
in fact do disappear from the system. If lxc is required to remove
the port from the bridge manually, that becomes more complicated
for unprivileged containers, as it would require a setuid-root
wrapper to be called at shutdown.
lxc-test-{unpriv,usernic.in}: make sure to chgrp as well
These tests are failing on new kernels because the container root is
not privileged over the directories, since privilege no requires
the group being mapped into the container.
veth.pair is ignore for unprivileged containers as allowing an
unprivileged user to set a specific device name would allow them to
trigger actions in tools like NetworkManager or other uevent based
handlers that may react based on specific names or prefixes being used.
centos template: prevent mingetty from calling vhangup(2)
When using unprivileged containers, tty fails because of vhangup. Adding
--nohangup to nimgetty, it fixes the issue. This is the same problem
occurred for oracle template, commit 2e83f7201c5d402478b9849f0a85c62d5b9f1589
confile: sanity-check netdev->type before setting netdev->priv elements
The netdev->priv is shared for the netdev types. A bad config file
could mix configuration for different types, resulting in a bad
netdev->priv when starting or even destroying a container. So sanity
check the netdev->type before setting a netdev->priv element.
This should fix https://github.com/lxc/lxc/issues/254