KATOH Yasufumi [Thu, 30 Oct 2014 11:31:20 +0000 (20:31 +0900)]
overlayfs: overlayfs.v22 or higher needs workdir option
This patch creates workdir as "olwork", and retry mount with workdir
option when mount is failed.
It is used to prepare files before atomically swithing with
destination, and needs to be on the same filesystem as upperdir. It's
OK for it to be empty.
Serge Hallyn [Thu, 16 Oct 2014 15:10:21 +0000 (15:10 +0000)]
overlay and aufs clone_paths: be more robust
Currently when we clone a container, bdev_copy passes NULL as dst argument
of bdev_init, then sees bdev->dest (as a result) is NULL, and sets
bdev->dest to $lxcpath/$name/rootfs. so $ops->clone_paths() can
assume that "/rootfs" is at the end of the path. The overlayfs and
aufs clonepaths do assume that and index to endofstring-6 and append
delta0. Let's be more robust by actually finding the last / in
the path.
Then, instead of always setting oldbdev->dest to $lxcpath/$name/rootfs,
set it to oldbdev->src. Else dir_clonepaths fails when mounting src
onto dest bc dest does not exist. We could also fix that by creating
bdev->dest if needed, but that addes an empty directory to the old
container.
This fixes 'lxc-clone -o x1 -n x2' if x1 has lxc.rootfs = /var/lib/lxc/x1/x
and makes the overlayfs and aufs paths less fragile should something else
change.
Serge Hallyn [Mon, 27 Oct 2014 14:23:10 +0000 (14:23 +0000)]
lxc_global_config_value: simplify the theme
Rather than try to free all the not-being-returned items at
each if clause where we assign one to return value, just NULL
the one we are returning so we can safely free all the
values. This should fix the newly reported coverity memory
leak
Serge Hallyn [Tue, 14 Oct 2014 11:04:35 +0000 (11:04 +0000)]
lxc-start: don't re-try to mount rootfs if we already did so
If we are root using a user namespace and are mounting a blockdev as rootfs,
then we do this before unsharing the userns, because we are not allowed to
do it in a userns. But after unsharing the userns, we unconditionally
retried mounting the rootfs, resulting in failure. stop that.
Tycho Andersen [Wed, 22 Oct 2014 22:25:02 +0000 (22:25 +0000)]
c/r: put lxc-restore-net in /usr/share
On restore, we pass criu a script to manage the network interfaces (i.e. the
full path to lxc-restore-net), which we previously installed into
/var/lib/<tuple>/lxc. However, this is also the directory that is the default
for use in mounting the rootfs locally before pivot_root()ing. So, we mounted
the rootfs and then happliy called criu, pointing it to this directory which
didn't have lxc-restore-net any more, it just had the container's rootfs.
Instead, we should put lxc-restore-net somewhere else, so that criu can still
see it after the rootfs is mounted.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Mon, 27 Oct 2014 03:01:30 +0000 (22:01 -0500)]
do_rootfs_setup: fix return bugs
Fix return value on bind mount failure.
If we've already mounted the rootfs, exit after the bind mount
rather than re-trying the rootfs mount. The only case where
this happens is when root is starting a container in a user
namespace and with a block device backing store.
In that case, pre-mount hooks will be executed in the initial
user namespace. That may be worth fixing. Or it may be what
we want. We should think about it and fix it.
Dark Templar [Wed, 22 Oct 2014 14:35:08 +0000 (09:35 -0500)]
Fix another gentoo template typo
I've found one more typo in the gentoo template, configuration in the
generated file /etc/conf.d/hostname was not valid, but it didn't impact
me due to "lxc.utsname" being set in the configuration file of container
and hostname service being not used. Anyway, I've made a patch and
sending it with this mail.
Signed-off-by: Dark Templar <dark_templar@hotbox.ru> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
When running unprivileged, lxc-create will touch a fstab file, with bind-mounts
for the ttys and other devices. Add this entry in the container config.
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
busybox template: support for unprivileged containers
Apply the changes found in templates/lxc-download to the busybox template as
well. Change ownership of the config and fstab files to the unprivileged user,
and the ownership of the rootfs to root in the new user namespace.
Eliminate the "unsupported for userns" flag.
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
KATOH Yasufumi [Thu, 2 Oct 2014 09:01:06 +0000 (18:01 +0900)]
lxc_global_config_value can return the default lxc.cgroup.pattern whether root or non-root
>>> On Tue, 30 Sep 2014 19:48:09 +0000
in message "Re: [lxc-devel] [PATCH] lxc-config can show lxc.cgroup.(use|pattern)"
Serge Hallyn-san wrote:
> I think it would be worth also augmenting
> lxc_global_config_value() to return a default lxc.cgroup.use
> for 'all', and a default lxc.cgroup.pattern ("/lxc/%n" for root
> or "%n" for non-root).
Tycho Andersen [Thu, 16 Oct 2014 13:13:59 +0000 (13:13 +0000)]
c/r: refactor the way we pass data to criu/scripts
We previously wrote a bunch of files (eth*, veth*, and bridge*) as hard coded
files which we used as the names of interfaces to restore via criu's
--veth-pair. This meant that if people, e.g. gave a different bridge on their
new host, we would use our saved bridge in bridge* and try to restore to the
wrong bridge. Instead, we can just generate a new veth id (if the user hasn't
provided one), and use whatever the user configured values for the interface
name and bridge are.
This allows people to switch the bridge that they restore onto simply by
migrating the rootfs and config, and then changing the bridge name in the
container's configuration before running lxc-checkpoint.
Tycho Andersen [Fri, 10 Oct 2014 14:55:37 +0000 (14:55 +0000)]
c/r: factor out network dump/restore code
Break the monolithic ->checkpoint and ->restore functions into smaller ones.
This is in preparation for the checkpoint/restore tty work, which has a similar
need to dump information outside of criu.
Serge Hallyn [Wed, 15 Oct 2014 14:20:45 +0000 (16:20 +0200)]
netdev_move_by_index: support wlan
The python lxc-device supported adding wlan devices, so add that
support as well. Since the python one did not support 'del',
I didn't try adding that support, though it should be trivial to
add.
We should be able to do the wlan adding using netlink, but I
went ahead and used 'iw' as the netlink path looked more
complicated than it does for other nics. Patches to switch that
over would be very welcome.
Dongsheng Yang [Tue, 16 Sep 2014 09:12:00 +0000 (17:12 +0800)]
lxc-device: rewrite lxc-device.
As there is a function named attach_interface to pass
a interface to container now, we do not need to relay on
python impolementation for lxc-device any more.
changelog: 10/15/2014: serge: fail immediately if run as non-root.
changelog: 10/15/2014: serge: add explicit error message on bad usage (fix build failure)
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Dongsheng Yang [Tue, 16 Sep 2014 04:58:55 +0000 (12:58 +0800)]
network: allow lxc_network_move_by_index() rename netdev in moving.
In netlink, we can set the dest_name of netdev when move netdev
between namespaces in one netlink request. And moving a netdev of
a src_name to a netdev with a dest_name is a common usecase.
So this patch add a parametaer to lxc_network_move_by_index() to
indicate the dest_name for the movement. NULL means same with
the src_name.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Thu, 9 Oct 2014 15:54:51 +0000 (10:54 -0500)]
fix lxc.mount.auto clearing
the way config_mount was structured, sending 'lxc.mount.auto = '
ended up actually clearing all lxc.mount.entrys. Fix that by
moving the check for an empty value to after the subkey checks.
Then, actually do the clearing of auto_mounts in config_mount_auto.
The 'strlen(subkey)' check being removed was bogus - the subkey
either known to be 'lxc.mount.entry', else subkey would have been
NULL (and forced a return in the block above).
This would have been clearer if the config_mount() and helper
fns were structured like the rest of confile.c. It's tempting
to switch it over, but there are subtleties in there so it's
not something to do without a lot of thought and testing.
Dwight Engen [Thu, 2 Oct 2014 20:56:02 +0000 (16:56 -0400)]
systemd/selinux init scripts fixups
- RHEL/OL 7 doesn't have the ifconfig command by default so have the
lxc-net script check for its existence before use, and fall back
to using the ip command if ifconfig is not available
- When lxc-net is run from systemd on a system with selinux enabled,
the mkdir -p ${varrun} will create /run/lxc as init_var_run_t which
dnsmasq can't write its pid into, so we restorecon it
after creation (to var_run_t)
- The lxc-net systemd .service file needs an [Install] section so that
"systemctl enable lxc-net" will work
Tycho Andersen [Tue, 7 Oct 2014 19:33:08 +0000 (19:33 +0000)]
restore: create cgroups for criu
Previously, we let criu create the cgroups for a container as it was restoring
things. In some cases (i.e. migration across hosts), if the container being
migrated was in /lxc/u1-3, it would be migrated to the target host in
/lxc/u1-3, even if there was no /lxc/u1-2 (or worse, if there was already an
alive container in u1-3).
Instead, we use lxc's cgroup_create, and then tell criu where to restore to.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Tycho Andersen [Wed, 8 Oct 2014 17:11:50 +0000 (12:11 -0500)]
restore: Hoist handler to function level
On Tue, Oct 07, 2014 at 07:33:07PM +0000, Tycho Andersen wrote:
> This commit is in preparation for the cgroups create work, since we will need
> the handler in both the parent and the child. This commit also re-works how
> errors are propagated to be less verbose.
Here is an updated version:
From 941623498a49551411ccf185146061f3f37d3a67 Mon Sep 17 00:00:00 2001
From: Tycho Andersen <tycho.andersen@canonical.com>
Date: Tue, 7 Oct 2014 19:13:51 +0000
Subject: [PATCH 1/2] restore: Hoist handler to function level
This commit is in preparation for the cgroups create work, since we will need
the handler in both the parent and the child. This commit also re-works how
errors are propagated to be less verbose.
v2: rename error to has_error, handle it correctly, and remove some diff noise
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Tycho Andersen [Wed, 8 Oct 2014 17:12:04 +0000 (17:12 +0000)]
criu: DECLARE_ARG should check for null arguments
This is in preparation for the cgroups creation work, but also probably just a
good idea in general. The ERROR message is handy since we print line nos. it
will to give people an indication of what arg was null.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Andrey Vagin [Sat, 4 Oct 2014 21:49:16 +0000 (01:49 +0400)]
lxc: don't call pivot_root if / is on a ramfs
pivot_root can't be called if / is on a ramfs. Currently chroot is
called before pivot_root. In this case the standard well-known
'chroot escape' technique allows to escape a container.
I think the best way to handle this situation is to make following actions:
* clean all mounts, which should not be visible in CT
* move CT's rootfs into /
* make chroot into /
I don't have a host, where / is on a ramfs, so I can't test this patch.
Serge Hallyn [Wed, 8 Oct 2014 05:14:26 +0000 (00:14 -0500)]
cgmanager: several fixes
These all fix various ways that cgroup actions could fail if an
unprivileged user's cgroup paths were not all the same for all
controllers.
1. in cgm_{g,s}et, use the right controller, not the first in the list,
to get the cgroup path.
2. when we pass 'all' to cgmanager for a ${METHOD}_abs, make sure that all
cgroup paths are the same. That isn't necessary for methods not
taking an absolute path, so split up the former
cgm_supports_multiple_controllers() function into two booleans, one
telling whether cgm supports it, and another telling us whether
cgm supports it AND all controller cgroup paths are the same.
3. separately, do_cgm_enter with abs=true couldn't work if all
cgroup paths were not the same. So just ditch that helper and
call lxc_cgmanager_enter() where needed, because the special
cases would be more complicated.
apparmor: restrict signal and ptrace for processes
Restrict signal and ptrace for processes running under the container
profile. Rules based on AppArmor base abstraction. Add unix rules for
processes running under the container profile.
- move action() from common to sysvinit wrapper since its only really
applicable for sysvinit and not the other init systems
- fix bug in action() fallback, need to shift away msg before executing action
- make lxc-net 98 so it starts before lxc-container (99), otherwise the lxcbr0
won't be available when containers are autostarted
- make the default RUNTIME_PATH be /var/run instead of /run. On older
distros (like ol6.5) /run doesn't exist. lxc-net will create this directory
and attempt to create the dnsmasq.pid file in it, but this will fail when
SELinux is enabled because the directory will have the default_t type.
Newer systems have /var/run symlinked to /run so you get to the same place
in that case.
- add %postun to remove lxc-dnsmasq user when pkgs are removed
- fix bug in lxc-oracle template that was creating /var/lock/subsys/lxc as
a dir and interfering with the init scripts
This commit is based on the work of: Signed-off-by: Michael H. Warfield <mhw@WittsEnd.com>
A generic changelog would be:
- Bring support for lxcbr0 to all distributions
- Share the container startup and network configuration logic across
distributions and init systems.
- Have all the init scripts call the helper script.
- Support for the various different distro-specific configuration
locations to configure lxc-net and container startup.
Changes on top of Mike's original version:
- Remove sysconfig/lxc-net as it's apparently only there as a
workaround for an RPM limitation and is breaking Debian systems by
including a useless file which will get registered as a package provided
conffile in the dpkg database and will therefore cause conffile prompts
on upgrades...
- Go with a consistant coding style in the various init scripts.
- Split out the common logic from the sysvinit scripts and ship both in
their respective location rather than have them be copies.
- Fix the upstart jobs so they actually work (there's no such thing as
libexec on Debian systems).
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
With cgmanager, the cgroups are polled on demand, so these steps aren't needed.
However, with cgfs, lxc doesn't know about the cgroups for a container and so
it can't report any of the statistics about e.g. how much memory or CPU a
container is using.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
The ->checkpoint() API call didn't exit correctly if criu was killed by a
signal instead of exiting, so lxc-checkpoint didn't fail correctly as a result.
To cover all the cases we have around, we need to:
- Attempt to use cgm if present (preferred)
- Attempt to use cgmanager directly over dbus otherwise
- Fallback to cgroupfs
apparmor: make sure sysfs and securityfs are mounted when checking for mount feature
Otherwise the check will return false if securityfs was not mounted
by the container's configuration. In the past we let that quietly
proceed, but unconfined. Now that we restrict such container
starts, this caused lxc-test-apparmor to fail.
apparmor: improve behavior when kernel lacks mount restrictions (v2)
(Dwight, I took the liberty of adding your Ack but the code did
change a bit to continue passing the char *label from attach.
Tested that "lxc-start -n u1 -s lxc.aa_profile=p2; lxc-attach -n u1"
does attach you to the p2 profile)
Apparmor policies require mount restrictions to fullfill many of
their promises - for instance if proc can be mounted anywhere,
then 'deny /proc/sysrq-trigger w' prevents only accidents, not
malice.
The mount restrictions are not available in the upstream kernel.
We can detect their presence through /sys. In the past, when
we detected it missing, we would not enable apparmor. But that
prevents apparmor from helping to prevent accidents.
At the same time, if the user accidentaly boots a kernel which
has regressed, we do not want them starting the container thinking
they are more protected than they are.
This patch:
1. adds a lxc.aa_allow_incomplete = 1 container config flag. If
not set, then any container which is not set to run unconfined
will refuse to run. If set, then the container will run with
apparmor protection.
2. to pass this flag to the apparmor driver, we pass the container
configuration (lxc_conf) to the lsm_label_set hook.
3. add a testcase. To test the case were a kernel does not
provide mount restrictions, we mount an empty directory over
the /sys/kernel/security/apparmor/features/mount directory. In
order to have that not be unmounted in a new namespace, we must
test using unprivileged containers (who cannot remove bind mounts
which hide existing mount contents).
This idea came from Andy Lutomirski. Instead of using a
temporary directory for the pivot_root put-old, use "." both
for new-root and old-root. Then fchdir into the old root
temporarily in order to unmount the old-root, and finally
chdir back into our '/'.
Drop lxc.pivotdir from the lxc.container.conf manpage.
Warn when we see a lxc.pivotdir entry (but keep it in the
lxc.conf for now).
support use of 'all' containers when cgmanager supports it
Introduce a new list of controllers just containing "all".
Make the lists of controllers null-terminated.
If the cgmanager api version is high enough, use the 'all' controller
rather than walking all controllers, which should greatly reduce the
amount of dbus overhead. This will be especially important for
those going through a cgproxy.
Also remove the call to cleanup cgroups when a cgroup existed. That
usually fails (and failure is ignored) since the to-be-cleaned-up
cgroup is busy, but we shouldn't even be trying. Note this can
create for extra un-cleanedup cgroups, however it's better than us
accidentally removing a cgroup that someone else had created and was
about to use.
CRIU 1.3 has a pretty crippling deadlock which will cause dumping containers to
fail fairly often. This is fixed in criu 1.3.1, so we shouldn't run the tests
on anything less than that.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
After looking through some logs, it is a little cleaner to do it as
below, instead of what I originally posted.
Tycho
In order for LXC to be the parent of the restored process, CRIU needs to
restore init as its sibling, not as its child. This was previously accomplished
essentially via luck :). CRIU now has a --restore-sibling option which forces
this behavior that LXC expects. See more discussion in this thread:
http://lists.openvz.org/pipermail/criu/2014-September/thread.html#16330
v2: don't pass --restore-sibling to dump. This is mostly cosmetic, but will
look less confusing in the logs if people ever look at them.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
lxc-gentoo: keep original uid/gid of files/dirs when installing
Call tar with --numeric-owner option to use numbers for user/group
names because the whole uid/gid in rootfs should be consistently
unchanged as in original stage3 tarball and private portage.
criu version 1.3 has been tagged, which has the minimal set of patches to allow
checkpointing and restoring containers. lxc-test-checkpoint-restore is now
skipped on any version of criu lower than 1.3.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
This option is required when migrating containers across hosts; it is used to
restore inotify via file paths instead of file handles, which aren't preserved
across hosts.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
lxc-plamo: keep original uid/gid of files/dirs when installing
Regardless of whether "installpkg" command exists or not, install the
command temporarily with static linked tar command into the lxc cache
directory to keep the original uid/gid of files/directories. Also,
use sed command instead of ed command for simplicity.
config: fix the handling of lxc.hook and hwaddrs in unexpanded config
And add a testcase.
The code to update hwaddrs in a clone was walking through the container
configuration and re-printing all network entries. However network
entries from an include file which should not be printed out were being
added to the unexpanded config. With this patch, at clone we simply
update the hwaddr in-place in the unexpanded configuration file, making
sure to make the same update to the expanded network configuration.
The code to update out lxc.hook statements had the same problem.
We also update it in-place in the unexpanded configuration, though
we mirror the logic we use when updating the expanded configuration.
(Perhaps that should be changed, to simplify future updates)
This code isn't particularly easy to review, so testcases are added
to make sure that (1) extra lxc.network entries are not added (or
removed), even if they are present in an included file, (2) lxc.hook
entries are not added, (3) hwaddr entries are updated, and (4)
the lxc.hook entries are properly updated (only when they should be).
When managing containers, I need to take action based on container
exit status. For instance, if it exited abnormally (status!=0), I
sometime want to respawn it automatically. Or, when invoking
`lxc-stop` I want to know if it terminated gracefully (ie on `SIGTERM`)
or on `SIGKILL` after a timeout.
This patch adds a new message type `lxc_msg_exit_code,` to preserve
ABI. It sends the raw status code as returned by `waitpid` so that
listening application may want to apply `WEXITSTATUS` before. This is
what `lxc-monitor` does.
Signed-off-by: Jean-Tiare LE BIGOT <jean-tiare.le-bigot@ovh.net>
Serge Hallyn [Fri, 29 Aug 2014 14:20:44 +0000 (14:20 +0000)]
lxc-cgm: fix issue with nested chowning
To ask cgmanager to chown files as an unpriv user, we must send the
request from the container's namespace (with our own userid also
mapped in). However when we create a new namespace then we must
open a new dbus connection, so that our credential and the credential
on the dbus socket match. Otherwise the proxy will refuse the request.
Because we were warning about this failure but not exiting, the failure
was not noticed until the unprivileged container went on to try to
administer its cgroups, i.e. creating a container inside itself.
Fix this by having the do_chown_cgroup create a new cgmanager connection.
In order to reduce the number of connections, since the list of subsystems
is global anyway, don't call do_chown_cgroup once for each controller,
just call it once and have it run over all controllers.
(This patch does not change the fact that we don't fail if the
chown failed. I think we should change that, but let's do it in a
later patch)
Tycho Andersen [Tue, 26 Aug 2014 14:09:36 +0000 (09:09 -0500)]
Add support for checkpoint and restore via CRIU
This patch adds support for checkpointing and restoring containers via CRIU.
It adds two api calls, ->checkpoint and ->restore, which are wrappers around
the CRIU CLI. CRIU has an RPC API, but reasons for preferring exec() are
discussed in [1].
To checkpoint, users specify a directory to dump the container metadata (CRIU
dump files, plus some additional information about veth pairs and which
bridges they are attached to) into this directory. On restore, this
information is read out of the directory, a CRIU command line is constructed,
and CRIU is exec()d. CRIU uses the lxc-restore-net callback (which in turn
inspects the image directory with the NIC data) to properly restore the
network.
This will only work with the current git master of CRIU; anything as of a152c843 should work. There is a known bug where containers which have been
restored cannot be checkpointed [2].
v2: fixed some problems with the s/int/bool return code form api function
v3: added a testcase, fixed up the man page synopsis
v4: fix a small typo in lxc-test-checkpoint-restore
v5: remove a reference to the old CRIU_PATH, and a bad error about the same
Daniel Miranda [Mon, 25 Aug 2014 21:16:43 +0000 (18:16 -0300)]
build: Make setup.py run from srcdir to avoid distutils errors
distutils can't handle paths to source files containing '..'. It will
try to navigate away from the build directory and fail. To fix that,
before building the python module, transform all the path variables then
cd to the srcdir, and set the build directory manually.
This is hopefully the last needed fix to use separate build and
source diretories.
Signed-off-by: Daniel Miranda <danielkza2@gmail.com> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Daniel Miranda [Mon, 25 Aug 2014 21:16:42 +0000 (18:16 -0300)]
build: don't remove configuration template on clean
Now that default.conf is generated/linked during the configuration
phase, it should not longer be removed in the 'clean' stage, or
subsequent builds will fail. Only remove it during 'dist-clean'.
Signed-off-by: Daniel Miranda <danielkza2@gmail.com> Acked-by: Stéphane Graber <stgraber@ubuntu.com>