If the file "/sys/devices/system/cpu/isolated" doesn't exist, we can't just
simply bail. We still need to check whether we need to copy the parents cpu
settings.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
Move the user namespace at the first position in the array so that we always
attach to it first when iterating over the struct and using setns() to switch
namespaces. This especially affects lxc_attach(): Suppose you cloned a new user
namespace and mount namespace as an unprivileged user on the host and want to
setns() to the mount namespace. This requires you to attach to the user
namespace first otherwise the kernel will fail this check:
Using custom structs in attach.c risks getting out of sync with the commonly
used ns_info[LXC_NS_MAX] struct and thus attaching to wrong namespaces. Switch
to using ns_info[LXC_NS_MAX].
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
- Allocating an error message that the caller must free seems pointless. We can
just print the error message in preserve_ns() itself. This also allows us to
avoid using the GNU extension asprintf().
- Improve lxc_preserve_ns(): By passing in NULL or "" as the second argument
the function can now also be used to check whether namespaces are supported
by the kernel.
- Use lxc_preserve_ns() in preserve_ns().
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
- So far we blindly called lxc_delete_network() to make sure that we deleted
all network interfaces. This resulted in pointless netlink calls, especially
when a container had multiple networks defined. Let's be smarter and have
lxc_delete_network() return a boolean that indicates whether *all* configured
networks have been deleted. If so, don't needlessly try to delete them again
in start.c. This also decreases confusing error messages a user might see.
- When we receive -ENODEV from one of our lxc_netdev_delete_*() functions,
let's assume that either the network device already got deleted or that it
got moved to a different network namespace. Inform the user about this but do
not report an error in this case.
- When we have explicitly deleted the host side of a veth pair let's
immediately free(priv.veth_attr.pair) and NULL it, or
memset(priv.veth_attr.pair, ...) the corresponding member so we don't
needlessly try to destroy them again when we have to call
lxc_delete_network() again in start.c
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
When we set LXC_DEBUG_CGFSNG=1 we print out info about detected cgroup
hierarchies. When there's no named cgroup mounted we need to make sure that we
don't try to index an unallocated pointer.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
Adrian Reber [Mon, 14 Nov 2016 14:44:04 +0000 (14:44 +0000)]
lxc-checkpoint: enable dirty memory tracking in criu
CRIU supports dirty memory tracking to take incremental checkpoints.
Incremental checkpoints are one way of reducing downtime during
migration. The first checkpoint dumps all the memory pages and the
second (and third, and fourth, ...) only dumps pages which have changed.
Most of the necessary code has already been implemented. This just adds
the existing functionality to lxc-checkpoint:
-p, --pre-dump Only pre-dump the memory of the container.
Container keeps on running and following
checkpoints will only dump the changes.
--predump-dir=DIR path to images from previous dump (relative to -D)
The following is an example from a container running CentOS 7 with psql
and tomcat:
# lxc-checkpoint -n c7 -D /tmp/cp -p
Container keeps on running
# du -h /tmp/cp
229M /tmp/cp
Sync initial checkpoint to destination
# rsync -a /tmp/cp host2:/tmp/
Sync file-system
# rsync -a /var/lib/lxc/c7 host2:/var/lib/lxc/
Final dump; container is stopped
# lxc-checkpoint -n c7 -D /tmp/cp --predump-dir=../cp -s
# du -h /tmp/cp2
90M /tmp/cp2
After transferring the second (incremental checkpoint) and the changes
to the container's file system the container can be restored on the
second host by pointing lxc-checkpoint to the second checkpoint
directory:
Stéphane Graber [Mon, 14 Nov 2016 16:53:07 +0000 (11:53 -0500)]
debian: Don't depend on libui-dialog-perl
This package doesn't exist in stretch anymore, and it's unclear why we
were depending on a library to begin with (as opposed to having it
brought by whatever needs it).
we cannot simply copy the cpuset.cpus file from our parent cgroup. For example,
in the root cgroup cpuset.cpus will contain all of the cpus including the
isolated cpus. Copying the values of the root cgroup into a child cgroup will
lead to a wrong view in /proc/self/status: For the root cgroup
/sys/fs/cgroup/cpuset /proc/self/status will correctly show
Cpus_allowed_list: 0-1,3
even though cpuset.cpus will show
0-3
However, initializing a subcgroup in the cpuset controller by copying the
cpuset.cpus setting from the root cgroup will cause /proc/self/status to
incorrectly show
Cpus_allowed_list: 0-3
Hence, we need to make sure to remove the isolated cpus from cpuset.cpus. Seth
has argued that this is not a kernel bug but by design. So let us be the smart
guys and fix this in liblxc.
The solution is straightforward: To avoid having to work with raw cpulist
strings we create cpumasks based on uint32_t bit arrays.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
start: CLONE_NEWCGROUP after we have setup cgroups
If we do it earlier we end up with a wrong view of /proc/self/cgroup. For
example, assume we unshare(CLONE_NEWCGROUP) first, and then create the cgroup
for the container, say /sys/fs/cgroup/cpuset/lxc/c, then /proc/self/cgroup
would show us:
8:cpuset:/lxc/c
whereas it should actually show
8:cpuset:/
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>
conf: merge network namespace move & rename on shutdown
On shutdown we move physical network interfaces back to the
host namespace and rename them afterwards as well as in the
later lxc_network_delete() step. However, if the device had
a name which already exists in the host namespace then the
moving fails and so do the subsequent rename attempts. When
the namespace ceases to exist the devices finally end up
in the host namespace named 'dev<ID>' by the kernel.
In order to avoid this, we do the moving and renaming in a
single step (lxc_netdev_move_by_*()'s move & rename happen
in a single netlink transaction).
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Tycho Andersen [Mon, 31 Oct 2016 16:07:25 +0000 (10:07 -0600)]
c/r: explicitly emit bind mounts as criu arguments
We switched to --ext-mount-map auto because of "system" (liblxc) added
mounts like the cgmanager socket that weren't in the config file. This had
the added advantage that we could drop all the mount processing code,
because we no longer needed an --ext-mount-map argument.
The problem here is that mounts can move between hosts. While
--ext-mount-map auto does its best to detect this situation, it explicitly
disallows moves that change the path name. In LXD, we bind mount
/var/lib/lxd/shmounts/$container to /dev/.lxd-mounts for each container,
and so when a container is renamed in a migration, the name changes.
--ext-mount-map auto won't detect this, and so the migration fails.
We *could* implement mount rewriting in CRIU, but my experience with cgroup
and apparmor rewriting is that this is painful and error prone. Instead, it
is much easier to go back to explicitly listing --ext-mount-map arguments
from the config file, and allow the source of the bind to change. We leave
--ext-mount-map auto to catch any stragling (or future) system added
mounts.
Somehow this implementation of a cgroupfs backend decided to use the hierarchy
numbers it detects in /proc/cgroups and /proc/self/cgroups as indices for
the hierarchy struct. Controller numbering usually starts at 1 but may start at
0 if:
a) the controller is not mounted on a cgroups v1 hierarchy;
b) the controller is bound to the cgroups v2 single unified hierarchy; or
c) the controller is disabled
To avoid having to rework our fallback backend significantly, we should
explicitly check for each controller if hierarchy[i] != NULL.
Signed-off-by: Christian Brauner <christian.brauner@canonical.com>