Serge Hallyn [Fri, 9 Jan 2015 22:00:28 +0000 (22:00 +0000)]
Fix reversed args in mount call
Riya Khanna reported that with a ramfs rootfs the mount to make
/ rprivate was returning -EFAULT. NULL was being passed as the
mount target. Pass "/" instead.
Martin Pitt [Thu, 8 Jan 2015 12:09:37 +0000 (13:09 +0100)]
apparmor: Fix slave bind mounts
The permission to make a mount "slave" is spelt "make-slave", not "slave", see
https://launchpad.net/bugs/1401619. Also, we need to make all mounts slave, not
just the root dir.
Serge Hallyn [Fri, 19 Dec 2014 18:23:52 +0000 (18:23 +0000)]
Enable seccomp by default for unprivileged users.
In contrast to what the comment above the line disabling it said,
it seems to work just fine. It also is needed on current kernels
(until Eric's patch hits upstream) to prevent unprivileged containers
from hosing fuse filesystems they inherit.
Serge Hallyn [Fri, 19 Dec 2014 18:22:55 +0000 (18:22 +0000)]
seccomp: add rule to reject umount -f
If a container has a bind mount from a host nfs or fuse
filesystem, and does 'umount -f', it will disconnect the
host's filesystem. This patch adds a seccomp rule to
block umount -f from a container. It also adds that rule
to the default seccomp profile.
Shuai Zhang [Sun, 30 Nov 2014 13:03:37 +0000 (21:03 +0800)]
audit: added capacity and reserve() to nlmsg
There are now two (permitted) ways to add data to netlink message:
1. put_xxx()
2. call nlmsg_reserve() to get a pointer to newly reserved room within the
original netlink message, then write or memcpy data to that area.
Both of them guarantee adding requested length data do not overflow the
pre-allocated message buffer by checking against its cap field first.
And there may be no need to access nlmsg_len outside nl module, because both
put_xxx() and nlmsg_reserve() have alread did that for us.
KATOH Yasufumi [Wed, 5 Nov 2014 07:03:34 +0000 (16:03 +0900)]
Fix clone issues
This commit fixes two issues at the time of clone:
* unnecessary directory is created when clone between overlayfs/aufs
* clone failed when the end of rootfs path is not "/rootfs"
Signed-off-by: KATOH Yasufumi <karma@jazz.email.ne.jp> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
KATOH Yasufumi [Thu, 30 Oct 2014 11:31:20 +0000 (20:31 +0900)]
overlayfs: overlayfs.v22 or higher needs workdir option
This patch creates workdir as "olwork", and retry mount with workdir
option when mount is failed.
It is used to prepare files before atomically swithing with
destination, and needs to be on the same filesystem as upperdir. It's
OK for it to be empty.
Serge Hallyn [Thu, 16 Oct 2014 15:10:21 +0000 (15:10 +0000)]
overlay and aufs clone_paths: be more robust
Currently when we clone a container, bdev_copy passes NULL as dst argument
of bdev_init, then sees bdev->dest (as a result) is NULL, and sets
bdev->dest to $lxcpath/$name/rootfs. so $ops->clone_paths() can
assume that "/rootfs" is at the end of the path. The overlayfs and
aufs clonepaths do assume that and index to endofstring-6 and append
delta0. Let's be more robust by actually finding the last / in
the path.
Then, instead of always setting oldbdev->dest to $lxcpath/$name/rootfs,
set it to oldbdev->src. Else dir_clonepaths fails when mounting src
onto dest bc dest does not exist. We could also fix that by creating
bdev->dest if needed, but that addes an empty directory to the old
container.
This fixes 'lxc-clone -o x1 -n x2' if x1 has lxc.rootfs = /var/lib/lxc/x1/x
and makes the overlayfs and aufs paths less fragile should something else
change.
Cameron Norman [Mon, 1 Dec 2014 21:29:26 +0000 (13:29 -0800)]
lxc-debian: adjust init system configurations
Do as much as possible to allow containers switching from non-systemd to
systemd to work as intended (but nothing that will cause side effects).
Use update-rc.d disable instead of remove so the init scripts are not
re-enabled when the package is updated
Signed-off-by: Cameron Norman <camerontnorman@gmail.com> Acked-by: Stéphane Graber <stgraber@ubuntu.com>
Antonio Terceiro [Mon, 24 Nov 2014 01:51:06 +0000 (23:51 -0200)]
lxc-debian: support systemd as PID 1
Containers with systemd need a somewhat special setup, which I borrowed
and adapted from lxc-fedora. These changes are required so that Debian 8
(jessie) containers work properly, and are a no-op for previous Debian
versions.
Signed-off-by: Antonio Terceiro <terceiro@debian.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Gu1 [Tue, 28 Oct 2014 01:14:28 +0000 (02:14 +0100)]
lxc-debian: Fix default mirrors
Fix a typo in the lines inserted in the default sources.list.
Change the default mirror to http.debian.net which is (supposedly) more
accurate and better than cdn.debian.net for a generic configuration.
Use security.debian.org directly for the {release}/updates repository.
Abin Shahab [Wed, 12 Nov 2014 00:06:52 +0000 (00:06 +0000)]
Remounts bind mounts if read-only flag is provided
Bind mounts do not honor filesystem mount options. This change will
remount filesystems that are bind mounted if there are changes to
filesystem mount options, specifically if the mount is readonly.
Signed-off-by: Abin Shahab <ashahab@altiscale.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Silvio Fricke [Fri, 14 Nov 2014 19:56:12 +0000 (20:56 +0100)]
lxc/utils: bugfix freed pointer return value
We allocate a pointer and save this address in a static variable. After
this we freed this pointer and return.
Here a cuttout of a valgrind report:
[...]
==11568== Invalid read of size 1
==11568== at 0x4C2D524: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11568== by 0x5961C9B: puts (in /usr/lib/libc-2.20.so)
==11568== by 0x400890: main (lxc_config.c:73)
==11568== Address 0x6933e21 is 1 bytes inside a block of size 32 free'd
==11568== at 0x4C2B200: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11568== by 0x4E654F2: lxc_global_config_value (utils.c:415)
==11568== by 0x4E92177: lxc_get_global_config_item (lxccontainer.c:2287)
==11568== by 0x400883: main (lxc_config.c:71)
[...]
Serge Hallyn [Sun, 2 Nov 2014 14:01:18 +0000 (14:01 +0000)]
cgmanager: fix 'attach' with "all" controller support
"all" is not a supported keyword for cgmanager's get_pid_cgroup.
Pass the first mounted cgroup subsystem instead of passing "all" when
getting the container's cgorup to attach to.
Also, make sure that the target cgroup is in fact in all identical
cgroups before attaching with 'all". If not, then we must attach to
each cgroup separately, or else we will not be in all the same cgroups
as the target container.
Serge Hallyn [Mon, 27 Oct 2014 14:23:10 +0000 (14:23 +0000)]
lxc_global_config_value: simplify the theme
Rather than try to free all the not-being-returned items at
each if clause where we assign one to return value, just NULL
the one we are returning so we can safely free all the
values. This should fix the newly reported coverity memory
leak
Serge Hallyn [Tue, 14 Oct 2014 11:04:35 +0000 (11:04 +0000)]
lxc-start: don't re-try to mount rootfs if we already did so
If we are root using a user namespace and are mounting a blockdev as rootfs,
then we do this before unsharing the userns, because we are not allowed to
do it in a userns. But after unsharing the userns, we unconditionally
retried mounting the rootfs, resulting in failure. stop that.
Serge Hallyn [Mon, 27 Oct 2014 03:01:30 +0000 (22:01 -0500)]
do_rootfs_setup: fix return bugs
Fix return value on bind mount failure.
If we've already mounted the rootfs, exit after the bind mount
rather than re-trying the rootfs mount. The only case where
this happens is when root is starting a container in a user
namespace and with a block device backing store.
In that case, pre-mount hooks will be executed in the initial
user namespace. That may be worth fixing. Or it may be what
we want. We should think about it and fix it.
Dark Templar [Wed, 22 Oct 2014 14:35:08 +0000 (09:35 -0500)]
Fix another gentoo template typo
I've found one more typo in the gentoo template, configuration in the
generated file /etc/conf.d/hostname was not valid, but it didn't impact
me due to "lxc.utsname" being set in the configuration file of container
and hostname service being not used. Anyway, I've made a patch and
sending it with this mail.
Signed-off-by: Dark Templar <dark_templar@hotbox.ru> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
When running unprivileged, lxc-create will touch a fstab file, with bind-mounts
for the ttys and other devices. Add this entry in the container config.
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
busybox template: support for unprivileged containers
Apply the changes found in templates/lxc-download to the busybox template as
well. Change ownership of the config and fstab files to the unprivileged user,
and the ownership of the rootfs to root in the new user namespace.
Eliminate the "unsupported for userns" flag.
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
KATOH Yasufumi [Thu, 2 Oct 2014 09:01:06 +0000 (18:01 +0900)]
lxc_global_config_value can return the default lxc.cgroup.pattern whether root or non-root
>>> On Tue, 30 Sep 2014 19:48:09 +0000
in message "Re: [lxc-devel] [PATCH] lxc-config can show lxc.cgroup.(use|pattern)"
Serge Hallyn-san wrote:
> I think it would be worth also augmenting
> lxc_global_config_value() to return a default lxc.cgroup.use
> for 'all', and a default lxc.cgroup.pattern ("/lxc/%n" for root
> or "%n" for non-root).
Dongsheng Yang [Tue, 16 Sep 2014 04:58:55 +0000 (12:58 +0800)]
network: allow lxc_network_move_by_index() rename netdev in moving.
In netlink, we can set the dest_name of netdev when move netdev
between namespaces in one netlink request. And moving a netdev of
a src_name to a netdev with a dest_name is a common usecase.
So this patch add a parametaer to lxc_network_move_by_index() to
indicate the dest_name for the movement. NULL means same with
the src_name.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Serge Hallyn [Thu, 9 Oct 2014 15:54:51 +0000 (10:54 -0500)]
fix lxc.mount.auto clearing
the way config_mount was structured, sending 'lxc.mount.auto = '
ended up actually clearing all lxc.mount.entrys. Fix that by
moving the check for an empty value to after the subkey checks.
Then, actually do the clearing of auto_mounts in config_mount_auto.
The 'strlen(subkey)' check being removed was bogus - the subkey
either known to be 'lxc.mount.entry', else subkey would have been
NULL (and forced a return in the block above).
This would have been clearer if the config_mount() and helper
fns were structured like the rest of confile.c. It's tempting
to switch it over, but there are subtleties in there so it's
not something to do without a lot of thought and testing.
Andrey Vagin [Sat, 4 Oct 2014 21:49:16 +0000 (01:49 +0400)]
lxc: don't call pivot_root if / is on a ramfs
pivot_root can't be called if / is on a ramfs. Currently chroot is
called before pivot_root. In this case the standard well-known
'chroot escape' technique allows to escape a container.
I think the best way to handle this situation is to make following actions:
* clean all mounts, which should not be visible in CT
* move CT's rootfs into /
* make chroot into /
I don't have a host, where / is on a ramfs, so I can't test this patch.
Serge Hallyn [Wed, 8 Oct 2014 05:14:26 +0000 (00:14 -0500)]
cgmanager: several fixes
These all fix various ways that cgroup actions could fail if an
unprivileged user's cgroup paths were not all the same for all
controllers.
1. in cgm_{g,s}et, use the right controller, not the first in the list,
to get the cgroup path.
2. when we pass 'all' to cgmanager for a ${METHOD}_abs, make sure that all
cgroup paths are the same. That isn't necessary for methods not
taking an absolute path, so split up the former
cgm_supports_multiple_controllers() function into two booleans, one
telling whether cgm supports it, and another telling us whether
cgm supports it AND all controller cgroup paths are the same.
3. separately, do_cgm_enter with abs=true couldn't work if all
cgroup paths were not the same. So just ditch that helper and
call lxc_cgmanager_enter() where needed, because the special
cases would be more complicated.
apparmor: restrict signal and ptrace for processes
Restrict signal and ptrace for processes running under the container
profile. Rules based on AppArmor base abstraction. Add unix rules for
processes running under the container profile.
To cover all the cases we have around, we need to:
- Attempt to use cgm if present (preferred)
- Attempt to use cgmanager directly over dbus otherwise
- Fallback to cgroupfs