syscall_wrappers: rename internal memfd_create to memfd_create_lxc
In case the internal memfd_create has to be used, make sure we don't
clash with the already existing memfd_create function from glibc.
This can happen if this glibc function is a stub. In this case, at
./configure time, the test for this function will return false, however
the declaration of that function is still available. This leads to
compilation errors.
Signed-off-by: Patrick Havelange <patrick.havelange@essensium.com>
Rachid Koucha [Sat, 12 Oct 2019 11:05:50 +0000 (13:05 +0200)]
Bad sgml/man translation
When calling "man lxc.container.conf", an internal "man" keyword is displayed :
$ man lxc.container.conf
[...]
lxc.mount.entry
Specify a mount point corresponding to a line in the fstab format. Moreover lxc supports mount propagation, such as
rslave or rprivate, and adds three additional mount options. optional don't fail if mount does not work. create=dir
or create=file to create dir (or file) when the point will be mounted. relative source path is taken to be relative to
the mounted container root. For instance,
In the usual case the child runs in a separate pid namespace. So far we haven't
been able to reliably set the pdeath signal. When we set the pdeath signal we
need to verify that we haven't lost a race whereby we have been orphaned and
though we have set a pdeath signal it won't help us since, well, the parent is
dead.
We were able to correctly handle this case when we were in the same pidns since
getppid() will return a valid pid. When we are in a separate pidns 0 will be
returned since the parent doesn't exist in our pidns.
A while back, while Jann and I were discussing other things he came up with a
nifty idea: simply pass an fd for the parent's status file and check the
"State:" field. This is the implementation of that idea.
Suggested-by: Jann Horn <jann@thejh.net> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Julio Faracco [Thu, 5 Sep 2019 04:43:21 +0000 (01:43 -0300)]
utils: Copying source filename to avoid missing info.
Some applications use information from LOOP_GET_STATUS64. The file
associated with loop device is pointed inside structure field
`lo_file_name`. The current code is setting up a loop device without
this information. A legacy example of code checking this is cryptsetup:
Antonio Terceiro [Sun, 18 Aug 2019 20:30:32 +0000 (17:30 -0300)]
lxc-attach: make sure exit status of command is returned
Commit ae68cad763d5b39a6a9e51de2acd1ad128b720ca introduced a regression that
makes lxc-attach ignore the exit status of the executed command. This was first
identified in 3.0.4 LTS, while it worked on 3.0.3.
cgfsng: mount pure unified cgroup layout correctly
When pure cgroup unified mode is used we cannot pre-mount a tmpfs as this
confuses systemd.
Users should also set lxc.mount.auto = cgroup:force to ensure that systemd in
the container and on the host use identical cgroup layouts.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
suppress false-negative error in templates and nvidia hook
``/proc`` might be mounted with ``hidepid=2``.
This makes ``/proc/1/…`` appear absent for non-root users.
When using the templates or the nvidia hook as a non-root user
(e.g., when creating unprivileged containers) the error
"/proc/1/uid_map: No such file or directory" is printed.
Since the script works correctly despite the error, this error
message might be confusing for users.
Julio Faracco [Sat, 3 Aug 2019 05:16:13 +0000 (02:16 -0300)]
utils: Fix wrong integer of a function parameter.
If SSL is enabled, utils will include function `do_sha1_hash()` to
generate a sha1 encrypted buffer. Last function argument of
`EVP_DigestFinal_ex()` requires a `unsigned int` but the current
parameter is an `integer` type.
See error:
utils.c:350:38: error: passing 'int *' to parameter of type 'unsigned int *' converts between pointers to integer types with different sign
[-Werror,-Wpointer-sign]
EVP_DigestFinal_ex(mdctx, md_value, md_len);
^~~~~~
/usr/include/openssl/evp.h:549:49: note: passing argument to parameter 's' here
unsigned int *s);
Signed-off-by: Julio Faracco <jcfaracco@gmail.com>
Most kernels don't have this functionality yet, and so the warning is
printed a lot. Our people are scared of warnings, so let's make it INFO
instead in this case.
Rachid Koucha [Sat, 15 Jun 2019 13:17:50 +0000 (15:17 +0200)]
Fixed file descriptor leak for network namespace
In privileged mode, the container startup looses a file descriptor for "handler->nsfd[LX_NS_NET]". At line 1782, we preserve the namespaces file descriptor (in privileged mode, the network namespace is also preserved) :
for (i = 0; i < LXC_NS_MAX; i++)
if (handler->ns_on_clone_flags & ns_info[i].clone_flag)
INFO("Cloned %s", ns_info[i].flag_name);
if (!lxc_try_preserve_namespaces(handler, handler->ns_on_clone_flags, handler->pid)) {
ERROR("Failed to preserve cloned namespaces for lxc.hook.stop");
goto out_delete_net;
}
Then at line 1830, we preserve one more time the network namespace :
ret = lxc_try_preserve_ns(handler->pid, "net");
if (ret < 0) {
if (ret != -EOPNOTSUPP) {
SYSERROR("Failed to preserve net namespace");
goto out_delete_net;
}
The latter overwrites the file descriptor already stored in handler->nsfd[LXC_NS_NET] at line 1786.
So, this fix checks that the entry is not already filled.
BugLink: https://bugs.launchpad.net/bugs/1831258 Cc: Dimitri John Ledkov <xnox@ubuntu.com> Cc: Scott Moser <smoser@ubuntu.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
In addition to isolated cpus we also need to account for offline cpus when our
ancestor cgroup is the root cgroup and we have not been initialized yet.
Closes #2953.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Tycho Andersen [Thu, 9 May 2019 18:18:10 +0000 (14:18 -0400)]
lxc_clone: get rid of some indirection
We have a do_clone(), which just calls a void f(void *) that it gets
passed. We build up a struct consisting of two args that are just the
actual arg and actual function. Let's just have the syscall do this for us.
Tycho Andersen [Thu, 9 May 2019 17:52:30 +0000 (13:52 -0400)]
lxc_clone: pass non-stack allocated stack to clone
There are two problems with this code:
1. The math is wrong. We allocate a char *foo[__LXC_STACK_SIZE]; which
means it's really sizeof(char *) * __LXC_STACK_SIZE, instead of just
__LXC_STACK SIZE.
2. We can't actually allocate it on our stack. When we use CLONE_VM (which
we do in the shared ns case) that means that the new thread is just
running one page lower on the stack, but anything that allocates a page
on the stack may clobber data. This is a pretty short race window since
we just do the shared ns stuff and then do a clone without CLONE_VM.
However, it does point out an interesting possible privilege escalation if
things aren't configured correctly: do_share_ns() sets up namespaces while
it shares the address space of the task that spawned it; once it enters the
pid ns of the thing it's sharing with, the thing it's sharing with can
ptrace it and write stuff into the host's address space. Since the function
that does the clone() is lxc_spawn(), it has a struct cgroup_ops* on the
stack, which itself has function pointers called later in the function, so
it's possible to allocate shellcode in the address space of the host and
run it fairly easily.
ASLR doesn't mitigate this since we know exactly the stack offsets; however
this patch has the kernel allocate a new stack, which will help. Of course,
the attacker could just check /proc/pid/maps to find the location of the
stack, but they'd still have to guess where to write stuff in.
The thing that does prevent this is the default configuration of apparmor.
Since the apparmor profile is set in the second clone, and apparmor
prevents ptracing things under a different profile, attackers confined by
apparmor can't do this. However, if users are using a custom configuration
with shared namespaces, care must be taken to avoid this race.
Shared namespaces aren't widely used now, so perhaps this isn't a problem,
but with the advent of crio-lxc for k8s, this functionality will be used
more.
Thomas Parrott [Wed, 15 May 2019 14:54:12 +0000 (15:54 +0100)]
network: move phys netdevs back to monitor's net ns rather than pid 1's
Updates lxc_restore_phys_nics_to_netns() to move phys netdevs back to the monitor's network namespace rather than the previously hardcoded PID 1 net ns.
This is to fix instances where LXC is started inside a net ns different from PID 1 and physical devices are moved back to a different net ns when the container is shutdown than the net ns than where the container was started from.
Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
Rachid Koucha [Mon, 13 May 2019 11:13:18 +0000 (13:13 +0200)]
Config: check for %m availability
GLIBC supports %m to avoid calling strerror(). Using it saves some code space.
==> This check will define HAVE_M_FORMAT to be use wherever possible (e.g. log.h)