This patch causes various problems during boot, where a "mount storm" occurs
naturally. Current approach is flakey, and it seems very risky to push a
feature like this which impacts boot right before a release. So let's revert
for now, and consider a more robust solution after later.
NeilBrown [Sun, 16 Dec 2018 22:32:58 +0000 (09:32 +1100)]
mount: disable mount-storm protection while mount unit is starting.
The starting of mount units requires that changes to
/proc/self/mountinfo be processed before the SIGCHILD from the
completion of /sbin/mount is processed, as described by the comment
/* Note that due to the io event priority logic, we can be sure the new mountinfo is loaded
* before we process the SIGCHLD for the mount command. */
The recently-added mount-storm protection can defeat this as it
will sometimes deliberately delay processing of /proc/self/mountinfo.
So we need to disable mount-storm protection when a mount unit is starting.
We do this by keeping a counter of the number of pending
mounts, and disabling the protection when this is non-zero.
Thanks to @asavah for finding and reporting this problem.
Apparently this originated in PHP, so the json output could be directly
embedded in HTML script tags.
See https://stackoverflow.com/questions/1580647/json-why-are-forward-slashes-escaped.
Since the output of our tools is not intended directly for web page generation,
let's not do this unescaping. If needed, the consumer can always do escaping as
appropriate for the target format.
fileio: replace read_nul_string() by read_line() with a special flag
read_line() is a lot more careful and optimized than read_nul_string()
but does mostly the same thing. let's replace the latter by the former,
just with a special flag that toggles between the slightly different EOL
rules if both.
units: set NoNewPrivileges= for all long-running services
Previously, setting this option by default was problematic due to
SELinux (as this would also prohibit the transition from PID1's label to
the service's label). However, this restriction has since been lifted,
hence let's start making use of this universally in our services.
On SELinux system this change should be synchronized with a policy
update that ensures that NNP-ful transitions from init_t to service
labels is permitted.
Let's split the commit in two: the sorting and the changes.
Because there's a requirement to update selinux policy, this change is
incompatible, strictly speaking. I expect that distributions might want to
revert this particular change. Let's make it easy.
meson: rename two more variables from _c to _sources
_c is misleading because .h files should be included in those lists too
(this tells meson that the build outputs should be rebuilt if the header
files change).
test-hashmap: add test to compare hashmap_free performance
The point here is to compare speed of hashmap_destroy with free and a different
freeing function, to the implementation details of hashmap_clear can be
evaluated.
Results:
current code:
/* test_hashmap_free (slow, 1048576 entries) */
string_hash_ops test took 2.494499s
custom_free_hash_ops test took 2.640449s
string_hash_ops test took 2.287734s
custom_free_hash_ops test took 2.557632s
string_hash_ops test took 2.299791s
custom_free_hash_ops test took 2.586975s
string_hash_ops test took 2.314099s
custom_free_hash_ops test took 2.589327s
string_hash_ops test took 2.319137s
custom_free_hash_ops test took 2.584038s
code with a patch which restores the "fast path" using:
for (idx = skip_free_buckets(h, 0); idx != IDX_NIL; idx = skip_free_buckets(h, idx + 1))
in the case where both free_key and free_value are either free or NULL:
/* test_hashmap_free (slow, 1048576 entries) */
string_hash_ops test took 2.347013s
custom_free_hash_ops test took 2.585104s
string_hash_ops test took 2.311583s
custom_free_hash_ops test took 2.578388s
string_hash_ops test took 2.283658s
custom_free_hash_ops test took 2.621675s
string_hash_ops test took 2.334675s
custom_free_hash_ops test took 2.601568s
So the test is noisy, but there clearly is no significant difference with the
"fast path" restored. I'm surprised by this, but it shows that the current
"safe" implementation does not cause a performance loss.
When the code is compiled with optimization, those times are significantly
lower (e.g. 1.1s and 1.4s), but again, there is no difference with the "fast
path" restored.
The difference between string_hash_ops and custom_free_hash_ops is the
additional cost of global modification and the extra function call.
lldp: add test coverage for sd_lldp_get_neighbors() with multiple neighbors
In particular, check that the order of the results is consistent.
This test coverage will be useful in order to refactor the compare_func
used while sorting the results.
When introduced, this test also uncovered a memory leak in sd_lldp_stop(),
which was then fixed by a separate commit using a specialized function
as destructor of the LLDP Hashmap.
Tested:
$ ninja -C build/ test
$ valgrind --leak-check=full build/test-lldp
hashmap: rework hashmap_clear() to be more defensive
Let's first remove an item from the hashmap and only then destroy it.
This makes sure that destructor functions can mdoify the hashtables in
their own codee and we won't be confused by that.
resolved: only attempt non-answer SOA RRs if they are parents of our query
There's no value in authenticating SOA RRs that are neither answer to
our question nor parent of our question (the latter being relevant so
that we have a TTL from the SOA field for negative caching of the actual
query).
By being to eager here, and trying to authenticate too much we run the
risk of creating cyclic deps between our transactions which then causes
the over-all authentication to fail.
KeyringMode option is useful for user services. Also, documentation for the
option suggests that the option applies to user services. However, setting the
option to any of its allowed values has no effect.
This commit fixes that and removes EXEC_NEW_KEYRING flag. The flag is no longer
necessary: instead of checking if the flag is set we can check if keyring_mode
is not equal to EXEC_KEYRING_INHERIT.
Michal Sekletar [Fri, 14 Dec 2018 14:17:27 +0000 (15:17 +0100)]
journald: correctly attribute log messages also with cgroupsv1
With cgroupsv1 a zombie process is migrated to root cgroup in all
hierarchies. This was changed for unified hierarchy and /proc/PID/cgroup
reports cgroup to which process belonged before it exited.
Be more suspicious about cgroup path reported by the kernel and use
unit_id provided by the log client if the kernel reports that process is
running in the root cgroup.
Users tend to care the most about 'log->unit_id' mapping so systemctl
status can correctly report last log lines. Also we wouldn't be able to
infer anything useful from "/" path anyway.
Tore Anderson [Mon, 17 Dec 2018 08:15:59 +0000 (09:15 +0100)]
resolve: enable EDNS0 towards the 127.0.0.53 stub resolver
This appears to be necessary for client software to ensure the reponse data
is validated with DNSSEC. For example, `ssh -v -o VerifyHostKeyDNS=yes -o
StrictHostKeyChecking=yes redpilllinpro01.ring.nlnog.net` fails if EDNS0 is
not enabled. The debugging output reveals that the `SSHFP` records were
found in DNS, but were considered insecure.
Note that the patch intentionally does *not* enable EDNS0 in the
`/run/systemd/resolve/resolv.conf` file (the one that contains `nameserver`
entries for the upstream DNS servers), as it is impossible to know for
certain that all the upstream DNS servers handles EDNS0 correctly.
dissect-image: wait for the main device and all partitions to be known by udev
Fixes #10526.
Even if we waited for the root device to appear, the mount could still fail if
we didn't wait for udev to initalize the device. In particular, the
/dev/block/n:m path used to mount the device is created by udev, and nspawn
would sometimes win the race and the mount would fail with -ENOENT.
The same wait is done for partitions, since if we try to mount them, the same
considerations apply.
Note: I first implemented a version which just does a loop (with a short wait).
In that approach, udev takes on average ~800 µs to initialize the loopback
device. The approach where we set up a monitor and avoid the loop is a bit
nicer. There doesn't seem to be a significant difference in speed.
With 1000 invocations of 'systemd-nspawn -i image.squashfs echo':
loop (previous approach):
real 4m52.625s
user 0m37.094s
sys 2m14.705s
monitor (this patch):
real 4m50.791s
user 0m36.619s
sys 2m14.039s
dissect-image would wait for the root device and paritions to appear. But if we
had an image with no partitions, we'd not wait at all. If the kernel or udev
were slow in creating device nodes or symlinks, subsequent mount attempt might
fail if nspawn won the race.
Calling wait_for_partitions_to_appear() in case of no partitions means that we
verify that the kernel agrees that there are no partitions. We verify that the
kernel sees the same number of partitions as blkid, so let's that also in this
case.
This makes the failure in #10526 much less likely, but doesn't eliminate it
completely. Stay tuned.
The function interface is the same, except that the output pointer may be NULL.
The implementation is slightly simplified by taking advantage of changes in
ancestor commit 'sd-device: attempt to read db again if it wasn't found', by
not creating a new sd_device object before re-checking the is_initialized
status.
v2:
- In v1, the old object was always used and the device received back from the
sd_device_monitor_start callback was ignored. I *think* the result will be
equivalent in both cases, because by the time we the callback gets called,
the db entry in the filesystem will also exist, and any subsequent access to
properties of the object would trigger a read of the database from disk. But
I'm not certain, and anyway, using the device object received in the callback
seems cleaner.
Normally, we don't care too much about what pahole reports. But this structure
could potentially be allocated for every device on the system, i.e. in a large
number of copies. 5 vs 7 cache lines is nice.
/* size: 400, cachelines: 7, members: 53 */
/* sum members: 330, holes: 12, sum holes: 70 */
/* last cacheline: 16 bytes */
/* size: 320, cachelines: 5, members: 53 */
/* bit holes: 1, sum bit holes: 6 bits */
/* bit_padding: 5 bits */
sd-device: reduce the number of implementations of device_read_db() we keep around
We had two very similar functions: device_read_db_aux and device_read_db,
and a number of wrappers for them:
device_read_db_aux
← device_read_db (in sd-device.c)
← all functions in sd-device.c, including sd_device_is_initialized
← device_read_db_force
← event_execute_rules_on_remove (in udev-event.c)
device_read_db (in device-private.c)
← functions in device_private.c (but not device_read_db_force):
device_get_devnode_{mode,uid,gid}
device_get_devlink_priority
device_get_watch_handle
device_clone_with_db
← called from udevadm, udev-{node,event,watch}.c
Before 7141e4f62c3f220872df3114c42d9e4b9525e43e (sd-device: don't retry loading
uevent/db files more than once), the two implementations were the same. In that
commit, device_read_db_aux was changed. Those changes were reverted in the parent
commit, so the two implementations are now again the same except for superficial
differences. This commit removes device_read_db (in sd-device.c), and renames
device_read_db_aux to device_read_db_internal and makes everyone use this one
implementation. There should be no functional change.
sd-device: attempt to read db again if it wasn't found
This mostly reverts "sd-device: don't retry loading uevent/db files more than
once", 7141e4f62c3f220872df3114c42d9e4b9525e43e. We will retry if we couldn't
access the file, but not if parsing failed.
Not re-reading the database at all just doesn't seem like a good idea. We have
two implementations of device_read_db, and one does that, and the other retries
to read the db. Re-reading seems more useful, since we can create the object
and then access properties as some later time when we know that the device has
been initialized and we can get useful results. Otherwise, we force the user to
destroy this object and create a new one.
This changes device_read_uevent_file() and device_read_db_aux(). See next
commit for description of where those functions are used.
NeilBrown [Thu, 4 Oct 2018 05:49:22 +0000 (15:49 +1000)]
core/mount: minimize impact on mount storm.
If we create 2000 mounts (on a 1-CPU qemu VM) with
mkdir -p /MNT/{1..2000}
time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done
it takes around 20 seconds to complete. Much of this time is taken up
by systemd repeatedly processing /proc/self/mountinfo.
If I disable the processing, the time drops to about 4 seconds.
I have reports that on a larger system with multiple active user sessions, each
with it's own systemd, the impact can be higher.
One particular use-case where a large number of mounts can be expected in quick
succession is when the "clearcase" SCM starts up.
This patch modifies the handling up events from /proc/self/mountinfo so
that systemd backs off when a storm is detected. Specifically the time to process
mountinfo is measured, and the process will not be repeated until 10 times
that duration has passed. This ensures systemd won't use more than 10% of
real time processing mountinfo.
With this patch, my test above takes about 5 seconds.
Let's make sure the first hashmap we destroy also frees all machines,
because otherwise when freeing the other hashmaps we'll try to
deregister the contained machines from the hashmaps already destroyed.
Apparently, people do use nscd, hence play somewhat nice with it, and
let's explicitly flush nscd caches whenever we register a new
user/group.
This patch only adds the actual refresh request invocation. Later
commits then issue this call at appropriate moments.
Note that the nscd protocol is not officially documented though very
simple. This code is written very defensively so that incompatibilities
don't affect us much.
Given that glibc really has a duty to maintain compat between
differently compiled programs and their system nscd they can't break API
and thus it should be safe for us to implement an alternative,
minimalistic client.
Ideally this kind of explicit, global cache flushing would not be necessary.
However nscd currently has no cache coherency protocol, hence we can't
really implement this better. The only concept it knows is a TTL for
positive hosts lookups. Hoewver for negative lookups or any of the other
tables nothing is available.