Thomas Haller [Thu, 20 Dec 2018 10:56:02 +0000 (11:56 +0100)]
dhcp6: don't enforce DUID content for sd_dhcp6_client_set_duid()
There are various functions to set the DUID of a DHCPv6 client.
However, none of them allows to set arbitrary data. The closest is
sd_dhcp6_client_set_duid(), which would still do validation of the
DUID's content via dhcp_validate_duid_len().
Relax the validation and only log a debug message if the DUID
does not validate.
Note that dhcp_validate_duid_len() already is not very strict. For example
with DUID_TYPE_LLT it only ensures that the length is suitable to contain
hwtype and time. It does not further check that the length of hwaddr is non-zero
or suitable for hwtype. Also, non-well-known DUID types are accepted for
extensibility. Why reject certain DUIDs but allowing clearly wrong formats
otherwise?
The validation and failure should happen earlier, when accepting the
unsuitable DUID. At that point, there is more context of what is wrong,
and a better failure reason (or warning) can be reported to the user. Rejecting
the DUID when setting up the DHCPv6 client seems not optimal, in particular
because the DHCPv6 client does not care about actual content of the
DUID and treats it as opaque blob.
Also, NetworkManager (which uses this code) allows to configure the entire
binary DUID in binary. It intentionally does not validate the binary
content any further. Hence, it needs to be able to set _invalid_ DUIDs,
provided that some basic constraints are satisfied (like the maximum length).
sd_dhcp6_client_set_duid() has two callers: both set the DUID obtained
from link_get_duid(), which comes from configuration.
`man networkd.conf` says: "The configured DHCP DUID should conform to
the specification in RFC 3315, RFC 6355.". It does not not state that
it MUST conform.
Note that dhcp_validate_duid_len() has another caller: DHCPv4's
dhcp_client_set_iaid_duid_internal(). In this case, continue with
strict validation, as the callers are more controlled. Also, there is
already sd_dhcp_client_set_client_id() which can be used to bypass
this check and set arbitrary client identifiers.
Thomas Haller [Wed, 19 Dec 2018 09:05:37 +0000 (10:05 +0100)]
dhcp: don't enforce hardware address length for sd_dhcp_client_set_client_id()
sd_dhcp_client_set_client_id() is the only API for setting a raw client-id.
All other setters are more restricted and only allow to set a type 255 DUID.
Also, dhcp4_set_client_identifier() is the only caller, which already
does:
r = sd_dhcp_client_set_client_id(link->dhcp_client,
ARPHRD_ETHER,
(const uint8_t *) &link->mac,
sizeof(link->mac));
and hence ensures that the data length is indeed ETH_ALEN.
Drop additional input validation from sd_dhcp_client_set_client_id(). The client-id
is an opaque blob, and if a caller wishes to set type 1 (ethernet) or type 32
(infiniband) with unexpected address length, it should be allowed. The actual
client-id is not relevant to the DHCP client, and it's the responsibility of the
caller to generate a suitable client-id.
For example, in NetworkManager you can configure all the bytes of the
client-id, including such _invalid_ settings. I think it makes sense,
to allow the user to fully configure the identifier. Even if such configuration
would be rejected, it would be the responsibility of the higher layers (including
a sensible error message to the user) and not fail later during
sd_dhcp_client_set_client_id().
Still log a debug message if the length is unexpected.
Thomas Haller [Thu, 20 Dec 2018 12:05:13 +0000 (13:05 +0100)]
dhcp: fix sd_dhcp_client_set_client_id() for infiniband addresses
Infiniband addresses are 20 bytes (INFINIBAND_ALEN), but only the last
8 bytes are suitable for putting into the client-id.
This bug had no effect for networkd, because sd_dhcp_client_set_client_id()
has only one caller which always uses ARPHRD_ETHER type.
I was unable to find good references for why this is correct ([1]). Fedora/RHEL
has patches for ISC dhclient that also only use the last 8 bytes ([2], [3]).
RFC 4390 (Dynamic Host Configuration Protocol (DHCP) over InfiniBand) [4] does
not discuss the content of the client-id either.
NeilBrown [Sun, 16 Dec 2018 22:32:58 +0000 (09:32 +1100)]
mount: disable mount-storm protection while mount unit is starting.
The starting of mount units requires that changes to
/proc/self/mountinfo be processed before the SIGCHILD from the
completion of /sbin/mount is processed, as described by the comment
/* Note that due to the io event priority logic, we can be sure the new mountinfo is loaded
* before we process the SIGCHLD for the mount command. */
The recently-added mount-storm protection can defeat this as it
will sometimes deliberately delay processing of /proc/self/mountinfo.
So we need to disable mount-storm protection when a mount unit is starting.
We do this by keeping a counter of the number of pending
mounts, and disabling the protection when this is non-zero.
Thanks to @asavah for finding and reporting this problem.
Apparently this originated in PHP, so the json output could be directly
embedded in HTML script tags.
See https://stackoverflow.com/questions/1580647/json-why-are-forward-slashes-escaped.
Since the output of our tools is not intended directly for web page generation,
let's not do this unescaping. If needed, the consumer can always do escaping as
appropriate for the target format.
fileio: replace read_nul_string() by read_line() with a special flag
read_line() is a lot more careful and optimized than read_nul_string()
but does mostly the same thing. let's replace the latter by the former,
just with a special flag that toggles between the slightly different EOL
rules if both.
units: set NoNewPrivileges= for all long-running services
Previously, setting this option by default was problematic due to
SELinux (as this would also prohibit the transition from PID1's label to
the service's label). However, this restriction has since been lifted,
hence let's start making use of this universally in our services.
On SELinux system this change should be synchronized with a policy
update that ensures that NNP-ful transitions from init_t to service
labels is permitted.
Let's split the commit in two: the sorting and the changes.
Because there's a requirement to update selinux policy, this change is
incompatible, strictly speaking. I expect that distributions might want to
revert this particular change. Let's make it easy.
meson: rename two more variables from _c to _sources
_c is misleading because .h files should be included in those lists too
(this tells meson that the build outputs should be rebuilt if the header
files change).
test-hashmap: add test to compare hashmap_free performance
The point here is to compare speed of hashmap_destroy with free and a different
freeing function, to the implementation details of hashmap_clear can be
evaluated.
Results:
current code:
/* test_hashmap_free (slow, 1048576 entries) */
string_hash_ops test took 2.494499s
custom_free_hash_ops test took 2.640449s
string_hash_ops test took 2.287734s
custom_free_hash_ops test took 2.557632s
string_hash_ops test took 2.299791s
custom_free_hash_ops test took 2.586975s
string_hash_ops test took 2.314099s
custom_free_hash_ops test took 2.589327s
string_hash_ops test took 2.319137s
custom_free_hash_ops test took 2.584038s
code with a patch which restores the "fast path" using:
for (idx = skip_free_buckets(h, 0); idx != IDX_NIL; idx = skip_free_buckets(h, idx + 1))
in the case where both free_key and free_value are either free or NULL:
/* test_hashmap_free (slow, 1048576 entries) */
string_hash_ops test took 2.347013s
custom_free_hash_ops test took 2.585104s
string_hash_ops test took 2.311583s
custom_free_hash_ops test took 2.578388s
string_hash_ops test took 2.283658s
custom_free_hash_ops test took 2.621675s
string_hash_ops test took 2.334675s
custom_free_hash_ops test took 2.601568s
So the test is noisy, but there clearly is no significant difference with the
"fast path" restored. I'm surprised by this, but it shows that the current
"safe" implementation does not cause a performance loss.
When the code is compiled with optimization, those times are significantly
lower (e.g. 1.1s and 1.4s), but again, there is no difference with the "fast
path" restored.
The difference between string_hash_ops and custom_free_hash_ops is the
additional cost of global modification and the extra function call.
lldp: add test coverage for sd_lldp_get_neighbors() with multiple neighbors
In particular, check that the order of the results is consistent.
This test coverage will be useful in order to refactor the compare_func
used while sorting the results.
When introduced, this test also uncovered a memory leak in sd_lldp_stop(),
which was then fixed by a separate commit using a specialized function
as destructor of the LLDP Hashmap.
Tested:
$ ninja -C build/ test
$ valgrind --leak-check=full build/test-lldp
hashmap: rework hashmap_clear() to be more defensive
Let's first remove an item from the hashmap and only then destroy it.
This makes sure that destructor functions can mdoify the hashtables in
their own codee and we won't be confused by that.
resolved: only attempt non-answer SOA RRs if they are parents of our query
There's no value in authenticating SOA RRs that are neither answer to
our question nor parent of our question (the latter being relevant so
that we have a TTL from the SOA field for negative caching of the actual
query).
By being to eager here, and trying to authenticate too much we run the
risk of creating cyclic deps between our transactions which then causes
the over-all authentication to fail.
KeyringMode option is useful for user services. Also, documentation for the
option suggests that the option applies to user services. However, setting the
option to any of its allowed values has no effect.
This commit fixes that and removes EXEC_NEW_KEYRING flag. The flag is no longer
necessary: instead of checking if the flag is set we can check if keyring_mode
is not equal to EXEC_KEYRING_INHERIT.
Michal Sekletar [Fri, 14 Dec 2018 14:17:27 +0000 (15:17 +0100)]
journald: correctly attribute log messages also with cgroupsv1
With cgroupsv1 a zombie process is migrated to root cgroup in all
hierarchies. This was changed for unified hierarchy and /proc/PID/cgroup
reports cgroup to which process belonged before it exited.
Be more suspicious about cgroup path reported by the kernel and use
unit_id provided by the log client if the kernel reports that process is
running in the root cgroup.
Users tend to care the most about 'log->unit_id' mapping so systemctl
status can correctly report last log lines. Also we wouldn't be able to
infer anything useful from "/" path anyway.
Tore Anderson [Mon, 17 Dec 2018 08:15:59 +0000 (09:15 +0100)]
resolve: enable EDNS0 towards the 127.0.0.53 stub resolver
This appears to be necessary for client software to ensure the reponse data
is validated with DNSSEC. For example, `ssh -v -o VerifyHostKeyDNS=yes -o
StrictHostKeyChecking=yes redpilllinpro01.ring.nlnog.net` fails if EDNS0 is
not enabled. The debugging output reveals that the `SSHFP` records were
found in DNS, but were considered insecure.
Note that the patch intentionally does *not* enable EDNS0 in the
`/run/systemd/resolve/resolv.conf` file (the one that contains `nameserver`
entries for the upstream DNS servers), as it is impossible to know for
certain that all the upstream DNS servers handles EDNS0 correctly.
dissect-image: wait for the main device and all partitions to be known by udev
Fixes #10526.
Even if we waited for the root device to appear, the mount could still fail if
we didn't wait for udev to initalize the device. In particular, the
/dev/block/n:m path used to mount the device is created by udev, and nspawn
would sometimes win the race and the mount would fail with -ENOENT.
The same wait is done for partitions, since if we try to mount them, the same
considerations apply.
Note: I first implemented a version which just does a loop (with a short wait).
In that approach, udev takes on average ~800 µs to initialize the loopback
device. The approach where we set up a monitor and avoid the loop is a bit
nicer. There doesn't seem to be a significant difference in speed.
With 1000 invocations of 'systemd-nspawn -i image.squashfs echo':
loop (previous approach):
real 4m52.625s
user 0m37.094s
sys 2m14.705s
monitor (this patch):
real 4m50.791s
user 0m36.619s
sys 2m14.039s
dissect-image would wait for the root device and paritions to appear. But if we
had an image with no partitions, we'd not wait at all. If the kernel or udev
were slow in creating device nodes or symlinks, subsequent mount attempt might
fail if nspawn won the race.
Calling wait_for_partitions_to_appear() in case of no partitions means that we
verify that the kernel agrees that there are no partitions. We verify that the
kernel sees the same number of partitions as blkid, so let's that also in this
case.
This makes the failure in #10526 much less likely, but doesn't eliminate it
completely. Stay tuned.
The function interface is the same, except that the output pointer may be NULL.
The implementation is slightly simplified by taking advantage of changes in
ancestor commit 'sd-device: attempt to read db again if it wasn't found', by
not creating a new sd_device object before re-checking the is_initialized
status.
v2:
- In v1, the old object was always used and the device received back from the
sd_device_monitor_start callback was ignored. I *think* the result will be
equivalent in both cases, because by the time we the callback gets called,
the db entry in the filesystem will also exist, and any subsequent access to
properties of the object would trigger a read of the database from disk. But
I'm not certain, and anyway, using the device object received in the callback
seems cleaner.
Normally, we don't care too much about what pahole reports. But this structure
could potentially be allocated for every device on the system, i.e. in a large
number of copies. 5 vs 7 cache lines is nice.
/* size: 400, cachelines: 7, members: 53 */
/* sum members: 330, holes: 12, sum holes: 70 */
/* last cacheline: 16 bytes */
/* size: 320, cachelines: 5, members: 53 */
/* bit holes: 1, sum bit holes: 6 bits */
/* bit_padding: 5 bits */
sd-device: reduce the number of implementations of device_read_db() we keep around
We had two very similar functions: device_read_db_aux and device_read_db,
and a number of wrappers for them:
device_read_db_aux
← device_read_db (in sd-device.c)
← all functions in sd-device.c, including sd_device_is_initialized
← device_read_db_force
← event_execute_rules_on_remove (in udev-event.c)
device_read_db (in device-private.c)
← functions in device_private.c (but not device_read_db_force):
device_get_devnode_{mode,uid,gid}
device_get_devlink_priority
device_get_watch_handle
device_clone_with_db
← called from udevadm, udev-{node,event,watch}.c
Before 7141e4f62c3f220872df3114c42d9e4b9525e43e (sd-device: don't retry loading
uevent/db files more than once), the two implementations were the same. In that
commit, device_read_db_aux was changed. Those changes were reverted in the parent
commit, so the two implementations are now again the same except for superficial
differences. This commit removes device_read_db (in sd-device.c), and renames
device_read_db_aux to device_read_db_internal and makes everyone use this one
implementation. There should be no functional change.
sd-device: attempt to read db again if it wasn't found
This mostly reverts "sd-device: don't retry loading uevent/db files more than
once", 7141e4f62c3f220872df3114c42d9e4b9525e43e. We will retry if we couldn't
access the file, but not if parsing failed.
Not re-reading the database at all just doesn't seem like a good idea. We have
two implementations of device_read_db, and one does that, and the other retries
to read the db. Re-reading seems more useful, since we can create the object
and then access properties as some later time when we know that the device has
been initialized and we can get useful results. Otherwise, we force the user to
destroy this object and create a new one.
This changes device_read_uevent_file() and device_read_db_aux(). See next
commit for description of where those functions are used.
NeilBrown [Thu, 4 Oct 2018 05:49:22 +0000 (15:49 +1000)]
core/mount: minimize impact on mount storm.
If we create 2000 mounts (on a 1-CPU qemu VM) with
mkdir -p /MNT/{1..2000}
time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done
it takes around 20 seconds to complete. Much of this time is taken up
by systemd repeatedly processing /proc/self/mountinfo.
If I disable the processing, the time drops to about 4 seconds.
I have reports that on a larger system with multiple active user sessions, each
with it's own systemd, the impact can be higher.
One particular use-case where a large number of mounts can be expected in quick
succession is when the "clearcase" SCM starts up.
This patch modifies the handling up events from /proc/self/mountinfo so
that systemd backs off when a storm is detected. Specifically the time to process
mountinfo is measured, and the process will not be repeated until 10 times
that duration has passed. This ensures systemd won't use more than 10% of
real time processing mountinfo.
With this patch, my test above takes about 5 seconds.
Let's make sure the first hashmap we destroy also frees all machines,
because otherwise when freeing the other hashmaps we'll try to
deregister the contained machines from the hashmaps already destroyed.
Apparently, people do use nscd, hence play somewhat nice with it, and
let's explicitly flush nscd caches whenever we register a new
user/group.
This patch only adds the actual refresh request invocation. Later
commits then issue this call at appropriate moments.
Note that the nscd protocol is not officially documented though very
simple. This code is written very defensively so that incompatibilities
don't affect us much.
Given that glibc really has a duty to maintain compat between
differently compiled programs and their system nscd they can't break API
and thus it should be safe for us to implement an alternative,
minimalistic client.
Ideally this kind of explicit, global cache flushing would not be necessary.
However nscd currently has no cache coherency protocol, hence we can't
really implement this better. The only concept it knows is a TTL for
positive hosts lookups. Hoewver for negative lookups or any of the other
tables nothing is available.