Florian Forster [Wed, 29 Nov 2023 12:41:32 +0000 (13:41 +0100)]
write_riemann plugin: use reference counting to when freeing user data.
While reference counting was present previously, it had problems:
* The reference passed to `plugin_register_flush()` was not counted.
* If a flush plugin was registered, `free_func` was set to NULL but
the reference passed to `plugin_register_notification()` was still
counted, meaning in that case the counter never went to zero.
* Mutexes must be unlocked when calling `pthread_mutex_destroy()`.
* The code limped on after an error, returning a failure eventually.
This is unnecessarily complex control flow that has been simplified.
Leonard Göhrs [Fri, 24 Mar 2023 09:49:24 +0000 (10:49 +0100)]
src/write_http.c: use reference counting to decide when to free user_data
The teardown code for the wh_callback_t struct previously relied on
the order in which the different callback functions are de-initialized
to be known and to never change.
This is prone to failure and is indeed currently broken, leading to a
segmentation fault on collectd exit.
Fix this by counting the active references to the user data and freeing
it once it reaches zero.
Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
Apparently we defined a bunch of `plugin_foo` variables that were never
used. The generated list from the arguments to `AC_PLUGIN` now appear as
duplicates. This removes the previously unused definitions and leaves
the generated ones.
Florian Forster [Tue, 28 Nov 2023 18:31:11 +0000 (19:31 +0100)]
procevent plugin: remove use of a nested flexible array member.
The previous code used an ad-hoc struct to construct or parse a Netlink
message. This relied on allocating a field _after_ the struct with the
flexible array member, which is prohibited by the C standard, leading to
compiler warnings.
Florian Forster [Sat, 25 Nov 2023 13:12:59 +0000 (14:12 +0100)]
SMART plugin: initialize struct passed to `ioctl(2)`.
Valgrind is complaining about a conditional jump based on uninitialized
memory:
```
==66462== Conditional jump or move depends on uninitialised value(s)
==66462== at 0x10C500: smart_read_nvme_intel_disk (in /__w/collectd/collectd/test_plugin_smart)
==66462== by 0x10D366: test_x (in /__w/collectd/collectd/test_plugin_smart)
==66462== by 0x10D638: main (in /__w/collectd/collectd/test_plugin_smart)
```
This may be due to the `struct nvme_additional_smart_log` being
uninitialized when it's being passed to `ioctl(2)`.
This there, this removed an unnecessary level of indentation.
Florian Forster [Sat, 25 Nov 2023 12:51:57 +0000 (13:51 +0100)]
Netlink plugin: complete initialize structs used for testing.
Valgrind complains about a conditional jump based on uninitialized
memory:
```
==66438== Conditional jump or move depends on uninitialised value(s)
==66438== at 0x10CA06: vf_info_submit (in /__w/collectd/collectd/test_plugin_netlink)
==66438== by 0x1110F2: test_vf_submit_test (in /__w/collectd/collectd/test_plugin_netlink)
==66438== by 0x112EAC: main (in /__w/collectd/collectd/test_plugin_netlink)
```
This is likely caused by the `vf_stats_t` being only partially
initialized. Using a struct initializer is not only cleaner, it also
ensures the remainder of the struct is initialized to zero.
Jim Klimov [Wed, 31 Aug 2022 13:32:46 +0000 (15:32 +0200)]
configure.ac: if neither UPSCONN{,_t} type was found, refuse to build NUT plugin
NOTE: src/nut.c also has pragmas to error out in this situation,
but that handling is compiler-dependent and happens too late in
the checkout/configure/build loop.
Presumably this inability to find the type in the earlier-found header file
is also triggered by build environment "inconsistencies" like lack of basic
types in the libc implementation (maybe highlighting the need for additional
headers or macros for the platform).
Jim Klimov [Wed, 31 Aug 2022 09:40:01 +0000 (11:40 +0200)]
configure.ac, src/nut.c: detect int types required by NUT API we build against
Either use the stricter int types required by NUT headers since v2.8.0 release,
or the relaxed (arch-dependent) types required by older NUT releases - depending
on which NUT API version the collectd is building against at the moment.
Inspired by discussion at https://github.com/networkupstools/nut/issues/1638
Florian Forster [Tue, 28 Nov 2023 13:42:54 +0000 (14:42 +0100)]
cpu plugin: Fix potential buffer overflow.
```
In function 'cpu_commit_without_aggregation',
inlined from 'cpu_commit' at src/cpu.c:563:5,
inlined from 'cpu_read' at src/cpu.c:925:3:
src/cpu.c:534:50: note: directive argument in the range [0, 18446744073709551614]
534 | snprintf(cpu_num_str, sizeof(cpu_num_str), "%zu", cpu_num);
| ^~~~~
src/cpu.c:534:7: note: 'snprintf' output between 2 and 21 bytes into a destination of size 16
534 | snprintf(cpu_num_str, sizeof(cpu_num_str), "%zu", cpu_num);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
Florian Forster [Fri, 24 Nov 2023 07:28:06 +0000 (08:28 +0100)]
curl_stats: fix compatibility with new versions of cURL.
Use integer based keys for metrics if available.
cURL ≥ 7.55.0 provides additional keys that allow getting certain
metrics as integers rather than doubles, e.g. content length. In some
newer versions of cURL, the original keys (using doubles) are marked as
deprecated.
ChangeLog: cURL, cURL-JSON, cURL-XML, Write HTTP plugins: fix compatibility with new versions of cURL.
Florian Forster [Tue, 28 Nov 2023 12:56:27 +0000 (13:56 +0100)]
configure: disable all plugins not yet supporting collectd 6.
This should allow us (and users) to just run `./configure` without
further flags, lowering the barrier to entry. It also allows us to
remove these configure flags from the CI configuration.
Eero Tamminen [Fri, 24 Nov 2023 17:05:51 +0000 (19:05 +0200)]
gpu_sysman: rename "counter" output variant to more generic "base"
And make it control output for all base metric values, not just
counters. That allows disabling output of values for:
- Memory usage
- Frequency
- Temperature
If one wants to see only their rates.
That will be useful with the new "LogMetrics" option in next commit.
Did also small optimization for output variant checks (no need for
free() if they're moved earlier).
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
- Move enabled/disabled metric reporting to a separate function
- Report metrics enabling and metric details enabling separately
- Error if all metrics are disabled, regardless of detail options
- Explicitly log what metrics are still being reported if any of
them were disabled at run-time
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Leonard Göhrs [Mon, 9 Jan 2023 13:43:44 +0000 (14:43 +0100)]
[collectd 6] exec: make PUTMETRIC work
The cmd_handle_putmetric function checks if the command actually is
a PUTMETRIC command, at least that is what is supposed to check.
Prior to this fix it actually checked for PUTVAL and always prints a
-1 Unexpected command: `PUTMETRIC'.
error. While at it also remove the development printf that results in
Leonard Göhrs [Thu, 5 Jan 2023 09:11:23 +0000 (10:11 +0100)]
[collectd 6] exec: add PUTMETRIC command
Most existing setups using exec will use PUTVAL, which should just continue
to work with collectd 6 due to the plugin_dispatch_values compatibility
function.
New plugins should however use the new PUTMETRIC.
The respective command handler already exists. This commit just pipes it
through.
Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
Leonard Göhrs [Fri, 24 Mar 2023 13:03:43 +0000 (14:03 +0100)]
src/daemon/plugin.c: init fam.{time,interval} before uc_update(fam) on dispatch
The changes in commit
55efb56a ([collectd 6] src/daemon/plugin.c: Use one thread per write plugin)
changed the order in which the time and interval on a to-be-dispatched metric
family were set and uc_update() was called on the metric family.
The time and interval inside a metric family have to be set before calling
uc_update() on it, otherwise warnings like these will be generated en masse[1]:
uc_update: Value too old: … value time = 0.000; last cache update = 0.000;
And the "missing" handlers for the metric family will be called resulting in
plugins like write_prometheus dropping the value, resulting in it not showing
up when queried [2].
This change requires the inroduction of a second metric_family_clone() in the
dispatch path:
plugin_dispatch_metric_family()
| \
| Has to modify the passed fam to initialize the correct time and interval.
| So it has to metric_family_clone() the input.
v
plugin_dispatch_metric_internal()
| \
| Calls the filter chains and uc_update(fam), which both require correct
| times and intervals to be set.
|
| plugin_dispatch_metric_internal() calls either fc_process_chain() or
v <-- fc_default_action() depending on if a chain is configured.
|\
| fc_process_chain()
| |
| v
| plugin_write()
| \
| A chain may call plugin_write() multiple times with the same fam.
| This means plugin_write() can not take "ownership" of the fam,
| put it in a queue and free it once it feels like it.
| It must instead create another clone to put into the queue.
v
fc_default_action()
|
v
plugin_write()
\
This is the much more common case, as filter chains are a niche feature,
and in this case we could actually transfer ownership of the fam to
plugin_write and save a clone, but we would have to tell plugin_write()
that it is responsible for freeing the passed fam.
The negative performance impact of the clone could be mitigated by adding a
reference count to the metric_family_t struct and only freeing it once all
references to it are gone. But that would be a larger change and not a bug fix.
Only fix the "uninitialized time and interval" bug for now.
Leonard Göhrs [Thu, 23 Mar 2023 12:12:27 +0000 (13:12 +0100)]
src/daemon/plugin.c: don't store references to stack allocated values
The changes in commit
55efb56a ([collectd 6] src/daemon/plugin.c: Use one thread per write plugin)
wrongly assume that the references passed in to plugin_register_write()
somehow outlive the spawned write thread.
While this is true for some plugins that pass staticly allocated strings
and global user_data_t structs to plugin_register_write() it is not correct
for all plugins.
See [1] for an example of a plugin (write_http) mis-behaving due to this.
Store owned versions of the passed values instead. For user_data this means
the content of the struct and for the name it means a strdup()ed version of
the string.
The behaviour of plugin_write with regards to fam being NULL has not changed
with the switch to one thread per write plugin, so the documentation should
not change as well.
Leonard Göhrs [Fri, 28 Oct 2022 08:02:01 +0000 (10:02 +0200)]
[collectd 6] src/collectd.conf.pod polish the WriteQueueLimitHigh/Low docs
The behaviour of LimitHight/LimitLow has changed with the switch to one
thread per write plugin. This warrants a rewrite of the respective
documentation. Thanks to @eero-t for the suggestion.
Leonard Göhrs [Mon, 15 Aug 2022 06:43:51 +0000 (08:43 +0200)]
[collectd 6] src/daemon/plugin.c: restore previous position of plugin_write
The new queue design resulted in plugin_write being based on
enqueue_metric_family, which resulted in the functions moving around in the
file. This made the diff harder to read. Restore old position.
Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
Leonard Göhrs [Tue, 19 Jul 2022 09:20:09 +0000 (11:20 +0200)]
[collectd 6] src/daemon/plugin.c: Use one thread per write plugin
ChangeLog: collectd: Use one write thread per write plugin
The previous write thread design used a single queue with a single read head
from which one of the write threads would de-queue an element and would then
sequentially call each registered write callback.
This meant that all write plugins would have to cooperate in order to not drop
values. If for example all write threads are stalled by the same write plugin's
callback function not returning, the queue will start to fill up until elements
start to be dropped, even though there are other plugins that could still make
progress. In addition to that, all write callbacks have to be designed to be
reentrant right now, which increases complexity.
This new design uses a single linked-list write queue with one read head per
output plugin. Each output plugin is serviced in a dedicated write thread.
Elements are freed based on a reference count, which is shown in the ASCII-Art
below:
+- Thread #1 Head +- Thread #2 Head +- Tail
v v v
+--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+
| 0|->| 1|->| 1|->| 1|->| 1|->| 2|->| 2|->| 2|->| 2|->| 2|->X
+--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+
^
+- to be free()d
The changes introduced by this commit have some side-effects:
- The WriteThreads config option no longer exists, as a strict 1:1 ratio of
write plugins and write threads is used.
- The data flow has changed. The previous data flow was:
(From one of the ReadThreads)
plugin_dispatch_{values,multivalue}()
plugin_dispatch_metric_family()
enqueue_metric_family()
write_queue_enqueue() -----{Queue}----+
|
(In one of the WriteThreads threads) |
plugin_write_thread() |
^- plugin_write_dequeue() <-+
plugin_dispatch_metric_internal()
^- fc_process_chain(pre_cache_chain)
fc_process_chain(fc_process_chain)
fc_bit_write_invoke()
plugin_write(NULL) / plugin_write(plugin_name)
plugin callback()
The data flow now is:
(From one of the ReadThreads)
plugin_dispatch_{values,multivalue}()
plugin_dispatch_metric_family()
plugin_dispatch_metric_internal()
^- fc_process_chain(pre_cache_chain)
fc_process_chain(post_cache_chain)
fc_bit_write_invoke()
plugin_write(NULL) / plugin_write(plugin_name)
write_queue_enqueue() -----{Queue}----+
|
(In one of the WriteThreads threads) |
plugin_write_thread() <-+
plugin callback()
One result of this change is, that the behaviour of plugin_write has changed
from running the plugin callback immediately and in the same thread, to
always enqueueing the value and de-queing in the dedicated thread.
- The behaviour of the WriteQueueLimitHigh and WriteQueueLimitLow options has
changed. The Queue will be be capped to a length of LimitHigh by dropping
random queue elements between the queue end and LimitLow.
Setting LimitLow to a reasonably large value ensures that fast write plugins
do not loose values, even in the vicinity of a slow plugin.
The diagram below shows the random element selected for removal (###) in
Step 1 and the queue with the element removed in Step 2.
Leonard Göhrs [Tue, 27 Sep 2022 06:03:14 +0000 (08:03 +0200)]
mmc: cache open file descriptors to block devices
Udev rules can contain a "watch" option, which is described in the man page as:
Watch the device node with inotify; when the node is closed after being
opened for writing, a change uevent is synthesized.
This watch option is enabled by default for all block devices[1].
The intention behind this is to be notified about changes to the partition
table. The mmc plugin does however also need to open the block device for
writing, even though it never modifies its content, in order to be able to
issue ioctls with vendor defined MMC-commands.
Reduce the amount of generated change events from one per read to one per
collectd runtime by caching the open file descriptor.
Leonard Göhrs [Fri, 3 Jun 2022 13:31:54 +0000 (15:31 +0200)]
mmc: add more vendor specific and generic data sources (#4006)
* mmc plugin: integrate into configure.ac
The mmc plugin is not fully integrated in the configure.ac.
Change that.
Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
* mmc plugin: Skip mmc paths in /sys that start with a '.' (like "." and "..")
The plugin tries to (and obiously fails to) use "." and "..", that come out of
listdir, as mmc devices.
Filter these two out by skipping hidden files/directories.
Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
* mmc plugin: read standard eMMC 5.0 health metrics
Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
* mmc plugin: remove type-name defines
These defines can become confusing, especially when combined with the defines
for attribute names in the sysfs. This will only get worse when more
vendor-specific metrics are supported.
Remove the defines and use the type names directly.
Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
* mmc plugin: remove sysfs-attribute defines
These defines are used only once or twice and do not help with readability.
Replace them with just the raw strings.
Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
* mmc plugin: port to libudev
While using the sysfs directly works fine for the swissbit and generic eMMC
driver it does not scale well to other vendor-specific interfaces where one has
to open the block device in /dev to perform ioctls.
Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
* mmc plugin: add micron eMMC support
While this patch was only tested with a single product (MTFC16GAPALBH) I am
fairly confident that it will generalize to others as well, as micron
themselves ship a single tool[1], which this patch uses as a reference, to read
similar info from all of their eMMCs.
This patch also increases the maximum value of mmc_bad_blocks to infinity,
as it can be any 16 bit integer for micron eMMC but could be even larger for
other vendors.
Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
* mmc plugin: add sandisk eMMC support
While this patch was only tested with a single product (SDINBDG4-8G), I am
fairly confident that it should generalize to other devices as well,
as the current product portfolio on their website looks very similar to the one
I tested and new devies will likely use a Western Digital manufacturer ID.
Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
[collectd 6] disk: cherry-pick "In macOS 12, `IOMasterPort` is deprecated in favor of `IOMainPort`"
git cherry-pick 2711ebe7d671c ("In macOS 12, `IOMasterPort` is deprecated in favor of `IOMainPort`")
from the collectd 5 branch on top of the collectd 6 version of the plugin.
Original commit message:
In macOS 12, `IOMasterPort` is deprecated in favor of `IOMainPort`
```
src/battery.c:250:7: error: 'kIOMasterPortDefault' is deprecated: first deprecated in macOS 12.0 [-Werror,-Wdeprecated-declarations]
kIOMasterPortDefault, IOServiceNameMatching("battery"), &iterator);
^~~~~~~~~~~~~~~~~~~~
kIOMainPortDefault
```
Emma Foley [Tue, 15 Feb 2022 07:46:21 +0000 (07:46 +0000)]
Fix CI failures caused by unsupported distros and updates to dependencies (#3975)
* [ci][gha] Replace trusy with Bionic and Focal
Ubuntu 14.04 (Trusty) is out of standard support [1].
``make check`` fails for test_capabilities, as noted in [2].
[3] indicates that the cause is glibc, but that updates are not expected
to the version in trusty.
This PR replaces trusty with Ubuntu 18.04 (Bionic) and 20.04 (Focal).
Eero Tamminen [Thu, 17 Nov 2022 20:19:07 +0000 (22:19 +0200)]
gpu_sysman: initialize struct .pNext members before use
Next Sysman spec will explictly state that they need be initialized:
https://github.com/oneapi-src/level-zero-spec/commit/98dfaaf041dedfd8c9bcf9a3957f334836e859e4
And latest Sysman backend versions corrupt memory / crash unless .pNext
values in some of the structs given to Get functions are initialized.
(Releases before fall 2022 did not use .pNext values in get* calls,
and worked fine. It just took a long time until I was able to verify
whether this was a regression that will be fixed, or intended change.)
Additionally, validate in test code that .pNext values are set to NULL
(because some structs lack those pointer members, ADD_METRIC() macro
cannot do that check for the <statename> functions given for it, but
otherwise everything is covered).
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Eero Tamminen [Fri, 18 Nov 2022 12:26:06 +0000 (14:26 +0200)]
gpu_sysman: improve power limit handling
Limits can be reported to only a subset of power domains. Therefore
querying limits (for given GPU) should be disabled only when querying
fails for all domains.
Added also TODO for upcoming spec change I noticed in the spec tracker.
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Eero Tamminen [Tue, 8 Nov 2022 16:34:31 +0000 (18:34 +0200)]
gpu_sysman: do metric reset on every loop round
Not doing metric reset between loop rounds could result in extra
incorrect metric label being reported for a metric, when earlier
metric in the loop had a conditional label, but latter metric does not
satisfy that condition (Sysman call for the info failed, but fail is
ignored, or Sysman struct value used for given label is not set).
This can happen e.g. with the conditional memory "health", frequency
"throttled_by" and power "limit" labels.
Other alternative would be either setting or removing (= using NULL)
values for each of the possible labels on every round. Just reseting
metric labels on every round seemed more robust (easier to review),
and allowed simplifying the code slightly.
Looking at collectd metric implementation, it causes more allocs /
deallocs for the label array & label names though.
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Eero Tamminen [Tue, 8 Nov 2022 17:06:54 +0000 (19:06 +0200)]
gpu_sysman: make freq & mem handling more consistent
Readability/consistency improvement: change frequency and memory
metric handling to use new "reported" boolean instead of cache index,
for checking when metrics need to be submitted. This is more
consistent how other metric functions handle that.
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>