Eero Tamminen [Mon, 24 Oct 2022 14:25:23 +0000 (17:25 +0300)]
gpu_sysman: Minor improvements to test code
Decrease max value and increase how many decimals are shown for metric
values, so that tests verbose logging shows useful values also for
ratios (which are in 0-1 range).
Rest of changes improve 'gpu_sysman.c' test coverage by 1%.
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Eero Tamminen [Fri, 28 Jan 2022 16:06:50 +0000 (18:06 +0200)]
gpu_sysman: Add "pci_dev" label
On large cluster with different types of GPUs, it helps knowing which
card is of which type, not just their metrics. "pci_dev" label adds
PCI device ID to the device metrics.
Because GPUs within each cluster node are normally supposed to be
identical i.e. differ only between nodes, and additional labels
increase processing load, this is enabled only with the GpuInfo
setting.
Getting additional strings out of gpu_info() function required
refactoring. GPU index in errors is now output only by gpu_scan(),
and gpu_info() gets pointers to label string pointers instead.
Eero Tamminen [Thu, 8 Sep 2022 17:18:59 +0000 (20:18 +0300)]
gpu_sysman: Add ratio variant for power metric type
Needs new internal disable flag because power limit requires new
Sysman call which can fail separately from others (or reported limits
could be disabled). Because it's not called on first round, that
needs some changes to test checks too.
With all 3 metric variants being supported, variants check can be
removed.
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Eero Tamminen [Wed, 8 Jun 2022 15:23:34 +0000 (18:23 +0300)]
[collectd 6] Fix some gcc warnings with more strict checks (#3970)
* Remove unused dummy meta data functions
"format_stackdriver" was migrated over year ago.
* Fix rest of the GCC warnings from collectd core
* Properly initialize complex struct
* Fix signed vs unsigned comparisons
* Tell compiler which args are expected to be unused
Based on "-O3 -Werror -Wall -Wextra -Wformat-security" output.
* write_prometheus: fix static analysis warnings and comments
* Fix unused arguments reported by:
"-O3 -Werror -Wall -Wextra -Wformat-security"
* Fix obsolete comment to match MHD docs:
https://www.gnu.org/software/libmicrohttpd/ ("Queueing responses" section)
https://git.gnunet.org/libmicrohttpd.git/tree/src/include/microhttpd.h#n2398
* Fix use after free reported by Klocwork, "prom_fam" cannot be used
after it's been freed
* Fix signedness mismatch GCC warnings in few of the plugins
Based on "-O3 -Werror -Wall -Wextra -Wformat-security" output.
* Remove unused function arguments from few plugins
Based on "-O3 -Werror -Wall -Wextra -Wformat-security" output.
* Attribute unused functions arguments as such in few of the plugins
Based on "-O3 -Werror -Wall -Wextra -Wformat-security" output.
* turbostat: Satisfy clang-format CI check
Apparently CI has changed since this code was added to collectd.
CURLOPT_POSTFIELDSIZE allows to specify the data size, which is known in
advance and equals to cb->send_buffer_fill. When CURLOPT_POSTFIELDSIZE is not
set (or set to -1), then curl evaluates data size using strlen() function,
which have O(N) complexity, so we save a few CPU cycles here.
Signed-off-by: Matwey V. Kornilov <matwey.kornilov@gmail.com>
* write_influxdb_udp: Split formatting functions to format_influxdb
Signed-off-by: Matwey V. Kornilov <matwey.kornilov@gmail.com>
* write_http: Add influxdb format
Signed-off-by: Matwey V. Kornilov <matwey.kornilov@gmail.com>
* write_http: Enable using unix socket in libcurl
Signed-off-by: Matwey V. Kornilov <matwey.kornilov@gmail.com> Co-authored-by: Matthias Runge <mrunge@redhat.com>
Eero Tamminen [Tue, 7 Jun 2022 17:55:14 +0000 (20:55 +0300)]
[collectd 6] Add 'gpu_sysman' plugin for (Intel) GPU metrics (#3968)
* Add 'gpu_sysman' plugin for (Intel) GPU metrics
Metrics data is provided by OneAPI Level Zero Sysman API.
* Add unit-testing for 'gpu_sysman' plugin
See comment at start of src/gpu_sysman_test.c for details.
* Integrate 'gpu_sysman' plugin and its unit-testing to collectd build
* Add 'gpu_sysman' plugin configuration and documentation
* gpu_sysman: use sizeof(*var) rather than sizeof(vartype) in var=calloc(...)
Except for gpu_subarray_alloc(), all allocs are done with calloc().
This way correctness of all of them is easy to check just by grepping
for calloc (especially now that clang-format does not wrap those lines
any more), and reviewing gpu_subarray_alloc().
* gpu_sysman: minimal v6 API support + add units to metric names
Prometheus & OpenMetrics require metric names to be suffixed by the
metric unit, and ratios (0-1) to be used instead of percentages
(0-100).
* gpu_sysman: update test code for minimal v6 API support + new metric names
There's now also support for multiple metrics per family although they
are not used yet. "sstrncpy" is not needed any more.
* gpu_sysman: split metric properties from their names to separate labels
Following labels are used:
- sub_dev: subdevice ID (unsigned integer)
- location: e.g. "gpu" / "memory"
- type: e.g. "request" / "actual"
- direction: "read" / "write"
Additionally:
* Two location label values were fixed
* GPU engine indeces are now per engine type
(instead of single index being used for all types)
* All metric family and label names have been changed to use
underscores instead of dashes to separate words, as required by
Prometheus i.e. collectd does not need to convert them any more:
https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels
* gpu_sysman: update test code to handle metrics split with labels
NOTE: providing NULL as label value to delete it is NOT supported.
Test code will assert on labels with NULL values.
* gpu_sysman: remove "GPU-" prefix from name and add it "pci_pdf" label
Also rename GPU struct "name" member to more explicit "pci_bdf".
This allowed simplifying the code slightly.
Sysman API supports nowadays also other devices than GPUs, so prefix
is removed to to simplify code and to be more future-proof:
https://spec.oneapi.io/level-zero/latest/core/api.html#_CPPv416ze_device_type_t
(Plugin will still query only GPU devices from Sysman though.)
* gpu_sysman: fix test code for "pci_bdf" added to metrics family
- do not add "pci_pdf" to metric name for matching
- fix for adding metric labels to family copies of them
* gpu_sysman: improvements to reported metrics
* Fix memory "type" label overwrite
* Replace "free" memory metric with "memory_usage_ratio" one,
and rename "memory_bytes" to "memory_used_bytes" metric
* Split metric value aggregate function name to a separate
"function" label
* Have metric family declares always in same place in code
* Avoid both setting metric labels, and reporting empty metrics,
when higher internal sampling rate is used or there are L0
errors
* gpu_sysman: update tests for sysman plugin changes
* Add "memory_usage_ratio" checks
* Update validation for metrics that can be sampled at higher
rate i.e. have now the new aggregate function label
* With empty metrics avoided, dispatch mock-up can assert on them
* With extra L0 calls being skipped when not needed, number of calls
can differ between query rounds:
- refactor multi-sampling test to handle count changes
- change error handing checks to be done in single-sampled mode
* Debug output is needed to debug triggered multisample asserts,
so do that when assert would have been triggered, then abort
* gpu_sysman: add help information for all metric families
And document why const-qual cast is safe, and why GCC does
not warn about other assignments to .name & .help members.
* gpu_sysman: option to disable utilization metrics for single engines
More powerful GPUs can have a large number of engines of given type,
but user may be interested only on the higher level engine groups
utilization.
* gpu_sysman: option for specifying metrics output type
This can be used to speciify whether output metrics values will be
raw, derived or both.
This commit add support just for the configuration option itself,
adding / changing metrics to use it happens in next commit.
* gpu_sysman: optional raw metrics output for already supported metrics
This adds new counter type metrics for:
* memory bandwidth
* frequency throttle time
* engine execution time (activity)
* energy usage
Because collecd internally handles counters as integers, all units
cannot be ones recommended by Prometheus, but microseconds and
microjoules reported by Sysman.
* gpu_sysman: skip metrics with div-by-zero or time wrap around issues
Zero time intervals or max bandwidth would cause div-by-zero issues
and (very rare) time wrap around would cause bogus metric value.
Skip all of them.
* gpu_sysman: fix test code -Wpedantic + -Wcast-qual warnings
* gpu_sysman: add 'sub_dev' and 'type' labels only when needed
Empty label equals to a missing one, and Prometheus queries can check
for non-existence of a label, so let's just skip empty / unneeded ones.
Main difference to earlier is that LevelZero error categories that
provide non-zero values only for uncorrectable type (according to
spec), are now without a type label. Correctable i.e. zero metrics for
those categories were skipped already earlier.
* Add "dev_file" label support
And contrib/format.sh include re-order.
"dev_file" support is behind a define (enabled by default) because it
needs functions that are only part of POSIX, not C99.
Intel Kubernetes GPU plugin uses primary GPU node device file names
(card0, card1...) as its GPU identifiers. This new label helps in
mapping Kubernetes custom metrics to them.
* Move test defines from Sysman plugin to its test code
And document with what GCC warning options the code is tested / passes.
* Change strcpy() in Sysman plugin to sstrncpy()
While for plugin that change does not really help (as target buffer is
always larger than source), for test code it is useful. And it shuts
up less capabable static checking tools than GCC.
As test code cannot use existing collectd functionality for this (test
code needs modified versions of some collectd functions, and all
collectd code does not pass GCC warnings I use), sstrncpy() is copied
to test code.
For test code there's also a fix to size given for snprintf(), and
removal of redundant string termination for modified plugin_log() copy
(vsnprintf() already terminates string).
* Pass clang-format check for gpu_sysman_test.c comments
* Add scalloc() wrapper similar to smalloc() to common utils
scalloc() wraps calloc() with exit on alloc failure,
similarly to what smalloc() does for malloc().
* Replace Sysman plugin alloc+assert calls with smalloc/scalloc
If asserts were disabled, allocation failures would result in collectd
memory errors => replace alloc+assert in the plugin with collectd
smalloc/scalloc wrappers that exits after logging allocation error.
Downsides are that this does not invoke debugger (which could be in a
different control group with plenty of memory), nor tell where / what
allocation failed, like enabled assert would, so test code variants of
the wrappers still do asserts.
Emma Foley [Thu, 24 Feb 2022 07:25:47 +0000 (07:25 +0000)]
[ci][cirrus] Make Valgrind error on definite memory leaks only (#3977)
* [ci][cirrus] Replace trusty with bionic/focal in debian_default_toolchain
Ubuntu 14.04 (Trusty) is out of standard support [1].
``make check`` fails for test_capabilities, as noted in [2].
[3] indicates that the cause is glibc, but that updates are not expected
to the version in trusty.
This PR replaces trusty with Ubuntu 18.04 (Bionic) and 20.04 (Focal).
Matthias Runge [Tue, 15 Feb 2022 12:00:30 +0000 (13:00 +0100)]
[ci][cirrus] Replace trusty with bionic/focal in debian_default_toolchain (#3972)
Ubuntu 14.04 (Trusty) is out of standard support [1].
``make check`` fails for test_capabilities, as noted in [2].
[3] indicates that the cause is glibc, but that updates are not expected
to the version in trusty.
This PR replaces trusty with Ubuntu 18.04 (Bionic) and 20.04 (Focal).
excluderegex in logparser plugin is optional, hence can be NULL. If debug is enabled, debug print causes a CRITICAL error log like vfprintf %s NULL in "utils_match: match_create_callback: regex = %s, excluderegex = %s"
However, REG_NOERROR is not defined by musl, even GNU regex do not
mention REG_NOERROR, so just remove it to avoid the following build
failure:
src/netlink.c: In function 'check_ignorelist':
src/netlink.c:243:51: error: 'REG_NOERROR' undeclared (first use in this function); did you mean 'REG_NOTBOL'?
if (regexec(i->rdevice, dev, 0, NULL, 0) != REG_NOERROR)
^~~~~~~~~~~
REG_NOTBOL
```
src/snmp_agent.c:965:9: error: ‘strncat’ output may be truncated copying between 0 and 127 bytes from a string of length 127 [-Werror=stringop-truncation]
strncat(out, str, DATA_MAX_NAME_LEN - strlen(out) - 1);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```