From: Radosław Korzeniewski Date: Thu, 27 Jun 2019 14:17:46 +0000 (+0200) Subject: Add a Bacula statistics collection routine. X-Git-Tag: Release-9.6.0~203 X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=1d4d3079cb71d6f3e75100d5d7fecba0bf2d2d0b;p=thirdparty%2Fbacula.git Add a Bacula statistics collection routine. It implements the internal and build-in runtime statistics collection. You can examine it manually with 'statistics' command or save regularly to external databases, i.e. CSV files or Graphite application. The very detailed description of the all functionalities implemented can be found in README.Collector.txt file. --- diff --git a/.gitignore b/.gitignore index 2eb4af2d25..287efbf011 100644 --- a/.gitignore +++ b/.gitignore @@ -331,3 +331,6 @@ bacula/src/lib/sellist_test # win32 /bacula/src/win32/lib/bacula32.def /bacula/src/win32/lib/bacula64.def + +# OS and Editor/IDE files +.AppleDouble diff --git a/bacula/README.Collector.txt b/bacula/README.Collector.txt new file mode 100644 index 0000000000..047fd718c2 --- /dev/null +++ b/bacula/README.Collector.txt @@ -0,0 +1,497 @@ + +This is a brief description of a new Statistics collector functionality. + +0. Requirements and Assumptions + +>>>Eric Bollengier at 12/04/2018 +>>> +The main idea would be to generate statistics from our daemon (mainly the +storage daemon, but FD and Dir can be interested too) and send this data to a +Graphite daemon. + +Graphite is a bit like RRDTool, it is in charge to collect the data and render +the information. + +https://graphiteapp.org/ + +I think that the solution at the Bacula level should be able to use different +"drivers" (csv file on disk with a configurable format for example, native +graphite tcp connection, etc...). + +At the Bacula level, we probably need a new Resource in the configuration file +to configure that (or only new directives I don't know). Once you know what you +need, Kern will review the name of the Resources and/or Directives. + +Each job should be able to send data to the Bacula collector (ie from multiple +threads). Ideally, a job should not be blocked because the Graphite socket is +hanging for example. + +We need to define interesting metrics that can be reported to the statistic +collector, few examples: +- nb of jobs at a given time +- device statistics (nb read, nb write, nb job...) +- total network throughput +- per job network throughput +- disk throughput +- current memory usage +- system information (cpu, load average, swap usage) +- number of files read for a job +- (basically what the status command reports) + +It might be interesting to let the user choose the metrics they want to see, and +have a directive such as the destination message. + + Metrics = NbJob, NetworkThroughput, DiskThroughput, MemoryUsage + +(this is just an idea). + +We can start with few basic statistics and enrich the program later. + +1. Statistics Collector Architecture + +The most important requirement is to not block a job thread because external +communication stalled or introduced unexpected latency which can negatively +affect a running job. This requirement lead to the strict separation in +Collector functionality and behavior. + +The collector was designed as a two separate entities: +- an internal collector class: bstatcollect +- an interface collector thread: COLLECTOR +The first one functions as a metrics cache and the second one is responsible for +sending collected metrics into an external data target (i.e. Graphite). + +1.1. Statistics Collection flow + +Any Bacula's metrics are collected in a push architecture where an object code +is responsible for preparing/generating a metrics and "push" it to internal +collector for later usage. This "push" operation should be extremely fast which +allows to avoid unexpected latency in the job. + +To save a metrics in internal collector requires two step process: metric +registration and metrics update. Metrics registration could take some time +(relatively speaking) with O(n) and return a metrics index used in metrics update +process which is very fast with O(1). + +You should register metrics at the very beginning of the job or daemon start when +additional latency is not a problem at all. The update process using the metric +index from registration will be very fast, so it should be no problem setting it +even at the time critical part of the code, i.e. + +metric_index = statcollector->registration_int64("bacula.jobs.all", + METRIC_UNIT_JOB, value, "The number of all jobs"); + +(...) + +while (true) + +(...) + + statcollector->set_value_int64(metric_index, newvalue); + +(...) + +statcollector->unregistration(metric_index); + +The only latency introduced by a metrics update process is lock/unlock resolution +used for internal collector synchronization access. + +The metrics should be unregistered when no needed any more. After unregistration +process the metrics index becomes invalid and should not be used to address this +metric. As the metrics index is a regular integer number it will be reused upon +next registration of any new metric. + +You can get any or all metrics values when required. The return will be always +the full copy of the metrics so you can process it as you wish. + +1.2. Statistics Collector backend thread + +The collector background thread (COLLECTOR resource) is responsible for getting +the copy of the current metrics list and save them to configured destination. +The save/send process could be a time consuming, i.e. could involve a network +communication like for a Graphite collector. As this collector thread operates +on a copy of metrics list it doesn't affect standard job operations. Collector +thread saves metrics at regular intervals. + Note: the current implementation introduced a two backends: CSV file and + Graphite which are build in backends. + +1.3. Statistics Collector update thread + +We have a two metrics types to collect: an easy to count/update and a hard one +to count/update. The easy to count/update metrics are all statistics which +corresponds to already available (in memory) counter/variable, so we can update +metrics every time the counter/variable is updated. We can achieve a perfect +accuracy here without a problem. On the other hand the hard to count/update +metrics are all metrics which depends on external data (i.e. almost all +permanent metrics based on catalog data), metrics which are not directly +controllable by Bacula (i.e. the size of the heap) or metrics where frequent +update will bring a huge performance impact (i.e. sm_pool_memory size). For this +kind of metrics we've develop a dedicated mechanism for updating these metrics. + +The main assumption here is that these metrics won't be updated until necessary, +so as long as nobody would check what is the value we won't update it. We agreed +that the real value of the metrics (i.e. a number of error jobs) could change a +dozens of times in the mean time, but we want a perfect value at the time of +sampling, so i.e. saving/sending to external backend. + +For this purpose we run a dedicated collector update thread which will start +only when any of collector backend threads are started. So, if no collectors +defined for the daemon, no update thread will be necessary. The collector update +thread executes a dedicated function to every daemon as every daemon would has a +different set of hard to count/update metrics. The collector update thread +updates required metrics as frequent as the minimal interval parameter set at +defined collector resources. So, for two collector resources which would have +Interval=5min and Interval=30sec, the update thread should get a 30secs +interval. Additionally the "collect" command, which display all available +metrics at any time, is executing the same update function as for update thread +to get up to date metrics. + +2. Architecture implementation and code + +2.1. The metrics class + +This is a basic building block for metrics collection: + +class bstatmetrics : public SMARTALLOC { +public: + char *name; /* this is a metrics name */ + metric_type_t type; /* this is a metrics type */ + metric_unit_t unit; /* this is a metrics unit */ + metric_value_t value; /* this is a metrics value */ + char *description; /* this is a metrics description */ + bstatmetric(); + bstatmetric(char *mname, metric_type_t mtype, metric_unit_t munit, + char *descr); + bstatmetric(char *mname, metric_unit_t munit, bool mvalue, char *descr); + bstatmetric(char *mname, metric_unit_t munit, int64_t mvalue, char *descr); + bstatmetric(char *mname, metric_unit_t munit, float mvalue, char *descr); + ~bstatmetric(); + bstatmetric& operator=(const bstatmetric& orig); + void render_metric_value(POOLMEM **buf, bool bstr=false); + void render_metric_value(POOL_MEM &buf, bool bstr=false); + const char *metric_type_str(); + const char *metric_unit_str(); + void dump(); +}; + +You can have a three (technically four) types of metrics in bstatmetric: +- METRIC_UNDEF - when bstatmetrics is uninitialized +- METRIC_INT - the metrics stores an integer values (int64_t) +- METRIC_BOOL - the metrics stores a boolean values (so, True/False) +- METRIC_FLOAT - the metrics stores a float values + +You can define a metrics unit, which shows what entity a value represent, i.e. +METRIC_UNIT_BYTE, METRIC_UNIT_BYTESEC, METRIC_UNIT_JOB, METRIC_UNIT_CLIENT, etc. +When a value have no unit and it is a just a value/number you should use: +METRIC_UNIT_EMPTY. + +2.2. The internal collector class + +The metrics objects collection is an internal collector class: + +class bstatcollect : public SMARTALLOC { +public: + bstatcollect(); + /* registration return a metrics index */ + int registration(char *metric, metric_type_t type, metric_unit_t unit, + char *descr); + int registration_bool(char *metric, metric_unit_t unit, bool value, + char *descr); + int registration_int64(char *metric, metric_unit_t unit, int64_t value, + char *descr); + int registration_float(char *metric, metric_unit_t unit, float value, + char *descr); + /* unregistration */ + void unregistration(int metric); + /* update/set the metrics value */ + int set_value_bool(int metric, bool value); + int set_value_int64(int metric, int64_t value); + int add_value_int64(int metric, int64_t value); + int add2_value_int64(int metric1, int64_t value1, int metric2, + int64_t value2); + int sub_value_int64(int metric, int64_t value); + int set_value_float(int metric, float value); + int inc_value_int64(int metric); + int dec_value_int64(int metric); + int dec_inc_values_int64(int metricd, int metrici); + /* get data */ + bool get_bool(int metric); + int64_t get_int(int metric); + float get_float(int metric); + alist *get_all(); + bstatmetrics *get_metric(char *metric); + /* utility */ + void dump(); +}; + +You can register a metrics of particular type and unit with +bstatcollect::registration() method. In this case the value will be set +initially to zero. Using other registration_*() methods you can set other +initial value. + +When you make a next metrics registration in the time when a particular metrics +already exist in bstatcollect object you will always get the same metrics index. +When you unregister a metrics and register it later again you could get a +different metrics index. + +Any metrics could have a description string which could be useful to users. You +can set metrics description at first registration only. Any subsequent metrics +registration does not update it. + +2.2.1. Updating the metrics value + +Any metrics value update should be performed as an atomic operation, so the +internal collector class has a number of a such methods: +- set_value_*() - will set the metrics value into method argument "value", old + metrics value will be overwritten +- inc_value_int64()/dec_value_int64() - will increment or decrement the metrics + value accordingly as a single atomic operation +- dec_inc_values_int64() - will decrement a first metrics value and increment + a second metrics value as a single atomic operation, used to update a related + metrics in a single step +- add_value_int64() - will add the numeric argument to the metrics value as a + single atomic operation +- add2_value_int64() - will add the two numeric arguments to the two metrics + values as a single atomic operation, used to update more metrics in a single + step + +The inc_value_int64()/dec_value_int64()/add_value_int64()/etc. should be used +when managing a "shared" metrics updated from a different or separate threads. + +2.3. Supported utilities + +There are a few supported utilities you can use when processing metrics or +metrics list: +- bstatmetric::render_metric_value() - render a metrics value as a string into + a buffer. +- bstatmetric::metric_type_str() - return a metrics type as string. +- bstatmetric::metric_unit_str() - return a metrics unit as string. +- free_metric_alist() - releases memory for a list of metrics returned from + bstatcollect::get_all(). + +3. Statistics resource configuration + +The Statistics resource defines the attributes of the Statistics collector threads +running on any daemon. You can define any number of Collector resources and every +single Statistics resource will spawn a single collector thread. This resource can be +defined for any Daemon (Dir, SD and FD). Resource directives: + +Name = The collector name used by the system administrator. This +directive is required. + +Description = The text field contains a description of the Collector that +will be displayed in the graphical user interface. This directive is optional. + +Interval = This directive instruct Collector thread how long it +should sleep between every collection iteration. This directive is optional and +when not specified a value 300 seconds will be used instead. + +Type = The Type directive specifies the Collector backend, which may +be one of the following: CSV or Graphite. This directive is required. + +-> CSV is a simple file level backend which saves all required metrics with the +following format to the file: