git.ipfire.org Git - thirdparty/haproxy.git/log

MEDIUM: promex: switch to using stat_cols_px for front/back/server metrics

Now the stat_cols_px array contains all info that-prometheus requires
stop using the promex_st_metrics array that contains redundant infos.

As for ("MEDIUM: promex: switch to using stat_cols_info for global
metrics"), initial goal was to completely get rid of promex_st_metrics
array, but it turns out it is still required but only for the name
mapping part now. So in this commit we change it from complex structure
array (with redundant info) to a simple ist array with the
metric id:promex name mapping. If a metric name is not defined there, then
promex ignores it.

MINOR: promex: expose ST_I_INF_WARNINGS (AKA total_warnings) metric

It has been requested to have the ST_I_INF_WARNINGS metric available from
prometheus, let's define it in promex_global_metrics ist array so that
prometheus starts advertising it.

MEDIUM: promex: switch to using stat_cols_info for global metrics

Now the stat_cols_info array contains all info that prometheus requires,
stop using the promex_global_metrics array that contains redundant infos.

Initial goal was to completely drop the promex_global_metrics array.
However it was deemed no longer relevant as prometheus stats rely on a
custom name that cannot be derived from stat_cols_info[], unless we add
a specific ".promex_name" field or similar to name the stats for
prometheus. This is what was carried over on a first attempt but it proved
to burden stat_cols_info[] array (not only memory wise, it is quite
confusing to see promex in the main codebase, given that prometheus is
shipped as an optional add-on).

The new strategy consists in revamping the promex_global_metrics array
from promex_metric (with all redundant fields for metrics) to a simple
ID<==>IST mapping. If the metric is mapped, then it means promex addon
should advertise it (using the name provided in the mapping). Now for
all the metric retrieval, no longer rely on built-in hardcoded values
but instead leverage the new stat cols API.

The tricky part is the .type association because the general rule doesn't
apply for all metrics as it seems that we stated that some non-counters
oriented metrics (at least from haproxy point of view) had to be presented
as counter metrics. So in this patch we add some special treatment for
those metrics to emulate the old behavior. If that's not relevant in the
future, it may be removed. But this requires to ensure that promex users
will properly cope with that change. At least for now, no change of
behavior should be expected.

MINOR: stats: use stat_col storage stat_cols_info

Use stat_col storage for stat_cols_info[] array instead of name_desc.

As documented in 65624876f ("MINOR: stats: introduce a more expressive
stat definition method"), stat_col supersedes name_desc storage but
it remains backward compatible. Here we migrate to the new API to be
able to further extend stat_cols_info[] in following patches.

MINOR: stats: add .cap for some static metrics

Goal is to merge promex metrics definition into the main one.
Promex metrics will use the metric capability to know available scopes,
thus only metrics relevant for prometheus were updated.

MINOR: stats: STATS_PX_CAP___B_ macro

STATS_PX_CAP___B_ points to STATS_PX_CAP_BE, it is just an alias
for consistency, like STATS_PX_CAP____S which points to
STATS_PX_CAP_SRV.

MINOR: stats: add .generic explicit field in stat_col struct

Further extend logic implemented in 65624876 ("MINOR: stats: introduce a
more expressive stat definition method") and 4e9e8418 ("MINOR: stats:
prepare stats-file support for values other than FN_COUNTER"): we don't
rely anymore on the presence of the capability to know if the metric is
generic or not. This is because it prevents us from setting a capability
on static statistics. Yet it could be useful to set the capability even
on static metrics, thus we add a dedicated .generic bit to tell haproxy
that the metric is generic and can be handled automatically by the API.

Also, ME_NEW_* helpers are not explicitly associated to generic metric
definition (as it was already the case before) to avoid ambiguities.
It may change in the future as we may need to use the new definition
method to define static metrics (without the generic bit set). But for
now it isn't the case as this need definition was implemented for generic
metrics support in the first place. If we want to define static metrics
using the API, we could add a new set of helpers for instance.

BUG/MINOR: mux-h2: Reset streams with NO_ERROR code if full response was already sent

On frontend side, when a stream is shut while the response was already fully
sent, it was cancelled by sending a RST_STREAM(CANCEL) frame. However, it is
not accurrate. CANCEL error code must only be used if the response headers
were sent, but not the full response. As stated in the RFC 9113, when the
response was fully sent, to stop the request sending, a RST_STREAM with an
error code of NO_ERROR must be sent.

This patch should solve the issue #1219. It must be backported to all stable
versions.

MEDIUM: ssl/ckch: make the ckch_conf more generic

The ckch_store_load_files() function makes specific processing for
PARSE_TYPE_STR as if it was a type only used for paths.

This patch changes a little bit the way it's done,
PARSE_TYPE_STR is only meant to strdup() a string and stores the
resulting pointer in the ckch_conf structure.

Any processing regarding the path is now done in the callback.

Since the callbacks were basically doing the same thing, they were
transformed into the DECLARE_CKCH_CONF_LOAD() macros which allows to
do some templating of these functions.

The resulting ckch_conf_load_* functions will do the same as before,
except they will also do the path processing instead of letting
ckch_store_load_files() do it, which means we don't need the "base"
member anymore in the struct ckch_conf_kws.

MINOR: tools: path_base() concatenates a path with a base path

With the SSL configuration, crt-base, key-base are often used, these
keywords concatenates the base path with the path when the path does not
start by '/'.

This is done at several places in the code, so a function to do this
would be better to standardize the code.

BUG/MEDIUM: hlua/cli: fix cli applet UAF in hlua_applet_wakeup()

Recent commit e5e36ce09 ("BUG/MEDIUM: hlua/cli: Fix lua CLI commands
to work with applet's buffers") revealed a bug in hlua cli applet handling

Indeed, playing with Willy's lua tetris script on the cli, a segfault
would be encountered when forcefully closing the session by sending a
CTRL+C on the terminal.

In fact the crash was caused by a UAF: while the cli applet was already
freed, the lua task responsible for waking it up would still point to it.
Thus hlua_applet_wakeup() could be called even if the applet didn't exist
anymore.

To fix the issue, in hlua_cli_io_release_fct() we must also free the hlua
task linked to the applet, like we already do for
hlua_applet_tcp_release() and hlua_applet_http_release().

While this bug exists on stable versions (where it should be backported
too for precaution), it only seems to be triggered starting with 3.0.

MINOR: limits: fix check_if_maxsock_permitted description

Fix typo in check_if_maxsock_permitted() description.

BUG/MINOR: limits: compute_ideal_maxconn: don't cap remain if fd_hard_limit=0

'global.fd_hard_limit' stays uninitialized, if haproxy is started with -m
(global.rlimit_memmax). 'remain' is the MAX between soft and hard process fd
limits. It will be always bigger than 'global.fd_hard_limit' (0) in this case.

So, if we reassign 'remain' to the 'global.fd_hard_limit' unconditionally,
calculated then 'maxconn' will be even negative and the DEFAULT_MAXCONN (100)
will be set as the 'ideal_maxconn'.

During the 'global.maxconn' calculations in set_global_maxconn(), if the
provided 'global.rlimit_memmax' is quite big, system will refuse to calculate
based on its 'global.maxconn' and we will do a fallback to the 'ideal_maxconn',
which is 100.

Same problem for the configs with SSL frontends and backends.

This fixes the issue #2899.

This should be backported to v3.1.0.

MINOR: cli/server: don't take thread isolation to check for srv-removable

Thanks to the previous commits, we now know that "wait srv-removable"
does not require thread isolation, as long as 3372a2ea00 ("BUG/MEDIUM:
queues: Stricly respect maxconn for outgoing connections") and c880c32b16
("MINOR: stream: decrement srv->served after detaching from the list")
are present. Let's just get rid of thread_isolate() here, which can
consume a lot of CPU on highly threaded machines when removing many
servers at once.

CLEANUP: server: make it clear that srv_check_for_deletion() is thread-safe

This function was marked as requiring thread isolation because its code
was extracted from cli_parse_delete_server() and was running under
isolation. But upon closer inspection, and using atomic loads to check
a few counters, it is actually safe to run without isolation, so let's
reflect that in its description.

However, it remains true that cli_parse_delete_server() continues to call
it under isolation.

MINOR: server: simplify srv_has_streams()

Now that thanks to commit c880c32b16 ("MINOR: stream: decrement
srv->served after detaching from the list") we can trust srv->served,
let's use it and no longer loop on threads when checking if a server
still has streams attached to it. This will be much cheaper and will
result in keeping isolation for a shorter time in the "wait" command.

BUG/MINOR: hlua: fix optional timeout argument index for AppletTCP:receive()

Baptiste reported that using the new optional timeout argument introduced
in 19e48f2 ("MINOR: hlua: add an optional timeout to AppletTCP:receive()")
the following error would occur at some point:

runtime error: file.lua:lineno: bad argument #-2 to 'receive' (number
expected, got light userdata) from [C]: in method 'receive...

In fact this is caused by exp_date being retrieved using relative index -1
instead of absolute index 3. Indeed, while using relative index is fine
most of the time when we trust the stack, when combined with yielding the
top of the stack when resuming from yielding is not necessarily the same
as when the function was first called (ie: if some data was pushed to the
stack in the yieldable function itself). As such, it is safer to use
explicit index to access exp_date variable at position 3 on the stack.

It was confirmed that doing so addresses the issue.

No backport needed unless 19e48f2 is.

MINOR: stream: decrement srv->served after detaching from the list

In commit 3372a2ea00 ("BUG/MEDIUM: queues: Stricly respect maxconn for
outgoing connections"), it has been ensured that srv->served is held
as long as possible around the periods where a stream is attached to a
server. However, it's decremented early when entering sess_change_server,
and actually just before detaching from that server's list. While there
is theoretically nothing wrong with this, it prevents us from looking at
this counter to know if streams are still using a server or not.

We could imagine decrementing it much later but that wouldn't work with
leastconn, since that algo needs ->served to be final before calling
lbprm.server_drop_conn(). Thus what we're doing here is to detach from
the server, then decrement ->served, and only then call the LB callback
to update the server's position in the tree. At this moment the stream
doesn't know the server anymore anyway (except via this function's
local variable) so it's safe to consider that no stream knows the server
once the variable reaches zero.

BUG/MINOR: log: prevent saddr NULL deref in syslog_io_handler()

In ad0133cc ("MINOR: log: handle log-forward "option host""), we
de-reference saddr without first checking if saddr is NULL. In practise
saddr shouldn't be null, but it may be the case if memory error happens
for tcp syslog handler so we must assume that it can be NULL at some
point.

To fix the bug, we simply check for NULL before de-referencing it
under syslog_io_handler(), as the function comment suggests.

No backport needed unless ad0133cc is.

MINOR: jws: use jwt_alg type instead of a char

This patch implements the function EVP_PKEY_to_jws_algo() which returns
a jwt_alg compatible with the private key.

This value can then be passed to jws_b64_protected() and
jws_b64_signature() which modified to take an jwt_alg instead of a char.

MINOR: hlua: add an optional timeout to AppletTCP:receive()

TCP services might want to be interactive, and without a timeout on
receive(), the possibilities are a bit limited. Let's add an optional
timeout in the 3rd argument to possibly limit the wait time. In this
case if the timeout strikes before the requested size is complete,
a possibly incomplete block will be returned.

MINOR: cpu-topo: fix unused stack var 'cpu2' reported by coverity

Coverity has reported that cpu2 seems sometimes unused in
cpu_fixup_topology():

*** CID 1593776:  Code maintainability issues  (UNUSED_VALUE)
/src/cpu_topo.c: 690 in cpu_fixup_topology()
684                             continue;
685
686                     if (ha_cpu_topo[cpu].cl_gid != curr_id) {
687                             if (curr_id >= 0 && cl_cpu <= 2)
688                                     small_cl++;
689                             cl_cpu = 0;
>>>     CID 1593776:  Code maintainability issues  (UNUSED_VALUE)
>>>     Assigning value from "cpu" to "cpu2" here, but that stored value is overwritten before it can be used.
690                             cpu2 = cpu;
691                             curr_id = ha_cpu_topo[cpu].cl_gid;
692                     }
693                     cl_cpu++;
694             }
695

That's it. 'cpu2' automatic/stack variable is used only in for() loop scopes to
save cpus ID in which we are interested in. In the loop pointed by coverity
this variable is not used for further processing within the loop's scope.
Then it is always reinitialized to 0 in the another following loops.

This fixes GitHUb issue #2895.

MINOR: jws: add new functions in jws.h

Add signatures of jws_b64_payload(), jws_b64_protected(),
jws_b64_signature(), jws_flattened() which allows to create a complete
JWS flattened object.

MINOR: cpu-topo: add a new "resource" cpu-policy

This cpu policy keeps the smallest CPU cluster. This can
be used to limit the resource usage to the strict minimum
that still delivers decent performance, for example to
try to further reduce power consumption or minimize the
number of cores needed on some rented systems for a
sidecar setup, in order to scale the system down more
easily. Note that if a single cluster is present, it
will still be fully used.

When started on a 64-core EPYC gen3, it uses only one CCX
with 8 cores and 16 threads, all in the same group.

MINOR: cpu-topo: add a new "efficiency" cpu-policy

This cpu policy tries to evict performant core clusters and only
focuses on efficiency-oriented ones. On an intel i9-14900k, we can
get 525k rps using 8 performance cores, versus 405k when using all
24 efficiency cores. In some cases the power savings might be more
desirable (e.g. scalability tests on a developer's laptop), or the
performance cores might be better suited for another component
(application or security component).

MINOR: cpu-topo: add a new "performance" cpu-policy

This cpu policy tries to evict efficient core clusters and only
focuses on performance-oriented ones. On an intel i9-14900k, we can
get 525k rps using only 8 cores this way, versus 594k when using all
24 cores. The gains from using all these codes are not significant
enough to waste them on this. Also these cores can be much slower
at doing SSL handshakes so it can make sense to evict them. Better
keep the efficiency cores for network interrupts for example.

Also, on a developer's machine it can be convenient to keep all these
cores for the local tasks and extra tools (load generators etc).

MEDIUM: cpu-topo: let the "group-by-cluster" split groups

When a cluster is too large to fit into a single group, let's split it
into two equal groups, which will still be allowed to use all the CPUs
of the cluster. This allows haproxy to start all the threads with a
minimum number of groups (e.g. 2x40 for 80 cores).

MINOR: cpu-topo: add cpu-policy "group-by-cluster"

This policy forms thread groups from the CPU clusters, and bind all the
threads in them to all the CPUs of the cluster. This is recommended on
system with bad inter-CCX latencies. It was shown to simply triple the
performance with queuing on a 64-core EPYC without having to manually
assign the cores with cpu-map.

CLEANUP: thread: now remove the temporary CPU node binding code

This is now superseded by the default "safe" cpu-policy, and every time
it's used, that code was bypassed anyway since global.nbthread was set.
We can now safely remove it. Note that for other policies which do not
set a thread count nor further restrict CPUs (such as "none", or even
"safe" when finding a single node), we continue to go through the fallback
code that automatically assigns CPUs to threads and counts them.

MEDIUM: cpu-topo: use the "first-usable-node" cpu-policy by default

This now turns the cpu-policy to "first-usable-node" by default, so that
we preserve the current default behavior consisting in binding to the
first node if nothing was forced. If a second node is found,
global.nbthread is set and the previous code will be skipped.

MINOR: cpu-topo: add a 'first-usable-node' cpu policy

This is a reimplemlentation of the current default policy. It binds to
the first node having usable CPUs if found, and drops CPUs from the
second and next nodes.

MINOR: cpu-topo: add a CPU policy setting to the global section

We'll need to let the user decide what's best for their workload, and in
order to do this we'll have to provide tunable options. For that, we're
introducing struct ha_cpu_policy which contains a name, a description
and a function pointer. The purpose will be to use that function pointer
to choose the best CPUs to use and now to set the number of threads and
thread-groups, that will be called during the thread setup phase. The
only supported policy for now is "none" which doesn't set/touch anything
(i.e. all available CPUs are used).

MINOR: cpu-topo: add "only-cluster" and "drop-cluster" to cpu-set

These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated hardware
cluster number(s). It can be used to bind to only some clusters, such
as CCX or different energy efficiency cores. For this reason, here we
use the cluster's local ID (local to the node).

MINOR: cpu-topo: add "only-core" and "drop-core" to cpu-set

These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated hardware
core number(s). It can be used to bind to only some clusters as well
as to evict efficient cores whose number is known.

MINOR: cpu-topo: add "only-thread" and "drop-thread" to cpu-set

These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated hardware
thread number(s). It can be used to reserve even threads for HW IRQs
and odd threads for haproxy for example, or to evict efficient cores
that do only have thread #0.

MINOR: cpu-topo: add "only-node" and "drop-node" to cpu-set

These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated node(s).

MINOR: cpu-topo: ignore excess of too small clusters

On some Arm systems (typically A76/N1) where CPUs can be associated in
pairs, clusters are reported while they have no incidence on I/O etc.
Yet it's possible to have tens of clusters of 2 CPUs each, which is
counter productive since it does not even allow to start enough threads.

Let's detect this situation as soon as there are at least 4 clusters
having each 2 CPUs or less, which is already very suspcious. In this
case, all these clusters will be reset as meaningless. In the worst
case if needed they'll be re-assigned based on L2/L3.

MINOR: cpu-topo: create an array of the clusters

The goal here is to keep an array of the known CPU clusters, because
we'll use that often to decide of the performance of a cluster and
its relevance compared to other ones. We'll store the number of CPUs
in it, the total capacity etc. For the capacity, we count one unit
per core, and 1/3 of it per extra SMT thread, since this is roughly
what has been measured on modern CPUs.

In order to ease debugging, they're also dumped with -dc.

MINOR: cpu-topo: consider capacity when forming clusters

By using the cluster+capacity sorting function we can detect
heterogneous clusters which are not properly reported. Thanks to this,
the following misnumbered machine featuring 4 big cores, 4 medium ones
an 4 small ones is properly detected with its clusters correctly
assigned:

      [keep] thr=  0 -> cpu=  0 pk=00 no=00 cl=000 ts=000 capa=1024
      [keep] thr=  1 -> cpu=  1 pk=00 no=00 cl=002 ts=008 capa=278
      [keep] thr=  2 -> cpu=  2 pk=00 no=00 cl=002 ts=009 capa=278
      [keep] thr=  3 -> cpu=  3 pk=00 no=00 cl=002 ts=010 capa=278
      [keep] thr=  4 -> cpu=  4 pk=00 no=00 cl=002 ts=011 capa=278
      [keep] thr=  5 -> cpu=  5 pk=00 no=00 cl=001 ts=004 capa=905
      [keep] thr=  6 -> cpu=  6 pk=00 no=00 cl=001 ts=005 capa=905
      [keep] thr=  7 -> cpu=  7 pk=00 no=00 cl=001 ts=006 capa=866
      [keep] thr=  8 -> cpu=  8 pk=00 no=00 cl=001 ts=007 capa=866
      [keep] thr=  9 -> cpu=  9 pk=00 no=00 cl=000 ts=001 capa=984
      [keep] thr= 10 -> cpu= 10 pk=00 no=00 cl=000 ts=002 capa=984
      [keep] thr= 11 -> cpu= 11 pk=00 no=00 cl=000 ts=003 capa=1024

Also this has the benefit of always assigning highest performance
clusters with the smallest IDs so that simple configs can decide to
simply bind to cluster 0 or clusters 0,1 and benefit from optimal
performance.

MINOR: cpu-topo: add a function to sort by cluster+capacity

The purpose here is to detect heterogenous clusters which are not
properly reported, based on the exposed information about the cores
capacity. The algorithm here consists in sorting CPUs by capacity
within a cluster, and considering as equal all those which have 5%
or less difference in capacity with the previous one. This allows
large clusters of more than 5% total between extremities, while
keeping apart those where the limit is more pronounced. This is
quite common in embedded environments with big.little systems, as
well as on some laptops.

MINOR: cpu-topo: renumber cores to avoid holes and make them contiguous

Due to the way core numbers are assigned and the presence of SMT on
some of them, some holes may remain in the array. Let's renumber them
to plug holes once they're known, following pkg/node/die/llc etc, so
that they're local to a (pkg,node) set. Now an i7-14700 shows cores
0 to 19, not 0 to 27.

MINOR: cpu-topo: assign an L3 cache if more than 2 L2 instances

On some machines, L3 is not always reported (e.g. on some lx2 or some
armada8040). But some also don't have L3 (core 2 quad). However, no L3
when there are more than 2 L2 is quite unheard of, and while we don't
really care about firing 2 thread groups for 2 L2, we'd rather avoid
doing this if there are 8! In this case we'll declare an L3 instance
to fix the situation. This allows small machines to continue to start
with two groups while not derivating on large ones.

MINOR: cpu-topo: make sure we don't leave unassigned IDs in the cpu_topo

It's important that we don't leave unassigned IDs in the topology,
because the selection mechanism is based on index-based masks, so an
unassigned ID will never be kept. This is particularly visible on
systems where we cannot access the CPU topology, the package id, node id
and even thread id are set to -1, and all CPUs are evicted due to -1 not
being set in the "only-cpu" sets.

Here in new function "cpu_fixup_topology()", we assign them with the
smallest unassigned value. This function will be used to assign IDs
where missing in general.

MINOR: cpu-topo: assign clusters to cores without and renumber them

Due to the previous commit we can end up with cores not assigned
any cluster ID. For this, at the end we sort the CPUs by topology
and assign cluster IDs to remaining CPUs based on pkg/node/llc.
For example an 14900 now shows 5 clusters, one for the 8 p-cores,
and 4 of 4 e-cores each.

The local cluster numbers are per (node,pkg) ID so that any rule could
easily be applied on them, but we also keep the global numbers that
will help with thread group assignment.

We still need to force to assign distinct cluster IDs to cores
running on a different L3. For example the EPYC 74F3 is reported
as having 8 different L3s (which is true) and only one cluster.

Here we introduce a new function "cpu_compose_clusters()" that is called
from the main init code just after cpu_detect_topology() so that it's
not OS-dependent. It deals with this renumbering of all clusters in
topology order, taking care of considering any distinct LLC as being
on a distinct cluster.

MINOR: cpu-topo: ignore single-core clusters

Some platforms (several armv7, intel 14900 etc) report one distinct
cluster per core. This is problematic as it cannot let clusters be
used to distinguish real groups of cores, and cannot be used to build
thread groups.

Let's just compare the cluster cpus to the siblings, and ignore it if
they exactly match. We must also take care of not falling back to
core_cpus_list, which can enumerate cores that already have their
cluster assigned (e.g. intel 14900 has 4 4-Ecore clusters in addition
to the 8 Pcores).

MINOR: cpu-topo: implement a CPU sorting mechanism by cluster ID

This will be used to detect and fix incorrect setups which report
the same cluster ID for multiple L3 instances.

The arrangement of functions in this file is becoming a real problem.
Maybe we should move all this to cpu_topo for example, and better
distinguish OS-specific and generic code.

MINOR: cpu-topo: implement a sorting mechanism by CPU locality

Once we've kept only the CPUs we want, the next step will be to form
groups and these ones are based on locality. Thus we'll have to sort by
locality. For now the locality is only inferred by the index. No grouping
is made at this point. For this we add the "cpu_reorder_by_locality"
function with a locality-based comparison function.

MINOR: cpu-topo: implement a sorting mechanism for CPU index

CPU selection will be performed by sorting CPUs according to
various criteria. For dumps however, that's really not convenient
and we'll need to reorder the CPUs according to their index only.
This is what the new function cpu_reorder_by_index() does. It's
called in thread_detect_count() before dumping the CPU topology.

MINOR: cpu-topo: skip CPU properties that we've verified do not exist

A number of entries under /cpu/cpu%d only exist on certain kernel
versions, certain archs and/or with certain modules loaded. It's
pointless to insist on trying to read them all for all CPUs when
we've already verified they do not exist. Thus let's use stat()
the first time prior to checking some of them, and only try to
access them when they really exist. This almost completely
eliminates the large number of ENOENT that was visible in strace
during startup.

MINOR: cpu-topo: skip identification of non-existing CPUs

There's no point trying to read all entries under /cpu/cpu%d when that
one does not exist, so let's just skip it in this case.

MINOR: cpu-topo: skip CPU detection when /sys/.../cpu does not exist

There's no point scanning all entries when /cpu doesn't exist in the
first place. Let's check once for it and skip the loop in this case.

MINOR: cpu-topo: boost the capacity of performance cores with cpufreq

Cpufreq alone isn't a good metric on heterogenous CPUs because efficient
cores can reach almost as high frequencies as performant ones. Tests have
shown that majoring performance cores by 50% gives a pretty accurate
estimate of the performance to expect on modern CPUs, and that counting
+33% per extra SMT thread is reasonable as well. We don't have the info
about the core's quality, but using the presence of SMT is a reasonable
approach in this case, given that efficiency cores will not use it.

As an example, using one thread of each of the 8 P-cores of an intel
i9-14900k gives 395k rps for a corrected total capacity of 69.3k, using
the 16 E-cores gives 40.5k for a total capacity of 70.4k, and using both
threads of 6 P-cores gives 41.1k for a total capacity of 69.6k. Thus the
3 same scores deliver the same performance in various combinations.

MINOR: cpu-topo: use cpufreq before acpi cppc

The acpi_cppc method was found to take about 5ms per CPU on a 64-core
EPYC system, which is plain unacceptable as it delays the boot by half
a second. Let's use the less accurate cpufreq first, which should be
sufficient anyway since many systems do not have acpi_cppc. We'll only
fall back to acpi_cppc for systems without cpufreq. If it were to be
an issue over time, we could also automatically consider that all
threads of the same core or even of the same cluster run at the same
speed (when a cluster is known to be accurate).

MINOR: cpu-topo: fall back to nominal_perf and scaling_max_freq for the capacity

When cpu_capacity is not present, let's try to check acpi_cppc's
nominal_perf which is similar and commonly found on servers, then
scaling_max_freq (though that last one may vary a bit between CPUs
depending on die quality). That variation is not a problem since
we can absorb a ~5% variation without issue.

It was verified on an i9-14900 featuring 5.7-P, 6.0-P and 4.4-E GHz
that P-cores were not reordered and that E cores were placed last.
It was also OK on a W3-2345 with 4.3 to 4.5GHz.

MINOR: cpu-topo: refine cpu dump output to better show kept/dropped CPUs

It's becoming difficult to see which CPUs are going to be kept/dropped.
Let's just skip all offline CPUs, and indicate "keep" in front of those
that are going to be used, and "----" in front of the excluded ones. It
is way more readable this way.

Also let's just drop the array entry number, since it's always the same
as the CPU number and is only an internal representation anyway.

MEDIUM: cfgparse: remove now unused numa & thread-count detection

Ths is not needed anymore since already done before landing here
via thread_detect_count().

MEDIUM: thread: reimplement first numa node detection

Let's reimplement automatic binding to the first NUMA node when thread
count is not forced. It's the same thing as is already done in
check_config_validity() except that this time it's based on the
collected CPU information. The threads are automatically counted
and CPUs from non-first node(s) are evicted.

MEDIUM: cpu-topo: make sure to properly assign CPUs to threads as a fallback

If no cpu-map is done and no cpu-policy could be enforced, we still need
to count the number of usable CPUs, assign them to all threads and set
the nbthread value accordingly.

This already handles the part that was done in check_config_validity()
via thread_cpus_enabled_at_boot.

MEDIUM: thread: start to detect thread groups and threads min/max

By mutually refining the thread count and group count, we can try
to detect the most suitable setup for the current machine. Taskset
is implicitly handled correctly. tgroups automatically adapt to the
configured number of threads. cpu-map manages to limit tgroups to
the smallest supported value.

The thread-limit is enforced. Just like in cfgparse, if the thread
count was forced to a higher value, it's reduced and a warning is
emitted. But if it was not set, the thr_max value is bound to this
limit so that further calculations respect it.

We continue to default to the max number of available threads and 1
tgroup by default, with the limit. This normally allows to get rid
of that test in check_config_validity().

MINOR: cpu-topo: add "drop-cpu" and "only-cpu" to cpu-set

These allow respectively to disable binding to CPUs listed in a set, and
to disable binding to CPUs not in a set.

MINOR: cpu-topo: add a new "cpu-set" global directive to choose cpus

For now it's limited, it only supports "reset" to ask that any previous
"taskset" be ignored. The goal will be to later add more actions that
allow to symbolically define sets of cpus to bind to or to drop. This
also clears the cpu_mask_forced variable that is used to detect
that a taskset had been used.

MINOR: global: add a command-line option to enable CPU binding debugging

During development, everything related to CPU binding and the CPU topology
is debugged using state dumps at various places, but it does make sense to
have a real command line option so that this remains usable in production
to help users figure why some CPUs are not used by default. Let's add
"-dc" for this. Since the list of global.tune.options values is almost
full and does not 100% match this option, let's add a new "tune.debug"
field for this.

MINOR: cfgparse: use already known offline CPU information

No need to reparse cpu/online, let's just rely on the info we learned
previously about offline CPUs.

MINOR: cfgparse: move the binding detection into numa_detect_topology()

For now the function refrains from detecting the CPU topology when a
restrictive taskset or cpu-map was already performed on the process,
and it's documented as such, the reason being that until we're able
to automatically create groups, better not change user settings. But
we'll need to be able to detect bound CPUs and to process them as
desired by the user, so we now need to move that detection into the
function itself. It changes nothing to the logic, just gives more
freedom to the function.

MINOR: thread: turn thread_cpu_mask_forced() into an init-time variable

The function is not convenient because it doesn't allow us to undo the
startup changes, and depending on where it's being used, we don't know
whether the values read have already been altered (this is not the case
right now but it's going to evolve).

Let's just compute the status during cpu_detect_usable() and set a
variable accordingly. This way we'll always read the init value, and
if needed we can even afford to reset it. Also, placing it in cpu_topo.c
limits cross-file dependencies (e.g. threads without affinity etc).

MINOR: cpu-topo: add NUMA node identification to CPUs on FreeBSD

With this patch we're also NUMA node IDs to each CPU when the info is
found. The code is highly inspired from the one in commit f5d48f8b3
("MEDIUM: cfgparse: numa detect topology on FreeBSD."), the difference
being that we're just setting the value in ha_cpu_topo[].

MINOR: cpu-topo: add NUMA node identification to CPUs on Linux

With this patch we're also assigning NUMA node IDs to each CPU when one
is found. The code is highly inspired from the one in commit b56a7c89a
("MEDIUM: cfgparse: detect numa and set affinity if needed") that already
did the job, except that it could be simplified since we're just collecting
info to fill the ha_cpu_topo[] array.

MINOR: cpu-topo: also store the sibling ID with SMT

The sibling ID was not reported because it's not directly accessible
but we don't care, what matters is that we assign numbers to all the
threads we find using the same CPU so that some strategies permit to
allocate one thread at a time if we want to use few threads with max
performance.

MINOR: cpu-topo: add CPU topology detection for linux

This uses the publicly available information from /sys to figure the cache
and package arrangements between logical CPUs and fill ha_cpu_topo[], as
well as their SMT capabilities and relative capacity for those which expose
this. The functions clearly have to be OS-specific.

MINOR: cpu-topo: try to detect offline cpus at boot

When possible, the offline CPUs are detected at boot and their OFFLINE
flag is set in the ha_cpu_topo[] array. When the detection is not
possible (e.g. not linux, /sys not mounted etc), we just mark none of
them as being offline, as we don't want to infer wrong info that could
hinder automatic CPU placement detection. When valid, we take this
opportunity for refining cpu_topo_lastcpu so that we don't need to
manipulate CPUs beyond this value.

MINOR: cpu-topo: add detection of online CPUs on FreeBSD

On FreeBSD we can detect online CPUs at least by doing the bitwise-OR of
the CPUs of all domains, so we're using this and adding this detection
to ha_cpuset_detect_online(). If we find simpler later, we can always
rework it, but it's reasonably inexpensive since we only check existing
domains.

MINOR: cpu-topo: add detection of online CPUs on Linux

This adds a generic function ha_cpuset_detect_online() which for now
only supports linux via /sys. It fills a cpuset with the list of online
CPUs that were detected (or returns a failure).

REORG: cpu-topo: move bound cpu detection from cpuset to cpu-topo

The cpuset files are normally used only for cpu manipulations. It happens
that the initial CPU binding detection was initially placed there since
there was no better place, but in practice, being OS-specific, it should
really be in cpu-topo. This simplifies cpuset which doesn't need to know
about the OS anymore.

MINOR: cpu-topo: update CPU topology from excluded CPUs at boot

Now before trying to resolve the thread assignment to groups, we detect
which CPUs are not bound at boot so that we can mark them with
HA_CPU_F_EXCLUDED. This will be useful to better know on which CPUs we
can count later. Note that we purposely ignore cpu-map here as we
don't know how threads and groups will map to cpu-map entries, hence
which CPUs will really be used.

It's important to proceed this way so that when we have no info we
assume they're all available.

MINOR: cpu-topo: add a function to dump CPU topology

The new function cpu_dump_topology() will centralize most debugging
calls, and it can make efforts of not dumping some possibly irrelevant
fields (e.g. non-existing cache levels).

MINOR: cpu-topo: rely on _SC_NPROCESSORS_CONF to trim maxcpus

We don't want to constantly deal with as many CPUs as a cpuset can hold,
so let's first try to trim the value to what the system claims to support
via _SC_NPROCESSORS_CONF. It is obviously still subject to the limit of
the cpuset size though. The value is stored globally so that we can
reuse it elsewhere after initialization.

MINOR: cpu-topo: allocate and initialize the ha_cpu_topo array.

This does the bare minimum to allocate and initialize a global
ha_cpu_topo array for the number of supported CPUs and release
it at deinit time.

MINOR: cpu-topo: add ha_cpu_topo definition

This structure will be used to store information about each CPU's
topology (package ID, L3 cache ID, NUMA node ID etc). This will be used
in conjunction with CPU affinity setting to try to perform a mostly
optimal binding between threads and CPU numbers by default. Since it
was noticed during tests that absolutely none of the many machines
tested reports different die numbers, the die_id is not stored.
Also, it was found along experiments that the cluster ID will be used
a lot, half of the time as a node-local identifier, and half of the
time as a global identifier. So let's store the two versions at once
(cl_gid, cl_lid).

Some flags are added to indicate causes of exclusion (offline, excluded
at boot, excluded by rules, ignored by policy).

MINOR: thread: rely on the cpuset functions to count bound CPUs

let's just clean up the thread_cpus_enabled() code a little bit
by removing the OS-specific code and rely on ha_cpuset_detect_bound()
instead. On macos we continue to use sysconf() for now.

MINOR: cpuset: make the API support negative CPU IDs

Negative IDs are very convenient to mean "not set", so let's just make
the cpuset API robust against this, especially with ha_cpuset_isset()
so that we don't have to manually add this check everywhere when a
value is not known.

DOC: design-thoughts: commit numa-auto.txt

Lots of collected data and observations aggregated into a single commit
so as not to lose them. Some parts below come from several commit
messages and are incremental.

Add captures and analysis of intel 14900 where it's not easy to draw
the line between the desired P and E cores.

The 14900 raises some questions (imagine a dual-die variant in multi-socket).
That's the start of an algorithmic distribution of performance cores into
thread groups.

cpu-map currently conflicts a lot with the choices after auto-detection
but it doesn't have to. The problem is the inability to configure the
threads for the whole process like taskset does. By offering this ability
we can also start to designate groups of CPUs symbolically (package, die,
ccx, cores, smt).

It can also be useful to exploit the info from cpuinfo that is not
available in /sys, such as the model number. At least on arm, higher
numbers indicate bigger cores and can be useful to distinguish cores
inside a cluster. It will not indicate big vs medium ones of the same
type (e.g. a78 3.0 vs 2.4 GHz) but can still be effective at identifying
the efficient ones.

In short, infos such as cluster ID not always reliable, and are
local to the package. die_id as well. die number is not reported
here but should definitely be used, as a higher priority than L3.

We're still missing a discriminant between the l3 and cluster number
in order to address heterogenous CPUs (e.g. intel 14900), though in
terms of locality that's currently done correctly.

CPU selection is also a full topic, and some thoughts were noted
regarding sorting by perf vs locality so as never to mix inter-
socket CPUs due to sorting.

The proposed cpu-selection cannot work as-is, because it acts both on
restriction and preference, and these two are not actions but a sequence.
First restrictions must be enforced, and second the remaining CPUs are
sorted according to the preferred criterion, and a number of threads are
selected.

Currently we refine the OS-exposed cluster number but it's not correct
as we can end up with something poorly numbered. We need to respect the
LLC in any case so let's explain the approach.

DEV: ncpu: also emulate sysconf() for _SC_NPROCESSORS_*

This is also needed in order to make the requested number of CPUs
appear. For now we don't reroute to the original sysconf() call so
we return -1,EINVAL for all other info.

BUILD: tools: avoid a build warning on gcc-4.8 in resolve_sym_name()

A build warning is emitted with gcc-4.8 in tools.c since commit
e920d73f59 ("MINOR: tools: improve symbol resolution without dl_addr")
because the compiler doesn't see that <size> is necessarily initialized.
Let's just preset it.

MINOR: tools: teach resolve_sym_name() a few more common symbols

This adds run_poll_loop, run_tasks_from_lists, process_runnable_tasks,
ha_dump_backtrace and cli_io_handler which are fairly common in
backtraces. This will be less relative symbols when dladdr is not
usable.

MINOR: tools: ease the declaration of known symbols in resolve_sym_name()

Let's have a macro that declares both the symbol and its name, it will
avoid the risk of introducing typos, and encourages adding more when
needed. The macro also takes an optional second argument to permit an
inline declaration of an extern symbol.

MINOR: tools: improve symbol resolution without dl_addr

When dl_addr is not usable or fails, better fall back to the closest
symbol among the known ones instead of providing everything relative
to main. Most often, the location of the function will give some hints
about what it can be. Thus now we can emit fct+0xXXX in addition to
main+0xXXX or main-0xXXX. We keep a margin of +256kB maximum after a
function for a match, which is around the maximum size met in an object
file, otherwise it becomes pointless again.

MINOR: cli: export cli_io_handler() to ease symbol resolution

It's common to meet this function in backtraces, it's a bit annoying
that it's not resolved, so let's export it so that it becomes resolvable.

BUG/MINOR: stats: fix capabilities and hide settings for some generic metrics

Performing a diff on stats output before vs after commit 66152526
("MEDIUM: stats: convert counters to new column definition") revealed
that some metrics were not properly ported to to the new API. Namely,
"lbtot", "cli_abrt" and "srv_abrt" are now exposed on frontend and
listeners while it was not the case before.

Also, "hrsp_other" is exposed even when "mode http" wasn't set on the
proxy.

In this patch we restore original behavior by fixing the capabilities
and hide settings.

As this could be considered as a minor regression (looking at the commit
message it doesn't seem intended), better tag this as a bug. It should be
backported in 3.0 with 66152526.

DOC: management: rename some last occurences from domain "dns" to "resolvers"

This is a complementary patch to cf913c2f9 ("DOC: management: rename show
stats domain cli "dns" to "resolvers"). The doc still refered to the
legacy "dns" domain filter for stat command. Let's rename those occurences
to "resolvers".

It may be backported to all stable versions.

BUILD: backend: silence a build warning when threads are disabled

Since commit 8de8ed4f48 ("MEDIUM: connections: Allow taking over
connections from other tgroups.") we got this partially absurd
build warning when disabling threads:

src/backend.c: In function 'conn_backend_get':
src/backend.c:1371:27: warning: array subscript [0, 0] is outside array bounds of 'struct tgroup_info[1]' [-Warray-bounds]

The reason is that gcc sees that curtgid is not equal to tgid which is
defined as 1 in this case, thus it figures that tgroup_info[curtgid-1]
will be anything but zero and that doesn't fit. It is ridiculous as it
is a perfect case of dead code elimination which should not warrant a
warning. Nevertheless we know we don't need to do this when threads are
disabled and in this case there will not be more than 1 thread group, so
we can happily use that preliminary test to help the compiler eliminate
the dead condition and avoid spitting this warning.

No backport is needed.

BUILD: tools: silence a build warning when USE_THREAD=0

The dladdr_lock that was added to avoid re-entering into dladdr is
conditioned by threads, but the way it's declared causes a build
warning if threads are disabled due to the insertion of a lone semi
colon in the variables block. Let's switch to __decl_thread_var()
for this.

This can be backported wherever commit eb41d768f9 ("MINOR: tools:
use only opportunistic symbols resolution") is backported. It relies
on these previous two commits:

bb4addabb7 ("MINOR: compiler: add a simple macro to concatenate resolved strings")
69ac4cd315 ("MINOR: compiler: add a new __decl_thread_var() macro to declare local variables")

MINOR: compiler: add a new __decl_thread_var() macro to declare local variables

__decl_thread() already exists but is more suited for struct members.
When using it in a variables block, it appends the final trailing
semi-colon which is a statement that ends the variable block. Better
clean this up and have one precisely for variable blocks. In this
case we can simply define an unused enum value that will consume the
semi-colon. That's what the new macro __decl_thread_var() does.

MINOR: compiler: add a simple macro to concatenate resolved strings

It's often useful to be able to concatenate strings after resolving
them (e.g. __FILE__, __LINE__ etc). Let's just have a CONCAT() macro
to do that, which calls _CONCAT() with the same arguments to make
sure the contents are resolved before being concatenated.

BUG/MEDIUM: thread: use pthread_self() not ha_pthread[tid] in set_affinity

A bug was uncovered by the work on NUMA. It only triggers in the CI
with libmusl due to a race condition. What happens is that the call
to set_thread_cpu_affinity() is done very early in the polling loop,
and that it relies on ha_pthread[tid] instead of pthread_self(). The
problem is that ha_pthread[tid] is only set by the return from
pthread_create(), which might happen later depending on the number of
CPUs available to run the starting thread.

Let's just use pthread_self() here. ha_pthread[] is only used to send
signals between threads, there's no point in using it here.

This can be backported to 2.6.

MEDIUM: log: change default "host" strategy for log-forward section

Historically, log-forward proxy used to preserve host field from input
message as much as possible, and if syslog host wasn't provided
(rfc5424 '-' or bad rfc3164 or rfc5424 message) then "localhost" or "-"
would be used as host when outputting message using rfc3164 or rfc5424.

We change that behavior (which corresponds to "keep" host option), so that
log-forward now uses "fill" strategy as default: if the host is provided
in input message, it is preserved. However if it is missing and IP address
from sender is available, we use it.

MINOR: log: handle log-forward "option host"

Following previous patch, we know implement the logic for the host
option under log-forward section. Possible strategies are:

      replace If input message already contains a value for the host
              field, we replace it by the source IP address from the
              sender.
              If input message doesn't contain a value for the host field
              (ie: '-' as input rfc5424 message or non compliant rfc3164
              or rfc5424 message), we use the source IP address from the
              sender as host field.

      fill    If input message already contains a value for the host field,
              we keep it.
              If input message doesn't contain a value for the host field
              (ie: '-' as input rfc5424 message or non compliant rfc3164
              or rfc5424 message), we use the source IP address from the
              sender as host field.

      keep    If input message already contains a value for the host field,
              we keep it.
              If input message doesn't contain a value for the host field,
              we set it to localhost (rfc3164) or '-' (rfc5424).
              (This is the default)

      append  If input message already contains a value for the host field,
              we append a comma followed by the IP address from the sender.
              If input message doesn't contain a value for the host field,
              we use the source IP address from the sender.

Default value (unchanged) is "keep" strategy. option host is only relevant
with rfc3164 or rfc5424 format on log targets. Also, if the source address
is not available (ie: UNIX socket), default behavior prevails.

Documentation was updated.

MINOR: log: add "option host" log-forward option

add only the parsing part, options are currently unused

MINOR: tools: only print address in sa2str() when port == -1

Support special value for port in sa2str: if port is equal to -1, only
print the address without the port, also ignoring <map_ports> value.

MINOR: log: provide source address information in syslog_process_message()

provide struct sockaddr_storage pointer from the message sender in
syslog_process_message()

MINOR: log: migrate log-forward options from proxy->options2 to options3

Migrate recently added log-forward section options, currently stored under
proxy->options2 to proxy->options3 since proxy->options2 is running out of
space and we plan on adding more log-forward options.