git.ipfire.org Git - thirdparty/haproxy.git/log

MEDIUM: cli: apply spelling fixes for known commands before listing them

Entering "show tls" would still emit 35 entries. By measuring the distance
between all unknown words and the candidates, we can sort them and pick the
10 most likely candidates. This works reasonably well, as now "show tls"
only proposes "show tls-keys", "show threads", "show pools" and "show tasks".

If the distance is still too high or if a word is missing, the whole
prefix list continues to be dumped, thus "show" alone will still report
the entire list of commands beginning with "show".

It's still impossible to skip a word, for example "show conn" will not
propose "show servers conn" because the distance is calculated for each
word individually. Some changes to the distance calculation to support
updating an existing map could easily address this. But this is already
a great improvement.

MINOR: cli: filter the list of commands to the matching part

The error message on the CLI has become unreadable due to the long list
and it's not even sorted, making it even harder to figure the right
command.

This patch starts by looking if some of the words match something known,
and if so, will limit the listing only to those commands that start like
the current one. The "help", "prompt" and "quit" commands are always
shown to help the user try something else. Now thanks to this, typing
"add" or "del" will only list "add acl", "add map" and not 50 lines
anymore.

As a small bonus, we won't print "Unknown command" anymore in response
to the "help" command.

MINOR: cli: print the error message in the parser function itself

By doing so we can report more accurate information about what's wrong.
As a first step, we already distinguish the case of expert-only commands
from other ones.

MINOR: cli: test the appctx level for master access instead of comparing pointers

Now that the appctx contains the master level, it greatly simplifies
all the tests, as we can simply verify that keyword levels match the
effective level without having to cheat with applet pointers. This
also allows to fold the expert test in them.

MINOR: cli: set the ACCESS_MASTER* bits on the master bind_conf

Right now the code is a bit hackish, it tests for the keyword's level
flags but checks the applet's origin to compare the bits. Let's start
by properly setting the ACCESS_MASTER_ONLY and ACCESS_MASTER flags on
the master CLI's bind_conf so that they are automatically present
all the time.

CLEANUP: cli: fix misleading comment and better indent the access level flags

It was mentioned that ACCESS_MASTER_ONLY as for workers only instead of
master-only. And it wasn't clear that all ACCESS_* would belong to the
same thing.

BUG/MINOR: cli: make sure "help", "prompt", "quit" are enabled at master level

These 3 commands are functionally valid both in master and worker CLIs.
However, while they do have a valid handler, they are not permitted by
the code and work partially by chance in the master:
  - "prompt" and "quit" are intercepted by the request analyser
  - "help" triggers an error, which results in displaying the error
    message

Let's make sure they are permitted so that we don't count errors there and
that we can report appropriate help.

This bug has always been there but it doesn't have any functional effect
at the moment since "help" can only show the error message. As such, there
is no need to backport it.

CLEANUP: resolvers: Perform unsafe loop on requester list when possible

When answer list of a response is checked, it is useless to perform a safe
loop on the requester list.

CLEANUP: resolvers: Use ha_free() in srvrq_resolution_error_cb()

Two occurrences to "free(A);A=NULL;" may be replaced by a call to ha_free()
in the srvrq_resolution_error_cb() function.

MINOR: resolvers: Don't try to match immediatly renewed ADD items

The loop looking for existing ADD items to renew their last_seen must ignore
the items already renewed in the same loop. To do so, we rely on the
last_seen time. because it is now based on now_ms, it is safe.

Doing so avoid to match several time the same ADD item when the same IP
address is found in several ADD item. This reduces the number of extra DNS
resolutions.

This patch depends on "MINOR: resolvers: Use milliseconds for cached items
in resolver responses". Both may be backported as far as 2.2 if necessary.

MINOR: resolvers: Use milliseconds for cached items in resolver responses

The last time when an item was seen in a resolver responses is now stored in
milliseconds instead of seconds. This avoid some corner-cases at the
edges. This also simplifies time comparisons.

BUG/MEDIUM: resolvers: Skip DNS resolution at startup if SRV resolution is set

At startup, if a SRV resolution is set for a server, no DNS resolution is
created. We must wait the first SRV resolution to know if it must be
triggered. It is important to do so for two reasons.

First, during a "classical" startup, a server based on a SRV resolution has
no hostname. Thus the created DNS resolution is useless. Best waiting the
first SRV resolution. It is not really a bug at this stage, it is just
useless.

Second, in the same situation, if the server state is loaded from a file,
its hosname will be set a bit later. Thus, if there is no additionnal record
for this server, because there is already a DNS resolution, it inhibits any
new DNS resolution. But there is no hostname attached to the existing DNS
resolution. So no resolution is performed at all for this server.

To avoid any problem, it is fairly easier to handle this special case during
startup. But this means we must be prepared to have no "resolv_requester"
field for a server at runtime.

This patch must be backported as far as 2.2.

BUG/MEDIUM: resolvers: Don't release resolution from a requester callbacks

Another way to say it: "Safely unlink requester from a requester callbacks".

Requester callbacks must never try to unlink a requester from a resolution, for
the current requester or another one. First, these callback functions are called
in a loop on a request list, not necessarily safe. Thus unlink resolution at
this place, may be unsafe. And it is useless to try to make these loops safe
because, all this stuff is placed in a loop on a resolution list. Unlink a
requester may lead to release a resolution if it is the last requester.

However, the unkink is necessary because we cannot reset the server state
(hostname and IP) with some pending DNS resolution on it. So, to workaround
this issue, we introduce the "safe" unlink. It is only performed from a
requester callback. In this case, the unlink function never releases the
resolution, it only reset it if necessary. And when a resolution is found
with an empty requester list, it is released.

This patch depends on the following commits :

* MINOR: resolvers: Purge answer items when a SRV resolution triggers an error
* MINOR: resolvers: Use a function to remove answers attached to a resolution
* MINOR: resolvers: Directly call srvrq_update_srv_state() when possible
* MINOR: resolvers: Add function to change the srv status based on SRV resolution

All the series must be backported as far as 2.2. It fixes a regression
introduced by the commit b4badf720 ("BUG/MINOR: resolvers: new callback to
properly handle SRV record errors").

don't release resolution from requester cb

MINOR: resolvers: Directly call srvrq_update_srv_state() when possible

When the server status must be updated from the result of a SRV resolution,
we can directly call srvrq_update_srv_state(). It is simpler and this avoid
a test on the server DNS resolution.

This patch is mandatory for the next commit. It also rely on "MINOR:
resolvers: Directly call srvrq_update_srv_state() when possible".

MINOR: resolvers: Add function to change the srv status based on SRV resolution

srvrq_update_srv_status() update the server status based on result of SRV
resolution. For now, it is only used from snr_update_srv_status() when
appropriate.

MINOR: resolvers: Purge answer items when a SRV resolution triggers an error

When a SRV request trigger an error, if we decide to handle the error
because last_valid duration is expired, the answer list may be purged. All
items are considered as obsolete.

MINOR: resolvers: Use a function to remove answers attached to a resolution

resolv_purge_resolution_answer_records() must be used to removed all answers
attached to a resolution. For now, it is only used when a resolution is
released.

BUG/MEDIUM: resolvers: Trigger a DNS resolution if an ADD item is obsolete

When a ADD item attached to a SRV item is removed because it is obsolete, we
must trigger a DNS resolution to be sure the hostname still resolves or
not. There is no other way to be the entry is still valid. And we cannot set
the server in RMAINT immediatly, because a DNS server may be inconsitent and
may stop to add some additionnal records.

The opposite is also true. If a valid ADD item is still attached to a SRV
item, any DNS resolution must be stopped. There is no reason to perform
extra resolution in this case.

This patch must be backported as far as 2.2.

BUG/MINOR; resolvers: Ignore DNS resolution for expired SRV item

If no ADD item is found for a SRV item in a SRV response, a DNS resolution
is triggered. When it succeeds, we must be sure the SRV item is still
alive. Otherwise the DNS resolution must be ignored.

This patch depends on the commit "MINOR: resolvers: Move last_seen time of
an ADD into its corresponding SRV item". Both must be backported as far as
2.2.

MINOR: resolvers: new function find_srvrq_answer_record()

This function search for a SRV answer item associated to a requester
whose type is server.
This is mainly useful to "link" a server to its SRV record when no
additional record were found to configure the IP address.

This patch is required by a bug fix.

BUG/MEDIUM: resolvers: Fix the loop looking for an existing ADD item

For each ADD item found in a SRV response, we try to find a corresponding
ADD item already attached to an existing SRV item. If found, the ADD
last_seen time is updated, otherwise we try to find a SRV item with no ADD
to attached the new one.

However, the loop is buggy. Instead of comparing 2 ADD items, it compares
the new ADD item with the SRV item. Because of this bug, we are unable to
renew last_seen time of existing ADD.

This patch must be backported as far as 2.2.

BUG/MEDIUM: resolvers: Don't set an address-less server as UP

when a server status is updated based on a SRV item, it is always set to UP,
regardless it has an IP address defined or not. For instance, if only a SRV
item is received, with no additional record, only the server hostname is
defined. We must wait to have an IP address to set the server as UP.

This patch must be backported as far as 2.2.

BUG/MINOR: resolvers: Unlink DNS resolution to set RMAINT on SRV resolution

When a server is set in RMAINT becaues of a SRV resolution failure, the
server DNS resolution, if any, must be unlink first. It is mandatory to
handle the change in the context of a SRV resolution.

This patch must be backported as far as 2.2.

BUG/MINOR: resolvers: Reset server address on DNS error only on status change

When a DNS resolution error is detected, in snr_resolution_error_cb(), the
server address must be reset only if the server status has changed. It this
case, it means the server is set to RMAINT. Thus the server address may by
reset.

This patch fixes a bug introduced by commit d127ffa9f ("BUG/MEDIUM:
resolvers: Reset address for unresolved servers"). It must be backported as
far as 2.0.

BUG/MINOR: resolvers: Consider server to have no IP on DNS resolution error

When an error is received for a DNS resolution, for instance a NXDOMAIN
error, the server must be considered to have no address when its status is
updated, not the opposite.

Concretly, because this parameter is not used on error path in
snr_update_srv_status(), there is no impact.

This patch must be backported as far as 1.8.

Revert "BUG/MINOR: resolvers: Only renew TTL for SRV records with an additional record"

This reverts commit a331a1e8eb2ad4750711a477ca3e22d940495faf.

This commit fixes a real bug, but it also reveals some hidden bugs, mostly
because of some design issues. Thus, in itself, it create more problem than
it solves. So revert it for now. All known bugs will be addressed in next
commits.

This patch should be backported as far as 2.2.

BUG/MINOR: cfgparse/server: increment the extra keyword counter one at a time

This was introduced in previous commit 49c2b45c1 ("MINOR: cfgparse/server:
try to fix spelling mistakes on server lines"), the loop was changed but
the increment left. No backport is needed.

MINOR: tcp-rules: suggest approaching action names on mismatch

This adds support for action_suggest() in tcp-request and tcp-response
rules so as to propose the closest match in case of misspelling.

MINOR: http-rules: suggest approaching action names on mismatch

This adds support for action_suggest() in http-request, http-response
and http-after-response rulesets. For example:

parsing [/dev/stdin:2]: 'http-request' expects (...), but got 'del-hdr'. Did you mean 'del-header' maybe ?

MINOR: actions: add a function to suggest an action ressembling a given word

action_suggest() will return a pointer to an action whose keyword more or
less ressembles the passed argument. It also accepts to be more tolerant
against prefixes (since actions taking arguments are handled as prefixes).
This will be used to suggest approaching words.

MINOR: cfgparse/bind: suggest correct spelling for unknown bind keywords

Just like with the server keywords, now's the turn of "bind" keywords.
The difference is that 100% of the bind keywords are registered, thus
we do not need the list of extra keywords.

There are multiple bind line parsers today, all were updated:
  - peers
  - log
  - dgram-bind
  - cli

$ printf "listen f\nbind :8000 tcut\n" | ./haproxy -c -f /dev/stdin
[NOTICE] 070/101358 (25146) : haproxy version is 2.4-dev11-7b8787-26
[NOTICE] 070/101358 (25146) : path to executable is ./haproxy
[ALERT] 070/101358 (25146) : parsing [/dev/stdin:2] : 'bind :8000' unknown keyword 'tcut'; did you mean 'tcp-ut' maybe ?
[ALERT] 070/101358 (25146) : Error(s) found in configuration file : /dev/stdin
[ALERT] 070/101358 (25146) : Fatal errors found in configuration.

MINOR: cfgparse/server: try to fix spelling mistakes on server lines

Let's apply the fuzzy match to server keywords so that we can avoid
dumping the huge list of supported keywords each time there is a spelling
mistake, and suggest proper spelling instead:

  $ printf "listen f\nserver s 0 sendpx-v2\n" | ./haproxy -c -f /dev/stdin
  [NOTICE] 070/095718 (24152) : haproxy version is 2.4-dev11-caa6e3-25
  [NOTICE] 070/095718 (24152) : path to executable is ./haproxy
  [ALERT] 070/095718 (24152) : parsing [/dev/stdin:2] : 'server s' unknown keyword 'sendpx-v2'; did you mean 'send-proxy-v2' maybe ?
  [ALERT] 070/095718 (24152) : Error(s) found in configuration file : /dev/stdin
  [ALERT] 070/095718 (24152) : Fatal errors found in configuration.

MINOR: cfgparse: suggest correct spelling for unknown words in global section

The global section also knows a large number of keywords that are not
referenced in any list, so this needed them to be specifically listed.
It becomes particularly handy now because some tunables are never easy
to remember, but now it works remarkably well:

  $ printf "global\nsched.queue_depth\n" | ./haproxy -c -f /dev/stdin
  [NOTICE] 070/093007 (23457) : haproxy version is 2.4-dev11-dd8ee5-24
  [NOTICE] 070/093007 (23457) : path to executable is ./haproxy
  [ALERT] 070/093007 (23457) : parsing [/dev/stdin:2] : unknown keyword 'sched.queue_depth' in 'global' section; did you mean 'tune.runqueue-depth' maybe ?
  [ALERT] 070/093007 (23457) : Error(s) found in configuration file : /dev/stdin
  [ALERT] 070/093007 (23457) : Fatal errors found in configuration.

MINOR: cfgparse: suggest correct spelling for unknown words in proxy sections

Let's start by the largest keyword list, the listeners. Many keywords were
still not part of a list, so a common_kw_list array was added to list the
not enumerated ones. Now for example, typing "tmout" properly suggests
"timeout":

  $ printf "frontend f\ntmout client 10s\n" | ./haproxy -c -f /dev/stdin
  [NOTICE] 070/091355 (22545) : haproxy version is 2.4-dev11-3b728a-21
  [NOTICE] 070/091355 (22545) : path to executable is ./haproxy
  [ALERT] 070/091355 (22545) : parsing [/dev/stdin:2] : unknown keyword 'tmout' in 'frontend' section; did you mean 'timeout' maybe ?
  [ALERT] 070/091355 (22545) : Error(s) found in configuration file : /dev/stdin
  [ALERT] 070/091355 (22545) : Fatal errors found in configuration.

MINOR: cfgparse: add cfg_find_best_match() to suggest an existing word

Instead of just reporting "unknown keyword", let's provide a function which
will look through a list of registered keywords for a similar-looking word
to the one that wasn't matched. This will help callers suggest correct
spelling. Also, given that a large part of the config parser still relies
on a long chain of strcmp(), we'll need to be able to pass extra candidates.
Thus the function supports an optional extra list for this purpose.

MINOR: tools: add simple word fingerprinting to find similar-looking words

This introduces two functions, one which creates a fingerprint of a word,
and one which computes a distance between two words fingerprints. The
fingerprint is made by counting the transitions between one character and
another one. Here we consider the 26 alphabetic letters regardless of
their case, then any digit as a digit, and anything else as "other". We
also consider the first and last locations as transitions from begin to
first char, and last char to end. The distance is simply the sum of the
squares of the differences between two fingerprints. This way, doubling/
missing a letter has the same cost, however some repeated transitions
such as "e"->"r" like in "server" are very unlikely to match against
situations where they do not exist. This is a naive approach but it seems
to work sufficiently well for now. It may be refined in the future if
needed.

CLEANUP: actions: the keyword must always be const from the rule

There's no reason for a rule to want to modify an action keyword, let's
make sure it is always const.

CLEANUP: http-rules: remove the unexpected comma before the list of action keywords

The error message for http-request and http-response starts with a comma
that very likely is a leftover from a previous list construct. Let's remove
it: "'http-request' expects , 'wait-for-handshake', 'use-service' ...".

CLEANUP: vars: make the error message clearer on missing arguments for set-var

The error message after "http-response set-var" isn't very clear:

  [ALERT] 070/115043 (30526) : parsing [/dev/stdin:2] : error detected in proxy 'f' while parsing 'http-response set-var' rule : invalid variable 'set-var'. Expects 'set-var(<var-name>)' or 'unset-var(<var-name>)'.

Let's change it to this instead:

  [ALERT] 070/115608 (30799) : parsing [/dev/stdin:2] : error detected in proxy 'f' while parsing 'http-response set-var' rule : invalid or incomplete action 'set-var'. Expects 'set-var(<var-name>)' or 'unset-var(<var-name>)'.

With a wrong action name, it also works better (it's handled as a prefix
due to the opening parenthesis):

  [ALERT] 070/115608 (30799) : parsing [/dev/stdin:2] : error detected in proxy 'f' while parsing 'http-response set-varxxx' rule : invalid or incomplete action 'set-varxxx'. Expects 'set-var(<var-name>)' or 'unset-var(<var-name>)'.

CLEANUP: tcp-rules: add missing actions in the tcp-request error message

The tcp-request error message only mentions "accept", "reject" and
track-sc*, but there are a few other ones that were missing, so let's
add them.

This could be backported, though it's not likely that it will help anyone
with an existing config.

BUG/MINOR: server-state: use the argument, not the global state

The refactoring in commit 131b07be3 ("MEDIUM: server: Refactor
apply_server_state() to make it more readable") also had a copy-paste
error resulting in using global.server_state_file instead of the
function's argument, which easily crashes with a conf having a
state file in a backend and no global state file.

In addition, let's simplify the code and get rid of strcpy() which
almost certainly will break the build on OpenBSD.

This was introduced in 2.4-dev10, no backport is needed.

BUG/MINOR: server-state: properly handle the case where the base is not set

The refactoring in commit 131b07be3 ("MEDIUM: server: Refactor
apply_server_state() to make it more readable") made the global
server_state_base be dereferenced before being checked, resulting
in a crash on certain files.

This happened in 2.4-dev10, no backport is needed.

BUG/MINOR: tcpcheck: Fix double free on error path when parsing tcp/http-check

When a "tcp-check" or a "http-check" rule is parsed, we try to get the
previous rule in the ruleset to get its index. We must take care to reset
the pointer on this rule in case an error is triggered later on the
parsing. Otherwise, the same rule may be released twice. For instance, it
happens with such line :

http-check meth GET uri / ## note there is no "send" parameter

This patch must be backported as far as 2.2.

BUG/MINOR: proxy/session: Be sure to have a listener to increment its counters

It is possible to have a session without a listener. It happens for applets
on the client side. Thus all accesses to the listener info from the session
must be guarded. It was the purpose of the commit 36119de18 ("BUG/MEDIUM:
session: NULL dereference possible when accessing the listener"). However,
some tests on the session's listener existence are missing in proxy_inc_*
functions.

This patch should fix the issues #1171, #1172, #1173, #1174 and #1175. It
must be backported with the above commit as far as 1.8.

BUG/MINOR: tcpcheck: Update .health threshold of agent inside an agent-check

If an agent-check is configured for a server, When the response is parsed,
the .health threshold of the agent must be updated on up/down/stopped/fail
command and not the threshold of the health-check. Otherwise, the
agent-check will compete with the health-check and may mark a DOWN server as
UP.

This patch should fix the issue #1176. It must be backported as far as 2.2.

BUG/MEDIUM: filters: Set CF_FL_ANALYZE on channels when filters are attached

CF_FL_ANALYZE flag is used to know a channel is filtered. It is important to
synchronize request and response channels when the filtering ends.

However, it is possible to call all request analyzers before starting the
filtering on the response channel. This means flt_end_analyze() may be
called for the request channel before flt_start_analyze() on the response
channel. Thus because CF_FL_ANALYZE flag is not set on the response channel,
we consider the filtering is finished on both sides. The consequence is that
flt_end_analyze() is not called for the response and backend filters are
unregistered before their execution on the response channel.

It is possible to encounter this bug on TCP frontend or CONNECT request on
HTTP frontend if the client shutdown is reveiced with the first read.

To fix this bug, CF_FL_ANALYZE is set when filters are attached to the
stream. It means, on the request channel when the stream is created, in
flt_stream_start(). And on both channels when the backend is set, in
flt_set_stream_backend().

This patch must be backported as far as 1.7.

BUILD: atomic/arm64: force the register pairs to use in __ha_cas_dw()

Since commit f8fb4f75f ("MINOR: atomic: implement a more efficient arm64
__ha_cas_dw() using pairs"), on some modern arm64 (armv8.1+) compiled
with -march=armv8.1-a under gcc-7.5.0, a build error may appear on ev_poll.o :

/tmp/ccHD2lN8.s:1771: Error: reg pair must start from even reg at operand 1 -- `casp x27,x28,x22,x23,[x12]'
Makefile:927: recipe for target 'src/ev_poll.o' failed

It appears that the compiler cannot always assign register pairs there
for a structure made of two u64. It was possibly later addressed since
gcc-9.3 never caused this, but there's no trivially available info on
the subject in the changelogs. Unsuprizingly, using a u128 instead does
fix this, but it significantly inflates the code (+4kB for just 6 places,
very likely that it loaded some extra stubs) and the comparison is ugly,
involving two slower conditional jumps instead of a single one and a
conditional comparison. For example, ha_random64() grew from 144 bytes
to 232.

However, simply forcing the base register does work pretty well, and
makes the code even cleaner and more efficient by further reducing it
by about 4.5kB, possibly because it helps the compiler to pick suitable
registers for the pair there. And the perf on 64-cores looks steadily
0.5% above the previous one, so let's do this.

Note that the commit above was backported to 2.3 to fix scalability
issues on AWS Graviton2 platform, so this one will need to be as well.

BUG/MEDIUM: stick-tables: fix ref counter in table entry using multiple http tracksc.

Setting multiple http-request track-scX rules generates entries
which never expires.

If there was already an entry registered by a previous http rule
'stream_track_stkctr(&s->stkctr[rule->action], t, ts)' didn't
register the new 'ts' into the stkctr. And function is left
with no reference on 'ts' whereas refcount had been increased
by the '_get_entry'

The patch applies the same policy as the one showed on tcp track
rules and if there is successive rules the track counter keep the
first entry registered in the counter and nothing more is computed.

After validation this should be backported in all versions.

BUG/MINOR: connection: Missing QUIC initialization

The QUIC connection struct connection member was not initialized. This may
make randomly haproxy handle TLS connections as QUIC ones only when QUIC support
is enabled leading to such OpenSSL errors (captured from a reg test output, TLS
Client-Hello callback failed):

    OpenSSL error[0x10000085] OPENSSL_internal: CONNECTION_REJECTED
    OpenSSL error[0x10000410] OPENSSL_internal: SSLV3_ALERT_HANDSHAKE_FAILURE
    OpenSSL error[0x1000009a] OPENSSL_internal: HANDSHAKE_FAILURE_ON_CLIENT_HELLO

This patch should fix #1168 github issue.

OPTIM: task: automatically adjust the default runqueue-depth to the threads

The recent default runqueue size reduction appeared to have significantly
lowered performance on low-thread count configs. Testing various values
runqueue values on different workloads under thread counts ranging from
1 to 64, it appeared that lower values are more optimal for high thread
counts and conversely. It could even be drawn that the optimal value for
various workloads sits around 280/sqrt(nbthread), and probably has to do
with both the L3 cache usage and how to optimally interlace the threads'
activity to minimize contention. This is much easier to optimally
configure, so let's do this by default now.

MINOR: task: give the scheduler a bit more flexibility in the runqueue size

Instead of setting a hard-limit on runqueue-depth and keeping it short
to maintain fairness, let's allow the scheduler to automatically cut
the existing one in two equal halves if its size is between the configured
size and its double. This will allow to increase the default value while
keeping a low latency.

BUG/MINOR: sample: Rename SenderComID/TargetComID to SenderCompID/TargetCompID

The recently introduced Financial Information eXchange (FIX)
converters have some hard coded tags based on the specification that
were misspelled. Specifically, SenderComID and TargetComID should
be SenderCompID and TargetCompID according to the specification [1][2].

This patch updates all references, which includes the converters
themselves, the regression test, and the documentation.

[1] https://fiximate.fixtrading.org/en/FIX.5.0SP2_EP264/tag49.html
[2] https://fiximate.fixtrading.org/en/FIX.5.0SP2_EP264/tag56.html

BUG/MEDIUM: ssl: properly remove the TASK_HEAVY flag at end of handshake

Emeric found that SSL+keepalive traffic had dropped quite a bit in the
recent changes, which could be bisected to recent commit 9205ab31d
("MINOR: ssl: mark the SSL handshake tasklet as heavy"). Indeed, a
first incarnation of this commit made use of the TASK_SELF_WAKING
flag but the last version directly used TASK_HEAVY, but it would still
continue to remove the already absent TASK_SELF_WAKING one instead of
TASK_HEAVY. As such, the SSL traffic remained processed with low
granularity.

No backport is needed as this is only 2.4.

CLEANUP: config: also address the cfg_keyword API change in the compression code

The tests were made on slz and the zlib parsers for memlevel and windowsize
managed to escape the change made by commit 018251667 ("CLEANUP: config: make
the cfg_keyword parsers take a const for the defproxy"). This is now fixed.

BUG/MEDIUM: resolvers: handle huge responses over tcp servers.

Parameter "accepted_payload_size" is currently considered regardless
the used nameserver is using TCP or UDP. It remains mandatory to annouce
such capability to support e-dns, so a value have to be announced also
in TCP. Maximum DNS message size in TCP is limited by protocol to 65535
and so for UDP (65507) if system supports such UDP messages. But
the maximum value for this option was arbitrary forced to 8192.

This patch change this maximum to 65535 to allow user to set bigger value
for UDP if its system supports. It also sets accepted_payload_size
in TCP allowing to retrieve huge responses if the configuration uses
TCP nameservers.

The request announcing the accepted_payload_size capability is currently
built at resolvers level and is common to all used nameservers of the
section regardess transport protocol used. A further patch should be
made to at least specify a different payload size depending of the
transport, and perhaps could be forced to 65535 in case of TCP and
maximum would be forced back to 65507 matching UDP max.

This patch is appliable since 2.4 version

CLEANUP: stream: rename a few remaining occurrences of "stream *sess"

These are some leftovers from the ancient code where they were still
called sessions, but these areas in the code remain confusing due to
this naming. They were now called "strm" which will not even affect
indenting nor alignment.

MINOR: build: force CC to set a return code when probing options

gcc returns non zero code if an option is not supported (tested
from 6.5 to 10.2).

  $ gcc -Wfoobar -E -xc - -o /dev/null < /dev/null > /dev/null 2>&1 ; echo $?
  1

clang always return 0 if an option in not recognized unless
-Werror is also passed, preventing a correct probing of options
supported by the compiler (tested with clang 6.0.1 to 11.1.0).

  $ clang -Wfoobar -E -xc - -o /dev/null < /dev/null > /dev/null 2>&1 ; echo $?
  0
  $ clang -Werror -Wfoobar -E -xc - -o /dev/null < /dev/null > /dev/null 2>&1 ; echo $?
  1

Please note today this is not visible since clang 11 exit with SIGABRT
or with return code 1 on older version due to bad file descriptor from
file descriptor handling

  $ clang -Wfoobar -E -xc - -o /dev/null < /dev/null 2>&0 ; echo $?
  Aborted (core dumped)
  134
  $ clang -Wfoobar -E -xc - -o /dev/null < /dev/null ; echo $?
  warning: unknown warning option '-Wfoobar'; did you mean '-Wformat'? [-Wunknown-warning-option]
  1 warning generated.
  0
  $ clang-11 -Werror -Wfoobar -E -xc - -o /dev/null < /dev/null ; echo $?
  error: unknown warning option '-Wfoobar'; did you mean '-Wformat'? [-Werror,-Wunknown-warning-option]
  1

This specific issue is being tracked with clang upstream in https://bugs.llvm.org/show_bug.cgi?id=49463

BUG/MEDIUM: session: NULL dereference possible when accessing the listener

When implementing a client applet, a NULL dereference was encountered on
the error path which increment the counters.

Indeed, the counters incremented are the one in the listener which does
not exist in the case of client applets, so in sess->listener->counters,
listener is NULL.

This patch fixes the access to the listener structure when accessing
from a sesssion, most of the access are the counters in error paths.

Must be backported as far as 1.8.

BUILD: connection: do not use VAR_ARRAY in struct tlv

It was brought by commit c44b8de99 ("CLEANUP: connection: Use `VAR_ARRAY`
in `struct tlv` definition") but breaks the build with clang. Actually it
had already been done 6 months ago by commit 4987a4744 ("CLEANUP: tree-wide:
use VAR_ARRAY instead of [0] in various definitions") then reverted by
commit 441b6c31e ("BUILD: connection: fix build on clang after the VAR_ARRAY
cleanup") which explained the same thing but didn't place a comment in the
code to justify this (in short it's just an end of struct marker).

CLEANUP: config: make the cfg_keyword parsers take a const for the defproxy

The default proxy was passed as a variable to all parsers instead of a
const, which is not without risk, especially when some timeout parsers used
to make some int pointers point to the default values for comparisons. We
want to be certain that none of these parsers will modify the defaults
sections by accident, so it's important to mark this proxy as const.

This patch touches all occurrences found (89).

BUILD: bug: refine HA_LINK_ERROR() to only be used on gcc and derivatives

TCC happens to define __OPTIMIZE__ at -O2 but doesn't proceed with dead
code elimination, resulting in ha_free() to always reference the link
error symbol. Let's condition this test on __GCC__ which others like
Clang also define.

BUILD: task: fix build at -O0 with threads disabled

grq_total was incremented when picking tasks from the global run queue,
but this variable was not defined with threads disabled, and the code
was optimized away at -O2. No backport is needed.

CLEANUP: connection: Consistently use `struct ist` to process all TLV types

Instead of directly poking around within the `struct tlv tlv_packet` the actual
value will be consumed using a `struct ist`.

MINOR: connection: Use a `struct ist` to store proxy_authority

This makes the code cleaner, because proxy_authority can be handled like
proxy_unique_id.

CLEANUP: connection: Use istptr / istlen for proxy_unique_id

Don't access the ist's fields directly, use the helper functions instead.

CLEANUP: connection: Remove useless test for NULL before calling `pool_free()`

`pool_free()` is a noop when the given pointer is NULL. No need to test.

CLEANUP: connection: Use `VAR_ARRAY` in `struct tlv` definition

This is for consistency with `struct tlv_ssl`.

[RELEASE] Released version 2.4-dev11

Released version 2.4-dev11 with the following main changes :
    - CI: codespell: skip Makefile for spell check
    - CLEANUP: assorted typo fixes in the code and comments
    - BUG/MINOR: tcp-act: Don't forget to set the original port for IPv4 set-dst rule
    - BUG/MINOR: connection: Use the client's dst family for adressless servers
    - BUG/MEDIUM: spoe: Kill applets if there are pending connections and nbthread > 1
    - CLEANUP: Use ist2(const void*, size_t) whenever possible
    - CLEANUP: Use IST_NULL whenever possible
    - BUILD: proxy: Missing header inclusion for quic_transport_params_init()
    - BUILD: quic: Implicit conversion between SSL related enums.
    - DOC: spoe: Add a note about fragmentation support in HAProxy
    - MINOR: contrib: add support for heartbeat control messages.
    - MINOR: contrib: Enhance peers dissector heuristic.
    - BUG/MINOR: mux-h2: Fix typo in scheme adjustment
    - CLEANUP: Reapply the ist2() replacement patch
    - CLEANUP: Use istadv(const struct ist, const size_t) whenever possible
    - CLEANUP: Use isttest(const struct ist) whenever possible
    - Revert "CI: Pin VTest to a known good commit"
    - CLEANUP: backend: fix a wrong comment
    - BUG/MINOR: backend: free allocated bind_addr if reuse conn
    - MINOR: backend: handle reuse for conns with no server as target
    - REGTESTS: test http-reuse if no server target
    - BUG/MINOR: hlua: Don't strip last non-LWS char in hlua_pushstrippedstring()
    - BUG/MINOR: server-state: Don't load server-state file for disabled backends
    - CLEANUP: dns: Use DISGUISE() on a never-failing ring_attach() call
    - CLEANUP: dns: Remove useless test on ns->dgram in dns_connect_nameserver()
    - DOC: fix originalto except clause on destination address
    - CLEANUP: Use the ist() macro whenever possible
    - CLEANUP: Replace for loop with only a condition by while
    - REORG: atomic: reimplement pl_cpu_relax() from atomic-ops.h
    - BUG/MINOR: mt-list: always perform a cpu_relax call on failure
    - MINOR: atomic: add armv8.1-a atomics variant for cas-dw
    - MINOR: atomic: implement a more efficient arm64 __ha_cas_dw() using pairs
    - BUG/MINOR: ssl: don't truncate the file descriptor to 16 bits in debug mode
    - MEDIUM: pools: add CONFIG_HAP_NO_GLOBAL_POOLS and CONFIG_HAP_GLOBAL_POOLS
    - MINOR: pools: double the local pool cache size to 1 MB
    - MINOR: stream: use ABORT_NOW() and not abort() in stream_dump_and_crash()
    - CLEANUP: stream: explain why we queue the stream at the head of the server list
    - MEDIUM: backend: use a trylock when trying to grab an idle connection
    - REORG: tools: promote the debug PRNG to more general use as a statistical one
    - OPTIM: lb-random: use a cheaper PRNG to pick a server
    - MINOR: task: stop abusing the nice field to detect a tasklet
    - MINOR: task: move the nice field to the struct task only
    - MEDIUM: task: extend the state field to 32 bits
    - MINOR: task: add an application specific flag to the state: TASK_F_USR1
    - MEDIUM: muxes: mark idle conns tasklets with TASK_F_USR1
    - MINOR: xprt: add new xprt_set_idle and xprt_set_used methods
    - MEDIUM: ssl: implement xprt_set_used and xprt_set_idle to relax context checks
    - MINOR: server: don't read curr_used_conns multiple times
    - CLEANUP: global: reorder some fields to respect cache lines
    - CLEANUP: sockpair: silence a coverity check about fcntl()
    - CLEANUP: lua: set a dummy file name and line number on the dummy servers
    - MINOR: server: add a global list of all known servers
    - MINOR: cfgparse: finish to set up servers outside of the proxy setup loop
    - MINOR: server: allocate a per-thread struct for the per-thread connections stuff
    - MINOR: server: move actconns to the per-thread structure
    - CLEANUP: server: reorder some fields in the server struct to respect cache lines
    - MINOR: backend: add a BUG_ON if conn mux NULL in connect_server
    - BUG/MINOR: backend: fix condition for reuse on mode HTTP
    - BUILD: Fix build when using clang without optimizing.
    - CLEANUP: assorted typo fixes in the code and comments

CLEANUP: assorted typo fixes in the code and comments

This is 19th iteration of typo fixes

BUILD: Fix build when using clang without optimizing.

ha_free() uses code that attempts to set a non-existant variable to provoke
a link-time error, with the expectation that the compiler will not omit that
if the code is unreachable. However, clang will emit it when compiling with
no optimization, so only do that if __OPTIMIZE__ is defined.

BUG/MINOR: backend: fix condition for reuse on mode HTTP

This commit is a fix/complement to the following one :
08d87b3f49867440f66aee09173c84bf58cbc859
BUG/MEDIUM: backend: never reuse a connection for tcp mode

It fixes the check for the early insertion of backend connections in
the reuse lists if the backend mode is HTTP.

The impact of this bug seems limited because :
- in tcp mode, no insertion is done in the avail list as mux_pt does not
  support multiple streams.
- in http mode, muxes are also responsible to insert backend connections
  in lists in their detach functions. Prior to this fix the reuse rate
  could be slightly inferior.

It can be backported to 2.3.

MINOR: backend: add a BUG_ON if conn mux NULL in connect_server

Currently, there seems to be no way to have the transport layer ready
but not the mux in the function connect_server. Add a BUG_ON to report
if this implicit condition is not true anymore.

This should fix coverity report from github issue #1120.

CLEANUP: server: reorder some fields in the server struct to respect cache lines

There's currently quite some thread contention in the server struct
because frequently fields accessed fields are mixed with those being
often written to by any thread. Let's split this a little bit to
separate a few areas:
  - pure config / admin / operating status (almost never changes)
  - idle and queuing (fast changes, done almost together)
  - LB (fast changes, not necessarily dependent on the above)
  - counters (fast changes, at a different instant again)

MINOR: server: move actconns to the per-thread structure

The actconns list creates massive contention on low server counts because
it's in fact a list of streams using a server, all threads compete on the
list's head and it's still possible to see some watchdog panics on 48
threads under extreme contention with 47 threads trying to add and one
thread trying to delete.

Moving this list per thread is trivial because it's only used by
srv_shutdown_streams(), which simply required to iterate over the list.

The field was renamed to "streams" as it's really a list of streams
rather than a list of connections.

MINOR: server: allocate a per-thread struct for the per-thread connections stuff

There are multiple per-thread lists in the listeners, which isn't the
most efficient in terms of cache, and doesn't easily allow to store all
the per-thread stuff.

Now we introduce an srv_per_thread structure which the servers will have an
array of, and place the idle/safe/avail conns tree heads into. Overall this
was a fairly mechanical change, and the array is now always initialized for
all servers since we'll put more stuff there. It's worth noting that the Lua
code still has to deal with its own deinit by itself despite being in a
global list, because its server is not dynamically allocated.

MINOR: cfgparse: finish to set up servers outside of the proxy setup loop

Till now servers were only initialized as part of the proxy setup loop,
which doesn't cover peers, tcp log, dns, lua etc. Let's move this part
out of this loop and instead iterate over all registered servers. This
way we're certain to visit them all.

The patch looks big but it's just a move of a large block with the
corresponding reindent (as can be checked with diff -b). It relies
on the two previous ones ("MINOR: server: add a global list of all
known servers and" and "CLEANUP: lua: set a dummy file name and line
number on the dummy servers").

MINOR: server: add a global list of all known servers

It's a real pain not to have access to the list of all registered servers,
because whenever there is a need to late adjust their configuration, only
those attached to regular proxies are seen, but not the peers, lua, logs
nor DNS.

What this patch does is that new_server() will automatically add the newly
created server to a global list, and it does so as well for the 1 or 2
statically allocated servers created for Lua. This way it will be possible
to iterate over all of them.

CLEANUP: lua: set a dummy file name and line number on the dummy servers

The "socket_tcp" and "socket_ssl" servers had no config file name nor
line number, but this is sometimes annoying during debugging or later
in error messages, while all other places using new_server() or
parse_server() make sure to have a valid file:line set. Let's set
something to address this.

CLEANUP: sockpair: silence a coverity check about fcntl()

This is about coverity complaining that we didn't check the fcntl call
which can't fail, let's consume it. This is issue #1158.

CLEANUP: global: reorder some fields to respect cache lines

Some entries are atomically updated by various threads, such as the
global counters, and they're mixed with others which are read all the
time like the mode. This explains why "perf" was seeing a huge access
cost on global.mode in process_stream()! Let's reorder them so that
the static config stuff is at the beginning and the live stuff is at
the end.

MINOR: server: don't read curr_used_conns multiple times

This one is added atomically and we reread it just after this, causing
a second memory load that is visible in the perf profile.

MEDIUM: ssl: implement xprt_set_used and xprt_set_idle to relax context checks

Currently the SSL layer checks the validity of its tasklet's context just
in case it would have been stolen, had the connection been idle. Now it
will be able to be notified by the mux when this situation happens so as
not to have to grab the idle connection lock on each pass. This reuses the
TASK_F_USR1 flag just as the muxes do.

MINOR: xprt: add new xprt_set_idle and xprt_set_used methods

These functions are used on the mux layer to indicate that the connection
is becoming idle and that the xprt ought to be careful before checking the
context or that it's not idle anymore and that the context is safe. The
purpose is to allow a mux which is going to release a connection to tell
the xprt to be careful when touching it. At the moment, the xprt are
always careful and that's costly so we want to have the ability to relax
this a bit.

No xprt layer uses this yet.

MEDIUM: muxes: mark idle conns tasklets with TASK_F_USR1

The muxes are touching the idle_conns_lock all the time now because
they need to be careful that no other thread has stolen their tasklet's
context.

This patch changes this a little bit by setting the TASK_F_USR1 flag on
the tasklet before marking a connection idle, and removing it once it's
not idle anymore. Thanks to this we have the guarantee that a tasklet
without this flag cannot be present in an idle list and does not need
to go through this costly lock. This is especially true for front
connections.

MINOR: task: add an application specific flag to the state: TASK_F_USR1

This flag will be usable by any application. It will be preserved across
wakeups so the application can use it to do various stuff. Some I/O
handlers will soon benefit from this.

MEDIUM: task: extend the state field to 32 bits

It's been too short for quite a while now and is now full. It's still
time to extend it to 32-bits since we have room for this without
wasting any space, so we now gained 16 new bits for future flags.

The values were not reassigned just in case there would be a few
hidden u16 or short somewhere in which these flags are placed (as
it used to be the case with stream->pending_events).

The patch is tagged MEDIUM because this required to update the task's
process() prototype to use an int instead of a short, that's quite a
bunch of places.

MINOR: task: move the nice field to the struct task only

The nice field isn't needed anymore for the tasklet so we can move it
from the TASK_COMMON area into the struct task which already has a
hole around the expire entry.

MINOR: task: stop abusing the nice field to detect a tasklet

It's cleaner to use a flag from the task's state to detect a tasklet
and it's even cheaper. One of the best benefits is that this will
allow to get the nice field out of the common part since the tasklet
doesn't need it anymore. This commit uses the last task bit available
but that's temporary as the purpose of the change is to extend this.

OPTIM: lb-random: use a cheaper PRNG to pick a server

The PRNG used by the "random" LB algorithm was the central one which tries
hard to produce "correct" (i.e. hardly predictable) values suitable for use
in UUIDs or cookies. It's much too expensive for pure load balancing where
a cheaper thread-local PRNG is sufficient, and the current PRNG is part of
the hot places when running with many threads.

Let's switch to the stastistical PRNG instead, it's thread-local, very
fast, and with a period of (2^32)-1 which is more than enough to decide
on a server.

REORG: tools: promote the debug PRNG to more general use as a statistical one

We frequently need to access a simple and fast PRNG for statistical
purposes. The debug_prng() function did exactly this using a xorshift
generator but its use was limited to debug only. Let's move this to
tools.h and tools.c to make it accessible everywhere. Since it needs to
be fast, its state is thread-local. An initialization function starts a
different initial value for each thread for better distribution.

MEDIUM: backend: use a trylock when trying to grab an idle connection

In conn_backend_get() we can cause some extreme contention due to the
idle_conns_lock. Indeed, even though it's per-thread, it still causes
high contention when running with many threads. The reason is that all
threads which do not have any idle connections are quickly skipped,
till the point where there are still some, so the first reaching that
point will grab the lock and the other ones wait behind. From this
point, all threads are synchronized waiting on the same lock, and
will follow the leader in small jumps, all hindering each other.

Here instead of doing this we're using a trylock. This way when a thread
is already checking a list, other ones will continue to next thread. In
the worst case, a high contention will lead to a few new connections to
be set up, but this may actually be what is required to avoid contention
in the first place. With this change, the contention has mostly
disappeared on this lock (it's still present in muxes and transport
layers due to the takeover).

Surprisingly, checking for emptiness of the tree root before taking
the lock didn't address any contention.

A few improvements are still possible and desirable here. The first
one would be to avoid seeing all threads jump to the next one. We
could have each thread use a different prime number as the increment
so as to spread them across the entire table instead of keeping them
synchronized. The second one is that the lock in the muck layers
shouldn't be needed to check for the tasklet's context availability.

CLEANUP: stream: explain why we queue the stream at the head of the server list

In stream_add_srv_conn() MT_LIST_ADD() is used instead of MT_LIST_ADDQ(),
resulting in the stream being queued at the end of the server list. This
has no particular effect since we cannot dump the streams on a server,
and this is only used by "shutdown sessions" on a server. But it also
turns out to be significantly faster due to the shorter recovery from
the conflict with an adjacent MT_LIST_DEL(), thus it remains desirable
to use it, but at least it deserves a comment.

In addition to this, it's worth mentioning that this list should creates
extreme contention with threads while almost never used. It should be
made per-thread just like the global streams list.

MINOR: stream: use ABORT_NOW() and not abort() in stream_dump_and_crash()

Using abort() occasionally results in unexploitable core due to issues
rewinding the stack. Let's use ABORT_NOW() which in addition to crashing
much closer to the call point also has the benefit of showing the call
trace.

MINOR: pools: double the local pool cache size to 1 MB

The reason is that H2 can already require 32 16kB buffers for the mux
output at once, which will deplete the local cache. Thus it makes sense
to go further to leave some time to other connection to release theirs.
In addition, the L2 cache on modern CPUs is already 1 MB, so this change
is welcome in any case.

MEDIUM: pools: add CONFIG_HAP_NO_GLOBAL_POOLS and CONFIG_HAP_GLOBAL_POOLS

We've reached a point where the global pools represent a significant
bottleneck with threads. On a 64-core machine, the performance was
divided by 8 between 32 and 64 H2 connections only because there were
not enough entries in the local caches to avoid picking from the global
pools, and the contention on the list there was very high. It becomes
obvious that we need to have an array of lists, but that will require
more changes.

In parallel, standard memory allocators have improved, with tcmalloc
and jemalloc finding their ways through mainstream systems, and glibc
having upgraded to a thread-aware ptmalloc variant, keeping this level
of contention here isn't justified anymore when we have both the local
per-thread pool caches and a fast process-wide allocator.

For these reasons, this patch introduces a new compile time setting
CONFIG_HAP_NO_GLOBAL_POOLS which is set by default when threads are
enabled with thread local pool caches, and we know we have a fast
thread-aware memory allocator (currently set for glibc>=2.26). In this
case we entirely bypass the global pool and directly use the standard
memory allocator when missing objects from the local pools. It is also
possible to force it at compile time when a good allocator is used with
another setup.

It is still possible to re-enable the global pools using
CONFIG_HAP_GLOBAL_POOLS, if a corner case is discovered regarding the
operating system's default allocator, or when building with a recent
libc but a different allocator which provides other benefits but does
not scale well with threads.

BUG/MINOR: ssl: don't truncate the file descriptor to 16 bits in debug mode

Errors reported by ssl_sock_dump_errors() to stderr would only report the
16 lower bits of the file descriptor because it used to be casted to ushort.
This can be backported to all versions but has really no importance in
practice since this is never seen.

MINOR: atomic: implement a more efficient arm64 __ha_cas_dw() using pairs

There finally is a way to support register pairs on aarch64 assembly
under gcc, it's just undocumented, like many of the options there :-(
As indicated below, it's possible to pass "%H" to mention the high
part of a register pair (e.g. "%H0" to go with "%0"):

https://patchwork.ozlabs.org/project/gcc/patch/59368A74.2060908@foss.arm.com/

By making local variables from pairs of registers via a struct (as
is used in IST for example), we can let gcc choose the correct register
pairs and avoid a few moves in certain situations. The code is now slightly
more efficient than the previous one on AWS' Graviton2 platform, and
noticeably smaller (by 4.5kB approx). A few tests on older releases show
that even Linaro's gcc-4.7 used to support such register pairs and %H,
and by then ATOMICS were not supported so this should not cause build
issues, and as such this patch replaces the earlier implementation.

MINOR: atomic: add armv8.1-a atomics variant for cas-dw

This variant uses the CASP instruction available on armv8.1-a CPU cores,
which is detected when __ARM_FEATURE_ATOMICS is set (gcc-linaro >= 7,
mainline >= 9). This one was tested on cortex-A55 (S905D3) and on AWS'
Graviton2 CPUs.

The instruction performs way better on high thread counts since it
guarantees some forward progress when facing extreme contention while
the original LL/SC approach is light on low-thread counts but doesn't
guarantee progress.

The implementation is not the most optimal possible. In particular since
the instruction requires to work on register pairs and there doesn't seem
to be a way to force gcc to emit register pairs, we have to decide to force
to use the pair (x0,x1) to store the old value, and (x2,x3) to store the
new one, and this necessarily involves some extra moves. But at least it
does improve the situation with 16 threads and more. See issue #958 for
more context.

Note, a first implementation of this function was making use of an
input/output constraint passed using "+Q"(*(void**)target), which was
resulting in smaller overall code than passing "target" as an input
register only. It turned out that the cause was directly related to
whether the function was inlined or not, hence the "forceinline"
attribute. Any changes to this code should still pay attention to this
important factor.

BUG/MINOR: mt-list: always perform a cpu_relax call on failure

On highly threaded machines it is possible to occasionally trigger the
watchdog on certain contended areas like the server's connection list,
because while the mechanism inherently cannot guarantee a constant
progress, it lacks CPU relax calls which are absolutely necessary in
this situation to let a thread finish its job.

The loop's "while (1)" was changed to use a "for" statement calling
__ha_cpu_relax() as its continuation expression. This way the "continue"
statements jump to the unique place containing the pause without
excessively inflating the code.

This was sufficient to definitely fix the problem on 64-core ARM Graviton2
machines. This patch should probably be backported once it's confirmed it
also helps on many-cores x86 machines since some people are facing
contention in these environments. This patch depends on previous commit
"REORG: atomic: reimplement pl_cpu_relax() from atomic-ops.h".

An attempt was made to first read the value before exchanging, and it
significantly degraded the performance. It's very likely that this caused
other cores to lose exclusive ownership on their line and slow down their
next xchg operation.

In addition it was found that MT_LIST_ADD is significantly faster than
MT_LIST_ADDQ under high contention, because it fails one step earlier
when conflicting with an adjacent MT_LIST_DEL(). It might be worth
switching some operations' order to favor MT_LIST_ADDQ() instead.

REORG: atomic: reimplement pl_cpu_relax() from atomic-ops.h

There is some confusion here as we need to place some cpu_relax statements
in some loops where it's not easily possible to condition them on the use
of threads. That's what atomic.h already does. So let's take the various
pl_cpu_relax() implementations from there and place them in atomic.h under
the name __ha_cpu_relax() and let them adapt to the presence or absence of
threads and to the architecture (currently only x86 and aarch64 use a barrier
instruction), though it's very likely that arm would work well with a cache
flushing ISB instruction as well).

This time they were implemented as expressions returning 1 rather than
statements, in order to ease their placement as the loop condition or the
continuation expression inside "for" loops. We should probably do the same
with barriers and a few such other ones.