git.ipfire.org Git - thirdparty/haproxy.git/log

OPTIM: vars: use a cebtree instead of a list for variable names

Configs involving many variables can start to eat a lot of CPU in name
lookups. The reason is that the names themselves are dynamic in that
they are relative to dynamic objects (sessions, streams, etc), so
there's no fixed index for example. The current implementation relies
on a standard linked list, and in order to speed up lookups and avoid
comparing strings, only a 64-bit hash of the variable's name is stored
and compared everywhere.

But with just 100 variables and 1000 accesses in a config, it's clearly
visible that variable name lookup can reach 56% CPU with a config
generated this way:

  for i in {0..100}; do
    printf "\thttp-request set-var(txn.var%04d) int(%d)" $i $i;
    for j in {1..10}; do [ $i -lt $j ] || printf ",add(txn.var%04d)" $((i-j)); done;
    echo;
  done

The performance and a 4-core skylake 4.4 GHz reaches 85k RPS with a perf
profile showing:

  Samples: 170K of event 'cycles', Event count (approx.): 142378815419
  Overhead  Shared Object            Symbol
    56.39%  haproxy                  [.] var_to_smp
     6.65%  haproxy                  [.] var_set.part.0
     5.76%  haproxy                  [.] sample_process_cnv
     3.23%  haproxy                  [.] sample_conv_var2smp
     2.88%  haproxy                  [.] sample_conv_arith_add
     2.33%  haproxy                  [.] __pool_alloc
     2.19%  haproxy                  [.] action_store
     2.13%  haproxy                  [.] vars_get_by_desc
     1.87%  haproxy                  [.] smp_dup

[above, var_to_smp() calls var_get() under the read lock].

By switching to a binary tree, the cost is significantly lower, the
performance reaches 117k RPS (+37%) with this profile:

  Samples: 170K of event 'cycles', Event count (approx.): 142323631229
  Overhead  Shared Object            Symbol
    40.22%  haproxy                  [.] cebu64_lookup
     7.12%  haproxy                  [.] sample_process_cnv
     6.15%  haproxy                  [.] var_to_smp
     4.75%  haproxy                  [.] cebu64_insert
     3.79%  haproxy                  [.] sample_conv_var2smp
     3.40%  haproxy                  [.] cebu64_delete
     3.10%  haproxy                  [.] sample_conv_arith_add
     2.36%  haproxy                  [.] action_store
     2.32%  haproxy                  [.] __pool_alloc
     2.08%  haproxy                  [.] vars_get_by_desc
     1.96%  haproxy                  [.] smp_dup
     1.75%  haproxy                  [.] var_set.part.0
     1.74%  haproxy                  [.] cebu64_first
     1.07%  [kernel]                 [k] aq_hw_read_reg
     1.03%  haproxy                  [.] pool_put_to_cache
     1.00%  haproxy                  [.] sample_process

The performance lowers a bit earlier than with the list however. What
can be seen is that the performance maintains a plateau till 25 vars,
starts degrading a little bit for the tree while it remains stable till
28 vars for the list. Then both cross at 42 vars and the list continues
to degrade doing a hyperbole while the tree resists better. The biggest
loss is at around 32 variables where the list stays 10% higher.

Regardless, given the extremely narrow band where the list is better, it
looks relevant to switch to this in order to preserve the almost linear
performance of large setups. For example at 1000 variables and 10k
lookups, the tree is 18 times faster than the list.

In addition this reduces the size of the struct vars by 8 bytes since
there's a single pointer, though it could make sense to re-invest them
into a secondary head for example.

IMPORT: import cebtree (compact elastic binary trees)

This is an import of the compact elastic binary trees at commit
a9cd84a ("OPTIM: descent: better prefetch less and for writes when
deleting")

These will be used to replace certain lists (and possibly certain
tree nodes as well). They're as fast (or even faster) than ebtrees
for lookups, as fast for insertion and slower for deletion, and a
node only uses 2 pointers (like a list).

The only changes were cebtree.h where common/tools.h was replaced
with ebtree.h which we already have and already provides the needed
functions and macros, and the addition of a wrapper cebtree-prv.h in
src/ to redirect to import/cebtree-prv.h.

MINOR: vars: remove the emptiness tests in callers before pruning

All callers of vars_prune_* currently check the list for emptiness.
Let's leave that to vars_prune() itself, it will ease some changes in
the code. Thanks to the previous inlining of the vars_prune() function,
there's no performance loss, and even a very tiny 0.1% gain.

OPTIM: vars: inline vars_prune() to avoid many calls

Many configs don't have variables and call it for no reason, and even
configs with variables don't necessarily have some in all scopes.

OPTIM: vars: remove the unneeded lock in vars_prune_*

vars_prune() and vars_prune_all() take the variable lock while purging
all variables from a head. However this is not needed:
  - proc scope variables are only purged during deinit, hence no lock
    is needed ;
  - all other scopes are attached to entities bound to a single thread
    so no lock is needed either.

Removing the lock saves about 0.5% CPU on variables-intensive setups,
but above all simplify the code, so let's do it.

OPTIM: sample: don't check casts for samples of same type

Originally when converters were created, they were mostly for casting
types. Nowadays we have many artithmetic converters to perform operations
on integers, and a number of converters operating on strings. Both of
these categories most often do not need any cast since the input and
output types are the same, which is visible as the cast function is
c_none. However, profiling shows that when heavily using arithmetic
converters, it's possible to spend up to ~7% of the time in
sample_process_cnv(), a good part of which is only in accessing the
sample_casts[] array. Simply avoiding this lookup when input and ouput
types are equal saves about 2% CPU on such setups doing intensive use
of converters.

BUG/MEDIUM: queue: implement a flag to check for the dequeuing

As unveiled in GH issue #2711, commit 5541d4995d ("BUG/MEDIUM: queue:
deal with a rare TOCTOU in assign_server_and_queue()") does have some
side effects in that it can occasionally cause an endless loop.

As Christopher analysed it, the problem is that process_srv_queue(),
which uses a trylock in order to leave only one thread in charge of
the dequeueing process, can lose the lock race against pendconn_add().
If this happens on the last served request, then there's no more thread
to deal with the dequeuing, and assign_server_and_queue() will loop
forever on a condition that was initially exepected to be extremely
rare (and still is, except that now it can become sticky). Previously
what was happening is that such queued requests would just time out
and since that was very rare, nobody would notice.

The root of the problem really is that trylock. It was added so that
only one thread dequeues at a time but it doesn't offer only that
guarantee since it also prevents a thread from dequeuing if another
one is in the process of queuing. We need a different criterion.

What we're doing now is to set a flag "dequeuing" in the server, which
indicates that one thread is currently in the process of dequeuing
requests. This one is atomically tested, and only if no thread is in
this process, then the thread grabs the queue's lock and dequeues.
This way it will be serialized with pendconn_add() and no request
addition will be missed.

It is not certain whether the original race covered by the fix above
can still happen with this change, so better keep that fix for now.

Thanks to @Yenya (Jan Kasprzak) for the precise and complete report
allowing to spot the problem.

This patch should be backported wherever the patch above was backported.

BUG/MINOR: clock: validate that now_offset still applies to the current date

We want to make sure that now_offset is still valid for the current
date: another thread could very well have updated it by detecting a
backwards jump, and at the very same moment the time got fixed again,
that we retrieve and add to the new offset, which results in a larger
jump. Normally, for this to happen, it would mean that before_poll
was also affected by the jump and was detected before and bounded
within 2 seconds, resulting in max 2 seconds perturbations.

Here we try to detect this situation and fall back to re-adjusting the
offset instead.

It's more of a strengthening of what's done by commit e8b1ad4c2b
("BUG/MEDIUM: clock: also update the date offset on time jumps") than a
pure fix, in that the issue was not direclty observed but it's visibly
possible by reading the code, so this should be backported along with
the patch above. This is related to issue GH #2704.

Note that this could be simplified in terms of operations by migrating
the deadlines to nanoseconds, but this was the path to least intrusive
changes.

BUG/MINOR: clock: make time jump corrections a bit more accurate

Since commit e8b1ad4c2b ("BUG/MEDIUM: clock: also update the date offset
on time jumps") we try to update the now_offet based on the last known
valid date. But if it's off compared to the global_now_ns date shared
by other threads, we'll get the time off a little bit. When this happens,
we should consider the most recent of these dates so that if the global
date was already known to be more recent, we should use it and stick to
it. This will avoid setting too large an offset that could in turn provoke
a larger jump on another thread.

This is related to issue GH #2704.

This can be backported to other branches having the patch above.

BUG/MINOR: polling: fix time reporting when using busy polling

Since commit beb859abce ("MINOR: polling: add an option to support
busy polling") the time and status passed to clock_update_local_date()
were incorrect. Indeed, what is considered is the before_poll date
related to the configured timeout which does not correspond to what
is passed to the poller. That's not correct because before_poll+the
syscall's timeout will be crossed by the current date 100 ms after
the start of the poller. In practice it didn't happen when the poller
was limited to 1s timeout but at one minute it happens all the time.

That's particularly visible when running a multi-threaded setup with
busy polling and only half of the threads working (bind ... thread even).
In this case, the fixup code of clock_update_local_date() is executed
for each round of busy polling. The issue was made really visible
starting with recent commit e8b1ad4c2b ("BUG/MEDIUM: clock: also
update the date offset on time jumps") because upon a jump, the
shared offset is reset, while it should not be in this specific
case.

What needs to be done instead is to pass the configured timeout of
the poller (and not of the syscall), and always pass "interrupted"
set so as to claim we got an event (which is sort of true as it just
means the poller returned instantly). In this case we can still
detect backwards/forward jumps and will use a correct boundary
for the maximum date that covers the whole loop.

This can be backported to all versions since the issue was introduced
with busy-polling in 1.9-dev8.

MEDIUM: h1: Accept invalid T-E values with accept-invalid-http-response option

Since the 2.6, A parsing error is reported when the chunked encoding is
found twice. As stated in RFC9112, A sender must not apply the chunked
transfer coding more than once to a message body. It means only one chunked
coding must be found. In addition, empty values are also rejected becaues it
is forbidden by RFC9110.

However, in both cases, it may be useful to relax the rules for trusted
legacy servers when accept-invalid-http-response option is set. Especially
because it was accepted on 2.4 and older. In addition, T-E header is now
sanitized before sending it. It is not a problem Because it is a hop-by-hop
header

Note that it remains invalid on client side because there is no good reason
to relax the parsing on this side. We can argue a server is trusted so we
can decide to support some legacy behavior. It is not true on client side
and it is highly suspicious if a client is sending an invalid T-E header.

Note also we continue to reject unsupported T-E values (so all codings except
"chunked"). Because the "TE" header is sanitized and cannot contain other value
than "Trailers", there is absolutely no reason for a server to use something
else.

This patch should fix the issue #2677. It could probably be backported as
far as 2.6 if necessary.

DOC: server: document what to check for when adding new server keywords

It's too easy to overlook the dynamic servers when adding new server
keywords, and the fields on each keyword line are totally obscure. This
commit adds a title to each column of the table and explains what is
expected and what to check for when adding a keyword.

MINOR: server: allow init-state for dynamic servers

Commit 50322df introduced the init-state keyword, but it didn't enable
it for dynamic servers. However, this feature is perfectly desirable
for virtual servers too, where someone would like a server inlived
through "set server be1/srv1 state ready" to be put out of maintenance
in down state until the next health check succeeds.
At reading the code, it seems that it's only a matter of allowing this
keyword for dynamic servers, as current code path calls
srv_adm_set_ready() which incidentally triggers a call to
_srv_update_status_adm().

REGTESTS: shorten a bit the delay for the h1/h2 upgrade test

Commit d6c4ed9a96 ("REGTESTS: h1/h2: Update script testing H1/H2
protocol upgrades") introduced a 0.5 second delay which is higher
than those of most other tests (usually 0.05 or 0.2) and triggers
timeouts on my side. Let's just shorten it to 0.2 since its goal
is only to send data separately.

Note: maybe a barrier approach would be possible, though not
studied.

BUG/MINOR: pattern: do not leave a leading comma on "set" error messages

Commit 4f2493f355 ("BUG/MINOR: pattern: pat_ref_set: fix UAF reported by
coverity") dropped the condition to concatenate error messages and as
such introduced a leading comma in front of all of them. Then commit
911f4d93d4 ("BUG/MINOR: pattern: pat_ref_set: return 0 if err was found")
changed the behavior to stop at the first error anyway, so all the
mechanics dedicated to the concatenation of error messages is no longer
needed and we can simply return the error as-is, without inserting any
comma.

This should be backported where the patches above are backported.

REGTESTS: fix random failures with wrong_ip_port_logging.vtc under load

This test has an expect rule for syslog that looks for [cC]D, to
indicate a client abort or timeout during the data phase. The purpose
was to say that when it fails it must be this, but the very low timeout
(1ms) still makes it prone to succeeding if the machine is highly loaded.

This has become more visible since commit e8b1ad4c2b ("BUG/MEDIUM: clock:
also update the date offset on time jumps") because the clock drift
adjustments are more systematic. Since this commit, running 50 such tests
at twice more than the number of CPUs in parallel is sufficient to yield
errors due to some lines appearing as succeeding:

make reg-tests -- --j $((($(nproc)+1)*2)) --vtestparams -n50 reg-tests/log/wrong_ip_port_logging.vtc

It was observed that pauses up to 300ms were observed in epoll_wait() in
such circumstances, which were properly fixed by the time drift detection..
Another approach would consist in increasing the permitted margin during
which we don't fix the clock drift but that would not be logical since the
base time had really been awaited for.

This should be backported to all stable releases since the commit above
will trigger the issue more often.

BUG/MINOR: h1-htx: Don't flag response as bodyless when a tunnel is established

This reverts commit 225a4d02e1f6a12c0b4f3584949fad3339d71708.

When a 200-OK response is replied to a CONNECT request or a
101-Switching-protocol, a tunnel is considered as established between the
client and the server. However, we must not declare the reponse as
bodyless. Of course, there is no payload, but tunneled data are expected.

Because of this bug, the zero-copy forwarding is disabled on the server
side.

This patch must be backported as far as 2.9.

BUG/MAJOR: mux-h1: Wake SC to perform 0-copy forwarding in CLOSING state

When the mux is woken up on I/O events, if the zero-copy forwarding is
enabled, receives are blocked. In this case, the SC is woken up to be able
to perform 0-copy forwarding to the other side. This works well, except for
the H1C in CLOSING state.

Indeed, in that case, in h1_process(), the SC is not woken up because only
RUNNING H1 connections are considered. As consequence, the mux will ignore
connection closure. The H1 connection remains blocked, waiting for the
shutdown timeout. If no timeout is configured, the H1 connection is never
closed leading to a leak.

This patch should fix leak reported by Damien Claisse in the issue #2697. It
should be backported as far as 2.8.

MEDIUM: ssl/cli: "dump ssl cert" allow to dump a certificate in PEM format

The new "dump ssl cert" CLI command allows to dump a certificate stored
into HAProxy memory. Until now it was only possible to dump the
description of the certificate using "show ssl cert", but with this new
command you can dump the PEM content on the filesystem.

This command is only available on a admin stats socket.

$ echo "@1 dump ssl cert cert.pem" | socat /tmp/master.sock -
-----BEGIN PRIVATE KEY-----
[...]
-----END PRIVATE KEY-----
-----BEGIN CERTIFICATE-----
[...]
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
[...]
-----END CERTIFICATE-----

BUG/MEDIUM: pattern: prevent UAF on reused pattern expr

Since c5959fd ("MEDIUM: pattern: merge same pattern"), UAF (leading to
crash) can be experienced if the same pattern file (and match method) is
used in two default sections and the first one is not referenced later in
the config. In this case, the first default section will be cleaned up.
However, due to an unhandled case in the above optimization, the original
expr which the second default section relies on is mistakenly freed.

This issue was discovered while trying to reproduce GH #2708. The issue
was particularly tricky to reproduce given the config and sequence
required to make the UAF happen. Hopefully, Github user @asmnek not only
provided useful informations, but since he was able to consistently
trigger the crash in his environment he was able to nail down the crash to
the use of pattern file involved with 2 named default sections. Big thanks
to him.

To fix the issue, let's push the logic from c5959fd a bit further. Instead
of relying on "do_free" variable to know if the expression should be freed
or not (which proved to be insufficient in our case), let's switch to a
simple refcounting logic. This way, no matter who owns the expression, the
last one attempting to free it will be responsible for freeing it.
Refcount is implemented using a 32bit value which fills a previous 4 bytes
structure gap:

        int                        mflags;               /*    80     4 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int          lock;                 /*    88     8 */
(output from pahole)

Even though it was not reproduced in 2.6 or below by @asmnek (the bug was
revealed thanks to another bugfix), this issue theorically affects all
stable versions (up to c5959fd), thus it should be backported to all
stable versions.

BUG/MEDIUM: pattern: prevent uninitialized reads in pat_match_{str,beg}

Using valgrind when running map_beg or map_str, the following error is
reported:

==242644== Conditional jump or move depends on uninitialised value(s)
==242644==    at 0x2E4AB1: pat_match_str (pattern.c:457)
==242644==    by 0x2E81ED: pattern_exec_match (pattern.c:2560)
==242644==    by 0x343176: sample_conv_map (map.c:211)
==242644==    by 0x27522F: sample_process_cnv (sample.c:1330)
==242644==    by 0x2752DB: sample_process (sample.c:1373)
==242644==    by 0x319917: action_store (vars.c:814)
==242644==    by 0x24D451: http_req_get_intercept_rule (http_ana.c:2697)

In fact, the error is legit, because in pat_match_{beg,str}, we
dereference the buffer on len+1 to check if a value was previously set,
and then decide to force NULL-byte if it wasn't set.

But the approach is no longer compatible with current architecture:
data past str.data is not guaranteed to be initialized in the buffer.
Thus we cannot dereference the value, else we expose us to uninitialized
read errors. Moreover, the check is useless, because we systematically
set the ending byte to 0 when the conditions are met.

Finally, restoring the older value after the lookup is not relevant:
indeed, either the sample is marked as const and in such case it
is already duplicated, or the sample is not const and we forcefully add
a terminating NULL byte outside from the actual string bytes (since we're
past str.data), so as we didn't alter effective string data and that data
past str.data cannot be dereferenced anyway as it isn't guaranteed to be
initialized, there's no point in restoring previous uninitialized data.

It could be backported in all stable versions. But since this was only
detected by valgrind and isn't known to cause issues in existing
deployments, it's probably better to wait a bit before backporting it
to avoid any breakage.. although the fix should be theoretically harmless.

BUG/MINOR: pattern: prevent const sample from being tampered in pat_match_beg()

This is a complementary patch to a68affeaa ("BUG/MINOR: pattern: a sample
marked as const could be written"). Indeed the same logic from
pat_match_str() is used there, but we lack the check to ensure that the
sample is not const before writing data to it.

It could be backported to all stable versions.

BUG/MEDIUM: clock: detect and cover jumps during execution

After commit e8b1ad4c2 ("BUG/MEDIUM: clock: also update the date offset
on time jumps"), @firexinghe mentioned that the issue was still present
in their case. In fact it depends on the load, which affects the
probability that the time changes between two poll() calls vs that it
changes during poll(). The time correction code used to only deal with
the latter. But under load if it changes between two poll() calls, what
happens then is that before_poll is off, and after returning from poll(),
the date is within bounds defined by before_poll, so no correction is
applied.

After many tests, it turns out that the most reliable solution without
using CLOCK_MONOTONIC is to prevent before_poll from being earlier than
the previous after_poll (trivial), and to cover forward jumps, we need
to enforce a margin. Given that the watchdog kills a looping task within
2 seconds and that no sane setup triggers it, it seems that 2 seconds
remains a safe enough margin. This means that in the worst case, some
forward jumps of up to 2 seconds will not be corrected, leading to an
apparent fast time and low rates. But this is supposed to be an exceptional
event anyway (typically an admin or crontab running ntpdate).

For future versions, given that we now opportunistically call
now_mono_time() before and after poll(), that returns zero if not
supported, we could imagine relying on this one for the thread's local
time when it's non-null.

REGTESTS: h1/h2: Update script testing H1/H2 protocol upgrades

"http-messaging/protocol_upgrade.vtc" script was updated to test upgrades
for requests with a payload. It should fail when the request is sent to a H2
server. When sent to a H1 server, it should succeed, except if the server
replies before the end of the request.

BUG/MEDIUM: mux-h1/mux-h2: Reject upgrades with payload on H2 side only

Since 1d2d77b27 ("MEDIUM: mux-h1: Return a 501-not-implemented for upgrade
requests with a body"), it is no longer possible to perform a protocol
upgrade for requests with a payload. The main reason was to be able to
support protocol upgrade for H1 client requesting a H2 server. In that case,
the upgrade request is converted to a CONNECT request. So, it is not
possible to convey a payload in that case.

But, it is a problem for anyone wanting to perform upgrades on H1 server
using requests with a payload. It is uncommon but valid. So, now, it is the
H2 multiplexer responsibility to reject upgrade requests, on server side, if
there is a payload. An INTERNAL_ERROR is returned for the H2S in that
case. On H1 side, the upgrade is now allowed, but only if the server waits
for the end of the request to return the 101-Switching-protocol
response. Indeed, it is quite hard to synchronise the frontend side and the
backend side in that case. Asking to servers to fully consume the request
payload before returned the response seems reasonable.

This patch should fix the issue #2684. It could be backported after a period
of observation, as far as 2.4 if possible. But only if it is not too
hard. It depends on "MINOR: mux-h1: Set EOI on SE during demux when both
side are in DONE state".

MINOR: mux-h1: Set EOI on SE during demux when both side are in DONE state

For now, this case is already handled for all requests except for those
waiting for a tunnel establishment (CONNECT and protocol upgrades). It is
not an issue because only bodyless requests are supported in these cases. So
the request is always finished at the end of headers and therefore before
the response.

However, to relax conditions for full H1 protocol upgrades (H1 client and
server), this case will be necessary. Indeed, the idea is to be able to
perform protocol upgrades for requests with a payload. Today, the "Upgrade:"
header is removed before sending the request to the server. But to support
this case, this patch is required to properly finish transaction when the
server does not perform the upgrade.

DOC: configuration: place the HAPROXY_HTTP_LOG_FMT example on the correct line

When HAPROXY_HTTP_LOG_FMT was added by commit 537b9e7f36 ("MINOR: config:
add environment variables for default log format"), the example was placed
by accident after the clf log format instead of the HTTP log format,
causing a bit of confusion.

This can be backported to 2.8.

[RELEASE] Released version 3.1-dev7

Released version 3.1-dev7 with the following main changes :
    - MINOR: config: Created env variables for http and tcp clf formats
    - MINOR: mux-quic: add buf_in_flight to QCC debug infos
    - MINOR: mux-quic: correct qcc_bufwnd_full() documentation
    - MINOR: tools: add helpers to backup/clean/restore env
    - MINOR: mworker: restore initial env before wait mode
    - BUG/MINOR: haproxy: free init_env in deinit only if allocated
    - BUILD: tools: environ is not defined in OS X and BSD
    - DEV: coccinelle: add a test to detect unchecked malloc()
    - DEV: coccinelle: add a test to detect unchecked calloc()
    - CI: QUIC Interop AWS-LC: enable ngtcp2 client
    - CI: fix missing comma introduced in 956839c0f68a7722acc586ecd91ffefad2ccb303
    - CI: QUIC Interop: do not run bandwidth measurement tests
    - CI: QUIC Interop: use different artifact names for uploading logs
    - BUILD: quic: 32bits build broken by wrong integer conversions for printf()
    - CLEANUP: ssl: cleanup the clienthello capture
    - MEDIUM: ssl: capture the supported_versions extension from Client Hello
    - MEDIUM: ssl/sample: add ssl_fc_supported_versions_bin sample fetch
    - MEDIUM: ssl: capture the signature_algorithms extension from Client Hello
    - MEDIUM: ssl/sample: add ssl_fc_sigalgs_bin sample fetch
    - MINOR: proxy: Add support of 429-Too-Many-Requests in retry-on status
    - BUG/MEDIUM: mux-h2: Set ES flag when necessary on 0-copy data forwarding
    - BUG/MEDIUM: stream: Prevent mux upgrades if client connection is no longer ready
    - BUG/MINIR: proxy: Match on 429 status when trying to perform a L7 retry
    - CLEANUP: haproxy: fix typos in code comment
    - CLEANUP: mqtt: fix typo in MQTT_REMAINING_LENGHT_MAX_SIZE
    - MINOR: tools: Implement ipaddrcpy().
    - MINOR: quic: Implement quic_tls_derive_token_secret().
    - MINOR: quic: Token for future connections implementation.
    - BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder
    - MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct)
    - MINOR: quic: Implement qc_ssl_eary_data_accepted().
    - MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.
    - BUG/MEDIUM: quic: always validate sender address on 0-RTT
    - BUILD: quic: fix build errors on FreeBSD since recent GSO changes
    - MINOR: tools: extend str2sa_range to add an alt parameter
    - MINOR: server: add a alt_proto field for server
    - MEDIUM: sock: use protocol when creating socket
    - MEDIUM: protocol: add MPTCP per address support
    - BUG/MINOR: quic: Crash from trace dumping SSL eary data status (AWS-LC)
    - MEDIUM: stick-table: Add support of a factor for IN/OUT bytes rates
    - MEDIUM: bwlim: Use a read-lock on the sticky session to apply a shared limit
    - BUG/MEDIUM: mux-pt: Never fully close the connection on shutdown
    - BUG/MEDIUM: cli: Always release back endpoint between two commands on the mcli
    - BUG/MINOR: quic: unexploited retransmission cases for Initial pktns.
    - BUG/MEDIUM: mux-h1: Properly handle empty message when an error is triggered
    - MINOR: mux-h2: try to clear DEM_MROOM and MUX_MFULL at more places
    - BUG/MAJOR: mux-h2: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf
    - BUG/MINOR: mux-spop: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf
    - BUG/MINOR: Crash on O-RTT RX packet after dropping Initial pktns
    - BUG/MEDIUM: mux-pt: Fix condition to perform a shutdown for writes in mux_pt_shut()
    - CLEANUP: assorted typo fixes in the code and comments
    - DEV: patchbot: count the number of backported/non-backported patches
    - DEV: patchbot: add direct links to show only specific categories
    - DEV: patchbot: detect commit IDs starting with 7 chars
    - BUG/MEDIUM: clock: also update the date offset on time jumps
    - MEDIUM: server: add init-state

MEDIUM: server: add init-state

Allow the user to set the "initial state" of a server.

Context:

Servers are always set in an UP status by default. In
some cases, further checks are required to determine if the server is
ready to receive client traffic.

This introduces the "init-state {up|down}" configuration parameter to
the server.

- when set to 'fully-up', the server is considered immediately available
  and can turn to the DOWN sate when ALL health checks fail.
- when set to 'up' (the default), the server is considered immediately
  available and will initiate a health check that can turn it to the DOWN
  state immediately if it fails.
- when set to 'down', the server initially is considered unavailable and
  will initiate a health check that can turn it to the UP state immediately
  if it succeeds.
- when set to 'fully-down', the server is initially considered unavailable
  and can turn to the UP state when ALL health checks succeed.

The server's init-state is considered when the HAProxy instance
is (re)started, a new server is detected (for example via service
discovery / DNS resolution), a server exits maintenance, etc.

Link: https://github.com/haproxy/haproxy/issues/51

BUG/MEDIUM: clock: also update the date offset on time jumps

In GH issue #2704, @swimlessbird and @xanoxes reported problems handling
time jumps. Indeed, since 2.7 with commit 4eaf85f5d9 ("MINOR: clock: do
not update the global date too often") we refrain from updating the global
offset in case it didn't change. But there's a catch: in case of a large
time jump, if the poller was interrupted, the local time remains the same
and we return immediately from there without updating the offset. It then
becomes incorrect regarding the "date" value, and upon subsequent call to
the poller, there's no way to detect a jump anymore so we apply the old,
incorrect offset and the date becomes wrong. Worse, going back to the
original time (then in the past), global_now_ns remains higher than the
local time and neither get updated anymore.

What is missing in practice is to immediately update the offset when
detecting a time jump. In an ideal world, the offset would be updated
upon every call, that's what was being done prior to commit above but
it's extremely CPU intensive on large systems. However we can perfectly
afford to update the offset every time we detect a time jump, as it's
not as common.

This needs to be backported as far as 2.8. Thanks to both participants
above for providing very helpful details.

DEV: patchbot: detect commit IDs starting with 7 chars

Some commit messages contain commit IDs as short as 7 chars, let's detect
them.

DEV: patchbot: add direct links to show only specific categories

The per-category counters are now clickable so that it becomes possible
to list the relevant ones.

DEV: patchbot: count the number of backported/non-backported patches

It's useful to instantly see how many patches of each category have
already been backported and are still pending, let's count them and
report them at the top of the page.

CLEANUP: assorted typo fixes in the code and comments

This is 43rd iteration of typo fixes

BUG/MEDIUM: mux-pt: Fix condition to perform a shutdown for writes in mux_pt_shut()

A regression was introduced in the commit 76fa71f7a ("BUG/MEDIUM: mux-pt:
Never fully close the connection on shutdown") because of a typo on the
connection flags. CO_FL_SOCK_WR_SH flag must be tested to prevent a call to
conn_sock_shutw() and not CO_FL_SOCK_RD_SH.

Concretly, most of time, it is harmeless because shutdown for writes is
always performed before any shutdown for reads. Except in case describe by
the commit above. But it is not clear if it has an impact or not.

This patch must be backported with the commit above, so as far as 2.9.

BUG/MINOR: Crash on O-RTT RX packet after dropping Initial pktns

This bug arrived with this naive commit:

BUG/MINOR: quic: Too shord datagram during O-RTT handshakes (aws-lc only)

which omitted to consider the case where the Initial packet number space
could be discarded before receiving 0-RTT packets.

To fix this, append/insert the O-RTT (early-data) packet number space
into the encryption level list depending on the presence or not of
the Initial packet number space.

This issue was revealed when using aws-lc as TLS stack in GH #2701 issue.
Thank you to @Tristan971 for having reported this issue.

Must be backported where the commit mentionned above is supposed to be
backported: as far as 2.9.

BUG/MINOR: mux-spop: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf

That's the equivalent of the mux-h2 one, except that here there's no
real risk to loop since normally we cannot feed data that bypass the
closed state check (e.g. no zero-copy forward). But it still remains
dirty to be able to leave and empty mbuf with MFULL and MROOM set, so
better clear them as well.

No backport is needed since this is only in 3.1.

BUG/MAJOR: mux-h2: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf

There exists an extremely tricky code path that was revealed in 3.0 by
the glitches feature, though it might theoretically have existed before.

TL;DR: a mux mbuf may be full after successfully sending GOAWAY, and
discard its remaining contents without clearing H2_CF_MUX_MFULL and
H2_CF_DEM_MROOM, then endlessly loop in h2_send(), until the watchdog
takes care of it.

What can happen is the following: Some data are received, h2_io_cb() is
called. h2_recv() is called to receive the incoming data. Then
h2_process() is called and in turn calls h2_process_demux() to process
input data. At some point, a glitch limit is reached and h2c_error() is
called to close the connection. The input frame was incomplete, so some
data are left in the demux buffer. Then h2_send() is called, which in
turn calls h2_process_mux(), which manages to queue the GOAWAY frame,
turning the state to H2_CS_ERROR2. The frame is sent, and h2_process()
calls h2_send() a last time (doing nothing) and leaves. The streams
are all woken up to notify about the error.

Multiple backend streams were waiting to be scheduled and are woken up
in turn, before their parents being notified, and communicate with the
h2 mux in zero-copy-forward mode, request a buffer via h2_nego_ff(),
fill it, and commit it with h2_done_ff(). At some point the mux's output
buffer is full, and gets flags H2_CF_MUX_MFULL.

The io_cb is called again to process more incoming data. h2_send() isn't
called (polled) or does nothing (e.g. TCP socket buffers full). h2_recv()
may or may not do anything (doesn't matter). h2_process() is called since
some data remain in the demux buf. It goes till the end, where it finds
st0 == H2_CS_ERROR2 and clears the mbuf. We're now in a situation where
the mbuf is empty and MFULL is still present.

Then it calls h2_send(), which doesn't call h2_process_mux() due to
MFULL, doesn't enter the for() loop since all buffers are empty, then
keeps sent=0, which doesn't allow to clear the MFULL flag, and since
"done" was not reset, it loops forever there.

Note that the glitches make the issue more reproducible but theoretically
it could happen with any other GOAWAY (e.g. PROTOCOL_ERROR). What makes
it not happen with the data produced on the parsing side is that we
process a single buffer of input at once, and there's no way to amplify
this to 30 buffers of responses (RST_STREAM, GOAWAY, SETTINGS ACK,
WINDOW_UPDATE, PING ACK etc are all quite small), and since the mbuf is
cleared upon every exit from h2_process() once the error was sent, it is
not possible to accumulate response data across multiple calls. And the
regular h2_snd_buf() path checks for st0 >= H2_CS_ERROR so it will not
produce any data there either.

Probably that h2_nego_ff() should check for H2_CS_ERROR before accepting
to deliver a buffer, but this needs to be carefully studied. In the mean
time the real problem is that the MFULL flag was kept when clearing the
buffer, making the two inconsistent.

Since it doesn't seem possible to trigger this sequence without the
zero-copy-forward mechanism, this fix needs to be backported as far as
2.9, along with previous commit "MINOR: mux-h2: try to clear DEM_MROOM
and MUX_MFULL at more places" which will strengthen the consistency
between these checks.

Many thanks to Annika Wickert for her detailed report that allowed to
diagnose this problem. CVE-2024-45506 was assigned to this problem.

MINOR: mux-h2: try to clear DEM_MROOM and MUX_MFULL at more places

The code leading to H2_CF_MUX_MFULL and H2_CF_DEM_MROOM being cleared
is quite complex and assumptions about its state are extremely difficult
when reading the code. There are indeed long sequences where the mux might
possibly be empty, still having the flag set until it reaches h2_send()
which will clear it after the last send. Even then it's not obviour whether
it's always guaranteed to release the flag when invoked in multiple passes.
Let's just simplify the conditionnn so that h2_send() does not depend on
"sent" anymore and that h2_timeout_task() doesn't leave the flags set on
the buffer on emptiness. While it doesn't seem to fix anything, it will
make the code more robust against future changes.

BUG/MEDIUM: mux-h1: Properly handle empty message when an error is triggered

When a 400/408/500/501 error is returned by the H1 multiplexer, we first try
to get the error message of the proxy before using the default one. This may
be configured to be mapped on /dev/null or on an empty file. In that case,
no message is emitted, as expected. But everything is handled as the error
was successfully sent.

However, there is an bug here. In h1_send_error() function, this case is not
properly handled. The flag H1C_F_ABRTED is not set on the H1 connection as it
should be and h1_close() function is not called, leaving the H1 connection in an
undefined state.

It is especially an issue when a "empty" 408-Request-Time-out error is emitted
while there are data blocked in the output buffer. In that case, the connection
remains openned until the client closes and a "cR--"/408 is logged repeatedly, every
time the client timeout is reached.

This patch must backported as far as 2.8.

BUG/MINOR: quic: unexploited retransmission cases for Initial pktns.

qc_prep_hdshk_fast_retrans() job is to pick some packets to be retransmitted
from Initial and Handshake packet number spaces. A packet may be coalesced to
a first one into the same datagram. When a coalesced packet is inspected for
retransmission, it is skipped if its length would make the total datagram length
it is attached to exceeding the anti-amplification limit. But in this case, the
first packet must be kept for the current retransmission. This is tracked by
this trace statemement:
TRACE_PROTO("will probe Initial packet number space", QUIC_EV_CONN_SPPKTS, qc);
This was not the case because of the wrong "goto end" statement. This latter
must be run only if the Initial packet number space must not be probe with
the first packet found as coalesced to another one which must be skipped.

This bug was revealed by AWS-LC interop runner with handshakeloss and
handshakecorruption which always fail because this stack leads the server
to send more Initial packets.

Thank you to Ilya (@chipitsine) for this issue report in GH #2663.

Must be backported as far as 2.6.

BUG/MEDIUM: cli: Always release back endpoint between two commands on the mcli

When several commands are chained on the master CLI, the same client
connection is used. Because, it is a TCP connection, the mux PT is used. It
means there is no stream at the mux level. It is not possible to release the
applicative stream between each commands as for the HTTP. So, to work around
this limitation, between two commands, the master CLI is resetting the
stream. It does exactly what it was performed on HTTP to manage keep-alive
connections on old HAProxy versions.

But this part was copied from a code dealing with connection only while the
back endpoint can be an applet or a mux for the master cli. The previous fix
on the mux PT ("BUG/MEDIUM: mux-pt: Never fully close the connection on
shutdown") revealed a bug. Between two commands, the back endpoint was only
released if the connection's XPRT was closed. This works if the back
endpoint is an applet because there is no connection. But for commands sent
to a worker, a connection is used. At this stage, this only works if the
connection's XPRT is closed. Otherwise, the old endpoint is never detached
leading to undefined behavior on the next command execution (most probably a
crash).

Without the commit above, the connection's XPRT is always closed on
shutdown. It is no longer true. At this stage, we must inconditionnally
release the back endpoint by resetting the corresponding sedesc to fix the
bug.

This patch must be backported with the commit above in all stable
versions. On 2.4 and lower, it will need to be adapted.

BUG/MEDIUM: mux-pt: Never fully close the connection on shutdown

When a shutdown is reported to the mux (shutdown for reads or writes), the
connexion is immediately fully closed if the mux detects the connexion is
closed in both directions. Only the passthrough multiplexer is able to
perform this action at this stage because there is no stream and no internal
data. Other muxes perform a full connection close during the mux's release
stage. It was working quite well since recently. But, in theory, the bug is
quite old.

In fact, it seems possible for the lower layer to report an error on the
connection in same time a shutdown is performed on the mux. Depending on how
events are scheduled, the following may happen:

1. An connection error is detected at the fd layer and a wakeup is
    scheduled on the mux to handle the event.

2. A shutdown for writes is performed on the mux. Here the mux decides to
    fully close the connexion. If the xprt is not used to log info, it is
    released.

3. The mux is finally woken up. It tries to retrieve data from the xprt
    because it is not awayre there was an error. This leads to a crash
    because of a NULL-deref.

By reading the code, it is not obvious. But it seems possible with SSL
connection when the handshake is rearmed. It happens when a
SSL_ERROR_WANT_WRITE is reported on a SSL_read() attempt or a
SSL_ERROR_WANT_READ on a SSL_write() attempt.

This bug is only visible if the XPRT is not used to log info. So it is no so
common.

This patch should fix the 2nd crash reported in the issue #2656. It must
first be backported as far as 2.9 and then slowly to all stable versions.

MEDIUM: bwlim: Use a read-lock on the sticky session to apply a shared limit

There is no reason to acquire a write-lock on the sticky session when a
shared limit is applied because only the frequency is updated. The sticky
session itself is not modified. We must just take care it is not removed in
the mean time. So a read-lock may be used instead.

MEDIUM: stick-table: Add support of a factor for IN/OUT bytes rates

Add a factor parameter to stick-tables, called "brates-factor", that is
applied to in/out bytes rates to work around the 32-bits limit of the
frequency counters. Thanks to this factor, it is possible to have bytes
rates beyond the 4GB. Instead of counting each bytes, we count blocks
of bytes. Among other things, it will be useful for the bwlim filter, to be
able to configure shared limit exceeding the 4GB/s.

For now, this parameter must be in the range ]0-1024].

BUG/MINOR: quic: Crash from trace dumping SSL eary data status (AWS-LC)

This bug follows this patch:
MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.
where a new third variable was added to be dumped from QUIC_EV_CONN_IO_CB trace
event. The quic_trace() code did not reveal there was already another variable
passed as third argument but not dumped. This leaded to crash when dereferencing
a point to an int in place of a point to an SSL object.

This issue was reproduced only by handshakecorruption aws-lc interop test with
s2n-quic as client.

Note that this patch must be backported with this one:
BUG/MEDIUM: quic: always validate sender address on 0-RTT
which depends on the commit mentionned above.

MEDIUM: protocol: add MPTCP per address support

Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension
that enables a TCP connection to use different paths.

Multipath TCP has been used for several use cases. On smartphones, MPTCP
enables seamless handovers between cellular and Wi-Fi networks while
preserving established connections. This use-case is what pushed Apple
to use MPTCP since 2013 in multiple applications [2]. On dual-stack
hosts, Multipath TCP enables the TCP connection to automatically use the
best performing path, either IPv4 or IPv6. If one path fails, MPTCP
automatically uses the other path.

To benefit from MPTCP, both the client and the server have to support
it. Multipath TCP is a backward-compatible TCP extension that is enabled
by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...).
Multipath TCP is included in the Linux kernel since version 5.6 [3]. To
use it on Linux, an application must explicitly enable it when creating
the socket. No need to change anything else in the application.

This attached patch adds MPTCP per address support, to be used with:

mptcp{,4,6}@<address>[:port1[-port2]]

MPTCP v4 and v6 protocols have been added: they are mainly a copy of the
TCP ones, with small differences: names, proto, and receivers lists.

These protocols are stored in __protocol_by_family, as an alternative to
TCP, similar to what has been done with QUIC. By doing that, the size of
__protocol_by_family has not been increased, and it behaves like TCP.

MPTCP is both supported for the frontend and backend sides.

Also added an example of configuration using mptcp along with a backend
allowing to experiment with it.

Note that this is a re-implementation of Björn's work from 3 years ago
[4], when haproxy's internals were probably less ready to deal with
this, causing his work to be left pending for a while.

Currently, the TCP_MAXSEG socket option doesn't seem to be supported
with MPTCP [5]. This results in a warning when trying to set the MSS of
sockets in proto_tcp:tcp_bind_listener.

This can be resolved by adding two new variables:
sock_inet(6)_mptcp_maxseg_default that will hold the default
value of the TCP_MAXSEG option. Note that for the moment, this
will always be -1 as the option isn't supported. However, in the
future, when the support for this option will be added, it should
contain the correct value for the MSS, allowing to correctly
set the TCP_MAXSEG option.

Link: https://www.rfc-editor.org/rfc/rfc8684.html
Link: https://www.tessares.net/apples-mptcp-story-so-far/
Link: https://www.mptcp.dev
Link: https://github.com/haproxy/haproxy/issues/1028
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/515
Co-authored-by: Dorian Craps <dorian.craps@student.vinci.be>
Co-authored-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

MEDIUM: sock: use protocol when creating socket

Use the protocol configured for a connection when creating the socket,
instead of always using 0.

This change is needed to allow new protocol to be used when creating
the sockets, such as MPTCP. Note however that this patch won't change
anything for now, as the only other value that proto->sock_prot could
hold is IPPROTO_TCP, which has the same behavior as 0 when passed to
socket.

MINOR: server: add a alt_proto field for server

Add a new field alt_proto to the server structures that
specify if an alternate protocol should be used for this server.

This field can be transparently passed to protocol_lookup to get
an appropriate protocol structure.

This change allows thus to create servers with different protocols,
and not only TCP anymore.

MINOR: tools: extend str2sa_range to add an alt parameter

Add a new parameter "alt" that will store wether this configuration
use an alternate protocol.

This alt pointer will contain a value that can be transparently
passed to protocol_lookup to obtain an appropriate protocol structure.

This change is needed to allow for example the servers to know if it
need to use an alternate protocol or not.

BUILD: quic: fix build errors on FreeBSD since recent GSO changes

The following commits broke the build on FreeBSD when QUIC is enabled:

35470d518 ("MINOR: quic: activate UDP GSO for QUIC if supported")
448d3d388 ("MINOR: quic: add GSO parameter on quic_sock send API")

Indeed, it turns out that netinet/udp.h requires sys/types.h to be
included before. Let's just change the includes order to fix the build.
No backport is needed.

BUG/MEDIUM: quic: always validate sender address on 0-RTT

It has been reported by Wedl Michael, a student at the University of Applied
Sciences St. Poelten, a potential vulnerability into haproxy as described below.

An attacker could have obtained a TLS session ticket after having established
a connection to an haproxy QUIC listener, using its real IP address. The
attacker has not even to send a application level request (HTTP3). Then
the attacker could open a 0-RTT session with a spoofed IP address
trusted by the QUIC listen to bypass IP allow/block list and send HTTP3 requests.

To mitigate this vulnerability, one decided to use a token which can be provided
to the client each time it successfully managed to connect to haproxy. These
tokens may be reused for future connections to validate the address/path of the
remote peer as this is done with the Retry token which is used for the current
connection, not the next one. Such tokens are transported by NEW_TOKEN frames
which was not used at this time by haproxy.

So, each time a client connect to an haproxy QUIC listener with 0-RTT
enabled, it is provided with such a token which can be reused for the
next 0-RTT session. If no such a token is presented by the client,
haproxy checks if the session is a 0-RTT one, so with early-data presented
by the client. Contrary to the Retry token, the decision to refuse the
connection is made only when the TLS stack has been provided with
enough early-data from the Initial ClientHello TLS message and when
these data have been accepted. Hopefully, this event arrives fast enough
to allow haproxy to kill the connection if some early-data have been accepted
without token presented by the client.

quic_build_post_handshake_frames() has been modified to build a NEW_TOKEN
frame with this newly implemented token to be transported inside.

quic_tls_derive_retry_token_secret() was renamed to quic_do_tls_derive_token_secre()
and modified to be reused and derive the secret for the new token implementation.

quic_token_validate() has been implemented to validate both the Retry and
the new token implemented by this patch. When this is a non-retry token
which could not be validated, the datagram received is marked as requiring
a Retry packet to be sent, and no connection is created.

When the Initial packet does not embed any non-retry token and if 0-RTT is enabled
the connection is marked with this new flag: QUIC_FL_CONN_NO_TOKEN_RCVD. As soon
as the TLS stack detects that some early-data have been provided and accepted by
the client, the connection is marked to be killed (QUIC_FL_CONN_TO_KILL) from
ha_quic_add_handshake_data(). This is done calling qc_ssl_eary_data_accepted()
new function. The secret TLS handshake is interrupted as soon as possible returnin
0 from ha_quic_add_handshake_data(). The connection is also marked as
requiring a Retry packet to be sent (QUIC_FL_CONN_SEND_RETRY) from
ha_quic_add_handshake_data(). The the handshake I/O handler (quic_conn_io_cb())
knows how to behave: kill the connection after having sent a Retry packet.

About TLS stack compatibility, this patch is supported by aws-lc. It is
disabled for wolfssl which does not support 0-RTT at this time thanks
to HAVE_SSL_0RTT_QUIC.

This patch depends on these commits:

     MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.
     MINOR: quic: Implement qc_ssl_eary_data_accepted().
     MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct)
     BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder
     MINOR: quic: Token for future connections implementation.
     MINOR: quic: Implement quic_tls_derive_token_secret().
     MINOR: tools: Implement ipaddrcpy().

Must be backported as far as 2.6.

MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.

Dump the early data status from QUIC_EV_CONN_IO_CB trace event.
This is very helpful to know if the QUIC server has accepted the
early data received from clients.

MINOR: quic: Implement qc_ssl_eary_data_accepted().

This function is a wrapper around SSL_get_early_data_status() for
OpenSSL derived stack and SSL_early_data_accepted() boringSSL derived
stacks like AWS-LC. It returns true for a TLS server if it has
accepted the early data received from a client.

Also implement quic_ssl_early_data_status_str() which is dedicated to be used
for debugging purposes (traces). This function converts the enum returned
by the two function mentionned above to a human readable string.

MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct)

Modify qf_new_token structure to use a static buffer with QUIC_TOKEN_LEN
as size as defined by the token for future connections (quic_token.c).
Modify consequently the NEW_TOKEN frame parser (see quic_parse_new_token_frame()).
Also add comments to denote that the NEW_TOKEN parser function is used only by
clients and that its builder is used only by servers.

BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder

quic_build_new_token_frame() is the function which is called to build
a NEW_TOKEN frame into a buffer. The position pointer for this buffer
was not updated, leading the NEW_TOKEN frame to be malformed.

Must be backported as far as 2.6.

MINOR: quic: Token for future connections implementation.

There exist two sorts of token used by QUIC. They are both used to validate
the peer address (path validation). Retry are used for the current
connection the client want to open. This patch implement the other
sort of tokens which after having been received from a connection, may
be provided for the next connection from the same IP address to validate
it (or validate the network path between the client and the server).

The token generation is implemented by quic_generate_token(), and
the token validation by quic_token_chek(). The same method
is used as for Retry tokens to build such tokens to be reused for
future connections. The format is very simple: one byte for the format
identifier to distinguish these new tokens for the Retry token, followed
by a 32bits timestamps. As this part is ciphered with AEAD as cryptographic
algorithm, 16 bytes are needed for the AEAD tag. 16 more random bytes
are added to this token and a salt to derive the AEAD secret used
to cipher the token. In addition to this salt, this is the client IP address
which is used also as AAD to derive the AEAD secret. So, the length of
the token is fixed: 37 bytes.

MINOR: quic: Implement quic_tls_derive_token_secret().

This is function is similar to quic_tls_derive_retry_token_secret().
Its aim is to derive the secret used to cipher the token to be used
for future connections.

This patch renames quic_tls_derive_retry_token_secret() to a more
and reuses its code to produce a more generic one: quic_do_tls_derive_token_secret().
Two arguments are added to this latter to produce both quic_tls_derive_retry_token_secret()
and quic_tls_derive_token_secret() new function which calls
quic_do_tls_derive_token_secret().

MINOR: tools: Implement ipaddrcpy().

Implement ipaddrcpy() new function to copy only the IP address from
a sockaddr_storage struct object into a buffer.

CLEANUP: mqtt: fix typo in MQTT_REMAINING_LENGHT_MAX_SIZE

There was a typo in the macro name, where LENGTH was incorrectly
written. This didn't cause any issue because the typo appeared in all
occurrences in the codebase.

CLEANUP: haproxy: fix typos in code comment

Use "from" instead of "form" in ha_random_boot function code comments.

BUG/MINIR: proxy: Match on 429 status when trying to perform a L7 retry

Support for 429 was recently added to L7 retries (0d142e075 "MINOR: proxy:
Add support of 429-Too-Many-Requests in retry-on status"). But the
l7_status_match() function was not properly updated. The switch statement
must match the 429 status to be able to perform a L7 retry.

This patch must be backported if the commit above is backported. It is
related to #2687.

BUG/MEDIUM: stream: Prevent mux upgrades if client connection is no longer ready

If an early error occurred on the client connection, we must prevent any
multiplexer upgrades. Indeed, it is unexpected for a mux to be initialized
with no xprt. On a normal workflow it is impossible. So it is not an
issue. But if a mux upgrade is performed at the stream level, an early error
on the connection may have already been handled by the previous mux and the
connection may be already fully closed. If the mux upgrade is still
performed, a crash can be experienced.

It is possible to have a crash with an implicit TCP>HTTP upgrade if there is no
data in the input buffer. But it is also possible to get a crash with an
explicit "switch-mode http" rule.

It must be backported to all stable versions. In 2.2, the patch must be
applied directly in stream_set_backend() function.

BUG/MEDIUM: mux-h2: Set ES flag when necessary on 0-copy data forwarding

When DATA frames are sent via the 0-copy data forwarding, we must take care
to set the ES flag on the last DATA frame. It should be performed in
h2_done_ff() when IOBUF_FL_EOI flag was set by the producer. This flag is
here to know when the producer has reached the end of input. When this
happens, the h2s state is also updated. It is switched to "half-closed
local" or "closed" state depending on its previous state.

It is mainly an issue on uploads because the server may be blocked waiting
for the end of the request. A workaround is to disable the 0-copy forwarding
support the the H2 by setting "tune.h2.zero-copy-fwd-send" directive to off
in your global section.

This patch should fix the issue #2665. It must be backported as far as 2.9.

MINOR: proxy: Add support of 429-Too-Many-Requests in retry-on status

The "429" status can now be specified on retry-on directives. PR_RE_* flags
were updated to remains sorted.

This patch should fix the issue #2687. It is quite simple so it may safely
be backported to 3.0 if necessary.

MEDIUM: ssl/sample: add ssl_fc_sigalgs_bin sample fetch

This new sample fetch allow to extract the binary list contained in the
signature_algorithms (13) TLS extensions.

https://datatracker.ietf.org/doc/html/rfc8446#section-4.2.3

MEDIUM: ssl: capture the signature_algorithms extension from Client Hello

Activate the capture of the TLS signature_algorithms extension from the
Client Hello. This list is stored in the ssl_capture buffer when the
global option "tune.ssl.capture-cipherlist-size" is enabled.

MEDIUM: ssl/sample: add ssl_fc_supported_versions_bin sample fetch

This new sample fetch allow to extract the binary list contained in the
supported_versions (43) TLS extensions.

https://datatracker.ietf.org/doc/html/rfc8446#section-4.2.1

MEDIUM: ssl: capture the supported_versions extension from Client Hello

Activate the capture of the TLS supported_versions extension from the
Client Hello. This list is stored in the ssl_capture buffer when the
global option "tune.ssl.capture-cipherlist-size" is enabled.

CLEANUP: ssl: cleanup the clienthello capture

In order to add more extensions, clean up the clienthello capture
function a little bit.

BUILD: quic: 32bits build broken by wrong integer conversions for printf()

Since these commits the 32bits build is broken due to several errors as follow:

CC      src/quic_cli.o
src/quic_cli.c: In function ‘dump_quic_full’:
src/quic_cli.c:285:94: error: format ‘%ld’ expects argument of type ‘long int’,
        but argument 5 has type ‘uint64_t’ {aka ‘long long unsigned int’} [-Werror=format=]
  285 |                         chunk_appendf(&trash, "  [initl] rx.ackrng=%-6zu tx.inflight=%-6zu(%ld%%)\n",
      |                                                                                            ~~^
      |                                                                                              |
      |                                                                                              long int
      |                                                                                            %lld
  286 |                                       pktns->rx.arngs.sz, pktns->tx.in_flight,
  287 |                                       pktns->tx.in_flight * 100 / qc->path->cwnd);
      |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                                 |
      |                                                                 uint64_t {aka long long unsigned int}

Replace several %ld by %llu with ull as printf conversion in quic_clic.c and a
%ld by %lld with (long long) as printf conversion in quic_cc_cubic.c.

Thank you to Ilya (@chipitsine) for having reported this issue in GH #2689.

Must be backported to 3.0.

CI: QUIC Interop: use different artifact names for uploading logs

artifact names must be unique, otherwise only first failed logs are
uploaded, other encounter 409 conflict

CI: QUIC Interop: do not run bandwidth measurement tests

crosstraffic, goodput tests are intended to perform bandwidth measurement,
we do not consider GitHub runners for that purpose

GH issue: https://github.com/haproxy/haproxy/issues/2688

CI: fix missing comma introduced in 956839c0f68a7722acc586ecd91ffefad2ccb303

in 956839c0f68a7722acc586ecd91ffefad2ccb303 syntax was broken due to missing
comma. it is follow up.

CI: QUIC Interop AWS-LC: enable ngtcp2 client

Let's add it and see how it goes.
GH issue: https://github.com/haproxy/haproxy/issues/2688

DEV: coccinelle: add a test to detect unchecked calloc()

The coccinelle test "unchecked-calloc.cocci" detects various cases of
unchecked calloc().

DEV: coccinelle: add a test to detect unchecked malloc()

The coccinelle test "unchecked-malloc.cocci" detects various cases of
unchecked malloc().

BUILD: tools: environ is not defined in OS X and BSD

Add extern char **environ which in order to build the new functions to
manipulate the environment.

Indeed the variable environ is not required to be declared by POSIX, so
it need to be declared manually:

"In addition, the following variable, which must be declared by the user if it is to be used directly:

extern char **environ;"

https://pubs.opengroup.org/onlinepubs/9699919799/functions/environ.html

BUG/MINOR: haproxy: free init_env in deinit only if allocated

This fixes 7b78e1571 (" MINOR: mworker: restore initial env before wait
mode").

In cases, when haproxy starts without any configuration, for example:
'haproxy -vv', init_env array to backup env variables is never allocated. So,
we need to check in deinit(), when we free its memory, that init_env is not a
NULL ptr.

MINOR: mworker: restore initial env before wait mode

This patch is the follow-up of 1811d2a6ba (MINOR: tools: add helpers to
backup/clean/restore env).

In order to avoid unexpected behaviour in master-worker mode during the process
reload with a new configuration, when the old one has contained '*env' keywords,
let's backup its initial environment before calling parse_cfg() and let's clean
and restore it in the context of master process, just before it enters in a wait
polling loop.

This will garantee that new workers will have a new updated environment and not
the previous one inherited from the master, which does not read the configuration,
when it's in a wait-mode.

MINOR: tools: add helpers to backup/clean/restore env

'setenv', 'presetenv', 'unsetenv', 'resetenv' keywords in configuration could
modify the process runtime environment. In case of master-worker mode this
creates a problem, as the configuration is read only once before the forking a
worker and then the master process does the reexec without reading any config
files, just to free the memory. So, during the reload a new worker process will
be created, but it will inherited the previous unchanged environment from the
master in wait mode, thus it won't benefit the changes in configuration,
related to '*env' keywords. This may cause unexpected behavior or some parser
errors in master-worker mode.

So, let's add a helper to backup all process env variables just before it will
read its configuration. And let's also add helpers to clean up the current
runtime environment and to restore it to its initial state (as it was before
parsing the config).

MINOR: mux-quic: correct qcc_bufwnd_full() documentation

Fix returned value domment of qcc_bufwnd_full() which was incorrect.

MINOR: mux-quic: add buf_in_flight to QCC debug infos

Dump <buf_in_flight> QCC field both in QUIC MUX traces and "show quic".
This could help to detect if MUX does not allocate enough buffers
compared to quic_conn current congestion window.

MINOR: config: Created env variables for http and tcp clf formats

Since we already have variables for the other formats and the
change is trivial I thought it would be a nice addition for
completeness

[RELEASE] Released version 3.1-dev6

Released version 3.1-dev6 with the following main changes :
    - BUG/MINOR: proto_tcp: delete fd from fdtab if listen() fails
    - BUG/MINOR: proto_tcp: keep error msg if listen() fails
    - MINOR: proto_tcp: tcp_bind_listener: copy errno in errmsg
    - MINOR: channel: implement ci_insert() function
    - BUG/MEDIUM: mworker/cli: fix pipelined modes on master CLI
    - REGTESTS: mcli: test the pipelined commands on master CLI
    - MINOR: cfgparse: load_cfg_in_mem: fix null ptr dereference reported by coverity
    - MINOR: startup: fix unused value reported by coverity
    - BUG/MINOR: mux-quic: do not send too big MAX_STREAMS ID
    - BUG/MINOR: proto_uxst: delete fd from fdtab if listen() fails
    - BUG/MINOR: cfgparse: parse_cfg: fix null ptr dereference reported by coverity
    - MINOR: proto_uxst: copy errno in errmsg for syscalls
    - MINOR: mux-quic: do not trace error in qcc_send_frames() on empty list
    - BUG/MINOR: h3: properly reject too long header responses
    - CLEANUP: mworker/cli: clean up the mode handling
    - BUG/MINOR: tools: make fgets_from_mem() stop at the end of the input
    - BUG/MINOR: pattern: pat_ref_set: fix UAF reported by coverity
    - BUG/MINOR: pattern: pat_ref_set: return 0 if err was found
    - CI: keep logs for failed QIUC Interop jobs
    - BUG/MINOR: release-estimator: fix relative scheme in CHANGELOG URL
    - MINOR: release-estimator: add requirements.txt
    -  MINOR: release-estimator: add installation steps in README.md
    - MINOR: release-estimator: fix the shebang of the python script
    - DOC: config: correct the table for option tcplog
    - MEDIUM: log: relax some checks and emit diag warnings instead in lf_expr_postcheck()
    - MINOR: log: "drop" support for log-profile steps
    - CI: QUIC Interop LibreSSL: document chacha20 test status
    - CI: modernize codespell action, switch to node 16
    - CI: QUIC Interop AWS-LC: enable chrome client
    - DOC: lua: fix incorrect english in lua.txt
    - MINOR: Implements new log format of option tcplog clf
    - MINOR: cfgparse: limit file size loaded via /dev/stdin
    - BUG/MINOR: stats: fix color of input elements in dark mode
    - CLEANUP: stats: use modern DOCTYPE tag
    - BUG/MINOR: stats: add lang attribute to html tag
    - DOC: quic: fix default minimal value for max window size
    - DOC: quic: document nocc debug congestion algorithm
    - MINOR: quic: extract config window-size parsing
    - MINOR: quic: define max-window-size config setting
    - MINOR: quic: allocate stream txbuf via qc_stream_desc API
    - MINOR: mux-quic: account stream txbuf in QCC
    - MEDIUM: mux-quic: implement API to ignore txbuf limit for some streams
    - MINOR: h3: mark control stream as metadata
    - MINOR: mux-quic: define buf_in_flight
    - MAJOR: mux-quic: allocate Tx buffers based on congestion window
    - MINOR: quic/config: adapt settings to new conn buffer limit
    - MINOR: quic: define sbuf pool
    - MINOR: quic: support sbuf allocation in quic_stream
    - MEDIUM: h3: allocate small buffers for headers frames
    - MINOR: mux-quic: retry after small buf alloc failure
    - BUG/MINOR: cfgparse-global: fix err msg in mworker keyword parser
    - BUG/MINOR: cfgparse-global: clean common_kw_list
    - BUG/MINOR: cfgparse-global: remove redundant goto
    - MINOR: cfgparse-global: move 'pidfile' in global keywords list
    - MINOR: cfgparse-global: move 'expose-*' in global keywords list
    - MINOR: cfgparse-global: move tune options in global keywords list
    - MINOR: cfgparse-global: move unsupported keywords in global list
    - BUG/MINOR: cfgparse-global: remove tune.fast-forward from common_kw_list
    - MINOR: quic: store the lost packets counter in the quic_cc_event element
    - MINOR: quic: support a tolerance for spurious losses
    - MINOR: protocol: properly assign the sock_domain and sock_family
    - MINOR: protocol: add a family lookup
    - MEDIUM: socket: always properly use the sock_domain for requested families
    - MINOR: protocol: add the real address family to the protocol
    - MINOR: socket: don't ban all custom families from reuseport
    - MINOR: protocol: always initialize the receivers list on registration
    - CLEANUP: protocol: no longer initialize .receivers nor .nb_receivers

CLEANUP: protocol: no longer initialize .receivers nor .nb_receivers

Protocol definitions no longer need to initialize these internal fields,
as they're now properly initialized during protocol registration.

MINOR: protocol: always initialize the receivers list on registration

Till now, protocols were required to self-initialize their receivers
list head, which is not very convenient, and is quite error prone.
Indeed, it's too easy to copy-paste a protocol definition and forget
to update the .receivers field to point to itself, resulting in mixed
lists. Let's just do that in protocol_register(). And while we're at
it, let's also zero the nb_receivers entry that works with it, so that
the protocol definition isn't required to pre-initialize stuff related
to internal book-keeping.

MINOR: socket: don't ban all custom families from reuseport

The test on ss_family >= AF_MAX is too strict if we want to support new
custom families, let's apply this to the real_family instead so that we
check that the underlying socket supports reuseport.

MINOR: protocol: add the real address family to the protocol

For custom families, there's sometimes an underlying real address and
it would be nice to be able to directly use the real family in calls
to bind() and connect() without having to add explicit checks for
exceptions everywhere.

Let's add a .real_family field to struct proto_fam for this. For now
it's always equal to the family except for non-transferable ones such
as rhttp where it's equal to the custom one (anything else could fit).

MEDIUM: socket: always properly use the sock_domain for requested families

Now we make sure to always look up the protocol's domain for an address
family. Previously we would use it as-is, which prevented from properly
using custom addresses (which is when they differ).

This removes some hard-coded tests such as in log.c where UNIX vs UDP
was explicitly checked for example. It requires a bit of care, however,
so as to properly pass value 1 in the 3rd arg of the protocol_lookup()
for DGRAM stuff. Maybe one day we'll change these for defines or enums
to limit mistakes.

MINOR: protocol: add a family lookup

At plenty of places we have access to an address family which may
include some custom addresses but we cannot simply convert them to
the real families without performing some random protocol lookups.

Let's simply add a proto_fam table like we have for the protocols.
The protocols could even be indexed there, but for now it's not worth
it.

MINOR: protocol: properly assign the sock_domain and sock_family

When we finally split sock_domain from sock_family in 2.3, something
was not cleanly finished. The family is what should be stored in the
address while the domain is what is supposed to be passed to socket().
But for the custom addresses, we did the opposite, just because the
protocol_lookup() function was acting on the domain, not the family
(both of which are equal for non-custom addresses).

This is an API bug but there's no point backporting it since it does
not have visible effects. It was visible in the code since a few places
were using PF_UNIX while others were comparing the domain against AF_MAX
instead of comparing the family.

This patch clarifies this in the comments on top of proto_fam, addresses
the indexing issue and properly reconfigures the two custom families.

MINOR: quic: support a tolerance for spurious losses

Tests performed between a 1 Gbps connected server and a 100 mbps client,
distant by 95ms showed that:

  - we need 1.1 MB in flight to fill the link
  - rare but inevitable losses are sufficient to make cubic's window
    collapse fast and long to recover
  - a 100 MB object takes 69s to download
  - tolerance for 1 loss between two ACKs suffices to shrink the download
    time to 20-22s
  - 2 losses go to 17-20s
  - 4 losses reach 14-17s

At 100 concurrent connections that fill the server's link:
  - 0 loss tolerance shows 2-3% losses
  - 1 loss tolerance shows 3-5% losses
  - 2 loss tolerance shows 10-13% losses
  - 4 loss tolerance shows 23-29% losses

As such while there can be a significant gain sometimes in setting this
tolerance above zero, it can also significantly waste bandwidth by sending
far more than can be received. While it's probably not a solution to real
world problems, it repeatedly proved to be a very effective troubleshooting
tool helping to figure different root causes of low transfer speeds. In
spirit it is comparable to the no-cc congestion algorithm, i.e. it must
not be used except for experimentation.

MINOR: quic: store the lost packets counter in the quic_cc_event element

Upon loss detection, qc_release_lost_pkts() notifies congestion
controllers about the event and its final time. However it does not
pass the number of lost packets, that can provide useful hints for
some controllers. Let's just pass this option.

BUG/MINOR: cfgparse-global: remove tune.fast-forward from common_kw_list

Remove tune.fast-forward from common_kw_list. It was replaced by
'tune.disable-fast-forward' and it's no longer present in "if..else if.."
parser from cfg_parse_global(). Otherwise, it may be shown as the best-match
keyword for some tune options, which is now wrong.

Should be backported in versions 2.9 and 3.0.

MINOR: cfgparse-global: move unsupported keywords in global list

Following the previous commits and in order to clean up cfg_parse_global let's
move unsupported keywords in the global list and let's add for them a dedicated
parser.

MINOR: cfgparse-global: move tune options in global keywords list

In order to clean up cfg_parse_global() and to add the support of the new
MODE_DISCOVERY in configuration parsing, let's move the keywords related to
tune options into the global keywords list and let's add for them two dedicated
parsers. Tune options keywords are sorted between two parsers in dependency of
parameters number, which a given tune option needs.

tune options parser is called by section parser and follows the common API, i.e.
it returns -1 on failure, 0 on success and 1 on recoverable error. In case of
recoverable error we've previously returned ERR_ALERT (0x10) and we have emitted
an alert message at startup. Section parser treats all rc > 0 as ERR_WARN. So in
case, if some tune option was set twice in the global section, tune
options parser will return 1 (in order to respect the common API), section
parser will treat this as ERR_WARN and a warning message will be emitted during
process startup instead of alert, as it was before.

MINOR: cfgparse-global: move 'expose-*' in global keywords list

Following the previous commit let's also move 'expose-*' keywords in the global
cfg_kws list and let's add for them a dedicated parser. This will simplify the
configuration parsing in the new MODE_DISCOVERY, which allows to read only the
keywords, needed at the early start of haproxy process (i.e. modes, pidfile,
chosen poller).

MINOR: cfgparse-global: move 'pidfile' in global keywords list

This commit cleans up cfg_parse_global() and prepares the config parser to
support MODE_DISCOVERY. This step is needed in early starting stage, just to
figura out in which mode the process was started, to set some necessary
parameteres needed for this mode and to continue the initialization
stage.

'pidfile' makes part of such common keywords, which are needed to be parsed
very early and which are used almost in all process modes (except the
foreground, '-d').

'pidfile' keyword parser is called by section parser and follows the common
API, i.e. it returns -1 on failure, 0 on success and 1 on recoverable error. In
case of recoverable error we've previously returned ERR_ALERT (0x10) and we have
emitted an alert message at startup. Section parser treats all rc > 0 as
ERR_WARN. So in case, if pidfile was already specified via command line, the
keyword parser will return 1 (in order to respect the common API), section
parser will treat this as ERR_WARN and a warning message will be emitted during
process startup instead of alert, as it was before.

BUG/MINOR: cfgparse-global: remove redundant goto

In the case, when the given keyword was found in the global 'cfg_kws' list, we
go to 'out' label anyway, after testing rc returned by the keyword's parser. So
there is not a much gain if we perform 'goto out' jump specifically when rc > 0.