git.ipfire.org Git - thirdparty/haproxy.git/log

MINOR: quic: add GSO parameter on quic_sock send API

Add <gso_size> parameter to qc_snd_buf(). When non-null, this specifies
the value for socket option SOL_UDP/UDP_SEGMENT. This allows to send
several datagrams in a single call by splitting data multiple times at
<gso_size> boundary.

For now, <gso_size> remains set to 0 by caller, as such there should not
be any functional change.

MINOR: quic: define quic_cc_path MTU as constant

Future commits will implement GSO support to be able to emit multiple
datagrams in a single syscall invocation. This will be used every time
there is more data to sent than the UDP network MTU.

No change will be done for Tx buffer encoding, in particular when using
extra metadata datagram header. When GSO will be used, length field will
contain the total length of all datagrams to emit in a single GSO
syscall send. As such, QUIC send functions will detect that GSO is in
use if total length is greater than MTU.

This last assumption forces to ensure that MTU is constant. Indeed, in
case qc_send() is interrupted, Tx buffer will be left with prepared
datagrams. These datagrams will be emitted at the next qc_send()
invocation. If MTU would change during these two calls, it would be
impossible to know if GSO was used or not. To prevent this, mark <mtu>
field of quic_cc_path as constant.

MINOR: quic: activate UDP GSO for QUIC if supported

Add a startup test for GSO support in quic_test_socketopts() and
automatically activate it in qc_prep_pkts() when building datagrams as
big as MTU.

Also define a new config option tune.quic.disable-udp-gso. This is
useful to prevent warning on older platform or to debug an issue which
may be related to GSO.

MINOR: quic: extend detection of UDP API OS features

QUIC haproxy implementation relies on specific OS features to activate
some UDP optimization. One of these is the ability to bind multiple
sockets on the same address, which is necessary to have a dedicated
socket for each QUIC connections. This feature support is tested during
startup via an internal proto-quic function. It automatically deactivate
socket per connection if OS is not compatible.

The purpose of this patch is to render this QUIC feature detection code
more generic. Function is renamed quic_test_socketopts() and is still
invoked on startup. Its internal code has been refactored to be able to
implement other features support test in it.

Return value has also been changed and is now taken into account. In
case of ERR_FATAL, haproxy startup will be interrupted. This happens on
socket() syscall failure used to duplicate a QUIC listener FD.

This commit will become necessary to detect GSO support on startup.

CLEANUP: quic: remove obsolete comment on send

Remove comment on send which is now obsolete since the introduction of
per-connection socket.

MINOR: limits: add is_any_limit_configured

Let's encapsulate the check of all supported for now process internal limits in
a separate function. This will help in cases, when we need to simply check if
we have even only one limit set in the configuration file. It's important, as
the default value for a one limit (fd-hard-limits, for example) sometimes must
not affect the computation of the others.

REORG: haproxy: move limits handlers to limits

This patch moves handlers to compute process related limits in 'limits'
compilation unit.

MINOR: haproxy: prepare to move limits-related code

This patch is done in order to prepare the move of handlers to compute and to
check process related limits as maxconn, maxsock, maxpipes.

So, these handlers become no longer static due to the future move.

We add the handlers declarations in limits.h in this patch as well, in order to
keep the next patch, dedicated to code replacement, without any additional
modifications.

Such split also assures that this patch can be compiled separately from the
next one, where we moving the handlers. This is important in case of
git-bisect.

REORG: global: move rlim_fd_*_at_boot in limits

Let's move in 'limits' compilation unit global variables to keep the initial
process fd limits.

CLEANUP: fd: rm struct rlimit definition

As raise_rlim_nofile() was moved to limits compilation unit, limits.h includes
the system <sys/resource.h>. So, this definition of rlimit system type
structure is no longer need for compilation of fd unit.

REORG: fd: move raise_rlim_nofile to limits

Let's move raise_rlim_nofile() from 'fd' compilation unit to 'limits', as it
wraps setrlimit to change process RLIMIT_NOFILE.

MINOR: limits: prepare to keep limits in one place

The code which gets, sets and checks initial and current fd limits and process
related limits (maxconn, maxsock, ulimit-n, fd-hard-limit) is spread around
different functions in haproxy.c and in fd.c. Let's group it together in
dedicated limits.c and limits.h.

This patch is done in order to prepare the moving of limits-related functions
from different places to the new 'limits' compilation unit. It helps to keep
clean the next patch, which will do only the move without any additional
modifications.

Such detailed split is needed in order to be sure not to break accidentally
limits logic and in order to be able to compile each commit separately in case
of git-bisect.

[RELEASE] Released version 3.1-dev3

Released version 3.1-dev3 with the following main changes :
    - BUG/MINOR: quic: Wrong datagram building when probing.
    - BUG/MEDIUM: quic: fix possible exit from qc_check_dcid() without unlocking
    - BUG/MINOR: promex: Remove Help prefix repeated twice for each metric
    - DOC: configuration: add details about crt-store in bind "crt" keyword
    - BUG/MEDIUM: hlua/cli: Fix lua CLI commands to work with applet's buffers
    - DOC: configuration: more details about the master-worker mode
    - BUG/MEDIUM: server: fix race on server_atomic_sync()
    - BUG/MINOR: jwt: don't try to load files with HMAC algorithm
    - CLEANUP: quic: cleanup prototypes related to CIDs handling
    - CLEANUP: quic: remove non-existing quic_cid_tree definition
    - MINOR: quic: remove access to CID global tree outside of quic_cid module
    - REORG: quic: remove quic_cid_trees reference from proto_quic
    - MINOR: quic: add 2 BUG_ON() on datagram dispatch
    - MINOR: quic: ensure quic_conn is never removed on thread affinity rebind
    - MEDIUM: init: set default for fd_hard_limit via DEFAULT_MAXFD
    - DOC: configuration: update maxconn description
    - MINOR: proto: extend connection thread rebind API
    - BUG/MEDIUM: quic: prevent crash on accept queue full
    - BUG/MEDIUM: peers: Fix crash when syncing learn state of a peer without appctx
    - CI: add weekly QUIC Interop regression against LibreSSL
    - DEV: flags/quic: decode quic_conn flags
    - MINOR: quic: rename "ssl error" trace
    - BUG/MEDIUM: init: fix fd_hard_limit default in compute_ideal_maxconn
    - BUG/MINOR: jwt: fix variable initialisation
    - MINOR: ssl/sample: ssl_c_san returns a comma separated list of SAN
    - OPTIM: pool: improve needed_avg cache line access pattern
    - MAJOR: import: update mt_list to support exponential back-off (try #2)
    - CI: weekly QUIC Interop: try to fix private image
    - BUG/MINOR: h1: Fail to parse empty transfer coding names
    - BUG/MINOR: h1: Reject empty coding name as last transfer-encoding value
    - BUG/MEDIUM: h1: Reject empty Transfer-encoding header
    - BUG/MEDIUM: spoe: Be sure to create a SPOE applet if none on the current thread
    - BUILD: listener: silence a build warning about unused value without threads
    - DOC: architecture: remove the totally outdated architecture manual
    - SCRIPTS: create-release: no more need to skip architecture.txt

SCRIPTS: create-release: no more need to skip architecture.txt

Now that it's gone we won't stumble upon it by accident anymore.

DOC: architecture: remove the totally outdated architecture manual

We've discussed about removing it many times and I thought it had been
removed long ago, but apparently not as William proved me. Let's get
rid of it now. It's totally outdated (last updated 18 years ago, when
laptop processors were still 32 bits), mentions keywords and external
products that don't exist anymore. It's not even on docs.haproxy.org.
At some point, old stuff must really die.

BUILD: listener: silence a build warning about unused value without threads

A variable introduced in commit 1a43b9f32c ("MINOR: proto: extend
connection thread rebind API") is not used without threads and causes a
build warning. Let's just mark it maybe_unused.

Since the commit above is tagged for backporting, this one will need to
be backported along with it.

BUG/MEDIUM: spoe: Be sure to create a SPOE applet if none on the current thread

When a message is queued, waiting to be processed by a SPOE applet, there
are some heuristic to know if a new applet must be created or not. There are
2 conditions to skip the applet creation:

  1 - if there are enough idle applets on the current thread, or,

  2 - if the processing rate on the current thread is high enough to handle
      this new message

In the 2nd case, there is a flaw when the number of processed messages falls
to zero while the processing rate is still greater than zero. In that case,
we will skip the SPOE applet creation without taking care to check there is
at least one applet on the current thread.

So now, the conditions above to skip the SPOE applet creation are only
evaluated if there is at least one applet on the current thread.

This patch must be backported to every stable versions.

BUG/MEDIUM: h1: Reject empty Transfer-encoding header

The Transfer-Encoding headers list the transfer coding that have been
applied to the content in order to form the message body. It is a list of
tokens. And as specified by RFC 9110, a token cannot be empty. When several
coding names are specify as a comma-separated value, this case is properly
handled and an error is triggered. However, an empty header value will just
be skipped and no error is triggered. This could be an issue with some buggy
servers.

Now, empty Transfer-Encoding header are rejected too.

This patch must be backported as far as 2.6.

BUG/MINOR: h1: Reject empty coding name as last transfer-encoding value

The following Transfer-Encoding header is now rejected with a
400-bad-request:

Transfer-Encoding: chunked,\r\n

This case was not properly handled and the last empty value was just
ignored.

This patch must be backported as far as 2.6.

BUG/MINOR: h1: Fail to parse empty transfer coding names

Empty transfer coding names, inside a comma-separated list, are already
rejected. But it is only by chance. Today, it is detected as an unknown
coding names (not "chunked" concretly). Then, it is handled by the H1
multiplexer as an error and a 422-Unprocessable-Content response is
returned.

So, the error is properly detected in this case, but it is not accurate. A
400-bad-request response must be returned instead. Then, it is better to
catch the error during the header parsing. It is the purpose of this patch.

This patch should be backported as far as 2.6.

CI: weekly QUIC Interop: try to fix private image

for some reason image built in HAProxy workflow is "private", it
is succesfully built, but fails to pull. Let's try explicit docker login
for run job as well

MAJOR: import: update mt_list to support exponential back-off (try #2)

This is the second attempt at importing the updated mt_list code (commit
59459ea3). The previous one was attempted with commit c618ed5ff4 ("MAJOR:
import: update mt_list to support exponential back-off") but revealed
problems with QUIC connections and was reverted.

The problem that was faced was that elements deleted inside an iterator
were no longer reset, and that if they were to be recycled in this form,
they could appear as busy to the next user. This was trivially reproduced
with this:

  $ cat quic-repro.cfg
  global
          stats socket /tmp/sock1 level admin
          stats timeout 1h
          limited-quic

  frontend stats
          mode http
          bind quic4@:8443 ssl crt rsa+dh2048.pem alpn h3
          timeout client 5s
          stats uri /

  $ ./haproxy -db -f quic-repro.cfg  &

  $ h2load -c 10 -n 100000 --npn h3 https://127.0.0.1:8443/
  => hang

This was purely an API issue caused by the simplified usage of the macros
for the iterator. The original version had two backups (one full element
and one pointer) that the user had to take care of, while the new one only
uses one that is transparent for the user. But during removal, the element
still has to be unlocked if it's going to be reused.

All of this sparked discussions with Fred and Aurélien regarding the still
unclear state of locking. It was found that the lock API does too much at
once and is lacking granularity. The new version offers a much more fine-
grained control allowing to selectively lock/unlock an element, a link,
the rest of the list etc.

It was also found that plenty of places just want to free the current
element, or delete it to do anything with it, hence don't need to reset
its pointers (e.g. event_hdl). Finally it appeared obvious that the
root cause of the problem was the unclear usage of the list iterators
themselves because one does not necessarily expect the element to be
presented locked when not needed, which makes the unlock easy to overlook
during reviews.

The updated version of the list presents explicit lock status in the
macro name (_LOCKED or _UNLOCKED suffixes). When using the _LOCKED
suffix, the caller is expected to unlock the element if it intends to
reuse it. At least the status is advertised. The _UNLOCKED variant,
instead, always unlocks it before starting the loop block. This means
it's not necessary to think about unlocking it, though it's obviously
not usable with everything. A few _UNLOCKED were used at obvious places
(i.e. where the element is deleted and freed without any prior check).

Interestingly, the tests performed last year on QUIC forwarding, that
resulted in limited traffic for the original version and higher bit
rate for the new one couldn't be reproduced because since then the QUIC
stack has gaind in efficiency, and the 100 Gbps barrier is now reached
with or without the mt_list update. However the unit tests definitely
show a huge difference, particularly on EPYC platforms where the EBO
provides tremendous CPU savings.

Overall, the following changes are visible from the application code:

  - mt_list_for_each_entry_safe() + 1 back elem + 1 back ptr
    => MT_LIST_FOR_EACH_ENTRY_LOCKED() or MT_LIST_FOR_EACH_ENTRY_UNLOCKED()
       + 1 back elem

  - MT_LIST_DELETE_SAFE() no longer needed in MT_LIST_FOR_EACH_ENTRY_UNLOCKED()
      => just manually set iterator to NULL however.
    For MT_LIST_FOR_EACH_ENTRY_LOCKED()
      => mt_list_unlock_self() (if element going to be reused) + NULL

  - MT_LIST_LOCK_ELT => mt_list_lock_full()
  - MT_LIST_UNLOCK_ELT => mt_list_unlock_full()

  - l = MT_LIST_APPEND_LOCKED(h, e);  MT_LIST_UNLOCK_ELT();
    => l=mt_list_lock_prev(h); mt_list_lock_elem(e); mt_list_unlock_full(e, l)

OPTIM: pool: improve needed_avg cache line access pattern

On an AMD EPYC 3rd gen, 20% of the CPU is spent calculating the amount
of pools needed when using QUIC, because pool allocations/releases are
quite frequent and the inter-CCX communication is super slow. Still,
there's a way to save between 0.5 and 1% CPU by using fetch-add and
sub-fetch that are converted to XADD so that the result is directly
fed into the swrate_add argument without having to re-read the memory
area. That's what this patch does.

MINOR: ssl/sample: ssl_c_san returns a comma separated list of SAN

The ssl_c_san sample fetch returns a list of Subject Alt Name which was
presented by the client certificate.

The format is the same as the "openssl x509 -text" command, it's a
Description: Value list separated by commas.
The format is directly generated by the GENERAL_NAME_print() openssl
function.

https://github.com/openssl/openssl/blob/openssl-3.0/crypto/x509/v3_san.c#L207

Example:
IP Address:127.0.0.1, IP Address:127.0.0.2, IP Address:127.0.0.3, URI:http://docs.haproxy.org/2.7/, DNS:ca.tests.haproxy.com

BUG/MINOR: jwt: fix variable initialisation

Set the alg variable from sample_conv_jwt_verify_check() to
JWT_ALG_DEFAULT.

This was reported by coverity in #2630, but since you need to use the
first argument to use the 2nd, this has no real impact.

Mut be backported with 883f1bd (as far as 2.6).

BUG/MEDIUM: init: fix fd_hard_limit default in compute_ideal_maxconn

This commit fixes 41275a691 ("MEDIUM: init: set default for fd_hard_limit via
DEFAULT_MAXFD").

fd_hard_limit is taken in account implicitly via 'ideal_maxconn' value in
all maxconn adjustements, when global.rlimit_memmax is set:

MIN(global.maxconn, capped by global.rlimit_memmax, ideal_maxconn);

It also caps provided global.rlimit_nofile, if it couldn't be set as a current
process fd limit (see more details in the main() code).

So, lets set the default value for fd_hard_limit only, when there is no any
other haproxy-specific limit provided, i.e. rlimit_memmax, maxconn,
rlimit_nofile. Otherwise we may break users configs.

Please, note, that in master-worker mode, master does not need the
DEFAULT_MAXFD (1048576) as well, as we explicitly limit its maxconn to 100.

Must be backported in all stable versions until v2.6.0, including v2.6.0,
like the commit above.

MINOR: quic: rename "ssl error" trace

SSL status is reported each time quic_conn_io_cb() is finished via a
trace. Change the trace label from "ssl error" to "ssl status". This
allows to search for errors easier without being distracted by this
trace.

DEV: flags/quic: decode quic_conn flags

Decode quic_conn flags via qc_show_flags() function.

To support this, quic flags definition have been put outside of USE_QUIC
directive.

CI: add weekly QUIC Interop regression against LibreSSL

currently only quic-go and picoquic clients are enabled with testsuites
supposed to be "green". Tests will be run weekly.

BUG/MEDIUM: peers: Fix crash when syncing learn state of a peer without appctx

For a given peer, the synchronization of the learn state is no longer
performed in the peer appctx. It is delayed to be handled by the peers sync
task. It means that for a given peer, it is possible to have finished to
learn and only handle it after the appctx release. So the synchronization
may happen on a peer without appctx.

This was not tested and an unconditionnal wakeup on the appctx could lead to
a crash because of a NULL-deref. It may be experienced by running
reg-tests/peers/tls_basic_sync.vtc script in loop. The fix is obivous. In
sync_peer_learn_state(), we must omit to wakeup the appctx if it was already
released.

This patch should fix issue #2629. It must be backported to 3.0.

BUG/MEDIUM: quic: prevent crash on accept queue full

Handshake for quic_conn instances runs on a single non-chosen thread. On
completion, listener_accept() is performed to select the less loaded
thread before initializing connection instance. As such, quic_conn
instance is migrated to the thread with its upper connection.

In case accept queue is full, listener_accept() fallback to local accept
mode, which cause the connection to be assigned to the current thread.
However, this is not supported by QUIC as quic_conn instance is left on
the previously selected thread. In most cases, this will cause a
BUG_ON() due to a task manipulation from an outside thread.

To fix this, handle quic_conn thread rebind in multiple steps using the
new extended protocol API. Several operations have been moved from
qc_set_tid_affinity1() to newly defined qc_set_tid_affinity2(), in
particular CID TID update. This ensures that quic_conn instance is not
prematurely accessed on the new thread until accept queue push is
guaranteed to succeed.

qc_reset_tid_affinity() is also newly defined to reassign the newly
created tasks and tasklets to the current thread. This is necessary to
prevent the BUG_ON() crash described above.

This must be backported up to 2.8 after a period of observation. Note
that it depends on previous patch :
MINOR: proto: extend connection thread rebind API

MINOR: proto: extend connection thread rebind API

MINOR: listener: define callback for accept queue push

Extend API for connection thread rebind API by replacing single callback
set_affinity by three different ones. Each one of them is used at a
different stage of the operation :

* set_affinity1 is used similarly to previous set_affinity

* set_affinity2 is called directly from accept_queue_push_mp() when an
  entry has been found in accept ring. This operation cannot fail.

* reset_affinity is called after set_affinity1 in case of failure from
  accept_queue_push_mp() due to no space left in accept ring. This is
  necessary for protocols which must reconfigure resources before
  fallback on the current tid.

This patch does not have any functional changes. However, it will be
required to fix crashes for QUIC connections when accept queue ring is
full. As such, it must be backported with it.

DOC: configuration: update maxconn description

Let's update maxconn keyword description, in order to make it clear, which
setting has the precedence over the global.maxconn and the SYSTEM_MAXCONN if
set.

MEDIUM: init: set default for fd_hard_limit via DEFAULT_MAXFD

Let's provide a default value for fd_hard_limit, if it's not set in the
configuration. With this patch we could set some specific default via
compile-time variable DEFAULT_MAXFD as well. Hope, this will be helpfull for
haproxy package maintainers.

make -j 8 TARGET=linux-glibc DEBUG=-DDEFAULT_MAXFD=50000

If haproxy is comipled without DEFAULT_MAXFD defined, the default will be set
to 1048576.

This is done to avoid killing the process by its watchdog, while it started
without any limitations in its configuration or in the command line and the
hard RLIMIT_NOFILE is extremely huge (~1000000000). We use in this case
compute_ideal_maxconn() to calculate maxconn and maxsock, maxsock defines the
size of internal fdtab, which becames very-very large as well. When
the process starts to simply loop over this fdtab (0(n)), this takes a lot of
time, so watchdog does it job.

To avoid this, maxconn now is always reduced to some reasonable value either
by explicit global.fd-hard-limit from configuration, or by its default. The
default may be changed at build-time and overwritten then by
global.fd-hard-limit at runtime. Explicit global.fd-hard-limit from the
configuration has always precedence over DEFAULT_MAXFD, if set.

Must be backported in all stable versions until v2.6.0, including v2.6.0.

MINOR: quic: ensure quic_conn is never removed on thread affinity rebind

On accept, quic_conn instance is migrated from its original thread to a
new one. This operation is conducted in two steps, on the original than
the new thread instance. During the interval, quic_conn is artificially
rendered inactive. It must never be accessed nor removed until migration
is completed via qc_finalize_affinity_rebind(). This new BUG_ON() will
enforce that removal is never conducted until migration is completed.

MINOR: quic: add 2 BUG_ON() on datagram dispatch

QUIC datagram dispatch is an error prone operation as it must always
ensure the correct thread is used before accessing to the recipient
quic_conn instance. Strengthen this code part by adding two BUG_ON_HOT()
to ensure thread safety.

REORG: quic: remove quic_cid_trees reference from proto_quic

Previous commit removed access/manipulation to QUIC CID global tree
outside of quic_cid module. This ensures that proper locking is always
performed.

This commit finalizes this cleanup by marking CID global tree as static
only to quic_cid source file. Initialization of this tree is removed
from proto_quic and now performed using dedicated initcalls
quic_alloc_global_cid_tree().

As a side change, complete CID global tree documentation, in particular
to explain CID global tree artificial splitting and ODCID handling.
Overall, the code is now clearer and safer.

MINOR: quic: remove access to CID global tree outside of quic_cid module

haproxy generates for each QUIC connection a set of CID. The peer must
reuse them as DCID for its emitted packet. On datagram reception, DCID
field serves as identifier to dispatch them on their correct thread.

These CIDs are stored in a global CID tree. Access to this data
structure must always be protected with CID_LOCK. This commit is a
refactoring to regroup all CID tree access in quic_cid module. Several
code parts are ajusted :

* quic_cid_insert() is extended to check for insertion race-condition.
  This is useful on quic_conn instantiation. Code where such race cannot
  happen can use unsafe _quic_cid_insert() instead.

* on RETIRE_CONNECTION_ID frame reception, existing quic_cid_delete()
  function is used.

* remove tree lookup from qc_check_dcid(), extracted in the new
  quic_cmp_cid_conn() function. Ultimately, the latter should be removed
  as CID lookup could be conducted on quic_conn owned tree without
  locking.

CLEANUP: quic: remove non-existing quic_cid_tree definition

quic_cid_tree global variable does not exist anymore. Remove its
definition in quic_conn.c.

CLEANUP: quic: cleanup prototypes related to CIDs handling

Remove duplicated prototypes from quic_conn.h also present in
quic_cid.h. Also remove quic_derive_cid() prototype and mark it as
static.

BUG/MINOR: jwt: don't try to load files with HMAC algorithm

When trying to use a HMAC algorithm (HS256, HS384, HS512) the
sample_conv_jwt_verify_check() function of the converter tries to load a
file even if it is only supposed to contain a secret instead of a path.

When using lua, the check function is called at runtime so it even tries
to load file at each call... This fixes the issue for HMAC algorithm
but this is still a problem with the other algorithms, since we don't
have a way of pre-loading files before the call.

Another solution must be found to prevent disk IO with lua using other
algorithms.

Must be backported as far as 2.6.

BUG/MEDIUM: server: fix race on server_atomic_sync()

The following patch fixes a race condition during server addr/port
update :
  cd994407a9545a8d84e410dc0cc18c30966b70d8
  BUG/MAJOR: server/addr: fix a race during server addr:svc_port updates

The new update mechanism is implemented via an event update. It uses
thread isolation to guarantee that no other thread is accessing server
addr/port. Furthermore, to ensure server instance is not deleted just
before the event handler, server instance is lookup via its ID in proxy
tree.

However, thread isolation is only entered after server lookup. This
leaves a tiny race condition as the thread will be marked as harmless
and a concurrent thread can delete the server in the meantime. This
causes server_atomic_sync() to manipulated a deleted server instance to
reinsert it in used_server_addr backend tree. This can cause a segfault
during this operation or possibly on a future used_server_addr tree
access.

This issue was detected by criteo. Several backtraces were retrieved,
each related to server addr_node insert or delete operation, either in
srv_set_addr_desc(), or add/delete dynamic server handlers.

To fix this, simply extend thread isolation section to start it before
server lookup. This ensures that once retrieved the server cannot be
deleted until its addr/port are updated. To ensure this issue won't
happen anymore, a new BUG_ON() is added in srv_set_addr_desc().

Also note that ebpt_delete() is now called every time on delete handler
as this is a safe idempotent operation.

To reproduce these crashes, a script was executed to add then remove
different servers every second. In parallel, the following CLI command
was issued repeatdly without any delay to force multiple update on
servers port :

  set server <srv> addr 0.0.0.0 port $((1024 + RANDOM % 1024))

This must be backported at least up to 3.0. If above mentionned patch
has been selected for previous version, this commit must also be
backported on them.

DOC: configuration: more details about the master-worker mode

Add more details about the master-worker mode in the "master-worker"
global keyword.

Should fix issue #2198.

BUG/MEDIUM: hlua/cli: Fix lua CLI commands to work with applet's buffers

In 3.0, the CLI applet was rewritten to use its own buffers. However, the
lua part, used to register CLI commands at runtime, was not updated
accordingly. It means the lua CLI commands still try to write in the channel
buffers. This is of course totally unexepected and not supported. Because of
this bug, the applet hangs intead of returning the command result.

The registration of lua CLI commands relies on the lua TCP applets. So the
send and receive functions were fixed to use the applet's buffer when it is
required and still use the channel buffers otherwies. This way, other lua
TCP applets can still run on the legacy mode, without the applet's buffers.

This patch must be backported to 3.0.

DOC: configuration: add details about crt-store in bind "crt" keyword

Add some details about the certificate storage cache system in the "crt"
bind keyword.

This should be backported to 3.0. Fix issue #2618.

BUG/MINOR: promex: Remove Help prefix repeated twice for each metric

When the support for modules was added, the function producing the #HELP
line of each metric was refactored. Since then, the prefix "#HELP
<metric-name>" is printed twice because a code block was not removed. It is
now fixed.

This patch must be backported to 3.0.

BUG/MEDIUM: quic: fix possible exit from qc_check_dcid() without unlocking

Locking of the CID tree was extended in qc_check_dcid() by recent commit
05f59a5 ("BUG/MINOR: quic: fix race condition in qc_check_dcid()") but
there was a direct return from the middle of the function which was not
covered by the unlock, resulting in the function keeping the lock on
success return.

Let's just remove this return and replace it with a variable to merge all
exit paths.

This must be backported wherever the fix above is backported.

BUG/MINOR: quic: Wrong datagram building when probing.

This issue was revealed by chacha20 interop test which very often fails with
ngtcp2 as client. This was due to the fact that 2 application level packets could
be coalesced into the same datagram as revealed by such a capture:

Frame 380: 255 bytes on wire (2040 bits), 255 bytes captured (2040 bits)
Point-to-Point Protocol
Internet Protocol Version 4, Src: 193.167.100.100, Dst: 193.167.0.100
User Datagram Protocol
QUIC IETF
    QUIC Connection information
        [Connection Number: 0]
    [Packet Length: 187]
    QUIC Short Header DCID=ec523fe99840f9c17c868a88d649147814 PKN=333
        0... .... = Header Form: Short Header (0)
        .1.. .... = Fixed Bit: True
        ..0. .... = Spin Bit: False
        [...0 0... = Reserved: 0]
        [.... .0.. = Key Phase Bit: False]
        [.... ..00 = Packet Number Length: 1 bytes (0)]
        Destination Connection ID: ec523fe99840f9c17c868a88d649147814
        [Packet Number: 333]
        Protected Payload […]: 43537d43a3c83e47db6891bd6a4fd7d7fa31941badcb87a540e843341d6a5e493ed4c3f6e6bbff094804ee0ab06830dc1a1bbf52ace4323d2e4f6e0bd4eea73df0721d2949d05a058d3afb974e814494ebf44d1375b0e7f1fd5bcf634cf32ef9a9b4018758a49d39a24c40
    STREAM id=0 fin=0 off=294768 len=144 dir=Bidirectional origin=Client-initiated
        Frame Type: STREAM (0x000000000000000e)
            .... ...0 = Fin: False
            .... ..1. = Len(gth): True
            .... .1.. = Off(set): True
        Stream ID: 0
            .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ...0 = Stream initiator: Client-initiated (0)
            .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ..0. = Stream direction: Bidirectional (0)
        Offset: 294768
        Length: 144
        Stream Data […]: 63eef6ccee0d2ab602db3682d0e7cc09b72db6adc307d7699a211144b4b6c029cbed9beae1491c10a5fe0678d815a5303843d33c0593fedc9b64068fd0207e280d05aac2c0054fe9ab30857bc3669ee51d34756cfd2e098eb1ab31a03911f6a103f0a16f8f984d9861efdcf4433c
QUIC IETF
    [Packet Length: 38]
    QUIC Short Header DCID=ec523fe99840f9c17c868a88d649147814 PKN=334
        0... .... = Header Form: Short Header (0)
        .1.. .... = Fixed Bit: True
        ..0. .... = Spin Bit: False
        [...0 0... = Reserved: 0]
        [.... .0.. = Key Phase Bit: False]
        [.... ..00 = Packet Number Length: 1 bytes (0)]
        Destination Connection ID: ec523fe99840f9c17c868a88d649147814
        [Packet Number: 334]
        Protected Payload: b9c0e6dc3fc523574f8164c31b6cd156496212
    PING
        Frame Type: PING (0x0000000000000001)
    PADDING Length: 2
        Frame Type: PADDING (0x0000000000000000)
        [Padding Length: 2]

On the peer side these two packet are considered as a unique one
because there may be only one packet by datagram at application encryption
level and reported as a STREAM frame encoding error:

I00000332 0xec523fe99840f9c17c868a88d649147814 con recv packet len=225
mask=b2c69c7827 sample=43a3c83e47db6891bd6a4fd7d7fa3194
I00000332 0xec523fe99840f9c17c868a88d649147814 pkt rx pkn=333 dcid=0xec523fe99840f9c17c868a88d649147814 type=1RTT k=0
I00000332 0xec523fe99840f9c17c868a88d649147814 frm rx 333 1RTT STREAM(0x0e) id=0x0 fin=0 offset=294768 len=144 uni=0
ngtcp2_conn_read_pkt: ERR_FRAME_ENCODING
I00000332 0xec523fe99840f9c17c868a88d649147814 pkt tx pkn=1531039643 dcid=0xae79dfc99d6c65d6 type=1RTT k=0
I00000332 0xec523fe99840f9c17c868a88d649147814 frm tx 1531039643 1RTT CONNECTION_CLOSE(0x1c) error_code=FRAME_ENCODING_ERROR(0x7) frame_type=0 reason_len=0 reason=[]
I00000332 0xec523fe99840f9c17c868a88d649147814 frm tx 1531039643 1RTT PADDING(0x00) len=9

Note here that the sum of the two packet sizes (from capture) is the same as the
packet length reporte by ngtcp2: 187+38 = 225. It also seems that wireshark tries
to parse as much as packet into the same datagram, regardless of the QUIC protocol
rules.

Haproxy traces revealed that this could happen at least when probing the peer.
The recent low level packet building modifications aim was to build
as much as datagrams into the same buffer. But it seems that the
probing packet case treatment has been broken. That said, I have not
identified impacted commit. This issue could be reproduced inside
interop test environment (no possible git bisection).

To fix this, rely on the <probe> variable value to identify if the last
packet built by qc_prep_pkts() was a probing one, then try to
coalesce some others packet into the same datagram if this was not the case.
Of course the test on <probe> value has to be done before setting it
for the next packet.

Must be backported to 3.0.

[RELEASE] Released version 3.1-dev2

Released version 3.1-dev2 with the following main changes :
    - BUG/MINOR: log: fix broken '+bin' logformat node option
    - DEBUG: hlua: distinguish burst timeout errors from exec timeout errors
    - REGTESTS: ssl: fix some regtests 'feature cmd' start condition
    - BUG/MEDIUM: ssl: AWS-LC + TLSv1.3 won't do ECDSA in RSA+ECDSA configuration
    - MINOR: ssl: activate sigalgs feature for AWS-LC
    - REGTESTS: ssl: activate new SSL reg-tests with AWS-LC
    - BUG/MEDIUM: proxy: fix email-alert invalid free
    - REORG: mailers: move free_email_alert() to mailers.c
    - BUG/MINOR: proxy: fix email-alert leak on deinit() (2nd try)
    - DOC: configuration: fix alphabetical order of bind options
    - DOC: management: document ptr lookup for table commands
    - BUG/MAJOR: quic: fix padding with short packets
    - BUG/MAJOR: quic: do not loop on emission on closing/draining state
    - MINOR: sample: date converter takes HTTP date and output an UNIX timestamp
    - SCRIPTS: git-show-backports: do not truncate git-show output
    - DOC: api/event_hdl: small updates, fix an example and add some precisions
    - BUG/MINOR: h3: fix crash on STOP_SENDING receive after GOAWAY emission
    - BUG/MINOR: mux-quic: fix crash on qcs SD alloc failure
    - BUG/MINOR: h3: fix BUG_ON() crash on control stream alloc failure
    - BUG/MINOR: quic: fix BUG_ON() on Tx pkt alloc failure
    - DEV: flags/show-fd-to-flags: adapt to recent versions
    - MINOR: capabilities: export capget and __user_cap_header_struct
    - MINOR: capabilities: prepare support for version 3
    - MINOR: capabilities: use _LINUX_CAPABILITY_VERSION_3
    - MINOR: cli/debug: show dev: add cmdline and version
    - MINOR: cli/debug: show dev: show capabilities
    - MINOR: debug: print gdb hints when crashing
    - BUILD: debug: also declare strlen() in __ABORT_NOW()
    - BUILD: Missing inclusion header for ssize_t type
    - BUG/MINOR: hlua: report proper context upon error in hlua_cli_io_handler_fct()
    - MINOR: cfgparse/log: remove leftover dead code
    - BUG/MEDIUM: stick-table: Decrement the ref count inside lock to kill a session
    - MINOR: stick-table: Always decrement ref count before killing a session
    - REORG: init: do MODE_CHECK_CONDITION logic first
    - REORG: init: encapsulate CHECK_CONDITION logic in a func
    - REORG: init: encapsulate 'reload' sockpair and master CLI listeners creation
    - REORG: init: encapsulate code that reads cfg files
    - BUG/MINOR: server: fix first server template name lookup UAF
    - MINOR: activity: make the memory profiling hash size configurable at build time
    - BUG/MEDIUM: server/dns: prevent DOWN/UP flap upon resolution timeout or error
    - BUG/MEDIUM: h3: ensure the ":method" pseudo header is totally valid
    - BUG/MEDIUM: h3: ensure the ":scheme" pseudo header is totally valid
    - BUG/MEDIUM: quic: fix race-condition in quic_get_cid_tid()
    - BUG/MINOR: quic: fix race condition in qc_check_dcid()
    - BUG/MINOR: quic: fix race-condition on trace for CID retrieval

BUG/MINOR: quic: fix race-condition on trace for CID retrieval

quic_rx_pkt_retrieve_conn() is used when parsing a received datagram
from the listener socket. It returned the quic_conn instance
corresponding to the first packet DCID, unless it is mapped to another
thread.

As expected, global CID tree access is protected by a lock in the
function. However, there is a race condition due to the final trace
where qc instance is dereferenced outside of the lock. Fix this by
adding a new trace under lock protection and remove qc deferencement at
function end.

This may fix first crash of github issue #2607.

This must be backported up to 2.8.

BUG/MINOR: quic: fix race condition in qc_check_dcid()

qc_check_dcid() is a function which check that a DCID is associated to
the expected quic_conn instance. This is used for quic_conn socket
receive handler as there is a tiny risk that a datagram to another
connection was received on this socket.

As other operations on global CID tree, a lock must be used to protect
against race condition. However, as previous commit, lock was not held
long enough as CID tree node is accessed outside of the lock region. To
fix this, increase critical section until CID dereferencement is done.

The impact of this bug should be similar to the previous one. However,
risk of crash are even less reduced as it should be extremely rare to
receive datagram for other connections on a quic_conn socket. As such,
most of the time first check condition of qc_check_dcid() is enough.

This may fix first crash of issue github #2607.

This must be backported up to 2.8.

BUG/MEDIUM: quic: fix race-condition in quic_get_cid_tid()

haproxy generates CID for clients which reuse them as DCID on their
packets. These CID are stored in a global tree quic_cid_trees. Each
operation on this tree must be done under lock protection.

quic_get_cid_tid() is a function which lookups a CID in global tree and
return the associated thread ID. This is used on datagram reception on
listener socket before redispatching the datagram to the correct thread.
This function uses a lock to protect quic_cid_trees access. However,
lock region is too small as CID tree node is accessed outside of it. Fix
this by extending lock protection for CID dereferencement until thread
ID is retrieved.

The impact of this bug is unknown, but it may possible cause crashes.
However, it is probably rare as most of datagram reception is done on
quic_conn socket which does not uses quic_get_cid_tid().

This may fix first crash of github issue #2607.

This must be backported up to 2.8.

BUG/MEDIUM: h3: ensure the ":scheme" pseudo header is totally valid

Ensure pseudo-header scheme is only constitued of valid characters
according to RFC 9110. If an invalid value is found, the request is
rejected and stream is resetted.

It's the same as for previous commit "BUG/MEDIUM: h3: ensure the
":method" pseudo header is totally valid" except that this time it
applies to the ":scheme" pseudo header.

This must be backported up to 2.6.

BUG/MEDIUM: h3: ensure the ":method" pseudo header is totally valid

Ensure pseudo-header method is only constitued of valid characters
according to RFC 9110. If an invalid value is found, the request is
rejected and stream is resetted.

Previously only characters forbidden in headers were rejected (NUL/CR/LF),
but this is insufficient for :method, where some other forbidden chars
might be used to trick a non-compliant backend server into seeing a
different path from the one seen by haproxy. Note that header injection
is not possible though.

This must be backported up to 2.6.

Many thanks to Yuki Mogi of FFRI Security Inc for the detailed report
that allowed to quicky spot, confirm and fix the problem.

BUG/MEDIUM: server/dns: prevent DOWN/UP flap upon resolution timeout or error

This is a complementary patch to c16eba818 ("BUG/MEDIUM: server/dns:
preserve server's port upon resolution timeout or error").

Indeed, since c16eba818, the port is properly preserved, but unsetting
server's address this way results in server_atomic_sync() function
thinking that we're actually setting a new address and not unsetting
the previous one because addr family is != AF_UNSPEC.

Upon DNS timeout, this could be observed:

[WARNING] (2588257) : Server http/s1 is going DOWN for maintenance (DNS timeout status). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] (2588257) : Server http/s1 ('test1.localhost') is UP/READY (resolves again).

Notice that server timeouts and then immediately resolves again. Of course
in this case case the server's address was properly set to 0, meaning
that the server will not receive any traffic, but it is confusing and
could result in haproxy temporarily thinking that the server is actually
available while it's not.

To properly fix the issue and restore historical behavior, let's
explicitly set inetaddr's family to AF_UNSPEC after fetching original
server's address.

It should be backported in 3.0 with c16eba818.

MINOR: activity: make the memory profiling hash size configurable at build time

The MEMPROF_HASH_BITS variable was set to 10 without a possibility to
change it (beyond patching the code). After seeing a few reports already
with "other" being listed and a list with close to 1024 entries, it looks
like it's about time to either increase the hash size, or at least make
it configurable for special cases. As a reminder, in order to remain
fast, the algorithm searches no more than 16 places after the hash, so
when a table is almost full, searches are long and new places are rare.

The present patch just makes it possible to redefine it by passing
"-DMEMPROF_HASH_BITS=11" or "-DMEMPROF_HASH_BITS=12" in CFLAGS, and
moves the definition to defaults.h to make it easier to find. Such
values should be way sufficient for the vast majority of use cases.
Maybe in the future we'd change the default. At least this version
should be backported to ease rebuilds, say, till 2.8 or so.

BUG/MINOR: server: fix first server template name lookup UAF

This is a follow-up for 7223296 ("BUG/MINOR: server: fix first server
template not being indexed").

Indeed, in 7223296 we added a new call to _srv_parse_set_id_from_prefix()
for the first server before handling additional ones. But we actually
overlooked the fact that _srv_parse_set_id_from_prefix() was already
performed at the end of _srv_parse_tmpl_init() for the same server.

Since _srv_parse_set_id_from_prefix() frees srv->id, it results in UAF
when performing name lookups on the first server, because used_server_name
node key still uses the freed string pointer.

The early _srv_parse_set_id_from_prefix() call (added in 7223296) and
the original one perform the same task, except that the new one is
followed by name node insertion logic required for name lookups to work
properly. So let's simply get rid of the old one at the end of the
function.

_srv_parse_set_id_from_prefix() in the 'err:' label was also removed since
is is now useless as well starting with 7223296 and would trigger the same
bug on error paths. Thanks to Amaury for noticing it.

This bug was discovered while trying to address GH issue #2620.
Thanks to @x-yuri for his detailed report (with working repro).

It should be backported in 3.0 with 7223296.

REORG: init: encapsulate code that reads cfg files

Haproxy master process should not read its configuration the second time
after performing reexec and passing to MODE_MWORKER_WAIT. So, to make
this part of init() function more readable and to distinguish better the
point, where configs have been read, let's encapsulate it in a separate
function.

REORG: init: encapsulate 'reload' sockpair and master CLI listeners creation

Let's encapsulate the logic of 'reload' sockpair and master CLI listeners
creation, used by master CLI into a separate function, as we needed this
only in master-worker runtime mode. This makes the code of init() more
readable.

REORG: init: encapsulate CHECK_CONDITION logic in a func

As MODE_CHECK_CONDITION logic terminates the process anyway, no matter if
the test for the provided condition was successfull or not, let's
encapsulate it in a separate function. This makes the code of init() more
readable.

REORG: init: do MODE_CHECK_CONDITION logic first

In MODE_CHECK_CONDITION we only parse check_condition string, provided by
'-cc', and then we evaluate it. Haproxy process terminates at the
end of {if..else} block anyway, if the test has failed or passed. So, it
will be more appropriate to perform MODE_CHECK_CONDITION test first and
then do all other process runtime mode verifications.

MINOR: stick-table: Always decrement ref count before killing a session

Guarded functions to kill a sticky session, stksess_kill()
stksess_kill_if_expired(), may or may not decrement and test its reference
counter before really killing it. This depends on a parameter. If it is set
to non-zero value, the ref count is decremented and if it falls to zero, the
session is killed. Otherwise, if this parameter is equal to zero, the
session is killed, regardless the ref count value.

In the code, these functions are always called with a non-zero parameter and
the ref count is always decremented and tested. So, there is no reason to
still have a special case. Especially because it is not really easy to say
if it is supported or not. Does it mean it is possible to kill a sticky
session while it is still referenced somewhere ? probably not. So, does it
mean it is possible to kill a unreferenced session ? This case may be
problematic because the session is accessed outside of any lock and thus may
be released by another thread because it is unreferenced. Enlarging scope of
the lock to avoid any issue is possible but it is a bit of shame to do so
because there is no usage for now.

The best is to simplify the API and remove this case. Now, stksess_kill()
and stksess_kill_if_expired() functions always decrement and test the ref
count before killing a sticky session.

BUG/MEDIUM: stick-table: Decrement the ref count inside lock to kill a session

When we try to kill a session, the shard must be locked before decrementing
the ref count on the session. Otherwise, the ref count can fall to 0 and a
purge task (stktable_trash_oldest or process_table_expire) may release the
session before we have the opportunity to acquire the lock on the shard to
effectively kill the session. This could lead to a double free.

Here is the scenario:

    Thread 1                                 Thread 2

  sktsess_kill(ts)
    if (ATOMIC_DEC(&ts->ref_cnt) != 0)
        return
                   /* here the ref count is 0 */

                                       stktable_trash_oldest()
                                          LOCK(&sh_lock)
                                          if (!ATOMIC_LOAD(&ts->ref_cnf))
                                              __stksess_free(ts)
                                          UNLOCK(&sh_lock)

                  /* here the session was released */
    LOCK(&sh_lock)
    __stksess_free(ts)  <--- double free
    UNLOCK(&sh_lock)

The bug was introduced in 2.9 by the commit 7968fe3889 ("MEDIUM:
stick-table: change the ref_cnt atomically"). The ref count must be
decremented inside the lock for stksess_kill() and sktsess_kill_if_expired()
function.

This patch should fix the issue #2611. It must be backported as far as 2.9. On
the 2.9, there is no sharding. All the table is locked. The patch will have to
be adapted.

MINOR: cfgparse/log: remove leftover dead code

Remove development leftover introduced by commit 15e9c7da6 ("MINOR: log:
add log-profile parsing logic").

Indeed, since "log-profile" section keyword is registered via
REGISTER_CONFIG_SECTION() macro, it is not relevant to declare it in
common_kw_list[] from cfgparse-global.c. All it does is that it could
confuse the user by suggesting him to use "log-profile" inside a global
section when trying to find a best match in cfg_parse_global().

BUG/MINOR: hlua: report proper context upon error in hlua_cli_io_handler_fct()

As a result of copy pasting, hlua_cli_io_handler_fct() used to report lua
exceptions like E_ETMOUT as "Lua converter" instead of "Lua cli".

Let's fix that.

It could be backported to all stable versions.

[ada: for older versions, HLUA_E_BTMOUT case didn't exist so it has to be
skipped]

BUILD: Missing inclusion header for ssize_t type

Compilation issue detected as follows by gcc:

In file included from src/ncbuf.c:19:
src/ncbuf.c: In function 'ncb_write_off':
include/haproxy/bug.h:144:10: error: unknown type name 'ssize_t'
144 | extern ssize_t write(int, const void *, size_t); \

BUILD: debug: also declare strlen() in __ABORT_NOW()

Previous commit 8f204fa8ae ("MINOR: debug: print gdb hints when crashing")
broken on the CI where strlen() isn't known. Let's forward-declare it in
the __ABORT_NOW() functions, just like write(). No backport is needed.

MINOR: debug: print gdb hints when crashing

To make bug reporting easier for users, when crashing, let's suggest
what to do. Typically when a BUG_ON() matches, only the current thread
is useful the vast majority of the time, while when the watchdog
triggers, all threads are interesting.

The messages are printed at the end after the dump. We may adjust these
with wiki links in the future is more detailed instructions are relevant.

MINOR: cli/debug: show dev: show capabilities

If haproxy compiled with Linux capabilities support, let's show process
capabilities before applying the configuration and at runtime in 'show dev'
command output. This maybe useful for debugging purposes. Especially in
cases, when process changes its UID and GID to non-priviledged or it
has started and run under non-priviledged UID and needed capabilities are
set by admin on the haproxy binary.

MINOR: cli/debug: show dev: add cmdline and version

'show dev' command is very convenient to obtain haproxy debugging information,
while process is run in container. Let's extend its output with version and
cmdline. cmdline is useful in a way, as it shows absolute binary path and its
arguments, because sometimes the person, who is debugging failing container is
not the same, who has created and deployed it.

argc and argv are stored in the exported global structure, because
feed_post_mortem() is added as a post check function callback in the
post_check_list. So we can't simply change the signature of
feed_post_mortem(), without breaking other post check callbacks APIs.

Parsers are not supposed to modify argv, so we can safely bypass its pointer
to debug_parse_cli_show_dev(), without copying all argument stings somewhere
in the heap or on stack.

MINOR: capabilities: use _LINUX_CAPABILITY_VERSION_3

Linux kernel shows the warning below, when _LINUX_CAPABILITY_VERSION_1 is
used in capset() and capget().

        [1710243.523230] capability: warning: `haproxy' uses 32-bit capabilities (legacy support in use)

This triggers questions from users. Warning is shown by kernel, because
since Linux 2.6.25, 64-bit capabilities support was introduced in
_LINUX_CAPABILITY_VERSION_2. It's in order to be able to continiously
extend capabilities list with the new ones.

We can't use _LINUX_CAPABILITY_VERSION_2, because this version triggers
another warning, according linux/kernel/capability.c (see also more details
about it in comments from kernel sources and in man capset(2)).

kernel/capability.c:
    ...
    static int cap_validate_magic(cap_user_header_t header, unsigned *tocopy)
    {
            __u32 version;

            if (get_user(version, &header->version))
                    return -EFAULT;

            switch (version) {
            case _LINUX_CAPABILITY_VERSION_1:
                    warn_legacy_capability_use();
                    *tocopy = _LINUX_CAPABILITY_U32S_1;
                    break;
            case _LINUX_CAPABILITY_VERSION_2:
                    warn_deprecated_v2();
                    fallthrough;    /* v3 is otherwise equivalent to v2 */
            case _LINUX_CAPABILITY_VERSION_3:
                    *tocopy = _LINUX_CAPABILITY_U32S_3;
                    break;
            default:
            ...

So, to avoid any warnings, lets use _LINUX_CAPABILITY_VERSION_3, which
according to comments in linux/kernel/capability.c, has the same
functionality as _LINUX_CAPABILITY_VERSION_2 (i.e. array of 2
__user_cap_data_struct with 32-bits integers for each capability set), but
comes in Linux 2.6.26 with a header change, in order to protect legacy
source code.

For the moment, we don't authorize capabilities higher, than CAP_SYS_ADMIN
(21-st bit), so we always check the "low" 32 bits, i.e.
__user_cap_data_struct[0].

MINOR: capabilities: prepare support for version 3

Commit e338d263a76a ("Add 64-bit capability support to the kernel")
introduces in the kernel _LINUX_CAPABILITY_VERSION_1 and
_LINUX_CAPABILITY_VERSION_2 and its corresponded magic numbers "1"
(_LINUX_CAPABILITY_U32S_1) and "2" (_LINUX_CAPABILITY_VERSION_2).

Capabilities sets, since this commit, are composed as an arrays of
__user_cap_data_struct with length defined in version's magic number
(e.g. struct __user_cap_data_struct kdata[_LINUX_CAPABILITY_U32S_1]).

These magic numbers also help the kernel to figure out how many data
(in __user_cap_data_struct "units") it needs to copy_from/to_user in
capset/capget syscalls.

In order to use _LINUX_CAPABILITY_VERSION_3 in the next commit (it has the
same functionality as version 2), let's follow the kernel code and let's
allocate memory to store 32-capabilities as an array of
__user_cap_data_struct with the length of 1 (_LINUX_CAPABILITY_U32S_1).

MINOR: capabilities: export capget and __user_cap_header_struct

To be able to show process capabilities before applying its configuration and
also at runtime in 'show dev' command output, we need to export the wrapper
around capget() syscall. It also seems more handy to place
__user_cap_header_struct in .data section and declare it as globally
accessible, as we always fill it with the same values. This avoids allocate
and fill these 8 bytes each time on the stack frame, when capget() or capset()
wrappers are called.

DEV: flags/show-fd-to-flags: adapt to recent versions

The script hadn't been updated since it was introduced, and the
hard-coded field 12 doesn't match anymore (it's 16 now). Let's just
use "grep -o cflg..." to extract the desired part more flexibly.
This can be backported at least to 3.0, probably further, but it
will need to be tested prior to this. Better not bring it too far,
it's only used when debugging.

BUG/MINOR: quic: fix BUG_ON() on Tx pkt alloc failure

On quic_tx_packet allocation failure, it is possible to trigger BUG_ON()
crash on INITIAL packet building. This statement is responsible to
ensure INITIAL packets are padded to 1.200 bytes as required. If a
packet on higher encryption level allocation fails, PADDING frame cannot
properly encoded, despite the INITIAL packet properly built.

This crash happens due to qc_txb_store() invokation after quic_tx_packet
allocation failure to validate already built packets. However, this
statement is unneeded as qc_purge_tx_buf() is called just after. Simply
remove qc_txb_store() to fix this issue.

This was detected using -dMfail.

This should be backported up to 2.6.

BUG/MINOR: h3: fix BUG_ON() crash on control stream alloc failure

BUG_ON() from qcc_set_error() is triggered on HTTP/3 control stream
allocation failure. This is caused because both h3_finalize() and
qcc_init_stream_local() call qcc_set_error() which is forbidden to
prevent error code erasure.

Fix this by removing qcc_set_error() invocation from h3_finalize() on
allocation failure. Note that this function is still responsible to use
it on SETTING frame emission failure.

This was detected using -dMfail.

This must be backported up to 3.0.

BUG/MINOR: mux-quic: fix crash on qcs SD alloc failure

Since the following commit, sedesc are created since QCS instantiation
in qcs_new().
086e51017e7731ee9820b882fe6e8cc5f0dd5352
BUG/MEDIUM: mux-quic: Create sedesc in same time of the QUIC stream

However, sedesc is initialized before other QCS mandatory fields. If
sedesc allocation fails, a crash would occur on qcs_free() invocation
for QCS early release. To fix this, delay sedesc allocation until
function end.

This bug was detected using -dMfail.

This should be backported up to 2.6.

BUG/MINOR: h3: fix crash on STOP_SENDING receive after GOAWAY emission

After emitting a HTTP/3 GOAWAY frame, opening of streams higher than
advertised ID was prevented. h3_attach operation would return success
but without allocating H3S stream context for QCS. In addition, the
stream would be immediately scheduled for RESET_STREAM emission.

Despite the immediate stream close, the current is not sufficient enough
and can cause crashes. When of this occurence can be found if
STOP_SENDING is the first frame received for a stream. A crash would
occur under qcc_recv_stop_sending() after h3_attach invokation, when
h3_close() is used which try to access to H3S context.

To fix this, change h3_attach API. In case of success, H3S stream
context is always allocated, even if the stream will be scheduled for
immediate close. This renders the code more reliable.

This crash should be extremely rare, as it can only happen after GOAWAY
emission, which is only used on soft-stop or reload.

This should solve the second crash occurence reported on GH #2607.

This must be backported up to 2.8.

DOC: api/event_hdl: small updates, fix an example and add some precisions

Fix an example suggesting that using EVENT_HDL_SUB_TYPE(x, y) with y being
0 was valid. Then add some notes to explain how to use
EVENT_HDL_SUB_FAMILY() and EVENT_HDL_SUB_TYPE() with valid values.

Also mention that the feature is available starting from 2.8 and not 2.7.
Finally, perform some purely cosmetic updates.

This could be backported in 2.8.

SCRIPTS: git-show-backports: do not truncate git-show output

git-show-backports lists a git-show command which can be used to inspect
all commits subject to backport. This command specifies formatting
option to reproduce default git-show output, especially for commit
messages indented with 4 spaces character. However, it also add wrapping
on message line longer than 72 characters. This reduce lisibility of
messages where large info are written such as backtraces.

Improve this by changing git-show format option. Use a limit value of 0
to disable wrapping while preserving indentation.

This could be backported to every stable version to simplify backporting
process.

MINOR: sample: date converter takes HTTP date and output an UNIX timestamp

The `date` converter takes an HTTP date in input, it could be either a
imf, rfc850 or asctime date. It will output an UNIX timestamp.

BUG/MAJOR: quic: do not loop on emission on closing/draining state

To emit CONNECTION_CLOSE frame, a special buffer is allocated via
qc_txb_store(). This is due to QUIC_FL_CONN_IMMEDIATE_CLOSE flag.
However this flag is reset after qc_send_ppkts() invocation to prevent
reemission of CONNECTION_CLOSE frame.

qc_send() can invoke multiple times a series of qc_prep_pkts() +
qc_send_ppkts() to emit several datagrams. However, this may cause a
crash if on first loop a CONNECTION_CLOSE is emitted. On the next loop
iteration, QUIC_FL_CONN_IMMEDIATE_CLOSE is resetted, thus qc_prep_pkts()
will use the wrong buffer size as end delimiter. In some cases, this may
cause a BUG_ON() crash due to b_add() outside of buffer.

This bug can be reproduced by using a while loop of ngtcp2-client and
interrupting them randomly via Ctrl+C.

Here is the patch which introduce this regression :
cdfceb10ae136b02e51f9bb346321cf0045d58e0
MINOR: quic: refactor qc_prep_pkts() loop

BUG/MAJOR: quic: fix padding with short packets

QUIC sending functions were extended to be more flexible. Of all the
changes, they support now iterating over a variable instance of QEL
instance of only 2 previously. This change has rendered PADDING emission
less previsible, which was adjusted via the following patch :

  a60609f1aa3e5f61d2a2286fdb40ebf6936a80ee
  BUG/MINOR: quic: fix padding of INITIAL packets

Its main purpose was to ensure PADDING would only be generated for the
last iterated QEL instance, to avoid unnecessary padding. In parallel, a
BUG_ON() statement ensure that built INITIAL packets are always padded
to 1.200 bytes as necessary before emitted them.

This BUG_ON() statement caused crash in one particular occurence : when
building datagrams that mixed Initial long packets and 1-RTT short
packets. This last occurence type does not have a length field in its
header, contrary to Long packets. This caused a miscalculation for the
necessary padding size, with INITIAL packets not padded enough to reach
the necessary 1.200 bytes size.

This issue was detected on 3.0.2. It can be reproduced by using 0-RTT
combined with latency. Here are the used commands :

  $ ngtcp2-client --tp-file=/tmp/ngtcp2-tp.txt \
    --session-file=/tmp/ngtcp2-session.txt --exit-on-all-streams-close \
    127.0.0.1 20443 "https://[::]/?s=32o"
  $ sudo tc qdisc add dev lo root netem latency 500ms

Note that this issue cannot be reproduced on current dev version.
Indeed, it seems that the following patch introduce a slight change in
packet building ordering :

  cdfceb10ae136b02e51f9bb346321cf0045d58e0
  MINOR: quic: refactor qc_prep_pkts() loop

This must be backported to 3.0.

This should fix github issue #2609.

DOC: management: document ptr lookup for table commands

Add missing documentation and examples for the optional ptr lookup method
for table {show,set,clear} commands introduced in commit 9b2717e7 ("MINOR:
stktable: use {show,set,clear} table with ptr"), as initially described in
GH #2118.

It may be backported in 3.0.

DOC: configuration: fix alphabetical order of bind options

Put the curves, ecdhe, severity-output, v4v6 and v6only keyword at the
right place.

Fix issue #2594.

Could be backported in every stable versions.

BUG/MINOR: proxy: fix email-alert leak on deinit() (2nd try)

As shown in GH #2608 and ("BUG/MEDIUM: proxy: fix email-alert invalid
free"), simply calling free_email_alert() from free_proxy() is not the
right thing to do.

In this patch, we reuse proxy->email_alert.set memory space to introduce
proxy->email_alert.flags in order to support 2 flags:
PR_EMAIL_ALERT_SET (to mimic proxy->email_alert.set) and
PR_EMAIL_ALERT_RESOLVED (set once init_email_alert() was called on the
proxy to resolve email_alert.mailer pointer).

Thanks to PR_EMAIL_ALERT_RESOLVED flag, free_email_alert() may now
properly handle the freeing of proxy email_alert settings: if the RESOLVED
flag is set, then it means the .email_alert.mailers.name parsing hint was
replaced by the actual mailers pointer, thus no free should be attempted.

No backport needed: as described in ("BUG/MEDIUM: proxy: fix email-alert
invalid free"), this historical leak is not sensitive as it cannot be
triggered during runtime.. thus given that the fix is not backport-
friendly, it's not worth the trouble.

REORG: mailers: move free_email_alert() to mailers.c

free_email_alert() was declared in cfgparse.c, but it should belong to
mailers.c instead.

BUG/MEDIUM: proxy: fix email-alert invalid free

In fa90a7d3 ("BUG/MINOR: proxy: fix email-alert leak on deinit()"), I
tried to fix email-alert deinit() leak the simple way by leveraging
existing free_email_alert() helper function which was already used for
freeing email alert settings used in a default section.

However, as described in GH #2608, there is a subtelty that makes
free_email_alert() not suitable for use from free_proxy().

Indeed, proxy 'mailers.name' hint shares the same memory space than the
pointer to the corresponding mailers section (once the proxy is resolved,
name hint is replaced by the pointer to the section). However, since both
values share the same space (through union), we have to take care of not
freeing `mailers.name` once init_email_alert() was called on the proxy.

Unfortunately, free_email_alert() isn't protected against that, causing
double free() during deinit when mailers section is referenced from
multiple proxy sections. Since there is no easy fix, and that the leak in
itself isn't a big deal (fa90a7d3 was simply an opportunistic fix rather
than a must-have given that the leak only occurs during deinit and not
during runtime), let's actually revert the fix to restore legacy behavior
and prevent deinit errors.

Thanks to @snetat for having reported the issue on Github as well as
providing relevant infos to pinpoint the bug.

It should be backported everywhere fa90a7d3 was backported.
[ada: for versions prior to 3.0, simply revert the offending commit using
'git revert' as proxy_free_common() first appears in 3.0]

REGTESTS: ssl: activate new SSL reg-tests with AWS-LC

Prerequisites are now available in AWS-LC, so we can enable these
reg-tests.

With this patch, aws-lc only has 5 reg-tests that are not working:
- reg-tests/ssl/ssl_reuse.vtc: stateful session resumption is only supported with TLSv1.2
- reg-tests/ssl/ssl_curve_name.vtc: function to extract curve name is not available
- reg-tests/ssl/ssl_errors.vtc: errors are not the same than OpenSSL
- reg-tests/ssl/ssl_dh.vtc: AWS-LC does not support DH
- reg-tests/ssl/ssl_curves.vtc: not working correctly

Which means most of the features are working correctly.

MINOR: ssl: activate sigalgs feature for AWS-LC

AWSLC lacks the SSL_CTX_set1_sigalgs_list define, however the function
exists, which disables the feature in HAProxy, even if we could have
build with it.

SSL_CTX_set1_client_sigalgs_list() is not available, though.

This patch introduce the define so the feature is enabled.

BUG/MEDIUM: ssl: AWS-LC + TLSv1.3 won't do ECDSA in RSA+ECDSA configuration

SSL_get_ciphers() in AWS-LC seems to lack the TLSv1.3 ciphersuites,
which break the ECDSA key selection when doing TLSv1.3.

An issue was opened https://github.com/aws/aws-lc/issues/1638

Indeed, in ssl_sock_switchctx_cbk(), the sigalgs is used to determine if
ECDSA is doable or not, then the function compares the list of ciphers in
the clienthello with the list of configured ciphers.

The fix solves the issue by never skipping the TLSv1.3 ciphersuites,
even if they are not in SSL_get_ciphers().

REGTESTS: ssl: fix some regtests 'feature cmd' start condition

Since patch fde517b ("REGTESTS: wolfssl: temporarly disable some failing
reg-tests") some 'feature cmd' lines have an extra quotation mark, so
they were disable in every cases.

Must be backported to 2.9.

DEBUG: hlua: distinguish burst timeout errors from exec timeout errors

hlua burst timeout was introduced in 58e36e5b1 ("MEDIUM: hlua: introduce
tune.lua.burst-timeout").

It is a safety measure that allows to detect when too much time is spent
on a single lua execution (between 2 interruptions/yields), meaning that
the current thread is not able to perform other tasks. Such scenario
should be avoided because it will cause thread contention which may have
negative performance impact and could cause the watchdog to trigger. When
the burst timeout is exceeded, the current Lua execution is aborted and a
timeout error is reported to the user.

Unfortunately, the same error is currently being reported for cumulative
(AKA execution) timeout and for burst timeout, which may be confusing to
the user.

Indeed, "execution timeout" error historically results from the current
hlua context exceeding the total (cumulative) time it's allowed to run.
It is set per lua context using the dedicated tunables:
- tune.lua.session-timeout
- tune.lua.task-timeout
- tune.lua.service-timeout

We've already faced an user report where the user was able to trigger the
burst timeout and got "Lua task: execution timeout." error while the user
didn't set cumulative timeout. Thus the error was actually confusing
because it was indeed the burst timeout which was causing it due to the
use of cpu-intensive call from within the task without sufficient manual
"yield" keypoints around the cpu-intensive call to ensure it runs on a
dedicated scheduler cycle.

In this patch we make it so burst timeout related errors are reported as
"burst timeout" errors instead of "execution timeout" errors (which
in fact became the generic timeout errors catchall with 58e36e5b1).

To do this, hlua_timer_check() now returns a different value depending if
the exeeded timeout is the burst one or the cumulative one, which allows
us to return either HLUA_E_ETMOUT or HLUA_E_BTMOUT in hlua_ctx_resume().

It should improve the situation described in GH #2356 and may possibly be
backported with 58e36e5b1 to improve error reporting if it applies without
resistance.

BUG/MINOR: log: fix broken '+bin' logformat node option

In 12d08cf912 ("BUG/MEDIUM: log: don't ignore disabled node's options"),
while trying to restore historical node option inheritance behavior, I
broke the '+bin' logformat node option recently introduced in b7c3d8c87c
("MINOR: log: add +bin logformat node option").

Indeed, because of 12d08cf912, LOG_OPT_BIN is not set anymore on
individual nodes even if it was set globally, making the feature unusable.
('+bin' is also used for binary cbor encoding)

What I should have done instead is include LOG_OPT_BIN in the options
inherited from global ones. This is what's being done in this commit.
Misleading comment was adjusted.

It must be backported in 3.0 with 12d08cf912.

[RELEASE] Released version 3.1-dev1

Released version 3.1-dev1 with the following main changes :
    - REGTESTS: Remove REQUIRE_VERSION=2.1 from all tests
    - REGTESTS: Remove REQUIRE_VERSION=2.2 from all tests
    - CI: use "--no-install-recommends" for apt-get
    - CI: switch to lua 5.4
    - CI: use USE_PCRE2 instead of USE_PCRE
    - DOC: replace the README by a markdown version
    - CI: VTest: accelerate package install a bit
    - ADMIN: acme.sh: remove the old acme.sh code
    - BUG/MINOR: cfgparse: remove the correct option on httpcheck send-state warning
    - BUG/MINOR: tcpcheck: report correct error in tcp-check rule parser
    - BUG/MINOR: tools: fix possible null-deref in env_expand() on out-of-memory
    - DOC: configuration: add an example for keywords from crt-store
    - CI: speedup apt package install
    - DOC: add the FreeBSD status badge to README.md
    - DOC: change the link to the FreeBSD CI in README.md
    - MINOR: stktable: avoid ambiguous stktable_data_ptr() usage in cli_io_handler_table()
    - BUG/MINOR: hlua: use CertCache.set() from various hlua contexts
    - CLEANUP: hlua: fix CertCache class comment
    - CI: FreeBSD: upgrade image, packages
    - BUG/MEDIUM: h1-htx: Don't state interim responses are bodyless
    - MEDIUM: stconn: Be able to unblock zero-copy data forwarding from done_fastfwd
    - BUG/MEDIUM: mux-quic: Unblock zero-copy forwarding if the txbuf can be released
    - BUG/MINOR: quic: prevent crash on qc_kill_conn()
    - CLEANUP: hlua: use hlua_pusherror() where relevant
    - BUG/MINOR: hlua: don't use lua_pushfstring() when we don't expect LJMP
    - BUG/MINOR: hlua: fix unsafe hlua_pusherror() usage
    - BUG/MINOR: hlua: prevent LJMP in hlua_traceback()
    - CLEANUP: hlua: get rid of hlua_traceback() security checks
    - BUG/MINOR: hlua: fix leak in hlua_ckch_set() error path
    - CLEANUP: hlua: simplify ambiguous lua_insert() usage in hlua_ctx_resume()
    - BUG/MEDIUM: mux-quic: Don't unblock zero-copy fwding if blocked during nego
    - MINOR: mux-quic: Don't send an emtpy H3 DATA frame during zero-copy forwarding
    - BUG/MEDIUM: ssl: wrong priority whem limiting ECDSA ciphers in ECDSA+RSA configuration
    - BUG/MEDIUM: ssl: bad auth selection with TLS1.2 and WolfSSL
    - BUG/MINOR: quic: fix computed length of emitted STREAM frames
    - BUG/MINOR: quic: ensure Tx buf is always purged
    - BUG/MEDIUM: stconn/mux-h1: Fix suspect change causing timeouts
    - BUG/MAJOR: mux-h1:  Properly copy chunked input data during zero-copy nego
    - BUG/MINOR: mux-h1: Use the right variable to set NEGO_FF_FL_EXACT_SIZE flag
    - DOC: install: remove boringssl from the list of supported libraries
    - MINOR: log: fix "http-send-name-header" ignore warning message
    - BUG/MINOR: proxy: fix server_id_hdr_name leak on deinit()
    - BUG/MINOR: proxy: fix log_tag leak on deinit()
    - BUG/MINOR: proxy: fix email-alert leak on deinit()
    - BUG/MINOR: proxy: fix check_{command,path} leak on deinit()
    - BUG/MINOR: proxy: fix dyncookie_key leak on deinit()
    - BUG/MINOR: proxy: fix source interface and usesrc leaks on deinit()
    - BUG/MINOR: proxy: fix header_unique_id leak on deinit()
    - MINOR: proxy: add proxy_free_common() helper function
    - BUG/MEDIUM: proxy: fix UAF with {tcp,http}checks logformat expressions
    - MINOR: log: change wording in lf_expr_postcheck() error message
    - BUG/MEDIUM: log: fix lf_expr_postcheck() behavior with default section
    - CLEANUP: log/proxy: fix comment in proxy_free_common()
    - DOC: config: move "hash-key" from proxy to server options
    - DOC: config: add missing section hint for "guid" proxy keyword
    - DOC: config: add missing context hint for new server and proxy keywords
    - BUG/MINOR: promex: Skip resolvers metrics when there is no resolver section
    - DOC: internals: add a documentation about the master worker
    - BUG/MAJOR: mux-h1: Prevent any UAF on H1 connection after draining a request
    - BUG/MINOR: quic: fix padding of INITIAL packets
    - OPTIM: quic: fill whole Tx buffer if needed
    - MINOR: quic: refactor qc_build_pkt() error handling
    - MINOR: quic: use global datagram headlen definition
    - MINOR: quic: refactor qc_prep_pkts() loop
    - DOC/MINOR: management: add missed -dR and -dv options
    - DOC/MINOR: management: add -dZ option
    - DOC: management: rename show stats domain cli "dns" to "resolvers"
    - REORG: log: reorder send log helpers by dependency order
    - MINOR: session: expose session_embryonic_build_legacy_err() function
    - MEDIUM: log/session: handle embryonic session log within sess_log()
    - MINOR: log: provide sending log context to process_send_log() when available
    - MINOR: log: add log_orig_to_str() function
    - MINOR: log: provide log origin in logformat expressions using '%OG'
    - CLEANUP: log: remove ambiguous legacy comment for resolve_logger()
    - MINOR: log/backend: always free parsing hints in resolve_logger()
    - MINOR: log: make resolve_logger() static
    - MINOR: log: provide proxy context to resolve_logger()
    - MINOR: log: add __send_log_set_metadata_sd helper
    - MINOR: log: add logger flags
    - MINOR: log: add log-profile parsing logic
    - MINOR: log: add log profile buildlines
    - MEDIUM: log: handle log-profile in process_send_log()
    - DOC: config: add documentation for log profiles
    - REGTESTS: log: add a test for log-profile
    - MINOR: ssl: add ssl_sock_bind_verifycbk() in ssl_sock.h
    - REORG: ssl: move the SNI selection code in ssl_clienthello.c
    - BUILD: ssl: fix build with wolfSSL
    - CI: github: upgrade aws-lc to 1.29.0
    - Revert "CI: github: upgrade aws-lc to 1.29.0"
    - MEDIUM: ssl: support for ECDA+RSA certificate selection with AWS-LC
    - BUILD: ssl: disable deprecated functions for AWS-LC 1.29.0
    - MINOR: ssl: relax the 'ssl.default-dh-param' keyword parsing
    - CI: github: upgrade aws-lc to 1.29.0
    - DOC: INSTALL: minimum AWS-LC version is v1.22.0
    - CI: github: do the AWS-LC weekly build with ERR=1

CI: github: do the AWS-LC weekly build with ERR=1

The weekly CI that tries new version of AWS-LC was not building with
ERR=1, which let us think that everything was good but there was in fact
new warning that we missed.

Add ERR=1 to the build so the CI will failed for any new warning.

DOC: INSTALL: minimum AWS-LC version is v1.22.0

Change the minimum AWS-LC version required

CI: github: upgrade aws-lc to 1.29.0

Upgrade aws-lc to 1.29.0 on the push CI.

MINOR: ssl: relax the 'ssl.default-dh-param' keyword parsing

Some libraries are ignoring SSL_CTX_set_tmp_dh_callback(), but disabling
the 'ssl.default-dh-param' keyword when the function is not supported would
result in an error instead of silently continuing. This patch emits a
warning when the keyword is not supported instead of a loading failure.

BUILD: ssl: disable deprecated functions for AWS-LC 1.29.0

AWS-LC have a lot of functions that does nothing, which are now
deprecated and emits some warning.

This patch disables the following useless functions that emits a warning:
SSL_CTX_get_security_level(), SSL_CTX_set_tmp_dh_callback(),
ERR_load_SSL_strings(), RAND_keep_random_devices_open()

The list of deprecated functions is here:

https://github.com/aws/aws-lc/blob/main/docs/porting/functionality-differences.md