git.ipfire.org Git - thirdparty/haproxy.git/log

MEDIUM: stick-tables: Optimize the expiration process a bit.

In process_tables_expire(), if the table we're analyzing still has
entries, and thus should be put back into the tree, do not put it in the
mt_list, to have it put back into the tree the next time the task runs.
There is no problem with putting it in the tree right away, as either
the next expiration is in the future, or we handled the maximum number
of expirations per task call and we're about to stop, anyway.

This does not need to be backported.

BUG/MEDIUM: stick-tables: Make sure we handle expiration on all tables

In process_tables_expire(), when parsing all the tables with expiration
set, to check if the any entry expired, make sure we start from the
oldest one, we can't just rely on eb32_first(), because of sign issues
on the timestamp.
Not doing that may mean some tables are not considered for expiration.

This does not need to be backported.

MINOR: quic: remove <mux_state> field

This patch removes <mux_state> field from quic_conn structure. The
purpose of this field was to indicate if MUX layer above quic_conn is
not yet initialized, active, or already released.

It became tedious to properly set it as initialization order of the
various quic_conn/conn/MUX layers now differ between the frontend and
backend sides, and also depending if 0-RTT is used or not. Recently, a
new change introduced in connect_server() will allow to initialize QUIC
MUX earlier if ALPN is cached on the server structure. This had another
level of complexity.

Thus, this patch removes <mux_state> field completely. Instead, a new
flag QUIC_FL_CONN_XPRT_CLOSED is defined. It is set at a single place
only on close XPRT callback invokation. It can be mixed with the new
utility functions qc_wait_for_conn()/qc_is_conn_ready() to determine the
status of conn/MUX layers now without an extra quic_conn field.

DOC: configuration: deprecate the master-worker keyword

Deprecate the 'master-worker' keyword in the global section.

Split the configuration of the 'no-exit-on-failure' subkeyword in
another section which is not deprecated yet and explains that its only
meant for debugging purpose.

MEDIUM: cfgparse: 'daemon' not compatible with -Ws

Emit a warning when the 'daemon' keyword is used in master-worker mode
for systemd (-Ws). This never worked and was always ignored by setting
MODE_FOREGROUND during cmdline parsing.

MEDIUM: cfgparse: deprecate 'master-worker' keyword alone

Warn when the 'master-worker' keyword is used without
'no-exit-on-failure'.

Warn when the 'master-worker' keyword is used and -W and -Ws already set
the mode.

BUG/MEDIUM: connections: permit to permanently remove an idle conn

There's currently a function conn_delete_from_tree() which is used to
detach an idle connection from the tree it's currently attached to so
that it is no longer found. This function is used in three circumstances:
  - when picking a new connection that no longer has any avail stream
  - when temporarily working on the connection from an I/O handler,
    in which case it's re-added at the end
  - when killing a connection

The 2nd case above is quite specific, as it requires to preserve the
CO_FL_LIST_MASK flags so that the connection can be re-inserted into
the proper tree when leaving the handler. However, there's a catch.
When killing a connection, we want to be certain it will not be
reinserted into the tree. The flags preservation is causing a tiny
race if an I/O happens while the connection is in the kill list,
because in this case the I/O handler will note the connection flags,
do its work, then reinsert the connection where it believed it was,
then the connection gets purged, and another user can find it in the
tree.

The issue is very difficult to reproduce. On a 128-thread machine it
happens in H2 around 500k req/s after around 50M requests. In H1 it
happens after around 1 billion requests.

The fix here consists in passing an extra argument to the function to
indicate if the removal is permanent or not. When it's permanent, the
function will clear the associated flags. The callers were adjusted
so that all those dequeuing a connection in order to kill it do it
permanently and all other ones do it only temporarily.

A slightly different approach could have worked: the function could
always remove all flags, and the callers would need to restore them.
But this would require trickier modifications of the various call
places, compared to only passing 0/1 to indicate the permanent status.

This will need to be backported to all stable versions. The issue was
at least reproduced since 3.1 (not tested before). The patch will need
to be adjusted for 3.2 and older, because a 2nd argument "thr" was
added in 3.3, so the patch will not apply to older versions as-is.

BUG/MEDIUM: mux-h2: make sure not to move a dead connection to idle

In h2_detach(), it looks possible to place a dead connection back to
the idle list, and to later call h2_release() on it once detected as
dead. It's not certain that it happens but nothing in the code shows
it is not possible, so better make sure it cannot happen.

This should be preventively backported to all versions.

BUG/MEDIUM: mux-h1: fix 414 / 431 status code reporting

The more detailed status code reporting introduced with bc967758a2 is
checking against the error state to determine whether it is a too long
URL or too large headers. The check used always returns true which
results in a 414 as the error state is only set at a later point.

This commit adjusts the check to use the current state instead to return
the intended status code.

This patch must be backported as far as 3.1.

BUG/MEDIUM: server: Also call srv_reset_path_parameters() on srv up

Also call srv_reset_path_parameters() when the server changed states,
and got up. It is not enough to do it when the server goes down, because
there's a small race condition, and a connection could get established
just after we did it, and could have set the path parameters.

This does not need to be backported.

BUG/MEDIUM: server: Add a rwlock to path parameter

Add a rwlock to control the server's path_parameter, to make sure
multiple threads don't set it at the same time, and it can't be seen in
an inconsistent state.
Also don't set the parameter every time, only set them if they have
changed, to prevent needless writes.

This does not need to be backported.

MINOR: quic: remove connection arg from qc_new_conn()

This patch is similar to the previous one, this time dealing with
qc_new_conn(). This function was asymetric on frontend and backend side,
as connection argument was set only in the latter case.

This was required prior due to qc_alloc_ssl_sock_ctx() signature. This
has changed with the previous patch, thus qc_new_conn() can also be
realigned on both FE and BE sides. <conn> member of quic_conn instance
is always set outside it, in qc_xprt_start() on the backend case.

MINOR: quic: do not set conn member if ssl_sock_ctx

ssl_sock_ctx is a generic object used both on TCP/SSL and QUIC stacks.
Most notably it contains a <conn> member which is a pointer to struct
connection.

On QUIC frontend side, this member is always set to NULL. Indeed,
connection is only created after handshake completion. However, this has
changed for backend side, where the connection is instantiated prior to
its quic_conn counterpart. Thus, ssl_sock_ctx member would be set in
this case as a convenience for use later in qc_ssl_do_hanshake().

However, this method was unsafe as the connection can be released,
without resetting ssl_sock_ctx member. Thus, the previous patch fixes
this by using on <conn> member through the quic_conn instance which is
the proper way.

Thus, this patch resets ssl_sock_ctx <conn> member to NULL. This is
deemed the cleanest method as it ensures that both frontend and backend
sides must not use it anymore.

BUG/MINOR: quic: fix crash on client handshake abort

On backend side, a connection can be aborted and released prior to
handshake completion. This causes a crash in qc_ssl_do_hanshake() as
<conn> member of ssl_sock_ctx is not reset in this case.

To fix this, use <conn> member of quic_conn instead. This is safe as it
is properly set to NULL when a connection is released.

No impact on the frontend side as <conn> member is not accessed. Indeed,
in this case connection is most of the times allocated after handshake
completion.

No need to be backported.

CI: github: update to macos-26

macOS-15 images seems to have difficulties to run the reg-tests since a
few days for an unknown reason. Doing a rollback of both VTest2 and
haporxy doesn't seem to fix the problem so this is probably related to a
change in github actions.

This patch switches the image to the new macos-26 images which seems to
fix the problem.

SCRIPTS: build-ssl: fix rpath in AWS-LC install for openssl and bssl bin

AWS-LC binaries were not linked correctly with an rpath, preventing the
binaries to be useful without setting an LD_LIBRARY_PATH manually.

OPTIM: proxy: move atomically access fields out of the read-only ones

Perf top showed that h1_snd_buf() was having great difficulties accessing
the proxy's server_id_hdr_name field in the middle of the headers loop.
Moving the assignment out of the loop to a local variable moved the
problem there as well:

       |      if (!(h1m->flags & H1_MF_RESP) && isttest(h1c->px->server_id_hdr_n
  0.10 |20b0:   mov        -0x120(%rbp),%rdi
  1.33 |        mov        0x60(%rdi),%r10
  0.01 |        test       %eax,%eax
  0.18 |        jne        2118
12.87 |        mov        0x350(%r10),%rdi
  0.01 |        test       %rdi,%rdi
  0.05 |        je         2118
       |        mov        0x358(%r10),%r11

It turns out that there are several atomically accessed fields in its
vicinity, causing the cache line to bounce all the time. Let's collect
the few frequently changed fields and place them together at the end
of the structure, and plug the 32-bit hole with another isolated field.
Doing so also reduced a little bit the cost of decrementing be->be_conn
in process_stream(), and overall the HTTP/1 performance increased by
about 1% both on ARM and x86_64.

SCRIPTS: build-ssl: allow to build a FIPS version without FIPS

build-ssl.sh is always prepending a "v" to the version, preventing to
build a FIPS version without FIPS enabled.

This patch checks if FIPS is in the version string to chose to add the
"v" or not.

Example:

AWS_LC_VERSION=AWS-LC-FIPS-3.0.0 BUILDSSL_DESTDIR=/opt/awslc-3.0.0 ./scripts/build-ssl.sh

OPTIM: backend: skip conn reuse for incompatible proxies

When trying to reuse a backend connection, a connection hash is
calculated to match an entry with similar parameters. Previously, this
operation was skipped if the stream content wasn't based on HTTP, as it
would have been incompatible with http-reuse.

With the introduction of SPOP backends, this condition was removed, so
that it can also benefit from connection reuse. However, this means that
now hash calcul is always performed when connecting to a server, even
for TCP or log backends. This is unnecessary as these proxies cannot
perform connection reuse.

Note also that reuse mode is resetted on postparsing for incompatible
backends. This at least guarantees that no tree lookup will be performed
via be_reuse_connection(). However, connection lookup is still performed
in the session via session_get_conn() which is another unnecessary
operation.

Thus, this patch restores the condition so that reuse operations are now
entirely skipped if a backend mode is incompatible. This is implemented
via a new utility function named be_supports_conn_reuse().

This could be backported up to 3.1, as this commit could be considered
as a performance regression for tcp/log backend modes.

BUG/MAJOR: stats-file: fix crash on non-x86 platform caused by unaligned cast

Since commit d655ed5f14 ("BUG/MAJOR: stats-file: ensure
shm_stats_file_object struct mapping consistency (2nd attempt)"), the
last_state_change field in the counters is a uint (to match how it's
reported). However, it happens that there are explicit casts in function
me_generate_field() to retrieve the value, and which cause crashes on
aarch64 and likely other non-x86 64-bit platforms due to atomically
reading an unaligned 64-bit value, and may even randomly crash other
64-bit platforms when reading past the end of the structure.

The fix for now adapts the cast to match the one used by the accessed
type (i.e. unsigned int), but the approach must change, as there's
nothing there which allows to figure whether or not the type is correct
by just reading the code. At minima a typeof() on a named field is
needed, but this requires more invasive changes, hence this temporary
fix.

No backport is needed, as stats-file is only in 3.3.

BUG/MINOR: resolvers: ensure fair round robin iteration

Previous fixes restored round robin iteration, but an imbalance remains
when the response tree contains record types other than A or AAAA. Let's
take the following example: the DNS answers two A records and a CNAME.
The response "tree" (which is actually flat, more like a list) may look
as follows, ordered by hash:
- 1st item: first A record with IP 1
- 2nd item: second A record with IP 2
- 3rd item: CNAME record
As a consequence, resolv_get_ip_from_response will iterate as follows,
while the TTL is still valid:
- 1st call: DNS request is done, response tree is created, iteration
  starts at the first item, IP 1 is returned.
- 2nd call: cached response tree is used, iteration starts at the second
  item, IP 2 is returned.
- 3rd call: cached response tree is used, iteration starts at the third
  item, but it's a CNAME, so we continue to the next item, which restarts
  iteration at the first item, and IP 1 is returned.
- 4th call: cached response tree is used and iteration restarts at the
  beginning, returning IP 1 again.
The 1-2-1-1-2-1-1-2 sequence will repeat, so IP 1 will be used twice as
often as IP 2, creating a strong imbalance. Even with more IP addresses,
the first one by hashing order in the tree will always receive twice the
traffic of the others.
To fix this, set the next iteration item to the one following the selected
IP record, if any. This ensures we never use the same IP twice in a row.

This commit should be backported where 3023e9819 ("BUG/MINOR: resolvers:
Restore round-robin selection on records in DNS answers") is, so as far
as 2.6.

REGTESTS: converters: check USE_OPENSSL in aes_gcm.vtc

Check USE_OPENSSL as well as the haproxy version for the aes_gcm
reg-test.

MINOR: sample: optional AAD parameter support to aes_gcm_enc/dec

The aes_gcm_enc() and aes_gcm_dec() sample converters now accept an
optional fifth argument for Additional Authenticated Data (AAD). When
provided, the AAD value is base64-decoded and used during AES-GCM
encryption or decryption. Both string and variable forms are supported.

This enables use cases that require authentication of additional data.

OPTIM: quic: adjust automatic ALPN setting for QUIC servers

If a QUIC server is declared without ALPN, "h3" value is automatically
set during _srv_parse_finalize().

This patch adjusts this operation. Instead of relying on
ssl_sock_parse_alpn(), a plain strdup() is used. This is considered more
efficient as the ALPN string is constant in this case. This method is
already used for listeners on the frontend side.

MINOR: quic: reject conf with QUIC servers if not compiled

Ensure that QUIC support is compiled into haproxy when a QUIC server is
configured. This check is performed during _srv_parse_finalize() so that
it is detected both on configuration parsing and when adding a dynamic
server via the CLI.

Note that this changes the behavior of srv_is_quic() utility function.
Previously, it always returned false when QUIC support wasn't compiled.
With this new check introduced, it is now guaranteed that a QUIC server
won't exist if compilation support is not active. Hence srv_is_quic()
does not rely anymore on USE_QUIC define.

MINOR: quic: enable SSL on QUIC servers automatically

Previously, QUIC servers were rejected if SSL was not explicitely
activated using 'ssl' configuration keyword.

Change this behavior : now SSL is automatically activated for QUIC
servers when the keyword is missing. A warning is displayed as it is
considered better to explicitely note that SSL is in use.

[RELEASE] Released version 3.3-dev11

Released version 3.3-dev11 with the following main changes :
    - BUG/MEDIUM: mt_list: Make sure not to unlock the element twice
    - BUG/MINOR: quic-be: unchecked connections during handshakes
    - BUG/MEDIUM: cli: also free the trash chunk on the error path
    - MINOR: initcalls: Add a new initcall stage, STG_INIT_2
    - MEDIUM: stick-tables: Use a per-shard expiration task
    - MEDIUM: stick-tables: Remove the table lock
    - MEDIUM: stick-tables: Stop if stktable_trash_oldest() fails.
    - MEDIUM: stick-tables: Stop as soon as stktable_trash_oldest succeeds.
    - BUG/MEDIUM: h1-htx: Don't set HTX_FL_EOM flag on 1xx informational messages
    - BUG/MEDIUM: h3: properly encode response after interim one in same buf
    - BUG/MAJOR: pools: fix default pool alignment
    - MINOR: ncbuf: extract common types
    - MINOR: ncbmbuf: define new ncbmbuf type
    - MINOR: ncbmbuf: implement add
    - MINOR: ncbmbuf: implement iterator bitmap utilities functions
    - MINOR: ncbmbuf: implement ncbmb_data()
    - MINOR: ncbmbuf: implement advance operation
    - MINOR: ncbmbuf: add tests as standalone mode
    - BUG/MAJOR: quic: use ncbmbuf for CRYPTO handling
    - MINOR: quic: remove received CRYPTO temporary tree storage
    - MINOR: stats-file: fix typo in shm-stats-file object struct size detection
    - MINOR: compiler: add FIXED_SIZE(size, type, name) macro
    - MEDIUM: freq-ctr: use explicit-size types for freq-ctr struct
    - BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency
    - BUG/MEDIUM: build: limit excessive and counter-productive gcc-15 vectorization
    - BUG/MEDIUM: stick-tables: Don't loop if there's nothing left
    - MINOR: acme: add the dns-01-record field to the sink
    - MINOR: acme: display the complete challenge_ready command in the logs
    - BUG/MEDIUM: mt_lists: Avoid el->prev = el->next = el
    - MINOR: quic: remove unused conn-tx-buffers limit keyword
    - MINOR: quic: prepare support for options on FE/BE side
    - MINOR: quic: rename "no-quic" to "tune.quic.listen"
    - MINOR: quic: duplicate glitches FE option on BE side
    - MINOR: quic: split congestion controler options for FE/BE usage
    - MINOR: quic: split Tx options for FE/BE usage
    - MINOR: quic: rename max Tx mem setting
    - MINOR: quic: rename retry-threshold setting
    - MINOR: quic: rename frontend sock-per-conn setting
    - BUG/MINOR: quic: split max-idle-timeout option for FE/BE usage
    - BUG/MINOR: quic: split option for congestion max window size
    - BUG/MINOR: quic: rename and duplicate stream settings
    - BUG/MEDIUM: applet: Improve again spinning loops detection with the new API
    - Revert "BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency"
    - Revert "MEDIUM: freq-ctr: use explicit-size types for freq-ctr struct"
    - Revert "MINOR: compiler: add FIXED_SIZE(size, type, name) macro"
    - BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency (2nd attempt)
    - BUG/MINOR: stick-tables: properly index string-type keys
    - BUILD: openssl-compat: fix build failure with OPENSSL=0 and KTLS=1
    - BUG/MEDIUM: mt_list: Use atomic operations to prevent compiler optims
    - MEDIUM: quic: Fix build with openssl-compat
    - MINOR: applet: do not put SE_FL_WANT_ROOM on rcv_buf() if the channel is empty
    - MINOR: cli: create cli_raw_rcv_buf() from the generic applet_raw_rcv_buf()
    - BUG/MEDIUM: cli: do not return ACKs one char at a time
    - BUG/MEDIUM: ssl: Crash because of dangling ckch_store reference in a ckch instance
    - BUG/MINOR: ssl: Remove unreachable code in CLI function
    - BUG/MINOR: acl: warn if "_sub" derivative used with an explicit match
    - DOC: config: fix confusing typo about ACL -m ("now" vs "not")
    - DOC: config: slightly clarify the ssl_fc_has_early() behavior
    - MINOR: ssl-sample: add ssl_fc_early_rcvd() to detect use of early data
    - CI: disable fail-fast on fedora rawhide builds
    - MINOR: http: fix 405,431,501 default errorfile
    - BUG/MINOR: init: Do not close previously created fd in stdio_quiet
    - MINOR: init: Make devnullfd global and create it earlier in init
    - MINOR: init: Use devnullfd in stdio_quiet calls instead of recreating a fd everytime
    - MEDIUM: ssl: Add certificate password callback that calls external command
    - MEDIUM: ssl: Add local passphrase cache
    - MINOR: ssl: Do not dump decrypted privkeys in 'dump ssl cert'
    - BUG/MINOR: resolvers: Apply dns-accept-family setting on additional records
    - MEDIUM: h1: Immediately try to read data for frontend
    - REGTEST: quic: add ssl_reuse.vtc new QUIC test
    - BUG/MINOR: ssl: returns when SSL_CTX_new failed during init
    - MEDIUM: ssl/ech: config and load keys
    - MINOR: ssl/ech: add logging and sample fetches for ECH status and outer SNI
    - MINOR: listener: implement bind_conf_find_by_name()
    - MINOR: ssl/ech: key management via stats socket
    - CI: github: add USE_ECH=1 to haproxy for openssl-ech job
    - DOC: configuration: "ech" for bind lines
    - BUG/MINOR: ech: non destructive parsing in cli_find_ech_specific_ctx()
    - DOC: management: document ECH CLI commands
    - MEDIUM: mux-h2: do not needlessly refrain from sending data early
    - MINOR: mux-h2: extract the code to send preface+settings into its own function
    - BUG/MINOR: mux-h2: send the preface along with the first request if needed

BUG/MINOR: mux-h2: send the preface along with the first request if needed

Tests involving 0-RTT and H2 on the backend show that 0-RTT is being
partially used but does not work. The analysis shows that only the
preface and settings are sent using early-data and the request is sent
separately. As explained in the previous patch, this is caused by the
fact that a wakeup of the iocb is needed just to send the preface, then
a new call to process_stream is needed to try sending again.

Here with this patch, we're making h2_snd_buf() able to send the preface
if it was not yet sent. Thanks to this, the preface, settings and first
request can now leave as a single TCP segment. In case of TLS with 0-RTT,
it now allows all the block to leave in early data.

Even in clear-text H2, we're now seeing a 15% lower context-switch count,
and the number of calls to process_stream() per connection dropped from 3
to 2. The connection rate increased by an extra 9.5%. Compared to without
the last 3 patches, this is a 22% reduction of context-switches, 33%
reduction of process_stream() calls, and 15.7% increase in connection
rate. And more importantly, 0-RTT now really works with H2 on the
backend, saving one full RTT on the first request.

This fix is only for a missed optimization and a non-functional 0-RTT
on the backend. It's worth backporting it, but it doesn't cause enough
harm to hurry a backport. Better wait for it to live a little bit in
3.3 (till at least a week or two after the final release) before
backporting it. It's not sure that it's worth going beyond 3.2 in any
case. It depends on the these two previous commits:

MEDIUM: mux-h2: do not needlessly refrain from sending data early
MINOR: mux-h2: extract the code to send preface+settings into its own function

MINOR: mux-h2: extract the code to send preface+settings into its own function

The code that deals with sending preface + settings and changing the
state currently is in h2_process_mux(), but we'll want to do it as
well from h2_snd_buf(), so let's move it to a dedicate function first.
At this point there is no functional change.

MEDIUM: mux-h2: do not needlessly refrain from sending data early

The mux currently refrains from sending data before H2_CS_FRAME_H, i.e.
before the peer's SETTINGS frame was received. While it makes sense on
the frontend, it's causing harm on the backend because it forces the
first request to be sent in two halves over an extra RTT: first the
preface and settings, second the request once the settings are received.
This is totally contrary to the philosophy of the H2 protocol, consisting
in permitting the client to send as soon as possible.

Actually what happens is the following:
  - process_stream() calls connect_server()
  - connect_server() creates a connection, and if the proto/alpn is guessed
    or known, the mux is instantiated for the current request.
  - the H2 init code wakes the h2 tasklet up and returns
  - process_stream() tries to send the request using h2_snd_buf(), but that
    one sees that we're before H2_CS_FRAME_H, refrains from doing so and
    returns.
  - process_stream() subscribes and quits
  - the h2 tasklet can now execute to send the preface and settings, which
    leave as a first TCP segment. The connection is ready.
  - the iocb is woken again once the server's SETTINGS frame is received,
    turning the connection to the H2_CS_FRAME_H state, and the iocb wake
    up process_stream().
  - process_stream() executes again and can try to send again.
  - h2_snd_buf() is called and finally sends the request as a second TCP
    segment.

Not only this is inefficient, but it also renders 0-RTT and TFO impossible
on H2 connections. When 0-RTT is used, only the preface and settings leave
as early data (the very first data of that connection), which is totally
pointless.

In order to fix this, we have to go through a few steps:
  - first we need to let data be sent to a server immediately after the
    SETTINGS frame was sent (i.e. in H2_CS_SETTINGS1 state instead of
    H2_CS_FRAME_H). However, some protocol extensions are advertised by
    the server using SETTINGS (e.g. RFC8441) and some requests might need
    to know the existence of such extensions. For this reason we're adding
    a new h2c flag, H2_CF_SETTINGS_NEEDED, which indicates that some
    operations were not done because a server's SETTINGS frame is needed.
    This is set when trying to send a protocol upgrade or extended CONNECT
    during H2_CS_SETTINGS1, indicating that it's needed to wait for
    H2_CS_FRAME_H in this case. The flag is always set on frontend
    connections. This is what is being done in this patch.

  - second, we need to be able to push the preface opportunistically with
    the first h2_snd_buf() so that it's not needed to wake the tasklet up
    just to send that and wake process_stream() again. This will be in a
    separate patch.

By doing the first step, we're at least saving one needless tasklet
wakeup per connection (~9%), which results in ~5% backend connection
rate increase.

DOC: management: document ECH CLI commands

Document "show ssl ech", "add ssl ech", "set ssl ech" and "del ssl ech"

BUG/MINOR: ech: non destructive parsing in cli_find_ech_specific_ctx()

cli_find_ech_specific_ctx() parses the <frontend>/<bind_conf> and sets
a \0 in place the '/'. But the originals tring is still used to emit
messages in the CLI so we only output the frontend part.

This patch do the parsing in a trash buffer instead.

DOC: configuration: "ech" for bind lines

ECH is an experimental features which still a draft, but already exists as a
feature branch in OpenSSL.

This patch explains how to configure "ech" on bind lines.

CI: github: add USE_ECH=1 to haproxy for openssl-ech job

Add the USE_ECH=1 make option to the haproxy build in order to test the
build of the feature.

MINOR: ssl/ech: key management via stats socket

This patch extends the ECH support by adding runtime CLI commands to
view and modify ECH configurations.

New commands are added to the HAProxy CLI:
- "show ssl ech [<name>]" displays all ECH configurations or a specific
  one.
- "add ssl ech <name> <payload>" adds a new PEM-formatted ECH
  configuration.
- "set ssl ech <name> <payload>" replaces all existing ECH
  configurations.
- "del ssl ech <name> [<age-in-secs>]" removes ECH configurations,
  optionally filtered by age.

MINOR: listener: implement bind_conf_find_by_name()

Returns a pointer to the first bind_conf matching <name> in a frontend
<front>.

When name is prefixed by a @ (@<filename>:<linenum>), it tries to look
for the corresponding filename and line of the configuration file.

NULL is returned if no match is found.

MINOR: ssl/ech: add logging and sample fetches for ECH status and outer SNI

This patch adds functions to expose Encrypted Client Hello (ECH) status
and outer SNI information for logging and sample fetching.

Two new helper functions are introduced in ech.c:
- conn_get_ech_status() places the ECH processing status string into a
buffer.
- conn_get_ech_outer_sni() retrieves the outer SNI value if ECH
succeeded.

Two new sample fetch keywords are added:
- "ssl_fc_ech_status" returns the ECH status string.
- "ssl_fc_ech_outer_sni" returns the outer SNI value seen during ECH.

These allow ECH information to be used in HAProxy logs, ACLs, and
captures.

MEDIUM: ssl/ech: config and load keys

This patch introduces the USE_ECH option in the Makefile to enable
support for Encrypted Client Hello (ECH) with OpenSSL.

A new function, load_echkeys, is added to load ECH keys from a specified
directory. The SSL context initialization process in ssl_sock.c is
updated to load these keys if configured.

A new configuration directive, `ech`, is introduced to allow users to
specify the ECH key directory in the listener configuration.

BUG/MINOR: ssl: returns when SSL_CTX_new failed during init

In ssl_sock_initial_ctx(), returns when SSL_CTX_new() failed instead of
trying to apply anything on the ctx. This may avoid crashing when
there's not enough memory anymore during configuration parsing.

Could be backported in every haproxy versions

REGTEST: quic: add ssl_reuse.vtc new QUIC test

Note that this test does not work with OpenSSL 3.5.0 QUIC API because
the callback set by SSL_CTX_sess_set_new_cb() (ssl_sess_new_srv_cb()) is not
called (at least for QUIC clients)

The role of this new QUIC test is to run the same SSL/TCP test as
reg-tests/ssl/ssl_reuse.vtc but with QUIC connections where applicable (only with
TLSv1.3).

To do so, this QUIC test uses the "include" vtc command to run ssl/ssl_reuse.vtc
It also sets the VTC_SOCK_TYPE environment variable with the "setenv" command and
"quic" as value. This will ask vtest2 to use QUIC sockets for all "fd@{...}"
addresses prefixed by "${VTC_SOCK_TYPE}+" socket type if VTC_SOCK_TYPE value is "quic".

The SSL/TCP is modified to set this environment variable with "setenv -ifunset"
from ssl/ssl_reuse.vtc with "stream" as value, if it not already set.

vtest2 must be used with this patch to support this new QUIC test:
https://github.com/vtest/VTest2/commit/9aa4d498dbd426adef8779c692a3e0865e55b9c2

Thanks to this latter patch, vtest2 retrieves the VTC_SOCK_TYPE environment variable
value, then it parses the vtc file to retrieve all the fd addresses prefixed by
"${VTC_SOCK_TYPE}+" and creates a QUIC socket or a TCP socket depending on this
variable value.

MEDIUM: h1: Immediately try to read data for frontend

In h1_init(), if we're a frontend connection, immediately attempt to
read data, if the connection is ready, instead of just subscribing.
There may already be data available, at least if we're using 0RTT.

This may be backported up to 2.8 in a while, after 3.3 is released, so
that if it causes problem, we have a chance to hear about it.

BUG/MINOR: resolvers: Apply dns-accept-family setting on additional records

dns-accept-family setting was only evaluated for responses to A / AAAA DNS
queries. It was ignored when additional records in SRV responses were
parsed.

With this patch, whena SRV responses is parsed, additional records not
matching the dns-accept-family setting are ignored, as expected.

This patch must be backported to 3.2.

MINOR: ssl: Do not dump decrypted privkeys in 'dump ssl cert'

A private keys that is password protected and was decoded during init
thanks to the password obtained thanks to 'ssl-passphrase-cmd' should
not be dumped via 'dump ssl cert' CLI command.

MEDIUM: ssl: Add local passphrase cache

Instead of calling the external password command for all loaded
encrypted certificates, we will keep a local password cache.
The passwords won't be stored as plain text, they will be stored
obfuscated into the password cache. The obfuscation is simply based on a
XOR'ing with a random number built during init.
After init is performed, the password cache is overwritten and freed so
that no dangling info allowing to dump the passwords remains.

MEDIUM: ssl: Add certificate password callback that calls external command

When a certificate is protected by a password, we can provide the
password via the dedicated pem_password_cb param provided to
PEM_read_bio_PrivateKey.
HAProxy will fetch the password automatically during init by calling a
user-defined external command that should dump the right password on its
standard output (see new 'ssl-passphrase-cmd' global option).

MINOR: init: Use devnullfd in stdio_quiet calls instead of recreating a fd everytime

Since commit "65760d MINOR: init: Make devnullfd global and create it
earlier in init" the devnullfd file descriptor pointing to /dev/null
is created regardless of the process's parameters so we can use it in
all 'stdio_quiet' calls instead or recreating an FD.

MINOR: init: Make devnullfd global and create it earlier in init

The devnull fd might be needed during configuration parsing, if some
options require to fork/exec for instance. So we now create it much
earlier in the init process and without depending on the '-q' or '-d'
parameters.

BUG/MINOR: init: Do not close previously created fd in stdio_quiet

During init we were calling 'stdio_quiet' and passing the previously
created 'devnullfd' file descriptor. But the 'stdio_quiet' was also
closed afterwards which raised an error (EBADF).
If we keep from closing FDs that were opened outside of the
'stdio_quiet' function we will let the caller manage its FD and avoid
double close calls.

This patch can be backported to all stable branches.

MINOR: http: fix 405,431,501 default errorfile

A few typos were present in the default errorfiles for the status codes
above (missing dot at the end of the sentence, extra closing bracket).
This fixes them. This can be backported.

CI: disable fail-fast on fedora rawhide builds

Previously builds were dependent in terms that if one fails, other are
stopped. By their nature those builds are independent, let's not to fail
them altogether

MINOR: ssl-sample: add ssl_fc_early_rcvd() to detect use of early data

We currently have ssl_fc_has_early() which says that early data are still
unconfirmed by a final handshake, but nothing to see if a client has been
able to use early data at all, which is a problem because such mechanisms
generally depend on multiple factors and it's hard to know when they start
to work. This new sample fetch function will indicate that some early data
were seen over that front connection, i.e. this can be used to confirm
that at some point the client was able to push some. This is essentially
a debugging tool that has no practical use case other than debugging.

DOC: config: slightly clarify the ssl_fc_has_early() behavior

Clarify that it's about handshake *completion*, and also mention that
the action to be used to wait for the handshake is "wait-for-handshake",
which was not mentioned.

This can be backported though it's very minor.

DOC: config: fix confusing typo about ACL -m ("now" vs "not")

A one-letter typo in the doc update comint with commit 6ea50ba462 ("MINOR:
acl; Warn when matching method based on a suffix is overwritten") inverts
the meaning of the sentence. It was "is not allowed" and not
"is now allowed". Needs to be backported only if the commit above ever is
(unlikely).

BUG/MINOR: acl: warn if "_sub" derivative used with an explicit match

Recently, a new warning is displayed when an ACL derivative match method
is override with another '-m' method. This is implemented via the
following patch :

6ea50ba462692d6dcf301081f23cab3e0f6086e4
MINOR: acl; Warn when matching method based on a suffix is overwritten

However, this warning was not reported when "_sub" suffix was specified.
Fix this by adding PAT_MATCH_SUB in the warning comparison.

No backport needed except if above commit is.

BUG/MINOR: ssl: Remove unreachable code in CLI function

Remove unreachable code in 'cli_parse_show_jwt' function.

This bug was raised in GitHub #3159.
This patch does not need to be backported.

BUG/MEDIUM: ssl: Crash because of dangling ckch_store reference in a ckch instance

When updating CAs via the CLI, we need to create new copies of all the
impacted ckch instances (as in referenced in the ckch_inst_link list of
the updated CA) in order to use them instead of the old ones once the
updated is completed. This relies on the ckch_inst_rebuild function that
would set the ckch_store field of the ckch_inst. But we forgot to also
add the newly created instances in the ckch_inst list of the
corresponding ckch_store.

When updating a certificate afterwards, we iterate over all the
instances linked in the ckch_inst list of the ckch_store (which is
missing some instances because of the previous command) and rebuild the
instances before replacing the ckch_store. The previous ckch_store,
still referenced by the dangling ckch instance then gets deleted which
means that the instance keeps a reference to a free'd object.

Then if we were to once again update the CA file, we would iterate over
the ckch instances referenced in the cafile_entry's ckch_inst_link list,
which includes the first mentioned ckch instance with the dead
ckch_store reference. This ends up crashing during the ckch_inst_rebuild
operation.

This bug was raised in GitHub #3165.
This patch should be backported to all stable branches.

BUG/MEDIUM: cli: do not return ACKs one char at a time

Since 3.0 where the CLI started to use rcv_buf, it appears that some
external tools sending chained commands are randomly experiencing
failures. Each time this happens when the whole command is sent as a
single packet, immediately followed by a close. This is not a correct
way to use the CLI but this has been working for ages for simple
netcat-based scripts, so we should at least try to preserve this.

The cause of the failure is that the first LF that acks a command is
immediately sent back to the client and rejected due to the closed
connection. This in turn forwards the error back to the applet which
aborts its processing.

Before 3.0 the responses would be queued into the buffer, then sent
back to the channel, and would all fail at once. This changed when
snd_buf/rcv_buf were implemented because the applets are much more
responsive and since they yield between each command, they can
deliver one ACK at a time that is immediately forwarded down the
chain.

An easy way to observe the problem is to send 5 map updates, a shutdown,
and immediately close via tcploop, and in parallel run a periodic
"show map" to count the number of elements:

  $ tcploop -U /tmp/sock1 C S:"add map #0 1 1; add map #0 2 2; add map #0 3 3; add map #0 4 4; add map #0 5 5\n" F K

Before 3.0, there would always be 5 elements. Since 3.0 and before
20ec1de214 ("MAJOR: cli: Refacor parsing and execution of pipelined
commands"), almost always 2. And since that commit above in 3.2, almost
always one. Doing the same using socat or netcat shows almost always 5...
It's entirely timing-dependent, and might even vary based on the RTT
between the client and haproxy!

The approach taken here consists in doing the same principle as MSG_MORE
or Nagle but on the response buffer: the applet doesn't need to send a
single ACK for each command when it has already been woken up and is
scheduled to come back to work. It's fine (and even desirable) that
ACKs are grouped in a single packet as much as possible.

For this reason, this patch implements APPCTX_CLI_ST1_YIELD, a new CLI
flag which indicates that the applet left in yielding condition, i.e.
it has not finished its work. This flag is used by .rcv_buf to hold
pending data. This way we won't return partial responses for no reason,
and we can continue to emulate the previous behavior.

One very nice benefit to this is that it saves huge amounts of CPU on
the client. In the test below that tries to update 1M map entries, the
CPU used by socat went from 100% to 0% and the total transfer time
dropped by 28%:

  before:
    $ time awk 'BEGIN{ printf "prompt i\n"; for (i=0;i<1000000;i++) { \
         printf "add map #0 %d %d\n",i,i,i }}' | socat /tmp/sock1 - >/dev/null

    real    0m2.407s
    user    0m1.485s
    sys     0m1.682s

  after:
    $ time awk 'BEGIN{ printf "prompt i\n"; for (i=0;i<1000000;i++) { \
         printf "add map #0 %d %d\n",i,i,i }}' | socat /tmp/sock1 - >/dev/null

    real    0m1.721s
    user    0m0.952s
    sys     0m0.057s

The difference is also quite visible on the number of syscalls during
the test (for 1k updates):

  before:
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    100.00    0.071691           0    100001           sendmsg

  after:
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    100.00    0.000011           1         9           sendmsg

This patch will need to be backported to 3.0, and depends on these two
patches to be backported as well:

    MINOR: applet: do not put SE_FL_WANT_ROOM on rcv_buf() if the channel is empty
    MINOR: cli: create cli_raw_rcv_buf() from the generic applet_raw_rcv_buf()

MINOR: cli: create cli_raw_rcv_buf() from the generic applet_raw_rcv_buf()

This is in preparation for a future fix. For now it's simply a pure
copy of the original function, but dedicated to the CLI. It will
have to be backported to 3.0.

MINOR: applet: do not put SE_FL_WANT_ROOM on rcv_buf() if the channel is empty

appctx_rcv_buf() prepares all the work to schedule the transfers between
the applet and the channel, and it takes care of setting the various flags
that indicate what condition is blocking the transfer from progressing.

There is one limitation though. In case an applet refrains from sending
data (e.g. rate-limited, prefers to aggregate blocks etc), it will leave
a possibly empty channel buffer, and keep some data in its outbuf. The
data in its outbuf will be seen by the function above as an indication
of a channel full condition, so it will place SE_FL_WANT_ROOM. But later,
sc_applet_recv() will see this flag with a possibly empty channel, and
will rightfully trigger a BUG_ON().

appctx_rcv_buf() should be more accurate in fact. It should only set
SE_FL_RCV_MORE when more data are present in the applet, then it should
either set or clear SE_FL_WANT_ROOM dependingon whether the channel is
empty or not.

Right now it doesn't seem possible to trigger this condition in the
current state of applets, but this will become possible with a future
bugfix that will have to be backported, so this patch will need to be
backported to 3.0.

MEDIUM: quic: Fix build with openssl-compat

As the QUIC options have been split into backend and frontend, there is
no more GTUNE_QUIC_LISTEN_OFF to be found in global.tune.options, look
for QUIC_TUNE_FE_LISTEN_OFF in quic_tune.fe instead.
This should fix the build with USE_QUIC and USE_QUIC_OPENSSL_COMPAT.

BUG/MEDIUM: mt_list: Use atomic operations to prevent compiler optims

As a folow-up to f40f5401b9f24becc6fdd2e77d4f4578bbecae7f, explicitely
use atomic operations to set the prev and next fields, to make sure the
compiler can't assume anything about it, and just does it.

This should be backported after f40f5401b9 up to 2.8.

BUILD: openssl-compat: fix build failure with OPENSSL=0 and KTLS=1

The USE_KTLS test is currently being done outside of the USE_OPENSSL
guard so disabling USE_OPENSSL still results in build failures on
libcs built with support for kernels before 4.17, because we enable
KTLS by default on linux. Let's move the KTLS block inside the
USE_OPENSSL guard instead.

No backport is needed since KTLS is only in 3.3.

BUG/MINOR: stick-tables: properly index string-type keys

This is one of the rare pleasant surprises of fixing an almost 16-years
old bug that remained unnoticed since the feature was implemented. In
1.4-dev7, commit 3bd697e071 ("[MEDIUM] Add stick table (persistence)
management functions and types") introduced stick-tables with multiple
key types, including strings, IP addresses and integers. Entries are
coded in binary and their binary representation is indexed. A special
case was made for strings in order to index them as zero-terminated
strings. However, there's one subtlety. While strings indeed have a
zero appended, they're still indexed using ebmb_insert(), which means
that all the bytes till the configured size are indexed as well. And
while these bytes generally come from a temporary storage that often
contains zeroes, or that is longer than the configured string length
and will result in truncation, it's not always the case and certain
traffic patterns with certain configurations manage to occasionally
present unpadded strings resulting in apparent duplicate keys appearing
in the dump, as shown in GH issue #3161. It seems to be essentially
reproducible at boot, and not to be particularly affected by mixed
patterns. These keys are in fact not exact duplicates in memory, but
everywhere they're used (including during synchronization), they are
equal.

What's interesting is that when this happens, one key can be presented
to a peer with its own data and will be indexed as the only one, possibly
replacing contents from the previous key, which might replace them again
later once updated in turn. This is visible in the dump of the issue
above, where key "localhost:8001" was split into two entries, one with a
request count of one and the other with a request count of 499999, and
indeed, all peers see only that last value, which overwrote the first
one.

This fix must be backported to all stable branches. Special kudos to
Mark Wort for undelining that one.

BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency (2nd attempt)

This is a second attempt at fixing issues on 32bits systems which would
trigger the following BUG_ON() statement:

FATAL: bug condition "sizeof(struct shm_stats_file_object) != 544" matched at src/stats-file.c:825 shm_stats_file_object struct size changed, is is part of the exported API: ensure all precautions were taken (ie: shm_stats_file version change) before adjusting this

This is a drop-in replacement for d30b88a6c + 4693ee0ff, as suggested by
Willy.

Indeed, on supported platforms unsigned int can be assumed to be 4 bytes
long, and long can be assumed to be 8 bytes long. As such, the previous
attempt was overkill and added unecessary maintenance complexity which
could result in bugs if not used properly. Moreover, it would only
partially solve the issue, since on little endian vs big endian
architectures, the provisioned memory areas (originating from the same
shm stats file) could be read differently by the host.

Instead we fix the aligments issues, and this alone helps to ensure
struct memory consistency on 64 vs 32bits platforms. It was tested
on both i386 and i586.

last_change and last_sess counters are now stored as unsigned int, as
it helped to fix the alignment issues and they were found to be used
as 32bits integers anyway.

Thanks to Willy for problem analysis and the patch proposal.

No backport needed.

Revert "MINOR: compiler: add FIXED_SIZE(size, type, name) macro"

This reverts commit 466a603b59ed77e9787398ecf1baf77c46ae57b1.
Due to the last 2 commits, this macro is now unused, and will probably
never be used, so let's get rid of that for now.

Revert "MEDIUM: freq-ctr: use explicit-size types for freq-ctr struct"

This reverts commit 4693ee0ff7a5fa4a12ff69b1a33adca142e781ac.
As discussed in GH #3168, this works but it is not the proper way to fix
the issue. See following commits.

Revert "BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency"

This reverts commit d30b88a6cc47d662e92b524ad5818be312401d0e.
As discussed in GH #3168, this works but it is not the proper way to fix
the issue. See following commits.

BUG/MEDIUM: applet: Improve again spinning loops detection with the new API

A first attempt to fix this issue was already pushed (54b7539d6 "BUG/MEDIUM:
apppet: Improve spinning loop detection with the new API"). But it not was
fully accurrate. Indeed, we must check if something was received or sent by
the applet before incrementing the call rate. But we must also take care the
applet is allowed to receive or send data. That is what is performed in this
patch.

This patch must be backported as far as 3.0 with the patch above.

BUG/MINOR: quic: rename and duplicate stream settings

Several settings can be set to control stream multiplexing and
associated receive window. Previously, all of these settings were
configured using prefix "tune.quic.frontend.", despite being applied
blindly on both sides.

Fix this by duplicating these settings specific to frontend and backend
side. Options are also renamed to use the standardize prefix
"tune.quic.[be|fe].stream." notation.

Also, each option is individually renamed to better reflect its purpose
and hide technical details relative to QUIC transport parameter naming :
* max-data-size -> stream.rxbuf
* max-streams-bidi -> stream.max-concurrent
* stream-data-ratio -> stream.data-ratio

No need to backport.

BUG/MINOR: quic: split option for congestion max window size

BUG/MINOR: quic: split max-idle-timeout option for FE/BE usage

Streamline max-idle-timeout option. Rename it to use the newer cohesive
naming scheme 'tune.quic.fe|be.'.

Two different fields were already defined in global struct. These fields
are moved into quic_tune along with other QUIC settings. However, no
parser was defined for backend option, this commit fixes this.

No need to backport this.

MINOR: quic: rename frontend sock-per-conn setting

On frontend side, a quic_conn can have a dedicated FD or use the
listener one. These different modes can be activated via a global QUIC
tune setting.

This patch adjusts the option. First, it is renamed to the more
meaningful name 'tune.quic.fe.sock-per-conn'. Also, arguments are now
either 'default-on' or 'force-off'. The objective is to better highlight
reliationship with 'quic-socket' bind option.

The older option is deprecated and will be removed in 3.5.

MINOR: quic: rename retry-threshold setting

A QUIC global tune setting is defined to be able to force Retry emission
prior to handshake. By definition, this ability is only supported by
QUIC servers, hence it is a frontend option only.

Rename the option to use "fe" prefix. The old option name is deprecated
and will be removed in 3.5

MINOR: quic: rename max Tx mem setting

QUIC global memory can be limited across the entire process via a global
tune setting. Previously, this setting used to misleading "frontend"
prefix. As this is applied as a sum between all QUIC connections, both
from frontend and backend sides, remove the prefix. The new option name
is "tune.quic.mem.tx-max".

The older option name is deprecated and will be removed in 3.5.

MINOR: quic: split Tx options for FE/BE usage

This patch is similar to the previous one, except that it is focused on
Tx QUIC settings. It is now possible to toggle GSO and pacing on
frontend and backend sides independently.

As with previous patch, option are renamed to use "fe/be" unified
prefixes. This is part of the current serie of commits which unify QUI
settings. Older options are deprecated and will be removed on 3.5
release.

MINOR: quic: split congestion controler options for FE/BE usage

Various settings can be configured related to QUIC congestion controler.
This patch duplicates them to be able to set independent values on
frontend and backend sides.

As with previous patch, option are renamed to use "fe/be" unified
prefixes. This is part of the current serie of commits which unify QUIC
settings. Older options are deprecated and will be removed on 3.5
release.

MINOR: quic: duplicate glitches FE option on BE side

Previously, QUIC glitches support was only implemented for frontend
side. Extend this so that the option can be specified separately both on
frontend and backend sides. Function _qcc_report_glitch() now retrieves
the relevant max value based on connection side.

In addition to this, option has been renamed to use "fe/be" prefixes.
This is part of the current serie of commits which unify QUIC settings.
Older options are deprecated and will be removed on 3.5 release.

MINOR: quic: rename "no-quic" to "tune.quic.listen"

Rename the option to quickly enable/disable every QUIC listeners. It now
takes an argument on/off. The documentation is extended to reflect the
fact that QUIC backend are not impacted by this option.

The older keyword is simply removed. Deprecation is considered
unnecessary as this setting is only useful during debugging.

MINOR: quic: prepare support for options on FE/BE side

A major reorganization of QUIC settings is going to be performed. One of
its objective is to clearly define options which can be separately
configured on frontend and backend proxy sides.

To implement this, quic_tune structure is extended to support fe and be
options. A set of macros/functions is also defined : it allows to
retrieve an option defined on both sides with unified code, based on
proxy side of a quic_conn/connection instance.

MINOR: quic: remove unused conn-tx-buffers limit keyword

Remove parsing code for tune.quic.frontend.conn-tx-buffers.limit. This
option was deprecated for some time and in fact was noop and not
mentionned anymore in the documentation.

BUG/MEDIUM: mt_lists: Avoid el->prev = el->next = el

Avoid setting both el->prev and el->next on the same line.
The goal is to set both el->prev and el->next to el, but a naive
compiler, such as when we're using -O0, will set el->next first, then
will set el->prev to the value of el->next, but if we're unlucky,
el->next will have been set to something else by another thread.
So explicitely set both to what we want.

This should be backported up to 2.8.

MINOR: acme: display the complete challenge_ready command in the logs

When using a wildcard DNS domain in the ACME configuration, for example
*.example.com, one might think that it needs to use the challenge_ready
command with this domain. But that's not the case, the challenge_ready
command takes the domain asked by the ACME server, which is stripped of
the wildcard.

In order to be clearer, the log message shows exactly the command the
user should sent, which is clearer.

MINOR: acme: add the dns-01-record field to the sink

The dns-01-record field in the dpapi sink, output the authentication
token which is needed in the TXT record in order to validate the DNS-01
challenge.

BUG/MEDIUM: stick-tables: Don't loop if there's nothing left

Before waking up the expiration task again at the end of it, make sure
the next date is set. If there's nothing left to do, then task_exp will
be TASK_ETERNITY and we then don't want to be waken up again.

BUG/MEDIUM: build: limit excessive and counter-productive gcc-15 vectorization

In https://bugs.gentoo.org/964719, Dan Goodliffe reported that using
CFLAGS="-O3 -march=westmere" creates a binary that segfaults on startup
with gcc-15. This could be reproduced here, is isolated to gcc-15 and
-O3, and is caused by gcc emitting "movdqa" instructions to read unaligned
longs taken from chars that were carefully isolated within ifdefs checking
for support for unaligned integers on the platform...

Some experiments showed that changing all casts all over the code using
either typedef-enforced align(1) or using the packed union trick does
the job, it needs a more in-depth validation since it's obvious that
it doesn't produce the same code at all (at least on more modern
machines).

However, the offending optimization option could be isolated, it's
"-fvect-cost-model=dynamic" which causes this, while -O2 uses
"-fvect-cost-model=very-cheap". Turning it back to very-cheap solves the
issue, reduces the code, and yields an extra 5% performance increase on
the http-request rate (181k vs 172k on a single core)! This could at
least partially explain why it has been observed several times over
the last few years that -O3 yields bigger and slower code than -O2.

It was also verified that the option doesn't change the emitted code
at -O0..-O2,-Os,-Oz, but only at -O3.

This patch detects the presence of this option and turns it on to
address the problem that some distros are facing after an upgrade to
gcc-15. As such it should be backported to recent LTS and stable
branches. Here, 3.1 was used, so it seems legit to at least target
the last two LTS branches (i.e. go as far as 3.0).

Thanks to Dan Goodliffe for sharing a working reproducer, Sam James
for starting the investigations and Christian Ruppert for bringing
the issue to us.

BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency

As reported by @tianon on GH #3168, running haproxy on 32bits i386
platform would trigger the following BUG_ON() statement:

FATAL: bug condition "sizeof(struct shm_stats_file_object) != 544" matched at src/stats-file.c:825
shm_stats_file_object struct size changed, is is part of the exported API: ensure all precautions were taken (ie: shm_stats_file version change) before adjusting this

In fact, some efforts were already taken to ensure shm_stats_file_object
struct size remains consistent on 64 vs 32 bits platforms, since
shm_stats_file_object is part of the public API and directly exposed in
the stats file.

However, some parts were overlooked: some structs that are embedded in
shm_stats_file_object struct itself weren't using fixed-width integers,
and would sometime be unaligned. The result of this is that it was
up to the compiler (platform-dependent) to choose how to deal with such
ambiguities, which could cause the struct mapping/size to be inconsistent
from one platform to another.

Hopefully this was caught by the BUG_ON() statement and with the precious
help of @tianon

To fix this, we now use fixed-width integers everywhere for members
(and submembers) of shm_stats_file_object struct, and we use explicit
padding where missing to avoid automatic padding when we don't expect
one. As for the previous commit, we leverage FIXED_SIZE() and
FIXED_SIZE_ARRAY() macro to set the expected width for each integer
without causing build issues on platform that don't support larger
integers.

No backport needed, this feature was introduced during 3.3-dev.

MEDIUM: freq-ctr: use explicit-size types for freq-ctr struct

freq-ctr struct is used by the shm_stats_file API, and more precisely,
it is used in the shm_stats_file_object struct for counters.

shm_stats_file_object struct requires to be plateform-independent, thus
we switch to using explicit size types (AKA fixed width integer types)
for freq-ctr, in the attempt to make freq-ctr size and memory mapping
consistent from one platform to another.

We cannot simply use fixed-width integer because some of them are
involved in atomic operations, and forcing a given width could
cause build issues on some platforms where atomic ops are not
implemented for large integers. Instead we leverage the FIXED_SIZE
macro to keep handling the integers as before, but forcing them to
be stored using expected number of bytes (unused bytes will simply
be ignored).

No change of behavior should be expected.

MINOR: compiler: add FIXED_SIZE(size, type, name) macro

FIXED_SIZE() macro can be used to instruct the compiler that the struct
member named <name>, handled as <type>, must be stored using <size> bytes
and that even if the type used is actualler smaller than the expected size

FIXED_SIZE_ARRAY(), similar to FIXED_SIZE() but for arrays: it takes an
extra argument which is the number of members.

They may be used for portability concerns to ensure a structure mapping
remains consistent between platforms.

MINOR: stats-file: fix typo in shm-stats-file object struct size detection

As reported by @TimWolla on GH #3168, there was a typo in shm stats file
BUG_ON to report that the size of shm_stats_file_object changed.

No backport needed.

MINOR: quic: remove received CRYPTO temporary tree storage

The previous commit switch from ncbuf to ncbmbuf as storage for received
CRYPTO frames. The latter ensures that buffering of such frames cannot
fail anymore due to gaps size.

Previously, extra mechanism were implemented on QUIC frames parsing
function to overcome the limitation of ncbuf on gaps size. Before
insertion, CRYPTO frames were stored in a temporary tree to order their
insertion. As this is not necessary anymore, this commit removes the
temporary tree insertion.

This commit is closely associated to the previous bug fix. As it
provides a neat optimization and code simplication, it can be backported
with it, but not in the next immediate release to spot potential
regression.

BUG/MAJOR: quic: use ncbmbuf for CRYPTO handling

In QUIC, TLS handshake messages such as ClientHello are encapsulated in
CRYPTO frames. Each QUIC implementation can split the content in several
frames of random sizes. In fact, this feature is now used by several
clients, based on chrome so-called "Chaos protection" mechanism :

https://quiche.googlesource.com/quiche/+/cb6b51054274cb2c939264faf34a1776e0a5bab7

To support this, haproxy uses a ncbuf storage to store received CRYPTO
frames before passing it to the SSL library. However, this storage
suffers from a limitation as gaps between two filled blocks cannot be
smaller than 8 bytes. Thus, depending on the size of received CRYPTO
frames and their order, ncbuf may not be sufficient. Over time, several
mechanisms were implemented in haproxy QUIC frames parsing to overcome
the ncbuf limitation.

However, reports recently highlight that with some clients haproxy is
not able to deal with CRYPTO frames reception. In particular, this is
the case with the latest ngtcp2 release, which implements a similar
chaos protection mechanism via the following patch. It also seems that
this impacts haproxy interaction with firefox.

commit 89c29fd8611d5e6d2f6b1f475c5e3494c376028c
Author: Tatsuhiro Tsujikawa <tatsuhiro.t@gmail.com>
Date: Mon Aug 4 22:48:06 2025 +0900

Crumble Client Initial CRYPTO (aka chaos protection)

To fix haproxy CRYPTO frames buffering once and for all, an alternative
non-contiguous buffer named ncbmbuf has been recently implemented. This
type does not suffer from gaps size limitation, albeit at the cost of a
small reduction in the size available for data storage.

Thus, the purpose of this current patch is to replace ncbuf with the
newer ncbmbuf for QUIC CRYPTO frames parsing. Now, ncbmb_add() is used
to buffer received frames which is guaranteed to suceed. The only
remaining case of error is if a received frame offset and length exceed
the ncbmbuf data storage, which would result in a CRYPTO_BUFFER_EXCEEDED
error code.

A notable behavior change when switching to ncbmbuf implementation is
that NCB_ADD_COMPARE mode cannot be used anymore during add. Instead,
crypto frame content received at a similar offset will be overwritten.

A final note regarding STREAM frames parsing. For now, it is considered
unnecessary to switch from ncbuf in this case. Indeed, QUIC clients does
not perform aggressive fragmentation for them. Keeping ncbuf ensure that
the data storage size is bigger than the equivalent ncbmbuf area.

This should fix github issue #3141.

This patch must be backported up to 2.6. It is first necessary to pick
the relevant commits for ncbmbuf implementation prior to it.

MINOR: ncbmbuf: add tests as standalone mode

Write some tests for ncbmbuf buf. These tests should be run each time
ncbmbuf implementation is adjusted. Use the following command :

$ gcc -g -DSTANDALONE -I./include -o ncbmbuf src/ncbmbuf.c && ./ncbmbuf

As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.

MINOR: ncbmbuf: implement advance operation

Implement ncbmb_advance() function for the ncbmbuf type. This allows to
remove bytes in front of the buffer, regardless of the existing gaps.
This is implemented by resetting the corresponding bits of the bitmap.

As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.

MINOR: ncbmbuf: implement ncbmb_data()

Implement ncbmb_data() function for the ncbmbuf type. Its purpose is
similar to its ncbuf counterpart : it returns the size in bytes of data
starting at a specific offset until the next gap.

As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.

MINOR: ncbmbuf: implement iterator bitmap utilities functions

Extend private API for ncbmbuf type by defining an iterator type for the
buffer bitmap handling. The purpose is to provide a simple method to
iterate over the bitmap one byte at a time, with a proper bitmask set to
hide irrelevant bits.

This internal type is unused for now, but will become useful when
implementing ncb_data() and ncb_advance() functions.

As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.

MINOR: ncbmbuf: implement add

This patch implements add operation for ncbmbuf type.

This function is simpler than its ncbuf counterpart. Indeed, for now
only NCB_ADD_OVERWRT mode is supported. This compromise has been chosen
as ncbmbuf will be first used for QUIC CRYPTO frames handling, which
does not mandate to compare existing filled blocks during insertion.

As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.

MINOR: ncbmbuf: define new ncbmbuf type

Define ncbmbuf which is an alternative non-contiguous buffer
implementation. "bm" abbreviation stands for bitmap, which reflects how
gaps and filled blocks are encoded. The main purpose of this
implementation is to get rid of the ncbuf limitation regarding the
minimal size for gaps between two blocks of data.

This commit adds the new module ncbmbuf. Along with it, some utility
functions such as ncbmb_make(), ncbmb_init() and ncbmb_is_empty() are
defined. Public API of ncbmbuf will be extended in the following
patches.

This patch is not considered a bug fix. However, it will be required to
fix issue encountered on QUIC CRYPTO frames parsing. Thus, it will be
necessary to backport the current patch prior to the fix to come.

MINOR: ncbuf: extract common types

ncbuf is a module which provide a non-contiguous buffer type
implementation. This patch extracts some basic types related to it into
a new file ncbuf_common.h.

This patch will be useful to provide a new non-contiguous buffer
alternative implementation based on a bitmap.

This patch is not a bug fix. However, it is necessary for ncbmbuf
implementation which will be required to fix a QUIC issue on CRYPTO
frames parsing. This, it will be necessary to backport the current patch
prior to the fix to come.

BUG/MAJOR: pools: fix default pool alignment

The doc in commit 977feb5617 ("DOC: api: update the pools API with the
alignment and typed declarations") says that alignment of zero means
the type's alignment. And this is followed by the DECLARE_TYPED_POOL()
macro. Yet this is not what is done in create_pool_from_reg() which
only raises the alignment to a void* if lower, while it should start
from the type's. The effect is haproxy refusing to start on some 32-bit
platforms since that commit, displaying an error such as:

   "BUG in the code: at src/mux_h2.c:454, requested creation of pool
    'h2s' aligned to 4 while type requires alignment of 8! Please
    report to developers. Aborting."

Let's just apply the default type's alignment.

Thanks to @tianon for reporting this in GH issue #3168. No backport is
needed since aligned pools are 3.3-only.

BUG/MEDIUM: h3: properly encode response after interim one in same buf

Recently, proper support for interim responses forwarding to HTTP/3
client has been implemented. However, there was still an issue if two
responses are both encoded in the same snd_buf() iteration.

The issue is caused due to H3 HEADERS frame encoding method : 5 bytes
are reserved in front of the buffer to encode both H3 frame type and
varint length field. After proper headers encoding, output buffer head
is adjusted so that length can be encoded using the minimal varint size.

However, if the buffer is not empty due to a previous response already
encoded but not yet emitted, messing with the buffer head will corrupt
the entire H3 message. This only happens when encoding of both responses
is done in the same snd_buf() iteration, or at least without emission to
quic_conn layer in between.

The result of this bug is that the HTTP/3 client will be unable to parse
the response, most of the time reporting a formatting error. This can
be reproduced using the following netcat as HTTP/1 server to haproxy :

$ while sleep 0.2; do \
printf "HTTP/1.1 100 continue\r\n\r\nHTTP/1.1 200 ok\r\nContent-length: 5\r\nConnection: close\r\n\r\nblah\n" | nc -lp8002
done

To fix this, only adjust buffer head if content is empty. If this is not
the case, frame length is simply encoded as a 4-bytes varint size so
that messages are contiguous in the buffer.

This must be backported up to 2.6.